Modeling the voice

The voice can be modelled as an acoustic system of two parts: the source and the filter.

This is a very common model of the human voice fittingly called the source-filter model of voice production.

The most general description of the source-filter model does not make any assumptions about possible interactions between the source and the filter (e.g. the behaviour of the filter affecting the behaviour of the source and vice versa).

To simplify things, we will make the assumption, also quite common, that there are no such source-filter interactions.

This is a very important assumption, though not without drawbacks, because it will allow us to use linear prediction for formant estimation.

Expressing the source-filter model in the digital Z-domain yield the following:

$S(z) = G(z) \cdot V(z) \cdot R(z)$ , where $S(z)$ corresponds to the speech signal.

$G(z)$ is the source (glottal flow) and $V(z)$ is the filter (vocal tract filter).

$R(z)$ represents the lip radiation effect, which can be grossly modelled as a first-order FIR filter.

$R(z) = 1 - \alpha z^{-1}$ , where $\alpha \lesssim 1$

Now, in practice, we can often cancel out the lip radiation effect in the preprocessing stage by combining it with pre-emphasis. More on that later. For this reason we can consider only $G(z)$ and $V(z)$ from this point forward.

flowchart LR G("G(z)") --> V("V(z)") --> S["\text{Speech}"]

Let's go further in our modelisation: we can assume that the vocal tract filter is a cascade of resonances and antiresonances.

flowchart LR G("G(z)") --> R1("R_1(z)") & R2("R_2(z)") & A1("A_1(z)") & A2("A_2(z)") --> S["\text{Speech}"]

We can model each resonance as a two-pole filter where the two poles are complex conjugates of each other.

$R_k(z) = \dfrac 1 {(1 - \beta_k z^{-1})(1 - \overline {\beta_k} z^{-1})}$

And likewise, each antiresonance as a two-zero filter where the two zeroes are complex conjugates of each other.

$A_k(z) = (1 - \alpha_k z^{-1})(1 - \overline {\alpha_k} z^{-1})$

We can then express the poles and zeroes using the frequencies and bandwidths in Hertz, given a sampling frequency $f_s$ .

$\beta_k = \exp \left( - \dfrac { \pi B_{r,k}} {f_s} \right) \exp \left( 2i \pi \dfrac {F_{r,k}} {f_s} \right)$ , where $B_{r,k}$ and $F_{r,k}$ are the bandwidth and frequency of the $k_{\text{th}}$ resonance.

$\alpha_k = \exp \left( - \dfrac { \pi B_{a,k}} {f_s} \right) \exp \left( 2i \pi \dfrac {F_{a,k}} {f_s} \right)$ , where $B_{a,k}$ and $F_{a,k}$ are the bandwidth and frequency of the $k_{\text{th}}$ antiresonance.

Since we assumed that those filters are cascaded, we can express $V(z)$ as the product of all the resonance and antiresonance filters.

$V(z) = \dfrac {(1 - \alpha_1 z^{-1}) (1 - \overline{\alpha_1} z^{-1}) \ldots (1 - \alpha_J z ^{-1}) (1 - \overline {\alpha_J} z ^{-1})} {(1 - \beta_1 z^{-1}) (1 - \overline{\beta_1} z^{-1}) \ldots (1 - \beta_I z ^{-1}) (1 - \overline {\beta_I} z ^{-1})}$ , where $I$ and $J$ are the number of resonances and antiresonances respectively.

Expanding both polynomials, we end up with a fraction of two polynomials of $z^{-1}$ , the numerator of degree $2J$ , the denominator of degree $2I$ .

$V(z) = \dfrac {\displaystyle \sum_{j=0}^{2J} {a_j z^{-j}}} {\displaystyle \sum_{i=0}^{2I} {b_i z^{-i}}}$

Considering this expression, we can reformulate the problem of formant estimation as solving for those two polynomials.