Skip to main content

Modeling the voice

The voice can be modelled as an acoustic system of two parts: the source and the filter.

This is a very common model of the human voice fittingly called the source-filter model of voice production.

The most general description of the source-filter model does not make any assumptions about possible interactions between the source and the filter (e.g. the behaviour of the filter affecting the behaviour of the source and vice versa).

To simplify things, we will make the assumption, also quite common, that there are no such source-filter interactions.

This is a very important assumption, though not without drawbacks, because it will allow us to use linear prediction for formant estimation.

Expressing the source-filter model in the digital Z-domain yield the following:

S(z)=G(z)V(z)R(z)S(z) = G(z) \cdot V(z) \cdot R(z), where S(z)S(z) corresponds to the speech signal.

G(z)G(z) is the source (glottal flow) and V(z)V(z) is the filter (vocal tract filter).

R(z)R(z) represents the lip radiation effect, which can be grossly modelled as a first-order FIR filter.

R(z)=1αz1R(z) = 1 - \alpha z^{-1}, where α1\alpha \lesssim 1

Now, in practice, we can often cancel out the lip radiation effect in the preprocessing stage by combining it with pre-emphasis. More on that later. For this reason we can consider only G(z)G(z) and V(z)V(z) from this point forward.

flowchart LR G("G(z)") --> V("V(z)") --> S["\text{Speech}"]

Let's go further in our modelisation: we can assume that the vocal tract filter is a cascade of resonances and antiresonances.

flowchart LR G("G(z)") --> R1("R_1(z)") & R2("R_2(z)") & A1("A_1(z)") & A2("A_2(z)") --> S["\text{Speech}"]

We can model each resonance as a two-pole filter where the two poles are complex conjugates of each other.

Rk(z)=1(1βkz1)(1βkz1)R_k(z) = \dfrac 1 {(1 - \beta_k z^{-1})(1 - \overline {\beta_k} z^{-1})}

And likewise, each antiresonance as a two-zero filter where the two zeroes are complex conjugates of each other.

Ak(z)=(1αkz1)(1αkz1)A_k(z) = (1 - \alpha_k z^{-1})(1 - \overline {\alpha_k} z^{-1})

We can then express the poles and zeroes using the frequencies and bandwidths in Hertz, given a sampling frequency fsf_s.

βk=exp(πBr,kfs)exp(2iπFr,kfs)\beta_k = \exp \left( - \dfrac { \pi B_{r,k}} {f_s} \right) \exp \left( 2i \pi \dfrac {F_{r,k}} {f_s} \right), where Br,kB_{r,k} and Fr,kF_{r,k} are the bandwidth and frequency of the kthk_{\text{th}} resonance.

αk=exp(πBa,kfs)exp(2iπFa,kfs)\alpha_k = \exp \left( - \dfrac { \pi B_{a,k}} {f_s} \right) \exp \left( 2i \pi \dfrac {F_{a,k}} {f_s} \right), where Ba,kB_{a,k} and Fa,kF_{a,k} are the bandwidth and frequency of the kthk_{\text{th}} antiresonance.

Since we assumed that those filters are cascaded, we can express V(z)V(z) as the product of all the resonance and antiresonance filters.

V(z)=(1α1z1)(1α1z1)(1αJz1)(1αJz1)(1β1z1)(1β1z1)(1βIz1)(1βIz1)V(z) = \dfrac {(1 - \alpha_1 z^{-1}) (1 - \overline{\alpha_1} z^{-1}) \ldots (1 - \alpha_J z ^{-1}) (1 - \overline {\alpha_J} z ^{-1})} {(1 - \beta_1 z^{-1}) (1 - \overline{\beta_1} z^{-1}) \ldots (1 - \beta_I z ^{-1}) (1 - \overline {\beta_I} z ^{-1})}, where II and JJ are the number of resonances and antiresonances respectively.

Expanding both polynomials, we end up with a fraction of two polynomials of z1z^{-1}, the numerator of degree 2J2J, the denominator of degree 2I2I.

V(z)=j=02Jajzji=02IbiziV(z) = \dfrac {\displaystyle \sum_{j=0}^{2J} {a_j z^{-j}}} {\displaystyle \sum_{i=0}^{2I} {b_i z^{-i}}}

Considering this expression, we can reformulate the problem of formant estimation as solving for those two polynomials.