Statistical parametric speech synthesis

That's quite a mouthful, but we need to use a general term because this topic includes both Hidden Markov Models and Neural Networks for waveform generation.

There are two methods currently in use for this, and whilst they initially sound radically different, we’ll see that they have a lot in common.

  • HMM-based synthesis

    Hidden Markov Models are generative models, although their most common application is classification (Automatic Speech Recognition). But, of course we can generate new samples from them. That's how we use them for speech synthesis.

  • DNN-based synthesis

    In HMM-based speech synthesis, the hard work is done by a regression tree. Trees are rather naive models, so why not use something more powerful? A (Deep) Neural Network is a learnable, general-purpose non-linear transform that can be used for regression.