DNN-based synthesis

In HMM-based speech synthesis, the hard work is done by a regression tree. Trees are rather naive models, so why not use something more powerful? A (Deep) Neural Network is a learnable, general-purpose non-linear transform that can be used for regression.
  • The basics

    A neural network can be thought of as a general, learnable, non-linear mapping. In synthesis, we use it to perform regression.

  • Preparing the data

    Since the inputs and outputs of a neural network must be vectors of numbers, we have to encode our data in an appropriate way.

  • Training

    An informal explanation of how the network is trained using back-propagation.

  • Synthesising

    Putting text through the front end, encoding the resulting linguistic features as vectors, a forward pass through the network, optional trajectory generation, then waveform generation.