In HMM-based speech synthesis, the hard work is done by a regression tree. Trees are rather naive models, so why not use something more powerful? A (Deep) Neural Network is a learnable, general-purpose non-linear transform that can be used for regression.
The basics
A neural network can be thought of as a general, learnable, non-linear mapping. In synthesis, we use it to perform regression.
Preparing the data
Since the inputs and outputs of a neural network must be vectors of numbers, we have to encode our data in an appropriate way.
Synthesising
Putting text through the front end, encoding the resulting linguistic features as vectors, a forward pass through the network, optional trajectory generation, then waveform generation.