Build your own DNN voice

This exercise assumes that you have already built your own unit selection voice, and therefore have all the data you need.

Tools required
Only needed if you are setting this exercise up on your own. My Edinburgh students can skip this step.More...
- Merlin - CSTR's DNN toolkit
  CSTR's own DNN synthesis toolkit, built on top of Theano.
- Theano
  A framework for fast computation on CPUs and GPUs
- SPTK
  A speech signal processing toolkit
Prepare your workspace
Create the directory structures you need, and some configuration files.
Prepare the input labels
The DNN needs frame-level numerical labels, so we must derive those from the Festival utterance structures.More...
- Convert utterance structures to full context labels
  We will 'flatten' the utterance structures onto a linear sequence of context-dependent phone labels.
- Forced alignment
  The technique is similar to that for the unit selection voice, except that we have full context labels now.
- Convert label files to numerical values
  State-level full context labels are converted to frame-level numerical features.
Prepare the output features
We need to use a vocoder to parameterise the waveforms.
Design the DNN
You need to choose the architecture for your DNN
Train the DNN
The parameters of the DNN will be learned using stochastic gradient descent.
Synthesise
We can now put unseen sentences through the system, to perform synthesis.
Faster, faster
Disk access can be a limiting factor, so using local disk is usually much faster than network disk.