Prepare the output features

We need to use a vocoder to parameterise the waveforms.

We will use the WORLD vocoder rather than STRAIGHT, because WORLD is Open Source.

The scripts/copy_synthesis.sh script performs the WORLD analysis and Resynthesis steps below. The paths are omitted for clarity here – see the script for details and try to understand each stage of processing.

Note: this script will take a little while to run, so I suggest you modify scripts/copy_synthesis.sh so that it only processes the first, say, 20, files. Once you have the complete pipeline working all the way through to DNN training and synthesis, come back and process the remaining data.

WORLD analysis

Smooth spectral envelope analysis

The analysis program of WORLD performs analysis, including pitch tracking and spectral envelope estimation. It takes a single wav file as input

$ analysis file.wav f0.double sp.double ap.double

and, like STRAIGHT, it produces three streams of features: f0, the smooth spectral envelope, and aperiodic energy.

Mel-Generalised Cepstrum

From the spectral envelope, we will now use a pipeline of SPTK tools to extract Mel-Generalised Cepstral coefficients (the ‘mgc’ parameters)

$ x2x +df sp.double | sopr -R -m 32768.0 | mcep -a $alpha -m $order -l $fft_length -e 1.0E-8 -j 0 -f 0.0 -q 3 > file.mgc

x2x +df converts the double-precision (d) output of WORLD into floats (f)
sopr -R -m 32768.0 takes the square root (-R) and multiples (-m) by 32768.0
mcep converts the spectrum to the Mel-Generalised Cepstrum, and the parameter $alpha controls what type of cepstrum it produces (see the copy_synthesis.sh script for suggested values for the parameters)

log f0

We convert f0 to floats, then to log f0 like this

$ x2x +df f0.double | sopr -magic 0.0 -LN -MAGIC -1.0e+10 > file.lf0

sopr first replaces all zero values (i.e. unvoiced frames, because WORLD marks them with an f0 value of 0) with a MAGIC number, then takes the log, then replaces all the MAGIC numbers with a fixed (large negative) value

Band aperiodicities

We only need to convert these to floats

$ x2x +df ap.double > file.bap

The three files file.f0, file.mgc, file.ap are the parameterised version of the original waveform file.wav

Resynthesis

It is common to perform “copy synthesis” at this point, so that we create a set of waveforms that represent an upper bound on how good the synthetic speech will be. We could resynthesise the waveform from the files f0.double sp.double ap.double but we also want to take the approximation involved in the Mel-Generalised Cepstrum into account, so we will reconstruct the spectrum from those parameters:

$ mgc2sp -a 0.77 -g 0 -m 59 -l 2048 -o 2 file.mgc | sopr -d 32768.0 -P | x2x +fd > file_resynthesised.sp

and then generate a waveform

$ synth 2048 48000 f0.double file_resynthesised.sp ap.double file_resynthesised.wav

You should listen to some of the vocoded-and-resynthesised waveforms to understand what vocoding alone (without any statistical modelling) has done to the quality. You will hear a difference, and vocoding will work better for some voices than others. We might try to tune some of the vocoder settings for your voice, but for now we will omit this step.

The scripts/copy_synthesis.sh will resynthesise every utterance. This isn’t really necessary: you probably only need resynthesised versions of a few sentences, and in particular the test set. You could save a little time and disk space by modifying the script to only perform the resynthesis step for certain files.

Feature composition and normalisation

We need to combine the three streams of features into a single file per utterance, and globally normalise.

Edit your config file to turn feature composition and normalisation on

NORMLAB  : False
MAKECMP  : True
NORMCMP  : True
TRAINDNN : False
DNNGEN   : False
GENWAV   : False
CALMCD   : False

and run this step

$ python /Volumes/Network/courses/ss/dnn/dnn_tts/run_dnn.py feed_forward_dnn_WORLD.conf 

You will now have files in a directory called data/nn_mgc_lf0_vuv_bap_199 (the number on the end indicates the dimensionality of the features) and normalised features in data/nn_norm_mgc_lf0_vuv_bap_199