The recording script

You will record two sets of speech data. The first is "neutral read-text" and the second will be of your own design.

There are two sources of speech material for training speech synthesis models:

  1. Purpose-made recordings using a script of our choice
  2. ‘Found’ data, such as audiobooks and podcasts

In both cases, the synthetic speech that the model eventually generates will be influenced by the speech used to train that model. The most obvious factors are the speaker, the content, and the speaking style.

For this exercise, we are going to train our model on a relatively small amount of speech, obtained from purpose-made recordings. We may need to combine this with some pre-existing purpose-made recordings from one or more other speakers.

We need to select a script for recording. The standard method for this was devised for the unit selection, and involves greedily selecting sentences, one by one, from a large text corpus (e.g., novels or newspapers) in order to maximise phonetic (and possibly prosodic) coverage. In the first part of this exercise, we will simply use the existing CMU ARCTIC script.

You should record only the ‘A’ set of 593 prompts, which will yield around 30 minutes of speech material.

Because recording will take time (around 5 hours in the studio per hour of speech material obtained), you should get started on recording the ARCTIC A sentences immediately.

  • Adding your own material

    The ARTIC script uses sentences from old novels, and was designed only for diphone coverage. You can do better!

  • Script design

    Once you have chosen your domain, you need to select a set of sentences to record in the studio.