The database

The quality of a unit selection system depends very much on the speech database, both the quality of the recorded speech and the accuracy of the labels.
  • Creating the script

    The ideal database would have every unit in every possible linguistic context. That's impossible. We aim for the best coverage in a limited number of sentences.

  • The effect of coverage

    In general a larger database covers more units-in-context and so should give better quality synthesis.

  • Sources of text

    Text can be drawn from many possible sources, but we need to take care about the type of material and any copyright restrictions.

  • Annotating the database

    The labels need to be consistent with predictions from the front-end, yet correspond to how the speaker said the utterance.

  • Dictionaries for accents

    Self-consistency is important for unit selection - units with the same label need to be pronounced the same - but different accents make things more complicated.

  • Automatic vs manual labelling

    Given unlimited resources, would we manually annotate the database instead of using forced alignment?

  • Labelling other properties

    The database needs to be labelled with more than just time-aligned phones. Where are those labels going to come from?

  • More interesting speech material

    The speaking style of the database determines that of the synthetic speech, so how about recording more interesting speech material?

  • Implicit vs explicit labels

    The target cost can use explicit information such as acoustic properties, or rely on implicit textual features. The database needs annotating accordingly.