If we believe Taylor when he says we generally only need shallow processing of the text, then we can state the problem of text-to-speech as simply a matter of deciding what sub-word acoustic units to use, and what contextual features (derived from the text) we need to decorate those with.
6 minutes 15 seconds
Understanding the problem
|
|