Prosody

Prosody for Text-To-Speech can be reduced the the problem of predicting pausing, duration, and F0.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoThe same written sentence can be spoken in many different ways.
Perhaps the most obvious thing a speaker can vary is the F0 contour, which is perceived by the listener as variations in pitch.
The speaker can also create variations in amplitude and duration.
The speaker is varying the prosody of the utterance.
Prosody is an area of linguistics where there are too many competing theories on lots of disagreement.
But don't worry!
We're going to avoid all of that and keep things as simple as possible.
Here's a text sentence and a possible pronunciation written using the IPA.
This already tells us a lot about how to say this text out loud.
For example, it's already told us which vowels were going to get reduced to /ə/, compared to the dictionary entry for this word.
But this pronunciation doesn't specify everything we need to know before saying the sentence out loud.
If we want to create natural-sounding synthetic speech, we need more than this.
We're going to have to predict the fundamental frequency contour, the duration of each phoneme and, for longer sentences, perhaps where to insert pauses.
So again, without getting into any particular theory of prosody, here's a working summary that will be good enough for our purposes.
On the left are the linguistic functions that prosody performs in communication.
Phrasing is about how the words in a sentence form groups that are smaller than the sentence.
Rhythm is about timing and speaking rate.
Emphasis is about speakers placing relative importance on some words in a sentence, compared to others.
Intonation is the use of pitch - for example, to indicate a question.
There are also some paralinguistic functions ('para' means 'alongside') and those involve generally knowing more than just the text, so we're not going to cover them in this course.
On the right are the acoustic correlates: the things that happen to the speech signal.
Since all we're doing is synthesising speech - we're making speech signals - all we really need to do is predict those acoustic values.
Now, voice quality cannot easily be controlled in most systems, so let's forget that one and focus on the job of predicting F0 and duration.
We're going to do that step-by-step.
Before predicting the duration of phonemes and the fundamental frequency contour, we need to predict how our sentence breaks into prosodic phrases, which might be marked by pauses or by movements in F0.
We could consider this task to be part of text processing because we'll need to use text features to do it: that's all we have available.
Often, punctuation is a good indicator.
Commas often indicate places where the speaker could (or should) break the sentence into phrases.
A comma could be marked with a pause, but it's not the only way.
You can use movements in F0 and changes in duration as well.
But using only punctuation is not good enough.
Listen to a speaker reading this sentence.
"Presently Wilbur raised his head and began speaking in that strange, resonant fashion which hinted at sound-producing organs unlike the run of mankind's."
I can hear clear phrase breaks after 'presently' and after 'fashion'.
There's no punctuation there.
In fact, the only comma in this sentence is part of a list structure and the speaker didn't make a phrase break there.
So this problem is not as trivial as just finding the punctuation.
After predicting where to break the sentence into phrases, we might predict the duration of each segment: of each phoneme.
The duration of a phoneme depends partly on its identity.
For example, /m/ is intrinsically longer than /p/.
Duration also depends on the context in which this phoneme is being spoken.
We therefore need to predict the duration of each of these using information about its identity and the context in which it occurs.
That context could be the neighbouring phonemes or anything else we have available, such as where it is within a syllable, whether that syllable has lexical stress,...
Then, having predicted duration, we need to place an F0 contour on the utterance.
Although we're not subscribing to any particular theory of prosody, we can say that most linguists would agree that the syllable is the smallest meaningful unit when it comes to F0 variations.
So we will make predictions on a per-syllable basis, not per-phoneme.
Here's a naturally-spoken sentence.
'Nothing's impossible.'
To understand the problem of F0 prediction, let's see how close we could get to that natural sentence with some very simple F0 contours.
Start with the simplest one of all: a constant value, monotonic F0.
'Nothing's impossible.'
OK, not very natural at all!
Speakers tend to gradually decrease F0 over the course of an utterance.
That's called 'declination'.
'Nothing's impossible.'
Well, that already sounds a lot better.
Clearly not natural, but much better than before.
Another thing speakers do is to place some F0 movement on some of the syllables in some of the words.
So I'll choose one syllable from each of the two words in the sentence: the one with primary lexical stress.
This one and this one.
I have chosen where to place intonation events and I'm going to now choose what type of event to use: what kind of movement F0 will make.
I'll use the simplest one of all, which is a simple rise and then fall of F0.
So I'll just put some little bumps on those two syllables.
That sounds like this.
'Nothing's impossible.'
And that's not bad for such a very simple F0 contour.
Let's just hear the natural one one more time.
'Nothing's impossible.'
OK, so we got somewhere close to the natural one with this very, very simple F0 contour: a declining baseline with a couple of rise-fall accents.
Now, this point, if this was a whole course on prosody, we would have to have a long argument about what types of intonation events exist (rises, falls, rise-falls,...), how many types there are, and so on, and so forth.
But don't worry, we're not going to bother ourselves with that argument!
I talked about prosody and I reduced the problem to one of predicting things in a particular order.
Predicting phrase breaks, then phoneme durations, and then F0, which we reduced to predicting which syllables receive some event and then what sort of event that is.
But I actually didn't provide any methods for making those predictions.
That's because all of these tasks are now too hard for handwritten rules.
As with predicting pronunciation from spelling (for English, at least), we need something more powerful than rules.
We need machine learning.
All the problems I've just outlined in this video are all going to be solved with machine learning.
They're all problems in which we predict something, given some other information: some contextual information.
We're going to predict prosody and pronunciation using machine learning.
Our first encounter with machine learning will be a decision tree.
Unlike most forms of machine learning, decision trees are - to some extent - human-friendly.
That means we'll be able to inspect the model that has been learned from data and understand how it works.
We could even attempt to write a decision tree by hand.
But to be clear, this human-friendliness is just a bonus.
It's not our top priority in machine learning.
It's just 'nice to have' sometimes.