Pronunciation

The phoneme inventory is a design choice when we build a TTS or ASR system. The IPA is a helpful guide when making this choice, but we don't have to obey it, and are free to make different choices.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTo convert text into speech, we first have to find the words by tokenising and normalising the input text.
The next step is to determine the pronunciation of each word.
We've seen that speech sounds form categories.
They're called phonemes, although we haven't carefully defined that term yet.
We're going to use those to specify the pronunciation of words.
Some steps in tokenisation and normalisation were performed with handwritten rules.
We could imagine, at least for now, using the same approach for pronunciation.
We would write rules that map the spelling of the word - which is a sequence of characters - to its pronunciation - which is a sequence of phonemes.
That's going to work for some languages, but not all.
Here's the IPA vowel chart again.
There are a lot of possible vowel because of every possible combination of height (4 values in this chart), advancement or front-back (3 values), and rounding (2 values).
That's 4 x 3 x 2 = 24, and then there's a few in-between ones to give us even more.
No language in the world uses that many vowels.
But why not?
The reason is that nature is very good at finding efficient solutions, and using all of those vowels would be very inefficient.
If speakers had to make that many vowels, they would have to be extremely precise with their articulator positions, and that would take a lot of energy.
Listeners would also have to expend a lot of effort in perceiving these very small differences.
So, when I say nature is very good at finding efficient solutions, I mean ones where we can be lazy and expend as little energy as possible.
So languages tend to have far fewer vowels than this.
A lot of languages have settled on having 5 five vowels as a good number, for example, like Spanish and Japanese and many, many more.
Whatever vowels a language uses, they're going to be widely dispersed around the vowel space to make it easy to produce and easy to perceive.
No language in the world would pick these four vowels, for example.
If a language has evolved to have 4 vowels, they're quite likely to be in the corners: to be far apart, acoustically and perceptually.
Vowel space is acoustically continuous.
There are no boundaries.
There are no categories.
We can make any combination of the two formants that we like.
So, of all the available vowels in the IPA chart, how do we discover which ones a particular language uses?
In other words, where in this continuous vowel space might we draw some category boundaries between the different vowels?
To put that another way: how do we tell the difference between variation within categories (which a speaker is going to make because they're not being precise, but which a listener will hear as the same category), and variation that crosses a category boundary?
Here's how you do it, if you were documenting a language for the first time.
With the help of a native speaker, you'd look for pairs of words (or at least, possible words) that differ in only one sound.
If a single sound change makes the word change, than that pair of words is called a 'minimal pair' and the two contrasting sounds are different phonemes in that language.
That's the definition of a phoneme.
You'll see that I'm now writing the phonemes inside slashes like this.
Earlier, I had them in square brackets.
The slashes mean that this is a string of phonemes - of abstract categories.
There's no speech signal here.
The square brackets used earlier indicated that I was transcribing actual speech.
These are some minimal pairs that tell us about vowels in English.
'bit' and 'bet are different words, so /ɪ/ and /e/ are different phonemes.
'bit' and 'beat' are different words, so /ɪ/ and /iː/ are different phonemes.
That doesn't just work for vowels; that works forconsonants too.
'bit' and 'pit' are different words, so /b/ and /p/ are different phonemes.
Now, you'd be excused for thinking that the phoneme inventory is going to be very well-defined for all the languages we might ever want to build a Text-To-Speech or Automatic Speech Recognition system for.
Unfortunately not!
The phoneme inventory is an internal part of the system: it's not exposed to the users.
So it's available to us as one of the many design choices that we'll need to make as the system designers and builders.
To give you an idea of that sort of choice, let's consider allophones.
Sometimes there are two sounds which we think might be different phonemes because they're acoustically different, but we just can't find any minimal pairs in the language.
Here are some English words containing /l/ sounds.
The word-initial one is often called 'light l' and word-final one is often called 'dark l'.
They're produced with slightly different articulator positions and they have slightly different acoustics.
But you'll never find a minimal pair that differentiates between dark l and light l - they're called allophones.
Dark l and light l are just one example of allophonic variation.
Now, since we are the system designers it is up to us whether we want to use the same category for both dark and light l, or have them in two categories.
For Text-To-Speech, where we are generating speech and we want the acoustics to be right, and the acoustics be different for each of these dark and light ls, we'll probably want different categories - in other words, different symbols.
But for Automatic Speech Recognition, since the difference never distinguishes two words, then perhaps using the same category (the same symbol) would be fine.
So for Automatic Speech Recognition, we might have the pronunciation like this for this orthographic word.
But for speech synthesis, we might have it like this and use a different symbol for the dark l.
This isn't a complete course on phonology, which is that part of linguistics concerned with categories of sounds.
You need to take a course on that alongside these videos.
This video is really just to make the connection for you between sound categories and our two applications of Text-To-Speech and Automatic Speech Recognition.
So, let's continue with that.
Now we have a much clearer definition of what a phoneme is, we can get back to the problem we're currently trying to solve: converting a string of characters into a string of phonemes.
This is commonly called grapheme-to-phoneme conversion, although the input is characters, so "G2P", but you'll also find it called 'letter-to-sound'.
Here I've attempted to write down some context-sensitive rewrite rules for G2P.
On the left for Spanish, and on the right for English.
Like in the Handwritten Rules video, I'm not using any particular formalism for formatting these.
There are many ways of doing it, and in fact this notation is different to the previous videos.
We're not going to get hung up on notation!
This rule says that the character 'a' goes to the phoneme /a/ regardless of the context it occurs in.
This rule says that the character 'c', when it's going to be followed by the character 'i' goes to the phoneme /θ/.
Already, you can see that the rules for Spanish look nice and simple, and the rules for English are already considerably more complicated.
In fact, for Spanish, around 50 rules will cover everything.
Those can be written by hand and work well.
That's just not possible for English, because there are so many exceptions.
Try it for yourself if you want to try and prove me wrong!
For English, and some other languages, the only reliable source of word pronunciations is a dictionary: a large table of orthographic forms and their pronunciations, written by an expert lexicographer.
But even the largest dictionary can never cover all the words in the language because new words are invented every day.
What we need to do is to have a large dictionary and extrapolate from that to all the new words that aren't in the dictionary.
We need a way to learn from all the examples in the dictionary and automatically create G2P rules.
We'll treat the dictionary as data and we'll use machine learning to create our set of rules.
Now the 'rules' may or may not actually be rules.
So in general we should stop saying 'G2P rules' and start talking about a 'G2P model'.
We won't throw our dictionary away.
When performing Text-to-Speech, we'll always look for a word in the dictionary first, because it's reliable, and only if it's not there resort to G2P.
Exactly what our dictionary contains might vary depending on our application.
The first line on this slide is a transcription written in the IPA.
It's written in square brackets to indicate that.
In other words, that's the transcription of how somebody actually said this word.
That's actually not a dictionary entry.
The second line is a dictionary entry that might be used for Automatic Speech Recognition.
The symbols are not in the IPA, but that's just a convenience to make them machine readable.
That difference isn't important.
The difference that is interesting is between the Automatic Speech Recognition dictionary and the Text-to-Speech dictionary.
This one is for ASR.
This one is for TTS.
This 3rd line is much richer than the others.
It has extra information.
It shows syllable structure.
Fro each syllable, it indicates - with the number 1 or some other numbers - which syllables have lexical stress.
So this word is said 'impossible' with stress on the 2nd syllable; that's marked with this '1' here.
The TTS dictionary writes all the vowels as their full vowels.
It doesn't write anything as /ə/.
So it will be the job of the TTS system to decide whether the vowels in some of the unstressed syllables reduce to /ə/ as they were in this transcribed speech.
This symbol is a syllabic /l/ : that indicates a consonant that can form a syllable without needing a vowel.
Finally, the TTS dictionary might include Part-Of-Speech (POS) information, because that helps us look up the correct entry in the case of homographs.
The key point to understand here is that the dictionary (and its phoneme set) is something where there are design choices to be made, and that we might make different choices for different applications.
We've defined the phoneme and we're going to use it to write down pronunciations of words, either by writing a dictionary or by learning from that dictionary a G2P model.
What we need now, then, is some machine learning to solve that problem of G2P.
Machine learning can offer us all sorts of different types of models that we might choose from.
We're going to start with something very simple, but that actually has a very wide range of application.
Decision Trees can be used in lots and lots of problems where we make predictions about something.
That 'something' can be symbolic or numerical.
Those predictions are based on knowing the values of some predictors.
Those can also be of any type we like: symbolic, or numerical.