Sub-word units

The training data generally won't contain examples of every word in our vocabulary, so we will need to use sub-word units.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
So we're going to wrap up, then fairly quickly now.
Just put everything together and cover a few things that we didn't talk about yet, such as how on earth you do truly connected speech recognition.
How would you do this with a larger vocabulary? What about fancy language models on DH? We'll see that although what we've spoken about so far, what we did in the assignment looked really quite simplistic.
It's just whole word models and really simple grammars scaling from that to a 10,000 words.
Recovery with a try Graham language model is almost trivial because we know everything we need to know already.
In order to do that, we just need to see some pictures of how we get from a small model to a big model, the training algorithm over the same recognition algorithm with the same.
So you pretty much know how to do it.
I just need to kind of point the way.
So let's first talk about how we would do truly continuous speech, such as the lecture I'm giving now.
What if he wanted to transcribe that? What do we have to do differently from our digit recognised first thing Obviously, we need a larger vocabulary, so we might need 5000 to 10,000 words for a typical conversational speech.
30,000 words.
100,000 words if you want to just arbitrary wide coverage transcription those sort of recovery sizes that commercial systems might have want to.
Subtitle.
YouTube videos Using 6100 words Recoveries on the word in the recovery is just enumerating every possible inflected form of the word.
There's no morphology, no cleverness in any of these recovery's just listing every word for me.
A great big long dictionary.
The large recovery on DH.
We need to have continuous speech, so there's not going to be of this junk silences between everything was just going to run into each other on DH.
We're going to face the same problem is in synthesis, and that's a recognition time.
Someone will say a word to the system that we didn't have a recording off when we built the system so we can't possibly train whole word models.
We aren't going to deal with the case where the the word's not even in the dictionary.
Most systems don't do that.
Some advance systems might try and do that mostly, we'll have a fixed dictionary, and it has to be within that recovery.
We just don't have the corresponding acoustic training data to do it, so that sounds like kind of rather major challenge.
But the changes, they're going to be relatively small.
We'll talk about that now, so the whole word is simply not going to work.
We need some number of examples to train a model.
That's because we need some number of frames to align with each galaxy in to get reliable estimates for its meaning.
Its variants.
If the too few samples will get very unreliable estimates of being a variance, we need enough data per model, so we can't do a whole words we couldn't possibly collect.
Enough examples were going to simply use sub word models, and they're just models of phoney names.
So phoney models with the most typical thing to use might be language dependent might do something a bit different for some languages, but phoney was gonna work pretty well.
We're going to need a dictionary that maps from words to phone aims.
Hopefully, you have a look at the dictionary that I gave you for the DEA recognise her.
It's rather strange looking dictionary that just says one is pronounced as one.
Addiction is just a mapping from things in the language model.
Tow acoustic models.
So things in the left column on names of things appear in language.
Model on the right column is the names of things that appear in acoustic model digit.
Recognise that the names of models on the names of words.
If it was a phonetic system, we'd have left column will be still words.
Things appear.
Language model on the right column will be a string, for example, of phonemes.
So Pronunciation dictionary.
We're just gonna get that in the usual way of doing right by May.
In fact, researchers we don't like hanging with nasty things like pronunciation dictionaries.
So we want to expand the recovery went even one letter to sound rules to bulk up that dictionary.
So you get online dictionaries like CMU dicked, another open source, but rather low quality dictionaries.
The entries in there may or may not be hand checked.
They may have just been automatically generated somewhat unreliable.
It's not particularly critical as long as it's consistent between training and testing time, so I need a pronunciation dictionary.
It's just a mapping from language model things to accuse the model.
Things on the word models are going to be trivial.
Teo make.
We're going to make them by joining together sub word models.
Well, come on to that in a moment.
Let's then talk quickly about language models.
We know most of this hopefully by having done it in the assignment.
There's the language model from the digit recognise er.
It's kind of trivia, and we can write it by hand.
We could have written it in directly in this final state format.
We could have used a slightly more friendly language such as this grandma language from H D.
K to write it and then compile that into this format.
It doesn't really matter.
That's just a tool for writing it.
How, then, if we wanted, for example, to build this system but build it with some word models, where would we get the model of each of these words from? And we can just make models of words by come Captain Eytan models of phoney names on here.
We now see that this is trivial because of the property of H M M's because of this mark off property.
Because the property of generating on observation from a particular state does not depend on where we came from.
It's only depends on that were in that state, we could join together models, and it's still a completely validation.
You want to make a model of this dictionary word cat.
I want to make it from individual phoney models would just take individual phoney models.
That's the cuff that's the And that's the time when we just joined them together.
With these little arts, we can see what these dummy states are useful for now on H.
D.
K.
And that's a model of the word cat that we could then use to generate observation sequences for this word cat.
It's a simple is that it's almost almost trivial.
And not only can we do that for recognition, we can also do that when we're training the models.
So imagine we want to train phoney models.
We want to train models call these things.
One solution would be to record data on hand, label the beginning and the end of every phone, every every acoustic instance of a phoney, just like you did in the digit recognise her.
That would be immensely expensive.
It wouldn't be very reliable because people's accuracy and doing that's not going to be great unless the highly trained and it's going to take 10 or 100 times real time to do that labelling.
But it turns out we don't need to do that.
Let's imagine instead, we just label the start and end of every word.
How would we then trained models of phoney names if we don't know where the model starts and ends? So let's say this is a model of the word cat, and we know the time stamped on this model should start, so we know when we should enter this model.
We know the time stamp when we should leave this model, but we don't know how the state's aligned with the frames.
In other words, we don't know at what time we should go from one phoney to the next.
Phone him in the model.
That is a matter.
We already know how to do that because we already know how to train a model where we don't know.
The alignment between the states and the observations weaken due first uniforms segmentation.
Then we could do better.
Be training.
And then we do full bound Welsh because this thing here, who cares what? What? It's a model off.
It doesn't matter if it's a model of cat or or a whole word model.
It's known, hmm estate with transition probabilities, and we know where it starts on.
No way.
Tens and weaken just train its parameters in the usual way.
Okay, it's that simple.
There's no cleverness there.
We just temporarily make a model of the word cat by joining together our models do the alignments, find which frames aligned with which states, and just remember that.
Repeat that over all of the training data.
And then, at the very end, this state will have participated in the word cat on the word.
All these other words with Kirk in it would have been lined with a bunch of frames were just average them and with the mean for that state.
So we don't need to know the phoning boundaries.
We go one step further.
What if we don't even know the word boundaries? We just have whole utterances with their word transcriptions.
Whole sentences just do the same thing for each word temporarily construct a word model like that and then for each sentence con cabinet, the word models to make a temporal model for the sentence.
It's just a great big long Hmm.
Maybe it's got tens or hundreds of states.
We know the start time.
We know the end time.
We know the names of the sequence of models.
That's how we constructed it.
We just do Uniforms, segmentation, orbiter be training or bound Welch to learn the promises that model.
So it turns out, then you didn't actually need to do what you did in the assignment.
You could have got away without labelling the starts and ends of words.
I'm just labelled the sequence of words in the great big long file, so we could have done that right? So the extensions to connected speech or two data where we don't have labels the model level, perhaps that the word level or the sentence level is essentially completely trivial.
We just construct models temporarily by Khan Captain ating Some sub model sub words to make words worst make sentences get to temporary chairman.
Do the alignment.
Just remember this alignment.
Accumulate that cross all the training sentences that very end up dating model parameters and then go around again.
In fact, with my even train a system in a very simplistic way that could work quite well and just forget the whole uniform's segmentation Viterbi steps and just go straight in with Tom Welch with no alignments of labels at all.
And that works reasonably well.
If you've got lots of data and that's called There's a thing called Flat Start so we could have gone from our like models, which have got they've got to have parameters.
Otherwise, we can't even start the promises to be something very naive, like zero for all the means and one for all the variances.
Well, just align them with data, and that'll be a rather arbitrary alignment because there'll be nothing to tell it outta line.
But the beginnings will be tended to aligns with beginnings and the ends with ends.
We'll get first cut of the model parameters, and we just interrupt that Just do found Welch with sentence level, and so you used tools called HD Net, which does uniforms segmentation, Toby training on H rest H rest does bound Welch, but it needs to know where the model started.
And there's another tool that doesn't even need to know where the model started.
And that's called H E.
Rest on this e stands for Embedded.
This is a thing called embedded training, and this is what we do in a real big system.
We just have things jumped into sentences with word level transcriptions of those the whole sentences of text.
And that's what we trained on.
We might do a flat start.
We're building a really big system.
We probably just out with some models in my previous system that came from a previous system that went all the way back probably sometime in the distant past, were trained on handle line data.
We might use a seed models to get first alignment.
We might just do a flat start.
Okay, so the key point is that no, this makes anything more complicated because just joining together H moms just gives us an hmm.
Therefore, we know already know what to do with that.
We already know how to train it.
We already know how to do decoding with there's no new techniques what so ever needed to do that.
That's the beauty of H members.
Now we see why this mark of property so nice we can do this in this state here just doesn't care that there was another model before it.
It doesn't even know it was almost tokens.
Arrive in it and we can computer mission probabilities.
We just turn that handle.