Additional Videos

Some more videos on HMM training for ASR that were already on speech.zone.

Total video to watch in this section: 91 minutes

Please note the Module 10 lecture (2 Dec 2021) has been cancelled due to the strike.

Once we understand token passing within a single HMM, the extension to continuous speech is surprisingly easy.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
we want to talk about now in the last few minutes just to get your mind start.
It is.
How do we take that idea Off isolated words.
There's dynamic programming or isolated single H M M's generating single observation sequences and then string those together to make a recognise that they can recognise connected things.
Connected words more connected, phoney names.
So we'd like a large recovery.
Perhaps So maybe if you wanted to build something not for the 10 digits for 10,000 words or 100,000 words.
How would we do that? Probably not going to sit down and record seven examples of each of those 10,000 words.
It's probably not going to be the way to do it.
I would also like to have connected or continuous speech strings of things.
One word, followed by another word, and that's something I would encourage you to try to get onto.
The lab's taking post graduates do strings of digits.
Okay, we'll talk about that in the lab strings of digits for a little bit of silence in between to make life simple.
So how do we do that? It's going to turn out to be really, really easy because we've set our model up in a way that allows them to be easy.
So let's just start with another way of writing down what we doing so far.
So far, we're doing isolated word recognition.
Each of the words we have a hidden Markov model with this whose we put these dummy states here and that maybe is the model of the word zero on DH.
We could separately compute the probability off.
Oh, given zero on the probability of Oh, given all the other words and then compare those, we could be a little bit clever and just put that actually into a single computation.
And we do like this.
We'll take the models.
This thing here is just a model.
It's a model of zero.
With how many states we fancy with these little dummy states, we're just going to string them together into kind of a super Hmm.
Big hidden Markov model is going to join them all back to the states.
The beginning during the most state of the end, we now have a digit recognise that that's a more complicated looking model.
It's just a hidden Markov model.
Conception is no different.
It just happens to have sort of these parallel branches, right.
He's got branches like this, all these branches, and then they happen to come together at the end.
There's nothing in there that's concept to any different, just a simple linear model.
Specifically, token Passing will just work here, too.
For example, who wanted youto come passing on this model will put a toke it.
Let's clear this.
We'll put a token here.
We'll turn the handle.
It will send copies into each of the models.
The copies will do the thing they would have done in the model to go around to do their thing.
Tokens in here, tokens in here turn the handle lots of tokens or jumping forwards in the algorithm.
And just as we generate the last observation and turn the handle that last time a token will pop out of each of them, this token will have written on it.
Probability of Oh, given the model zero, this one will have the probability of given given word one.
They all arrive in the state.
10 tokens will suddenly arrive on.
One of them will have the highest probability we'll destroy the other nine and announced that opens the winner.
Well, that's that token, which were Did you come through? So the token now needs to remember its path.
And then we've done recognition.
And that's exactly how In code H fight programme, you're using works so it doesn't separate.
Compute each of those 10 models joins them together in a network, and you could see that network.
It's there in the files.
It's the language model, and it's in here.
It's just very simple.
Grandma, he just says, This or this or this or this.
There's no actual probabilities on these arcs.
In other words, implicitly.
They're all equal probability.
The uniformed probability.
Just one more minute, with just one slightly over.
Let's just look at how that generalises, then to something more than just isolated words.
I'm just going to really, really easy.
Okay? When we taken hmm of a unit, whether it's a word or a sub word unit like a phoney name and we join it to another hmm, what's the result? It's just a bigger hmm.
There's nothing different about it.
It's all this goddess transitions in states.
It just looks bigger and badder than a bit more complicated, but the algorithm is going to care about that.
Still going to work.
So if you want to make a model off a word, let's make a model of a word now from sub word units, a model of the word cat and imagine.
I never recorded the word cat, but I recorded some other things with cousin and some things with 1000 something with cousin trained models, cousin vases and test, nor the other four names.
And I want to make a model of cat.
I just take the model of cut model on the model of Tough on.
Just use these special little dummy state's non emitting states.
Join them together, put some transitions there.
This thing here is just a model.
A model of the word cat happens to have nine states.
This topology, it doesn't matter.
It happens to be left to right.
We've made a model for something that we never saw in the training data constraint, arbitrary hmm.
Together, make models off the sequence of things and that just generalises them to make sequences of words.
Okay, so we can have models of utterances that are just sequences of word models, models of words.
That sequence of phone models on each phone model is just a sequence of ancient.
There's a beautiful hierarchy to all of this modelling, and that's gonna, point is, then we'll stop on the next slide, point us towards a form of language model.
There's Compatible, in other words, has the same properties as a hidden Markov model.
There was a language more than is just a finite state network allows us string things together so that when we plug acoustic models into the language model, we get something that is still a valley Hidden Markov model not tells us something about the sort of language what will be going to be allowed to use? Well, just quickly.
Look at what one of those might look like.
The language more was going to have to have this sort of property.
It's going to be something that we have to be able to write as a finite state network.
It could be written by hand, or it could be automatically learn from data.
It doesn't matter as long as it's off the same form as the hidden Markov model.
In other words, it's got states and transitions join things together and they were just going to substitute in, for there were just substitute in the hidden Markov model.
For that could be a whole word model.
It could be a model made from phoney models.
It doesn't matter.
Substitute all that in Michael compiling and then just put a token here.
Just turn the handle, see what happens.
Tokens flow through the model at some point token pops out of the end with the answer on it.
This language model could be very simple or very complex.
It doesn't matter.
The same algorithm applies.
So the language will you get given for the assessment? Is this one on? What you might want to do is think about how you would extend that to do sequences of digits.
What would you need to do to this language model to allow one digit to follow another digit

Log in if you want to mark this as completed
The training data generally won't contain examples of every word in our vocabulary, so we will need to use sub-word units.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
So we're going to wrap up, then fairly quickly now.
Just put everything together and cover a few things that we didn't talk about yet, such as how on earth you do truly connected speech recognition.
How would you do this with a larger vocabulary? What about fancy language models on DH? We'll see that although what we've spoken about so far, what we did in the assignment looked really quite simplistic.
It's just whole word models and really simple grammars scaling from that to a 10,000 words.
Recovery with a try Graham language model is almost trivial because we know everything we need to know already.
In order to do that, we just need to see some pictures of how we get from a small model to a big model, the training algorithm over the same recognition algorithm with the same.
So you pretty much know how to do it.
I just need to kind of point the way.
So let's first talk about how we would do truly continuous speech, such as the lecture I'm giving now.
What if he wanted to transcribe that? What do we have to do differently from our digit recognised first thing Obviously, we need a larger vocabulary, so we might need 5000 to 10,000 words for a typical conversational speech.
30,000 words.
100,000 words if you want to just arbitrary wide coverage transcription those sort of recovery sizes that commercial systems might have want to.
Subtitle.
YouTube videos Using 6100 words Recoveries on the word in the recovery is just enumerating every possible inflected form of the word.
There's no morphology, no cleverness in any of these recovery's just listing every word for me.
A great big long dictionary.
The large recovery on DH.
We need to have continuous speech, so there's not going to be of this junk silences between everything was just going to run into each other on DH.
We're going to face the same problem is in synthesis, and that's a recognition time.
Someone will say a word to the system that we didn't have a recording off when we built the system so we can't possibly train whole word models.
We aren't going to deal with the case where the the word's not even in the dictionary.
Most systems don't do that.
Some advance systems might try and do that mostly, we'll have a fixed dictionary, and it has to be within that recovery.
We just don't have the corresponding acoustic training data to do it, so that sounds like kind of rather major challenge.
But the changes, they're going to be relatively small.
We'll talk about that now, so the whole word is simply not going to work.
We need some number of examples to train a model.
That's because we need some number of frames to align with each galaxy in to get reliable estimates for its meaning.
Its variants.
If the too few samples will get very unreliable estimates of being a variance, we need enough data per model, so we can't do a whole words we couldn't possibly collect.
Enough examples were going to simply use sub word models, and they're just models of phoney names.
So phoney models with the most typical thing to use might be language dependent might do something a bit different for some languages, but phoney was gonna work pretty well.
We're going to need a dictionary that maps from words to phone aims.
Hopefully, you have a look at the dictionary that I gave you for the DEA recognise her.
It's rather strange looking dictionary that just says one is pronounced as one.
Addiction is just a mapping from things in the language model.
Tow acoustic models.
So things in the left column on names of things appear in language.
Model on the right column is the names of things that appear in acoustic model digit.
Recognise that the names of models on the names of words.
If it was a phonetic system, we'd have left column will be still words.
Things appear.
Language model on the right column will be a string, for example, of phonemes.
So Pronunciation dictionary.
We're just gonna get that in the usual way of doing right by May.
In fact, researchers we don't like hanging with nasty things like pronunciation dictionaries.
So we want to expand the recovery went even one letter to sound rules to bulk up that dictionary.
So you get online dictionaries like CMU dicked, another open source, but rather low quality dictionaries.
The entries in there may or may not be hand checked.
They may have just been automatically generated somewhat unreliable.
It's not particularly critical as long as it's consistent between training and testing time, so I need a pronunciation dictionary.
It's just a mapping from language model things to accuse the model.
Things on the word models are going to be trivial.
Teo make.
We're going to make them by joining together sub word models.
Well, come on to that in a moment.
Let's then talk quickly about language models.
We know most of this hopefully by having done it in the assignment.
There's the language model from the digit recognise er.
It's kind of trivia, and we can write it by hand.
We could have written it in directly in this final state format.
We could have used a slightly more friendly language such as this grandma language from H D.
K to write it and then compile that into this format.
It doesn't really matter.
That's just a tool for writing it.
How, then, if we wanted, for example, to build this system but build it with some word models, where would we get the model of each of these words from? And we can just make models of words by come Captain Eytan models of phoney names on here.
We now see that this is trivial because of the property of H M M's because of this mark off property.
Because the property of generating on observation from a particular state does not depend on where we came from.
It's only depends on that were in that state, we could join together models, and it's still a completely validation.
You want to make a model of this dictionary word cat.
I want to make it from individual phoney models would just take individual phoney models.
That's the cuff that's the And that's the time when we just joined them together.
With these little arts, we can see what these dummy states are useful for now on H.
D.
K.
And that's a model of the word cat that we could then use to generate observation sequences for this word cat.
It's a simple is that it's almost almost trivial.
And not only can we do that for recognition, we can also do that when we're training the models.
So imagine we want to train phoney models.
We want to train models call these things.
One solution would be to record data on hand, label the beginning and the end of every phone, every every acoustic instance of a phoney, just like you did in the digit recognise her.
That would be immensely expensive.
It wouldn't be very reliable because people's accuracy and doing that's not going to be great unless the highly trained and it's going to take 10 or 100 times real time to do that labelling.
But it turns out we don't need to do that.
Let's imagine instead, we just label the start and end of every word.
How would we then trained models of phoney names if we don't know where the model starts and ends? So let's say this is a model of the word cat, and we know the time stamped on this model should start, so we know when we should enter this model.
We know the time stamp when we should leave this model, but we don't know how the state's aligned with the frames.
In other words, we don't know at what time we should go from one phoney to the next.
Phone him in the model.
That is a matter.
We already know how to do that because we already know how to train a model where we don't know.
The alignment between the states and the observations weaken due first uniforms segmentation.
Then we could do better.
Be training.
And then we do full bound Welsh because this thing here, who cares what? What? It's a model off.
It doesn't matter if it's a model of cat or or a whole word model.
It's known, hmm estate with transition probabilities, and we know where it starts on.
No way.
Tens and weaken just train its parameters in the usual way.
Okay, it's that simple.
There's no cleverness there.
We just temporarily make a model of the word cat by joining together our models do the alignments, find which frames aligned with which states, and just remember that.
Repeat that over all of the training data.
And then, at the very end, this state will have participated in the word cat on the word.
All these other words with Kirk in it would have been lined with a bunch of frames were just average them and with the mean for that state.
So we don't need to know the phoning boundaries.
We go one step further.
What if we don't even know the word boundaries? We just have whole utterances with their word transcriptions.
Whole sentences just do the same thing for each word temporarily construct a word model like that and then for each sentence con cabinet, the word models to make a temporal model for the sentence.
It's just a great big long Hmm.
Maybe it's got tens or hundreds of states.
We know the start time.
We know the end time.
We know the names of the sequence of models.
That's how we constructed it.
We just do Uniforms, segmentation, orbiter be training or bound Welch to learn the promises that model.
So it turns out, then you didn't actually need to do what you did in the assignment.
You could have got away without labelling the starts and ends of words.
I'm just labelled the sequence of words in the great big long file, so we could have done that right? So the extensions to connected speech or two data where we don't have labels the model level, perhaps that the word level or the sentence level is essentially completely trivial.
We just construct models temporarily by Khan Captain ating Some sub model sub words to make words worst make sentences get to temporary chairman.
Do the alignment.
Just remember this alignment.
Accumulate that cross all the training sentences that very end up dating model parameters and then go around again.
In fact, with my even train a system in a very simplistic way that could work quite well and just forget the whole uniform's segmentation Viterbi steps and just go straight in with Tom Welch with no alignments of labels at all.
And that works reasonably well.
If you've got lots of data and that's called There's a thing called Flat Start so we could have gone from our like models, which have got they've got to have parameters.
Otherwise, we can't even start the promises to be something very naive, like zero for all the means and one for all the variances.
Well, just align them with data, and that'll be a rather arbitrary alignment because there'll be nothing to tell it outta line.
But the beginnings will be tended to aligns with beginnings and the ends with ends.
We'll get first cut of the model parameters, and we just interrupt that Just do found Welch with sentence level, and so you used tools called HD Net, which does uniforms segmentation, Toby training on H rest H rest does bound Welch, but it needs to know where the model started.
And there's another tool that doesn't even need to know where the model started.
And that's called H E.
Rest on this e stands for Embedded.
This is a thing called embedded training, and this is what we do in a real big system.
We just have things jumped into sentences with word level transcriptions of those the whole sentences of text.
And that's what we trained on.
We might do a flat start.
We're building a really big system.
We probably just out with some models in my previous system that came from a previous system that went all the way back probably sometime in the distant past, were trained on handle line data.
We might use a seed models to get first alignment.
We might just do a flat start.
Okay, so the key point is that no, this makes anything more complicated because just joining together H moms just gives us an hmm.
Therefore, we know already know what to do with that.
We already know how to train it.
We already know how to do decoding with there's no new techniques what so ever needed to do that.
That's the beauty of H members.
Now we see why this mark of property so nice we can do this in this state here just doesn't care that there was another model before it.
It doesn't even know it was almost tokens.
Arrive in it and we can computer mission probabilities.
We just turn that handle.

Log in if you want to mark this as completed
You know the drill: think of the entire model as generating speech.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
So here's the best way to think about the whole system, and this is the generative model way of thinking about it.
This is the way I want you to think about it.
It's much easier to understand that this way we've got a hierarchy of generative models, so we probably in general going to recognise whole utterances.
We'll call them utterances rather than sentences.
Sentence imply some sort of grammatical unit.
And we've no idea whether that what people say is grammatical.
We're just saying utterance on acoustic unit a thing with silence beginning and end.
So we have utterances.
We have a generative model ofthe utterances on the generative model generates sequences of words.
Okay, so what is that generally model? That's the language model.
We could take a language model and we could randomly generate sentences from it, like state one randomly generate.
This model can generate sentences.
Okay, take a random walk through the language model star.
Here, toss your coin.
This is going to be 10 sided dice and you're going toe.
Look at the number and take a random transition.
Let's do it.
Here we are and we generate the sentence one.
Do it again.
Maybe we generate the sentence.
Zero randomly generate sentences.
Language model is a generative model ofthe sequences of words.
This is a very simple one in general, every sequences.
So that is the language model, right for each word.
We're now going to generate the sequence of phoney names, abstract names of units that make about word.
There's another generative model here.
Anybody like to propose what the name of that generation model is? My guess is from words to phoney names.
Pronunciation Dictionary.
So that's the dictionary.
I'm just gonna write.
Addict Dictionary is a generative model, but given a word will generate a series of phones.
You look up the word and it will give you the secrets of phone aims.
It might be just a simple deterministic mapping a fancy diction, and I have two pronunciations for the same Ortho graphic word it would randomly generate between the two of you might even put probabilities on those.
Press the random button on the dictionary for a word, and it will pop out of sequence of phonemes.
That's a generative model.
Addiction is also a generative model, right? And then those phoney names map onto the names of hmm.
So this is usually just a simple look up.
A simple mapping.
So there's a series of Asian member states on hmm states.
What did they generate? You know the observations.
If we join all of those things together, we've got an utterance model that eventually generate secrets of observation.
The whole thing was a model, and if we really try really hard thinking this probably stick modelling paradigm.
Think of this great big button.
Do we press our big bottom? The language model randomly generates the secrets of words.
For each of those words, that dictionary randomly generates the secrets of phone names that might, that might be just always the same for each word.
If it's a single pronunciation, the phone names generate their secrets of hmm stakes, and the hmm states randomly generate their sequence of observations so we could do speak synthesis from this model.
If you take the synthesis course, we'll see you exactly.
We were actually do so.
This is like this.
Do something a bit clever to make these observations, something we can turn back into way forms.
But that's how we could do speech synthesis recognition were essentially doing that except we're locking down the observations to be a particular sequence on when we press the button we don't randomly generate, we randomly generate that particular one on.
The byproduct of that is to find the probability of doing that.
And that's the probability of a particular items.
And then we just have a search problem off doing it.
For every possible entrance in the language, you need it iterated over every utterance, which is over every possible worst sequence.
Just run.
This big generative model seems a slightly backwards way of looking at things, but it's just that it's the right way to understand things.
A good way to understand it's a dictionary is give, given a word emits the sequence of earnings.

Log in if you want to mark this as completed
What kind of language models are possible for continuous speech?

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
So let's just refined then.
This idea of language models remind ourselves about this equation base equation.
And remember that we decided this thing is awkward to compute and totally unnecessary, so we get rid of it.
We turn this into a proportional sign and then we say, Say, What are we trying to do? We're trying to find the W that maximises this term.
That's the same W that maximises the right hand side.
So what we can do is right.
Put its inserting here on our GMAC's over W.
Put here are Max W.
So that arcamax over W implies trying all the different W's on choosing the one that maximises this value.
Looking all the different values implies a search.
That's what the letter B album is doing, its searching in an efficient way of parallel way.
So we're gonna find the artefacts of the left.
He was always the right.
That term could disappear because it's not doesn't evolve.
W So this equation still true that equals is still equals, And so we know the hmm computes this thing.
Let's just refine our ideas of how to compute p of w what the first one is the ones we used in the lab.
They're not really published.
IT models.
All they do is allow some sequences of words and not others.
They sort of non probabilistic models.
And there's one this allows w Can be one or W Khun B two and so on.
W can never be one to rule some things completely out and rule some things completely end.
We could think of it as a signing.
Uniformed probability is a problem of North 0.1 to each of the 10 possible things on the probability of zero to all the impossible things.
But there's no actual numbers in that ground.
There's no real probabilities.
Just implied, Mr Implied is being uniforms.
All of those branches are equally likely.
They've all got probability of 1/10 on them implicitly, so it's very simple.
One.
We could expand that idea thatjust generalises to any other sort of one.
So hopefully many have gone to the digits sequences.
We pull out this little bit of model, get rid of that, pull out this bit of model here.
There's a thing that the sequences of digits on all we'd need to do to refine that would be to add a junk thing here, and that would be one answer for the digital model.
So this is the language that says you can have a sequence of digits going the bottom path, or you can have the sentence.
You know, call Maria.
We could do that by hand.
And again, that's just going to assign non zero probabilities any Valley path and exactly a zero probability, any path that's not possible.
That's rather naive and not going to be very useful for any real application.
Maybe very simplistic.
1 may be a very simple dialogue system.
We're gonna constrained what people are allowed to say.
We want to generalise that now something that has probabilities and eventually to something that we don't write by home that we learn from data.
So the first model we might think about someone might call the word pair model.
So the first speech recognition I ever wrote wass for a really old tactical resource management and resource management is rather bizarre.
US Navy recovery, asking questions about ships and the language models.
Think of the word pair language model on DH was initially by hand every word in the vocabulary such as this one just listed the words that were allowed to follow it, Noel words because then it becomes just a completely flat, useless language model.
A subset of words were allowed to follow this word, and all the rest were not allowed.
So this word pair model here, we could write my hands.
Still, we could also learn it from data.
Just go and find all the pairs of words that we did see in some data set and remember them.
The key point is that this can also be written as this finite state network.
So you could write it like this.
So, for example, we could write out all the words are allowed to start a sentence.
So maybe we're allowed to start a sentence with Maybe we're only allowed some senses with.
And then after the word there, we could have this word, cat or hat.
So that's what this arc here implies.
And then after cat, we have to say one of these three words and so on The word pair model directly maps onto a simple, finite state network.
That's pretty obvious.
And then we can generalise that further and start putting probabilities on things.
So this model here is okay.
Except if somebody says something that we didn't consider, it will be given an exactly zero probability on B.
No way the recognised could ever recognise it.
So someone said the mat, It's impossible.
Yes, so probability that W equals the mat.
I'm not complaining Equals zero guarantee will always get that wrong.
If someone says that we probably want a model that does something a bit softer than that that says the mats unlikely but not impossible would put a small number on that.
Just cause we didn't see it doesn't mean it's not possible.
So we'd like to have all words possible after all other words, but not with uniformed probabilities, higher probabilities.
That thing we saw Maurine the training data on lower probabilities of things so less want a probabilistic model for P.
M.
W.
And so we could just generalise this word pair model and I'm not going to call the whole thing doesn't fully connected, was going to draw a subset of what it might look like.
So this park that goes from hat to on and had to sat has now got probabilities on.
This is saying the probability off seeing on, given that the previous word wass hot read it off is no 0.75 The probability of seeing sat, given that the previous world was hat is no point to you fine.
And we could generalise that we could just put on art from every word.
Every other word.
Some of probabilities across the Ozarks needs to be one.
We can then learn those from data just by counting how many times those pairs of words occurred in some data.
This has now reached the edge of what we're going to cover on this course.
We're not going to really look at exactly how to estimate there's some data just conceptually, that a model is, for example, probabilistic word pairs and let's give it its proper name.
She's called by Graham 52 by Graham.
We could also equally well called.
It was to gramme that's also fine.
A Diagram language model could be learned from data simply by counting things, but the key point is that it could be map directly onto this finite state form.
Any any language more of a similar form could be done in the same way.
We're not going to draw because he'll get really big and messy and we're not going to look at the details.
But we could imagine a model where the probability of word doesn't just depend on the previous word, but the previous two words You have to remember.
It'll work history, and that will be thinking all the three grand or try a gramme.
If you're doing any NLP Course is, you've probably already seen these models, right? So this is easy.
If you haven't.
Don't worry, this is This is as much as you need to understand.
For this course, repeat.
The key point is that all of these models on the general form is an Engram where N is the order of the model and could be one where the probability of a word depends only on its identity.
To wear depends on its identity in the preceding word.
Three.
Where it's a little window, three words and so on.
All of these models could be written this finite state networks.
The states are actually labelled with a context proceeding and minus one words, and they can all be learnt from data Written is a finite state network.
Why is that important is because we can then compile at with the names and do token

Log in if you want to mark this as completed

There is a typo right at the end of this video – the pop-up caption should say “…left context” and not “…left contents”.

The problem we need to solve is that we don't know the alignment between states and observations.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
we're gonna go straight on now.
To the hard part, of course, possibly the most conceptually hard part.
We're going to do it in entirely non mathematical terms.
We haven't talked very much about transition probabilities.
We're not going to say a whole lot about here when we train them.
I was gonna mention how they might be trained.
The values on those transition probabilities aren't so important.
They're not doing a whole lot of work.
In other words, they're not particularly discriminative between one class and another class.
The Galaxy is really where it's at, so the transition publishes a very crude, more love duration.
People try putting much more sophisticated models of duration in for no win except for a large computational costs.
So duration is not a hugely descriptive Q.
In speech recognition.
It's really distribution in them FCC space that tells us what So we know how to estimate the parameters of a galaxy of probability density function.
So pdf means remind ourselves it's called a density rather than just a probability function because it's for a continuous value thing and just think of it as a scatter plot.
How dense of those points in each region of space.
So it's called.
It's a density.
It's no, it doesn't give you an absolute probability.
It doesn't matter because it's equivalent on it gets is what we need.
So it's probably density function.
We stated without proof, that the way to estimate the parameters is to maximise the likelihood of the training data Given the model.
In other words, to turn the knob called me up and down until the training data looked as likely as possible and simultaneously turned the knob called variance up and down until it made the training data is like it's possible.
And those simple estimates of taking the mean of the data taking the various of the data do that wearing a theoretical course would actually prove from first principles that wasthe best estimate, which is going to stay to hear.
One important thing to note there is that the galaxy and only needs to see data that we thought was generated by that calcium.
We don't need to see data from other Galaxies.
Other classes, purely generative paradigm.
We only learn from positive labelled examples were not learning, for example, just discriminate against other classes.
So the model of the word eight is just learn from your seven recordings of the word eight, and it doesn't need to look at sevens and make sure it's a fact that generating seven.
We just hope that that's the case, because it's good at generating somewhat simplistic advance systems might go further than that and try and separate the classes.
But we're not doing that here.
Now.
We want to have an estimate of the Galaxy ins in the hmm states.
We immediately got a problem that we've got sequences of observations of variable length HMO's with more than one emitting state.
We don't know which state generated which observation, and that's the problem to solve in training.
We're going to solve it firstly, through ridiculously simple and naive method.
Then we're going to use a slightly better, reasonable method still an approximation, and they were going to look what we really do remember in testing and decoding when we compute the probability that a model generated an observation sequence, the correct thing to do by definition of hidden Markov model will be to add up the probabilities of all the different state sequences that could have generated that some there probabilities together That's the probability of the model generating that sequence.
That's rather expensive because there's a lot of state sequences, a test time recognition time.
We really care about speed, about computational costs that really matters.
So we make an approximation.
We just look for the single most likely state sequence that's found by dynamic programming by the Viterbi algorithm.
That's what Toa combusting gets us on.
That probability of the single most likely sequence is a pretty good approximation to the total probability.
Apparently we find that that's good enough.
That gives us just just a good recognition results.
If we did the right thing, was competition much faster training time.
We don't care nearly so much about computational costs.
It's done once off line before we need to run the system and so we're going to do the right thing in training.
We are going to consider every possible states.
When we doing training.
We should really do it during recognition time.
It's too expensive and it doesn't really help performance but training time.
It is worth going this extra mile to do the right thing, but we're going to use this approximation is one crude form of training that gets us partway towards the right.
Just remind ourselves then our empirical estimate of the mean that's what little hat on the mu means.
It means it's not really the mean of calcium.
So some speed comes into the recognise er, we're going to pretend for the purposes of recognition that the thing that generated it was actually on.
Hmm.
That's the paradise we're working.
And it wasn't it was a person, but we're gonna pretend it was an hmm.
Therefore, there is a relation.
I'm out there in the world generating the speech that we're trying to recognise, that it really has a value for the meaning, the variance.
There are three values, but we can empirically estimate those values by looking at the data that this agent was generating.
That's what little hat means on top means an empirical estimates off mean on DH.
The estimate is very simple.
Some together, all the observations associate with this calcium, what we need to solve on take that mean so just divide by the number on the same for the variance.
Okay, this is just the squared distance.
So the difference squared because I want to make it symmetric.
again someday.
Over all the data points and divided by.
So it's just a means quite a difference.
Like the width of the calcium on average.
How far are the data? Points with me.
How broad is this distribution? So we say those without proof we're going to apply those.
Except we don't know which observations to sum up.
So these observations here, this implies all of them in a sequence.
But some of them will come from one state.
Some will come for the next day.
Someone come from the next state.
We have to make that association.
So we know which ones to Adam and then divide by m.

Log in if you want to mark this as completed
The problem we need to solve is that we don't know the alignment between states and observations.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
state secret is hidden, so we know we're Beijing's.
We're happy now, hopefully with this idea that state sequences around them variable and it could take all possible body was all at the same time.
And they just all existing little parallel universes on.
We don't need to just pick one.
They just all exist with very probabilities.
Remember the coin tossing experiment? How many coins we toss our coin.
We don't look at the result.
If we're free contest, we're just going to guess a fixed value heads we right half the time Wrong.
Half the time for a Beijing will maintain uncertainty.
Well, say, Well, I believe it's half heads and half tails.
Maintain this uncertain distribution.
It was a different philosophical way of thinking about it.
Same with a hidden Markov model has hidden Markov model inside a box.
We can't know his parameters and speech pops out.
Random observations of speech pop out.
We think what's happening inside the box.
Hmm.
Well, it's simple, tasty, using all possible state sequences and something them together to generate a probability of this speech.
There isn't a single answer to the States, so we don't know it.
So in this a bit of a Catch 22.
We have no model, it's got no parameters.
And therefore we can't work out which calcium generated which observation.
So we need to start somewhere.
Weaken state.
We don't approve it.
It's just being proved over and over again.
There is no single equation that, given some speech on Givens, are like blank prototype.
Hmm just writes down in one step, the mean of state three equals some equation involving the data.
There's no such equation.
All we can do is make the model a bit closer to the data and then in direct.

Log in if you want to mark this as completed
A one-step method to get an initial estimate of model parameters. Typically used to initialise the models.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
So we're first going to do a really quick and dirty solution that does operate in one step.
That gives you a rather poor model.
Then we're going to do on it primitive method that gets you a lot better model but not perfect, because only constitutes a single state sequence.
And then we look at the real thing and we'll look at it mainly qualitatively just understand its properties.
And if you want the mask there in the books, I'm not expecting you to understand the mathematical formulation of this full album.
Do you want to? You don't understand conceptually why it's better than doing this.
You, Toby approximation.
I'm going to ignore transition probabilities just going only to say that as we're doing all of these alignments, essentially, we could just count how many times each transition was used normalised those counting to probabilities.
So you're actually quite easy to estimate transition probabilities, but we're just going to ignore them and concentrate on the most important parameter, which is the meaning variance of calcium.
So here's a really quick and dirty way of doing it is not gonna give us a very good model.
So in all these examples.
We've got a model and it's got three emitting states.
So five state model in H K speak, we got an observation sequence that has six observations.
Each of those is a vector of MFC sees, perhaps with 39 dimensions.
Andi, we can see that there's more than one way this model could have generated observation sequence.
One of those paths is this very crude one.
It just says we spend about the same amount of time in each of the three states.
If six divided into three nicely, that's good, because I chose carefully.
Example.
Didn't divide nicely.
We'll just have to crudely divide it.
We're just assign uniformly will slice this into three parts.
We're just crudely assigned the 1st 3rd to the first state and so on.
Okay, so we'll take these two observations thes actual FCC vectors that came from a little 25 millisecond bit of speech.
They have values.
We'll take these, too.
I will add them together and divide by two, and that will become the mean of the galaxy in this state.
That's an empirical estimate of mean, and we'll take their average squared distance in the mean time we'll assign that as the variables.
And if we're using full co variance, we have a matrix.
But in all these examples, we just have a vector and we do the same here.
These two, we assume that were generated by this galaxy and there'll be this one and its variants.
And so that's clearly over simplistic because we know our speaks changes in duration, for example, speaking rate.
It doesn't stretch linearly.
So we wouldn't want to just crudely say the first model always models the 1st 3rd of this phoney more word.
It might model the first sound in this phoney more word which might not stretch linearly like that.
However, this is going to get a straight to some parameters of the model.
The mean of this state, the mean of this state, the mean of this state will be different.
This one will be from here.
This next one will be from here, this one from here.
So they'll be different Libby crude approximations to the true model.
So we got somewhere from a model with no parameters.
We've got a crude model with some parameters.
It was no need to repeat this because every time we do it, we'll get the same.
Yeah, we're happy.
That idea.
This is just deterministic if I do it a second time different.
So it's a one step algorithm was not very good.
Let's just call uniforms.
Segmentation, for obvious reasons, I guess, is a crude model.
That's a good start.
Take the statistics off the observations alone with state, and those are the estimates of the parameters of that state.
So if solve the problem off which observations were generated by which state by making this rather simple assumption that they're just uniformly segmented.

Log in if you want to mark this as completed
Iteratively re-aligning the model with the data and updating the model parameters based on the single best alignment.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
Now we're going to improve this model.
I'm going to state just stated that there isn't a single step solution.
There's no equation immediately gets us to the true model.
All we can do is take a model that we've already got and try and make it a little bit better.
So in other words, we're going to use an alternative method.
So one analogy that might or might not be helpful is what if we're trying to have an algorithm that's just trying to find the highest mountain in Britain? Okay, we don't have a helicopter and we don't have a map.
We can't.
We don't have a picture of Britain and just pick it.
We don't have this oracle knowledge.
All we have is just local surroundings where we are so a simple algorithm that will be just to keep walking uphill.
Eventually we get to the top of a hill.
It doesn't not guarantee to be the biggest hill, but it will be a locally a maximum.
If you take very small steps, you might take a long time to get to the top of the hill.
But well, very precisely, find the top of the hill but it might be just the local help.
So we walk from here and start walking uphill.
We're just going to end up where we're going to end up off the seat still around here.
But it's certainly not the biggest mountain.
We might therefore try to think of a better way of finding the biggest hill.
One way of doing that.
We just take giant steps all over Britain again.
Always uphill, but very large, very crude steps.
Pretty fine.
There's a bigger hill, but we weren't very good at finding exact top because we keep going past it and down the other side, zigzagging get us to a roughly a place where there are bigger hills.
But it won't find us exactly.
The top of one of them might switch back to our slow algorithm was taking small steps again to find you and get to the very top so often in machine learning will do.
Algorithms like this will have a fast and dirty algorithm that gets us to the right region quickly and then switch to the so I'll go fine tunes and gets us to the top of that region.
But none of these algorithms make any guarantees that when you do get to the top of the hill and you converge on your eyes, there's no steps you could take that go up.
There's no promise that that really is the biggest mountain.
There might be another one we never explored.
We never got.
That we can say is it's bigger than anything in the immediate vicinity.
So these interested methods have the potential of finding you a solution.
That's not the globally optimal solution.
And you'll never know if they were the global optimal solution because you don't know what that solution is.
Okay, just empirically compare one solution with another and see if it's better.
We never know.
So the true Asian man has parameters which will maximise the likelihood of the training data.
We don't know what they are.
We can't know what they are.
All we can do is find our best local guessed that so just this very crude algorithm where we just linearly segmented, uniformly segmented data, assigned it to model.
That's clearly not very good.
Can we do better than that? Of course we can.
We already know how to do one thing that's better than that that's what we do during recognition.
We find the most likely single most likely state sequence that generates that data.
Okay, that's the Viterbi algorithm.
Now, to do that, the model has to have some parameters.
We have to start with a model house parameters.
This method here, the states might have no Garson's in the beginning, It doesn't matter.
They're not involved in making this uniform segmentation.
So this works for a blank empty model.
Gives those model immediately some parameter values.
But I'm very good.
But her first guess everything after that needs a model to start with.
We're gonna have to do this first, just two guests at the parameters of the model, or make some other guests like Randomise the promise of the model set them to the global mean of Arians some other guests.
But there has to be some parameters.
Once we've got parameters we can use, the Viterbi algorithm actually implemented stoking passing to find a better alignment between observations and states.
If we could do that, you'll still be a harder line.
Matisse observation will belong to exactly one state, so it will be a forced hard alignment, and then we could just do this sort of thing again.
The observations were associate with state.
We just take their mean update.
The parameters of that state this is quick and dirty doesn't give a very good model.
But at least it has parameters, given that crude model will realign it with data.
So, for example, we might form token passing.
And now the model that we got from the crude alignment aligns itself with data in this way, this is the single most likely state sequence that generates this observation.
We start here as always.
We go here and make an observation, and then we go on to the next state, made this observation round here, make this observation around here again and make this observation on here General observation round here and here.
So the steak sequence is going to go.
Let's do Huk numbering going to go to 333 for four.
So this one belongs to state too.
These ones belong to state three.
These ones belong to State For under.
This is a single most likely way that this model could have generated the observations and now we'll update the model parameters on that basis.
Remember the mean that we're having at the moment in this state was actually the mean of these two observations from the first step.
We now say that it's actually more likely that this state also generated this other observation Here.
Steal it from the stay next door, shuffle the alignment around.
I've got slightly better alignment than the uniform one, and then we're just going to take all of it.
All of these guys have a look.
Divide by three day.
This mean all of these guys and it was only once.
That's not work very well.
But in general, the sequences will be longer.
Take its meaning variance and update the mean here out of date.
I mean, here we update the model parameters, and now they're slightly better than we were before.
So take those.
Take the meaning variants means deviation.
Update the moment parameters.
Now the models change.
And so with these new model parameters, maybe this is no longer the most likely alignment between observations on more.
Khun, see what's happening here.
We have a model with not quite the right parameters from which we can find an alignment.
We're then going to change the model parameters So we need to find the alignment again on the line.
My change.
So we need to change the model parameters.
We're going to go around that until that converges, for example, until the alignment stops moving about or more generally until the likelihood of the data stops increasing.
So the model slightly different.
So we update them on parameters, Realign the data again, things shuffling around on.
Now, what's happened is that this this state, this state here is still good at generating these two.
But the state of stolen this one and is now taking these.
Okay, so we've decided this model the best way of modelling the data is that the first state has quite a short duration secretary along with duration and generates these two on This generates these three.
We're happy so far.
Any question so we can know what we can see? That every time we go around this algorithm, we change the alignment and therefore we need to update the model parameters and because we've updated model parameters, that might change the alignment.
So we're going to just get a rate backwards and forwards between those two things until we can't do any better.
Well, just measure the likelihood of the training data so that as we do, the tokens going round as the winning token pops out at the end, we'll look at its likelihood.
Remember it next time we go around, hopefully likelihood better.
We'll make a little plot of that when it stops getting better and we'll abort.
Stop.
Or maybe we'll just give up after a fixed number of iterations.
So we keep doing this, going around, updating the promises again and again.
We go around until we converge in terms of our ability.
Now, let's just just fix a few problems with that.
This stay here just generates a single observation, so computing its meaning variance is a problem.
The mean is just equal to the observation.
The variances.
Zero.
That's no good.
So in reality, we can't reliably train a model on a single training example, because we might just get a single frame associating with state.
You could try that an experiment, see if you can train them all on a single observation secret things might go wrong, but in general we don't just have one training example for each model.
We have many, so we'll do this.
This is for the first recording of Let's Say, it's this word.
Eight, but we'll find this alignment.
Remember it and then we'll pop in our second recording of this word.
Eight.
Maybe this one's a bit longer.
It's got a few extra frames.
This's our second recording.
Find that alignment, remember it.
And then when we update the state parameters, we'll just pull the observations across the different recordings.
We just pull them all together.
So across the multiple examples in the training set find all the things that were associated with state to be at least one from each of the recordings and possibly a sequence.
I just add them all together and divide by So this generalises trivially to multiple training examples.
You just do the alignment, separate for each, pull everything together and then do your state updates.
That implies you need to pass through all the training data once and then update the marble parameters and go around this league, so that's looking okay.
That will give us a reasonable model.
The uniform's segmentation instant, but it's not a great great model.
B tter.
Be training is going to be fast because this Viterbi algorithms.
Extremely efficient gives us quite a lot better model in an H g k.
Those both those about folded together into one tool is called 18 yet, So it initialise is the model.
That's what it means.
And he just does these two things.
So, agent, it will print out it.
Orations in these situations are reiterations of this Viterbi style training.
It will produce a train model that will save it.
We're going to your hmm zero directory.
You could do recognition with that model.
It will work.
It might not be as good as the model we're about to make, but you could compare that.
So this is a roughly train model, but it's fast.

Log in if you want to mark this as completed
By summing over all possible state sequences we marginalise away the state sequence: we don't care what value it takes.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
right, but we need to do the right thing.
The right thing is not to take the single most likely state sequence.
Just pull the data from that.
It's to consider all of the other states sequences to allow them to contribute to the estimates off the state parameters of the means and the variance.
Okay, we're going to do this kind of longhand they were goingto quickly see.
That would be much more efficient way of implementing that called bound Welsh algorithm.
That's let's say this is the single most likely state sequence for a three state model to generate six observations.
Let's just write out the state's equal.
So we really clear about it and we'll do a CK numbering.
The state sequence is 233 for four four.
That's your number one best state sequence.
If you're gonna pick one, this is the best.
But there are lots and lots of other states sequences.
For example, we could have gone 223444 on DH.
So on, And so on many, many other states sequences on all of them are a little bit less likely than this, but maybe not that much less like we should be taking those into account because we should be summing over all possible sexy.
That's the definition of a hidden Markov model.
So we're going to have to find some algorithm that does that.
Great.
So what we're going to do is we're going to consider every possible state sequence that the model could have used inside this black box.
It's simultaneously.
Just do all of them, not all of them simultaneously.
They all exist at once the moment we've just picked the most likely one and use that we're now going to consider all of them.
Wait.
Those by the associated probability most likely ones gonna make a strong contribution to the model parameters.
But it's going to combine with some of the other ones as well.
Lots of different stakes sequences.
Each of them has an associated probability.
We happen to know that that was the highest.
But there are other ones we want to take into account as well.
So I need to sum over all of them.
Hey, so we're going to introduce an algorithm.
We're not going to do that.
We're going to just think about concepts of it and introducing terminology that you'll see in textbooks.
The general idea of this next algorithm is a bit like the Viterbi training.
It has two parts to it.
In the Viterbi training, there's one part which is finding the alignment on the second part, which is updating the model parameters.
Given that alignment, that was just the single most likely alignment.
What we're going to do now is we're going to consider all possible alignments and then given those we're going to update the moment, Prime Minister.
Still two parts to it, and all it will do is slightly improve the model.
The only guarantee we have is the model won't get any worse on the training data.
We have no promises about test data.
That's up to us to engineer still two parts to it, finding the alignment between states and things.
What we going to do is actually average overall possible state alignments waited by their individual probabilities.
So the fancy word for averaging improbabilities called expectation, expectation part and then, given this alignments, all these different alignments were goingto turn the knobs and all the model parameters to maximise the likelihood of the training data.
So the maximisation step that's just a simple equation that says mean of a state equals waited, some of the observations that we've aligned with it.
The general algorithm is of this type called expectation maximisation, and it's so common it's often just called E.
M.
If you're taking other courses of machine learning or in an LP or the speech recognition course, you'll see that there's lots of different ways that this concept could be applied to models applies whenever we don't have a single step solution or we could do is make a step towards a better solution.
So some averaging part.
And then there's some updating parameters, part expectation maximisation.
And when it's applied to hidden Markov models, it gets given a specialist more specific name That's called the Bound, Well, child driven.
Okay, two people.
Professor Bohm, Professor Wells.
Okay, let's rewrite Viterbi trading in what seems like a slightly funny way, and then you see why we did that.
Think about this alignment.
The mean here equals the sum of this thing.
Plus this thing divided by two let's rewrite that in a different way is the sum of everything waited by some weights.
It's the weights are the probability of a lining.
That observation with state.
It's gonna be exactly the same thing here.
It's going to be zero times this thing because it never aligned.
Plus one times this, plus one times this zero times this zero times listen zero times us so we'll take those things.
And, well, some of them up waited by hard numbers, zeros and ones.
And then we'll divide by the sum of the weights.
So that will be one over to here.
You see, that is exactly the same thing.
It's just a more general form.
So the mean of the state is the sum of all observations.
Waited by something on the weights are probability, but lining that state without observation, the general form, that's kind of nice.
That would be a good way to implement it on bitter being gives us the ways it says The way for this zero didn't align the way for this is one and so on.
It was right that write it down This general form.
So the mean off a particular state is a some off all the observations.
So this sum is now over all the observations, the entire training data, each of those observation gets waited by some value called probability, some weight.
And then we need to normalise so that things of probabilities that some correctly and that some normalising factor we should just get some of the weight.
In that example, it was won over to normal.
Most of the normalising factor was, too.
What are we going to do now is that these weights are going to go soft instead of hard ways of 01 They're going to be soft waits.
Okay, so what is this p of? I think it's the probability that a particular state generated a particular observation.
Viterbi.
That's either.
It absolutely did.
Probabilities one or absolutely didn't probability zero.
But if we look across all different state sequences and take these averages, sometimes that observation will be aligned with state to sometimes of Eli mistake one.
So those weights now become values between no exactly zero or exactly one.
This problem's going to something like probability of being in a state of particular time, which is exactly the same thing as saying that state generated that observation.
Yes, so this probability here is gonna be a value that's not zero or one is going to be our belief, The probability that when this hmm generators observation sequence that this observation observation six in the sequence came from ST three.
Hey, in some states sequences that will sum, it won't.
So, what was the probability out from that? Okay, with this e algorithm, which becomes about Well, let's just see if we could work through very quickly how that might work.
So let's write.
All the state sequence is down.
Got 222234 223 34 All the way down here and eventually get to this one, which is 23 for four.
We got six things in the observation sequence.
We got three emitting states called 23 and four.
And these are all the states sequences we could fill in.
All those on one of them was the most probable one.
I already forgot which one? It wasn't.
It was 233444 So, somewhere in here, there's the 2334441 on this Polish born here.
This one is the one Viterbi finds us.
It finds the single most likely one.
And then from that we get the weights on the weights are that when we computing the new mean for state, too, it just takes off.
The first observation waited by one plus all the others off the observations, waited by zero when we compute the new mean for state three, it takes the first observation.
Waited by zero.
That's the second one waited by one.
What's the third one? Waited by one plus, the other one's waited by zero.
That's what this alignment tells us on the same probability of that state sequins.
I'm going to write out as if it was natural probability.
It's probably density.
And let's just say that 0.3 sixes, the probability of that one.
All the others have probabilities to maybe they're all.
They're all lower than that.
Maybe this one's a bit less likely.
This one's a bit less like he's still of all the different probabilities of these particular state sequences.
So the found, well, child rhythm is going to do something quite simple when we update the mean off state to instead of only looking here and saying State two takes, the first observation will wait one and the rest will take Wait zero, it says, okay, we'll take the first observation.
That's this first slot here.
All right, with this weight.
Plus the first observation with this weight.
Plus the second observation with this weight.
Plus all of these with this weight.
Plus all of these with this weight.
So we write out the mean of state two equals R sum over all of the observations.
All six off them on the weights.
Are these weights every time that state to have the possibility of generating that observation? Okay, there's a slightly complex idea, but think of it in comparison to the Viterbi is just a soft version of Peter Be.
It does all possible state sequences and then just as awaited, some of the observations.
Okay, so it's a pretty tricky concept, and that just brings us to the end of the lecture in a few minutes.
So we've gone from uniforms segmentation, which we knew was wrong because we didn't have any promise in the model.
There's actually nothing else we could do other than randomly initialise them or just initialising all two zeros and ones.
Well, then said we've got a model that's not very good.
We could use that model to find an alignment of the data to update the model, the model would get a little bit better.
We go around that until the model starts getting better, then we'll flip into the real thing on the real thing is to do this.
So this list of all possible state sequences it could be quite long, so implies it might be competition more expensive.
To do this for every possible state sequence will compute its probability likelihood.
We'll use that as a weight when we do The weighted sum of the observation.
Toby doesn't weighted sum of the weights, zeros and ones.
Byron Welch just has a weighted sum of the weights of soft ways.
They're somewhere numbers between 01 effectively.
Okay, wait.
If you think about things going backwards and forwards, that's a little bit of a misleading way of thinking about things like the Viterbi algorithm.
That'll be Album appears to operate on left the right fashion.
It could equally go right to left.
It could start in the middle and go out words we just find the same answers are on the way.
Guarantees to find is the single most likely state sequence.
It doesn't matter.
You could reverse the speech will reverse the money.
You will get the same well it will because we'll update the model promises and that will change the alignments.
That's the inspiration.
Model promises change, never alignment changes.
This is the same thing's gonna happen inbound Welsh.
So we'll take the weighted sum.
So let's just compute the mean of state to ago.
It is no 0.21 times observation one plus no 0.36 times.
Observation one plus nor 0.12 times.
Observation, too, plus no 0.12 times observation.
So realisation one does not point to +12 times.
Observation two plus no point for the ways of getting very small.
We're making small contributions now, No point north four times Observation one plus no point nor four times observation, too, and so on.
So it's just these the weights and these observations that get summed up these very unlikely states equals is up here, have a very low wait.
So those alignments make a very small contribution to the new value of the mean of the state.
The one that makes the biggest contribution, obviously, is the single those likely one big 10.36 wait times that we just do the same of dating the other states.
So what happens? We've got to some model from somewhere, perhaps Viterbi training or random initialization.
Anything we want will find all possible state alignments between that model in our observations in general will be a lot off them.
Each of them will have a probability, will use that as a weight.
And then we'll pop that weight into this formula, the new mean of a particular state.
It's just the sum of all observations the Thai data set waited by the probability across all of these different ways of lining up models with observations that that state generated that observation in some of the alignments, dead with a weight and some it didn't weigh zero on this weight.
We're just going to give it a rather impressionistic name.
That's just the state occupancy, probably being in that state of that time.
That means the alignment and then some normalising factor to make sure everything is properly and the normalising fact is just going to be some of the weights.
Okay, so we don't want the mast for that for this course, But if you prefer to understand it through the mass.
Do that.
Let's just finish off by saying a few more things about this.
So it's called the bound Well shall go them.
Please note the spelling, not Welsh pronounced Welsh.
It's about, well, Child Room is just a form of expectation.
Maximisation specialised to hit a Markov models, and in the little manual example we worked out, I explicitly listed all possible state sequences.
It was a bit like what we did in the early days before we discover dynamic programming, and we said, Well, just just enumerate all of them and actually calculate them or separately and then pick the most likely one.
And then we realise that they've got an awful lot of things in common.
Lots of them start with staying in this state for three frames, so they could all share their computation.
That's what dynamic programme tells us, tells of all these different paths.
They've got all sorts of sub sequences in common, and that bit of computation could be known once and shared amongst all pass that share that state secrets.
That's easiest to think about all the common prefixes.
We actually, for any common sub sequence anywhere.
It's not just the prefixes.
All this affects us all the bits in the middle without going into the details and the diagrams in Djurovski Martin Northern Books Trying, explain Listen, don't succeed.
Very welcomes.
A bit complicated.
We could see impressionistic Lee for bound Welsh all of those states sequences I just did listed out.
They've also got lots of things in common as well, so we could share the computation between all these states sequences.
So we don't need to explicitly write them all one by one and compute them separately, compute their probabilities there state occupancies and do the weighted sum.
We can complete it all in one go something like dynamic programming.
But instead of choosing the max at each time, what should do a sum that each time so we computed on some sort of matrix or lattice a bit like dynamic programming.
We don't need to understand that For this course, we can say that this is actually also quite efficient b'more computation than picking the single most likely Max and throw things away.
We're here.
We need to someone we need to visit the whole grid, but it still could be done much more efficiently than the kind of dumb way of just writing them all out long because I share all possible sub sequences of computation, their diagrams trying explain that which will be on the scope of this course.
Okay, so when do we start training? Well, we stopped.
When the likelihood of the training data stops increasing, you could see that coming out of H G K on the ground.
Well, Child Goodman aged Ikea is computed by this Grogan called eight Rest, Which means re estimation, in other words, implies that model needs to have some parameters, and we'll just update them on all eight dressed promises to do.
It's not making model any worse, and hopefully to make it better and better means that has a higher probability of generating the training data.
We never see the test date.
They're no promises about the estate.
There's no guarantees about that.
That's your job as an engineer is to engineer the training data so that you predict that it would be good at generating the estate

Log in if you want to mark this as completed
Yes, again! Really, this is the best way to see how everything integrates elegantly, just by multiplying probabilities.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
we're not going to reiterate all of this is going to make sure we now see all the components quickly all together and see how everything fits together.
If we want to really push the generative model right to its absolute limit, we could say that vectors of M.
F.
C.
C s generate little frames of way for really want in synthesis.
We really literally will do that in recognition that little bit the very bottom of the chain between the way form of the features is actually deterministic.
Signal processing.
It's handcrafted.
It's not really part of the whole generative model framework.
For example, there's no distribution over those those way form things, So privatisations is kind of a separate step.
GSR on it just takes the way form and immediately replaced it with a sequence of factors.
Typically, MFC sees other privatisations are available.
There are other things we could use.
No, we're not going to talk about them here.
For example, we could use this philtre bank coefficients if I recognise it didn't use gas.
Ian's so privatise the speech and then just throw away phones away.
This is always the first step in speech recognition don't on these messy way forms straight to sequence of observations.
Actors with their need, acoustic models, parts of sub wood units.
We know how to do that with H M M's.
We need something to match between language model units on DH acoustic model units.
If it's subway models, then we need a dictionary.
So we're just stating that we're going to get much more depth about that.
We're gonna soon we could write one by hand or by one, or get one from somewhere with only something that generates sequences of words for whole utterances going to be a language model.
So these are our probably stick generative models.
This privatisation doesn't quite fit in the generative model paradigm, and this will really force it.
It's just going to be some signal processing out front, some handcrafted stuff that we need to talk about a few other little topics to glue everything together.
Let's just try and do that all in one little.
Okay, so how might you try and see the whole speech? Recognise it together? Well, it's very tempting to draw some sort of flow chart, So you've got something that some sort of speech recognise her and you got your speech, your way, form.
And somehow the way from goes in on what pops out is W W Let's call it W W that maximises the problems of this generative model thing.
It's tempting to think of that it is a piece of software that's certainly the case.
There's a piece of software it loads away form or pull out the sound card, and it prints out were sequence.
But this is a very misleading way of thinking about it, because it's what breaks out generative model view of things.
So it's an implementation all diagram.
It's not really a diagram of the true published that model that's going on so encouraging not to think of flow charts that do that.
I would encourage you very strongly to think in this way.
Instead, think of it as something that, given words, generates speech or the sequences of MRCC, everything in there is a probabilistic generative model.
Because we do that it becomes really obvious how to fit these different generative models together.
It also gives us some clues about what forms of general models are going to work and what forms are not going to work.
But if we make all of our generative models compatible with token passing, in other words, they're finite state.
Everything glues together in a beautiful, clean way, not just in a second.
Okay, privatisation is just this deterministic signal processing.
Hopefully understand why we need to do each of these steps.
A lot of this is to do with the fact that we've chosen a rather naive model, the hidden Markov model.
It's got some very powerful assumptions in it.
For example, we would like to use diagnose co variance calcium is that assume that the coefficients within an observation are independent of each other.
Have done a lot of massaging lot of manipulations of our features to try and make.
That's true.
It's possible.
That's what this coast I'm transform does.
Hmm also makes this incredibly powerful, strong assumption that one observation is conditionally independent of the previous and the next one or the other.
Observation.
It's a condition.
I mean, given the hmm state has generated on DH, we'll see that we've done a lot of things.
The features to see gate, those crazy and wrong assumptions we made when we chose hmm.
Josiah, remember, because they're mathematically so convenient.
We did this costar in transform to make things statistically independent within frames.
And then we have these Delta and Delta coefficients, which are the differences between frames, which gets us this dependence that the age mom can't model from the age of like a skull fish going around its bowl once he goes round completely forgot how it got there, so we can't condition the probability of one observation on the value of the previous one because it's already forgotten.
So instead, we put some information about the previous one in the current frame as Delta's.
So as Grady slopes, rates of change and there's even information the rate of change of the rate of change we put Delta Deltas.
So the Deltas and Delta Delta is just getting us around this hmm assumption.
So we did a lot of feature Masala Jing, but it's better to do that than try and use a much more complex and expensive model.
And that's just comparing what we find that that's the case, dramatise things, and then, from this point on, everything is now generative model.
We have an acoustic model that generates sequence of observations, got a language model that generates sequences of words.
Typically, something can hang.
Graham.
We're not covering how to learn that data in the stores.
That's for your NLP courses offered the speech recognition course next semester.
Riel systems use something like a three ground, perhaps a four gramme.
Actually, if we've got enough data, if we think about three and four grammes, these sequences of three words and forwards think about looking on on the Web, some large database of text.
We won't see every possible three go.
There's a very large number of those.
Okay, if we have 20,000 words 20 k 20,000 words, not a cabaret.
The number of word pairs is 20,000 squared.
That's a pretty large number, and the number of three grammes is 20,000.
Cube.
That's an even larger number will never see all of those, however big our databases.
So what really systems do is they try and use three grammes when they've seen it enough times, and when it's a gap, the whole backs off to ground.
Then it backed off one gramme.
Let's go backing off for smoothing.
It's a real systems to complex things with language models and That's the domain of the SA course next semester.
So the number of try grammes and diagrams in model very, very large.
These the ones we see in data so you can see that these parameters.
This's a number of parameters in the model.
Each one of these is a probability is very, very large, so language model might take a lot of memory.
Then we can see why a speech ignition system might need a lot of memory to rob.
Each fight, which have used in the lab, is very naive and simplistic.
It loads the entire language model expands into the entire finite state network.
Pops in, Hmm, is where instead of words and has the most enormous hmm, Network of states that doesn't scale well to 6.7 million.
Try a grammes with lots of sub word models pasted in there that we just ran out of memory.
So real systems don't do what h fight does.
They'll just compile bits of the network dynamically in a very complex way.
Okay, but it's a good way to thinking about it.
The best way to understand it, so language model is finite state, we know are each members of finite state.
We could just do the single compiling, which is substituting in the hmm, where the words air on just doing compressing.
So don't think like this.
This you'll see diagrams like this in research papers and things.
It's okay to sort of explain how you implemented things and what your models were and stuff.
It doesn't really help you understand the generative model paradigm, so don't get hung up on pictures like that.
Instead, think of pictures more like this.
This bit of the bottom doesn't quite fit our generative model paradigm in there.
So it's just some signal processing.
All of this is so the speech signal kind of goes into the privatisation.
But after that, we see the arrows is strictly going this way.
Sentence model generates words.
That's a life model.
Word model generates phoney names and then the models of acoustic phones.
That's the dictionary.
We probably just look up in the table the names of the phoney names and get a model for each that gives his hmm hmm hmm has a sequence of states.
So we get a secret states in the States, a Gaussian probability density functions, and they generate observations.
And each of these is a sequence model senses generates a sequence of words.
A word generates a sequence of phones for names.
Finding Jonas, a sequence of HMOs, states and aviation estate generates at least one, probably a sequence of observations.

Log in if you want to mark this as completed
The idea of compiling al the models together is very natural, if we are taking the generative view.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
Let's just draw out explicitly how this this idea of compilation works.
Let's do it for our really simple, simple model.
Let's do it for a phone dialling model.
Same thing would work for our by Graham.
Also, start start state some salt and state.
And some words, I don't know.
May some people, right.
We might have our network like this.
That's our language.
Model withdrew by having it.
Could have been loaned from data.
Could complain.
We want on.
We can turn this into a network of hmm stakes, step by step.
Thank you.
Huh? No.
Right.
So the first thing we do is replace the words with their sequence of phoney names.
Let's just do this one.
Okay? So symbols are wrong.
Come on.
Okay, so we just rewrite a word.
The secrets of phone aims, And then, for each of those, make it colourful.
We'll just put in its little Hmm.
Okay.
We just do that everywhere.
So words get substituted with their phone teams.
Get sympathy with Asian member states.
What? Quickly, Just do that.
So he's all disappear on what we're gonna be left with destroyed by quickly.
I remember being made in this way who are self transitions on on DH.
The probabilities on these arcs inside.
Hmm.
These things here are just the hmm transition probabilities.
We didn't talk very much about how we might learn those, but there is going to get learned during Peter Be training and bound well, training to count how many times each ark is used proportional to all the other arts out of that state.
The these arcs in between phone aims might be pronunciation probabilities.
Imagine we have a fancy dictionary allowed two pronunciations for this word.
So the second phoning, maybe alternates, so we might have branches in here.
So imagine those two pronunciations.
And so these arcs are going to be the probability of two competing pronunciations of a word.
So there aren't within a gym m's, these ones.
They're arks from the dictionary, possibly in general.
Mostly there just be one pronunciation for word.
And then there's the arcs that connect things together.
On these arcs are language model arcs, and they might have language, mortar probabilities.
So where the probabilities come from could be from the language model of the dictionary or inside the acoustic.
Hmm.
But none of that matters.
because they just probabilities.
They're just numbers on arcs.
And then we just put our token in the middle in the beginning here, with its probability of once the only token we send copies down the arcs on, just propagate them Through.
They just flow through the model generating sequence of observations, and at some point, some tokens arrive here.
MP three pick the biggest one, and that's the winner on DH.
We'd better make sure that we know what path it took.
So we do one of the little thing in token passing we didn't mention yet, and that's tokens.
Need to remember the words that they went through.
They could remember every state they went through if we cared about that.
So what tokens Air Going to Dio as they go down language model arcs is going to be little tags on these language model arcs.
The tags are names of words, and as tokens passed through those, they'll just add that to their word record so that the winning token would just look at it.
And it'll say, I went through this world and this world in this world, and that's the recognition results as well as having its probability on which we don't care about that was any for comparing to other tokens.
Okay, so this this generative thing here, this idea of this generative model when we come to implement it, turns into what we might call compilation, which is exactly what H fight does.
When you relate right, you'll see it print some stuff out.
It'll say something about number of states and number of arcs.
It's made.
That's just it.
Telling you how big the network wants that it compiled.
They won't be very big for the digit.
Recognise er it will be bigger for the sequence recognised because there's a few extra Ark's a little bit bigger.
So it's just telling you how big that thing it compiled wass on the compilation was Thing we just did, substituting things in until we just end it with a great big network.
It's a finite state network, and then we just do two compressing on that.
So this the fact that these all ours are all pointing in the same way and everything's a generative model means that we can compile together the language model, the dictionary on the acoustic model into a single model, single, unified model, and then we could just do the same algorithms on it

Log in if you want to mark this as completed
The Baum-Welch algorithm can be applied to data comprising whole utterances transcribed with words, without needing any manual within-utterance alignment.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
Let's just reiterate that this also applies in training.
So for training, the algorithm might apply might be bound Welsh algorithm rather Viterbi algorithm.
We have sentences.
Let's assume our training data is partitioned into sentences with silence.
The beginning.
In the end, we have word transcriptions, no alignment.
We just know these words were in the seventies.
The sentence model, then, is just the transcription.
So it's like a language model says it's only this sentence.
So imagine we have, you know, training sentence, which is labelled just with a string of words with no alignments.
So we'll use that.
We'll look up the phone names for each of these.
Well, look up the hmm states for those phoney names.
Hierarchy.
Generally, models construct temporarily the hmm.
Off the sentence, the cat sat and then we'll have our observation sequence for that training sentence.
Given that, Hmm.
In that observation sequence, we confined the alignment with Viterbi or with bound Welsh.
Anything we like updated model pragmatists and then just go around around around the only little trickiness there, which isn't really very difficult is that in our training data, we won't just train on one sentence, will train on a whole bunch of sentences.
We have 1000 sentences, so this alignment step will run and bound.
Welsh.
That's the e step.
And the E Step just needs to store some information.
It won't actually update the model parameters.
What it will store is the probability of being in the state that each particular time and therefore effectively, which observations aligned with that states in a probabilistic sense and in fact, what it really will store.
It will be the weighted sum of those observations waited by the probability.
So remember, we're doing averaging, so that involves awaited summit with a weighted sum.
And then some of the weights, which will need later for normalisation, will store.
These two things will start up in something called accumulators.
Just buckets will throw throw numbers into, and we're just accumulate during this Easter without changing the model parameters.
And we just do that.
All of the training sentences, hundreds or thousands off them, accumulating all of this stuff.
And at the very end, each state we're having accumulated that says these alignments, these air, the observations that aligned with me with these weights, and this is the sum of these weights.
And then we'll perform the second step, which is the maximisation step, which essentially just devise one by the other and updates to me.
Does the equivalent computation for that VarianMS So the East up obviously, is going to be the thing that takes all the time.
The step is computing all of the alignments across all of the data on the very, very end each state says across all of those thousands of sentences I aligned with these states with these weights or some of them up with those weights computing me, getting them am step maximisation step.

Log in if you want to mark this as completed
By removing unlikely partial paths (tokens), we can make recognition much faster.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
the only thing to say, Finally, something hopefully have already all discovered from doing practical.
And that is that in a large model on your models were at large.
In this, I'm a bit in a really large model.
Most of the tokens most the time will be very unlikely.
We'll be off in parts of the language model, very different words and nothing to do with the acoustics.
And they will go around around to keep crunching the numbers and keep bumping into likely tokens and being deleted.
But we could save a lot of computation by making an approximation, but that's just throw away things that look like there's no chance they're ever going to win.
Probability falls too low.
We just throw them away.
The most common way of doing that is thinking Beam search.
Every single iteration somewhere in the network will be a current token.
That's the most probable.
Currently, it might not be.
The eventual winner is currently doing the best.
Anything that's worse than that, by some margin called the Beam would just be deleted.
It's all set this idea of a beam with below this best token if we take that being really tight.
We could make things go really fast.
We risk throwing away the token.
That actually would have gone on to one.
It's just we didn't didn't know that at this point.

Log in if you want to mark this as completed