Computing P(W)

What kind of language models are possible for continuous speech?

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
So let's just refined then.
This idea of language models remind ourselves about this equation base equation.
And remember that we decided this thing is awkward to compute and totally unnecessary, so we get rid of it.
We turn this into a proportional sign and then we say, Say, What are we trying to do? We're trying to find the W that maximises this term.
That's the same W that maximises the right hand side.
So what we can do is right.
Put its inserting here on our GMAC's over W.
Put here are Max W.
So that arcamax over W implies trying all the different W's on choosing the one that maximises this value.
Looking all the different values implies a search.
That's what the letter B album is doing, its searching in an efficient way of parallel way.
So we're gonna find the artefacts of the left.
He was always the right.
That term could disappear because it's not doesn't evolve.
W So this equation still true that equals is still equals, And so we know the hmm computes this thing.
Let's just refine our ideas of how to compute p of w what the first one is the ones we used in the lab.
They're not really published.
IT models.
All they do is allow some sequences of words and not others.
They sort of non probabilistic models.
And there's one this allows w Can be one or W Khun B two and so on.
W can never be one to rule some things completely out and rule some things completely end.
We could think of it as a signing.
Uniformed probability is a problem of North 0.1 to each of the 10 possible things on the probability of zero to all the impossible things.
But there's no actual numbers in that ground.
There's no real probabilities.
Just implied, Mr Implied is being uniforms.
All of those branches are equally likely.
They've all got probability of 1/10 on them implicitly, so it's very simple.
One.
We could expand that idea thatjust generalises to any other sort of one.
So hopefully many have gone to the digits sequences.
We pull out this little bit of model, get rid of that, pull out this bit of model here.
There's a thing that the sequences of digits on all we'd need to do to refine that would be to add a junk thing here, and that would be one answer for the digital model.
So this is the language that says you can have a sequence of digits going the bottom path, or you can have the sentence.
You know, call Maria.
We could do that by hand.
And again, that's just going to assign non zero probabilities any Valley path and exactly a zero probability, any path that's not possible.
That's rather naive and not going to be very useful for any real application.
Maybe very simplistic.
1 may be a very simple dialogue system.
We're gonna constrained what people are allowed to say.
We want to generalise that now something that has probabilities and eventually to something that we don't write by home that we learn from data.
So the first model we might think about someone might call the word pair model.
So the first speech recognition I ever wrote wass for a really old tactical resource management and resource management is rather bizarre.
US Navy recovery, asking questions about ships and the language models.
Think of the word pair language model on DH was initially by hand every word in the vocabulary such as this one just listed the words that were allowed to follow it, Noel words because then it becomes just a completely flat, useless language model.
A subset of words were allowed to follow this word, and all the rest were not allowed.
So this word pair model here, we could write my hands.
Still, we could also learn it from data.
Just go and find all the pairs of words that we did see in some data set and remember them.
The key point is that this can also be written as this finite state network.
So you could write it like this.
So, for example, we could write out all the words are allowed to start a sentence.
So maybe we're allowed to start a sentence with Maybe we're only allowed some senses with.
And then after the word there, we could have this word, cat or hat.
So that's what this arc here implies.
And then after cat, we have to say one of these three words and so on The word pair model directly maps onto a simple, finite state network.
That's pretty obvious.
And then we can generalise that further and start putting probabilities on things.
So this model here is okay.
Except if somebody says something that we didn't consider, it will be given an exactly zero probability on B.
No way the recognised could ever recognise it.
So someone said the mat, It's impossible.
Yes, so probability that W equals the mat.
I'm not complaining Equals zero guarantee will always get that wrong.
If someone says that we probably want a model that does something a bit softer than that that says the mats unlikely but not impossible would put a small number on that.
Just cause we didn't see it doesn't mean it's not possible.
So we'd like to have all words possible after all other words, but not with uniformed probabilities, higher probabilities.
That thing we saw Maurine the training data on lower probabilities of things so less want a probabilistic model for P.
M.
W.
And so we could just generalise this word pair model and I'm not going to call the whole thing doesn't fully connected, was going to draw a subset of what it might look like.
So this park that goes from hat to on and had to sat has now got probabilities on.
This is saying the probability off seeing on, given that the previous word wass hot read it off is no 0.75 The probability of seeing sat, given that the previous world was hat is no point to you fine.
And we could generalise that we could just put on art from every word.
Every other word.
Some of probabilities across the Ozarks needs to be one.
We can then learn those from data just by counting how many times those pairs of words occurred in some data.
This has now reached the edge of what we're going to cover on this course.
We're not going to really look at exactly how to estimate there's some data just conceptually, that a model is, for example, probabilistic word pairs and let's give it its proper name.
She's called by Graham 52 by Graham.
We could also equally well called.
It was to gramme that's also fine.
A Diagram language model could be learned from data simply by counting things, but the key point is that it could be map directly onto this finite state form.
Any any language more of a similar form could be done in the same way.
We're not going to draw because he'll get really big and messy and we're not going to look at the details.
But we could imagine a model where the probability of word doesn't just depend on the previous word, but the previous two words You have to remember.
It'll work history, and that will be thinking all the three grand or try a gramme.
If you're doing any NLP Course is, you've probably already seen these models, right? So this is easy.
If you haven't.
Don't worry, this is This is as much as you need to understand.
For this course, repeat.
The key point is that all of these models on the general form is an Engram where N is the order of the model and could be one where the probability of a word depends only on its identity.
To wear depends on its identity in the preceding word.
Three.
Where it's a little window, three words and so on.
All of these models could be written this finite state networks.
The states are actually labelled with a context proceeding and minus one words, and they can all be learnt from data Written is a finite state network.
Why is that important is because we can then compile at with the names and do token

Log in if you want to mark this as completed There is a typo right at the end of this video – the pop-up caption should say “…left context” and not “…left contents”.