From subword units to n-grams: hierarchy of models

Defining a hierarchy of models: we can compile different HMMs to create models of utterances

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT
A LIGHTLY-CORRECTED VERSION MAY BE MADE AVAILABLE IF TIME ALLOWS
This is the last module of the course module.
Nine actually got two parts to it.
The first part's actually quite easy connected speech.
Surprisingly easy on the second part's quite hard on our final destination.
Off Rachman training will be the Bond Welsh algorithm, which is conceptually the hardest part of the entire course.
So looks like there's an awful lot to cover in this module, but the individual videos are pretty complete, and so this is really going to be just a no overview to join all of that together and show you how the individual videos fit into this topic map.
These are topics have actually covered previously, but we're going to see them link back in now and help us get towards our final destination for hmm, training the bound Welsh algorithm.
Before that, let's just deal with connected speech, which is surprisingly easy on.
We've made it easy by choosing to use finite state models, and we're going to keep choosing finite state models, and everything will compile together into a single graph that will just be another hit of Markov model.
We're motivated Hmm Tze from pattern matching, but I think we can get beyond that now so we can treat a gym moms as a model in its own right.
It's a finite state model, so it's made of a finite number of states.
Joined together in the network with transitions in each of those states is a Gaussian probability density function whose job is to emit observations on DH so the model is emitting or generating on observation sequence.
It's a probabilistic generative model, so we moved on now from pattern matching into thinking always in probabilistic generative models.
On the key point of the model is that you can capture the variability that's present in natural examples of speech.
So it learns from data so the model accounts for their ability, and it does that at the core by using probability density functions.
Pdf ce on.
We're going to use gas.
Ian's, the finite state.
Part of the model is simply there to specify the sequence in which to use those pdf ce so the model is a generative model generates sequences.
So for that reason, our M SEC vectors we now refer to ours observations because they are the observed output of a generative model.
When the model generation observation sequence.
All we see is the observation sequence.
We don't see how the model did that.
Specifically, we don't see that state sequence the model took.
That steak with sequence is hidden.
State secrets is just another way of saying the alignment between states and observations many possible states equals is Khun Generate the same observation sequins each state.
Secrets will have a different probability of generating the observation sequence, but there be many non zero ones on that arises because of Gaussian Khun.
Generate any observation with non zero probability Galaxy in assigns a non zero probability or, strictly speaking, probability density, tow any value in practise.
When we doing inference with the model from using the model to do pattern recognition to do speech recognition, we can get away with only considering the single most likely state sequence.
There's an exact algorithm that finds that one most likely state secrets called the Viterbi algorithm, and that's what makes speech recognition fast and efficient.
But now we need to revisit the definition of the model and see that this approximation of using the most likely state sequence needs to be replaced with a sum over all possible state sequences because when we train the model, we can't a lot less about speed are much more about getting the best estimate of the models parameters.
So let's get on with the first part of the module, which is the easiest part.
And that's the deal with connected speech.
And then we'll come back to this idea ofthe training a model.
A lot of the language with you so far about hidden Markov models is involved, saying things like There is a model for each class that we're trying to recognise.
Typically, those classes have bean whole words, and we assumed that we would need to do separate computation for each model to compute the probability that it emitted.
The given observation sequence that we tried to recognise would repeat that a number of times one time for each word in Africa.
Bricks of 10 times for our digit.
Recognise er compare 10 numbers.
Pick the highest and label.
The observation seems with that class, that's fine.
That would work, but that's hard to see how that might extend to sequences of words.
So we're now going to introduce language models.
Language models will allow us to account for the pride probability off a word or of a word sequence.
And now we'll be able to move away from saying we compute 10 separate models to combining them in a single model.
That's where we're going on our solutions.
That is going to be an engram.
So the solution to worse sequences this to create a model, often utterance, that generates words.
Everything is a generative model.
Let's just recap how an Engram language model works.
So in the end, Graham, we have to choose the value of N a popular choice might be to for by Graham three for a try a gramme and maybe higher.
If we have a lot of data, let's draw a three gramme or try Graham in an Engram language model.
There are a number of states in the states, a label with word history.
So in a tri Graham language model, the word history is the previous two words.
Are we getting the probability of the next word? That's where n equals three comes from, and so we label estate with history, so he's going to some number of states, and if we had em words in our recovery going tough, M squared such States in a tri Graham.
I won't know them or because if M was, thousands would have millions of such states.
Let's just draw few examples.
Ones here is one where the word history is the cut.
Any arc coming out of that state is going to be the probability off the next word given history was the count.
Maybe next word might be SAT is quite a likely word, but would have one art for every possible following word.
So we've got the cat in the history and sat being the thing we're predicting the probability off that we generate.
The word sat as we move down this ark.
History now slides forward, so the history now becomes cut.
Sat on.
The has dropped off the end because it's outside the Engram now and is only three.
So the window of histories to any are coming out of here.
Going to be the probability off some word given cut sad.
And so we'd put anywhere you like in here and predict its probability and so on.
So the states were labelled with the word history.
It arcs have the probability of the next word.
Given that word history on out of every state to be lots of arcs, going to all the possible next words.
So if this was a complete model that b m such arcs, because any word could follow this history and principal with a non zero probability on this model would have a size that's governed by the vocabulary on DH by N if n got larger.
If instead of a trial Graham, we attempted to build a four gramme.
Each state will be labelled with the three word history, and we would now have not m squared.
But M Cube states so we can see why we very quickly have to keep n under control on the three Grand was already got a lot of states of four grammes going to have a huge number of states.
It's very rare that we might see anything bigger than four, maybe five, Grammys about our limit.
It's not that the model is too big.
Is that the number of parameters the model house is too big on.
Those parameters have to be estimated from data by counting these words sequence occurrences in some corpus.
We're not going to do the training of Ingram's in this course think intuitively, Khun.
See that? That would be very straightforward.
We would just count how many times sat followed the cat, and we count how many times any other word here followed the same word History on normalise it and get a probability from that.
That would be a simple way to estimate the probabilities of this engram for our purposes.
The most important property of the Ingram language model is that it's finite state.
There might be a large number of states, but there's a fixed countable number.
It's finite.
That's the solution to connected speech.
To make a model that emits word sequences, we took a random walk around.
This end Graham language model would emit random sentences.
The second problem we have if we want to do large recovery connected speech recognition is that the recovery might be large.
We need to define it in advance.
We want to specify the words that I recognise her will recognise on DH.
We don't want to be restricted toe having many examples of each of those words in the training data that will be very restrictive, and it will prevent us from ever adding a word to our recognise her without going back and getting Mohr recorded speech.
We don't want to have to do that.
If we got the covers of tens of thousands of words, we can't guarantee that every one of them occurs enough times in the training data, so we simply can't use whole word models.
The solution to that very obvious.
It's exactly what we did, what we did speak synthesis.
We didn't con continent recordings of words we can caffeinated recordings of phone, name sized things and speech synthesis.
Those were die phones.
Let's start with something very simple for speech recognition.
The phoney ins themselves will make models off phone aims.
The solution is of the same form as for speech synthesis.
Use of word models on DH phonetics tells us the appropriate so would unit to use.
And it allows us to write a dictionary, which is the mapping from words to phoning sequences by hand in the same way we we would do for speech synthesis.
So to the Ingram solution for connected speech, we have the solution for larger recovery, and that's to use a word units.
But everything's a generative model, so this is also a generative model going to have a generative model off a word.
Let's have general model of this Ortho graphic word cat.
That's the spelling of cut.
And it's going to be finite state like that now it doesn't really matter.
In finite state, formalism is whether we label things on states or arts will move through them all.
There are different formalism.
Is that possible? We saw in the final state transducers for speech synthesis that all the action happened on the arcs in the States.
We just call them the memory we saw in the Engram language model.
Those probabilities are on the arcs, and the states are the memory of the history.
Hear of drawn four names on the states that could equally have drawn the monarchs joined together with states.
That Ridge is a matter the point again that this is a finite state model off the word cat.
Given the Ortho graphic word it emits, it generates the sequence of phoney names.
We've got things happening, a couple of different levels there.
We got sentences made off words.
Sometimes we use the word utterance rather than sentence.
Just to be more general sentence implies some grammatical structure spoken language isn't always grammatical, so we use the more general term utterance utterances made off words.
It's the secrets of words.
A word is a sequence of phoney names.
So there's some hierarchy in learning language on DH described that very simply, we can solve the problem of large recovery connected speech recognition with this hierarchy of models, which combined will give us a generative model of a complete utterance.
First, a generative model of the veterans emits a word sequence each word In that sequence, a generative model of that word emits a phoney, um, sequence that models created from a dictionary.
I would say that that model is the dictionary, and then, for each phoney we have a hidden Markov model on that emits an observation sequence.
Generates the acoustics off that phoney no borders, No.
In the assignment we don't have.
Phoney models were committing this step here, and we have a generative model.
The entrance that emits a word sequence on a generation model of each word that emits an observation sequence That's fine, too, is still a hierarchy.
It's just, um, it's one of the levels.
So this hierarchy of models that all finite state or we need to do is to divine how to join them together.
The technical term for that in finite state machines is composition.
We're going to use another term compiling.
That's the terminology often used by H.
D.
K.
For example, how does that work? So this is only going to be possible if all the generative models of finite state combined them together or compose them in a finite state.
Model terminology to make a generative all oven utterance that emits acoustic observations emits, MFC sees the normal terminology in the field is not really to talk about an entrance model.
We call that the language model, for example, the Engram written as a finite state network instead of talking about word model.
We just don't talk about the dictionary Pronunciation dictionary on the sub word model.
We call it the acoustic model because it's the part of Modern.
It finally actually emits em of CCS or whatever Observation vector.
We've chosen composition in finite state models, Khun mean quite a lot of different things, and this is not a complete course on finite state models, so I'm only going to define the sort of composition that we're interested in here, and that's this hierarchical way of doing things.
Think ofthe part off our model that's keep things simple and have on utterance model where the states had just labelled with words.
We could think ofthe replacing that with this model.
And then we can think off replacing each of these things with I had a Markov model, and now it's obvious why we like HD Kay's way of having non emitting starting in states because they're convenient places which to glue together all these different models.
So composition here means to replace a model ofthe word to get this and then for that, replacing each phone aim with its generative model.
It's acoustic model.
That's an operation that's happening on a finite state network.
It's not a transducer of the form we saw in speech synthesis.
It doesn't have input and output.
It just generates on.
If we remember the path we took them, we can recover the most likely sequence of models and use that as the output of the system.
The resulting structure is a finite state model, and I'm going to call it the graph, but as well about see, it's actually just one big.
Hmm.
So we'll have always Cem well defined place to start.
Listen, what? And we're going to have a big, finite state network, which is the composition off an Engram language model or any other finite state model, perhaps one drawn by hand in the case of a simple digit recognise er Ingraham language model.
A dictionary on DH acoustic model through one of the phoney names on DH after the compositions happened after the compilation has happened, we just get this big graph and it's just on.
Hmm.
So we might have some non emitting states, some emitting states, nonentity states from committing states and so on and so on.
Once we've made this graph, this composition process is finished.
All that we have is an hmm.
This is Justin.
Hmm, More complicated looking than some of the H moves you've seen before, but it's got nothing in it that's different.
It's only got some non emitting states that just do the glue.
It's got some emitting states that will admit observations every time we visit them.
It's got some transitions which are part of the H M M's some transitions, which are joining a gym.
Moms together some transitions which are coming maybe out of one word and into the start of the net.
Next word and so on.
It doesn't matter for the purpose of computation where each of the transitions came from, because the whole thing is just an hmm, some of them may be.
This one might be the probability from the language model.
Some of them might be self transitioned on, Hmm States there from the acoustic model.
None of that matters because if you understand token passing, you put a token here on just press go on the token passing algorithm on DH at the point that you generate all the observations.
Look at what token comes out to the end state of the exact time.
And that's the winner Northern needs to do is to remember the sequence of models it went through, or at least to remember each word that it went through.
And that will be the recognition output.
It's really that simple.
There's nothing more to say.
The algorithm is the same on this network as it will be on a small, simple Hmm.
The hasty Keitel H.
Right does explicitly this in software.
It composes it compiles this graph and then it does token passing on that.
It's a token passing on that graph is just a practical implementation ofthe dynamic programming because we're doing on something that's a hidden Markov model.
We call it the Viterbi algorithm so we can do the Viterbi algorithm on that graph.
That graph, as I just drew it, is the same as the picture of the hmm.
It's drawing like this, the one that's the finite state picture oven.
Hmm.
There's no explicit time Access was very careful not to think of this having time flowing from left to right.
It's just a finite state model because we could have transitions like this, and that's perfectly legal.
That graph is the finite state picture off the model for speech recognition.
We can do Donna make programming on that directly on that data structure.
The implementation of that is called token passing.
As soon as that Graff gets reasonably large, we will want to do some pruning.
I'm pruning is an approximation, so the Viterbi algorithm is an exact algorithm.
It will always exactly find the most likely state sequence through.
The model will find the same one every time and it's not an approximation.
But to do that, he has to consider many state sequences on.
Many of those will be pretty unlikely a lot of the time and so we can make that would go a lot faster by pruning away by discarding some paths.
So if you see the little video about token passing, here's a screenshot of it.
At this moment in the algorithm, there are a lot of tokens alive.
The algorithm has been running for enough time, steps of emitted, enough observations that talks have propagated throughout the whole model.
But some of these tokens will be very likely they are likely to end up being on the best path they have.
A good chance of being the winner on others might be very unlikely.
We might be willing to take a small gamble and say they're so unlikely we don't think they'll ever become the winner.
We can't prove that we can just guess it, and we guess that they won't become the winner and will discard them.
Now we'll delete them.
So imagine we delete that one there.
This is an approximation.
If we do too much of this, have to do too much pruning.
We will eventually delete tokens that would have gone on to be the winner without knowing it, because we haven't finished the computation.
But normally we could do quite a lot of pruning without taking much risk.
That pruning will make computation much faster because we are massively reducing the number of tokens that we need to propagate each time step.
We think about some big graph, some very large number of hmm states.
Most of the tokens will have a very low probability.
How do we define whether they're too low? How do you define which took us to prune? Well, we just find the best token at that moment in time, we define a probability that some distance below that some threshold on that amount is often called the beam.
So this is beam approving, and we discard all tokens that outside the beam that the lower probability than that beam that finishes connected speech we form an utterance model, which is an engram, emits words for each word.
We omit phone aims.
That's a hierarchy of models.
We joined them together through a process which might come formally call composition, but well, more informally called compiling.
That's what H D.
K uses would compile a graph.
Sometimes call that the recognition network.
That's the thing on which we could do token passing directly on that data structure and that will perform the process off recognition that is an implementation of the Viterbi algorithm for hidden Markov models.
If the grafts reasonably large will want to speed up that computation by discarding very unlikely path before they've propagated all the way through the model, that's an approximation that's a process known as pruning.

Log in if you want to mark this as completed
Excellent 41
Very helpful 2
Quite helpful 3
Slightly helpful 0
Confusing 0
No rating 0
My brain hurts 0
Really quite difficult 2
Getting harder 10
Just right 34
Pretty simple 0
No rating 0