Summary – compiling the recognition network

The idea of compiling al the models together is very natural, if we are taking the generative view.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
Let's just draw out explicitly how this this idea of compilation works.
Let's do it for our really simple, simple model.
Let's do it for a phone dialling model.
Same thing would work for our by Graham.
Also, start start state some salt and state.
And some words, I don't know.
May some people, right.
We might have our network like this.
That's our language.
Model withdrew by having it.
Could have been loaned from data.
Could complain.
We want on.
We can turn this into a network of hmm stakes, step by step.
Thank you.
Huh? No.
Right.
So the first thing we do is replace the words with their sequence of phoney names.
Let's just do this one.
Okay? So symbols are wrong.
Come on.
Okay, so we just rewrite a word.
The secrets of phone aims, And then, for each of those, make it colourful.
We'll just put in its little Hmm.
Okay.
We just do that everywhere.
So words get substituted with their phone teams.
Get sympathy with Asian member states.
What? Quickly, Just do that.
So he's all disappear on what we're gonna be left with destroyed by quickly.
I remember being made in this way who are self transitions on on DH.
The probabilities on these arcs inside.
Hmm.
These things here are just the hmm transition probabilities.
We didn't talk very much about how we might learn those, but there is going to get learned during Peter Be training and bound well, training to count how many times each ark is used proportional to all the other arts out of that state.
The these arcs in between phone aims might be pronunciation probabilities.
Imagine we have a fancy dictionary allowed two pronunciations for this word.
So the second phoning, maybe alternates, so we might have branches in here.
So imagine those two pronunciations.
And so these arcs are going to be the probability of two competing pronunciations of a word.
So there aren't within a gym m's, these ones.
They're arks from the dictionary, possibly in general.
Mostly there just be one pronunciation for word.
And then there's the arcs that connect things together.
On these arcs are language model arcs, and they might have language, mortar probabilities.
So where the probabilities come from could be from the language model of the dictionary or inside the acoustic.
Hmm.
But none of that matters.
because they just probabilities.
They're just numbers on arcs.
And then we just put our token in the middle in the beginning here, with its probability of once the only token we send copies down the arcs on, just propagate them Through.
They just flow through the model generating sequence of observations, and at some point, some tokens arrive here.
MP three pick the biggest one, and that's the winner on DH.
We'd better make sure that we know what path it took.
So we do one of the little thing in token passing we didn't mention yet, and that's tokens.
Need to remember the words that they went through.
They could remember every state they went through if we cared about that.
So what tokens Air Going to Dio as they go down language model arcs is going to be little tags on these language model arcs.
The tags are names of words, and as tokens passed through those, they'll just add that to their word record so that the winning token would just look at it.
And it'll say, I went through this world and this world in this world, and that's the recognition results as well as having its probability on which we don't care about that was any for comparing to other tokens.
Okay, so this this generative thing here, this idea of this generative model when we come to implement it, turns into what we might call compilation, which is exactly what H fight does.
When you relate right, you'll see it print some stuff out.
It'll say something about number of states and number of arcs.
It's made.
That's just it.
Telling you how big the network wants that it compiled.
They won't be very big for the digit.
Recognise er it will be bigger for the sequence recognised because there's a few extra Ark's a little bit bigger.
So it's just telling you how big that thing it compiled wass on the compilation was Thing we just did, substituting things in until we just end it with a great big network.
It's a finite state network, and then we just do two compressing on that.
So this the fact that these all ours are all pointing in the same way and everything's a generative model means that we can compile together the language model, the dictionary on the acoustic model into a single model, single, unified model, and then we could just do the same algorithms on it

Log in if you want to mark this as completed
This video covers topics: ,
Excellent 35
Very helpful 8
Quite helpful 5
Slightly helpful 1
Confusing 1
No rating 0
My brain hurts 0
Really quite difficult 3
Getting harder 4
Just right 42
Pretty simple 1
No rating 0