Summary – the generative view

Yes, again! Really, this is the best way to see how everything integrates elegantly, just by multiplying probabilities.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
we're not going to reiterate all of this is going to make sure we now see all the components quickly all together and see how everything fits together.
If we want to really push the generative model right to its absolute limit, we could say that vectors of M.
F.
C.
C s generate little frames of way for really want in synthesis.
We really literally will do that in recognition that little bit the very bottom of the chain between the way form of the features is actually deterministic.
Signal processing.
It's handcrafted.
It's not really part of the whole generative model framework.
For example, there's no distribution over those those way form things, So privatisations is kind of a separate step.
GSR on it just takes the way form and immediately replaced it with a sequence of factors.
Typically, MFC sees other privatisations are available.
There are other things we could use.
No, we're not going to talk about them here.
For example, we could use this philtre bank coefficients if I recognise it didn't use gas.
Ian's so privatise the speech and then just throw away phones away.
This is always the first step in speech recognition don't on these messy way forms straight to sequence of observations.
Actors with their need, acoustic models, parts of sub wood units.
We know how to do that with H M M's.
We need something to match between language model units on DH acoustic model units.
If it's subway models, then we need a dictionary.
So we're just stating that we're going to get much more depth about that.
We're gonna soon we could write one by hand or by one, or get one from somewhere with only something that generates sequences of words for whole utterances going to be a language model.
So these are our probably stick generative models.
This privatisation doesn't quite fit in the generative model paradigm, and this will really force it.
It's just going to be some signal processing out front, some handcrafted stuff that we need to talk about a few other little topics to glue everything together.
Let's just try and do that all in one little.
Okay, so how might you try and see the whole speech? Recognise it together? Well, it's very tempting to draw some sort of flow chart, So you've got something that some sort of speech recognise her and you got your speech, your way, form.
And somehow the way from goes in on what pops out is W W Let's call it W W that maximises the problems of this generative model thing.
It's tempting to think of that it is a piece of software that's certainly the case.
There's a piece of software it loads away form or pull out the sound card, and it prints out were sequence.
But this is a very misleading way of thinking about it, because it's what breaks out generative model view of things.
So it's an implementation all diagram.
It's not really a diagram of the true published that model that's going on so encouraging not to think of flow charts that do that.
I would encourage you very strongly to think in this way.
Instead, think of it as something that, given words, generates speech or the sequences of MRCC, everything in there is a probabilistic generative model.
Because we do that it becomes really obvious how to fit these different generative models together.
It also gives us some clues about what forms of general models are going to work and what forms are not going to work.
But if we make all of our generative models compatible with token passing, in other words, they're finite state.
Everything glues together in a beautiful, clean way, not just in a second.
Okay, privatisation is just this deterministic signal processing.
Hopefully understand why we need to do each of these steps.
A lot of this is to do with the fact that we've chosen a rather naive model, the hidden Markov model.
It's got some very powerful assumptions in it.
For example, we would like to use diagnose co variance calcium is that assume that the coefficients within an observation are independent of each other.
Have done a lot of massaging lot of manipulations of our features to try and make.
That's true.
It's possible.
That's what this coast I'm transform does.
Hmm also makes this incredibly powerful, strong assumption that one observation is conditionally independent of the previous and the next one or the other.
Observation.
It's a condition.
I mean, given the hmm state has generated on DH, we'll see that we've done a lot of things.
The features to see gate, those crazy and wrong assumptions we made when we chose hmm.
Josiah, remember, because they're mathematically so convenient.
We did this costar in transform to make things statistically independent within frames.
And then we have these Delta and Delta coefficients, which are the differences between frames, which gets us this dependence that the age mom can't model from the age of like a skull fish going around its bowl once he goes round completely forgot how it got there, so we can't condition the probability of one observation on the value of the previous one because it's already forgotten.
So instead, we put some information about the previous one in the current frame as Delta's.
So as Grady slopes, rates of change and there's even information the rate of change of the rate of change we put Delta Deltas.
So the Deltas and Delta Delta is just getting us around this hmm assumption.
So we did a lot of feature Masala Jing, but it's better to do that than try and use a much more complex and expensive model.
And that's just comparing what we find that that's the case, dramatise things, and then, from this point on, everything is now generative model.
We have an acoustic model that generates sequence of observations, got a language model that generates sequences of words.
Typically, something can hang.
Graham.
We're not covering how to learn that data in the stores.
That's for your NLP courses offered the speech recognition course next semester.
Riel systems use something like a three ground, perhaps a four gramme.
Actually, if we've got enough data, if we think about three and four grammes, these sequences of three words and forwards think about looking on on the Web, some large database of text.
We won't see every possible three go.
There's a very large number of those.
Okay, if we have 20,000 words 20 k 20,000 words, not a cabaret.
The number of word pairs is 20,000 squared.
That's a pretty large number, and the number of three grammes is 20,000.
Cube.
That's an even larger number will never see all of those, however big our databases.
So what really systems do is they try and use three grammes when they've seen it enough times, and when it's a gap, the whole backs off to ground.
Then it backed off one gramme.
Let's go backing off for smoothing.
It's a real systems to complex things with language models and That's the domain of the SA course next semester.
So the number of try grammes and diagrams in model very, very large.
These the ones we see in data so you can see that these parameters.
This's a number of parameters in the model.
Each one of these is a probability is very, very large, so language model might take a lot of memory.
Then we can see why a speech ignition system might need a lot of memory to rob.
Each fight, which have used in the lab, is very naive and simplistic.
It loads the entire language model expands into the entire finite state network.
Pops in, Hmm, is where instead of words and has the most enormous hmm, Network of states that doesn't scale well to 6.7 million.
Try a grammes with lots of sub word models pasted in there that we just ran out of memory.
So real systems don't do what h fight does.
They'll just compile bits of the network dynamically in a very complex way.
Okay, but it's a good way to thinking about it.
The best way to understand it, so language model is finite state, we know are each members of finite state.
We could just do the single compiling, which is substituting in the hmm, where the words air on just doing compressing.
So don't think like this.
This you'll see diagrams like this in research papers and things.
It's okay to sort of explain how you implemented things and what your models were and stuff.
It doesn't really help you understand the generative model paradigm, so don't get hung up on pictures like that.
Instead, think of pictures more like this.
This bit of the bottom doesn't quite fit our generative model paradigm in there.
So it's just some signal processing.
All of this is so the speech signal kind of goes into the privatisation.
But after that, we see the arrows is strictly going this way.
Sentence model generates words.
That's a life model.
Word model generates phoney names and then the models of acoustic phones.
That's the dictionary.
We probably just look up in the table the names of the phoney names and get a model for each that gives his hmm hmm hmm has a sequence of states.
So we get a secret states in the States, a Gaussian probability density functions, and they generate observations.
And each of these is a sequence model senses generates a sequence of words.
A word generates a sequence of phones for names.
Finding Jonas, a sequence of HMOs, states and aviation estate generates at least one, probably a sequence of observations.