Hidden Markov Models for ASR

Intro to Hidden Markov Models, comparison to DTW

Whilst the video is playing, click on a line in the transcript to play the video from that point.
00:0000:07 We've been on a journey towards the Hidden Markov Model, and finally we've arrived.
00:0700:29 We started by talking about pattern matching, and we did that with an algorithm called dynamic time warping, in which we stored a template, which is just a recorded example of some speech we'd like to recognize, perhaps an isolated word, called an exemplar, and we matched that against some unknown speech that we would like to label.
00:2900:34 And we measured distance between the stored template and the unknown.
00:3400:41 We then realized that this lacks any way of modeling the natural variability of speech.
00:4100:48 The template is just a single stored example, it doesn't express how much we can vary from that example and still be within the same class.
00:4800:58 So we decided to solve that by replacing the exemplar with a probabilistic model, and we looked in our toolbox of probabilistic models and found the Gaussian as the only model that we can use.
00:5801:01 It has the right mathematical properties and it's very simple.
01:0101:11 We're going to use the Gaussian as the core of our probabilistic model, and that led us into a little bit of a detour to worry about the sort of features that we could model with a Gaussian.
01:1101:21 And we spent some time doing feature engineering to come up with features called MFCCs that are suitable for modeling with diagonal covariance Gaussians.
01:2101:41 So now, armed with the knowledge of dynamic time warping and knowing its shortcomings, and knowing that we've now got these excellent features called Mel Frequency Cepstral coefficients, we're ready to build a probabilistic generative model of sequences of MFCC vectors, and that model is going to be the Hidden Markov Model.
01:4101:45 We're going to have to understand what's hidden about it and what's Markov about it.
01:4501:49 So to orient ourselves in our journey, this is what we know so far.
01:4901:52 We talked about distances between patterns.
01:5201:57 We saw that there's an interesting problem there of aligning sequences of varying length.
01:5702:06 We found a very nice algorithm for doing that called dynamic programming and the instantiation of that algorithm, dynamic time warping, and we saw that we could do dynamic time warping.
02:0602:11 We could perform the algorithm of dynamic programming on a data structure called a grid.
02:1102:15 And that's just a place to store partial distances.
02:1502:18 We're now going to invent the Hidden Markov Model.
02:1802:21 We'll understand what's hidden about it.
02:2102:27 It's its state sequence, and that's connected to the alignment between templates and unknowns.
02:2702:31 The state sequence is a way of expressing the alignment.
02:3102:44 And then we'll develop an algorithm for doing computations with the Hidden Markov Model that deals with this problem of the hidden state sequence, and that data structure is going to be called a lattice, and it's very similar to the grid in dynamic time warping.
02:4402:46 Think of it as a generalization of the grid.
02:4602:57 And so we're now going to perform dynamic programming on the different data structure, on the lattice, and that will be doing dynamic programming for Hidden Markov Models, and the algorithm gets a special name again.
02:5703:00 It's called the Viterbi algorithm.
03:0003:07 That's the connection between dynamic time warping, as we know it so far, and hidden Markov models as we're about to learn them.
03:0703:10 So we know that the single template in dynamic time warping doesn't capture variability.
03:1003:17 There were many solutions proposed to that before someone invented the HMM, for example, storing multiple templates to express variability.
03:1703:22 That's a reasonable way to capture variability, is to have a variety of natural exemplars.
03:2203:28 The HMM is going to leverage that idea too, but it's not going to store multiple exemplars.
03:2803:32 It's going to summarize their properties in statistics.
03:3203:37 Essentially that just boils down to the means and variances of Gaussians.
03:3703:42 So although there are ways of dealing with variability in dynamic time warping, we're not going to cover them.
03:4203:56 We're going to go straight into using statistical models, where in some abstract feature space, let's draw a slice of Cepstral feature space, 4th versus the 7th Kepstral coefficient, so some slice through this Cepstral feature space.
03:5604:12 We're going to have multiple recorded examples of a particular class, and rather than store those, we're going to summarize them as a distribution, which has a mean and a standard deviation along each dimension, or a variance if you prefer.
04:1204:15 So here's what we're going to cover in this module, module 8.
04:1504:19 So we already know about the multivariate Gaussian as a generative model.
04:1904:26 A multivariate Gaussian generates vectors, and we're going to start calling them observations because they're emitted from this generative model.
04:2604:33 We know that speech changes its properties over time, so we're going to need a sequence of Gaussians to generate speech.
04:3304:37 So we need a mechanism to do that sequencing, and that amounts to dealing with duration.
04:3704:42 And we're going to come up with a very simple mechanism for that, which is a finite state network.
04:4204:49 And so when we put these Gaussian generative models into the states of a finite state network, we get the Hidden Markov Model.
04:4904:58 We're going to see when the model generates data, it could take many different state sequences to do that, and that's why we say the state sequence is hidden.
04:5805:01 It's unknown to us, and we have to deal with that problem.
05:0105:08 One way of doing that is to draw out a data structure, which is the Hidden Markov Model replicated at every time instant, that's called a lattice.
05:0805:15 We're also going to see that we can perform dynamic programming directly on a data structure that is the Hidden Markov Model itself, and that's an implementational difference.
05:1505:26 Both of those are dynamic programming for the Hidden Markov Model, and both of those ways of doing things, either the lattice or this computation directly on the hidden Markov model itself, are the Viterbi algorithm.
05:2605:33 When we finish all of that, we'll finally remember that we don't yet have a way of estimating the parameters of the model from the data.
05:3305:38 So everything here in module 8, we're going to assume that someone's given us the model, some pre-trained model.
05:3905:41 So we don't have the algorithm for training the model yet.
05:4105:49 So at the very end, we'll make a first step towards that, and we'll just remind ourselves how we can train a single Gaussian on some data.
05:4905:56 We might call that fitting the Gaussian to the data, or estimating the mean and variance of the Gaussian given some data.
05:5606:06 And we'll see that that is easy, but the hidden state sequence property of the HMM makes it a little harder to do that for a Hidden Markov Model and a sequence of observations.
06:0606:08 But that's coming in the next module.
06:0806:15 So we know already that a single Gaussian, a multivariate Gaussian, can generate feature vectors, and we're going to call them observations.
06:1506:18 That's just to remind ourselves that they are generated by a model.
06:1806:20 Let's just make sure we fully understand that.
06:2006:23 Again, I'll draw some part of our feature space.
06:2306:29 I can't draw in 12-dimensional MFCC space or higher, so I'm always going to draw just two of the dimensions.
06:2906:32 Any two, it doesn't matter.
06:3206:35 Two of the MFCC coefficients.
06:3506:37 How does a Gaussian generate?
06:3706:42 Well, the Gaussian is a mean, and a mean is just a point in the space.
06:4206:45 And we have a variance along each of the feature dimensions.
06:4506:53 We're going to assume there's no covariance, and so there's some variance, a standard deviation in this direction, perhaps some in this direction.
06:5306:59 So one way of drawing that is to draw a contour line, one standard deviation from the mean.
06:5907:00 Like that.
07:0007:02 This model can generate data points.
07:0207:04 What does that mean?
07:0407:09 It means we can randomly sample from the model, and it will tend to generate samples nearer the mean.
07:0907:14 And just how spread they are from the mean is governed by the parameter variance.
07:1407:26 So if I press the button on the model to randomly sample, it generates a data point, or another data point, or another data point, and just occasionally, a data point far from the mean.
07:2607:33 More often than not, they're near the mean, and so it generates data points that have this Gaussian distribution.
07:3807:41 So that's the model being a generative model.
07:4107:42 But we're not doing generation.
07:4207:44 We're not doing synthesis of speech.
07:4407:46 We're doing recognition of speech.
07:4607:50 So what does it mean when we say there's a generative model?
07:5008:00 Well, it says we are pretending, we're assuming that the feature vectors extracted from speech that we observe coming into our speech recognizer were really generated by a Gaussian.
08:0008:07 We just don't know which Gaussian, and the job of the classifier is to determine which of the Gaussians is the most probable Gaussian to have generated a feature vector.
08:0708:11 Now, of course, the feature vectors weren't generated by a Gaussian.
08:1108:14 They're extracted from speech, and the speech was generated by the human vocal tract.
08:1408:16 But we're making an assumption.
08:1608:22 We're assuming that their distribution is Gaussian, and that we can model with a generative model that has a Gaussian distribution.
08:2208:23 So it's an assumption.
08:2308:31 Now since these blue data points were extracted from natural speech, and what we're trying to do is decide whether they were generated by this Gaussian, we don't really generate from the Gaussian.
08:3108:33 We just take the Gaussian's parameters.
08:3308:39 We take one of these data points, and we compute the probability that this observation came from this Gaussian.
08:3908:46 And we just read that off the Gaussian curve as the probability density, and we'll take that as proportional to the probability.
08:4608:57 So the core operation that we're going to do with our Gaussian generative model whilst doing speech recognition is to take a feature vector and compute the probability that it was emitted by a particular Gaussian.
08:5708:59 And that's what the Gaussian equation tells us.
08:5909:00 It's the equation of the curve.
09:0009:04 That's the equation that describes probability density in one dimension.
09:0409:07 We've got the Gaussian distribution.
09:0709:13 Along comes some data point, and we just read off the value of probability density.
09:1309:14 It's that simple.
09:1409:30 When we do that in multiple dimensions with an assumption of no covariance, we'll just do the one-dimensional case separately for every dimension and multiply all the probability densities together to get the probability of the observation vector.
09:3009:34 And that's an assumption of independence between the features in the vector.
09:3409:42 So that mechanism of the multivariate Gaussian as a generative model is going to be right in the heart of our HMM that we're about to invent.
09:4209:47 It solves the problem of the Euclidean distance measure being too naive.
09:4709:49 How does it solve it?
09:4909:54 Because it learns the amount of variability in each dimension of the feature vector and captures that as part of the model.
09:5409:58 The Euclidean distance measure only knows about distance from the mean.
09:5810:00 It doesn't have a standard deviation parameter.
10:0010:04 So we cast our minds back to dynamic time warping and just remember how that works.
10:0410:08 This is the unknown, and that's the time running through the unknown.
10:0810:10 This is a template that we've stored.
10:1010:16 We'll assume that the first frame of each always align with each other because speech has been carefully end-pointed.
10:1610:17 There's no leading silence.
10:1710:26 So we'll align this frame with this frame, and then dynamic time warping just involves finding a path to here that sums up local distances along the way.
10:2610:29 And those local distances are Euclidean distances here.
10:2910:33 So the Gaussian is going to solve the problem of the local distance.
10:3310:40 Let's take a local distance in this cell between this vector and this vector, and it's going to capture variability.
10:4010:44 Storing a single frame of an exemplar doesn't capture variability.
10:4410:46 That's the main job of the Gaussian.
10:4610:48 But there's another problem of templates.
10:4810:51 It doesn't capture variability in duration either.
10:5110:56 There's a single example, an arbitrarily chosen example of a recording of a digit, say.
10:5610:59 Maybe this is a recording of me saying three.
10:5911:01 The duration of that is arbitrary.
11:0111:05 It doesn't model the variation in duration we might expect.
11:0511:07 The template might also be a little bit redundant.
11:0711:11 It might be that all of these frames sound very similar.
11:1111:15 They're the vowel at the end of the word three, and we don't need to store three frames for that.
11:1511:20 We could just store one, or even better, we could store the mean and the variance of those frames.
11:2011:28 So we're going to get rid of the stored exemplar, and we're going to get to choose the temporal resolution of the thing that's replacing it.
11:2811:32 The thing that's replacing it is going to be not an example, but a model.
11:3211:39 And so my exemplar had seven frames, but maybe I don't need seven frames to capture the word three.
11:3911:43 Maybe it's made of just three different parts, three different sounds.
11:4311:51 So at the same time as getting rid of the exemplar, I'm going to gain control over the temporal resolution, the granularity of my model.
11:5111:54 And for now, let's choose three.
11:5411:56 This is a design choice.
11:5612:06 We're going to get to choose this number here, and we can choose it on any basis we like, but probably we should choose it to give the best accuracy of recognition by experimentation.
12:0612:16 So where I used to have frames of an exemplar, I'm now going to instead store the average and the variance of those.
12:1612:19 And I'm going to do that separately for each of the three parts.
12:1912:30 So for this part here, I'm going to store a multivariate Gaussian in the same dimensions as the feature space, and it's going to store a mean and a variance.
12:3012:33 And for the purpose of this module, I'm not going to worry about where they come from.
12:3312:40 I'm just going to assume that we will later devise some mechanism, some algorithm for training this model.
12:4012:45 We're always going to assume for the rest of this module that the model has been pre-trained, it was given to us.
12:4512:49 So instead of storing an exemplar, here I'm going to store a Gaussian.
12:4912:54 I'm just going to draw one in one dimension, because I can't draw in this many dimensions.
12:5413:02 It's going to be a mean and a standard deviation, or we could store the variance.
13:0213:05 For the middle part, I'm going to store another one.
13:0513:12 It's going to be potentially a different mean and a different standard deviation.
13:1213:20 And for the final one, the third one would be at another mean and another standard deviation.
13:2013:31 So where there used to be a sequence of feature vectors for each of these three parts, there is now a statistical description of what that part of the word sounds like.
13:3113:40 So this axis is the model, and the model is in three parts, it models the beginning of the word, the middle of the word, and the end of the word.
13:4013:43 And we'll just leave it like that as a little abstract.
13:4313:53 Certainly we're not claiming that these are going to be phonemes, and we've already said that it doesn't have to be three, it's a number of our choosing, and we'll worry about how to choose it another time.
13:5313:58 So the model so far is a sequence of three Gaussian distributions.
13:5814:02 I've drawn them in one dimension, but in reality, of course, they're multivariate.
14:0214:15 And all we need now is to put them in some mechanism that makes sure we go through them in the correct order, that we go from the beginning to the middle to the end, and a mechanism that also tells us how long to spend in each of those parts.
14:1514:18 The duration of the unknown is going to be variable.
14:1814:21 Every time we do recognition, that might be a different number.
14:2114:29 As we'll see when we develop the model, the length of the observation sequence here is generally going to be longer than the number of parts in the model.
14:2914:38 And so we've got to decide whether this Gaussian generates just this frame, or the first two, or maybe the first three.
14:3814:43 And then how do we move up to the next Gaussian to start generating the next frames?
14:4314:47 And when do we move up to the next Gaussian to start generating the remaining frames?
14:4714:55 So we need some mechanism to decide when to transition from the beginning of a word to a middle of a word, and from the middle of the word to the end of the word.
14:5514:59 The simplest mechanism that I know to do that is a finite state network.
14:5915:01 So let's draw a finite state network.
15:0115:12 We're going to be able to transition between the states, and these transitions say that we have to go from left to right through the model.
15:1215:13 We don't go backwards.
15:1315:14 Speech doesn't reverse.
15:1415:20 And we now need a mechanism that says that you can, from each of these states, generate more than one observation.
15:2015:22 Remember, these are the observations.
15:2215:29 And that's just going to be a self-transition on the state.
15:2915:38 And inside the states are the multivariate Gaussians that I had before, with their means and variances.
15:3815:51 So we started with a generative model of a single feature vector, a multivariate Gaussian, and we've put it into a sequencing mechanism, a finite state network, that simply tells us in what order to use our various Gaussians.
15:5116:02 And we have some transitions between those states that controls that, and we have some self-transitions on the state that allow us to emit more than one observation from the same Gaussian.
16:0216:06 So those transitions are our model of duration, and we just invented the Hidden Markov Model.

Log in if you want to mark this as completed
Excellent 65
Very helpful 5
Quite helpful 10
Slightly helpful 2
Confusing 3
No rating 0
My brain hurts 1
Really quite difficult 4
Getting harder 20
Just right 59
Pretty simple 1
No rating 0