› Forums › Automatic speech recognition › Hidden Markov Models (HMMs) › recognising words
- This topic has 5 replies, 2 voices, and was last updated 6 years, 9 months ago by Simon.
-
AuthorPosts
-
-
November 22, 2017 at 17:50 #8522
I’m a little bit unclear what the steps are involved in recognising a word. Last week we learned about token passing bit I am not sure where this gets used.
Here’s what I think so far:
– let’s say our systems has two models: A (for cat), and B (for bat)
– the input speech is “bat” (a sequence of mfccs will be extracted from it)
– model A generates a sequences of MFCCsI’m not sure what happens after, the lecture slide pack doesnt make sense to me.
-
November 22, 2017 at 18:30 #8523
The missing component in your explanation is the language model. This is what connects the individual word models into a single network (like the one in the “Token Passing game” we played in class).
The language model and all the individual HMMs of words are compiled together into a single network. This recognition network is also an HMM, just with a more complicated topology than the individual word HMMs.
Because the recognition network is just an HMM, we can perform Token Passing to find the most likely path through it that generated the given observation sequence.
The tokens will each keep a record of which word models they pass through. Then, we can read this record from the winning token to find the most likely word sequence.
-
November 22, 2017 at 18:55 #8524
So I understand that the language model is just all the HMMs connected together.
So going back to my example:
------HMM for cat --------- start __/ \___ end \ / ------HMM for bat ---------
I put a token at start and send a token along each model.
For each model, is only one observation generated? Or does another token passing process occur in the model to find the most likely observation? -
November 22, 2017 at 19:08 #8525
The language model is not quite the same as “all the HMMs connected together”.
The language model, on its own, is a generative model that generates (or if you prefer emits) words.
The language model and the acoustic models (the HMMs of words) are combined – we usually say compiled – into a single network. Some arcs in this recognition network come from the language model, others come from the acoustic models.
We can only compile the language model and acoustic models if they are finite state. Any N-gram language model can be written as a finite state network. That’s the main reason that we use N-grams (rather than some other sort of language model).
-
November 22, 2017 at 22:36 #8529
I have one more question. In slide 55 of the slide pack, it says that once we have the sequence of feature vectors for an unknown word, each model generates this sequence. What I don’t understand is, if the unknown word is “car”, how can a model for “cat” generate the sequence for “car”?
-
November 23, 2017 at 07:48 #8530
But we don’t know which word our sequence of feature vectors corresponds to. This is what we are trying to find out.
So, we can only try generating it with every possible model (or every possible sequence of models, in the case of connected speech), and search for the one that generates it with the highest probability.
Because we are using Gaussian pdfs, any model can generate any observation sequence. The model of “cat” can generate an observation sequence that corresponds to “car”. But, if we have trained our models correctly, then it will do so with a lower probability than the model of “car”.
A Gaussian pdf assigns a non-zero probability to any possible observation – the long “tails” of the distribution never quite go down to zero. The probability of observations far away from the mean becomes very small, but never zero.
-
-
AuthorPosts
- You must be logged in to reply to this topic.