› Forums › General questions › Acoustic Model and Language Model
- This topic has 5 replies, 3 voices, and was last updated 7 years, 1 month ago by Simon.
-
AuthorPosts
-
-
November 7, 2017 at 15:41 #8283
I don’t think I understand what these are.
From Jurafsky and Martin, I read that these models calculate some probabilities. The acoustic model has something to do with the input waveforms. The language model predicts how likely a word is in a sentence?
Please can someone clearly explain what these models represent.
-
November 7, 2017 at 21:47 #8285
The best route to understanding this is first to understand Bayes’ rule.
If W is a word sequence and O is the observed speech signal:
The language model represents our prior beliefs about what sequences of words are more or less likely. We say “prior” because this is knowledge that we have before we even hear (or “observe”) any speech signal. The language model computes P(W). Notice that O is not involved.
When using a generative model, such as an HMM, as the acoustic model, it computes the likelihood of the observed speech signal, given a possible word sequence – this is called the likelihood and is written P(O|W).
Neither of those quantities are what we actually need, if we are trying to decide what was said. We actually want to calculate the probability of every possible word sequence (so we can choose the most probable one), given the speech signal. This quantity is called the posterior, because we can only know its value after observing the speech, and is written P(W|O).
Bayes’ rule tells us how we can combine the prior and the likelihood to calculate the posterior – or at least something proportional to it, which is good enough for our purposes of choosing the value of W that maximises P(W|O).
You might think this is rather abstract and conceptually hard. You’d be right. Developing both an intuitive and formal understanding of probabilistic modelling takes some time.
-
November 26, 2017 at 23:31 #8563
Where does the prior come in in ASR? So far, I thought we were just comparing likelihoods, which I assume is proportional to the posterior if we assume that all the priors are the same (e.g., each word is equally probable in the digit recognizer).
– I guess it is not hard to integrate priors into the language model, based on word frequency.
– For the word model, if there are alternate pronunciations, the prior may be all we can go by?
– But how to include priors into the phone/acoustic model? Or do they not play a role (since each phone or subphone has just 1 model), and we assume likelihood = posterior? -
November 27, 2017 at 09:47 #8565
The language model computes the prior, P(W). If you like, we might say that the language model is the prior. It’s called the prior because we can calculate it before observing O.
In the isolated digit recogniser, P(W) is never actually made explicit, because it’s a uniform distribution. But you can think of having P(W=w) = 0.1 for all words w.
The acoustic model computes the likelihood, P(O|W).
We combine them, using Bayes’ rule, to obtain the posterior P(W|O); we ignore the constant scaling factor of P(O).
Now, to incorporate alternative pronunciation probabilities, we’d need to introduce a new random variable to our equations, and decide how to compute it. Try for yourself…
-
November 27, 2017 at 20:45 #8588
So what is the language model computed from? Another corpus? I thought the transitions between words were also learned during training…
-
November 27, 2017 at 20:54 #8589
A n-gram language model would be learned from a large text corpus. The simplest method is just to count how often each word follows other words, and then normalise to probabilities.
In general, we don’t train the language model only on the transcripts of the speech we are using to train the HMMs. We usually need a lot more data than this, and so train on text-only data. This is beyond the scope of the Speech Processing course, where we don’t actually cover the training of n-gram language models
We just need to now how to write them in a finite state form, and then use them for recognition.
In the digit recogniser assignment, the language model is even simpler than an n-gram and so we write it by hand.
-
-
AuthorPosts
- You must be logged in to reply to this topic.