Forum Replies Created
-
AuthorPosts
-
In the specific case of a left-to-right HMM topology, you are right that the first and last observations in any observation sequence must indeed be emitted from the first and last states, respectively.
But, we still only know about some part of the state sequence, and the complete state sequence remains unknown to us: it is still a hidden random variable. It’s just that the distribution of this hidden random variable is ever so slightly restricted.
In the general case of an arbitrary model topology and an observation sequence that is longer than the shortest path through that model, this is not the case. But, even in this general case, we still know something about possible values of the hidden state sequence. Any state sequences that are shorter or longer than the observation sequence have zero probability, and non-zero-probability values of the state sequence are restricted to those of exactly the right length.
Let’s separate out a few different aspects of this.
The storage space required for a covariance matrix is [latex]O(D^2)[/latex], where D is the dimensionality of the observation vector.
The computational cost can worked out by looking at the vector-matrix-vector multiplication in the formula for a multivariate Gaussian – can you work that out?
But the real issue is the large number of model parameters — which is also [latex]O(D^2)[/latex] — that need to be estimated from data.
No, in general Automatic Speech Recognition (ASR) systems have a fixed vocabulary and therefore, as you correctly state, any out-of-vocabulary (OOV) words would be recognised as similar-sounding in-vocabulary words (or, more likely, as a sequence of short words).
It is possible to build an open-vocabulary system, but this is somewhat unusual. The vast majority of ASR systems that you will read about in the literature have a fixed vocabulary.
The non-emitting (also known as ‘dummy’) start and end states are there to avoid having to separately specify two specific parameters: the probability of starting in each state, and the probability of ending in each state. Using the non-emitting states allows us to write those parameters on transitions out of the starting non-emitting state, and into the ending non-emitting state. In some text books, they do not use the non-emitting states, and therefore the parameters of the model have to also include these starting and ending state distributions. That’s messy, and easily avoided by doing things ‘the HTK way’.
In a left-to-right model, there will typically be only one transition from the starting non-emitting state: it goes to the first emitting state, with probability 1. But we could have additional transitions if we wished: this would allow the model to emit the first observation from one of the other states.
All those dummy states are still there in the compiled model (language model + acoustic models) – they are joined up with arcs, and on those arcs are the language model probabilities.
The language model on its own is not an HMM – it’s just a finite state machine.
Your question is mainly about what the word models look like. Yes, we could indeed use a single emitting state per word. Then, we would be modelling just the average spectral envelope across the entire word duration. That may well be enough to distinguish words in a very small vocabulary system (e.g., isolated digits), but is a rather naive model.
Using more states per word gives us a finer temporal granularity. For example, 3 emitting states per word allows the modelling of (roughly speaking) the beginning, middle and end sounds of that word. Such a model is probably a better generative model of that word, and conversely a worse generative model of other words, so should lead to more accurate recognition.
In larger systems, we use one model per sub-word unit (e.g., phone-sized units) and then of course we will have multiple states per word.
Try it for yourself by experiment – it’s easy to vary the number of emitting states in a whole-word system. You’ll probably want to do such an experiment on a reasonably large multi-speaker dataset in order to get reliable results.
That versus which
Examples:
- The bicycle that I saw yesterday was red.
- The bicycle, which I saw yesterday, was red.
This one is simple. If the part starting with ‘that‘ or ‘which‘ can be deleted and you still have a sentence that means the same thing, then it’s optional and you should use ‘which‘. If you can’t delete it, then it’s obligatory and you should use ‘that‘.
In Example 1, I am distinguishing between the bicycle that I saw yesterday and some other possible bicycles (perhaps one that I saw today). In Example 2, there is only one possible bicycle that I could be talking about. I could have not told you about seeing it yesterday and would still have communicated the same meaning: that it is red.
Another way to make the distinction is that clauses with ‘that’ are restrictive: they narrow down the scope of what you are talking about. Clauses with ‘which’ just add optional extra information without doing that: they are ‘nonrestrictive’.
Because the ‘which’ version is optional information, you will usually want to put some commas around it, as in the second example above.
As usual, Grammar Girl explains it well.
Less versus fewer
The traditional answer is that fewer is for countable things (sheep, people, days, apples,…) and less is for things you can’t count (water, excitement, pain, work, …). That is still my default answer and is the safe choice.
Grammar Girl is usually a good source for this sort of writing information, and there you’ll find some useful exceptions to the ‘is it countable?’ rule.
In reality, the distinction between less and fewer is not quite so clear and you can argue either way.
A general rule
When writing scientifically, the last thing you want is your reader fixating on your English usage instead of the actual content that you are trying to communicate. So, play it safe and follow conventions. If in doubt, find another construction that avoids tricky word choices that you might get wrong.
What is a stochastic model?
The term stochastic is the opposite of deterministic. We could also use words like ‘random’ or ‘probabilistic’ instead of ‘stochastic’.
An HMM is stochastic because the state sequence is a random variable: we simply don’t know what value it took, when the model generated a particular observation sequence.
Summing over all values of Q versus taking only the single most likely value
What we are summing up are the individual probabilities of the observation sequence and one particular state sequence (Q takes on one of its many possible values), given the model W, which is p(O,Q|W). That’s not the value we need. We want p(O|W). We don’t care about Q – it’s a hidden random variable.
The quantity that we are computing by summing over (i.e., integrating away) Q is the total conditional probability of the observation sequence O given the model W, that is p(O|W). This is also known as the likelihood. Here’s a mathematical statement of integrating away Q that shows why we need to sum over all possible state sequences to get the value we really want:
[latex]
p(O|W) = \sum_Q p(O,Q|W)
[/latex]and note that I’ve been using small ‘p’ everywhere because we are using Gaussians and so things are not actually probabilities, but probability densities.
HMMs actually compute the likelihood p(O|W) where O is the observation sequence and W is the HMM (or a sequence of HMMs). Note the small “p” – it’s a probability density, because we are using Gaussians, and not a true probability, although it is in some loose sense “proportional” to probability.
So, the likelihood computed by the HMM, and the probability from the language model, P(W), are on different scales. They have a quite different range of values. You can see that for yourself in the output from HTK when it prints acoustic log likelihoods, which tend to be large negative numbers like -1352.3.
Therefore, some scaling of these two values is necessary before combining them. We do that scaling in the log domain: multiply the language model probability by a constant value before adding it to the acoustic model log likelihood. If we didn’t do this, the acoustic model likelihood would dominate and the language model would have little influence.
We call the language model scaling factor (sometimes called the language model weight) a hyperparameter because it is something separate from the actual HMM and language model parameters.
Separately from this, there is another hyperparameter, called the word insertion penalty. This allows us to trade off insertion errors versus deletion errors in order to minimise Word Error Rate.
Note that hyperparameters must never be set using the test data. We should reserve some of the training data for this (and not use that part for training the HMMs or language model). This set is typically called a development set.
Generally, both the language model scaling factor and the word insertion penalty are tuned empirically (by trial and error) on a development set.
Let’s drop the term “best state sequence” because that’s probably misleading and not helpful.
The HMM is a stochastic generative model, and when it generates an observation sequence it does so using a randomly-chosen state sequence. We can never know the “true” sequence because it’s a hidden random variable (let’s call it Q). What we have to do is consider all possible state sequences (so Q is best described as a probability distribution). We have to think in a Bayesian way: we must say that Q is simultaneously taking on all possible values, some more likely than others.
The correct way to compute the probability of an observation sequence having been generated by a model, is to sum over all possible values of Q (technically, we can call this operation ‘integrating away’ Q because we only care about the probability and not about the value of Q). The quantity we obtain is called the Forward probability and can be computed using the Forward algorithm.
But sometimes we need the fastest possible computation and are willing to make an approximation to the Forward probability. So, instead of summing over all values of Q, we just pick the single most likely value and compute the probability of the observation sequence given that one value of Q. This is what we do during recognition.
Now let’s attempt to answer the question “Why does training with the Baum-Welch algorithm lead to a better model than training with the Viterbi algorithm?”
The key is indeed that Baum-Welch computes exact state occupancies. That means that it exactly computes how much each observation vector in the observation sequence should contribute to the updated mean and variance of each state. This gives a more robust estimate of those parameters than the Viterbi algorithm, because each state receives weighted contributions from many different observations.
What do I mean by “more robust”? Here, I mean that the Baum-Welch algorithm is less sensitive to misalignments between states and observations that might happen in the Viterbi algorithm.
How much difference does it actually make? Design an experiment and find out for yourself!
This paper might offer some insights:
L. J. Rodriguez-Fuentes and M. I. Torres, “Comparative Study of the Baum-Welch and Viterbi Training Algorithms Applied to Read and Spontaneous Speech Recognition” in Pattern Recognition and Image Analysis, First Iberian Conference, IbPRIA 2003, Puerto de Andratx, Mallorca, Spain, June 4-6, 2003. DOI: 10.1007/978-3-540-44871-6_98
Your understanding is quite good – certainly the main points are along the right lines.
To answer the questions:
1. The Viterbi algorithm operates left-to-right (although could of course operate right-to-left if we really wanted) and so is “sequential” in that sense. The calculation of the Forward probability is very similar, except that all paths are summed, rather than just considering the most probable one at each state / each time.
2. The Viterbi algorithm and the Forward-Backward (i.e., Baum-Welch) algorithm are computing different things. The Viterbi algorithm only finds the single most likely path, and its corresponding probability (which can then be used as a good approximation of the total Forward probability that the model generated the given observation sequence). The Baum-Welch algorithm computes more than this: it does not consider just one path but all possible paths; and, it computes the probability of being in any given state at any given time (called the “state occupancy”), summed across all possible paths that pass through that state at the given time.
3. We can indeed use either algorithm for training the model, but the Baum-Welch algorithm computes exact state occupancies whereas the Viterbi algorithm only computes an approximation. So, the Baum-Welch is more accurate and we expect it would therefore lead to better estimates of the model’s parameters (which in fact it does – how could you demonstrate that empirically?).
4. The Viterbi algorithm can be used to train the model, as we’ve stated above, but will result in slightly inferior (i.e., less likely) estimates for the model parameters. Because it is much faster than Baum-Welch, it is often used to quickly get the model parameters to approximately the most likely values, before “fine tuning” them with Baum-Welch.
The Inverse Fourier Transform (or alternatively a Discrete Cosine Transform – either can be used) doesn’t create any more information, but you are right to say that it does “unpack” the information and spreads it out along the quefrency axis. That’s very much like the process by which the Fourier transforms “unpacks” the frequencies in a a waveform and lays them out along the frequency axis to make the spectrum. We can then perform some operations more conveniently in the frequency domain (e.g., finding the formants, or performing filtering).
After the “unpacking” in cepstral analysis, it’s possible to ignore information that is not wanted: when extracting MFCCs this means ignoring all the higher quefrency components and only retaining the first few coefficients (typically, 12). That is done simply by truncating the cepstrum (effectively setting everything above the 12th coefficient to zero).
Is it any clearer now? If not – keep asking!
You could use a loop to construct a string (e.g., in order to pass it as a command line argument):
S="" for X in `cat myfile.scp` do S=${S}" -I "${X} done echo $S # ... now do something useful with ${S}
where myfile.scp is a plain text file with one value per line.
Let’s be clear about terminology: the model itself can “do” nothing more than randomly generate observation sequences. If we want to do something else, then we need an algorithm to do that.
For example, the Baum-Welch algorithm finds the most likely values of the Gaussian parameters and transition probabilities, given some training data.
Typically, the number of states is set by the system builder (e.g., you!). Three emitting states per phoneme is the most common arrangement when using phoneme models. For whole-word models, the right number of states is less obvious.
You make a very good point about the number of states in the model: if the model has a simple left-to-right topology (determined by the transitions), then the minimum length of observation sequence that it can emit is equal to the number of states. If the number of states is large, this will be a problem: the model will assign zero probability (either in training or testing) to any sequence shorter than this.
-
AuthorPosts