› Forums › Automatic speech recognition › HTK › Viterbi vs Forward-Backward (Baum-Welch)
- This topic has 8 replies, 5 voices, and was last updated 5 years, 2 months ago by Simon.
-
AuthorPosts
-
-
November 22, 2015 at 02:14 #762
I’m trying to understand each of these algorithms, and the relation they have in HTK training process as the differences between them.
So far, (I think) I undertand:
1) Forward-Backward:
Forward probability: probability that a certain vector (ot) is generated by a state j at a time t. To estimate the FP for state j, we need the FP value of all the possible previous states i, weighted by the transition matrix probabilities (the most likely transitions will have a greater value), iteratively. Takes the “past” into account (a past of one state).
Backward probability: probability of generate a vector from time t+1 to the end, given that we are in state i at time t, iteratively (so, the possible states . Takes the “future” into account.
So, if we mix both, iteratively, we get the path and we get the most likely alignment.2) Viterbi: calculates the best sequence storing partial paths, the paths that are winning so far (and dropping the ones that have lower probability so far). When it gets to the end, goes back using pointers to get the most likely path.
The difference: Viterbi seems more “sequential”, starting at the star and without going back until the very end where it follows the traceback to retrieve the path. Meanwhile, the B-F algorithm, as the name says, goes back and forward iteratively.
Then, the questions:
1. (is that right?)
2. is there any other important difference between them?3. why Viterbi is not enough and we need to use B-F?
4. why do we need to use Viterbi before B-F?Reading further the manual, I think that Viterbi is used to perform the second alignment with the training data (after the rough equally divided segmentation done with k-means), and F-B is used to re-estimate means, variances and transitions and “state ocupation”. So B-F is doing more things than Viterbi.
Thank you.
-
November 22, 2015 at 10:43 #764
Your understanding is quite good – certainly the main points are along the right lines.
To answer the questions:
1. The Viterbi algorithm operates left-to-right (although could of course operate right-to-left if we really wanted) and so is “sequential” in that sense. The calculation of the Forward probability is very similar, except that all paths are summed, rather than just considering the most probable one at each state / each time.
2. The Viterbi algorithm and the Forward-Backward (i.e., Baum-Welch) algorithm are computing different things. The Viterbi algorithm only finds the single most likely path, and its corresponding probability (which can then be used as a good approximation of the total Forward probability that the model generated the given observation sequence). The Baum-Welch algorithm computes more than this: it does not consider just one path but all possible paths; and, it computes the probability of being in any given state at any given time (called the “state occupancy”), summed across all possible paths that pass through that state at the given time.
3. We can indeed use either algorithm for training the model, but the Baum-Welch algorithm computes exact state occupancies whereas the Viterbi algorithm only computes an approximation. So, the Baum-Welch is more accurate and we expect it would therefore lead to better estimates of the model’s parameters (which in fact it does – how could you demonstrate that empirically?).
4. The Viterbi algorithm can be used to train the model, as we’ve stated above, but will result in slightly inferior (i.e., less likely) estimates for the model parameters. Because it is much faster than Baum-Welch, it is often used to quickly get the model parameters to approximately the most likely values, before “fine tuning” them with Baum-Welch.
-
November 23, 2015 at 11:12 #800
Following on to this topic: I have trouble understanding why the ‘single most likely state sequence’ (from Viterbi), is not in fact the best state sequence for our model. Why does considering, and weighting, every possible state sequence (from Baum-Welch) generate a better model? Yes, I can prove that this is in fact true empirically, but I still don’t really understand why. I’m assuming the key is what you stated above: “Baum-Welch algorithm computes exact state occupancies whereas the Viterbi algorithm only computes an approximation”. The video on this on LEARN is also quite good and clear…but there’s still this gap in my understanding. Why isn’t ‘the best’ (the most likely) the best?
-
November 23, 2015 at 11:44 #803
Let’s drop the term “best state sequence” because that’s probably misleading and not helpful.
The HMM is a stochastic generative model, and when it generates an observation sequence it does so using a randomly-chosen state sequence. We can never know the “true” sequence because it’s a hidden random variable (let’s call it Q). What we have to do is consider all possible state sequences (so Q is best described as a probability distribution). We have to think in a Bayesian way: we must say that Q is simultaneously taking on all possible values, some more likely than others.
The correct way to compute the probability of an observation sequence having been generated by a model, is to sum over all possible values of Q (technically, we can call this operation ‘integrating away’ Q because we only care about the probability and not about the value of Q). The quantity we obtain is called the Forward probability and can be computed using the Forward algorithm.
But sometimes we need the fastest possible computation and are willing to make an approximation to the Forward probability. So, instead of summing over all values of Q, we just pick the single most likely value and compute the probability of the observation sequence given that one value of Q. This is what we do during recognition.
Now let’s attempt to answer the question “Why does training with the Baum-Welch algorithm lead to a better model than training with the Viterbi algorithm?”
The key is indeed that Baum-Welch computes exact state occupancies. That means that it exactly computes how much each observation vector in the observation sequence should contribute to the updated mean and variance of each state. This gives a more robust estimate of those parameters than the Viterbi algorithm, because each state receives weighted contributions from many different observations.
What do I mean by “more robust”? Here, I mean that the Baum-Welch algorithm is less sensitive to misalignments between states and observations that might happen in the Viterbi algorithm.
How much difference does it actually make? Design an experiment and find out for yourself!
This paper might offer some insights:
L. J. Rodriguez-Fuentes and M. I. Torres, “Comparative Study of the Baum-Welch and Viterbi Training Algorithms Applied to Read and Spontaneous Speech Recognition” in Pattern Recognition and Image Analysis, First Iberian Conference, IbPRIA 2003, Puerto de Andratx, Mallorca, Spain, June 4-6, 2003. DOI: 10.1007/978-3-540-44871-6_98
-
November 24, 2015 at 18:59 #851
I also have some trouble understanding the difference between summing over all possible values of Q (Forward algorithm) and choosing the most likely state sequence (Viterbi algoritm). I read your reply but I’m still confused as to why we wouldn’t always want the most likely state sequence?
If we think of Q as a probability distribution, wouldn’t the sum over all possible values of Q (“integrating away”) be equal to 1?
I would also like to have a definition of “stochastic model”. Does this refer to the use of probabilities (Bayesian point of view)?
-
November 25, 2015 at 09:02 #855
Summing over all values of Q versus taking only the single most likely value
What we are summing up are the individual probabilities of the observation sequence and one particular state sequence (Q takes on one of its many possible values), given the model W, which is p(O,Q|W). That’s not the value we need. We want p(O|W). We don’t care about Q – it’s a hidden random variable.
The quantity that we are computing by summing over (i.e., integrating away) Q is the total conditional probability of the observation sequence O given the model W, that is p(O|W). This is also known as the likelihood. Here’s a mathematical statement of integrating away Q that shows why we need to sum over all possible state sequences to get the value we really want:
[latex]
p(O|W) = \sum_Q p(O,Q|W)
[/latex]and note that I’ve been using small ‘p’ everywhere because we are using Gaussians and so things are not actually probabilities, but probability densities.
-
November 25, 2015 at 09:05 #856
What is a stochastic model?
The term stochastic is the opposite of deterministic. We could also use words like ‘random’ or ‘probabilistic’ instead of ‘stochastic’.
An HMM is stochastic because the state sequence is a random variable: we simply don’t know what value it took, when the model generated a particular observation sequence.
-
-
-
November 21, 2019 at 11:26 #10304
Hi, I have a question regarding the objective function of these two algorithms.
I’m not sure if this this correct, but my understanding is that both algorithms can be used to weigh the empirical mean/variance of each possible alignment, the only difference is that while Viterbi has only 0 or 1 for weighing, B-F uses the true probability distribution to weigh each alignment.
The objective function for Viterbi is “argmax over w P(W|O) ∝ argmax over w P(O|W)P(W)”. But what is the objective function for B-F? Also, do all weights used in B-F have to add up to 1?
-
November 21, 2019 at 12:18 #10305
Both algorithms find alignments between states and observations.
Both algorithms express this alignment as the probability of an observation aligning with a state, which is the same thing as the state sequence. In Viterbi, we can think of those probabilities being “hard” or 1s and 0s because there just one state sequence is considered. In Baum-Welch, they will be “soft” because all state sequences are considered.
These probabilities are then used as the weights in a weighted sum of observations to re-estimate the means of the Gaussians. The weights will not sum to one, so this weighted sum must be normalised by dividing by the sum of the weights.
Remember that in this course, we’re not deriving the equations to express the above. So bear in mind that whilst the concept of “hard” and “soft” alignment is perfectly correct, the exact computations of the weights might be slightly more complex.
In training, W is constant for each training sample, so no language model is needed.
The objective function of both algorithms when used for training is to maximise the probability of the observations given the model: P(O|W). This is called “maximum likelihood training” and is the simplest and most obvious thing to do.
(However, this simple objective function does not directly relate to the final task, which is one of classification. So, in advanced courses on ASR, other objective functions would be developed which more directly relate to minimising the recognition error rate. Those are much more complex.)
-
-
AuthorPosts
- You must be logged in to reply to this topic.