Forum Replies Created
-
AuthorPosts
-
HMMs actually compute the likelihood p(O|W) where O is the observation sequence and W is the HMM (or a sequence of HMMs). Note the small “p” – it’s a probability density, because we are using Gaussians, and not a true probability, although it is in some loose sense “proportional” to probability.
So, the likelihood computed by the HMM, and the probability from the language model, P(W), are on different scales. They have a quite different range of values. You can see that for yourself in the output from HTK when it prints acoustic log likelihoods, which tend to be large negative numbers like -1352.3.
Therefore, some scaling of these two values is necessary before combining them. We do that scaling in the log domain: multiply the language model probability by a constant value before adding it to the acoustic model log likelihood. If we didn’t do this, the acoustic model likelihood would dominate and the language model would have little influence.
We call the language model scaling factor (sometimes called the language model weight) a hyperparameter because it is something separate from the actual HMM and language model parameters.
Separately from this, there is another hyperparameter, called the word insertion penalty. This allows us to trade off insertion errors versus deletion errors in order to minimise Word Error Rate.
Note that hyperparameters must never be set using the test data. We should reserve some of the training data for this (and not use that part for training the HMMs or language model). This set is typically called a development set.
Generally, both the language model scaling factor and the word insertion penalty are tuned empirically (by trial and error) on a development set.
Let’s drop the term “best state sequence” because that’s probably misleading and not helpful.
The HMM is a stochastic generative model, and when it generates an observation sequence it does so using a randomly-chosen state sequence. We can never know the “true” sequence because it’s a hidden random variable (let’s call it Q). What we have to do is consider all possible state sequences (so Q is best described as a probability distribution). We have to think in a Bayesian way: we must say that Q is simultaneously taking on all possible values, some more likely than others.
The correct way to compute the probability of an observation sequence having been generated by a model, is to sum over all possible values of Q (technically, we can call this operation ‘integrating away’ Q because we only care about the probability and not about the value of Q). The quantity we obtain is called the Forward probability and can be computed using the Forward algorithm.
But sometimes we need the fastest possible computation and are willing to make an approximation to the Forward probability. So, instead of summing over all values of Q, we just pick the single most likely value and compute the probability of the observation sequence given that one value of Q. This is what we do during recognition.
Now let’s attempt to answer the question “Why does training with the Baum-Welch algorithm lead to a better model than training with the Viterbi algorithm?”
The key is indeed that Baum-Welch computes exact state occupancies. That means that it exactly computes how much each observation vector in the observation sequence should contribute to the updated mean and variance of each state. This gives a more robust estimate of those parameters than the Viterbi algorithm, because each state receives weighted contributions from many different observations.
What do I mean by “more robust”? Here, I mean that the Baum-Welch algorithm is less sensitive to misalignments between states and observations that might happen in the Viterbi algorithm.
How much difference does it actually make? Design an experiment and find out for yourself!
This paper might offer some insights:
L. J. Rodriguez-Fuentes and M. I. Torres, “Comparative Study of the Baum-Welch and Viterbi Training Algorithms Applied to Read and Spontaneous Speech Recognition” in Pattern Recognition and Image Analysis, First Iberian Conference, IbPRIA 2003, Puerto de Andratx, Mallorca, Spain, June 4-6, 2003. DOI: 10.1007/978-3-540-44871-6_98
Your understanding is quite good – certainly the main points are along the right lines.
To answer the questions:
1. The Viterbi algorithm operates left-to-right (although could of course operate right-to-left if we really wanted) and so is “sequential” in that sense. The calculation of the Forward probability is very similar, except that all paths are summed, rather than just considering the most probable one at each state / each time.
2. The Viterbi algorithm and the Forward-Backward (i.e., Baum-Welch) algorithm are computing different things. The Viterbi algorithm only finds the single most likely path, and its corresponding probability (which can then be used as a good approximation of the total Forward probability that the model generated the given observation sequence). The Baum-Welch algorithm computes more than this: it does not consider just one path but all possible paths; and, it computes the probability of being in any given state at any given time (called the “state occupancy”), summed across all possible paths that pass through that state at the given time.
3. We can indeed use either algorithm for training the model, but the Baum-Welch algorithm computes exact state occupancies whereas the Viterbi algorithm only computes an approximation. So, the Baum-Welch is more accurate and we expect it would therefore lead to better estimates of the model’s parameters (which in fact it does – how could you demonstrate that empirically?).
4. The Viterbi algorithm can be used to train the model, as we’ve stated above, but will result in slightly inferior (i.e., less likely) estimates for the model parameters. Because it is much faster than Baum-Welch, it is often used to quickly get the model parameters to approximately the most likely values, before “fine tuning” them with Baum-Welch.
The Inverse Fourier Transform (or alternatively a Discrete Cosine Transform – either can be used) doesn’t create any more information, but you are right to say that it does “unpack” the information and spreads it out along the quefrency axis. That’s very much like the process by which the Fourier transforms “unpacks” the frequencies in a a waveform and lays them out along the frequency axis to make the spectrum. We can then perform some operations more conveniently in the frequency domain (e.g., finding the formants, or performing filtering).
After the “unpacking” in cepstral analysis, it’s possible to ignore information that is not wanted: when extracting MFCCs this means ignoring all the higher quefrency components and only retaining the first few coefficients (typically, 12). That is done simply by truncating the cepstrum (effectively setting everything above the 12th coefficient to zero).
Is it any clearer now? If not – keep asking!
You could use a loop to construct a string (e.g., in order to pass it as a command line argument):
S="" for X in `cat myfile.scp` do S=${S}" -I "${X} done echo $S # ... now do something useful with ${S}
where myfile.scp is a plain text file with one value per line.
Let’s be clear about terminology: the model itself can “do” nothing more than randomly generate observation sequences. If we want to do something else, then we need an algorithm to do that.
For example, the Baum-Welch algorithm finds the most likely values of the Gaussian parameters and transition probabilities, given some training data.
Typically, the number of states is set by the system builder (e.g., you!). Three emitting states per phoneme is the most common arrangement when using phoneme models. For whole-word models, the right number of states is less obvious.
You make a very good point about the number of states in the model: if the model has a simple left-to-right topology (determined by the transitions), then the minimum length of observation sequence that it can emit is equal to the number of states. If the number of states is large, this will be a problem: the model will assign zero probability (either in training or testing) to any sequence shorter than this.
We could certainly consider hand-labelling the data with the correct state sequence. In other words, for every frame in the training data, we would annotate it with the state of the model it aligns with.
But, that would be very hard, for two reasons:
- How do we know what the correct state sequence is anyway?
- There are 100 frames per second, and we might have hours of data. It might take rather a long time to do this hand-labelling
Your suggestion to divide the word models into sub-word units (in fact, phonemes) is a good idea, but we still wouldn’t want to hand-label the phones in a large speech corpus (phonetic transcription and alignment takes about 100 times real time: 100 hours of work, per hour of speech data).
But what if we wanted to have more than one state per phoneme, which we normally would do (3 emitting states per phoneme model is the usual arrangement in most systems)? How would we then hand-align the observations with the three sub-phonetic states?
We will see that the model can itself find alignments with the training data, and labelling at the phoneme level is not needed. In fact, we don’t even need to align the word boundaries either, we just need to know the model sequence for each training utterance. The Baum-Welch algorithm takes care of everything else.
I think we both have errors 🙂 I’ve attached a spreadsheet that calculates this worked example.
model A: 0.22, 0.10, 0.19, 0.28, 0.28, 0.28 – total is 0.00010
model B: 0.24, 0.05, 0.18, 0.20, 0.20, 0.22 – total is 0.00002
Attachments:
You must be logged in to view attached files.Please post your observation probabilities here and I’ll compare them to mine – it’s possible there are errors in my calculations.
When you say “calculate the PDF” do you mean how do we estimate the parameters of the Gaussians in an HMM? That will be coming up in lectures shortly.
As for finding the most probable state sequence, that is the Viterbi algorithm and is coming up in lectures even sooner.
This is coming up in lectures – but we need to understand recognition first.
The error message means that HTK cannot find the correct labels (hence the .lab extension) for the file s1574060_test01. First, it looked in the MLF, and didn’t find them, then it looked in the same place as the .rec files – that’s why the path given in the error message is rec/
So: you have a mistake in your MLF: perhaps you missed out the correct labels for s1574060_test01, or there is a formatting error. If you can’t find it, post your MLF here (as an attachment) and I’ll find the problem.
The notation of 9.374690e-01 is called scientific notation, and is a way to display very large or very small values. The “e” stands for exponent and we can read the “eNN” as either
- “times 10 to the power NN”
- “move the decimal point NN places (negative NN means move to the left, positive NN means move to the right”
they both mean the same thing. “e+00” would mean “don’t move the decimal place”. The number before the “e” always has exactly one digit before the decimal point, and that is always non-zero.
Here’s how that works for your numbers:
9.374690e-01 means “take 9.374690 and move the decimal place -01 places” in other words one position to the left. So we have
9.374690e-01 = 0.9374690
6.253099e-02 = 0.06253099
and now we see that indeed 0.9374690 + 0.06253099 = 1
Try for yourself: write these numbers in scientific notation:
123.435 0.00043 -3487.12
[showhide more_text="Reveal the answers"]
123.435 = 1.23435e+02 0.00043 = 4.3e-04 -3487.12 = -3.48712e+03
[/showhide]
Their explanation is that the log is a form of dynamic range compression, which is a standard technique in audio engineering used to narrow the range of energies found in a signal.
Another motivation might be to simulate a property of human hearing, which also involves a kind of dynamic range compression so that we can hear very quiet sounds but also tolerate very loud sounds.
However, there is a much better theoretical motivation for taking the logarithm in the spectral domain when extracting MFCCs: It’s there to convert a convolution in the time domain, which is the same as a multiplication in the spectral domain (of the source spectrum and vocal tract filter frequency response) into an addition, so that the source and filter can be separated more easily.
Transforming a signal into a domain where a convolution has become addition is called “homomorphic filtering”.
The process of extracting MFCCs from a waveform is approximately a type of homomorphic filtering.
Well spotted !
The covariance matrix is symmetrical. Along the diagonal are the variance values (the “covariance between each dimension and itself” if you like). Off the diagonal are the covariance values between pairs of dimensions.
Since the covariance between a and b is the same as the covariance between b and a, the upper triangle of this matrix (above the diagonal) is the same as the lower triangle (below the diagonal).
Let’s define covariance formally to understand why this is:
Here are two variables – the elements of a two-dimensional feature vector: \([X_1 , X_2]\)
First, let’s write down the variance of \(X_1\), which is defined simply as the average squared distance from the mean – in other words, to estimate it from data, we simply compute the squared difference between every data point and the mean, and take the average of that.
[latex]
\sigma^2_1 = var(X_1) = E[ (X_1 – \mu_1)(X_1 – \mu_1) ]
[/latex]The “E[…]” notation is just a fancy formal way of saying “the average value” and the E stands for “expected value” or “expectation”.
Here’s the covariance between \(X_1\) and \(X_2\)
[latex]
cov(X_1,X_2) = E[ (X_1 – \mu_1)(X_2 – \mu_2) ]
[/latex]Now, for yourself, write down the covariance between \(X_2\) and \(X_1\). You will find that it’s equal to the value above.
[showhide more_text="Reveal the answer" less_text="Hide the answer" hidden="yes"]
Here’s the covariance between \(X_2\) and \(X_1\)
[latex]
cov(X_2,X_1) = E[ (X_2 – \mu_2)(X_1 – \mu_1) ]
[/latex]and because multiplication is commutative we can write
[latex]
(X_1 – \mu_1)(X_2 – \mu_2) = (X_2 – \mu_2)(X_1 – \mu_1)
[/latex]and therefore
[latex]
cov(X_1,X_2) = cov(X_2,X_1) \\
[/latex]Let’s move up to three dimensions. Noting that \(cov(X_1,X_2)\) can be written as \(\Sigma_{12}\), the full covariance matrix looks like this:
[latex]
\Sigma = \left( \begin{array}{ccc}
\Sigma_{11} & \Sigma_{12} & \Sigma_{13} \\
\Sigma_{21} & \Sigma_{22} & \Sigma_{23} \\
\Sigma_{31} & \Sigma_{32} &\Sigma_{33} \end{array} \right)
[/latex]But we normally write \(\sigma_1^2\) rather than \(\Sigma_{11}\), and since \(\Sigma_{12} = \Sigma_{21}\) we can write this:
[latex]
\Sigma = \left( \begin{array}{ccc}
\sigma_1^2 & \Sigma_{12} & \Sigma_{13} \\
\Sigma_{12} & \sigma_2^2 & \Sigma_{23} \\
\Sigma_{13} & \Sigma_{23} & \sigma_3^2 \end{array} \right)
[/latex]See how the matrix is symmetrical. It has just over half as many parameters as you might have thought at first. But, the number of parameters in a covariance matrix is still proportional to the square of the dimension of the feature vector. That’s one reason we might try to make feature vectors as low-dimensional as possible before modelling them with a Gaussian.
Confused by the notation?
The subscripts are always indexing the dimension of the feature vector. The superscript “2” in \(\sigma^2\) just means “squared”: \(\sigma^2 = \sigma \times \sigma\)
The notation of upper and lower case sigma is also potentially confusing, because \(\Sigma\) is a covariance matrix, \(\sigma\) is standard deviation, and \(\sigma^2\) is variance. We do not write \(\Sigma^2\) for the covariance matrix!
[/showhide]
PS – let me know if the maths doesn’t render in your browser.
-
AuthorPosts