Forum Replies Created
-
AuthorPosts
-
We could certainly consider hand-labelling the data with the correct state sequence. In other words, for every frame in the training data, we would annotate it with the state of the model it aligns with.
But, that would be very hard, for two reasons:
- How do we know what the correct state sequence is anyway?
- There are 100 frames per second, and we might have hours of data. It might take rather a long time to do this hand-labelling
Your suggestion to divide the word models into sub-word units (in fact, phonemes) is a good idea, but we still wouldn’t want to hand-label the phones in a large speech corpus (phonetic transcription and alignment takes about 100 times real time: 100 hours of work, per hour of speech data).
But what if we wanted to have more than one state per phoneme, which we normally would do (3 emitting states per phoneme model is the usual arrangement in most systems)? How would we then hand-align the observations with the three sub-phonetic states?
We will see that the model can itself find alignments with the training data, and labelling at the phoneme level is not needed. In fact, we don’t even need to align the word boundaries either, we just need to know the model sequence for each training utterance. The Baum-Welch algorithm takes care of everything else.
I think we both have errors 🙂 I’ve attached a spreadsheet that calculates this worked example.
model A: 0.22, 0.10, 0.19, 0.28, 0.28, 0.28 – total is 0.00010
model B: 0.24, 0.05, 0.18, 0.20, 0.20, 0.22 – total is 0.00002
Attachments:
You must be logged in to view attached files.Please post your observation probabilities here and I’ll compare them to mine – it’s possible there are errors in my calculations.
When you say “calculate the PDF” do you mean how do we estimate the parameters of the Gaussians in an HMM? That will be coming up in lectures shortly.
As for finding the most probable state sequence, that is the Viterbi algorithm and is coming up in lectures even sooner.
This is coming up in lectures – but we need to understand recognition first.
The error message means that HTK cannot find the correct labels (hence the .lab extension) for the file s1574060_test01. First, it looked in the MLF, and didn’t find them, then it looked in the same place as the .rec files – that’s why the path given in the error message is rec/
So: you have a mistake in your MLF: perhaps you missed out the correct labels for s1574060_test01, or there is a formatting error. If you can’t find it, post your MLF here (as an attachment) and I’ll find the problem.
The notation of 9.374690e-01 is called scientific notation, and is a way to display very large or very small values. The “e” stands for exponent and we can read the “eNN” as either
- “times 10 to the power NN”
- “move the decimal point NN places (negative NN means move to the left, positive NN means move to the right”
they both mean the same thing. “e+00” would mean “don’t move the decimal place”. The number before the “e” always has exactly one digit before the decimal point, and that is always non-zero.
Here’s how that works for your numbers:
9.374690e-01 means “take 9.374690 and move the decimal place -01 places” in other words one position to the left. So we have
9.374690e-01 = 0.9374690
6.253099e-02 = 0.06253099
and now we see that indeed 0.9374690 + 0.06253099 = 1
Try for yourself: write these numbers in scientific notation:
123.435 0.00043 -3487.12
[showhide more_text="Reveal the answers"]
123.435 = 1.23435e+02 0.00043 = 4.3e-04 -3487.12 = -3.48712e+03
[/showhide]
Their explanation is that the log is a form of dynamic range compression, which is a standard technique in audio engineering used to narrow the range of energies found in a signal.
Another motivation might be to simulate a property of human hearing, which also involves a kind of dynamic range compression so that we can hear very quiet sounds but also tolerate very loud sounds.
However, there is a much better theoretical motivation for taking the logarithm in the spectral domain when extracting MFCCs: It’s there to convert a convolution in the time domain, which is the same as a multiplication in the spectral domain (of the source spectrum and vocal tract filter frequency response) into an addition, so that the source and filter can be separated more easily.
Transforming a signal into a domain where a convolution has become addition is called “homomorphic filtering”.
The process of extracting MFCCs from a waveform is approximately a type of homomorphic filtering.
Well spotted !
The covariance matrix is symmetrical. Along the diagonal are the variance values (the “covariance between each dimension and itself” if you like). Off the diagonal are the covariance values between pairs of dimensions.
Since the covariance between a and b is the same as the covariance between b and a, the upper triangle of this matrix (above the diagonal) is the same as the lower triangle (below the diagonal).
Let’s define covariance formally to understand why this is:
Here are two variables – the elements of a two-dimensional feature vector: \([X_1 , X_2]\)
First, let’s write down the variance of \(X_1\), which is defined simply as the average squared distance from the mean – in other words, to estimate it from data, we simply compute the squared difference between every data point and the mean, and take the average of that.
[latex]
\sigma^2_1 = var(X_1) = E[ (X_1 – \mu_1)(X_1 – \mu_1) ]
[/latex]The “E[…]” notation is just a fancy formal way of saying “the average value” and the E stands for “expected value” or “expectation”.
Here’s the covariance between \(X_1\) and \(X_2\)
[latex]
cov(X_1,X_2) = E[ (X_1 – \mu_1)(X_2 – \mu_2) ]
[/latex]Now, for yourself, write down the covariance between \(X_2\) and \(X_1\). You will find that it’s equal to the value above.
[showhide more_text="Reveal the answer" less_text="Hide the answer" hidden="yes"]
Here’s the covariance between \(X_2\) and \(X_1\)
[latex]
cov(X_2,X_1) = E[ (X_2 – \mu_2)(X_1 – \mu_1) ]
[/latex]and because multiplication is commutative we can write
[latex]
(X_1 – \mu_1)(X_2 – \mu_2) = (X_2 – \mu_2)(X_1 – \mu_1)
[/latex]and therefore
[latex]
cov(X_1,X_2) = cov(X_2,X_1) \\
[/latex]Let’s move up to three dimensions. Noting that \(cov(X_1,X_2)\) can be written as \(\Sigma_{12}\), the full covariance matrix looks like this:
[latex]
\Sigma = \left( \begin{array}{ccc}
\Sigma_{11} & \Sigma_{12} & \Sigma_{13} \\
\Sigma_{21} & \Sigma_{22} & \Sigma_{23} \\
\Sigma_{31} & \Sigma_{32} &\Sigma_{33} \end{array} \right)
[/latex]But we normally write \(\sigma_1^2\) rather than \(\Sigma_{11}\), and since \(\Sigma_{12} = \Sigma_{21}\) we can write this:
[latex]
\Sigma = \left( \begin{array}{ccc}
\sigma_1^2 & \Sigma_{12} & \Sigma_{13} \\
\Sigma_{12} & \sigma_2^2 & \Sigma_{23} \\
\Sigma_{13} & \Sigma_{23} & \sigma_3^2 \end{array} \right)
[/latex]See how the matrix is symmetrical. It has just over half as many parameters as you might have thought at first. But, the number of parameters in a covariance matrix is still proportional to the square of the dimension of the feature vector. That’s one reason we might try to make feature vectors as low-dimensional as possible before modelling them with a Gaussian.
Confused by the notation?
The subscripts are always indexing the dimension of the feature vector. The superscript “2” in \(\sigma^2\) just means “squared”: \(\sigma^2 = \sigma \times \sigma\)
The notation of upper and lower case sigma is also potentially confusing, because \(\Sigma\) is a covariance matrix, \(\sigma\) is standard deviation, and \(\sigma^2\) is variance. We do not write \(\Sigma^2\) for the covariance matrix!
[/showhide]
PS – let me know if the maths doesn’t render in your browser.
You might want to load the list of values from a file:
for X in `cat myfile.scp` do echo The value of X is ${X} done
where myfile.scp is a plain text file with one value per line. This is a good way to loop around a list of files, for example.
To loop around a range of numerical values, you can use
for X in {1..10} do echo The value of X is ${X} done
A more flexible way is to use the seq command which allows you to control the increment step size, use non-integer values, and control the format in which the number is printed
for X in $(seq 1 10) do echo The value of X is ${X} done
or
for X in $(seq -w 1 0.5 6) do echo The value of X is ${X} done
and so on. Type ‘man seq’ at a bash prompt to read the manual for the seq command.
The basic loop around a fixed set of values looks like this:
for X in 1 2 3 do echo The value of X is ${X} done
where the values are actually strings, so we can also have
for X in 34 b purple c 99 a do echo The value of X is ${X} done
or
for FRUIT in apples oranges pears do echo The current fruit is ${FRUIT} done
This is an internal feature and you don’t need to understand what it means. The key phrase-level feature on the example above is “NB”, meaning “no break”.
OK, so my hypothesis about non-ASCII characters is probably wrong here. You seem to have found a pretty bad error in the part of the pipeline that detects/classifies/expands non-standard words. Can you speculate on exactly where this might have happened, and maybe even propose where a change would have to be made to fix this problem?
The unknown / blank item in the Word relation is probably the place where the pound sign used to be just after tokenisation, but has been deleted after completion of the non-standard word processing step (because we don’t want “pounds three billion”).
This sounds like a unit selection error. The most likely explanation is that there is a unit in the speech database that is labelled as the vowel in “red” but actually sounds like the vowel in “reed”.
It’s easy to see how that might happen: there was a front-end error during the labelling of the database (e.g., the database utterance contained the word “read” pronounced as “reed” but the front end predicted the phone sequence for the pronunciation “red” and so aligned that phone label with the speech. Automatic labelling works well, but may not always be able to detect that type of error.
The unit selection algorithm is susceptible to mislabelling errors and has only limited ways of detecting them at synthesis time.
-
AuthorPosts