Forum Replies Created
-
AuthorPosts
-
The Euclidean distance metric is effectively the same as a Gaussian with a constant variance.
Even EM offers no guarantee to find the model parameters that maximise the likelihood of the training data (stated as “the maximum likelihood parameter settings” in the question). It can only find a local maximum, for the reasons explained in this topic. So, b. is untrue.
Yes, iv. is not true – how could it be, when we might not know anything about the test set whilst training the model?
Let’s consider the other options:
i. says that EM will find the best possible model. But we also know that it’s just an iterative “hill climbing” algorithm that stops when it cannot climb any higher. (“Height” means likelihood of the training data.). EM is also sensitive to the starting position: we may get a different final model, if we start from a different initial model. These two facts tell us that it cannot guarantee to maximise the likelihood of the training data – the best it can do is find a local maximum.
ii. we’ve already seen the this is true: hill-climbing will never take us downhill
iii.this is true by definition: EM updates all model parameters in each M step
This leads us to the correct answer of c.
You’re correct to rule out c. – this would make dynamic programming inapplicable. The same goes for d., which even requires knowledge of the future state sequence!
If b. was true, then what will happen when two tokens (in Token Passing) meet in a particular state? What if they had differing previous states (which of course they always will)?
The correct answer is a. – this is in fact a statement of the Markov property of the model.
iii. pseudo-pitch marks
These are pitch marks placed in unvoiced regions, solely for the purpose of signal processing algorithms that operate pitch synchronously (e.g., TD-PSOLA). They are an internal part of the signal processing algorithm, and so only exist within the waveform generator
iv. pitch contour
This is a target F0 contour (remember that we use the terms “F0” and “pitch” interchangeably in this field, even though only “F0” is technically correct). It is predicted from the text by the front end (e.g., by a succession of classification and regression trees).
Annable K is correct – phone durations can be part of the linguistic specification because they are predicted from the text (e.g., with a regression tree).
The linguistic specification is what is passed from the front-end (text processor) to the waveform generator. So, you need to decide what is predicated in the front-end, and what is done by the waveform generator.
The linguistic specification can only include things that are predicted in the front end.
Yes, ii. is correct.
The range of possible F0 modification factor is quite limited in TD-PSOLA, before audible artefacts are produced.
TD-PSOLA operates on pitch periods, and so needs pseudo pitch marks in unvoiced regions. A source-filter model can modify the duration of unvoiced sounds simply by inputting a longer random source signal. It doesn’t need to divide unvoiced speech regions into pseudo pitch periods.
So, iii. and iv. are both true.
Kamen D is correct.
Imagine a system that simply deleted all words: it always produces no words in the output (hypothesis). It never makes any insertion or substitution errors. This system should get a WER of 100%.
The formula in (b) would not work: it involves dividing by zero (since there are no words in the hypothesis).
The formula in (d) gives the correct value: for our system, the number of deletions is equal to the number of words in the reference, and so we get a value of 1 (in other words, a WER of 100%).
Kamen D is correct about i., ii. and iii.
Here, globally optimal means that the model we learn from data is the best possible model (of that form), given the data.
We cannot guarantee this for a CART, because we use a greedy training algorithm. We do not consider every possible tree (think about how many possible trees there are, for even a small set of questions), and therefore we cannot be sure that there isn’t a better tree than the one we have learned.
Kamen D’s answer is correct. Let’s look at the impossible answers:
ii. the number of classes (of the predictee) is a constant, so this cannot be used as a stopping criterion, since it doesn’t change as we grow the tree
iv. again, this doesn’t change as we grow the tree, so cannot be used as a stopping criterion
In DTW, there is no summing at all. We are only interested in the signal best path, and so take the minimum path cost at every grid point.
Well, when using a probability distribution, we also need to compute the probability (or likelihood) of our observation having been generated by many possible density functions (e.g., one per class, or one per HMM state).
The overall complexity of computing a (log) probability and computing a Euclidean distance is essentially the same. We can see this in the formula for a Guassian, where the key operation is the same (squared distance from the mean).
Join discontinuity means the difference in any signal properties across a join (concatenation point).
Those signal properties include anything that is perceptually relevant, including (but not limited to) the spectrum.
So, spectral discontinuity is a sub-component of join discontinuity.
What other components can you think of?
Jurafsky & Martin (J&M) include a figure showing the “classical” cepstrum, and this is what is confusing you. As you say, they fail to make a clear connection between this and MFCCs.
To clear this up, we need to distinguish between the classical cepstrum, and what actually happens in creating MFCCs.
Let’s start with the classical cepstrum, as in J&M’s Figure 9.14 (borrowed from Taylor, who gives a better explanation – read that if you can).
WARNING: in J&M’s Figure 9.14, the plots for (a) and (b) need to be swapped in order for the caption to be correct! My explanation below assumes you’ve corrected this figure.
The three subfigures illustrate the key stages of
(a) obtaining the spectrum from the waveform, using an FFT – in this domain, the source and filter are multiplied together
(b) taking the log of the spectrum, which makes the source and filter additive
(c) performing a series expansion (e.g., DCT) which “lays out” the different components of the log spectrum along an axis, such that the source components and filter components are in different places along that axis and can easily be separated. In J&M’s Figure 9.14(c) we can see the fundamental period as a small peak around the middle of the cepstrum.
There is no filterbank in the classical cepstrum, and no Mel-scaling of the frequency axis.
Mel Frequency Cepstral Co-efficients are inspired by the classical cepstrum and use the same key processing steps. In addition, MFCC extraction involves some additional processing: a Mel-scaled filterbank. This happens after 9.14(a) and so 9.14(b) becomes a smooth spectral envelope (no harmonics) on a Mel scale, and 9.14(c) would no longer have the small peak corresponding to the fundamental period.
The filterbank serves two purposes. First, it’s an easy way to warp the frequency scale from linear (Hertz) to a Mel scale, simply by placing the filter’s centre frequencies evenly apart on a Mel scale. Second, it’s an opportunity to smooth the spectrum and reduce the prominence of the harmonics – in other words, to produce a spectrum that contains less information about the source.
To summarise: J&M’s Figure 9.14(c) is the classical cepstrum and is not one of the stages on the way to MFCCs.
-
AuthorPosts