Forum Replies Created
-
AuthorPosts
-
The problem should now be solved.
Technical details: video were being served from the subdomain
media.speech.zone
which was not covered by thespeech.zone
SSL certificate. Chrome is stricter than other browsers and refused to load anything from that subdomain. Videos are now served from the primary domain.Good catch! Not sure what happened there, but the first video (“Why? When? Which aspects?”) is now re-instated. There are 4 videos in Module 5.
The forward algorithm computes P(O|W) correctly by summing all terms. The Viterbi algorithm computes an approximation of P(O|W) by finding only the largest term (= most likely path) in the sum.
Token Passing uses a much smaller data structure: the HMM (= a finite state model) itself, which is one “column” (= all model states at a particular time) of the lattice.
So, Token Passing is equivalent to working with the lattice whilst only ever needing one column of it in memory at any given time.
Token Passing is a time-synchronous algorithm – all paths (= tokens) are extended forwards in time at once.
[Everything below this point is beyond the scope of Speech Processing]
There are non-time-synchronous algorithms. Working on the full lattice would allow paths to be extended over states, or over time, in many different ways. When combined with beam pruning, a clever ordering of the search can lead to doing fewer computations overall. This becomes important for large vocabulary connected speech recognition (LVCSR).
But we then also have the problem that the lattice is too big to construct in memory, so we create only the parts of it that we are going to search. Historical footnote: my Masters dissertation was an implementation of a stack decoder performing A* search; this avoids constructing the lattice in memory, whilst searching it in “best first” order.
In HTK,
HVite
does Token Passing, which becomes inefficient for LVCSR. For LVCSR,HDecode
is much more sophisticated and efficient.We could go from uniform segmentation to Baum-Welch, skipping Viterbi training. In fact, we could even go from the prototype model directly to Baum-Welch.
Baum-Welch is an iterative algorithm that gradually changes the model parameters to maximise the likelihood of the training data. The only proof we have for this algorithm is that each iteration increases (in fact, does not decrease) the likelihood of the training data. There is no way to know whether the final model parameters are globally-optimal.
This type of algorithm is sensitive to the initial model parameters. The final model we get could be different, depending on where we start from. This is also true for Viterbi training.
So, we use uniform segmentation to get a much better model than the prototype (which has zero mean and unit variance for all Gaussians in all states). Starting from this model should give a better final model than starting from the prototype.
The model from uniform segmentation is used as the initial model for Viterbi training, which in turn provides the initial model for Baum-Welch.
Another reason to perform training in these phases is that Viterbi training is faster than Baum-Welch and will get us quite close to the final model. This reduces the number of iterations of Baum-Welch that are needed.
The Viterbi algorithm is not a greedy algorithm. It performs a global optimisation and guarantees to find the most likely state sequence, by exploring all possible state sequences.
An example of a greedy algorithm is the one for training a CART. Unlike the Viterbi algorithm, this algorithm does not explore all possibilities: it takes a sequence of local, hard decisions (about the best question to split the data) and does not backtrack. It does not guarantee to find the globally-optimal tree.
Orthogonal means that the correlation between any pair of basis functions is zero. This property is necessary for there to be a unique set of coefficients when we analyse a given signal using a set of basis functions.
Think of it as “there is no amount of one basis function contained in any other one”, or as “if we analyse a signal that happens to be the same as one of the basis functions, that coefficient and that coefficient only will be non-zero”.
Yes, the similarity between a basis function and the signal being analysed is computed by multiplying them sample-by-sample and summing up.
The tapered window is applied to the extracted pitch period units. You can think of it as being applied as we overlap-add them back together. Each extracted unit, and therefore its tapered window, is centred on a pitch mark (= an epoch).
December 14, 2020 at 11:08 in reply to: correlation between MFCCs and thier deltas and delta-deltas #13650There will be some correlation between the MFCCs and their deltas, yes. But, empirically, we still obtain a benefit (i.e., lower WER) by adding them even if this covariance is not modelled.
Your answer for a) i. is correct and the diagrams a fine, although you use an unrealistic example F0 of 5 Hz.
Your answer for a) ii. is basically correct, except that instead of discarding the pitch periods at the end (which is equivalent to truncating the signal), we would distribute the discarding across the duration of the signal (e.g., discard every 3rd period) so that we retain the complete spectral change from the start to the end.
Overall, this is a strong answer and would get a good mark.
Yes, “dimensionality reduction” and “ability to control the feature vector dimension” are two aspects of the same thing. Truncating the cepstrum is a very well-motivated way to choose the dimensionality of the resulting feature vector because the cepstral coefficients have a meaningful order, with the lower ones being more important than the higher ones.
Smaller feature vectors are a good idea when modelling with Gaussians because Gaussians in high dimensions have more parameters to estimate (and don’t work very well in practice). So the motivation for keeping feature dimension low is not the speed of computation but the number of model parameters.
Sampling rate of 16 kHz and 25 ms frame duration means 16000*0.025 = 400 samples in the analysis frame. So the DFT produces 200 magnitudes and 200 phases, of which we discard the phases leaving 200 DFT coefficients.
We didn’t cover this explicitly and it would not be examinable, but an FFT is a restricted case of the DFT which requires the analysis frame to contain a power of 2 number of samples (256, 512, 1024, etc). In the above case, we would need to perform the FFT on 512 waveform samples and not 400.
The sum over all paths computed by the forward algorithm will include the single more likely path, the second most likely, and so on…
One way to understand Viterbi is that we can approximate a sum of many terms (here, path likelihoods) by just taking the largest term. This seems fine if we assume that the largest term is much bigger than all the rest. Try adding these numbers together as quickly as you can:
362823 + 2321 + 123 + 32 + 21 + 14 + 8 + 3 + 1
You could give a very quick answer: “The sum is about 362823”. You would be pretty close.
If that argument isn’t entirely convincing, then another way to understand Viterbi, at least for use in recognition, is to ask you to compute which of these sums will result in the largest value, as quickly as you can:
362823 + 12321 + 9123 + 632 + 321 + 14 + 12 + 3 + 1 221344 + 13234 + 1023 + 332 + 211 + 47 + 11 + 4 + 2
and you could do that just by looking at the largest term in each sum and comparing those, again avoiding actually doing the summation.
Yes, this is a reasonable style for an exam answer – you don’t need to write a perfectly-crafted essay under exam conditions. Brief points, even bullet points, are OK providing you communicate your understanding and show your reasoning. Mere fact-recall will only get you a certain mark – to get full marks you need to go beyond that and give full explanations.
A diagram would be a nice way to convey the first part of your answer – draw the pipeline. For the online exam in Gradescope, it is essential to draw diagrams – don’t only write purely-textual answers in a word processor.
Methods for performing POS tagging are not in-scope for Speech Processing. You can just assume that it is possible to tag text with POS tags very accurately, at least for any language with enough training data (which would be hand-tagged text).
-
AuthorPosts