Forum Replies Created
-
AuthorPosts
-
December 14, 2020 at 11:08 in reply to: correlation between MFCCs and thier deltas and delta-deltas #13650
There will be some correlation between the MFCCs and their deltas, yes. But, empirically, we still obtain a benefit (i.e., lower WER) by adding them even if this covariance is not modelled.
Your answer for a) i. is correct and the diagrams a fine, although you use an unrealistic example F0 of 5 Hz.
Your answer for a) ii. is basically correct, except that instead of discarding the pitch periods at the end (which is equivalent to truncating the signal), we would distribute the discarding across the duration of the signal (e.g., discard every 3rd period) so that we retain the complete spectral change from the start to the end.
Overall, this is a strong answer and would get a good mark.
Yes, “dimensionality reduction” and “ability to control the feature vector dimension” are two aspects of the same thing. Truncating the cepstrum is a very well-motivated way to choose the dimensionality of the resulting feature vector because the cepstral coefficients have a meaningful order, with the lower ones being more important than the higher ones.
Smaller feature vectors are a good idea when modelling with Gaussians because Gaussians in high dimensions have more parameters to estimate (and don’t work very well in practice). So the motivation for keeping feature dimension low is not the speed of computation but the number of model parameters.
Sampling rate of 16 kHz and 25 ms frame duration means 16000*0.025 = 400 samples in the analysis frame. So the DFT produces 200 magnitudes and 200 phases, of which we discard the phases leaving 200 DFT coefficients.
We didn’t cover this explicitly and it would not be examinable, but an FFT is a restricted case of the DFT which requires the analysis frame to contain a power of 2 number of samples (256, 512, 1024, etc). In the above case, we would need to perform the FFT on 512 waveform samples and not 400.
The sum over all paths computed by the forward algorithm will include the single more likely path, the second most likely, and so on…
One way to understand Viterbi is that we can approximate a sum of many terms (here, path likelihoods) by just taking the largest term. This seems fine if we assume that the largest term is much bigger than all the rest. Try adding these numbers together as quickly as you can:
362823 + 2321 + 123 + 32 + 21 + 14 + 8 + 3 + 1
You could give a very quick answer: “The sum is about 362823”. You would be pretty close.
If that argument isn’t entirely convincing, then another way to understand Viterbi, at least for use in recognition, is to ask you to compute which of these sums will result in the largest value, as quickly as you can:
362823 + 12321 + 9123 + 632 + 321 + 14 + 12 + 3 + 1 221344 + 13234 + 1023 + 332 + 211 + 47 + 11 + 4 + 2
and you could do that just by looking at the largest term in each sum and comparing those, again avoiding actually doing the summation.
Yes, this is a reasonable style for an exam answer – you don’t need to write a perfectly-crafted essay under exam conditions. Brief points, even bullet points, are OK providing you communicate your understanding and show your reasoning. Mere fact-recall will only get you a certain mark – to get full marks you need to go beyond that and give full explanations.
A diagram would be a nice way to convey the first part of your answer – draw the pipeline. For the online exam in Gradescope, it is essential to draw diagrams – don’t only write purely-textual answers in a word processor.
Methods for performing POS tagging are not in-scope for Speech Processing. You can just assume that it is possible to tag text with POS tags very accurately, at least for any language with enough training data (which would be hand-tagged text).
Between you, you have the main points:
dimensionality reduction
reduction of covariance
elimination of evidence of F0
perceptual weighting (Mel scale)Any three of those would be a good answer, along with explanations of why these are advantageous. An excellent answer would briefly indicate how each of them is achieved for MFCCs and why FFT coefficients don’t have that property (e.g., state the FFT dimension for speech sampled at a typical sampling rate with a typical analysis frame duration).
TD-PSOLA cannot modify the spectral envelope, so it cannot remove spectral discontinuities at joins (e.g., between diphones). d) is correct.
Baum-Welch does the correct computation. This is “by definition” – because the model has a hidden state sequence, the correct thing to do is integrate out that random variable = sum over all values it can take.
Baum-Welch provides a better estimate of the model parameters than Viterbi, in both a theoretical and empirical sense.
An accent can only occur when the word is spoken. An accent could be realised, for example, by some movement in F0 or increased energy. Lexically-stressed syllables are candidates that might (but might not) receive an accent when spoken aloud.
The full details of linear prediction (LPC) are not part of the 2020-21 version of Speech Processing, although we did cover the form of the filter (the difference equation).
There are two sub-parts within b):
What problems arise…
TD-PSOLA first divides a speech waveform into pitch periods (= fundamental periods) each of which we assume to be the impulse response of the vocal tract filter. Then it reconstructs a modified waveform using overlap-add. There are potential problems in both steps.
Problems created when the original speech is divided into pitch periods: We assumes that the impulse response of the vocal tract lasts for a shorter time than T0, which is not always true. So, the extracted pitch periods are not exactly individual impulse responses. There is something of the preceding impulse response contained in each of them. This is made worse by the fact that, in practice, we need to extract units units of duration 2xT0 periods to allow for reducing F0 (see below).
Problems created when we use overlap-add to reconstruct the modified waveform – F0 modification: When F0 is increased a lot (T0 is decreased), we heavily overlap the pitch periods. Because the pitch periods are not “clean” and “isolated” impulse responses, this causes distortion. When F0 is reduced a lot (T0 is increased), eventually there will be gaps in-between the pitch periods – if we have extracted units of duration 2xT0 this will occur when we attempt to modify F0 to half of the original or less.
Problems created when we use overlap-add to reconstruct the modified waveform – duration modification: extreme amounts of modification involve either deleting many pitch periods, or duplicating many. Deleting many will disrupt the slowly-changing nature of the vocal tract impulse response. Duplicating many will result in a signal that effectively has a constant vocal tract frequency responses for multiple consecutive pitch periods; this can sound unnatural.
How does linear predictive…
Assuming that we perfectly separate source and filter (it’s a big assumption, but the question doesn’t ask us to discuss that), we are now free to make any amount of modification to F0 and duration.
F0 modification is straightforward: simply adjust the frequency of the impulse train. Since (we assume) source and filter are perfectly separated, this will be just like making natural speech with a different frequency of vocal fold vibration, going through the vocal tract filter. The problems of TD-PSOLA either overlapping imperfect impulse responses too much, or leaving gaps in-between, don’t arise because the linear predictive filter produces the correct impulse response for every input impulse and they combine in the right way (convolution).
Duration modification is straightforward: simply adjust the duration of the impulse train. The problem of the vocal tract filter remaining piecewise constant when TD-PSOLA increases duration by a large amount can be overcome by interpolating the filter co-efficients so that they change gradually and are never constant.
This answer really needs to be in the form of a diagram. Try uploading one here, annotated with the terms you use in your written explanation, and I’ll check it for you.
In general, females have shorter vocal tracts than males, and therefore higher formant frequencies. So iii. is true.
Harmonics are at multiples of F0. Since female speech has generally a higher F0 than male speech, the harmonics will be at multiples of a higher F0 = more widely spaced. So i. is also true.
The signal in Ladefoged Fig 6.2 could be generated by passing an impulse train with a fundamental frequency of 200 Hz through a filter which only passes through frequencies in the range 1 800 Hz to 2 200 Hz. For example, a filter with a single resonance at 2 000 Hz and a narrow bandwidth.
The harmonics in the filtered signal are still at integer multiples of the fundamental. The fundamental frequency of the filtered signal is still 200 Hz even though there is no harmonic at that frequency.
The filter cannot change the fundamental frequency. It can only modify the spectral envelope = it can only change the amplitudes of harmonics, not their frequencies.
One interesting consequence of this is that we perceive such signals as having a pitch equal to their fundamental frequency, even if there is no energy at that frequency. Our perception of pitch is based not simply on identifying the fundamental, but on the harmonic structure.
Yes, you are correct: “topology” just means the shape of the Hidden Markov Model (HMM) = how many states it has and what transitions between them are possible.
For modelling speech, a left-to-right topology is the correct choice. Speech does not time-reverse, the phones in a word must appear in the correct order, etc.
For speech, we do not generally use “parallel path” HMMs, which have transitions that allow some states to be skipped. We use strictly left-to-right models in which the only valid paths pass through all the emitting states in order.
The only exception to this might be an HMM for noise or silence in which we might add some other transitions, or connect all emitting states with all other emitting states with transitions in both directions to make an ergodic HMM.
So, in the general case, an HMM could have transitions between any pair of states, including self-transitions. That’s why, when we derive algorithms for doing computations with HMMs, we must consider all possible transitions and not restrict ourselves to a left-to-right topology.
-
AuthorPosts