Forum Replies Created
-
AuthorPosts
-
For Speech Processing, we only cover within-sentence tokenisation using simple methods: manually created rules or regular expressions.
Splitting a longer text into sentences is a different problem and probably harder, since it would involve deciding whether periods are end-of-sentence or not, for example. This problem might be hard enough to require machine learning, such as a CART trained on manually segmented text. This is out-of-scope for this course.
Unit selection, by definition, involves searching amongst many unit sequences, regardless of the unit type.
PSOLA can be used on the residual signal during Residual Excited Linear Prediction (RELP). The residual is of course just a waveform, so is in the time domain. But we usually reserve the term TD-PSOLA for application to speech waveforms only.
PSOLA doesn’t apply in the frequency domain.
Unit selection isn’t mentioned in this question, so the answer should be independent of that.
In all exam questions, read very carefully to see precisely what is being asked (and what is not being asked). A common mistake is to skim a question and assume you know what is being asked.
Specifically, exam questions this year might appear to be similar to past questions that you can remember, but the wording might be changed and so they might be asking a different question (even if the list of available answers is the same).
The term “linear prediction” means a source-filter model using a linear-predictive (=all pole) filter in combination with a source (normally a pulse train, for voiced speech).
Option ii. talks about the spectrum, not spectrogram. A linear predictive filter approximates the vocal tract frequency response, and is responsible for imposing the spectral envelope on to the source spectrum. So one way to estimate the spectral envelope from a speech signal is to fit a linear predictive filter and then plot its frequency response.
The statement “voicing is caused by air flowing through the glottis” is always true. There is no other way to generate voicing.
But the statement “air flowing through the glottis causes voicing” would not be correct all the time.
Declination is the gradual decline of F0 over an utterance – see Figure 8.12 in J&M 8.3.
The frequency range of the spectrogram (its minimum and maximum values), as you nearly said, is determined only by the sampling rate of the waveform (which is not specified in the question, but you assumed to be 16 kHz).
Speaking rate is indeed “how fast” the speech is: it’s commonly measured in syllables per second.
The question talks about a single period of the waveform. If you took the spectrum of this single period, what would that look like? Would you see formants? Would you see harmonics?
There are many variants on path weightings. We don’t need to get lost in those details – just aim to understand dynamic programming and its application to aligning two sequences of differing lengths.
The reason for the double weighting on diagonal paths is that the alternative path against which they are being compared involves summing the costs of one horizontal path and one vertical path, so the comparison would be “unfair”.
This problem of correctly weighting paths arises because there is no modelling of duration. In an HMM, there is a (very simple) duration model: the transition probabilities.
The question isn’t asking you whether “regular expression counts as one type of handcrafted rule”. You just need to decide which of those four techniques are typically used.
Correct. Any gradient ascent algorithm such as EM (or backpropagation for training a neural network – out of scope for Speech Processing) can only find a local optimum.
The Baum-Welch algorithm used for training HMMs maximises the likelihood only in the sense that it finds a local maximum: it only guarantees that there is no small change in the model’s parameters that would further increase the likelihood of the training data.
Pseudo pitch marks are simply uniformly spaced at some constant value of T0, such as 0.01 s (which is an F0 of 100Hz).
Declination in F0 is caused by the speaker’s lungs gradually emptying and therefore producing lower pressure and airflow through the glottis. This decreases the rate of vibration of the vocal folds.
EM is a gradient ascent algorithm. It can only find a local maximum.
(Very few models have a training algorithm that guarantees finding the global optimum.)
-
AuthorPosts