Forum Replies Created
-
AuthorPosts
-
Correct. Any gradient ascent algorithm such as EM (or backpropagation for training a neural network – out of scope for Speech Processing) can only find a local optimum.
The Baum-Welch algorithm used for training HMMs maximises the likelihood only in the sense that it finds a local maximum: it only guarantees that there is no small change in the model’s parameters that would further increase the likelihood of the training data.
Pseudo pitch marks are simply uniformly spaced at some constant value of T0, such as 0.01 s (which is an F0 of 100Hz).
Declination in F0 is caused by the speaker’s lungs gradually emptying and therefore producing lower pressure and airflow through the glottis. This decreases the rate of vibration of the vocal folds.
EM is a gradient ascent algorithm. It can only find a local maximum.
(Very few models have a training algorithm that guarantees finding the global optimum.)
Read the question carefully and you will see it is asking you to compare diphone and phone units and say why diphones are preferred. It is not asking you simply to say which of i.-iv. are true.
Because iv. is true for both phones and diphones, it cannot be a reason to prefer one over the other.
There are unit inventories for which there is not a unique unit sequence. Some early systems had a mixed inventory of phones, diphones, demi-syllables and larger units.
The transition probabilities are also trained. We didn’t cover how this in done in class, and it’s out of scope for the course, but it’s quite simple:
In Viterbi-style training, simply count how many times each transition is taken (when the model uses the most likely state sequence to generate the training data) and finally normalise to make the probabilities of all transitions out of a state sum to 1.
This slide is talking about the most basic form waveform concatenation synthesis in which we store only one example of each unit type:
inventory = the set of stored waveform units
- using phonemes as the type would require only around 45 stored waveform units
- diphones would require ~2000 stored waveform units
- and so on …. larger unit sizes generally have more types
But similar arguments apply to unit selection:
inventory = the set of unique unit types
database = the stored waveforms (usually complete natural utterances) from which units are selected; multiple instances of each unit type are available
Unit selection involves search all possible candidate unit sequences to find the best-sounding sequence. Even using dynamic programming, this will involve significant computation.
As the database of speech to draw candidates from increases in size, the number of available candidates increases in proportion, but the number of possible sequences increases exponentially.
Longer tubes have lower resonant frequencies than shorter tubes.
It’s vocal tract, not “vocal track”.
Vocal tract length is determined mainly by anatomy; speakers can only vary it a little (e.g., by protruding the lips)
Likewise, a speaker’s F0 range is determined by anatomy and can only be varied within that range.
F0 and formants are independent things, so “adjust their F0 so that the speech they produce has the same formants” is incorrect.
This was a bug – thanks for spotting it. Now fixed.
J&M 8.3 was only missing from the reading list displayed within the module 4 page.
You are right that this is a typo: “decreases” should read “increases”.
Does this change your answer?
You are right that using a filterbank with suitably wide filters should eliminate all evidence of F0 in the resulting set of filter outputs. That’s the theory…
…however, even with wide filter bandwidths, there will still be some correlation between a filter’s output and F0: the amount of energy in a filter’s output will vary a little up/down as more/fewer harmonics happen to fall within its pass band.
In this case, the motivation for using the cepstrum remains obtaining decorrelated representation, and we should still truncate the cepstrum to discard the higher cepstral coefficients which will be those that correlate the most with F0 (and to reduce dimensionality).
This topic about using a remote machine is for MSc dissertation projects. For Speech Processing, we do not have a remote machine that you can log in to.
That’s a great question. If we read outdated books like Holmes & Holmes we will find discussion of LPC features for ASR, and other features derived from a source-filter analysis such as PLP (Perceptual Linear Prediction). PLP was popular for quite a while.
LPC analysis makes a strong assumption about the shape of the spectral envelope: that it can be modelled as an all-pole filter. MFCCs use a more general-purpose approach of series expansion that does not make this assumption.
LPC analysis requires solving for the filter co-efficients, and there are multiple ways to do that. They all have limitations, and the process can be error prone, especially when the speech is not clean.
-
AuthorPosts