Forum Replies Created
-
AuthorPosts
-
Yes, the most likely cause is incorrect sampling rate of waveforms. See this topic.
You shouldn’t edit the main dictionary, but you can override a pronunciation using the command
lex.add.entry
in the addendum which is stored inmy_lexicon.scm
in the assignment.The filtering can be done directly in the time domain. It’s easiest to describe and understand filtering in the frequency domain, but the filter can be implemented as a direct operation on the speech waveform samples.
Filter design is an entire subject on its own and out of scope. But we can understand one very simple form of low-pass filter that is easy to implement: a moving average. If we take a moving average of a speech waveform, that will smooth out the smaller details (i.e., remove the higher frequencies).
The time offset is because the main peak in the speech waveform may not align with the peak that we found in the low-pass-filtered version. That could be for two separate reasons. The first is to do with the phases of the many different harmonic frequencies making up the speech signal. The second is the phase response of the low-pass filter (e.g., the filter introduces a time delay).
I think a visual inspection (plot both distributions, suitably normalised, on the same axes) would suffice for the purposes of the Speech Synthesis assignment.
To record sound, we measure deviations above or below mean (resting) air pressure using a microphone [*]. The vertical axis on an audio waveform plot corresponds to air pressure. That is why waveform samples during silence have values around to zero.
Our idealised impulse train is a waveform where all samples have a value of exactly zero, except for one sample per period which has a positive value. This is the simplest possible signal that contains energy at all multiples of the fundamental frequency.
As you have realised, this idealised signal is not physically possible – a real signal does indeed need to also have regions of below-mean-air-pressure. (It does not need to be symmetric though, so we don’t need “negative impulses” to balance the positive ones.)
[* Actually, microphones vary in precisely what they measure: pressure, pressure gradient across a diaphragm, velocity, …etc. This subtlety is not important to understand here.]
First let’s determine if that is the original distribution of the corpus – a little digging shows that it’s not. From the paper we eventually find the v2 update which has this license. You can read that to check if it allows your intended use (it does).
In the class for Module 4 – the database you can think more about whether this dataset is suitable (e.g., is it going to be easy to read out loud? will the text present normalisation issues?) and, if it is suitable, how to select a subset for recording.
The problem should now be solved.
Technical details: video were being served from the subdomain
media.speech.zone
which was not covered by thespeech.zone
SSL certificate. Chrome is stricter than other browsers and refused to load anything from that subdomain. Videos are now served from the primary domain.Good catch! Not sure what happened there, but the first video (“Why? When? Which aspects?”) is now re-instated. There are 4 videos in Module 5.
The forward algorithm computes P(O|W) correctly by summing all terms. The Viterbi algorithm computes an approximation of P(O|W) by finding only the largest term (= most likely path) in the sum.
Token Passing uses a much smaller data structure: the HMM (= a finite state model) itself, which is one “column” (= all model states at a particular time) of the lattice.
So, Token Passing is equivalent to working with the lattice whilst only ever needing one column of it in memory at any given time.
Token Passing is a time-synchronous algorithm – all paths (= tokens) are extended forwards in time at once.
[Everything below this point is beyond the scope of Speech Processing]
There are non-time-synchronous algorithms. Working on the full lattice would allow paths to be extended over states, or over time, in many different ways. When combined with beam pruning, a clever ordering of the search can lead to doing fewer computations overall. This becomes important for large vocabulary connected speech recognition (LVCSR).
But we then also have the problem that the lattice is too big to construct in memory, so we create only the parts of it that we are going to search. Historical footnote: my Masters dissertation was an implementation of a stack decoder performing A* search; this avoids constructing the lattice in memory, whilst searching it in “best first” order.
In HTK,
HVite
does Token Passing, which becomes inefficient for LVCSR. For LVCSR,HDecode
is much more sophisticated and efficient.We could go from uniform segmentation to Baum-Welch, skipping Viterbi training. In fact, we could even go from the prototype model directly to Baum-Welch.
Baum-Welch is an iterative algorithm that gradually changes the model parameters to maximise the likelihood of the training data. The only proof we have for this algorithm is that each iteration increases (in fact, does not decrease) the likelihood of the training data. There is no way to know whether the final model parameters are globally-optimal.
This type of algorithm is sensitive to the initial model parameters. The final model we get could be different, depending on where we start from. This is also true for Viterbi training.
So, we use uniform segmentation to get a much better model than the prototype (which has zero mean and unit variance for all Gaussians in all states). Starting from this model should give a better final model than starting from the prototype.
The model from uniform segmentation is used as the initial model for Viterbi training, which in turn provides the initial model for Baum-Welch.
Another reason to perform training in these phases is that Viterbi training is faster than Baum-Welch and will get us quite close to the final model. This reduces the number of iterations of Baum-Welch that are needed.
The Viterbi algorithm is not a greedy algorithm. It performs a global optimisation and guarantees to find the most likely state sequence, by exploring all possible state sequences.
An example of a greedy algorithm is the one for training a CART. Unlike the Viterbi algorithm, this algorithm does not explore all possibilities: it takes a sequence of local, hard decisions (about the best question to split the data) and does not backtrack. It does not guarantee to find the globally-optimal tree.
Orthogonal means that the correlation between any pair of basis functions is zero. This property is necessary for there to be a unique set of coefficients when we analyse a given signal using a set of basis functions.
Think of it as “there is no amount of one basis function contained in any other one”, or as “if we analyse a signal that happens to be the same as one of the basis functions, that coefficient and that coefficient only will be non-zero”.
Yes, the similarity between a basis function and the signal being analysed is computed by multiplying them sample-by-sample and summing up.
The tapered window is applied to the extracted pitch period units. You can think of it as being applied as we overlap-add them back together. Each extracted unit, and therefore its tapered window, is centred on a pitch mark (= an epoch).
-
AuthorPosts