Page 8

Forum Replies Created

Viewing 15 posts - 106 through 120 (of 1,087 total)

← 1 2 3 … 7 8 9 … 71 72 73 →

Author

Posts
March 28, 2022 at 09:42 in reply to: Accessing unilex-rpx lexicon #15814
Simon
Professor
You shouldn’t edit the main dictionary, but you can override a pronunciation using the command lex.add.entry in the addendum which is stored in my_lexicon.scm in the assignment.
March 6, 2022 at 14:21 in reply to: Low Pass Filter for isolating F0 in Epoch Detection #15751
Simon
Professor
The filtering can be done directly in the time domain. It’s easiest to describe and understand filtering in the frequency domain, but the filter can be implemented as a direct operation on the speech waveform samples.

Filter design is an entire subject on its own and out of scope. But we can understand one very simple form of low-pass filter that is easy to implement: a moving average. If we take a moving average of a speech waveform, that will smooth out the smaller details (i.e., remove the higher frequencies).

The time offset is because the main peak in the speech waveform may not align with the peak that we found in the low-pass-filtered version. That could be for two separate reasons. The first is to do with the phases of the many different harmonic frequencies making up the speech signal. The second is the phase response of the low-pass filter (e.g., the filter introduces a time delay).
February 23, 2022 at 15:43 in reply to: Quantify Zipfianess #15724
Simon
Professor
I think a visual inspection (plot both distributions, suitably normalised, on the same axes) would suffice for the purposes of the Speech Synthesis assignment.
January 30, 2022 at 13:15 in reply to: Negative Impulse Train #15646
Simon
Professor
To record sound, we measure deviations above or below mean (resting) air pressure using a microphone [*]. The vertical axis on an audio waveform plot corresponds to air pressure. That is why waveform samples during silence have values around to zero.

Our idealised impulse train is a waveform where all samples have a value of exactly zero, except for one sample per period which has a positive value. This is the simplest possible signal that contains energy at all multiples of the fundamental frequency.

As you have realised, this idealised signal is not physically possible – a real signal does indeed need to also have regions of below-mean-air-pressure. (It does not need to be symmetric though, so we don’t need “negative impulses” to balance the positive ones.)

[* Actually, microphones vary in precisely what they measure: pressure, pressure gradient across a diaphragm, velocity, …etc. This subtlety is not important to understand here.]
January 28, 2022 at 21:18 in reply to: Using Dialogue Data As My Corpus #15631
Simon
Professor
First let’s determine if that is the original distribution of the corpus – a little digging shows that it’s not. From the paper we eventually find the v2 update which has this license. You can read that to check if it allows your intended use (it does).

In the class for Module 4 – the database you can think more about whether this dataset is suitable (e.g., is it going to be easy to read out loud? will the text present normalisation issues?) and, if it is suitable, how to select a subset for recording.
August 31, 2021 at 12:10 in reply to: Videos not working #14516
Simon
Professor
The problem should now be solved.

Technical details: video were being served from the subdomain media.speech.zone which was not covered by the speech.zone SSL certificate. Chrome is stricter than other browsers and refused to load anything from that subdomain. Videos are now served from the primary domain.
February 10, 2021 at 11:01 in reply to: Module 5 – first video missing? #13834
Simon
Professor
Good catch! Not sure what happened there, but the first video (“Why? When? Which aspects?”) is now re-instated. There are 4 videos in Module 5.
December 16, 2020 at 18:58 in reply to: Bayes Theorem + HMMs #13706
Simon
Professor
The forward algorithm computes P(O|W) correctly by summing all terms. The Viterbi algorithm computes an approximation of P(O|W) by finding only the largest term (= most likely path) in the sum.
December 16, 2020 at 18:56 in reply to: token passing #13705
Simon
Professor
Token Passing uses a much smaller data structure: the HMM (= a finite state model) itself, which is one “column” (= all model states at a particular time) of the lattice.

So, Token Passing is equivalent to working with the lattice whilst only ever needing one column of it in memory at any given time.

Token Passing is a time-synchronous algorithm – all paths (= tokens) are extended forwards in time at once.

[Everything below this point is beyond the scope of Speech Processing]

There are non-time-synchronous algorithms. Working on the full lattice would allow paths to be extended over states, or over time, in many different ways. When combined with beam pruning, a clever ordering of the search can lead to doing fewer computations overall. This becomes important for large vocabulary connected speech recognition (LVCSR).

But we then also have the problem that the lattice is too big to construct in memory, so we create only the parts of it that we are going to search. Historical footnote: my Masters dissertation was an implementation of a stack decoder performing A* search; this avoids constructing the lattice in memory, whilst searching it in “best first” order.

In HTK, HVite does Token Passing, which becomes inefficient for LVCSR. For LVCSR, HDecode is much more sophisticated and efficient.
December 16, 2020 at 18:02 in reply to: Baum-welch algorithm #13704
Simon
Professor
We could go from uniform segmentation to Baum-Welch, skipping Viterbi training. In fact, we could even go from the prototype model directly to Baum-Welch.

Baum-Welch is an iterative algorithm that gradually changes the model parameters to maximise the likelihood of the training data. The only proof we have for this algorithm is that each iteration increases (in fact, does not decrease) the likelihood of the training data. There is no way to know whether the final model parameters are globally-optimal.

This type of algorithm is sensitive to the initial model parameters. The final model we get could be different, depending on where we start from. This is also true for Viterbi training.

So, we use uniform segmentation to get a much better model than the prototype (which has zero mean and unit variance for all Gaussians in all states). Starting from this model should give a better final model than starting from the prototype.

The model from uniform segmentation is used as the initial model for Viterbi training, which in turn provides the initial model for Baum-Welch.

Another reason to perform training in these phases is that Viterbi training is faster than Baum-Welch and will get us quite close to the final model. This reduces the number of iterations of Baum-Welch that are needed.
December 14, 2020 at 20:37 in reply to: greedy algorithm #13689
Simon
Professor
The Viterbi algorithm is not a greedy algorithm. It performs a global optimisation and guarantees to find the most likely state sequence, by exploring all possible state sequences.

An example of a greedy algorithm is the one for training a CART. Unlike the Viterbi algorithm, this algorithm does not explore all possibilities: it takes a sequence of local, hard decisions (about the best question to split the data) and does not backtrack. It does not guarantee to find the globally-optimal tree.
December 14, 2020 at 18:05 in reply to: orthogonal basis functions #13666
Simon
Professor
Orthogonal means that the correlation between any pair of basis functions is zero. This property is necessary for there to be a unique set of coefficients when we analyse a given signal using a set of basis functions.

Think of it as “there is no amount of one basis function contained in any other one”, or as “if we analyse a signal that happens to be the same as one of the basis functions, that coefficient and that coefficient only will be non-zero”.
December 14, 2020 at 17:14 in reply to: a set of coefficients #13662
Simon
Professor
Yes, the similarity between a basis function and the signal being analysed is computed by multiplying them sample-by-sample and summing up.
December 14, 2020 at 13:56 in reply to: Part II Question 1 #13657
Simon
Professor
The tapered window is applied to the extracted pitch period units. You can think of it as being applied as we overlap-add them back together. Each extracted unit, and therefore its tapered window, is centred on a pitch mark (= an epoch).
December 14, 2020 at 11:08 in reply to: correlation between MFCCs and thier deltas and delta-deltas #13650
Simon
Professor
There will be some correlation between the MFCCs and their deltas, yes. But, empirically, we still obtain a benefit (i.e., lower WER) by adding them even if this covariance is not modelled.
Author

Posts

Viewing 15 posts - 106 through 120 (of 1,087 total)

← 1 2 3 … 7 8 9 … 71 72 73 →

Simon

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis