Page 37

Forum Replies Created

Viewing 15 posts - 541 through 555 (of 1,084 total)

← 1 2 3 … 36 37 38 … 71 72 73 →

Author

Posts
December 19, 2017 at 00:41 in reply to: Question 7 – linguistic specification #8804
Simon
Professor
iii. pseudo-pitch marks

These are pitch marks placed in unvoiced regions, solely for the purpose of signal processing algorithms that operate pitch synchronously (e.g., TD-PSOLA). They are an internal part of the signal processing algorithm, and so only exist within the waveform generator

iv. pitch contour

This is a target F0 contour (remember that we use the terms “F0” and “pitch” interchangeably in this field, even though only “F0” is technically correct). It is predicted from the text by the front end (e.g., by a succession of classification and regression trees).
December 19, 2017 at 00:38 in reply to: Question 7 – linguistic specification #8803
Simon
Professor
Annable K is correct – phone durations can be part of the linguistic specification because they are predicted from the text (e.g., with a regression tree).
December 18, 2017 at 02:45 in reply to: Question 7 – linguistic specification #8756
Simon
Professor
The linguistic specification is what is passed from the front-end (text processor) to the waveform generator. So, you need to decide what is predicated in the front-end, and what is done by the waveform generator.

The linguistic specification can only include things that are predicted in the front end.
December 18, 2017 at 02:43 in reply to: Question 14 – source-filter vs TD-PSOLA #8755
Simon
Professor
Yes, ii. is correct.

The range of possible F0 modification factor is quite limited in TD-PSOLA, before audible artefacts are produced.

TD-PSOLA operates on pitch periods, and so needs pseudo pitch marks in unvoiced regions. A source-filter model can modify the duration of unvoiced sounds simply by inputting a longer random source signal. It doesn’t need to divide unvoiced speech regions into pseudo pitch periods.

So, iii. and iv. are both true.
December 18, 2017 at 02:38 in reply to: Question 26 – Word Error Rate #8754
Simon
Professor
Kamen D is correct.

Imagine a system that simply deleted all words: it always produces no words in the output (hypothesis). It never makes any insertion or substitution errors. This system should get a WER of 100%.

The formula in (b) would not work: it involves dividing by zero (since there are no words in the hypothesis).

The formula in (d) gives the correct value: for our system, the number of deletions is equal to the number of words in the reference, and so we get a value of 1 (in other words, a WER of 100%).
December 18, 2017 at 02:30 in reply to: Question 12 – CART drawbacks #8753
Simon
Professor
Kamen D is correct about i., ii. and iii.

Here, globally optimal means that the model we learn from data is the best possible model (of that form), given the data.

We cannot guarantee this for a CART, because we use a greedy training algorithm. We do not consider every possible tree (think about how many possible trees there are, for even a small set of questions), and therefore we cannot be sure that there isn’t a better tree than the one we have learned.
December 18, 2017 at 02:27 in reply to: Question 11 – CART stopping criteria #8752
Simon
Professor
Kamen D’s answer is correct. Let’s look at the impossible answers:

ii. the number of classes (of the predictee) is a constant, so this cannot be used as a stopping criterion, since it doesn’t change as we grow the tree

iv. again, this doesn’t change as we grow the tree, so cannot be used as a stopping criterion
December 18, 2017 at 01:19 in reply to: Summing up over ALL path costs? #8734
Simon
Professor
In DTW, there is no summing at all. We are only interested in the signal best path, and so take the minimum path cost at every grid point.
December 18, 2017 at 01:18 in reply to: Complexity of using Euclidean distance #8733
Simon
Professor
Well, when using a probability distribution, we also need to compute the probability (or likelihood) of our observation having been generated by many possible density functions (e.g., one per class, or one per HMM state).

The overall complexity of computing a (log) probability and computing a Euclidean distance is essentially the same. We can see this in the formula for a Guassian, where the key operation is the same (squared distance from the mean).
December 18, 2017 at 01:16 in reply to: Question 15 – join discontinuity #8732
Simon
Professor
Join discontinuity means the difference in any signal properties across a join (concatenation point).

Those signal properties include anything that is perceptually relevant, including (but not limited to) the spectrum.

So, spectral discontinuity is a sub-component of join discontinuity.

What other components can you think of?
November 29, 2017 at 08:46 in reply to: Source in the cepstrum – eliminated after Mel? #8601
Simon
Professor
Jurafsky & Martin (J&M) include a figure showing the “classical” cepstrum, and this is what is confusing you. As you say, they fail to make a clear connection between this and MFCCs.

To clear this up, we need to distinguish between the classical cepstrum, and what actually happens in creating MFCCs.

Let’s start with the classical cepstrum, as in J&M’s Figure 9.14 (borrowed from Taylor, who gives a better explanation – read that if you can).

WARNING: in J&M’s Figure 9.14, the plots for (a) and (b) need to be swapped in order for the caption to be correct! My explanation below assumes you’ve corrected this figure.

The three subfigures illustrate the key stages of

(a) obtaining the spectrum from the waveform, using an FFT – in this domain, the source and filter are multiplied together

(b) taking the log of the spectrum, which makes the source and filter additive

(c) performing a series expansion (e.g., DCT) which “lays out” the different components of the log spectrum along an axis, such that the source components and filter components are in different places along that axis and can easily be separated. In J&M’s Figure 9.14(c) we can see the fundamental period as a small peak around the middle of the cepstrum.

There is no filterbank in the classical cepstrum, and no Mel-scaling of the frequency axis.

Mel Frequency Cepstral Co-efficients are inspired by the classical cepstrum and use the same key processing steps. In addition, MFCC extraction involves some additional processing: a Mel-scaled filterbank. This happens after 9.14(a) and so 9.14(b) becomes a smooth spectral envelope (no harmonics) on a Mel scale, and 9.14(c) would no longer have the small peak corresponding to the fundamental period.

The filterbank serves two purposes. First, it’s an easy way to warp the frequency scale from linear (Hertz) to a Mel scale, simply by placing the filter’s centre frequencies evenly apart on a Mel scale. Second, it’s an opportunity to smooth the spectrum and reduce the prominence of the harmonics – in other words, to produce a spectrum that contains less information about the source.

To summarise: J&M’s Figure 9.14(c) is the classical cepstrum and is not one of the stages on the way to MFCCs.
November 27, 2017 at 20:54 in reply to: Acoustic Model and Language Model #8589
Simon
Professor
A n-gram language model would be learned from a large text corpus. The simplest method is just to count how often each word follows other words, and then normalise to probabilities.

In general, we don’t train the language model only on the transcripts of the speech we are using to train the HMMs. We usually need a lot more data than this, and so train on text-only data. This is beyond the scope of the Speech Processing course, where we don’t actually cover the training of n-gram language models

We just need to now how to write them in a finite state form, and then use them for recognition.

In the digit recogniser assignment, the language model is even simpler than an n-gram and so we write it by hand.
November 27, 2017 at 09:47 in reply to: Acoustic Model and Language Model #8565
Simon
Professor
The language model computes the prior, P(W). If you like, we might say that the language model is the prior. It’s called the prior because we can calculate it before observing O.

In the isolated digit recogniser, P(W) is never actually made explicit, because it’s a uniform distribution. But you can think of having P(W=w) = 0.1 for all words w.

The acoustic model computes the likelihood, P(O|W).

We combine them, using Bayes’ rule, to obtain the posterior P(W|O); we ignore the constant scaling factor of P(O).

Now, to incorporate alternative pronunciation probabilities, we’d need to introduce a new random variable to our equations, and decide how to compute it. Try for yourself…
November 26, 2017 at 16:34 in reply to: Transition Matrix in training a model #8560
Simon
Professor
Yes, the transition matrix is also updated – you can verify this for yourself by inspecting it in the prototype models, the intermediate models after HInit and the final models after HRest.

Conceptually, training the transition probabilities is straightforward: we just count how often each transition is used, and then normalise the counts to probabilities (to make the probabilities across all transitions leaving any given state sum to 1). This counting is very easy for the Viterbi training – we literally count how often each transition was used in the single best alignment for each training example, and sum across all training examples. For Baum-Welch it’s conceptually the same, but we use “soft counting” when summing across all alignments and all training examples.

To see for yourself how much contribution the transition matrix makes to the model, you could even do an experiment (optional!), such as manually editing the transition matrices of the final models to reset them to the values from the prototype model, but leaving the Gaussians untouched.
November 23, 2017 at 07:48 in reply to: recognising words #8530
Simon
Professor
But we don’t know which word our sequence of feature vectors corresponds to. This is what we are trying to find out.

So, we can only try generating it with every possible model (or every possible sequence of models, in the case of connected speech), and search for the one that generates it with the highest probability.

Because we are using Gaussian pdfs, any model can generate any observation sequence. The model of “cat” can generate an observation sequence that corresponds to “car”. But, if we have trained our models correctly, then it will do so with a lower probability than the model of “car”.

A Gaussian pdf assigns a non-zero probability to any possible observation – the long “tails” of the distribution never quite go down to zero. The probability of observations far away from the mean becomes very small, but never zero.
Author

Posts

Viewing 15 posts - 541 through 555 (of 1,084 total)

← 1 2 3 … 36 37 38 … 71 72 73 →

Simon

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis