Page 27

Forum Replies Created

Viewing 15 posts - 391 through 405 (of 1,087 total)

← 1 2 3 … 26 27 28 … 71 72 73 →

Author

Posts
December 8, 2019 at 15:29 in reply to: Question 14 – source-filter vs TD-PSOLA #10447
Simon
Professor
Pseudo pitch marks are simply uniformly spaced at some constant value of T0, such as 0.01 s (which is an F0 of 100Hz).
December 8, 2019 at 15:28 in reply to: Question 16 – declination #10446
Simon
Professor
Declination in F0 is caused by the speaker’s lungs gradually emptying and therefore producing lower pressure and airflow through the glottis. This decreases the rate of vibration of the vocal folds.
December 8, 2019 at 15:26 in reply to: Question 27 – EM training #10445
Simon
Professor
EM is a gradient ascent algorithm. It can only find a local maximum.

(Very few models have a training algorithm that guarantees finding the global optimum.)
December 8, 2019 at 15:23 in reply to: Question 13 – diphones #10444
Simon
Professor
Read the question carefully and you will see it is asking you to compare diphone and phone units and say why diphones are preferred. It is not asking you simply to say which of i.-iv. are true.

Because iv. is true for both phones and diphones, it cannot be a reason to prefer one over the other.

There are unit inventories for which there is not a unique unit sequence. Some early systems had a mixed inventory of phones, diphones, demi-syllables and larger units.
December 8, 2019 at 14:43 in reply to: Question 27 – EM training #10441
Simon
Professor
The transition probabilities are also trained. We didn’t cover how this in done in class, and it’s out of scope for the course, but it’s quite simple:

In Viterbi-style training, simply count how many times each transition is taken (when the model uses the most likely state sequence to generate the training data) and finally normalise to make the probabilities of all transitions out of a state sum to 1.
December 8, 2019 at 11:53 in reply to: Why is a smaller inventory good? #10434
Simon
Professor
This slide is talking about the most basic form waveform concatenation synthesis in which we store only one example of each unit type:

inventory = the set of stored waveform units
- using phonemes as the type would require only around 45 stored waveform units
- diphones would require ~2000 stored waveform units
- and so on …. larger unit sizes generally have more types
But similar arguments apply to unit selection:

inventory = the set of unique unit types

database = the stored waveforms (usually complete natural utterances) from which units are selected; multiple instances of each unit type are available

Unit selection involves search all possible candidate unit sequences to find the best-sounding sequence. Even using dynamic programming, this will involve significant computation.

As the database of speech to draw candidates from increases in size, the number of available candidates increases in proportion, but the number of possible sequences increases exponentially.
December 8, 2019 at 11:39 in reply to: F0 and formants #10433
Simon
Professor
Longer tubes have lower resonant frequencies than shorter tubes.
December 7, 2019 at 13:12 in reply to: F0 and formants #10419
Simon
Professor
It’s vocal tract, not “vocal track”.

Vocal tract length is determined mainly by anatomy; speakers can only vary it a little (e.g., by protruding the lips)

Likewise, a speaker’s F0 range is determined by anatomy and can only be varied within that range.

F0 and formants are independent things, so “adjust their F0 so that the speech they produce has the same formants” is incorrect.
December 7, 2019 at 12:34 in reply to: Reading Jurafsky and Martin 8.3 #10411
Simon
Professor
This was a bug – thanks for spotting it. Now fixed.

J&M 8.3 was only missing from the reading list displayed within the module 4 page.
December 7, 2019 at 12:19 in reply to: Question 4 — Mistakes in paper, reduce and decrease #10407
Simon
Professor
You are right that this is a typo: “decreases” should read “increases”.

Does this change your answer?
December 6, 2019 at 17:56 in reply to: eliminating f0 #10396
Simon
Professor
You are right that using a filterbank with suitably wide filters should eliminate all evidence of F0 in the resulting set of filter outputs. That’s the theory…

…however, even with wide filter bandwidths, there will still be some correlation between a filter’s output and F0: the amount of energy in a filter’s output will vary a little up/down as more/fewer harmonics happen to fall within its pass band.

In this case, the motivation for using the cepstrum remains obtaining decorrelated representation, and we should still truncate the cepstrum to discard the higher cepstral coefficients which will be those that correlate the most with F0 (and to reduce dimensionality).
November 24, 2019 at 21:57 in reply to: About feedback on Assignment 1 #10360
Simon
Professor
Information about the UCU Strike
November 22, 2019 at 16:26 in reply to: Working on a remote machine #10315
Simon
Professor
This topic about using a remote machine is for MSc dissertation projects. For Speech Processing, we do not have a remote machine that you can log in to.
November 22, 2019 at 08:05 in reply to: Linear Prediction Coefficients for ASR #10313
Simon
Professor
That’s a great question. If we read outdated books like Holmes & Holmes we will find discussion of LPC features for ASR, and other features derived from a source-filter analysis such as PLP (Perceptual Linear Prediction). PLP was popular for quite a while.

LPC analysis makes a strong assumption about the shape of the spectral envelope: that it can be modelled as an all-pole filter. MFCCs use a more general-purpose approach of series expansion that does not make this assumption.

LPC analysis requires solving for the filter co-efficients, and there are multiple ways to do that. They all have limitations, and the process can be error prone, especially when the speech is not clean.
November 22, 2019 at 07:51 in reply to: Why do we store test data differently? #10311
Simon
Professor
The labels are for different things in training and testing.

For the training data, we need to know the start and end time of each training example because HRest requires this: it trains one model at a time.

For the test data, we simply need the correct label for each test .mfcc file, to use as a reference when computing the WER.
Author

Posts

Viewing 15 posts - 391 through 405 (of 1,087 total)

← 1 2 3 … 26 27 28 … 71 72 73 →

Simon

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis