Page 34

Forum Replies Created

Viewing 15 posts - 496 through 510 (of 1,084 total)

← 1 2 3 … 33 34 35 … 71 72 73 →

Author

Posts
October 24, 2018 at 22:26 in reply to: Submission #9489
Simon
Professor
As far as I can tell, the submission links are already there. You do need to read and confirm an “own work” declaration before the submit link becomes visible.

If you have any trouble, go directly to the appropriate PPLS teaching office for help.
October 24, 2018 at 22:22 in reply to: Github for the course #9488
Simon
Professor
There is no point submitting audio for the first Speech Processing assignment. Is this what you are trying to achieve?

Because you are using a specific voice in the lab, all you need to do is provide the exact text and the audio is precisely reproducible.
October 23, 2018 at 10:16 in reply to: Am I plagiarising? #9475
Simon
Professor
Let’s separate out two different things:

1. the exact wording used by another author

If you really want to repeat this wording, then it must be placed in quotation marks followed by a citation that includes the page number. In general, you should avoid using quotations, because they do not show your own understanding as well as using your own words. The reasons to use a quote is where the precise wording is important in itself – for example, you wish to make a comment about the wording.

2. ideas, facts, concepts, etc that are from another author

This will apply quite widely, of course. The correct thing to do here is to describe the idea in your own words, but use a citation to acknowledge the source of the idea. Try to cite the original source, or at the very least a textbook.

You use the term “memorise”. If you mean “learn and understand the basic ideas, facts, concepts, etc” then this is absolutely fine – this is what you are supposed to do.

But if you mean literally “memorising the wording used by another author” then this is a Very Bad Idea, and not a good way to learn. Aim for understanding and not mere memorisation.

It is better to properly understand a small number of key concepts, than to memorise large amounts of material without understanding.
October 19, 2018 at 13:27 in reply to: Jurafsky & Martin – Chapter 8 #9447
Simon
Professor
The difference between t and j is meaningful, but Taylor forgot to spell it out explicitly.

In the equations where the candidate unit, u, is indexed by t (meaning time as an integer counting through the target sequence, not time measured in seconds), he is referring to the unit selected at that time to be used for synthesis.

In the equations where u is indexed using another variable, j, he is using j to index all the available candidates of that unit type, of which one will be selected.
October 13, 2018 at 18:41 in reply to: Handbook of Phonetic Sciences – Chapter 20 – Intro to Signal Processing #9419
Simon
Professor
Linear Prediction refers to a specific form of filter being used in a source-filter model. A linear predictive filter is very simple: it predicts each speech sample (the filter’s output) as a weighted sum of the previous few speech (i.e., filter output) samples. The weights are called the filter co-efficients.

Such a filter has only resonances (technically called “poles”), and no anti-resonances (“zeros”). It can be used as a simple model of the vocal tract. We need to excite the filter with an input signal, such as a pulse train. The output generated will be a synthetic speech waveform.

The frequency response of the filter corresponds to the spectral envelope of the generated speech.

It’s important to realise that, when we model speech with a simple source-filter model, such as linear prediction, we are only really modelling properties of the signal. We are not directly modelling the vocal tract in any realistic sense.
October 13, 2018 at 18:31 in reply to: Jurafsky & Martin – Chapter 8 #9418
Simon
Professor
Letter-to-phone alignment is needed when preparing the data for training the letter-to-sound model such as a classification tree. This is because letter-to-sound is a sequence-to-sequence problem, but a classification tree only deals with fixed-length input and output. We therefore do a common ‘trick’ of sliding a fixed length window along the sequence of predictors (which are the letters, in this case).

One way to find the alignment is by using Dynamic Programming, which searches for the most likely alignment between the two sequences. We will define (by hand) a simple cost function which (for example) gives higher probability to alignments between letters that are vowels and phonemes that are vowels, and the same for consonants. Or, the cost function could list, for every letter, the phonemes that it is allowed to align with.

Dynamic Programming is coming up later in the course – we’ll first encounter it in the Dynamic Time Warping (DTW) method for speech recognition. I suggest waiting until we get there, and then revisiting this topic to see if you can work out how to apply Dynamic Programming to this problem.

I’ll leave one hint here for you to come back to: in DTW, we create a grid and place the template along one axis and the unknown word along the other. For letter-to-phoneme alignment, we would place the letters along one axis and the phonemes along the other.

Post a follow-up later in the course if you need more help.
October 9, 2018 at 09:50 in reply to: Ladefoged – Chapter 5 #9401
Simon
Professor
Yes, it’s an error in some editions of the book, corrected in later editions.
October 4, 2018 at 12:19 in reply to: Handbook of Phonetic Sciences – Chapter 20 – Intro to Signal Processing #9400
Simon
Professor
Poles and zeros are properties of a filter. They correspond to the physical properties of resonance and anti-resonance.

It is common to model the vocal tract as an all-pole filter: something with only resonances. The most common all-pole filter used is Linear Prediction.

The relationship between our model (i.e., filter) parameters and the vocal tract shape is not trivial, because our model is such a simplistic approximation of the true vocal tract. So, for example, we wouldn’t normally use pole frequencies as features for Automatic Speech Recognition (although in the early days, features like that were widely used).

But, for conceptual understanding, we can say that the poles of a Linear Prediction filter correspond to resonant frequencies of the vocal tract, which we call formants. (Poles occur in pairs, and there will be two poles per formant).

To do formant tracking, we could fit an all-pole filter to a speech signal, and use the poles to identify the formants.

[This level of detail is beyond the scope of Speech Processing. These concepts are still important, and will become more relevant in Speech Synthesis.]
October 4, 2018 at 12:05 in reply to: Handbook of Phonetic Sciences – Chapter 20 – Intro to Signal Processing #9397
Simon
Professor
The inner product between two signals is calculated by multiplying the corresponding samples (one from each signal) and summing up those values.

Intuitively, think of this as a measure of how similar the two signals are. If they are similar, then the inner product will have a high value. If they are very different, it will have a low value. So, we can understand the Fourier transform in this intuitive way:

Take the signal we want to analyse.

Create a sine wave of a particular frequency, and take the inner product between this and our signal. The resulting value is “how much of that frequency is present in our signal”. Plot that result, as a dot on a chart with frequency along the horizontal axis, and “how much” (i.e., magnitude) on the vertical axis.

Repeat for a range of frequencies. Join the dots. The final plot is the spectrum.

Fourier theory will tell us exactly what frequencies of sine waves we need to use, in order to perfectly characterise the signal (i.e., for the spectrum and the signal to contain exactly the same information, and thus to be able to make one from the other, in either direction).

Now on to phase: this is the relative offset (i.e., shift in time) between the sine wave and the signal, before we take the inner product. You correctly state that phase is important. Luckily, Fourier analysis not only computes the magnitudes of frequency components present in our signal, it also computes the phase that each sine wave needs to be at, so that when we sum those sine waves together we reconstruct our signal exactly.
October 3, 2018 at 15:44 in reply to: Taylor – Chapter 3 #9393
Simon
Professor
When Taylor says “suprasegmental prosody” (which he elaborates later on, in Section 6.5.2 of his book) he means aspects of prosody closely associated with the words and the literal meaning of the utterance. For example: syllable stress within a word, or placing a prominence on a content word, or tone in a tonal language.

He uses “affective prosody” (Section 6.5.1) to mean aspects of prosody that convey emotion, attitude and other things determined by the mental state of the talker.

Under “augmentative prosody” (Section 6.5.3) he includes the use of prosody to aid communication, such as using rising intonation at the end of a yes/no question: even though this is not essential, it significantly aids communication efficiency. Another example would be placing phrase breaks to help the listener disambiguate information.

[This material is beyond the scope of the Speech Processing course, but would be in-scope for Speech Synthesis]
October 3, 2018 at 09:26 in reply to: An introduction to signal processing for speech #9386
Simon
Professor
Yes, DCT means Discrete Cosine Transform. We will be coming on to that in the later part of Speech Processing, when we consider how to extract useful features from the FFT spectrum, to use for Automatic Speech Recognition. We’ll also bellowing at the Mel scale. Wait until we get there, then ask the question again.
October 3, 2018 at 09:25 in reply to: spectrum returned by FFT #9385
Simon
Professor
Yes, these are both the spectrum of a voiced speech sound. The upper one appears to be on a linear vertical scale, so we only see the very largest amplitudes and everything else appears to be zero. The lower plot is on a logarithmic vertical scale and therefore we can see both very large and very small magnitudes on the same plot. The lower plot is more informative.
August 10, 2018 at 07:33 in reply to: Join & Target cost #9304
Simon
Professor
In the exercise, the section about the target cost weight describes how to change the target cost weight.

See the notes about pre-selection and pruning – these are probably why you are not hearing any difference. You cannot disable pre-selection (because this ensures the right thing is said!) but you can disable pruning.

To confirm that different candidates are actually being used, you can examine the Unit relation of the utterance. It’s possible that the selected units are changing but you cannot hear the small difference this makes.

Use the commands described here to examine which candidates are selected:
```
festival> (set! myutt (SayText "Hello world."))
festival> (utt.relation.print myutt 'Unit)     
```
June 18, 2018 at 17:29 in reply to: Running Merlin on Eddie3 #9286
Simon
Professor
You should get everything working on a DICE machine, using just the CPU and a small data set, before attempting to use Eddie. Have you done that?
June 18, 2018 at 17:28 in reply to: Running Merlin on Eddie3 #9285
Simon
Professor
Looks like you have all processes set to “False” which means nothing will be done (other than loading the config files and writing some log output).
Author

Posts

Viewing 15 posts - 496 through 510 (of 1,084 total)

← 1 2 3 … 33 34 35 … 71 72 73 →

Simon

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis