Forum Replies Created
-
AuthorPosts
-
Information Theory is a powerful tool, but well beyond the scope of Speech Processing. I’d be happy to help you one-on-one or in a small group, if this is something you are trying to understand.
Good answer Danielle. But we should note that this paper is specifically about frequency scales for representing pitch (the perceptual correlate of fundamental frequency) rather than the more general spectral envelope information (e.g., formant frequencies) that is important for speech recognition.
Regarding the choice of frequency scale for Automatic Speech Recognition (ASR), the key property we want is a non-linear scale that compresses the higher frequencies more than the lower ones. In other words, the resulting features (e.g., filterbank energies) use more co-efficients to describe the most important (i.e., most informative) frequency range for speech up to around 3 kHz, and fewer co-efficients for the higher frequencies that are less important (i.e, contain less information).
All perceptual scales (Mel, Bark, etc) have this property. They will all work much the same for this application and the choice is made either through personal preference, or empirically by experimentation. The Mel scale is by far the most popular for ASR.
Apologies for not updating this information – yes, this year we will again use the built-in microphone on an iMac in the lab (or any other iMac you have access to).
I’ve reported the problem – the submission system is set up by the teaching offices, not me. You can email your submission to them, if Learn is not working.
As far as I can tell, the submission links are already there. You do need to read and confirm an “own work” declaration before the submit link becomes visible.
If you have any trouble, go directly to the appropriate PPLS teaching office for help.
There is no point submitting audio for the first Speech Processing assignment. Is this what you are trying to achieve?
Because you are using a specific voice in the lab, all you need to do is provide the exact text and the audio is precisely reproducible.
Let’s separate out two different things:
1. the exact wording used by another author
If you really want to repeat this wording, then it must be placed in quotation marks followed by a citation that includes the page number. In general, you should avoid using quotations, because they do not show your own understanding as well as using your own words. The reasons to use a quote is where the precise wording is important in itself – for example, you wish to make a comment about the wording.
2. ideas, facts, concepts, etc that are from another author
This will apply quite widely, of course. The correct thing to do here is to describe the idea in your own words, but use a citation to acknowledge the source of the idea. Try to cite the original source, or at the very least a textbook.
You use the term “memorise”. If you mean “learn and understand the basic ideas, facts, concepts, etc” then this is absolutely fine – this is what you are supposed to do.
But if you mean literally “memorising the wording used by another author” then this is a Very Bad Idea, and not a good way to learn. Aim for understanding and not mere memorisation.
It is better to properly understand a small number of key concepts, than to memorise large amounts of material without understanding.
The difference between t and j is meaningful, but Taylor forgot to spell it out explicitly.
In the equations where the candidate unit, u, is indexed by t (meaning time as an integer counting through the target sequence, not time measured in seconds), he is referring to the unit selected at that time to be used for synthesis.
In the equations where u is indexed using another variable, j, he is using j to index all the available candidates of that unit type, of which one will be selected.
October 13, 2018 at 18:41 in reply to: Handbook of Phonetic Sciences – Chapter 20 – Intro to Signal Processing #9419Linear Prediction refers to a specific form of filter being used in a source-filter model. A linear predictive filter is very simple: it predicts each speech sample (the filter’s output) as a weighted sum of the previous few speech (i.e., filter output) samples. The weights are called the filter co-efficients.
Such a filter has only resonances (technically called “poles”), and no anti-resonances (“zeros”). It can be used as a simple model of the vocal tract. We need to excite the filter with an input signal, such as a pulse train. The output generated will be a synthetic speech waveform.
The frequency response of the filter corresponds to the spectral envelope of the generated speech.
It’s important to realise that, when we model speech with a simple source-filter model, such as linear prediction, we are only really modelling properties of the signal. We are not directly modelling the vocal tract in any realistic sense.
Letter-to-phone alignment is needed when preparing the data for training the letter-to-sound model such as a classification tree. This is because letter-to-sound is a sequence-to-sequence problem, but a classification tree only deals with fixed-length input and output. We therefore do a common ‘trick’ of sliding a fixed length window along the sequence of predictors (which are the letters, in this case).
One way to find the alignment is by using Dynamic Programming, which searches for the most likely alignment between the two sequences. We will define (by hand) a simple cost function which (for example) gives higher probability to alignments between letters that are vowels and phonemes that are vowels, and the same for consonants. Or, the cost function could list, for every letter, the phonemes that it is allowed to align with.
Dynamic Programming is coming up later in the course – we’ll first encounter it in the Dynamic Time Warping (DTW) method for speech recognition. I suggest waiting until we get there, and then revisiting this topic to see if you can work out how to apply Dynamic Programming to this problem.
I’ll leave one hint here for you to come back to: in DTW, we create a grid and place the template along one axis and the unknown word along the other. For letter-to-phoneme alignment, we would place the letters along one axis and the phonemes along the other.
Post a follow-up later in the course if you need more help.
Yes, it’s an error in some editions of the book, corrected in later editions.
October 4, 2018 at 12:19 in reply to: Handbook of Phonetic Sciences – Chapter 20 – Intro to Signal Processing #9400Poles and zeros are properties of a filter. They correspond to the physical properties of resonance and anti-resonance.
It is common to model the vocal tract as an all-pole filter: something with only resonances. The most common all-pole filter used is Linear Prediction.
The relationship between our model (i.e., filter) parameters and the vocal tract shape is not trivial, because our model is such a simplistic approximation of the true vocal tract. So, for example, we wouldn’t normally use pole frequencies as features for Automatic Speech Recognition (although in the early days, features like that were widely used).
But, for conceptual understanding, we can say that the poles of a Linear Prediction filter correspond to resonant frequencies of the vocal tract, which we call formants. (Poles occur in pairs, and there will be two poles per formant).
To do formant tracking, we could fit an all-pole filter to a speech signal, and use the poles to identify the formants.
[This level of detail is beyond the scope of Speech Processing. These concepts are still important, and will become more relevant in Speech Synthesis.]
October 4, 2018 at 12:05 in reply to: Handbook of Phonetic Sciences – Chapter 20 – Intro to Signal Processing #9397The inner product between two signals is calculated by multiplying the corresponding samples (one from each signal) and summing up those values.
Intuitively, think of this as a measure of how similar the two signals are. If they are similar, then the inner product will have a high value. If they are very different, it will have a low value. So, we can understand the Fourier transform in this intuitive way:
Take the signal we want to analyse.
Create a sine wave of a particular frequency, and take the inner product between this and our signal. The resulting value is “how much of that frequency is present in our signal”. Plot that result, as a dot on a chart with frequency along the horizontal axis, and “how much” (i.e., magnitude) on the vertical axis.
Repeat for a range of frequencies. Join the dots. The final plot is the spectrum.
Fourier theory will tell us exactly what frequencies of sine waves we need to use, in order to perfectly characterise the signal (i.e., for the spectrum and the signal to contain exactly the same information, and thus to be able to make one from the other, in either direction).
Now on to phase: this is the relative offset (i.e., shift in time) between the sine wave and the signal, before we take the inner product. You correctly state that phase is important. Luckily, Fourier analysis not only computes the magnitudes of frequency components present in our signal, it also computes the phase that each sine wave needs to be at, so that when we sum those sine waves together we reconstruct our signal exactly.
When Taylor says “suprasegmental prosody” (which he elaborates later on, in Section 6.5.2 of his book) he means aspects of prosody closely associated with the words and the literal meaning of the utterance. For example: syllable stress within a word, or placing a prominence on a content word, or tone in a tonal language.
He uses “affective prosody” (Section 6.5.1) to mean aspects of prosody that convey emotion, attitude and other things determined by the mental state of the talker.
Under “augmentative prosody” (Section 6.5.3) he includes the use of prosody to aid communication, such as using rising intonation at the end of a yes/no question: even though this is not essential, it significantly aids communication efficiency. Another example would be placing phrase breaks to help the listener disambiguate information.
[This material is beyond the scope of the Speech Processing course, but would be in-scope for Speech Synthesis]
Yes, DCT means Discrete Cosine Transform. We will be coming on to that in the later part of Speech Processing, when we consider how to extract useful features from the FFT spectrum, to use for Automatic Speech Recognition. We’ll also bellowing at the Mel scale. Wait until we get there, then ask the question again.
-
AuthorPosts