Forum Replies Created
-
AuthorPosts
-
Between you, you have the main points:
dimensionality reduction
reduction of covariance
elimination of evidence of F0
perceptual weighting (Mel scale)Any three of those would be a good answer, along with explanations of why these are advantageous. An excellent answer would briefly indicate how each of them is achieved for MFCCs and why FFT coefficients don’t have that property (e.g., state the FFT dimension for speech sampled at a typical sampling rate with a typical analysis frame duration).
TD-PSOLA cannot modify the spectral envelope, so it cannot remove spectral discontinuities at joins (e.g., between diphones). d) is correct.
Baum-Welch does the correct computation. This is “by definition” – because the model has a hidden state sequence, the correct thing to do is integrate out that random variable = sum over all values it can take.
Baum-Welch provides a better estimate of the model parameters than Viterbi, in both a theoretical and empirical sense.
An accent can only occur when the word is spoken. An accent could be realised, for example, by some movement in F0 or increased energy. Lexically-stressed syllables are candidates that might (but might not) receive an accent when spoken aloud.
The full details of linear prediction (LPC) are not part of the 2020-21 version of Speech Processing, although we did cover the form of the filter (the difference equation).
There are two sub-parts within b):
What problems arise…
TD-PSOLA first divides a speech waveform into pitch periods (= fundamental periods) each of which we assume to be the impulse response of the vocal tract filter. Then it reconstructs a modified waveform using overlap-add. There are potential problems in both steps.
Problems created when the original speech is divided into pitch periods: We assumes that the impulse response of the vocal tract lasts for a shorter time than T0, which is not always true. So, the extracted pitch periods are not exactly individual impulse responses. There is something of the preceding impulse response contained in each of them. This is made worse by the fact that, in practice, we need to extract units units of duration 2xT0 periods to allow for reducing F0 (see below).
Problems created when we use overlap-add to reconstruct the modified waveform – F0 modification: When F0 is increased a lot (T0 is decreased), we heavily overlap the pitch periods. Because the pitch periods are not “clean” and “isolated” impulse responses, this causes distortion. When F0 is reduced a lot (T0 is increased), eventually there will be gaps in-between the pitch periods – if we have extracted units of duration 2xT0 this will occur when we attempt to modify F0 to half of the original or less.
Problems created when we use overlap-add to reconstruct the modified waveform – duration modification: extreme amounts of modification involve either deleting many pitch periods, or duplicating many. Deleting many will disrupt the slowly-changing nature of the vocal tract impulse response. Duplicating many will result in a signal that effectively has a constant vocal tract frequency responses for multiple consecutive pitch periods; this can sound unnatural.
How does linear predictive…
Assuming that we perfectly separate source and filter (it’s a big assumption, but the question doesn’t ask us to discuss that), we are now free to make any amount of modification to F0 and duration.
F0 modification is straightforward: simply adjust the frequency of the impulse train. Since (we assume) source and filter are perfectly separated, this will be just like making natural speech with a different frequency of vocal fold vibration, going through the vocal tract filter. The problems of TD-PSOLA either overlapping imperfect impulse responses too much, or leaving gaps in-between, don’t arise because the linear predictive filter produces the correct impulse response for every input impulse and they combine in the right way (convolution).
Duration modification is straightforward: simply adjust the duration of the impulse train. The problem of the vocal tract filter remaining piecewise constant when TD-PSOLA increases duration by a large amount can be overcome by interpolating the filter co-efficients so that they change gradually and are never constant.
This answer really needs to be in the form of a diagram. Try uploading one here, annotated with the terms you use in your written explanation, and I’ll check it for you.
In general, females have shorter vocal tracts than males, and therefore higher formant frequencies. So iii. is true.
Harmonics are at multiples of F0. Since female speech has generally a higher F0 than male speech, the harmonics will be at multiples of a higher F0 = more widely spaced. So i. is also true.
The signal in Ladefoged Fig 6.2 could be generated by passing an impulse train with a fundamental frequency of 200 Hz through a filter which only passes through frequencies in the range 1 800 Hz to 2 200 Hz. For example, a filter with a single resonance at 2 000 Hz and a narrow bandwidth.
The harmonics in the filtered signal are still at integer multiples of the fundamental. The fundamental frequency of the filtered signal is still 200 Hz even though there is no harmonic at that frequency.
The filter cannot change the fundamental frequency. It can only modify the spectral envelope = it can only change the amplitudes of harmonics, not their frequencies.
One interesting consequence of this is that we perceive such signals as having a pitch equal to their fundamental frequency, even if there is no energy at that frequency. Our perception of pitch is based not simply on identifying the fundamental, but on the harmonic structure.
Yes, you are correct: “topology” just means the shape of the Hidden Markov Model (HMM) = how many states it has and what transitions between them are possible.
For modelling speech, a left-to-right topology is the correct choice. Speech does not time-reverse, the phones in a word must appear in the correct order, etc.
For speech, we do not generally use “parallel path” HMMs, which have transitions that allow some states to be skipped. We use strictly left-to-right models in which the only valid paths pass through all the emitting states in order.
The only exception to this might be an HMM for noise or silence in which we might add some other transitions, or connect all emitting states with all other emitting states with transitions in both directions to make an ergodic HMM.
So, in the general case, an HMM could have transitions between any pair of states, including self-transitions. That’s why, when we derive algorithms for doing computations with HMMs, we must consider all possible transitions and not restrict ourselves to a left-to-right topology.
In the simple grammar used in the assignment, we assume there is an equal (= uniform) probability of each digit. For the sequences part of the assignment, we also assume all sequences have equal probability.
But in the more general case of connected speech recognition, we will learn the prior P(W) from data. Usually that involves learning (= training) an N-gram language model from a corpus of text: the details of learning an N-gram are out-of-scope for Speech Processing, but you do need to understand that such a model is finite state and what a trained model looks like.
So, the answer to “how do we know the probability of a word sequence before we observe any acoustic evidence (= speech) ?” is that we pre-calculate and store it: that’s the language model. In the general case of an N-gram, we use data to estimate the probability of every possible N-gram in the language by counting its frequencies in a text corpus.
Our prior belief about W is P(W). When we receive the acoustic evidence O, we compute the likelihood P(O|W). We then revise (= update) our belief about W in the light of this new evidence, by multiplying the likelihood and the prior, to get P(W|O). [Ignoring P(O).]
P(W|O) is the posterior: it’s what we believe about the distribution of W given (= after receiving) the acoustic evidence O.
The three states allow the model to capture sound changes within a phone – the “beginning”, “middle” and “end”. Any number of emitting states could be used, but 3 has been found to work well and is – as you correctly state – the most common choice.
Look in the VMWare settings for “Network Adapter” and make sure it’s enabled. Play around with the settings there to see if that helps – there are different options for how the VM gets an internet connection from the host computer.
There are various trouble-shooting guides on the VMWare website depending what OS your host computer is running.
In general, we only need transcriptions without time alignments to train HMMs, including for monophone models. The method for training models in such a situation is known as “embedded training” but this is slightly beyond the scope of the course.
But in the Digit Recogniser assignment – and in the theory part of the Speech Processing course – we are using a simpler method for training HMMs which does require the data to be labelled with the start and end times of each model (which are of words, in the Digit Recogniser assignment).
You need to
rsync
the files (as per the start of the TTS assignment, which was in Tutorial B of Module 3).For each training utterance, there will be a word-level transcription. A phone-level transcription is needed in order to determine the phone models to join together to make an utterance model. The phone transcription might be created simply by looking each word up in the dictionary and replacing it with its phone sequence.
These transcriptions do not need to have any timing information though – they are just sequences of words or phones.
-
AuthorPosts