Forum Replies Created
-
AuthorPosts
-
You are right that HMMs have “a set of probability distributions” but there are two different types of probability distribution in an HMM. One type is the emission probability density functions: the multivariate Gaussian in each emitting state that generates observations (MFCCs).
What is the other type?
In the video Cepstral Analysis, Mel-Filterbanks, MFCCs we first had a recap of filterbank features. These would be great features, except that they exhibit co-variance.
We then reminded ourselves of how the source and filter combine in the time domain using convolution, or in the frequency domain using multiplication. We made them additive by taking the log and devised a way to deconvolve the source and filter. This video only explained the classical cepstrum – there was no Mel scale or filterbank.
Finally, in the video From MFCCs, towards a generative model using HMMs we developed MFCCs, by using our filterbank features as a starting point, then applying the same crucial steps as we would for using the cepstrum to obtain the filter without the source: take the log (make source and filter additive), series expansion (separate source and filter along the quefrency axis), truncate (discard the source).
If, after reading the questions again carefully, you are certain they are duplicates, please email the screenshots to me.
First, remember that pitch is the perceptual correlate of F0. We can only measure F0 from a speech waveform. Pitch only exists in the mind of the listener.
When we say F0 is not lexically contrastive in ASR, we mean that it is not useful for telling two words apart. The output of ASR is the text, so we do not need to distinguish “preSENT” from “PREsent”, for example, we simply need to output the written form “present”.
Duration is lexically contrastive because there are pairs of words in the language that differ in their vowel length.
Hidden Markov Models do model duration. Can you explain how they do that?
It’s preferred to copy-paste text into your forum post rather than use screenshots (which are not searchable, and which we cannot quote in our replies).
You are trying to use an MFCC file which does not exist – that’s what the HTK error “Cannot open Parm File” means.
Take a look in
/Volumes/Network/courses/sp/data/mfcc/
to see how to data are organised into train and test partitions, and what the filenames are.The train and test data are arranged differently:
Because we have labels for the train data, we can keep all of the training examples from each speaker in a single MFCC file, where the corresponding label file specifies not just the labels (e.g., “one” or “seven”) but also the start and end times.
The test data are cut into individual digits, ready to be recognised.
November 10, 2022 at 21:39 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16489Only you can answer that: you need to give enough background for your reader to understand the points you will make later in the report. For example, if your explanation for an audible concatenation refers to source and filter properties, you should have provided enough background about that for your explanation to be understood by your reader.
It’s good practice to specify the language (and accent, or other properties, when relevant) you are working with: you would be amazed at how many published papers forget to do that!. Likewise, it is good practice to be clear about what data are used, where they come from, etc.
The data here include both that in the unit selection voice, and the sentences you use to illustrate mistakes.
November 7, 2022 at 22:09 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16466Correct! (Also in the pronunciation dictionary, of course.)
Actually, the symbol set is not exactly phonemes – it include allophones, for example. What is the difference?
November 7, 2022 at 20:58 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16464You correctly state that diphones are used because they capture co-articulation.
But are you sure phonemes are not used anywhere in the system?
November 7, 2022 at 18:27 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16457There are lots of connections. Some hints:
do we use phonemes in TTS?
speech sounds are affected by the surrounding speech sounds through the process of co-articulation (which occurs both within and between words)
the source and filter each have different consequences for the acoustic properties of a speech sound: how is that knowledge used in TTS?
They are in the
Unit
relation.It is far preferable to find a better source, such as a textbook or peer-reviewed paper. The problem with Wikipedia is that almost anyone can write or edit an entry and we don’t usually know anything about them. It is hard to trust a source when we do not know the author.
You will find me occasionally linking to Wikipedia in answers to forum posts. I only do that when I know the article is correct. I would generally not cite Wikipedia in a scientific paper or in my main teaching materials.
November 7, 2022 at 08:30 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16418As mentioned in many other posts, do not focus so heavily on Festival – its just a piece of software! The assignment is about the general principles of TTS.
Therefore, in the background section, you will want to explain those general principles: what does your reader need to know, in order to understand your explanations of the mistakes later in the report? That might include both how each step is done, but also whether that step is easy or hard, solved with current techniques or still an open problem, etc.
The formatting instructions specify which headings are compulsory and whether you can add subsections below them (yes, you can).
November 6, 2022 at 20:55 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16403Word count is defined in the writing up instructions.
November 6, 2022 at 18:43 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16398The answer is in the phonetics material of Speech Processing – go back over Module 1 to recap how speech is produced, then Module 2 which covers the acoustic properties of vowels and consonants. You might also find Module 4 helps you answer this question, especially the last video, “Phoneme”.
-
AuthorPosts