Forum Replies Created
-
AuthorPosts
-
November 7, 2022 at 18:27 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16457
There are lots of connections. Some hints:
do we use phonemes in TTS?
speech sounds are affected by the surrounding speech sounds through the process of co-articulation (which occurs both within and between words)
the source and filter each have different consequences for the acoustic properties of a speech sound: how is that knowledge used in TTS?
They are in the
Unit
relation.It is far preferable to find a better source, such as a textbook or peer-reviewed paper. The problem with Wikipedia is that almost anyone can write or edit an entry and we don’t usually know anything about them. It is hard to trust a source when we do not know the author.
You will find me occasionally linking to Wikipedia in answers to forum posts. I only do that when I know the article is correct. I would generally not cite Wikipedia in a scientific paper or in my main teaching materials.
November 7, 2022 at 08:30 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16418As mentioned in many other posts, do not focus so heavily on Festival – its just a piece of software! The assignment is about the general principles of TTS.
Therefore, in the background section, you will want to explain those general principles: what does your reader need to know, in order to understand your explanations of the mistakes later in the report? That might include both how each step is done, but also whether that step is easy or hard, solved with current techniques or still an open problem, etc.
The formatting instructions specify which headings are compulsory and whether you can add subsections below them (yes, you can).
November 6, 2022 at 20:55 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16403Word count is defined in the writing up instructions.
November 6, 2022 at 18:43 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16398The answer is in the phonetics material of Speech Processing – go back over Module 1 to recap how speech is produced, then Module 2 which covers the acoustic properties of vowels and consonants. You might also find Module 4 helps you answer this question, especially the last video, “Phoneme”.
Yes, that’s how you convert a log probability back to a probability (noting that logs are not necessarily base 10, although in Festival I believe they are).
“transition probabilities of observed states and emission states” is a little muddled – I suggest waiting for HMMs to be covered in the Speech Processing course and then see if that helps you understand better. States emit observations according to some emission probability distribution, and there are also probabilities on the transitions between states.
It is not expected for this assignment to modify the method used by Festival for any steps in the pipeline, such as
prob_models
. The combination of which methods are used for each part of the pipeline is part of the voice definition and it won’t always make sense to modify one method in isolation (e.g., a subsequent step in the pipeline might be expecting a specific relation to be created by a preceding step).Those are log probabilities. It is good practice to store probabilities as their logarithm because absolute probabilities can be very small numbers which are hard to store with sufficient precision.
Diphone concatenation and TD-PSOLA could be implemented as a single process during waveform generation. We can simultaneously modify the F0 and duration inside each individual diphone and overlap-add the last pitch period of each diphone with the first pitch period of the next diphone to concatenate them.
In TD-PSOLA, the width of the analysis window is typically twice the fundamental period so that each window contains two pitch periods. This is so that, if we need to space the pitch periods further apart (i.e., to reduce F0), there is some waveform to ‘fill the gap’.
The nicely-plotted waveform shown in the video at 5:50 is for one of the diphones in the sequence. At 5:45 my hand-drawn waveform for one diphone was unfortunately only two periods long – that was sloppy of me and potentially confusable with a TD-PSOLA analysis frame, which it is not. A diphone would generally be longer than that.
Each impulse response does not represent one phone. A phone is generally much longer than T0 : the vocal tract shape changes much more slowly than the vocal folds vibrate.
November 4, 2022 at 18:15 in reply to: Finding documentation about recognizing non-standard words in Festival #16262You can assume that Festival does it in the way described in this core reading. Remember that the assignment is about Text-to-Speech in general, not narrowly about Festival.
This recommended reading from Taylor will help you think about which parts of the problem are harder than others.
If you want to read beyond the course readings (which is of course optional and not expected, but something you may choose to do if aiming for a very high mark), then Chapter 5 Text decoding: finding the words from the text of Taylor’s book goes into more depth.
November 4, 2022 at 15:59 in reply to: Taking at look at letter-to-sound rules (for new words) #16256No, you don’t need to explain specific failures by referencing the actual tree. That would not be very insightful anyway: although we often say that decision trees are human readable, the error might occur very deep down the tree and be hard to explain.
Instead, focus on the general properties of the model being used. For example, why might a decision tree make errors at all? How could you do better: a bigger tree, a smaller tree, change from a decision tree to another model, train it on more data, … etc ?
For a word that Festival’s letter-to-sound model got wrong, was it because that word might be particularly hard for some reason? What reason?
Yes: what is the overall aim of the report? what question(s) are you trying to answer? Essentially, you want to give the reader some motivation to read the rest of the report: what will they find out? why is it interesting?
November 4, 2022 at 12:17 in reply to: Taking at look at letter-to-sound rules (for new words) #16231 -
AuthorPosts