Forum Replies Created
-
AuthorPosts
-
It’s good practice to specify the language (and accent, or other properties, when relevant) you are working with: you would be amazed at how many published papers forget to do that!. Likewise, it is good practice to be clear about what data are used, where they come from, etc.
The data here include both that in the unit selection voice, and the sentences you use to illustrate mistakes.
November 7, 2022 at 22:09 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16466Correct! (Also in the pronunciation dictionary, of course.)
Actually, the symbol set is not exactly phonemes – it include allophones, for example. What is the difference?
November 7, 2022 at 20:58 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16464You correctly state that diphones are used because they capture co-articulation.
But are you sure phonemes are not used anywhere in the system?
November 7, 2022 at 18:27 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16457There are lots of connections. Some hints:
do we use phonemes in TTS?
speech sounds are affected by the surrounding speech sounds through the process of co-articulation (which occurs both within and between words)
the source and filter each have different consequences for the acoustic properties of a speech sound: how is that knowledge used in TTS?
They are in the
Unitrelation.It is far preferable to find a better source, such as a textbook or peer-reviewed paper. The problem with Wikipedia is that almost anyone can write or edit an entry and we don’t usually know anything about them. It is hard to trust a source when we do not know the author.
You will find me occasionally linking to Wikipedia in answers to forum posts. I only do that when I know the article is correct. I would generally not cite Wikipedia in a scientific paper or in my main teaching materials.
November 7, 2022 at 08:30 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16418As mentioned in many other posts, do not focus so heavily on Festival – its just a piece of software! The assignment is about the general principles of TTS.
Therefore, in the background section, you will want to explain those general principles: what does your reader need to know, in order to understand your explanations of the mistakes later in the report? That might include both how each step is done, but also whether that step is easy or hard, solved with current techniques or still an open problem, etc.
The formatting instructions specify which headings are compulsory and whether you can add subsections below them (yes, you can).
November 6, 2022 at 20:55 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16403Word count is defined in the writing up instructions.
November 6, 2022 at 18:43 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16398The answer is in the phonetics material of Speech Processing – go back over Module 1 to recap how speech is produced, then Module 2 which covers the acoustic properties of vowels and consonants. You might also find Module 4 helps you answer this question, especially the last video, “Phoneme”.
Yes, that’s how you convert a log probability back to a probability (noting that logs are not necessarily base 10, although in Festival I believe they are).
“transition probabilities of observed states and emission states” is a little muddled – I suggest waiting for HMMs to be covered in the Speech Processing course and then see if that helps you understand better. States emit observations according to some emission probability distribution, and there are also probabilities on the transitions between states.
It is not expected for this assignment to modify the method used by Festival for any steps in the pipeline, such as
prob_models. The combination of which methods are used for each part of the pipeline is part of the voice definition and it won’t always make sense to modify one method in isolation (e.g., a subsequent step in the pipeline might be expecting a specific relation to be created by a preceding step).Those are log probabilities. It is good practice to store probabilities as their logarithm because absolute probabilities can be very small numbers which are hard to store with sufficient precision.
Diphone concatenation and TD-PSOLA could be implemented as a single process during waveform generation. We can simultaneously modify the F0 and duration inside each individual diphone and overlap-add the last pitch period of each diphone with the first pitch period of the next diphone to concatenate them.
In TD-PSOLA, the width of the analysis window is typically twice the fundamental period so that each window contains two pitch periods. This is so that, if we need to space the pitch periods further apart (i.e., to reduce F0), there is some waveform to ‘fill the gap’.
The nicely-plotted waveform shown in the video at 5:50 is for one of the diphones in the sequence. At 5:45 my hand-drawn waveform for one diphone was unfortunately only two periods long – that was sloppy of me and potentially confusable with a TD-PSOLA analysis frame, which it is not. A diphone would generally be longer than that.
Each impulse response does not represent one phone. A phone is generally much longer than T0 : the vocal tract shape changes much more slowly than the vocal folds vibrate.
November 4, 2022 at 18:15 in reply to: Finding documentation about recognizing non-standard words in Festival #16262You can assume that Festival does it in the way described in this core reading. Remember that the assignment is about Text-to-Speech in general, not narrowly about Festival.
This recommended reading from Taylor will help you think about which parts of the problem are harder than others.
If you want to read beyond the course readings (which is of course optional and not expected, but something you may choose to do if aiming for a very high mark), then Chapter 5 Text decoding: finding the words from the text of Taylor’s book goes into more depth.
-
AuthorPosts
This is the new version. Still under construction.