Page 6

Forum Replies Created

Viewing 15 posts - 76 through 90 (of 1,087 total)

← 1 2 3 … 5 6 7 … 71 72 73 →

Author

Posts
November 7, 2022 at 18:18 in reply to: Join and Target cost festival #16456
Simon
Professor
They are in the Unit relation.
November 7, 2022 at 12:53 in reply to: Can we cite Wikipedia? #16429
Simon
Professor
It is far preferable to find a better source, such as a textbook or peer-reviewed paper. The problem with Wikipedia is that almost anyone can write or edit an entry and we don’t usually know anything about them. It is hard to trust a source when we do not know the author.

You will find me occasionally linking to Wikipedia in answers to forum posts. I only do that when I know the article is correct. I would generally not cite Wikipedia in a scientific paper or in my main teaching materials.
November 7, 2022 at 08:30 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16418
Simon
Professor
As mentioned in many other posts, do not focus so heavily on Festival – its just a piece of software! The assignment is about the general principles of TTS.

Therefore, in the background section, you will want to explain those general principles: what does your reader need to know, in order to understand your explanations of the mistakes later in the report? That might include both how each step is done, but also whether that step is easy or hard, solved with current techniques or still an open problem, etc.

The formatting instructions specify which headings are compulsory and whether you can add subsections below them (yes, you can).
November 6, 2022 at 20:55 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16403
Simon
Professor
Word count is defined in the writing up instructions.
November 6, 2022 at 18:43 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16398
Simon
Professor
The answer is in the phonetics material of Speech Processing – go back over Module 1 to recap how speech is produced, then Module 2 which covers the acoustic properties of vowels and consonants. You might also find Module 4 helps you answer this question, especially the last video, “Phoneme”.
November 6, 2022 at 17:30 in reply to: Festival’s POS tagger #16395
Simon
Professor
Yes, that’s how you convert a log probability back to a probability (noting that logs are not necessarily base 10, although in Festival I believe they are).

“transition probabilities of observed states and emission states” is a little muddled – I suggest waiting for HMMs to be covered in the Speech Processing course and then see if that helps you understand better. States emit observations according to some emission probability distribution, and there are also probabilities on the transitions between states.
November 6, 2022 at 14:35 in reply to: Phrasify: method #16377
Simon
Professor
It is not expected for this assignment to modify the method used by Festival for any steps in the pipeline, such as prob_models. The combination of which methods are used for each part of the pipeline is part of the voice definition and it won’t always make sense to modify one method in isolation (e.g., a subsequent step in the pipeline might be expecting a specific relation to be created by a preceding step).
November 6, 2022 at 14:32 in reply to: Festival’s POS tagger #16376
Simon
Professor
Those are log probabilities. It is good practice to store probabilities as their logarithm because absolute probabilities can be very small numbers which are hard to store with sufficient precision.
November 6, 2022 at 09:00 in reply to: Module 6 – window width in signal processing #16313
Simon
Professor
Diphone concatenation and TD-PSOLA could be implemented as a single process during waveform generation. We can simultaneously modify the F0 and duration inside each individual diphone and overlap-add the last pitch period of each diphone with the first pitch period of the next diphone to concatenate them.
November 6, 2022 at 08:56 in reply to: Module 6 – window width in signal processing #16312
Simon
Professor
In TD-PSOLA, the width of the analysis window is typically twice the fundamental period so that each window contains two pitch periods. This is so that, if we need to space the pitch periods further apart (i.e., to reduce F0), there is some waveform to ‘fill the gap’.

The nicely-plotted waveform shown in the video at 5:50 is for one of the diphones in the sequence. At 5:45 my hand-drawn waveform for one diphone was unfortunately only two periods long – that was sloppy of me and potentially confusable with a TD-PSOLA analysis frame, which it is not. A diphone would generally be longer than that.

Each impulse response does not represent one phone. A phone is generally much longer than T0 : the vocal tract shape changes much more slowly than the vocal folds vibrate.
November 4, 2022 at 18:15 in reply to: Finding documentation about recognizing non-standard words in Festival #16262
Simon
Professor
You can assume that Festival does it in the way described in this core reading. Remember that the assignment is about Text-to-Speech in general, not narrowly about Festival.

This recommended reading from Taylor will help you think about which parts of the problem are harder than others.

If you want to read beyond the course readings (which is of course optional and not expected, but something you may choose to do if aiming for a very high mark), then Chapter 5 Text decoding: finding the words from the text of Taylor’s book goes into more depth.
November 4, 2022 at 15:59 in reply to: Taking at look at letter-to-sound rules (for new words) #16256
Simon
Professor
No, you don’t need to explain specific failures by referencing the actual tree. That would not be very insightful anyway: although we often say that decision trees are human readable, the error might occur very deep down the tree and be hard to explain.

Instead, focus on the general properties of the model being used. For example, why might a decision tree make errors at all? How could you do better: a bigger tree, a smaller tree, change from a decision tree to another model, train it on more data, … etc ?

For a word that Festival’s letter-to-sound model got wrong, was it because that word might be particularly hard for some reason? What reason?
November 4, 2022 at 13:08 in reply to: Statement of Aims #16236
Simon
Professor
Yes: what is the overall aim of the report? what question(s) are you trying to answer? Essentially, you want to give the reader some motivation to read the rest of the report: what will they find out? why is it interesting?
November 4, 2022 at 12:17 in reply to: Taking at look at letter-to-sound rules (for new words) #16231
Simon
Professor
Yes – see here. Remember that, for English, they are not rules as such, but a decision tree learned from data (i.e., from the dictionary).
November 3, 2022 at 13:21 in reply to: Module 6 – window width in signal processing #16226
Simon
Professor
Can you point at the exact video and timestamp for context, so I can give a precise answer?
Author

Posts

Viewing 15 posts - 76 through 90 (of 1,087 total)

← 1 2 3 … 5 6 7 … 71 72 73 →

Simon

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis