Forum Replies Created
-
AuthorPosts
-
The SoundStream codes are an alternative to the mel spectrogram.
To do Text-to-Speech, we would train a model to generate SoundStream codes, instead of generating a mel spectrogram.
Before training the system, we would pass all our training data waveforms through the SoundStream encoder, thus converting each waveform into a sequence of codes.
(In the case of a mel spectrogram, we would pass each waveform through a mel-scale filterbank to convert it to a mel spectrogram.)
Then we train a speech synthesis model to predict a code sequence given a phone (or text) input.
To do speech synthesis, we perform inference with the model to generate a sequence of codes, given a phone (or text) input. We then pass that sequence of codes through the decoder of SoundStream which outputs a waveform.
(In the case of a mel spectrogram, we would pass the mel spectrogram to a neural vocoder which would output a waveform)
If the output is small, and you’re running Festival in interactive mode, just copy-paste from the terminal into any plain text editor.
If you want to capture everything from an interactive session, this will capture
stdout
in the fileout.txt
but still issue it to the terminal so you can use festival interactively:$ festival | tee out.txt
If you are running Festival in batch (non-interactive mode), you can redirect
stdout
to a file using>
like this:$ festival -b some_batch_script.scm > out.txt
You can’t make a causal link from a lower target cost to “sounding better”, at least for any individual diphone or even an individual utterance. As you say, other factors are at play – notably the join cost.
Remember that the costs are only ever used relative to other costs: the search minimises the total cost.
If you want to inspect the target cost for a synthesised utterance, it is available in the Utterance object.
To inspect the differences between selected units (e.g., the diphones from different source utterances that you mention), you can look at the utterances they were taken from. For example, you could look at the original left and right phonetic context of the diphone in the source utterance, and compare that to the context in which it is being used in the target utterance. The more different these are, the worse we expect that unit to sound. This difference is exactly what the target cost measures.
The unit selection search algorithm only guarantees that the selected sequence has the lowest sum of join and target costs.
It does not necessarily select an individual candidate unit that has the lowest target cost for its target position. So be careful when talking about “achieving a lower target cost”. The search will of course tend to achieve that, but only for the whole sequence.
When you say “the choice of one [candidate unit is] better than the other”, I think you simply mean “sounds better”. So that is what you would report to illustrate this; remember that expert listening (i.e., by yourself) is a valid method, provided you specify that in the experiment.
We went through this in a few recent classes (Module 6, Module 8, and the first class of the state-of-the art module) so revise those classes first.
In your experiments, you have learned about unit selection when it is put into practice: when you built new voices from data, or when you synthesised new sentences using those voices. One thing you should have learned is the one you stated: the sensitivity of unit selection to some of the many design choices.
The marks under “practical implications for current methods” are for discussing the implications of what you have learned for methods such as FastPitch, Tacotron 2, or the latest approaches using language modelling. For example, do some or all current methods have the same design choices as unit selection? If so, would they be more or less sensitive to each choice?
A concrete example: the unit selection voices you have built all require pitch tracking to provide a value of F0. You may have done an experiment to discover what happens when the value of F0 is poorly estimated. FastPitch also requires F0. What do you think would happen if a FastPitch model was trained with poorly-estimated F0 values?
A second concrete example: for unit selection to work correctly, we require at least one recording of every possible diphone type. For it to work well, we require multiple recordings in a variety of contexts. We call this “coverage”. What might the coverage requirements be of current methods? Do they need more or less coverage than unit selection?
A third concrete example: unit selection, in principle (although not in the voices you have built), can use signal processing to manipulate the speech – for example, to make the joins less perceptible or to impose a desired prosodic pattern. This requires a representation of the speech waveform where properties including F0 can be modified. Is that still applicable for a current method which generates a mel spectrogram? What about an audio codec such as SoundStream?
How you incorporate this into your report is up to you: designing a good structure is part of the assignment.
Connected speech effects, including elision, will of course make forced alignment harder because there is a greater degree of mismatch between the labels and the speech. In your example above, there probably is no good alignment of those labels because there is acoustically little or no [v] in the speech.
This is a fundamental challenge in speech, and not easily solved!
But, if your alignments generally look OK, then you can say that forced alignment has been successful and move on through the subsequent steps of building the voice.
Figuring out why forced alignment fails, and then solving that, is part of the assignment.
The most common cause is too much mismatch between the labels and the speech. That might be as simple as excessively long leading/trailing silences (solution: endpoint), or something more tricky like the voice talent’s pronunciations being too different to those in the dictionary, or letter-to-sound pronunciations which are a poor match to how the voice talent pronounced certain words.
Sometimes, the easiest solution is to use additional data (e.g., your own ARCTIC A recordings) to train the models.
Remember that this is not the same as including all of that data in the unit selection database: you could use all your data to train the alignment models, but only use specific subsets in the unit selection database for the voice you are building.
There are two different things going on here:
1. a handful “bad pitch marking” warnings is acceptable, but not for every segment. See this post: https://speech.zone/forums/topic/bad-pitch-marking/#post-9237
2. most
sp
labels will have zero duration, and when you view them in Wavesurfer they will be drawn on top of a correct phone label, thus making it invisible. You need to manually delete all zero-durationsp
labels before loading the file in Wavesurfer, as described in the Find and fix a labelling error step.Yes, that’s correct. You can use different data to train the models for alignment, than you eventually include in the unit selection database. (But be careful to report this, if it affects any of your experiments.)
For warning 8232, search the forums.
Looks like your forced alignment is very poor. You will find that all the words are there, but that the labels have become collapsed to the start and end of the file.
How much speech data are the models being trained on? If it is only a small amount, you could try adding the ARCTIC A utterances to your
utts.data
(just during forced alignment), so that the models are trained on more data and are more likely to work.Lab sessions are unstructured
This is only partially fair feedback. The lab sessions do generally start with an overview from the lead tutor (usually Korin) about what to focus on that week. We have noticed that many students do not pay attention to this overview (we know this from the questions they ask later in that lab session).
We will continue to provide guidance at the start of each lab session about where you should be in the assignment (e.g., with reference to the milestones) and what to focus on.
The assignment is overwhelming
Including: too open-ended, instructions long and hard to follow, unclear expectations, and similar comments
The open-ended and slightly under-specified nature of the assignment is by design, so that students need to actively think about what they are doing, and why they are doing it, and not merely follow a sequence of instructions.
The goal of the assignment is to consolidate and test your understanding of the course material.
But I agree that this can be overwhelming and we should provide a little more structure and guidance. For this year, I have already added two class elements to address this:
- On 2024-02-13, we went over the whole assignment and made links to the relevant parts of the course material. The key takeaway was that you should develop your understanding of that course material by doing that aspect of the assignment and then demonstrate that understanding in the write-up.
- On 2024-03-12, we will go over the newly-revised structured marking scheme and see how to get a good mark. In other words, we will see how to demonstrate understanding.
Further coursework guidance may be added, depending on your feedback about the above.
Keep doing this
Number of people mentioning each point is given in parentheses.
Videos and flipped classroom format (17)
In-class interactivity (16)
Whiteboard group exercises (9)
Quizzes on speech.zone (6)
Have you inspected the alignment? Load one of your utterances and the corresponding labels into Wavesurfer or Praat and inspect them. Try that for a few different utterances.
-
AuthorPosts