Forum Replies Created
-
AuthorPosts
-
We went through this in a few recent classes (Module 6, Module 8, and the first class of the state-of-the art module) so revise those classes first.
In your experiments, you have learned about unit selection when it is put into practice: when you built new voices from data, or when you synthesised new sentences using those voices. One thing you should have learned is the one you stated: the sensitivity of unit selection to some of the many design choices.
The marks under “practical implications for current methods” are for discussing the implications of what you have learned for methods such as FastPitch, Tacotron 2, or the latest approaches using language modelling. For example, do some or all current methods have the same design choices as unit selection? If so, would they be more or less sensitive to each choice?
A concrete example: the unit selection voices you have built all require pitch tracking to provide a value of F0. You may have done an experiment to discover what happens when the value of F0 is poorly estimated. FastPitch also requires F0. What do you think would happen if a FastPitch model was trained with poorly-estimated F0 values?
A second concrete example: for unit selection to work correctly, we require at least one recording of every possible diphone type. For it to work well, we require multiple recordings in a variety of contexts. We call this “coverage”. What might the coverage requirements be of current methods? Do they need more or less coverage than unit selection?
A third concrete example: unit selection, in principle (although not in the voices you have built), can use signal processing to manipulate the speech – for example, to make the joins less perceptible or to impose a desired prosodic pattern. This requires a representation of the speech waveform where properties including F0 can be modified. Is that still applicable for a current method which generates a mel spectrogram? What about an audio codec such as SoundStream?
How you incorporate this into your report is up to you: designing a good structure is part of the assignment.
Connected speech effects, including elision, will of course make forced alignment harder because there is a greater degree of mismatch between the labels and the speech. In your example above, there probably is no good alignment of those labels because there is acoustically little or no [v] in the speech.
This is a fundamental challenge in speech, and not easily solved!
But, if your alignments generally look OK, then you can say that forced alignment has been successful and move on through the subsequent steps of building the voice.
Figuring out why forced alignment fails, and then solving that, is part of the assignment.
The most common cause is too much mismatch between the labels and the speech. That might be as simple as excessively long leading/trailing silences (solution: endpoint), or something more tricky like the voice talent’s pronunciations being too different to those in the dictionary, or letter-to-sound pronunciations which are a poor match to how the voice talent pronounced certain words.
Sometimes, the easiest solution is to use additional data (e.g., your own ARCTIC A recordings) to train the models.
Remember that this is not the same as including all of that data in the unit selection database: you could use all your data to train the alignment models, but only use specific subsets in the unit selection database for the voice you are building.
There are two different things going on here:
1. a handful “bad pitch marking” warnings is acceptable, but not for every segment. See this post: https://speech.zone/forums/topic/bad-pitch-marking/#post-9237
2. most
splabels will have zero duration, and when you view them in Wavesurfer they will be drawn on top of a correct phone label, thus making it invisible. You need to manually delete all zero-durationsplabels before loading the file in Wavesurfer, as described in the Find and fix a labelling error step.Yes, that’s correct. You can use different data to train the models for alignment, than you eventually include in the unit selection database. (But be careful to report this, if it affects any of your experiments.)
For warning 8232, search the forums.
Looks like your forced alignment is very poor. You will find that all the words are there, but that the labels have become collapsed to the start and end of the file.
How much speech data are the models being trained on? If it is only a small amount, you could try adding the ARCTIC A utterances to your
utts.data(just during forced alignment), so that the models are trained on more data and are more likely to work.Lab sessions are unstructured
This is only partially fair feedback. The lab sessions do generally start with an overview from the lead tutor (usually Korin) about what to focus on that week. We have noticed that many students do not pay attention to this overview (we know this from the questions they ask later in that lab session).
We will continue to provide guidance at the start of each lab session about where you should be in the assignment (e.g., with reference to the milestones) and what to focus on.
The assignment is overwhelming
Including: too open-ended, instructions long and hard to follow, unclear expectations, and similar comments
The open-ended and slightly under-specified nature of the assignment is by design, so that students need to actively think about what they are doing, and why they are doing it, and not merely follow a sequence of instructions.
The goal of the assignment is to consolidate and test your understanding of the course material.
But I agree that this can be overwhelming and we should provide a little more structure and guidance. For this year, I have already added two class elements to address this:
- On 2024-02-13, we went over the whole assignment and made links to the relevant parts of the course material. The key takeaway was that you should develop your understanding of that course material by doing that aspect of the assignment and then demonstrate that understanding in the write-up.
- On 2024-03-12, we will go over the newly-revised structured marking scheme and see how to get a good mark. In other words, we will see how to demonstrate understanding.
Further coursework guidance may be added, depending on your feedback about the above.
Keep doing this
Number of people mentioning each point is given in parentheses.
Videos and flipped classroom format (17)
In-class interactivity (16)
Whiteboard group exercises (9)
Quizzes on speech.zone (6)
Have you inspected the alignment? Load one of your utterances and the corresponding labels into Wavesurfer or Praat and inspect them. Try that for a few different utterances.
This script makes a list of the
.mfccfiles on which the forced-alignment HMMs will be trained. After running it, the filetrain.scpshould contain the list of.mfccfilenames, corresponding to the lines inutts.data.How large is your corpus?
What is the goal of finding the out-of-dictionary words?
If you wish to exclude all sentences that contain such a word, then you’ll have to find them all – you could do this using the provided Festival script (which might be slow for a very large corpus) or some other way (by writing your own code).
But if your aim is to identify all the words you might need to add to the dictionary, then you are not expected to do that for the large source corpus. You might need to rely on letter-to-sound to provide pronunciations during the text selection phase.
You should only manually write pronunciations for a modest number of words appearing in your (much smaller) recording script.
Check your quota like this:
$ quota --hum
If the figure in the
spacecolumn is larger than thequota(which is generally 5000 MB = 5 GB) then you need to remove files.Use the
ducommand mentioned earlier in this topic to find what is taking up the most space.(The
abrt-cli statuswarning is probably also caused by full quota.)If you are making many voices for the Speech Synthesis assignment, in separate copies of the
ssdirectory, you can share files between them where applicable (e.g., thewavdirectory for voices that use the same database). See https://speech.zone/forums/topic/symbolic-links/ (if that looks tricky, do it with a tutor in a lab session).You shouldn’t need that much space to do the assignment. It’s likely that you have a large amount of unnecessary files somewhere. Check disk usage like this:
cd du -sh * du -sh .?*
cdchanges to your home directory. The firstdumeasures the size of all regular files and directories. The second uses a glob.?*that matches all the hidden items (anything whose name starts with a period) including the directory.cache.It should be safe to delete the contents of
.cache, if that is the offending directory, or you can delete just some of the subdirectories if you prefer. -
AuthorPosts
This is the new version. Still under construction.