Forum Replies Created
-
AuthorPosts
-
You need to post error messages to get help with them.
The voice definition needs to be somewhere in Festival’s path – e.g., put it alongside any voices that were installed when you installed Festival originally.
Look in
/Volumes/Network/courses/ss/festival/lib.incomplete/voices-multisyn/english/localdir_multisyn-gam
on the lab machines. This is a special kind of voice definition that looks for all the voice files in the current working directory (i.e., wherever you start Festival from).
The instructions are only intended to be used on the machines in the lab. Sounds like you might be doing this on your own machine?
The voice definition file is provided for you – just make sure you set up your workspace correctly, and source the setup file in each new shell – this sets paths and so on, so that Festival finds the voice definition file.
The attached files didn’t upload (you need to add a .txt ending to get around the security on the forum). To debug on Eddie, do not submit jobs to the queue, but instead use an interactive session (using qlogin with appropriate flags to get on to a suitable GPU node).
But, talk to classmates first – several people are attempting to use Eddie and you should share knowledge and effort. Also, look at the Informatics MLP cluster, where people have got Merlin working – see
https://www.wiki.ed.ac.uk/display/CSTR/CSTR+computing+resources
There must be at least one pitchmark in every segment, to make pitch-synchronous signal processing possible. (Note: the earlier pitchmarking step inserts evenly spaced fake pitchmarks in unvoiced speech.)
A segment without any pitchmarks is mostly probably caused by misaligned labels, although very bad pitchmarking is also a potential cause.
See Section 4.2.3 of Multisyn: Open-domain unit selection for the Festival speech synthesis system, for example.
You are right that ‘bad pitchmarking’ is detected during the
build_utts
step, whilst transferring timestamps from the forced alignment onto the utterance structures for the database.Ah – poor wording in the paper. Blame the last author. This is clearer:
“Spectral discontinuity is estimated by calculating the Euclidean distance between a pair of vectors of 12 MFCCs: one from either side of a potential join point.”
So, indeed, there is one frame either side of the join.
Can you point me to the exact place in the paper that this is mentioned please?
That looks like an error, although it will only have an effect for relatively low-pitched female voices.
You are welcome to look at the Festival source code (which is now showing its age) but making these deep modifications is far beyond the scope of this exercise. You are not expected to do this.
Restrict yourself to changing things that are described in the instructions, and that can be done easily at the Festival interactive prompt. For example, you can change the relative weight between target cost and join cost.
Both of you have identified interesting things to investigate. But, you would probably need a much larger database (and a lot more time) for such experiments to make sense.
You’re absolutely right about the current (2017-18) layout of the exercise to build your own unit selection voice. It’s a deep (and, even worse, variable-depth) hierarchical structure based too closely on how I personally like to arrange my thoughts, and not actually that helpful for students.
In previous years, the content of my courses was also arranged in this way, but the new versions have a structure with limited nesting. Student feedback suggests the new structure is much easier to navigate.
I plan to change all the exercises to have a similar simple structure, but didn’t want to do this in the middle of an academic year.
Thank-you for the constructive feedback. Next year’s students will thank you!
There is no point in simply copying pronunciations predicted using the LTS model into a dictionary. The dictionary is for storing exceptions to the LTS model.
So, if the LTS model gets the pronunciation correct, no need to add it to the dictionary.
But, of the LTS model gets the pronunciation wrong, you need to add a manually-created correct entry to the dictionary..
The phone set used is the one for whichever voice you currently have loaded (i.e., you need to make sure this is the one you want).
Do not use different dictionaries for different stages in the process. This makes no sense at all: the symbol sets used by different dictionaries are not interchangeable (even if it looks like some of the same symbols are used – the names of the symbols are arbitrary).
You probably used the wrong voice, or wrong dictionary, to create your labels. For example, you ran some steps with one dictionary, and other steps with a different dictionary.
This will probably be because of Apple’s over-strict security settings. The files are not damaged. Try downloading them in a browser other than Safari, or on a different computer (not in the lab).
March 3, 2018 at 10:57 in reply to: Amount of source text to start with, for my text selection algorithm #9131If you can only find 550 in-domain sentences, then you’re going to have to record pretty much all of them. So, no point doing text selection, but you should still measure the coverage.
You then propose to experiment with text selection using a bigger set of source data. That’s a good idea – just what I recommend in the previous answer above.You can measure coverage, demonstrate that your algorithm works, and so on – all good material for your report. But you perhaps don’t need to record that dataset, unless you really enjoy recording.
In general, I’d expect you to record Arctic A plus additional data of your own design amounting to about the same size as Arctic A. I think you’re proposing a third set of the same size again. If you’re efficient at recording (i.e., you get almost all sentences right in one ‘take’), and the time taken to get this data ready for voice building (e.g., sanity checking, choosing the best ‘take’) is not too much, then you could do it. But it’s definitely not essential.
March 3, 2018 at 10:09 in reply to: Amount of source text to start with, for my text selection algorithm #9129Your methodology is good: find as much domain-specific material as possible, and then use an algorithm to select the subset with best coverage.
You suggest that, because you are starting with a small amount of source text, you should select a smaller subset to record.
Actually, I would recommend selecting a subset that is the same size as Artic A (you decide how to measure ‘size’), because this would enable interesting comparisons to be made.
Starting with only 1100 sentences will limit how much your algorithm will be able to improve coverage, compared to random selection of a subset of the same size. But, it’s still a worthwhile exercise, because you’re doing all the important steps. So, go ahead.
In your report, you can acknowledge the limitations, and you could also show how much your algorithm was able to improve coverage. So, you have lots of ways to demonstrate your understanding and to get a good mark.
If you want to demonstrate that your text selection algorithm would work better given more source text, then you could run it on a much larger set (e.g., 1 million sentences) and measure coverage vs random selection. Don’t bother recording the selected sentences though – the goal is just to show that your algorithm works.
-
AuthorPosts