Forum Replies Created
-
AuthorPosts
-
Yes, these are both the spectrum of a voiced speech sound. The upper one appears to be on a linear vertical scale, so we only see the very largest amplitudes and everything else appears to be zero. The lower plot is on a logarithmic vertical scale and therefore we can see both very large and very small magnitudes on the same plot. The lower plot is more informative.
In the exercise, the section about the target cost weight describes how to change the target cost weight.
See the notes about pre-selection and pruning – these are probably why you are not hearing any difference. You cannot disable pre-selection (because this ensures the right thing is said!) but you can disable pruning.
To confirm that different candidates are actually being used, you can examine the
Unit
relation of the utterance. It’s possible that the selected units are changing but you cannot hear the small difference this makes.Use the commands described here to examine which candidates are selected:
festival> (set! myutt (SayText "Hello world.")) festival> (utt.relation.print myutt 'Unit)
You should get everything working on a DICE machine, using just the CPU and a small data set, before attempting to use Eddie. Have you done that?
Looks like you have all processes set to “False” which means nothing will be done (other than loading the config files and writing some log output).
You need to post error messages to get help with them.
The voice definition needs to be somewhere in Festival’s path – e.g., put it alongside any voices that were installed when you installed Festival originally.
Look in
/Volumes/Network/courses/ss/festival/lib.incomplete/voices-multisyn/english/localdir_multisyn-gam
on the lab machines. This is a special kind of voice definition that looks for all the voice files in the current working directory (i.e., wherever you start Festival from).
The instructions are only intended to be used on the machines in the lab. Sounds like you might be doing this on your own machine?
The voice definition file is provided for you – just make sure you set up your workspace correctly, and source the setup file in each new shell – this sets paths and so on, so that Festival finds the voice definition file.
The attached files didn’t upload (you need to add a .txt ending to get around the security on the forum). To debug on Eddie, do not submit jobs to the queue, but instead use an interactive session (using qlogin with appropriate flags to get on to a suitable GPU node).
But, talk to classmates first – several people are attempting to use Eddie and you should share knowledge and effort. Also, look at the Informatics MLP cluster, where people have got Merlin working – see
https://www.wiki.ed.ac.uk/display/CSTR/CSTR+computing+resources
There must be at least one pitchmark in every segment, to make pitch-synchronous signal processing possible. (Note: the earlier pitchmarking step inserts evenly spaced fake pitchmarks in unvoiced speech.)
A segment without any pitchmarks is mostly probably caused by misaligned labels, although very bad pitchmarking is also a potential cause.
See Section 4.2.3 of Multisyn: Open-domain unit selection for the Festival speech synthesis system, for example.
You are right that ‘bad pitchmarking’ is detected during the
build_utts
step, whilst transferring timestamps from the forced alignment onto the utterance structures for the database.Ah – poor wording in the paper. Blame the last author. This is clearer:
“Spectral discontinuity is estimated by calculating the Euclidean distance between a pair of vectors of 12 MFCCs: one from either side of a potential join point.”
So, indeed, there is one frame either side of the join.
Can you point me to the exact place in the paper that this is mentioned please?
That looks like an error, although it will only have an effect for relatively low-pitched female voices.
You are welcome to look at the Festival source code (which is now showing its age) but making these deep modifications is far beyond the scope of this exercise. You are not expected to do this.
Restrict yourself to changing things that are described in the instructions, and that can be done easily at the Festival interactive prompt. For example, you can change the relative weight between target cost and join cost.
Both of you have identified interesting things to investigate. But, you would probably need a much larger database (and a lot more time) for such experiments to make sense.
You’re absolutely right about the current (2017-18) layout of the exercise to build your own unit selection voice. It’s a deep (and, even worse, variable-depth) hierarchical structure based too closely on how I personally like to arrange my thoughts, and not actually that helpful for students.
In previous years, the content of my courses was also arranged in this way, but the new versions have a structure with limited nesting. Student feedback suggests the new structure is much easier to navigate.
I plan to change all the exercises to have a similar simple structure, but didn’t want to do this in the middle of an academic year.
Thank-you for the constructive feedback. Next year’s students will thank you!
There is no point in simply copying pronunciations predicted using the LTS model into a dictionary. The dictionary is for storing exceptions to the LTS model.
So, if the LTS model gets the pronunciation correct, no need to add it to the dictionary.
But, of the LTS model gets the pronunciation wrong, you need to add a manually-created correct entry to the dictionary..
The phone set used is the one for whichever voice you currently have loaded (i.e., you need to make sure this is the one you want).
Do not use different dictionaries for different stages in the process. This makes no sense at all: the symbol sets used by different dictionaries are not interchangeable (even if it looks like some of the same symbols are used – the names of the symbols are arbitrary).
-
AuthorPosts