Forum Replies Created
-
AuthorPosts
-
Yes, DCT means Discrete Cosine Transform. We will be coming on to that in the later part of Speech Processing, when we consider how to extract useful features from the FFT spectrum, to use for Automatic Speech Recognition. We’ll also bellowing at the Mel scale. Wait until we get there, then ask the question again.
Yes, these are both the spectrum of a voiced speech sound. The upper one appears to be on a linear vertical scale, so we only see the very largest amplitudes and everything else appears to be zero. The lower plot is on a logarithmic vertical scale and therefore we can see both very large and very small magnitudes on the same plot. The lower plot is more informative.
In the exercise, the section about the target cost weight describes how to change the target cost weight.
See the notes about pre-selection and pruning – these are probably why you are not hearing any difference. You cannot disable pre-selection (because this ensures the right thing is said!) but you can disable pruning.
To confirm that different candidates are actually being used, you can examine the
Unitrelation of the utterance. It’s possible that the selected units are changing but you cannot hear the small difference this makes.Use the commands described here to examine which candidates are selected:
festival> (set! myutt (SayText "Hello world.")) festival> (utt.relation.print myutt 'Unit)
You should get everything working on a DICE machine, using just the CPU and a small data set, before attempting to use Eddie. Have you done that?
Looks like you have all processes set to “False” which means nothing will be done (other than loading the config files and writing some log output).
You need to post error messages to get help with them.
The voice definition needs to be somewhere in Festival’s path – e.g., put it alongside any voices that were installed when you installed Festival originally.
Look in
/Volumes/Network/courses/ss/festival/lib.incomplete/voices-multisyn/english/localdir_multisyn-gam
on the lab machines. This is a special kind of voice definition that looks for all the voice files in the current working directory (i.e., wherever you start Festival from).
The instructions are only intended to be used on the machines in the lab. Sounds like you might be doing this on your own machine?
The voice definition file is provided for you – just make sure you set up your workspace correctly, and source the setup file in each new shell – this sets paths and so on, so that Festival finds the voice definition file.
The attached files didn’t upload (you need to add a .txt ending to get around the security on the forum). To debug on Eddie, do not submit jobs to the queue, but instead use an interactive session (using qlogin with appropriate flags to get on to a suitable GPU node).
But, talk to classmates first – several people are attempting to use Eddie and you should share knowledge and effort. Also, look at the Informatics MLP cluster, where people have got Merlin working – see
https://www.wiki.ed.ac.uk/display/CSTR/CSTR+computing+resources
There must be at least one pitchmark in every segment, to make pitch-synchronous signal processing possible. (Note: the earlier pitchmarking step inserts evenly spaced fake pitchmarks in unvoiced speech.)
A segment without any pitchmarks is mostly probably caused by misaligned labels, although very bad pitchmarking is also a potential cause.
See Section 4.2.3 of Multisyn: Open-domain unit selection for the Festival speech synthesis system, for example.
You are right that ‘bad pitchmarking’ is detected during the
build_uttsstep, whilst transferring timestamps from the forced alignment onto the utterance structures for the database.Ah – poor wording in the paper. Blame the last author. This is clearer:
“Spectral discontinuity is estimated by calculating the Euclidean distance between a pair of vectors of 12 MFCCs: one from either side of a potential join point.”
So, indeed, there is one frame either side of the join.
Can you point me to the exact place in the paper that this is mentioned please?
That looks like an error, although it will only have an effect for relatively low-pitched female voices.
You are welcome to look at the Festival source code (which is now showing its age) but making these deep modifications is far beyond the scope of this exercise. You are not expected to do this.
Restrict yourself to changing things that are described in the instructions, and that can be done easily at the Festival interactive prompt. For example, you can change the relative weight between target cost and join cost.
Both of you have identified interesting things to investigate. But, you would probably need a much larger database (and a lot more time) for such experiments to make sense.
You’re absolutely right about the current (2017-18) layout of the exercise to build your own unit selection voice. It’s a deep (and, even worse, variable-depth) hierarchical structure based too closely on how I personally like to arrange my thoughts, and not actually that helpful for students.
In previous years, the content of my courses was also arranged in this way, but the new versions have a structure with limited nesting. Student feedback suggests the new structure is much easier to navigate.
I plan to change all the exercises to have a similar simple structure, but didn’t want to do this in the middle of an academic year.
Thank-you for the constructive feedback. Next year’s students will thank you!
-
AuthorPosts
This is the new version. Still under construction.