Forum Replies Created
-
AuthorPosts
-
Something has gone wrong with your PATH because there are a mixture of festival_linux and festival_mac elements in the above messages.
This is an unsolved bug on the lab computers – see here for workarounds.
The number of “mixture components ” the number of individual Gaussians that are summed together in the Gaussian Mixture Model in each HMM state. Each such Gaussian is multidimensional (as you say, with 39 dimensions).
If you’re finding it hard to separate out the two concepts “number of mixture components” and “multivariate“, then aim to understand things in this order:
- An individual univariate Gaussian: it emits observations that are vectors with 1 dimension (or just scalars, if you prefer)
- Extend that to a mixture of univariate Gaussian components: observations are still vectors with 1 dimension, but drawn from a more complex probability density function (with multiple modes)
- Go back to the individual univariate Gaussian in 1 dimension
- Now extend that single univariate Gaussian to a single multivariate Gaussian: it emits observations that are vectors with, say, 39 dimensions but drawn from a distribution with only one mode
- Put 2 and 4 together to get a model that emits observations that are vectors with 39 dimensions (that’s the multivariate part), and are drawn from a more complex probability density function (because it’s now a mixture distribution)
Those binaries are only built for Mac, so wouldn’t work on the new lab machines.
Using an objective measure is quite advanced for this assignment, and well beyond what is expected, but it’s good to try! The simplest approach is to install what is needed on your own machine (provided it’s Linux or Mac), and you could do that by following the installation instructions for Merlin which should give you all the tools you need.
If installation proves difficult, then don’t waste too much time on this, and move on to some other experiment.
Yes – perhaps you might discuss the relative importance of different error types (e.g., small time alignment discrepancies vs. large alignment errors vs. incorrect vowel reduction detection vs …. etc)
The phrase “and interpolates through the unvoiced regions of speech” was a mistake and I’ve removed it.
You are right that the
pitchfork
program (called by make_pm_wave) adds pseudo-pitchmarks in unvoiced regions when the-fill
flag is used.If there are consistent differences between the two halves of your data (e.g., recording conditions, voice quality, speaking effort, …), then the join cost will tend to favour sequences of diphones from within only one half. Try reducing the relative join cost weight to see if you get more switching between the halves within each utterance. Also try synthesising sentences from one half of the corpus, and then synthesising variants on those sentences that gradually move away from that half (perhaps introducing words only found in the other half).
It’s the nature of unit selection that you will always find some examples of better-sounding synthesis from a “theoretically-worse” voice (less data, domain mismatch, different join/target cost weight, etc). All you can hope for is that one voice sounds better than the other on average: you won’t get every single utterance sounding better in one voice than another.
This has important implications for the listening test and how many utterances you should use to measure this average difference.
In the dictionary’s phoneset, /?/ is the glottal stop. This character is problematic for HTK and also if used as part of a filename, and so is mapped to /Q/ for the forced alignment stage.
The glottal stop is a bit unusual in unilex-rpx: it never occurs in the dictionary, but can be predicted by the letter-to-sound model (*) – see this post. For this reason, it is not included in the phone list used for forced-alignment, which leads to the error you are getting.
Solution: don’t use the glottal stop in any pronunciations you add to
my_lexicon.scm
.As already noted above, there is no point adding the output from LTS to
my_lexicon.scm
.The correct method for adding a pronunciation for “bullheaded” would be to look up pronunciations of similar-sounding words (here, “bull”, “head”, “headed”) and assemble a pronunciation from these fragments, making any small modifications you think are needed.
(*) Normally, this would be impossible, because the letter-to-sound model is trained on the dictionary entries. I don’t actually know why it is possible in unilex-rpx.
Always remember that Praat is only showing an estimated value of F0, found using a method based on autocorrelation. Since Praat does not know a priori your F0 range, it uses a very wide range (75 to 600 Hz by default).
The low values you are seeing could either be truly low values (perhaps due to creaky voice as you suggest), or errors in estimation. To decide, you need to inspect the waveform closely and estimate the true fundamental period yourself.
If you are needing to find your minimum and maximum values of F0 in order to set parameters for automatic F0 estimation (“pitch tracking”), then you don’t need to include these extreme values: just aim to cover the majority of values, omitting outliers.
One approach used to set minimum and maximum automatically is to run the pitch tracker with a wide F0 range, plot a histogram of the resulting values across the whole database, and use that to choose tighter limits for a second run of the pitch tracker. (I’m not suggesting to do that, unless you are particularly interested in inspecting the F0 distribution.)
You need more than a standard Festival installation in order to perform the assignment on your own machine. You should use the lab machines.
It is a good idea for each voice to have approximately the same loudness (which is a perceptual property). You can apply a global scaling factor to all the synthetic waveforms for a voice using sox:
% sox -v 3.5 in.wav out.wav
Choose a volume scaling factor (3.5 in the above example) by ear, then apply the same scaling to all waveforms from that voice. Be careful not to clip, and so it’s good practice to inspect the waveforms afterwards (e.g., in Praat) to check for that. You shouldn’t need to adjust the volume too precisely to achieve sufficiently uniform perceived loudness across voices.
It’s always safer to scale down rather than up, but if one voice is very quiet then scaling that one up is fine.
You might be tempted to normalise (i.e., automatically apply the maximum scaling-up that is safe without clipping) using sox, but doing that for individual files can result in varying perceptual loudness because the phonetic content will vary utterance-to-utterance. Loudness is not the same as amplitude!
The only changes, other than timing, will be vowel reductions and detection of short pauses at word boundaries (“sp” with non-zero duration). When you use different data, you get different models, and therefore it might be that the model of schwa now has a higher likelihood than the model of the full vowel, or vice versa.
Look inside the
do_alignment
script and find the dictionary that is used, to understand how the forced alignment detects vowel reduction (hint: try drawing the recognition network – which will be used for token passing – for one utterance).The acoustic features used by the join cost are computed using short-term analysis with a fixed frame rate and frame duration. So you will need to have substantially-different pitch marks in order for the join to move enough for the join cost to use features from a different frame.
The run time might be dominated by the time taken to load the voice. There are several ways to control for that:
- Use a machine with no other users logged in
- Put a copy of the voice on the local disk of the machine you are using (e.g., copy your
ss
folder to/tmp
and change to that folder before starting Festival) so that the loading time is fast and consistent - Likewise, if saving the waveform output, make sure to write to local disk, or better to the system “black hole” file
/dev/null
, or even better not to write any output at all - Synthesise a large enough set of sentences to give a run time of a minute or more, thus making the few seconds of voice loading time irrelevant
Remember also to report the “user” time (which is the time used by the process) and not “real” (which is wall-clock time).
There are interactions between “Observation pruning” and “Beam pruning” and you will want to disable one when experimenting with the other. With small databases where there is only one candidate for some diphones, pruning will have no effect in those parts of the search: that one candidate will always have to be used, no matter what. So, do these experiments with the largest possible database (e.g., ARCTIC slt).
You need to deduce which subsequent steps depend on pitch marking – try drawing a flowchart of the steps in the voice building process showing the flow of information between them.
For example, pitch marks determine potential join locations, and so the computation of join cost coefficients will be affected. Therefore any steps related to join cost will need to be re-run.
(If in doubt, run all subsequent steps anyway, to be on the safe side.)
-
AuthorPosts