Page 23

Forum Replies Created

Viewing 15 posts - 331 through 345 (of 1,087 total)

← 1 2 3 … 22 23 24 … 71 72 73 →

Author

Posts
April 19, 2020 at 17:44 in reply to: Fixing a unit sequence #11190
Simon
Professor
You’re experiencing the downside of unit selection – instability and unpredictable behaviour.

It’s also not necessarily the case that more powerful acoustic models provide a more accurate alignment.

Instead of forcing unit choices, you might instead think of some evidence that you can present to back up your claim that listeners were “responding to the choice of units”. This could be informal / qualitative / small scale / based on your own listening or analysis – no need for another listening test.
April 19, 2020 at 13:59 in reply to: Fixing a unit sequence #11186
Simon
Professor
The only special case where you might want to do this is when demonstrating the effect of fixing a labelling error. The instructions suggest restricting the database to only the utterances providing the desired units. (This is not guaranteed to keep the unit sequence the same, if there are multiple instances of some diphones, but usually does the trick.)

However, it’s not possible in general to impose this constraint on a full voice. Given that almost any change in a voice (e.g., different F0 estimation settings) will change either the join or target costs, you may get a different unit sequence. Constraining the unit sequence effectively means ignoring join and target costs.

Why do you want to do that, in your case?
April 16, 2020 at 16:21 in reply to: GMMs in Forced Alignment #11178
Simon
Professor
Yes, more components gives a more expressive probability density function that can better fit the data. As in all machine learning, having too many model parameters (here, the number of mixture components controls the number of means and variances to estimate) can lead to overfitting. That’s probably not the main concern here, since we are not trying to generalise from training data to test data.

Try limiting the number of components to 1, as a way to get potentially worse alignments. Another way to get worse alignments would be to reduce the amount of data used to train the models, whilst still aligning all the data in the final run of HVite in the script.
April 16, 2020 at 15:22 in reply to: MCD for Objective Evaluation #11176
Simon
Professor
Something has gone wrong with your PATH because there are a mixture of festival_linux and festival_mac elements in the above messages.

This is an unsolved bug on the lab computers – see here for workarounds.
April 16, 2020 at 15:19 in reply to: GMMs in Forced Alignment #11175
Simon
Professor
The number of “mixture components ” the number of individual Gaussians that are summed together in the Gaussian Mixture Model in each HMM state. Each such Gaussian is multidimensional (as you say, with 39 dimensions).

If you’re finding it hard to separate out the two concepts “number of mixture components” and “multivariate“, then aim to understand things in this order:
1. An individual univariate Gaussian: it emits observations that are vectors with 1 dimension (or just scalars, if you prefer)
2. Extend that to a mixture of univariate Gaussian components: observations are still vectors with 1 dimension, but drawn from a more complex probability density function (with multiple modes)
3. Go back to the individual univariate Gaussian in 1 dimension
4. Now extend that single univariate Gaussian to a single multivariate Gaussian: it emits observations that are vectors with, say, 39 dimensions but drawn from a distribution with only one mode
5. Put 2 and 4 together to get a model that emits observations that are vectors with 39 dimensions (that’s the multivariate part), and are drawn from a more complex probability density function (because it’s now a mixture distribution)
April 15, 2020 at 13:10 in reply to: MCD for Objective Evaluation #11172
Simon
Professor
Those binaries are only built for Mac, so wouldn’t work on the new lab machines.

Using an objective measure is quite advanced for this assignment, and well beyond what is expected, but it’s good to try! The simplest approach is to install what is needed on your own machine (provided it’s Linux or Mac), and you could do that by following the installation instructions for Merlin which should give you all the tools you need.

If installation proves difficult, then don’t waste too much time on this, and move on to some other experiment.
April 15, 2020 at 12:36 in reply to: Labelling Error #11171
Simon
Professor
Yes – perhaps you might discuss the relative importance of different error types (e.g., small time alignment discrepancies vs. large alignment errors vs. incorrect vowel reduction detection vs …. etc)
April 15, 2020 at 09:18 in reply to: Pitch Marking -fill option #11168
Simon
Professor
The phrase “and interpolates through the unvoiced regions of speech” was a mistake and I’ve removed it.

You are right that the pitchfork program (called by make_pm_wave) adds pseudo-pitchmarks in unvoiced regions when the -fill flag is used.
April 14, 2020 at 09:49 in reply to: Unit selection in combined database #11165
Simon
Professor
If there are consistent differences between the two halves of your data (e.g., recording conditions, voice quality, speaking effort, …), then the join cost will tend to favour sequences of diphones from within only one half. Try reducing the relative join cost weight to see if you get more switching between the halves within each utterance. Also try synthesising sentences from one half of the corpus, and then synthesising variants on those sentences that gradually move away from that half (perhaps introducing words only found in the other half).

It’s the nature of unit selection that you will always find some examples of better-sounding synthesis from a “theoretically-worse” voice (less data, domain mismatch, different join/target cost weight, etc). All you can hope for is that one voice sounds better than the other on average: you won’t get every single utterance sounding better in one voice than another.

This has important implications for the listening test and how many utterances you should use to measure this average difference.
April 13, 2020 at 12:47 in reply to: Unknown Label error #11162
Simon
Professor
In the dictionary’s phoneset, /?/ is the glottal stop. This character is problematic for HTK and also if used as part of a filename, and so is mapped to /Q/ for the forced alignment stage.

The glottal stop is a bit unusual in unilex-rpx: it never occurs in the dictionary, but can be predicted by the letter-to-sound model (*) – see this post. For this reason, it is not included in the phone list used for forced-alignment, which leads to the error you are getting.

Solution: don’t use the glottal stop in any pronunciations you add to my_lexicon.scm.

As already noted above, there is no point adding the output from LTS to my_lexicon.scm.

The correct method for adding a pronunciation for “bullheaded” would be to look up pronunciations of similar-sounding words (here, “bull”, “head”, “headed”) and assemble a pronunciation from these fragments, making any small modifications you think are needed.

(*) Normally, this would be impossible, because the letter-to-sound model is trained on the dictionary entries. I don’t actually know why it is possible in unilex-rpx.
April 13, 2020 at 08:51 in reply to: Pitch "artifacts" #11158
Simon
Professor
Always remember that Praat is only showing an estimated value of F0, found using a method based on autocorrelation. Since Praat does not know a priori your F0 range, it uses a very wide range (75 to 600 Hz by default).

The low values you are seeing could either be truly low values (perhaps due to creaky voice as you suggest), or errors in estimation. To decide, you need to inspect the waveform closely and estimate the true fundamental period yourself.

If you are needing to find your minimum and maximum values of F0 in order to set parameters for automatic F0 estimation (“pitch tracking”), then you don’t need to include these extreme values: just aim to cover the majority of values, omitting outliers.

One approach used to set minimum and maximum automatically is to run the pitch tracker with a wide F0 range, plot a histogram of the resulting values across the whole database, and use that to choose tighter limits for a second run of the pitch tracker. (I’m not suggesting to do that, unless you are particularly interested in inspecting the F0 distribution.)
April 10, 2020 at 20:55 in reply to: running festival on my machine #11146
Simon
Professor
You need more than a standard Festival installation in order to perform the assignment on your own machine. You should use the lab machines.
April 10, 2020 at 15:13 in reply to: Audio in Qualtrics? #11141
Simon
Professor
It is a good idea for each voice to have approximately the same loudness (which is a perceptual property). You can apply a global scaling factor to all the synthetic waveforms for a voice using sox:
```
% sox -v 3.5 in.wav out.wav
```
Choose a volume scaling factor (3.5 in the above example) by ear, then apply the same scaling to all waveforms from that voice. Be careful not to clip, and so it’s good practice to inspect the waveforms afterwards (e.g., in Praat) to check for that. You shouldn’t need to adjust the volume too precisely to achieve sufficiently uniform perceived loudness across voices.

It’s always safer to scale down rather than up, but if one voice is very quiet then scaling that one up is fine.

You might be tempted to normalise (i.e., automatically apply the maximum scaling-up that is safe without clipping) using sox, but doing that for individual files can result in varying perceptual loudness because the phonetic content will vary utterance-to-utterance. Loudness is not the same as amplitude!
April 8, 2020 at 21:08 in reply to: do_alignment script #11134
Simon
Professor
The only changes, other than timing, will be vowel reductions and detection of short pauses at word boundaries (“sp” with non-zero duration). When you use different data, you get different models, and therefore it might be that the model of schwa now has a higher likelihood than the model of the full vowel, or vice versa.

Look inside the do_alignment script and find the dictionary that is used, to understand how the forced alignment detects vowel reduction (hint: try drawing the recognition network – which will be used for token passing – for one utterance).
April 7, 2020 at 16:42 in reply to: Same join costs despite different pitch marks #11130
Simon
Professor
The acoustic features used by the join cost are computed using short-term analysis with a fixed frame rate and frame duration. So you will need to have substantially-different pitch marks in order for the join to move enough for the join cost to use features from a different frame.
Author

Posts

Viewing 15 posts - 331 through 345 (of 1,087 total)

← 1 2 3 … 22 23 24 … 71 72 73 →

Simon

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis