Forum Replies Created
-
AuthorPosts
-
The fundamental frequency of a complex wave does not necessarily have the largest amplitude in the spectrum.
We can use the source-filter model to understand how that is possible, using figure 4.13 from Ladefoged that you attached. These are idealised speech waveforms, made by passing an impulse train through a filter.
The filter is particularly simple in this example: it has a single resonance at 600 Hz.
Energy at or close to the resonant frequency is amplified by the filter, whereas energy at frequencies far away from the resonant frequency is attenuated. To convince yourself that a filter can do that, think about a brass instrument like a trumpet: the input is generated by vibrating lips at the mouthpiece, which is not very loud, yet the output can be very loud indeed.
The input impulse train in Ladefoged’s example has a fundamental frequency of 100Hz in the uppermost plot. This contains equal amounts of energy at 100 Hz, 200 Hz, 300 Hz, 400 Hz, 500 Hz, 600 Hz, 700 Hz, …. etc.
Thinking in the frequency domain will be easier than the time domain. The spectrum of that impulse train tells us that the waveform is equivalent to a sine wave at 100 Hz, added to one at 200 Hz, another one at 300 Hz, and so on.
All that a (linear) filter can do to a sine wave is change its amplitude: increase or decrease it. The amount of increase or decrease plotted against frequency is called the frequency response of the filter. The filter in the example has a peak in its frequency response at 600 Hz, meaning that any input at that frequency (e.g., the 600 Hz sine wave component of the impulse train) will be amplified.
Try this yourself in the lab: take an impulse train and pass it through a filter that has a single resonance (in Praat you can use “filter one formant“), then inspect the waveform and the spectrum. With appropriate filter settings you can almost entirely attenuate the fundamental. But, listen to the resulting signal and you will perceive the same pitch as the original impulse train.
People have tried using automatic speech recognition to evaluate the intelligibility of synthetic speech, but with only limited success. So the simple answer is that there is no objective measure of intelligibility.
The Blizzard Challenges in 2008, 2009, and 2010 included tasks on Mandarin Chinese and the summary papers for these years (available from http://festvox.org/blizzard/index.html) tell you about the two measures used: pinyin error rate with or without tone.
The general answer is: no, Word Error Rate is not the most useful measure of intelligibility for all languages.
There is no single index of all available TTS front-ends – the closest thing would be on the SynSIG website’s software list.
Availability varies widely with language, and for some there is no free software available.
So the short answer is “No – you’ll need to talk to your supervisor”.
Here are some useful guidelines on making good figures.
1. A weighted average WER is correct, but the simplest way to calculate this is just to sum up insertions, deletions and substitutions across the entire test set, then divide by the total number of words in the reference.
2. I’m not sure you would often want to report WER for individual sentences – this will be highly variable and likely to be based on very few samples. You would need to have a specific reason to report (and analyse) per-sentence WER.
3. No, that’s not the reason! It’s easy to automate WER calculation. Published work using too few listeners just indicates lazy experimenters!
The Blizzard Challenge uses a standard dynamic programming approach to align the reference with what the listener transcribed – very much like HResults from HTK or sclite. WER is then calculated in the usual way, summing up insertions, deletions and substitutions and dividing by the total number of words in the reference.
The procedure is slightly enhanced for Blizzard to allow for listeners’ typos (which are defined in a manually-created lookup table, updated for each new test set used once we see the typical mistakes they make for those particular sentences).
For your listening tests, I recommend manually correcting any typos, then either computing WER manually, or using HResults – that’s just a matter of getting things in the right file format. Your reference would be in an MLF and your listener transcriptions would be in .rec files.
Whilst we are on this topic, this is a good time to remember that in general you cannot compute WER per sentence, then average over all sentences. This is only valid if all sentences have the same number of words (in the reference).
I’ve not heard of this error before, and do know that many people have used Qualtrics successfully.
One thing to try would be converting the wav file to high bitrate (at least 128 kbps and ideally 320 kbps) mp3 and see if Qualtrics prefers that. Not ideal, but an acceptable workaround.
If the error persists, contact the IS helpline by email for support.
You are probably overthinking. If the audio plays correctly in whatever tool you use to implement the listening test, then there is no reason to pad with extra silence.
Why do you need to do this?
One option would be just to add some silence using sox.
The error message tells you the problem: the file
resources/mixup10.hed
does not exist. You will need to create it. Look at the ones that do exist to figure out what needs to go in it – just some simple commands toHHEd
.One easy way is to write a very simple program (Python or shell script) to generate a file that contains something like
(voice_localdir_multisyn-rpx) (set! myutt (SayText "Hello world.")) (utt.save.wave myutt "sentence001.wav" 'riff) (set! myutt (SayText "Here is the next sentence.")) (utt.save.wave myutt "sentence002.wav" 'riff)
Your program should save this into a file, perhaps called
generate_test_sentences.scm
and then you can execute that in Festival simply by passing it on the command line like this$ festival generate_test_sentences.scm
The course should have more material on neural networks and state-of-the-art methods such as Wavenet, right from the start
The course will cover all of these things, don’t worry. But it makes no sense to jump into state-of-the-art models until the foundations are solid, so that we can motivate why particular approaches are the state-of-the-art.
In other courses that I am familiar with, the state-of-the-art is covered at the start of the course, but then I find students lack the foundations and simply don’t really understand the material. They then have to backtrack and learn the foundations that are needed, sometimes on their own.
As it happens, Wavenet, in its original text-to-waveform configuration, was only very briefly the state-of-the-art and has already been superseded That’s another reason to do the state-of-the-art at the end of the course, so we can include the very latest approaches from this fast-moving field.
Could the assignment be done in pairs or groups?
I’m not a fan of pair or group work, where a single report is submitted and therefore all members of the group receive the same mark. My experience with observing this in other courses is that some students do a lot more work than others.
You can of course work together in the lab, discussing the theory and practice of speech synthesis, talking about what limited domain you might use, what kinds of hypotheses make sense, or which tool to use to implement a listening test, etc. But you must then execute the work yourself.
More instructions on shell scripting / why can’t we use Python
The voice building ‘recipe’ we are using is written as shell or Scheme scripts, and it’s not easy to change that. Learning more shell scripting (in addition to what was covered in Speech Processing) is an important aspect of the course.
There some shell scripting help on the forums and I am always happy to add to that, in response to specific questions and requests.
You can use Python for everything you implement yourself, such as a text selection algorithm.
-
AuthorPosts