Forum Replies Created
-
AuthorPosts
-
Here are some useful guidelines on making good figures.
1. A weighted average WER is correct, but the simplest way to calculate this is just to sum up insertions, deletions and substitutions across the entire test set, then divide by the total number of words in the reference.
2. I’m not sure you would often want to report WER for individual sentences – this will be highly variable and likely to be based on very few samples. You would need to have a specific reason to report (and analyse) per-sentence WER.
3. No, that’s not the reason! It’s easy to automate WER calculation. Published work using too few listeners just indicates lazy experimenters!
The Blizzard Challenge uses a standard dynamic programming approach to align the reference with what the listener transcribed – very much like HResults from HTK or sclite. WER is then calculated in the usual way, summing up insertions, deletions and substitutions and dividing by the total number of words in the reference.
The procedure is slightly enhanced for Blizzard to allow for listeners’ typos (which are defined in a manually-created lookup table, updated for each new test set used once we see the typical mistakes they make for those particular sentences).
For your listening tests, I recommend manually correcting any typos, then either computing WER manually, or using HResults – that’s just a matter of getting things in the right file format. Your reference would be in an MLF and your listener transcriptions would be in .rec files.
Whilst we are on this topic, this is a good time to remember that in general you cannot compute WER per sentence, then average over all sentences. This is only valid if all sentences have the same number of words (in the reference).
I’ve not heard of this error before, and do know that many people have used Qualtrics successfully.
One thing to try would be converting the wav file to high bitrate (at least 128 kbps and ideally 320 kbps) mp3 and see if Qualtrics prefers that. Not ideal, but an acceptable workaround.
If the error persists, contact the IS helpline by email for support.
You are probably overthinking. If the audio plays correctly in whatever tool you use to implement the listening test, then there is no reason to pad with extra silence.
Why do you need to do this?
One option would be just to add some silence using sox.
The error message tells you the problem: the file
resources/mixup10.hed
does not exist. You will need to create it. Look at the ones that do exist to figure out what needs to go in it – just some simple commands toHHEd
.One easy way is to write a very simple program (Python or shell script) to generate a file that contains something like
(voice_localdir_multisyn-rpx) (set! myutt (SayText "Hello world.")) (utt.save.wave myutt "sentence001.wav" 'riff) (set! myutt (SayText "Here is the next sentence.")) (utt.save.wave myutt "sentence002.wav" 'riff)
Your program should save this into a file, perhaps called
generate_test_sentences.scm
and then you can execute that in Festival simply by passing it on the command line like this$ festival generate_test_sentences.scm
The course should have more material on neural networks and state-of-the-art methods such as Wavenet, right from the start
The course will cover all of these things, don’t worry. But it makes no sense to jump into state-of-the-art models until the foundations are solid, so that we can motivate why particular approaches are the state-of-the-art.
In other courses that I am familiar with, the state-of-the-art is covered at the start of the course, but then I find students lack the foundations and simply don’t really understand the material. They then have to backtrack and learn the foundations that are needed, sometimes on their own.
As it happens, Wavenet, in its original text-to-waveform configuration, was only very briefly the state-of-the-art and has already been superseded That’s another reason to do the state-of-the-art at the end of the course, so we can include the very latest approaches from this fast-moving field.
Could the assignment be done in pairs or groups?
I’m not a fan of pair or group work, where a single report is submitted and therefore all members of the group receive the same mark. My experience with observing this in other courses is that some students do a lot more work than others.
You can of course work together in the lab, discussing the theory and practice of speech synthesis, talking about what limited domain you might use, what kinds of hypotheses make sense, or which tool to use to implement a listening test, etc. But you must then execute the work yourself.
More instructions on shell scripting / why can’t we use Python
The voice building ‘recipe’ we are using is written as shell or Scheme scripts, and it’s not easy to change that. Learning more shell scripting (in addition to what was covered in Speech Processing) is an important aspect of the course.
There some shell scripting help on the forums and I am always happy to add to that, in response to specific questions and requests.
You can use Python for everything you implement yourself, such as a text selection algorithm.
Lab tasks for each week could be clearer / labs could be more structured
We will provide more class-wide instructions during the remaining lab sessions, whilst still leaving plenty of time for individual help.
Positive comments
Number of people mentioning each point is given in parentheses.
Group work / interactive classes (13)
The videos (8) and subtitles / transcripts (3)
Flipped classroom format (7)
Labs (6) and specifically the tutor (2)
Milestones for the assignment (4)
speech.zone in general, including content, navigation (4)
February 8, 2019 at 08:49 in reply to: ASF – translating linguistic features to acoustic representation #9686Predicting acoustic features from linguistic features is a regression problem. We already have the necessary labelled training data: the speech database that will be used for unit selection.
One way to do the regression would be to train a regression tree (a CART). This is the method used in so-called “HMM-based speech synthesis” that we will cover in the second half of the course. But in HMM synthesis, the predicted acoustic features are used as input to a vocoder to create a waveform, rather than in an ASF target cost function.
We might then replace the tree with a better regression model: a neural network. We’ll cover this method after HMM synthesis.
Once we know about HMM and neural network speech synthesis (both using vocoders rather than unit selection + waveform concatenation), we can then come back to the ASF formulation of unit selection. We will find that this is usually called “hybrid speech synthesis” and is covered towards the end of the course.
Your analogy with programming languages is along the right lines. In this context:
“high level” means “further away from the waveform”, “more abstract” and “changing at a slower rate”
“low level” means “closer to the waveform”, “more concrete (e.g., specified more precisely using more parameters)” and “changing more rapidly”
-
AuthorPosts