Whilst the video is playing, click on a line in the transcript to play the video from that point. So that's about all we have to say about the evaluation off speech synthesis. We could evaluate things about the system, but mainly we wanted evaluate the speech itself. So it's the evaluation off synthetic speech. And let's just wrap up by reminding ourselves why we were doing that in the first place and looking again at what to do with the outcome of that So let's just repeat our diagram of how research works and, indeed, how development works We need carefully controlled experiments to tell the differences between, for example, these four system variants that might be all very similar. Brilliant idea may have made only a very small improvement in the quality of the synthetic speech. We need a very sensitive evaluation that can detect that and tell us to take that improvement forward and to leave the others behind. We need to make sure there's no confounding factors. So when we look at these 1, 2, 3, 4 systems, the only difference between them must be the thing we changed and everything else must be kept constant. For example, the front end, the database, the signal processing, whatever we haven't changed must be identical, so there are no confounds when we do this evaluation and find that one of the variants was better and take it forward, we're then, assuming that it was our change that was responsible for that, what we can't do easily is do a listening test, Look at the result of it and somehow back propagate that through the pipeline of the text-to-speech system and find that it was the letter to sound system that was to blame. Now that's because these pipeline architectures are simply not invertable. We can push things through them one direction. We can't push the errors back through them in the other direction. If we had a complete end-to-end machine learning approach to the problem, for example, a great big neural net that went all the way from text to a waveform, we could imagine now back propagating errors through the system. And that's one direction to field is heading into - something that's learnable in an end to end fashion and not just in a modular fashion. However, we're not quite there yet. Of course, that great big machine learning black box isn't a final solution, because although we might be able to learn in an end-to-end fashion When it does make mistakes, we really can't point to any part of the system to blame for it because it's a black box. So there's still an advantage of a pipeline architecture made of modules that we understand that we can take apart and we can replace. Let's finish off with a table off recommendations about what test to use to measure what property, so depending what you want to evaluate. What tests should you pick? Naturalness is the most common thing to be evaluated. It's perfectly good to use Mean Opinion Score very often on a five point scale that you might like a seven point scale for some reason, or something that's closely related to that which is MUSHRA, which has a built in error checking and calibration mechanism. And a forced choice is also perfectly okay, because it's quite a straightforward task to explain to listeners which of these two sample sounds the most natural. I can't immediately think of a task that you could ask listeners to do other than those things from which you could gauge the naturalness of a signal. Similarity to Target Speaker is essentially the same sort of task is naturalness, so people tend to do MOS or MUSHRA type tests. Obviously, with a reference sample of what the target speakers supposed to sound like. You could use forced choice to say which of the two sounds more like a particular target speaker Judging whether something is or is not a target speaker is a slightly odd task for listeners because in real life, someone either is or is not the target making graded distinctions maybe a little bit tricky. This is probably fine. One thing you'll find in the literature and indeed in some standards, is the use ofthe opinion scores to rate intelligibility. It's hard to see where that will be acceptable. How do you know whether you've understood something? Okay, so if something is perfectly intelligible or very unintelligible, then we could make a judgement on that and give an opinion. But fine distinctions. If we miss here a word, how do we know we've misheard it? So I would never recommend the use of opinion scores or MUSHRA or anything like that to rate intelligibility. I would only really, ever trust people typing in what they heard as a task. The only exception to That might be if there are only a couple of possible things they could hear, such as in minimal pairs, and you could possibly do a forced choice. But this is very uncommon these days. You will see reports or evaluations that don't really say what they're looking for It's just preference. They're implying natural. That's probably but maybe not saying it. In this case, if we're really not sure what we're asking listeners to do, we're just saying, Do you like this system morr or this other system more and all you can really do is forced choice preference test, because you're not really telling them what to listen to it all, just an overall preference that would also be appropriate for choosing between two different voices, which was most pleasant, for example, which is different to naturalness. Well, we've reached the end off the first part of the course. And it's all about unit selection, the data behind it and how to evaluate it. What we'll move on to next is a different form of synthesis, and that's called Statistical Parametric Speech Synthesis. And what you've learned in this unit selection part will still be highly relevant for several reasons. Of course, we will want to evaluate that synthesis from the statistical parametric method. And what we've talked about in this module on evaluation, all still applies. We'll still need a database. We'll probably still want coverage in that database, although it might not be as important as in unit Selection. And eventually we'll join together the statistical parametric method and the unit selection method for something called hybrid synthesis. So now that you know all of this, take what we've learned and go and put it into practise and go and build your own unit selection voice.
Wrap up
We'll conclude with a reminder of what to do with the outcome of an evaluation, and some recommendations for what types of test are suitable for various purposes.
Log in if you want to mark this as completed
|
|