Whilst the video is playing, click on a line in the transcript to play the video from that point. In this module, we're going to talk about evaluation. We'll start right at the beginning. Think about why we even need to evaluate. What we're going to do with the result of that. That will help us think about when it's appropriate to evaluate. Do we do that while we're building a system? Do we wait until we're finished? There are many aspects of a system. We might evaluate its internal components, its reliability, or speed. But we are mainly going to focus on the more obvious speech synthesis evaluation, which is to listen to the speech and make some judgements, perhaps about its naturalness and its intelligibility. Once we have established those things, we'll move on to thinking about exactly how to do that. How do you evaluate? In that section, we will look at methodologies for evaluation. How we present the stimuli to listeners. Who those listeners might be. What the stimulus should be. And some other aspects of experimental design. And once we've finished evaluation, we need to do something with what we've found. So, what do we do with the outcome of an evaluation? Before you look at this module, you should just check that you already know the following things. Of course, you should have a pretty good understanding of the complete process of text to speech synthesis by this point. And because you've got that understanding, you should have some ideas about where errors can come from, why synthetic speech is not perfect. For example, in the front end, things might go wrong very early in text normalisation, the wrong words are said. Pronunciations could be incorrect: letter-to-sound could go wrong. And other things like prosody could be wrong. Things will also go wrong when we generate the waveform. Let's just think about unit selection in this section. Unit selection errors might be that the units we chose were from inappropriate contexts. So they have the wrong co-articulation, for example. Or they might not have joined very well. We'll get concatenation artefacts To have a good understanding of why people hear some errors but not other errors, it will be useful to go and revise your acoustic phonetics and make sure you know a little bit about speech perception. Different errors will lead to different perceptual consequences. They'll impact different aspects of the synthetic speech. Some things are going to obviously affect the naturalness of the synthetic speech. It will sound unnatural in some way: unnatural prosody, unnatural signal quality because of concatenation, perhaps. Other types of errors might in fact affect intelligibility. An obvious one there is: if text normalisation goes wrong - the synthesiser says the wrong words and the listener understands the wrong thing. But intelligibility could also be impacted by audible joins that disrupt the listener's perception. So both naturalness and intelligibility can be affected by things that happen in the front end and in waveform generation. We're going to see that's a general trend in evaluation: that we might be able to measure something in the synthetic speech, but that thing that we measure doesn't have a direct connection back to only one property of the system. We have to figure out that for ourselves. And that's one reason it's rather hard to automatically optimise the system based on the results a listening test. Because you can't feed back the results of that listening test down into the insides of the system and say, "This error was because of letter-to-sound". Now, before we just jump in and try and evaluate synthetic speech, let's sit down and think very carefully about that. We're going to be a little bit systematic. We're not just going to make some synthetic speech, play it to listeners and see what they think. We're going to make a plan. We're going to start right back at the beginning, thinking: "Why do we want to evaluate?" "What is it that we want to get out of that evaluation?" Most of the time, we're trying to learn something for ourselves that will help us make our system better. We're trying to guide the future development of the system. When we think we've got a pretty good system, we might then go out and compare it somebody else's system and hope that we're better. Maybe then we can publish a paper. Our technique is an improvement over some baseline. If we're going to sell this thing, we might simply want to say it's good enough to put in front of customers and that'll be just a simple pass-fail decision. Depending on our goals, we might then need to decide when we're going to do that evaluation. If it's to improve our own system, we might do it fairly frequently. It might be during development. That might be of individual components, or the end-to-end system, depending what we want to learn. Assuming that it's the synthetic speech we're going to evaluate rather than the performance of some component of the system, we have to think: "What is it about speech that we could actually measure?" Some obvious things will be: Can you understand what was said? Did it sound something like natural speech? Maybe less obviously, and something that'll become a lot more clear when we move on to parametric forms of synthesis, is speaker similarity. Because there's no guarantee that the synthetic speech sounds like the person that we want it to sound like. Then, having established all of that, we'll get to the meat. We'll think about how to evaluate. We'll think about what we're going to ask our listeners to do. We're going to design a task. Put that into some test design, which includes the materials we should use on. We'll think about how to measure the listeners' performance on that task, with those materials. We'll also quite briefly mentioned the idea of objective measures. In other words, measures that don't need a listener - measures that are in fact an algorithm operating on the synthetic speech. For example, comparing it to a natural example and measuring some distance. When you've done all of that, you need to then make a decision what to do next. For example, how do we improve our system, knowing what we've learned in the listening test? So what do we do with the outcome of listening test or some other evaluation? So on to the first of those in a bit more detail, then. Why do we need to evaluate it all? It's not always obvious, because it's something that's just not spelled out ever. I was never taught this explicitly, and I've never seen it really in a textbook. But why do we evaluate? Well, it's to make decisions. Imagine I have basic system. I've taken the Festival speech synthesis system. I've had four brilliant ideas about how to make it better. I've built four different variations on Festival, each one embodying one of these brilliant ideas. Now I'd like to find out which of them was the best idea. So I'll generate some examples from those systems. I'll make my evaluation. There was some outcome from that. The outcome helps me make a decision. Which of those ideas was best? Let's imagine that the idea in this system was a great idea. So perhaps I'll drop the other ideas. I'll take this idea, and I'll make some further enhancements to it - some more variants on it. So I'll make some new systems in this evolutionary way. You have now got another bunch of systems you would like to evaluate. I had again four ideas (it's a magic number!). And again, I'd like to find out which of those was the best idea. Again, we'll do another evaluation, and perhaps one of those turns out to be better than the rest. And I'll take that idea forward and the other ideas will stay on the shelf. And so research is a little bit like this greedy evolutionary search. We have an idea. We compare it to our other ideas. We pick the best of them. We take that forward and refine it and improve it until we can't make it any better. Evaluation is obviously playing a critical role in this because it's helping us to make the decision that this was the best, and this was the best, and that the route our research should take should be like this, not some other path. So that's why we need to evaluate. You could do these evaluations at many possible points in time. If we're building a new front end for a language or trying to improve the one we've already got, we'll probably do a lot of testing on those isolated components. For example, we might have a module which does some normalisation. We're going to put a better POS tagger in, or the letter-to-sound model might have got better, or we might have improved the dictionary. We might be able to - in those cases - do some sort of objective testing against Gold Standard data: against things we know, with labels. That's just normal machine learning with a training set and the test set. But we also might want to know what the effect of those components is when they work within a complete system. Some components, of course, only make sense in a complete system, such as the thing that generates a waveform. We have to look at the waveforms (i.e., listen to the waveforms) to know how good they are. We're not going to say too much more about evaluating components - that's better covered when we're talking about how these components work. We really going to focus on complete systems. We're not in industry here, so we're not going want to know: Did Does it pass or fail? We really going to be concentrating on comparing between systems. Now, there might be two of our own systems, like the four variants of my great ideas. They might be against the baseline - an established standard - that's just Festival. They might be competitive against other people as in, for example, the Blizzard Challenge evaluation campaigns. Now, when we're making those comparisons between variants of systems, we need to be careful to control things so that when we have a winner, we know the reason was because of our great idea, not some other change. We made two in the Blizzard Challenge, and indeed, in all sorts of internal testing we'll keep certain things under control. For example, we'll keep the same front end, and we'll keep the same database if we're trying to evaluate different forms of waveform generation. So we're already seeing some aspects of experimental design. We need to be sure that the only difference between the systems is the one we're interested in and everything else is controlled so that - when we get our outcome - we know the only explanation is the variable of interest and not some other confounding thing. Like, we change the database. Now there's a whole lot of terminology flying around testing, whether it's software testing or whole system testing, whatever it might be, we're not really going to get bogged down in it. But let's just look at a couple of things that come up in some textbooks. Sometimes we see this term: "glass box". That means we can look inside the system that we're testing. We have complete access to it. We could go inside the code. We could even put changes into the code for the purposes of testing. This kind of testing is normally about things like reliability and speed to make sure the system doesn't crash, make sure it runs fast enough to be usable. We're really not going to talk much about that. If we wanted to, we could go look at source code. For example, in this piece of source code, we can see who wrote it, so there's a high chance there might be a bug here. We might put in specific tests unit tests to check that there are no silly bugs. Put in test cases with known correct output and compare against them. Coming back to this terminology, we'll sometimes see this term: "black box". This is now when we're not allowed to look inside and see how things work. We can't change the code for the purpose of testing it. All we could do is put things in and get things out. So to measure performance, we can maybe get some objective measures against gold standard data. That will be the sort of thing we might do for a Part-Of-Speech tagger. One would hope that making an improvement anywhere in the system would lead to an improvement in the synthetic speech. If only that was the case! Systems like Text-to-Speech synthesisers are complex. They typically have something like a pipeline architecture that propagates errors and that can lead to very tricky interactions between the components. Those interactions can actually mean that improving something early in the pipeline actually causes problems later in the pipeline. In other words, making improvement actually makes synthetic speech worse! Let's take some examples. Imagine we fix a problem in text normalisation so that currencies are now correctly normalised whereas they used to be incorrectly normalised. However, if we only do that in the run-time system and don't change the underlying database, we will now have a mismatch between the database labels and the content of what we try to say at run time. We might get worse performance - for example, lower naturalness - now, because we have more joins. Similarly, improving the letter-to-sound module might start producing phoneme sequences which are low frequency in the database, because the database used the old letter-to-sound module, which never produced those sequences. So again, we will get more joins in our units. So, in general then, these pipeline architectures, which are the norm in Text-to-Speech, can lead to unfortunate interactions. And in general, of course, in all of software engineering, fixing one bug can easily reveal other bugs that you hadn't noticed until then. Nevertheless, we're always going to try and improve systems. We want to improve those components, but we might have to propagate those improvements right through the pipeline. We might have to completely rebuild the system and re-label the database whenever we change, for example, our letter-to-sound module. So we know we need to evaluate: to make decisions, for example, about where to go with our research. We can do those evaluations for components or the whole systems. We decided when to do it. And now, when you think about what it is about speech that we could evaluate. So we're now going to focus on just synthetic speech evaluation and forget about looking inside the system. What aspects of speech could we possibly quantify? There's lots of descriptors, lots of words we could use to talk about this. It's tempting to use the word "quality". That's a common term in speech coding. For transmission down telephone systems we talk took about the "quality" of the speech. That's a little bit of a vague term for synthetic speech, because there are many different dimensions. So tend not to talk about that [quality] so much. Rather, we use the term naturalness. Naturalness, implies some similarity to natural speech from a real human talker. In general, it's assumed that that's what we're aiming for in Text-to-Speech. I would also like that speech to be understandable. And there's a load of different terms you could use to talk about that property, the most common one is "intelligibility". That's simply the ability of a listener to recall, or to write down, the words that he or she heard. We might try and evaluate some higher level things, such as understanding or comprehension, but then we're going to start interacting with things like the memory of the listener. So we're going to start measuring listener properties when really we want to measure our synthetic speech properties. Occasionally we might measure speaker similarity. As I've already mentioned, in the parametric synthesis case, it's possible to produce synthetic speech that sounds reasonably natural, is highly intelligible, but doesn't sound very much like the person that we recorded. That sometimes matters. It doesn't always matter, so there's no reason to evaluate that unless you care about it. And there's a whole lot of other things you might imagine evaluating. Go away and think about what they might be and then put them into practise yourself. You are going evaluate synthetic speech along a dimension that's not one of the ones on this slide and see if you can find something new that you could measure. We're no longer going to consider these things. They will be straightforward to measure. You could do those in experiments for yourself. You could consider how pruning affects speed and how that trades off against, for example, naturalness. Now, even these simple descriptors are probably not that simple. And in particular, it seems to me that naturalness is not a single dimension. We could imagine speech that's segmentally natural - the phonemes have reproduced nicely- but it's prosodically unnatural. Or vice versa. So naturalness might need unpacking, but the convention in the field is to give that as an instruction to the listener: "Please rate the naturalness of the synthetic speech." and assume that they can do that along a 1-dimensional scale. Similarly, intelligibility is usually defined as the ability to transcribe what was said, but there might be more to listening to synthetic speech than just getting the words right. You might like to think about whether somebody really understood the meaning of the sentence. Managed to transcribe the words, but if the prosody was all wrong, they might have understood a different meaning. More generally, we might think about how much effort it is to listen to synthetic speech. It seems a reasonable hypothesis that it's hard work listening to synthetic speech compared to natural speech. There are methods out there to try and probe people's effort or attention, or all sorts of other factors. We have measured things about their pupils. Sticking electrodes on their scalp to measure things about their brain activity. These are very much research tools. They're not widely used in synthetic speech evaluation. The reason for that is that is then very hard to separate out measuring things about the listener from things about the synthetic speech. And so there's a confound there. For example, we're measuring the listeners working memory, not how good the synthetic speech was. So at the moment these things are not widely used. It would be nice to think that we could use them to get a deeper understanding of synthetic speech. But that's an open question. So I put that to one side. For now, let's wrap this section up with a quick recap. We know why we should evaluate to find stuff out and make decisions. We could do it at various points: and you need to choose. And we've listed initially some aspects of the the system, specifically of this speech, that we could measure. So from this point on in the next video, we're going to focus on the mainstream. What's normal in the field? What would be expected if you wanted to publish a paper about your speech synthesis research? We're going to evaluate the output from a complete system. Are we going to do that for an end-to-end Text-to-Speech system: text in, synthetic speech out. We're going to measure the two principal things that we can: the naturalness and the intelligibility. And so what we need to do now is find out how.
Why? When? Which aspects?
What are our goals when evaluating synthetic speech?
Log in if you want to mark this as completed
|
|