Module status: ready.
Before starting this module, you should already have an understanding of:
- What kinds of errors can be made by
- the front end
- the waveform generator (using unit selection and concatenation of waveforms)
- Basic acoustic phonetics, and some knowledge of speech perception
Download the slides for the module 5 videos
Total video to watch in this module: 65 minutes
Whilst the video is playing, click on a line in the transcript to play the video from that point. In this module, we're going to talk about evaluation. We'll start right at the beginning. Think about why we even need to evaluate. What we're going to do with the result of that. That will help us think about when it's appropriate to evaluate. Do we do that while we're building a system? Do we wait until we're finished? There are many aspects of a system. We might evaluate its internal components, its reliability, or speed. But we are mainly going to focus on the more obvious speech synthesis evaluation, which is to listen to the speech and make some judgements, perhaps about its naturalness and its intelligibility. Once we have established those things, we'll move on to thinking about exactly how to do that. How do you evaluate? In that section, we will look at methodologies for evaluation. How we present the stimuli to listeners. Who those listeners might be. What the stimulus should be. And some other aspects of experimental design. And once we've finished evaluation, we need to do something with what we've found. So, what do we do with the outcome of an evaluation? Before you look at this module, you should just check that you already know the following things. Of course, you should have a pretty good understanding of the complete process of text to speech synthesis by this point. And because you've got that understanding, you should have some ideas about where errors can come from, why synthetic speech is not perfect. For example, in the front end, things might go wrong very early in text normalisation, the wrong words are said. Pronunciations could be incorrect: letter-to-sound could go wrong. And other things like prosody could be wrong. Things will also go wrong when we generate the waveform. Let's just think about unit selection in this section. Unit selection errors might be that the units we chose were from inappropriate contexts. So they have the wrong co-articulation, for example. Or they might not have joined very well. We'll get concatenation artefacts To have a good understanding of why people hear some errors but not other errors, it will be useful to go and revise your acoustic phonetics and make sure you know a little bit about speech perception. Different errors will lead to different perceptual consequences. They'll impact different aspects of the synthetic speech. Some things are going to obviously affect the naturalness of the synthetic speech. It will sound unnatural in some way: unnatural prosody, unnatural signal quality because of concatenation, perhaps. Other types of errors might in fact affect intelligibility. An obvious one there is: if text normalisation goes wrong - the synthesiser says the wrong words and the listener understands the wrong thing. But intelligibility could also be impacted by audible joins that disrupt the listener's perception. So both naturalness and intelligibility can be affected by things that happen in the front end and in waveform generation. We're going to see that's a general trend in evaluation: that we might be able to measure something in the synthetic speech, but that thing that we measure doesn't have a direct connection back to only one property of the system. We have to figure out that for ourselves. And that's one reason it's rather hard to automatically optimise the system based on the results a listening test. Because you can't feed back the results of that listening test down into the insides of the system and say, "This error was because of letter-to-sound". Now, before we just jump in and try and evaluate synthetic speech, let's sit down and think very carefully about that. We're going to be a little bit systematic. We're not just going to make some synthetic speech, play it to listeners and see what they think. We're going to make a plan. We're going to start right back at the beginning, thinking: "Why do we want to evaluate?" "What is it that we want to get out of that evaluation?" Most of the time, we're trying to learn something for ourselves that will help us make our system better. We're trying to guide the future development of the system. When we think we've got a pretty good system, we might then go out and compare it somebody else's system and hope that we're better. Maybe then we can publish a paper. Our technique is an improvement over some baseline. If we're going to sell this thing, we might simply want to say it's good enough to put in front of customers and that'll be just a simple pass-fail decision. Depending on our goals, we might then need to decide when we're going to do that evaluation. If it's to improve our own system, we might do it fairly frequently. It might be during development. That might be of individual components, or the end-to-end system, depending what we want to learn. Assuming that it's the synthetic speech we're going to evaluate rather than the performance of some component of the system, we have to think: "What is it about speech that we could actually measure?" Some obvious things will be: Can you understand what was said? Did it sound something like natural speech? Maybe less obviously, and something that'll become a lot more clear when we move on to parametric forms of synthesis, is speaker similarity. Because there's no guarantee that the synthetic speech sounds like the person that we want it to sound like. Then, having established all of that, we'll get to the meat. We'll think about how to evaluate. We'll think about what we're going to ask our listeners to do. We're going to design a task. Put that into some test design, which includes the materials we should use on. We'll think about how to measure the listeners' performance on that task, with those materials. We'll also quite briefly mentioned the idea of objective measures. In other words, measures that don't need a listener - measures that are in fact an algorithm operating on the synthetic speech. For example, comparing it to a natural example and measuring some distance. When you've done all of that, you need to then make a decision what to do next. For example, how do we improve our system, knowing what we've learned in the listening test? So what do we do with the outcome of listening test or some other evaluation? So on to the first of those in a bit more detail, then. Why do we need to evaluate it all? It's not always obvious, because it's something that's just not spelled out ever. I was never taught this explicitly, and I've never seen it really in a textbook. But why do we evaluate? Well, it's to make decisions. Imagine I have basic system. I've taken the Festival speech synthesis system. I've had four brilliant ideas about how to make it better. I've built four different variations on Festival, each one embodying one of these brilliant ideas. Now I'd like to find out which of them was the best idea. So I'll generate some examples from those systems. I'll make my evaluation. There was some outcome from that. The outcome helps me make a decision. Which of those ideas was best? Let's imagine that the idea in this system was a great idea. So perhaps I'll drop the other ideas. I'll take this idea, and I'll make some further enhancements to it - some more variants on it. So I'll make some new systems in this evolutionary way. You have now got another bunch of systems you would like to evaluate. I had again four ideas (it's a magic number!). And again, I'd like to find out which of those was the best idea. Again, we'll do another evaluation, and perhaps one of those turns out to be better than the rest. And I'll take that idea forward and the other ideas will stay on the shelf. And so research is a little bit like this greedy evolutionary search. We have an idea. We compare it to our other ideas. We pick the best of them. We take that forward and refine it and improve it until we can't make it any better. Evaluation is obviously playing a critical role in this because it's helping us to make the decision that this was the best, and this was the best, and that the route our research should take should be like this, not some other path. So that's why we need to evaluate. You could do these evaluations at many possible points in time. If we're building a new front end for a language or trying to improve the one we've already got, we'll probably do a lot of testing on those isolated components. For example, we might have a module which does some normalisation. We're going to put a better POS tagger in, or the letter-to-sound model might have got better, or we might have improved the dictionary. We might be able to - in those cases - do some sort of objective testing against Gold Standard data: against things we know, with labels. That's just normal machine learning with a training set and the test set. But we also might want to know what the effect of those components is when they work within a complete system. Some components, of course, only make sense in a complete system, such as the thing that generates a waveform. We have to look at the waveforms (i.e., listen to the waveforms) to know how good they are. We're not going to say too much more about evaluating components - that's better covered when we're talking about how these components work. We really going to focus on complete systems. We're not in industry here, so we're not going want to know: Did Does it pass or fail? We really going to be concentrating on comparing between systems. Now, there might be two of our own systems, like the four variants of my great ideas. They might be against the baseline - an established standard - that's just Festival. They might be competitive against other people as in, for example, the Blizzard Challenge evaluation campaigns. Now, when we're making those comparisons between variants of systems, we need to be careful to control things so that when we have a winner, we know the reason was because of our great idea, not some other change. We made two in the Blizzard Challenge, and indeed, in all sorts of internal testing we'll keep certain things under control. For example, we'll keep the same front end, and we'll keep the same database if we're trying to evaluate different forms of waveform generation. So we're already seeing some aspects of experimental design. We need to be sure that the only difference between the systems is the one we're interested in and everything else is controlled so that - when we get our outcome - we know the only explanation is the variable of interest and not some other confounding thing. Like, we change the database. Now there's a whole lot of terminology flying around testing, whether it's software testing or whole system testing, whatever it might be, we're not really going to get bogged down in it. But let's just look at a couple of things that come up in some textbooks. Sometimes we see this term: "glass box". That means we can look inside the system that we're testing. We have complete access to it. We could go inside the code. We could even put changes into the code for the purposes of testing. This kind of testing is normally about things like reliability and speed to make sure the system doesn't crash, make sure it runs fast enough to be usable. We're really not going to talk much about that. If we wanted to, we could go look at source code. For example, in this piece of source code, we can see who wrote it, so there's a high chance there might be a bug here. We might put in specific tests unit tests to check that there are no silly bugs. Put in test cases with known correct output and compare against them. Coming back to this terminology, we'll sometimes see this term: "black box". This is now when we're not allowed to look inside and see how things work. We can't change the code for the purpose of testing it. All we could do is put things in and get things out. So to measure performance, we can maybe get some objective measures against gold standard data. That will be the sort of thing we might do for a Part-Of-Speech tagger. One would hope that making an improvement anywhere in the system would lead to an improvement in the synthetic speech. If only that was the case! Systems like Text-to-Speech synthesisers are complex. They typically have something like a pipeline architecture that propagates errors and that can lead to very tricky interactions between the components. Those interactions can actually mean that improving something early in the pipeline actually causes problems later in the pipeline. In other words, making improvement actually makes synthetic speech worse! Let's take some examples. Imagine we fix a problem in text normalisation so that currencies are now correctly normalised whereas they used to be incorrectly normalised. However, if we only do that in the run-time system and don't change the underlying database, we will now have a mismatch between the database labels and the content of what we try to say at run time. We might get worse performance - for example, lower naturalness - now, because we have more joins. Similarly, improving the letter-to-sound module might start producing phoneme sequences which are low frequency in the database, because the database used the old letter-to-sound module, which never produced those sequences. So again, we will get more joins in our units. So, in general then, these pipeline architectures, which are the norm in Text-to-Speech, can lead to unfortunate interactions. And in general, of course, in all of software engineering, fixing one bug can easily reveal other bugs that you hadn't noticed until then. Nevertheless, we're always going to try and improve systems. We want to improve those components, but we might have to propagate those improvements right through the pipeline. We might have to completely rebuild the system and re-label the database whenever we change, for example, our letter-to-sound module. So we know we need to evaluate: to make decisions, for example, about where to go with our research. We can do those evaluations for components or the whole systems. We decided when to do it. And now, when you think about what it is about speech that we could evaluate. So we're now going to focus on just synthetic speech evaluation and forget about looking inside the system. What aspects of speech could we possibly quantify? There's lots of descriptors, lots of words we could use to talk about this. It's tempting to use the word "quality". That's a common term in speech coding. For transmission down telephone systems we talk took about the "quality" of the speech. That's a little bit of a vague term for synthetic speech, because there are many different dimensions. So tend not to talk about that [quality] so much. Rather, we use the term naturalness. Naturalness, implies some similarity to natural speech from a real human talker. In general, it's assumed that that's what we're aiming for in Text-to-Speech. I would also like that speech to be understandable. And there's a load of different terms you could use to talk about that property, the most common one is "intelligibility". That's simply the ability of a listener to recall, or to write down, the words that he or she heard. We might try and evaluate some higher level things, such as understanding or comprehension, but then we're going to start interacting with things like the memory of the listener. So we're going to start measuring listener properties when really we want to measure our synthetic speech properties. Occasionally we might measure speaker similarity. As I've already mentioned, in the parametric synthesis case, it's possible to produce synthetic speech that sounds reasonably natural, is highly intelligible, but doesn't sound very much like the person that we recorded. That sometimes matters. It doesn't always matter, so there's no reason to evaluate that unless you care about it. And there's a whole lot of other things you might imagine evaluating. Go away and think about what they might be and then put them into practise yourself. You are going evaluate synthetic speech along a dimension that's not one of the ones on this slide and see if you can find something new that you could measure. We're no longer going to consider these things. They will be straightforward to measure. You could do those in experiments for yourself. You could consider how pruning affects speed and how that trades off against, for example, naturalness. Now, even these simple descriptors are probably not that simple. And in particular, it seems to me that naturalness is not a single dimension. We could imagine speech that's segmentally natural - the phonemes have reproduced nicely- but it's prosodically unnatural. Or vice versa. So naturalness might need unpacking, but the convention in the field is to give that as an instruction to the listener: "Please rate the naturalness of the synthetic speech." and assume that they can do that along a 1-dimensional scale. Similarly, intelligibility is usually defined as the ability to transcribe what was said, but there might be more to listening to synthetic speech than just getting the words right. You might like to think about whether somebody really understood the meaning of the sentence. Managed to transcribe the words, but if the prosody was all wrong, they might have understood a different meaning. More generally, we might think about how much effort it is to listen to synthetic speech. It seems a reasonable hypothesis that it's hard work listening to synthetic speech compared to natural speech. There are methods out there to try and probe people's effort or attention, or all sorts of other factors. We have measured things about their pupils. Sticking electrodes on their scalp to measure things about their brain activity. These are very much research tools. They're not widely used in synthetic speech evaluation. The reason for that is that is then very hard to separate out measuring things about the listener from things about the synthetic speech. And so there's a confound there. For example, we're measuring the listeners working memory, not how good the synthetic speech was. So at the moment these things are not widely used. It would be nice to think that we could use them to get a deeper understanding of synthetic speech. But that's an open question. So I put that to one side. For now, let's wrap this section up with a quick recap. We know why we should evaluate to find stuff out and make decisions. We could do it at various points: and you need to choose. And we've listed initially some aspects of the the system, specifically of this speech, that we could measure. So from this point on in the next video, we're going to focus on the mainstream. What's normal in the field? What would be expected if you wanted to publish a paper about your speech synthesis research? We're going to evaluate the output from a complete system. Are we going to do that for an end-to-end Text-to-Speech system: text in, synthetic speech out. We're going to measure the two principal things that we can: the naturalness and the intelligibility. And so what we need to do now is find out how.
Whilst the video is playing, click on a line in the transcript to play the video from that point. So now on to how to do the evaluation. There's two main forms of evaluation. Subjective, in which we use human beings as listeners. And their task involves listening to synthetic speech and doing something, making some response. So we're going to look at how to design a test of that form, including things such as what materials you would use. The other form of evaluation, something called objective and that would use some algorithms, some automated procedure to evaluate the synthetic speech without the need for listeners. That's a very attractive idea because using listeners can be slow, expensive and just inconvenient to recruit people and get them to listen to your synthetic speech. However, objective measures are far from perfect, but they're currently not a replacement for subjective evaluation. We'll look at a couple of forms off objective measure just in principle, not in huge detail. The most obvious idea would be to take some natural speech and measure somehow the difference - the distance between your synthetic speech and this reference sample of natural speech on the assumption that the distance of zero is the best you can do, and then you're perfectly natural, or you could use something a bit more sophisticated. You could try and model something about human perception, which isn't as simple as just a distance. And we might need a model of perception to do that That's even harder. And that's definitely a research topic and not a well established way of evaluating synthetic speech at this time. So we'll focus on the subjective, in other words, the listening test. So in subjective evaluation, the subjects are people sometimes called listeners. And they're going to be asked to do something. They're going to be given a task, and it's up to us what that task might be. In all cases, it's going to be best if it's simple and obvious to the listener. In other words, it doesn't need a lot of instruction because people don't read instructions. We might just play pairs off samples of synthetic speech, perhaps from two different variants of our system, and just say, which one do you prefer? That's a forced choice pairwise preference test. We might want more graded judgement than that For example, we might ask them to make a rating on a scale, maybe a Likert scale with a number of points on it. Those points might be described. Sometimes we just described the end points. Sometimes it described all the points. They'll listen to one sample and give a response here on a 1 to 5 scale that's very common for naturalness. Their task might be to do something else, for example, to type in the words that they heard. It's reasonable to wonder whether just anybody could do these tasks. And so we might ask, Should we train the listeners in order to, for example, pay attention to specific aspects of speech? This idea of training is pretty common in experimental phonetics, where we would like the listeners to do something very specific to attend to a particular acoustic property. But in evaluating synthetic speech, we're usually more concerned with getting a lot of listeners. And if we want a lot of listeners and they're going to listen to a lot of material and we're going to repeat test many times, training listeners is a bit problematic. It's quite time consuming, and we could be never certain that they've completely learned to do what we ask them to do. So this is quite uncommon in evaluating synthetic speech. The exception to that would be an expert listener. For example, the linguist, the native speaker in a company listening to a system - a prototype system - and giving specific feedback on what's wrong with it. But naive listeners, the one we can recruit on the street and pay to do our listening test in general would not be trained. We'll just assume as, for example, as native speakers of the language, they were able to choose between two things and express a preference. That such a simple, obvious task, they can do it without any further instructional training. If we wanted a deeper analysis than simply a preference, an alternative to training the listeners is to give them what appears to be a very simple task and then you some sophisticated statistical analysis to untangle what they actually did and to measure, for example, on what dimensions they were making. Those judgements about difference in this pairwise forced choice test. Doing this more sophisticated analysis - here is something called multi-dimensional scaling, which projects the differences into a space and then we can interpret the axes - is not really mainstream. It's not so common: by far and away the most common thing is to give a very obvious task and either pairwise or Likert scale type tasks or typing test for intelligibility. These are what we're going to concentrate on So we've decided to give our listeners a simple and obvious task. And now we need to think, how are they going to do on this task. For example, are listeners able to give absolute judgements about naturalness played a single sample with no reference. Are they going to reliably rate it always as a three out of five? If we think that's true, we could design the task that relies on this absolute nature of listeners judgments. That's possibly not true for things like naturalness. If you've just heard a very unnatural system and then a more natural system, your rating would be inflated by that. So many times we might want to make a safer assumption that listeners make relative judgments and one way to be sure that they're always making relative judgments and to give some calibration to that is to include some references. And the most obvious reference, of course, is natural speech itself. But there are other references, such as a standard speech synthesis system that never changes as our prototype system gets better. When we've decided then whether our listeners needs some help calibrating their judgments or whether they can be relied upon to give absolute, in other words, repeatable judgments every time with a need to present, stimulated them and get these judgments back. And that needs an interface. Very often that will be done in a Web browser so that we could do these tests online. It needs to present the stimulus. And it needs to obtain their response. That's pretty straightforward. So we decided to get let's say, relative judgments from listeners, present them with a reference stimulus, and then the things they've got to judge - get them to, for example, choose from a pull down menu from a five point scale and get their response back. And what remains then, is how many samples to play them. So how big to make the test. And how many listeners to do this with? So I need to think about the size of the test. And something that we might talk about as we go through these points in detail, but always keep in mind the listeners themselves. Sometimes they're called subjects. What type of listener are they? Maybe they should be native speakers. Maybe it doesn't matter. Where are we going to find them, while offering a payment is usually effective. And importantly, how do we know that they're doing the job well? So, for example, when we failed to find any difference between two systems, is it because really there was no difference? Or is it because the listeners weren't paying attention? So we want some quality control and that would be built in to the test design as we're going through. Let's examine each of those points in a little bit more detail and make them concrete with some examples of how they might appear to a listener by far and away. The most common sort of listening test is one that attempts to measure the naturalness of the speech synthesiser. And in this test, listeners hear a single stimulus in isolation, typically one sentence and then they give a judgement on that single stimulus, and then we take the average of those on DH. We call that the mean opinion score. Now, here we have to call this an absolute judgement because that score is on a scale of 1 to 5, and it's on a single stimulus. It's not relative to anything else. However, we must be very careful to not conflate two things It's absolute in the sense that it's all a single stimulus, and we expect the listener to be able to do that task without reference to some baseline or some top line. However, that does not mean that if we repeat this task, another time will get exactly the same judgments. So in other words, it's not necessarily repeatable, and it's certainly not necessary comparable across listening tests. There's no guarantee of that. I would have to design that in to understand To understand why there's this difference between the task being absolute in terms of the judgement the listener gives and the test not necessarily being repeatable or comparable: imagine we have a system that's quite good. And we put it in the listening test and in the same listening test are some variants on that system, which all are all really, really bad, our "quite good" system is going to get good scores. Maybe our listeners going to keep scoring four out of five because it's quite good. It's not perfect, but it's quite good. Let's imagine another version of the test. We put exactly the same system in its still quite good, but it's mixed in with some other variants on the system, and they're also quite good. In fact, it's really hard to tell the difference between them, so they're all somewhere in the middle. It's very likely our listeners in that case, we'll just give everything three, which is halfway between one and five, because it's not really bad. It's not really good. They can't tell the difference between them. So just give everything a score of three. So we might get a different score for the same system, depending on the context around it In other words, listeners probably do recalibrate themselves according to what they've heard recently more elsewhere in the test. We need to worry about that. One solution to this problem of not being able to get truly absolute calibrated opinions from listeners is not to ask for them at all. But I ask only for comparisons. So ask only for relative judgments, so we'll give them multiple things to listen to, and the most obvious thing is two things, and we'll ask them to say which do they prefer. If you want to capture the strength of the preference, you might also put in a don't care or equally natural or can't tell the difference sort of option. And if they always to choose that, then we know that these systems are hard to tell apart. That's optional. We could just have two options. A is better or B is better. There are variations on this that aren't just pairwise forced choice so we might use more than two stimuli. And we might include some references and might even hybridise rating and sorting. We'll look a test in a moment that's called MUSHRA that attempts to take the best of the pairwise forced choice, which essentially relative and the MOS, which is essentially absolute and the idea of calibration by providing reference samples, putting all of that into a single test design that's intuitive for listeners and has some nice properties for us. So mean opinion score is the most common by far, and that's because the most common thing we want to evaluate his naturalness. So here's a typical interface. It's got on audio player, in this case, the listener can listen as many times as they want, and it's got a way of giving the response. Here is the pull down menu, but it could equally be boxes to tick, which is one option. And here we can see that all the points along the scale are described and when they've done that, they submit their score and move on to the next sample. If we're asking them to transcribe what they heard, maybe that the interface would look like this some way of playing the audio and a place to type in their text. In this particular design they're only allowed to listen to the speech once. And that's another design decision we have to to make: would their judgement change if they're allowed to listen many, many times. Certainly they would take longer to do the test. And other designs are possible. We won't go into a lot of detail on this one, but we could ask for multiple judgments on the same stimulus that might be particularly appropriate if the stimulus is long. For example, a paragraph from an audio book and we might try and get things that are not just naturalness, but maybe some of those sub-dimensions of naturalness. And here are some examples on these scales and finally, that MUSHRA design, which is an international standard. There's variants on it, but this is the official MUSHRA design. We have a reference that we could listen to that would be natural speech. And then we have a set of stimuli, that we have to listen to: in this test there are eight so we can see you can do a lot of samples side by side and for each of them, the user, the listener has to choose a score, but in doing so is also implicitly sorting the systems. So it's a hybrid sorting on rating task. MUSHRA builds in a nice form of quality control. It instructs the listener that one of these eight stimulate is in fact the reference, and then they must give it a score of 100 and in fact, they're not allowed to move on to the next set until they do so. So this forces them to listen to everything, and it gives some calibration. We might also put a lower bound a bottom sample in that we know should always be worse than all of the systems that's known as the anchor, not obvious what that would be for synthetic speech, so it's not currently used. But the idea of having this hidden reference is very useful because it both calibrates it provides a reference to listen to. And it catches listeners who are not doing the task properly, so it some quality control and some calibration. Now, this is definitely not a course on statistics. And so we're not going to go into the nitty gritty of exactly how the size of a sample determines the statistical power that we would need to make confident statements. Such as system A is better than system B, but what we can have is some pretty useful rules of thumb that we can use in all listening tests that we know from experience tends to lead to reliable results. The first one is not to stress the listeners out too much on very definitely not to allow them to become bored because their judgments might change. So a test duration of 45 minutes is actually really quite long, and we certainly wouldn't want to go beyond that. We might actually go much smaller than that, and in online situations where we don't have control over the listeners, and they're listening situation where they're recruited via the Web. We might prefer to have much, much shorter tasks. Because one individual listener might be atypical - they might just hear synthetic speech in a different way, they might have different preferences to the population. We don't want a single listener's responses to have an undue influence on the overall statistics. And the only way to do that is to have a lot of different listeners. I would say we need at least 20 listeners in any reasonable listening test, and preferably always more than that, maybe 30 or more. The same applies to the material text that we synthesise and then play to listeners. It's possible that one sentence is a typical favours one system over another, just out of luck. And so again the only way to mitigate that is to use as many different sentences as possible. And our test design might combine numbers of listeners and numbers of sentences as we're going to see in a bit, so the number of sentences might be determined for us by our number of systems And the number of listeners that we want to use. Far and away the easiest thing to do in the listening test is to construct one listening test and then just to repeat the same test with many different listeners. In other words, all listeners individually, each hear everything. They all hear the same stimuli and the most we might do is to put them in a different random order for each listener. That seems a great idea because it's going to be be simple to deploy and it's going to be balanced because all listeners hear hear all stimuli. But there are two situations where it's not possible to do that. One of them is obvious. If we want to compare so many different systems and we sure we need many different texts, we simply can't fit all of that into a 45 minute test or a lot less. So we would have to split that across multiple listeners. But there's a more complex and much more important reason that we might not have a within subjects design, and that's because we might not want to play certain sequences or certain combinations of stimulus to the one listener, and that's because there might be effects off priming. In other words having heard one thing the response with some other thing changes. Or ordering, for example, if we hear a really good synthesiser, then a really bad synthesiser, the bad one's score will be even lower because of that ordering effect. So if we worried about these effects we're going to have to have a different sort of design. And that design is called between subjects. On the essence of this design, is we form a virtual single subject from a group of people such that we can split the memory effect such that one thing heard by one listener can't possibly affect another thing heard by a different listener because they don't communicate with each other. In other words, there's no memory carry over effect from listener to listener, so we can effectively make a virtual listener who doesn't have a memory. In both designs, whether it's within subjects or between subjects, we can build various forms of quality control, and the simplest one is actually to deliberately repeat some items and make sure they give about the same judgement in the repeated case. So to check listeners are consistent and not random. One way to do that in a forced choice - a pairwise test - is simply to repeat a pair, ideally in the opposite order and in fact, in some pairwise listening tests. Every pair, every A/B pair is played in the order, A followed by B and somewhere else in the test it's played as B followed by A so all pairs of repeated. And then we look at the consistency across that to get a measure of how reliable our listening was. We might use throwaway listeners who are very unreliable. So I'm now going to explain one type of between subjects design that achieves a few very important goals. The goals of the following that we have a set of synthesisers. Here I've numbered them: 0 1 2 3 4. So five synthesisers. We would like them to each synthesize the same set of sentences. We'd like those sentences to be heard by a set of listeners. Here they're called subjects. But there are some constraints. We can't allow an individual subject to hear the same sentence more than once. Now that means it's not possible for a single subject to do the whole listening test because that would involve hearing every sentence synthesised by all five synthesisers. So what we do, we form groups of subjects, which are the rows of this square. The square is called a Latin square. It's got the special property that each number appears in every row and every column just once. So we'll take one of our subjects and they will listen across a row of this square, so they will hear Sentence A said by synthesiser number zero. And they would perform a task. Now here we're doing an intelligibility task where it's absolutely crucial that no subject hear's the same sentence more than once. So they must make one pass through one row of the Matrix, and they can't do any of the other rows. So they'll typing in what they heard, and then they'll move on and they'll hear the next one on the listening across a row of this Latin Square, and you'll notice that they're going to hear every sentence just once each and each time, it said, by a different synthesiser, different number in that cell. And then this achieves the property that we only hear each sentence once, but the subjects hearing all the different synthesises so they'll be somehow calibrated. Their relative judgments will be affected by having heard both the best and the worst synthesiser and not only bad ones are only good ones. Now, when this subject has completed this row of the square, they've heard every synthesiser Well each synthesiser was only saying a single sentence. So that's not fair, because sentence A might be easier to synthesise then sentence B so at the moment System zero has an unfair advantage. So we need to complete that. We need to balance the design by taking another subject and listening to the sentences said by a different permutation off the synthesisers and work our way down these rows. So we've got five individual subjects, but they're grouped together in forming a single virtual listener. And so we need to have at least five people. And if we want more than five, we need multiples of five. So we put the same number of people in each row. This virtual listener, composed of five human beings, listens like this through the Latin square and they hear every synthesiser saying every sentence. That's five by five, 25 things. But no individual human being ever hears the same sentence twice. And then we just aggregate the opinions or the responses in this case, the type in responses, for example, word error rates across all subjects to form our final statistics. The assumption here is that all subjects behave the same and that we can take five individual people, split the test across them and then aggregate the results. Okay, we got into a bit of detail there about a very specific listening test design, although it is a very highly recommended one and it's widely used, particularly in the Blizzard Challenge, to make sure that we don't have effects off repeated presentation of the same text to a listener. And that's crucial in intelligibility because they will remember those sentences. So we're talking still about subjective evaluation, in other words, using listeners and getting them to do a task. We decided what the task might be, such as typing in or choosing from a five point scale. We looked at various test interfaces. They might be pull down menus or type in boxes and if it's necessary, and it very often is, using this between subjects design to remove the problem of repeated stimuli. But what sentences are we actually going to synthesise and play to the listeners? In other words, what are the materials we should use Now there's a possible tension when choosing the materials to record. If our systems for a particular domain, maybe it's a personal assistant, we would probably want to synthesise this sentences in that domain on measure its performance, saying naturalness on such sentences. That might conflict with other requirements of evaluation and the sort of analysis we might want to do, particularly if we want to test intelligibility. So we might sometimes even want to use isolated words to narrow down the range of possible errors a listener can make and make our analysis much simpler. So could even use minimal pairs. That's not so common these days. But it's still possible. Or more likely we'll use full sentences to get bit more variety, but also to make things a little harder to analyse. That's gonna be a lot more pleasant for listeners. Let's look at the sort of materials that are typical when testing intelligibility. We would perhaps prefer to use nice, normal sentences, either from the domain of application or pull from a source such as a newspaper. The problem with these sentences is that in quiet conditions, which we hope our listeners are in. We'll tend to get a ceiling effect. They're basically perfectly intelligible. Now that's success in the sense that our system works. But it's not useful experimentally because we can't tell the difference between different systems. One might be more intelligible than the other. We can't tell if they're both synthesising such easy material, and so we have to pull them apart by making things harder for listeners. The standard way of making the task harder for listeners is to make the sentences less predictable and we tend to pick things that are syntactically correct, because otherwise it would sound very unnatural but semantically unpredictable because they kind of have conflicting meanings within them. These are very far from actual system usage. That's a reasonable criticism of them, but they will take us away from the ceiling effect and help us to differentiate between different systems. That could be very important, especially in system development. A simpler way off doing intelligibility testing that gets us down to specific individual errors is to use minimal pairs and will typically embed them in sentences because it's nicer for the listener. Now we will say cold again. We'll say gold again. They differing one phonetic feature and we can tell if I synthesizer is making that error. This is very, very time consuming because we're getting one small data point from this full sentence. That's actually rarely used these days, much more common. in fact, fairly standard is to use semantically unpredictable sentences. These could be generated by algorithm from template or a set of template sentences into which slots we drop words from some lists. And those lists might be of words of controlled frequency, medium frequency words or so on. There are other standards out there for measuring intelligibility as there are for Naturalness and for quality. You need to be a little bit careful about just simply applying these standards because they're almost exclusively developed for measuring natural speech. And so what they're trying to capture is not the effects of synthetic speech, but the effects of a transmission channel and the degradations it's imposed upon original natural speech, so these don't always automatically translate to the synthetic speech situation. There are other ways than semantically unpredictable sentences to avoid the ceiling effect. We might have the noise that's always going to make the task harder. We might distract the listener with a parallel task that's increasing their cognitive load. But in this case, we're now starting to worry again that we're measuring something about the listeners ability and not about synthetic speech. Maybe you can think off some novel ways to avoid the ceiling effect without using rather unnatural, semantically unpredictable sentences. Naturalness is more straightforward because we can use normal sentences. Naturalness more straightforward. We can use relatively normal sentence. Just pull them from some domain. It could be the domain of usage of the system. It might be some other domain. We still, though, might pay a little bit of attention to them and use some design text. And the most common form of that are sometimes known as the Harvard or the IEEE sentences. These have the nice property that within these lists, there's some phonetic balance.
Whilst the video is playing, click on a line in the transcript to play the video from that point. Now, while using listeners to evaluate your synthetic speech is by far the most common thing to do. It would be nice if we could find a measure an algorithm perhaps that could evaluate objectively. So we're using the word objectively here to imply that listeners are subjective. So is that possible? In a subjective test, we have subjects, and they express perhaps an opinion about the synthetic speech. In an objective test, we have an algorithm. It might be a simpler is measuring distances between synthetic speech and a reference sample, which is almost certainly going to be natural speech. Or we might use some more sophisticated model and the model will try and take account of human perception. So let's turn our minds then to whether an objective measure is even possible. Think about why it would be non trivial, and then we'll look at what can be done so far. But throughout all of this, let's always remember that were not yet able to replace listeners. Simple, objective measures involve just comparing synthetic speech to a reference. The reference will be natural speech and the comparison will probably be just some distance. Now there's already a huge number of assumptions in doing that. First of all, it assumes that natural speech is somehow gold standard, and we can't do better than that. But any natural speaker's rendition of a sentence is just one possible way of saying that sentence. And there are many, many possible ways. It also assumes that distance between a synthetic sample on this natural reference is somehow going to indicate what listeners would have said about it is going to correlate with their opinion score. Most objective measures are just a sum of local differences. So we're going to align the natural and synthetic that could be just, say, dynamic, time warping. Or we could force out synthesisers to generate speech with the same durational pattern is a natural reference. And then we'll sum the local differences, and that will be some total distance. And that would be our measure, and a large distance would imply less natural. But this use off a single natural reference as gold standard is obviously flawed for the reason that we've already stated: there is more than one correct way of saying any sentence. So the method, as it stands, fails to account for natural variation. We could try and mitigate that by having multiple natural examples. We can't really capture all of the possible valid and correct natural ways of saying a sentence. These simple, objective measures are just based on properties of the signal. That's all we have to go on. And the properties are the obvious things, the stuff that might correlate with phonetic identity. For example, spectral envelope and the stuff that might correlate with prosody, F0 contour, maybe energy. For the special envelope, we just think about a distortion of the difference between natural and synthetic. We would want to do that in a representation that's, somehow relevant to perception, captures some perceptual properties. And one thing that we know off that does that is the Mel Cepstrum. So Mel Cepstral distortion. For an F0 contour, there's a couple of different things we might measure. My just look at the difference in Hertz or some other units between the natural and synthetic contours. And sum those up and calculate the root mean squared error between those two contours. That's just a sum of local differences. That has a problem that if there's just a small offset between the two that will turn up to look like rather a large error. And so we might also measure the shape of the two contours with something like correlation. If we produce a synthetic F0 contour that essentially has the right shape but is offset or is its amplitude is a little bit too big or too small, we would still predict that sounds very nice to listeners. It has peaks in the right places, for example. So for F0 we might have a pair of measures: Error and Correlation. Let's just see what properties those measures are looking at. So if we've got a natural and a synthetic version of a sentence, the first thing of course we're going to do is make some alignment between them. So we'll extract some features. Maybe MFCCs. We'll do some procedure of alignment. A good way to do that would be dynamic time warping. That's just going to make some non linear alignment between these two sequences, matching up the most similar sound with the most similar sound. And then we could take frames that have been aligned with one another and look at the local difference. Maybe we'll take that frame there on its corresponding frame. So those two were aligned by the dynamic time warping and we'll extract from this one. This is just a spectrum with frequency here and magnitude. Here we might extract the spectral envelope represent that with cepstrum. We'll do the same here. And in the cepstral domain. compare these two things on measure the difference or distortion. Then we'll just sum that up for every frame off the utterance. That'll be Mel Cepstral Distortion. Now this is a pretty low level signal measure is the same sort of thing we might do in speech recognition to say if two things were the same or different, but it's got some properties that seem reasonable. If these two signals were identical than the distortion would be zero, that seems reasonable. If the two signals were extremely different, distortion would be high and that seems reasonable. So to that extent that will correlate with listeners judgments very high cepstral distortion would suggest that they'll say they sound very different or if played only the synthetic sample. They would say that sounds very unnatural. And zero distortion means the synthetic is identical to natural. So for big differences, this measure is probably going to work, however, that relationship between Mel Cepstral Distortion and what listeners think isn't completely smooth and monotonic and so small differences are not going to be captured so reliably by this measure. And in system development, as we incrementally improve our system, it's actually small differences that we really want to measure the most. Nevertheless, this distance measure or distortion in the mel cepstral domain is very widely used in a statistical parametric synthesis. It's very much the same thing that we're actually minimising when we're training our systems And so it's a possible measure of system performance or of predicted naturalness of the final speech. Something to think about for yourself is, would this measure make sense for synthetic speech made by concatenation of wave forms, remembering that those are locally, essentially, perfectly natural, so any local frame is natural speech. It might not be the same as the natural reference, but it's perfectly natural. Would this measure capture naturalness of unit selection speech. That's something to think about for yourself. Let's move on to the other common thing to measure objectively, and that's F0 and here again. We'll have to make some alignment between synthetic and natural. We'll look at local differences. Perhaps just this absolute difference or the mean squared difference and sum that up across so that we get this root mean squared error. Correlation would measure the similarity in the shapes of these two things. For example, whether they have peaks in the same place. Again this has some reasonable properties. If the contours are identical, the RMSE is zero. And we would expect listeners to say this prosody was perfect or perfectly natural. If they're radically different, we get a large error, and that's a prediction of unnatural this. But once again, there are lots of valid natural contours. And our one reference contour isn't the only way of saying the sentence, and so error won't perfectly correlate with listeners' opinions. So the summary message for the simple objective measures is: you'll find them widely used and reported in the literature, especially for statistical parametric synthesis. They have some uses, a development tool because they could be done very quickly during development without going out to listeners. They're actually quite close to the measures were minimising when training statistical models anyway. But we should exercise their use with caution and they're not a replacement for using listeners. We might want to report these results alongside subjective results and if there's a disagreement, we'll probably believe the subjective result, and not this objective measure. But these are very simple, low level, signal processing based measures. There's not a whole lot to do with perception in here. The cepstrum might be Mel cepstrum to get some perceptual warping. In F0 we might not do it on Herzt. We might put that onto a perceptual scale. Maybe we'll just take the log and then expressed these differences in semitones. We could put a little perceptual knowledge, but not a lot. And so it seems reasonable to try and put much more complex perceptual model into our objective measure. There is a place where reasonably sophisticated models of perception in fact, things that have been fitted to perceptual data are widely used and are reliable. They're in fact a part of international standards. And that's from the field of telecommunications. So once again, for the transmission of natural speech, which gets distorted in transmission on these things either measure the receive signal or the difference between the transmitted and received signal. The standard measure until recently was called PESQ and recently, there's a new, better measure being proposed that's called POLQA. These both standards the exist in software implementation and they're used widely by telecommunications companies. As they stand - certainly you can say for PESQ - they don't well correlate, or they don't in fact correlate at all with listeners opinions of synthetic speech. So we can't just take these black boxes and use them to evaluate the naturalness of synthetic speech, pretending that synthetic species just a distorted form of natural speech because the distortions are completely different to the sort of distortions we get in transmission of speech. However, there are attempts to take these measures, which are typically big combinations of many, many different features and modify them. For example, modify the weights in this waited combination and fit those weights to listening test results for synthetic speech. So there are some modified versions, and they do work to some limited extent. They're not yet good enough to replace listeners, but they are now correlating, roughly speaking with listeners opinions. So to wrap this part up. There are standards out there, like PESQ. That's tempting to think we could just apply those two synthetic speech, and they will give us a number. Unfortunately, that number's pretty meaningless. It's not a good predictor of what listeners would say if we did the proper listening test. We could try modifying PESQ which has been done.
Whilst the video is playing, click on a line in the transcript to play the video from that point. So that's about all we have to say about the evaluation off speech synthesis. We could evaluate things about the system, but mainly we wanted evaluate the speech itself. So it's the evaluation off synthetic speech. And let's just wrap up by reminding ourselves why we were doing that in the first place and looking again at what to do with the outcome of that So let's just repeat our diagram of how research works and, indeed, how development works We need carefully controlled experiments to tell the differences between, for example, these four system variants that might be all very similar. Brilliant idea may have made only a very small improvement in the quality of the synthetic speech. We need a very sensitive evaluation that can detect that and tell us to take that improvement forward and to leave the others behind. We need to make sure there's no confounding factors. So when we look at these 1, 2, 3, 4 systems, the only difference between them must be the thing we changed and everything else must be kept constant. For example, the front end, the database, the signal processing, whatever we haven't changed must be identical, so there are no confounds when we do this evaluation and find that one of the variants was better and take it forward, we're then, assuming that it was our change that was responsible for that, what we can't do easily is do a listening test, Look at the result of it and somehow back propagate that through the pipeline of the text-to-speech system and find that it was the letter to sound system that was to blame. Now that's because these pipeline architectures are simply not invertable. We can push things through them one direction. We can't push the errors back through them in the other direction. If we had a complete end-to-end machine learning approach to the problem, for example, a great big neural net that went all the way from text to a waveform, we could imagine now back propagating errors through the system. And that's one direction to field is heading into - something that's learnable in an end to end fashion and not just in a modular fashion. However, we're not quite there yet. Of course, that great big machine learning black box isn't a final solution, because although we might be able to learn in an end-to-end fashion When it does make mistakes, we really can't point to any part of the system to blame for it because it's a black box. So there's still an advantage of a pipeline architecture made of modules that we understand that we can take apart and we can replace. Let's finish off with a table off recommendations about what test to use to measure what property, so depending what you want to evaluate. What tests should you pick? Naturalness is the most common thing to be evaluated. It's perfectly good to use Mean Opinion Score very often on a five point scale that you might like a seven point scale for some reason, or something that's closely related to that which is MUSHRA, which has a built in error checking and calibration mechanism. And a forced choice is also perfectly okay, because it's quite a straightforward task to explain to listeners which of these two sample sounds the most natural. I can't immediately think of a task that you could ask listeners to do other than those things from which you could gauge the naturalness of a signal. Similarity to Target Speaker is essentially the same sort of task is naturalness, so people tend to do MOS or MUSHRA type tests. Obviously, with a reference sample of what the target speakers supposed to sound like. You could use forced choice to say which of the two sounds more like a particular target speaker Judging whether something is or is not a target speaker is a slightly odd task for listeners because in real life, someone either is or is not the target making graded distinctions maybe a little bit tricky. This is probably fine. One thing you'll find in the literature and indeed in some standards, is the use ofthe opinion scores to rate intelligibility. It's hard to see where that will be acceptable. How do you know whether you've understood something? Okay, so if something is perfectly intelligible or very unintelligible, then we could make a judgement on that and give an opinion. But fine distinctions. If we miss here a word, how do we know we've misheard it? So I would never recommend the use of opinion scores or MUSHRA or anything like that to rate intelligibility. I would only really, ever trust people typing in what they heard as a task. The only exception to That might be if there are only a couple of possible things they could hear, such as in minimal pairs, and you could possibly do a forced choice. But this is very uncommon these days. You will see reports or evaluations that don't really say what they're looking for It's just preference. They're implying natural. That's probably but maybe not saying it. In this case, if we're really not sure what we're asking listeners to do, we're just saying, Do you like this system morr or this other system more and all you can really do is forced choice preference test, because you're not really telling them what to listen to it all, just an overall preference that would also be appropriate for choosing between two different voices, which was most pleasant, for example, which is different to naturalness. Well, we've reached the end off the first part of the course. And it's all about unit selection, the data behind it and how to evaluate it. What we'll move on to next is a different form of synthesis, and that's called Statistical Parametric Speech Synthesis. And what you've learned in this unit selection part will still be highly relevant for several reasons. Of course, we will want to evaluate that synthesis from the statistical parametric method. And what we've talked about in this module on evaluation, all still applies. We'll still need a database. We'll probably still want coverage in that database, although it might not be as important as in unit Selection. And eventually we'll join together the statistical parametric method and the unit selection method for something called hybrid synthesis. So now that you know all of this, take what we've learned and go and put it into practise and go and build your own unit selection voice.
Reading
Taylor – Section 17.2 – Evaluation
Testing of the system by the developers, as well as via listening tests.
Bennett: Large Scale Evaluation of Corpus-based Synthesisers
An analysis of the first Blizzard Challenge, which is an evaluation of speech synthesisers using a common database.
BenoƮt et al: The SUS test
A method for evaluating the intelligibility of synthetic speech, which avoids the ceiling effect.
Clark et al: Statistical analysis of the Blizzard Challenge 2007 listening test results
Explains the types of statistical tests that are employed in the Blizzard Challenge. These are deliberately quite conservative. For example, MOS data is correctly treated as ordinal. Also includes a Multi-Dimensional Scaling (MDS) section that is not as widely used as the other types of analysis.
Mayo et al: Multidimensional scaling of listener responses to synthetic speech
Multi-dimensional scaling is a way to uncover the different perceptual dimensions that listeners use, when rating synthetic speech.
Norrenbrock et al: Quality prediction of synthesised speech…
Although standard speech quality measures such as PESQ do not work well for synthetic speech, specially constructed methods do work to some extent.
King: Measuring a decade of progress in Text-to-Speech
A distillation of the key findings of the first 10 years of the Blizzard Challenge.
Let’s test your recall of the video material. Make sure you’ve watched all the videos and read through the corresponding slides. Don’t refer back to them whilst doing this quiz!
Download the slides for the class on 2024-02-06
Download the whiteboards from the class on 2024-02-06
You are now in a position to design and run your own listening test.
As with all experimental work, your listening test should be driven by one or more clearly stated hypotheses.
There are lots of free tools available to help you implement a test: most of these operate in any web browser – see the forums for some suggestions or to ask for help (login required).