Whilst the video is playing, click on a line in the transcript to play the video from that point. So now on to how to do the evaluation. There's two main forms of evaluation. Subjective, in which we use human beings as listeners. And their task involves listening to synthetic speech and doing something, making some response. So we're going to look at how to design a test of that form, including things such as what materials you would use. The other form of evaluation, something called objective and that would use some algorithms, some automated procedure to evaluate the synthetic speech without the need for listeners. That's a very attractive idea because using listeners can be slow, expensive and just inconvenient to recruit people and get them to listen to your synthetic speech. However, objective measures are far from perfect, but they're currently not a replacement for subjective evaluation. We'll look at a couple of forms off objective measure just in principle, not in huge detail. The most obvious idea would be to take some natural speech and measure somehow the difference - the distance between your synthetic speech and this reference sample of natural speech on the assumption that the distance of zero is the best you can do, and then you're perfectly natural, or you could use something a bit more sophisticated. You could try and model something about human perception, which isn't as simple as just a distance. And we might need a model of perception to do that That's even harder. And that's definitely a research topic and not a well established way of evaluating synthetic speech at this time. So we'll focus on the subjective, in other words, the listening test. So in subjective evaluation, the subjects are people sometimes called listeners. And they're going to be asked to do something. They're going to be given a task, and it's up to us what that task might be. In all cases, it's going to be best if it's simple and obvious to the listener. In other words, it doesn't need a lot of instruction because people don't read instructions. We might just play pairs off samples of synthetic speech, perhaps from two different variants of our system, and just say, which one do you prefer? That's a forced choice pairwise preference test. We might want more graded judgement than that For example, we might ask them to make a rating on a scale, maybe a Likert scale with a number of points on it. Those points might be described. Sometimes we just described the end points. Sometimes it described all the points. They'll listen to one sample and give a response here on a 1 to 5 scale that's very common for naturalness. Their task might be to do something else, for example, to type in the words that they heard. It's reasonable to wonder whether just anybody could do these tasks. And so we might ask, Should we train the listeners in order to, for example, pay attention to specific aspects of speech? This idea of training is pretty common in experimental phonetics, where we would like the listeners to do something very specific to attend to a particular acoustic property. But in evaluating synthetic speech, we're usually more concerned with getting a lot of listeners. And if we want a lot of listeners and they're going to listen to a lot of material and we're going to repeat test many times, training listeners is a bit problematic. It's quite time consuming, and we could be never certain that they've completely learned to do what we ask them to do. So this is quite uncommon in evaluating synthetic speech. The exception to that would be an expert listener. For example, the linguist, the native speaker in a company listening to a system - a prototype system - and giving specific feedback on what's wrong with it. But naive listeners, the one we can recruit on the street and pay to do our listening test in general would not be trained. We'll just assume as, for example, as native speakers of the language, they were able to choose between two things and express a preference. That such a simple, obvious task, they can do it without any further instructional training. If we wanted a deeper analysis than simply a preference, an alternative to training the listeners is to give them what appears to be a very simple task and then you some sophisticated statistical analysis to untangle what they actually did and to measure, for example, on what dimensions they were making. Those judgements about difference in this pairwise forced choice test. Doing this more sophisticated analysis - here is something called multi-dimensional scaling, which projects the differences into a space and then we can interpret the axes - is not really mainstream. It's not so common: by far and away the most common thing is to give a very obvious task and either pairwise or Likert scale type tasks or typing test for intelligibility. These are what we're going to concentrate on So we've decided to give our listeners a simple and obvious task. And now we need to think, how are they going to do on this task. For example, are listeners able to give absolute judgements about naturalness played a single sample with no reference. Are they going to reliably rate it always as a three out of five? If we think that's true, we could design the task that relies on this absolute nature of listeners judgments. That's possibly not true for things like naturalness. If you've just heard a very unnatural system and then a more natural system, your rating would be inflated by that. So many times we might want to make a safer assumption that listeners make relative judgments and one way to be sure that they're always making relative judgments and to give some calibration to that is to include some references. And the most obvious reference, of course, is natural speech itself. But there are other references, such as a standard speech synthesis system that never changes as our prototype system gets better. When we've decided then whether our listeners needs some help calibrating their judgments or whether they can be relied upon to give absolute, in other words, repeatable judgments every time with a need to present, stimulated them and get these judgments back. And that needs an interface. Very often that will be done in a Web browser so that we could do these tests online. It needs to present the stimulus. And it needs to obtain their response. That's pretty straightforward. So we decided to get let's say, relative judgments from listeners, present them with a reference stimulus, and then the things they've got to judge - get them to, for example, choose from a pull down menu from a five point scale and get their response back. And what remains then, is how many samples to play them. So how big to make the test. And how many listeners to do this with? So I need to think about the size of the test. And something that we might talk about as we go through these points in detail, but always keep in mind the listeners themselves. Sometimes they're called subjects. What type of listener are they? Maybe they should be native speakers. Maybe it doesn't matter. Where are we going to find them, while offering a payment is usually effective. And importantly, how do we know that they're doing the job well? So, for example, when we failed to find any difference between two systems, is it because really there was no difference? Or is it because the listeners weren't paying attention? So we want some quality control and that would be built in to the test design as we're going through. Let's examine each of those points in a little bit more detail and make them concrete with some examples of how they might appear to a listener by far and away. The most common sort of listening test is one that attempts to measure the naturalness of the speech synthesiser. And in this test, listeners hear a single stimulus in isolation, typically one sentence and then they give a judgement on that single stimulus, and then we take the average of those on DH. We call that the mean opinion score. Now, here we have to call this an absolute judgement because that score is on a scale of 1 to 5, and it's on a single stimulus. It's not relative to anything else. However, we must be very careful to not conflate two things It's absolute in the sense that it's all a single stimulus, and we expect the listener to be able to do that task without reference to some baseline or some top line. However, that does not mean that if we repeat this task, another time will get exactly the same judgments. So in other words, it's not necessarily repeatable, and it's certainly not necessary comparable across listening tests. There's no guarantee of that. I would have to design that in to understand To understand why there's this difference between the task being absolute in terms of the judgement the listener gives and the test not necessarily being repeatable or comparable: imagine we have a system that's quite good. And we put it in the listening test and in the same listening test are some variants on that system, which all are all really, really bad, our "quite good" system is going to get good scores. Maybe our listeners going to keep scoring four out of five because it's quite good. It's not perfect, but it's quite good. Let's imagine another version of the test. We put exactly the same system in its still quite good, but it's mixed in with some other variants on the system, and they're also quite good. In fact, it's really hard to tell the difference between them, so they're all somewhere in the middle. It's very likely our listeners in that case, we'll just give everything three, which is halfway between one and five, because it's not really bad. It's not really good. They can't tell the difference between them. So just give everything a score of three. So we might get a different score for the same system, depending on the context around it In other words, listeners probably do recalibrate themselves according to what they've heard recently more elsewhere in the test. We need to worry about that. One solution to this problem of not being able to get truly absolute calibrated opinions from listeners is not to ask for them at all. But I ask only for comparisons. So ask only for relative judgments, so we'll give them multiple things to listen to, and the most obvious thing is two things, and we'll ask them to say which do they prefer. If you want to capture the strength of the preference, you might also put in a don't care or equally natural or can't tell the difference sort of option. And if they always to choose that, then we know that these systems are hard to tell apart. That's optional. We could just have two options. A is better or B is better. There are variations on this that aren't just pairwise forced choice so we might use more than two stimuli. And we might include some references and might even hybridise rating and sorting. We'll look a test in a moment that's called MUSHRA that attempts to take the best of the pairwise forced choice, which essentially relative and the MOS, which is essentially absolute and the idea of calibration by providing reference samples, putting all of that into a single test design that's intuitive for listeners and has some nice properties for us. So mean opinion score is the most common by far, and that's because the most common thing we want to evaluate his naturalness. So here's a typical interface. It's got on audio player, in this case, the listener can listen as many times as they want, and it's got a way of giving the response. Here is the pull down menu, but it could equally be boxes to tick, which is one option. And here we can see that all the points along the scale are described and when they've done that, they submit their score and move on to the next sample. If we're asking them to transcribe what they heard, maybe that the interface would look like this some way of playing the audio and a place to type in their text. In this particular design they're only allowed to listen to the speech once. And that's another design decision we have to to make: would their judgement change if they're allowed to listen many, many times. Certainly they would take longer to do the test. And other designs are possible. We won't go into a lot of detail on this one, but we could ask for multiple judgments on the same stimulus that might be particularly appropriate if the stimulus is long. For example, a paragraph from an audio book and we might try and get things that are not just naturalness, but maybe some of those sub-dimensions of naturalness. And here are some examples on these scales and finally, that MUSHRA design, which is an international standard. There's variants on it, but this is the official MUSHRA design. We have a reference that we could listen to that would be natural speech. And then we have a set of stimuli, that we have to listen to: in this test there are eight so we can see you can do a lot of samples side by side and for each of them, the user, the listener has to choose a score, but in doing so is also implicitly sorting the systems. So it's a hybrid sorting on rating task. MUSHRA builds in a nice form of quality control. It instructs the listener that one of these eight stimulate is in fact the reference, and then they must give it a score of 100 and in fact, they're not allowed to move on to the next set until they do so. So this forces them to listen to everything, and it gives some calibration. We might also put a lower bound a bottom sample in that we know should always be worse than all of the systems that's known as the anchor, not obvious what that would be for synthetic speech, so it's not currently used. But the idea of having this hidden reference is very useful because it both calibrates it provides a reference to listen to. And it catches listeners who are not doing the task properly, so it some quality control and some calibration. Now, this is definitely not a course on statistics. And so we're not going to go into the nitty gritty of exactly how the size of a sample determines the statistical power that we would need to make confident statements. Such as system A is better than system B, but what we can have is some pretty useful rules of thumb that we can use in all listening tests that we know from experience tends to lead to reliable results. The first one is not to stress the listeners out too much on very definitely not to allow them to become bored because their judgments might change. So a test duration of 45 minutes is actually really quite long, and we certainly wouldn't want to go beyond that. We might actually go much smaller than that, and in online situations where we don't have control over the listeners, and they're listening situation where they're recruited via the Web. We might prefer to have much, much shorter tasks. Because one individual listener might be atypical - they might just hear synthetic speech in a different way, they might have different preferences to the population. We don't want a single listener's responses to have an undue influence on the overall statistics. And the only way to do that is to have a lot of different listeners. I would say we need at least 20 listeners in any reasonable listening test, and preferably always more than that, maybe 30 or more. The same applies to the material text that we synthesise and then play to listeners. It's possible that one sentence is a typical favours one system over another, just out of luck. And so again the only way to mitigate that is to use as many different sentences as possible. And our test design might combine numbers of listeners and numbers of sentences as we're going to see in a bit, so the number of sentences might be determined for us by our number of systems And the number of listeners that we want to use. Far and away the easiest thing to do in the listening test is to construct one listening test and then just to repeat the same test with many different listeners. In other words, all listeners individually, each hear everything. They all hear the same stimuli and the most we might do is to put them in a different random order for each listener. That seems a great idea because it's going to be be simple to deploy and it's going to be balanced because all listeners hear hear all stimuli. But there are two situations where it's not possible to do that. One of them is obvious. If we want to compare so many different systems and we sure we need many different texts, we simply can't fit all of that into a 45 minute test or a lot less. So we would have to split that across multiple listeners. But there's a more complex and much more important reason that we might not have a within subjects design, and that's because we might not want to play certain sequences or certain combinations of stimulus to the one listener, and that's because there might be effects off priming. In other words having heard one thing the response with some other thing changes. Or ordering, for example, if we hear a really good synthesiser, then a really bad synthesiser, the bad one's score will be even lower because of that ordering effect. So if we worried about these effects we're going to have to have a different sort of design. And that design is called between subjects. On the essence of this design, is we form a virtual single subject from a group of people such that we can split the memory effect such that one thing heard by one listener can't possibly affect another thing heard by a different listener because they don't communicate with each other. In other words, there's no memory carry over effect from listener to listener, so we can effectively make a virtual listener who doesn't have a memory. In both designs, whether it's within subjects or between subjects, we can build various forms of quality control, and the simplest one is actually to deliberately repeat some items and make sure they give about the same judgement in the repeated case. So to check listeners are consistent and not random. One way to do that in a forced choice - a pairwise test - is simply to repeat a pair, ideally in the opposite order and in fact, in some pairwise listening tests. Every pair, every A/B pair is played in the order, A followed by B and somewhere else in the test it's played as B followed by A so all pairs of repeated. And then we look at the consistency across that to get a measure of how reliable our listening was. We might use throwaway listeners who are very unreliable. So I'm now going to explain one type of between subjects design that achieves a few very important goals. The goals of the following that we have a set of synthesisers. Here I've numbered them: 0 1 2 3 4. So five synthesisers. We would like them to each synthesize the same set of sentences. We'd like those sentences to be heard by a set of listeners. Here they're called subjects. But there are some constraints. We can't allow an individual subject to hear the same sentence more than once. Now that means it's not possible for a single subject to do the whole listening test because that would involve hearing every sentence synthesised by all five synthesisers. So what we do, we form groups of subjects, which are the rows of this square. The square is called a Latin square. It's got the special property that each number appears in every row and every column just once. So we'll take one of our subjects and they will listen across a row of this square, so they will hear Sentence A said by synthesiser number zero. And they would perform a task. Now here we're doing an intelligibility task where it's absolutely crucial that no subject hear's the same sentence more than once. So they must make one pass through one row of the Matrix, and they can't do any of the other rows. So they'll typing in what they heard, and then they'll move on and they'll hear the next one on the listening across a row of this Latin Square, and you'll notice that they're going to hear every sentence just once each and each time, it said, by a different synthesiser, different number in that cell. And then this achieves the property that we only hear each sentence once, but the subjects hearing all the different synthesises so they'll be somehow calibrated. Their relative judgments will be affected by having heard both the best and the worst synthesiser and not only bad ones are only good ones. Now, when this subject has completed this row of the square, they've heard every synthesiser Well each synthesiser was only saying a single sentence. So that's not fair, because sentence A might be easier to synthesise then sentence B so at the moment System zero has an unfair advantage. So we need to complete that. We need to balance the design by taking another subject and listening to the sentences said by a different permutation off the synthesisers and work our way down these rows. So we've got five individual subjects, but they're grouped together in forming a single virtual listener. And so we need to have at least five people. And if we want more than five, we need multiples of five. So we put the same number of people in each row. This virtual listener, composed of five human beings, listens like this through the Latin square and they hear every synthesiser saying every sentence. That's five by five, 25 things. But no individual human being ever hears the same sentence twice. And then we just aggregate the opinions or the responses in this case, the type in responses, for example, word error rates across all subjects to form our final statistics. The assumption here is that all subjects behave the same and that we can take five individual people, split the test across them and then aggregate the results. Okay, we got into a bit of detail there about a very specific listening test design, although it is a very highly recommended one and it's widely used, particularly in the Blizzard Challenge, to make sure that we don't have effects off repeated presentation of the same text to a listener. And that's crucial in intelligibility because they will remember those sentences. So we're talking still about subjective evaluation, in other words, using listeners and getting them to do a task. We decided what the task might be, such as typing in or choosing from a five point scale. We looked at various test interfaces. They might be pull down menus or type in boxes and if it's necessary, and it very often is, using this between subjects design to remove the problem of repeated stimuli. But what sentences are we actually going to synthesise and play to the listeners? In other words, what are the materials we should use Now there's a possible tension when choosing the materials to record. If our systems for a particular domain, maybe it's a personal assistant, we would probably want to synthesise this sentences in that domain on measure its performance, saying naturalness on such sentences. That might conflict with other requirements of evaluation and the sort of analysis we might want to do, particularly if we want to test intelligibility. So we might sometimes even want to use isolated words to narrow down the range of possible errors a listener can make and make our analysis much simpler. So could even use minimal pairs. That's not so common these days. But it's still possible. Or more likely we'll use full sentences to get bit more variety, but also to make things a little harder to analyse. That's gonna be a lot more pleasant for listeners. Let's look at the sort of materials that are typical when testing intelligibility. We would perhaps prefer to use nice, normal sentences, either from the domain of application or pull from a source such as a newspaper. The problem with these sentences is that in quiet conditions, which we hope our listeners are in. We'll tend to get a ceiling effect. They're basically perfectly intelligible. Now that's success in the sense that our system works. But it's not useful experimentally because we can't tell the difference between different systems. One might be more intelligible than the other. We can't tell if they're both synthesising such easy material, and so we have to pull them apart by making things harder for listeners. The standard way of making the task harder for listeners is to make the sentences less predictable and we tend to pick things that are syntactically correct, because otherwise it would sound very unnatural but semantically unpredictable because they kind of have conflicting meanings within them. These are very far from actual system usage. That's a reasonable criticism of them, but they will take us away from the ceiling effect and help us to differentiate between different systems. That could be very important, especially in system development. A simpler way off doing intelligibility testing that gets us down to specific individual errors is to use minimal pairs and will typically embed them in sentences because it's nicer for the listener. Now we will say cold again. We'll say gold again. They differing one phonetic feature and we can tell if I synthesizer is making that error. This is very, very time consuming because we're getting one small data point from this full sentence. That's actually rarely used these days, much more common. in fact, fairly standard is to use semantically unpredictable sentences. These could be generated by algorithm from template or a set of template sentences into which slots we drop words from some lists. And those lists might be of words of controlled frequency, medium frequency words or so on. There are other standards out there for measuring intelligibility as there are for Naturalness and for quality. You need to be a little bit careful about just simply applying these standards because they're almost exclusively developed for measuring natural speech. And so what they're trying to capture is not the effects of synthetic speech, but the effects of a transmission channel and the degradations it's imposed upon original natural speech, so these don't always automatically translate to the synthetic speech situation. There are other ways than semantically unpredictable sentences to avoid the ceiling effect. We might have the noise that's always going to make the task harder. We might distract the listener with a parallel task that's increasing their cognitive load. But in this case, we're now starting to worry again that we're measuring something about the listeners ability and not about synthetic speech. Maybe you can think off some novel ways to avoid the ceiling effect without using rather unnatural, semantically unpredictable sentences. Naturalness is more straightforward because we can use normal sentences. Naturalness more straightforward. We can use relatively normal sentence. Just pull them from some domain. It could be the domain of usage of the system. It might be some other domain. We still, though, might pay a little bit of attention to them and use some design text. And the most common form of that are sometimes known as the Harvard or the IEEE sentences. These have the nice property that within these lists, there's some phonetic balance.
Subjective evaluation
The most commonly used, and the most reliable, method of evaluation is to ask people to listen to synthetic speech and provide a response. Often that is simply a preference, or an opinion score.
Log in if you want to mark this as completed
|
|