Why? When? Which aspects?

What are our goals when evaluating synthetic speech?

Whilst the video is playing, click on a line in the transcript to play the video from that point.
00:0500:21 In this module, we're going  to talk about evaluation.  We'll start right at the beginning. Think about why we even need to evaluate.  What we're going to do with the result of that.  That will help us think about  when it's appropriate to evaluate.  Do we do that while we're building a system? Do we wait until we're finished?
00:2200:40 There are many aspects of a system. We might evaluate its internal components,   its reliability, or speed. But we are mainly going to   focus on the more obvious speech synthesis  evaluation, which is to listen to the speech   and make some judgements, perhaps about  its naturalness and its intelligibility.
00:4101:06 Once we have established those things, we'll  move on to thinking about exactly how to do that.  How do you evaluate? In that section,   we will look at methodologies for evaluation. How we present the stimuli to listeners.  Who those listeners might be. What the stimulus should be.  And some other aspects of experimental design.  And once we've finished evaluation, we  need to do something with what we've found.  So, what do we do with the  outcome of an evaluation?
01:0902:31 Before you look at this module, you should just  check that you already know the following things.  Of course, you should have a pretty  good understanding of the complete   process of text to speech synthesis by this point.  And because you've got that understanding, you  should have some ideas about where errors can come   from, why synthetic speech is not perfect. For example, in the front end,   things might go wrong very early in text normalisation,   the wrong words are said. Pronunciations could be incorrect:   letter-to-sound could go wrong. And other things like prosody could be wrong.  Things will also go wrong  when we generate the waveform.  Let's just think about unit  selection in this section.  Unit selection errors might be that the units we  chose were from inappropriate contexts.  So they have the wrong  co-articulation, for example.  Or they might not have joined very well. We'll get concatenation artefacts To have a good understanding of why people  hear some errors but not other errors,   it will be useful to go and  revise your acoustic phonetics   and make sure you know a little  bit about speech perception.  Different errors will lead to  different perceptual consequences.  They'll impact different  aspects of the synthetic speech.  Some things are going to obviously affect  the naturalness of the synthetic speech.  It will sound unnatural in  some way: unnatural prosody,   unnatural signal quality  because of concatenation, perhaps.
02:3102:50 Other types of errors might in fact affect intelligibility.  An obvious one there is: if text normalisation goes  wrong - the synthesiser says the wrong words   and the listener understands the wrong thing. But intelligibility could also be impacted by   audible joins that disrupt the listener's perception.
02:5003:29 So both naturalness and intelligibility  can be affected by things that happen   in the front end and in waveform generation. We're going to see that's a general trend   in evaluation: that we might be able to measure  something in the synthetic speech, but that thing   that we measure doesn't have a direct connection  back to only one property of the system.  We have to figure out that for ourselves. And that's one reason it's rather hard   to automatically optimise the system  based on the results a listening test.  Because you can't feed back the  results of that listening test   down into the insides of the system and say,  "This error was because of letter-to-sound".
03:2904:16 Now, before we just jump in and  try and evaluate synthetic speech,   let's sit down and think  very carefully about that.  We're going to be a little bit systematic. We're not just going to make some synthetic   speech, play it to listeners  and see what they think.  We're going to make a plan. We're going to start right   back at the beginning, thinking: "Why do we want to evaluate?"  "What is it that we want to  get out of that evaluation?" Most of the time, we're trying to learn  something for ourselves that will help us   make our system better. We're trying to guide   the future development of the system. When we think we've got a pretty good system,   we might then go out and compare it somebody  else's system and hope that we're better.  Maybe then we can publish a paper.  Our technique is an  improvement over some baseline.  If we're going to sell this  thing, we might simply want to say   it's good enough to put in front of customers  and that'll be just a simple pass-fail decision.
04:1705:03 Depending on our goals, we might then need to  decide when we're going to do that evaluation.  If it's to improve our own system,  we might do it fairly frequently.  It might be during development. That might be of individual components,   or the end-to-end system,  depending what we want to learn.  Assuming that it's the synthetic speech we're  going to evaluate rather than the performance of   some component of the system, we have to think: "What is it about speech that we   could actually measure?" Some obvious things will be:  Can you understand what was said? Did it sound something like natural speech?  Maybe less obviously, and something that'll become  a lot more clear when we move on to parametric   forms of synthesis, is speaker similarity. Because there's no guarantee that the synthetic   speech sounds like the person  that we want it to sound like.
05:0405:25 Then, having established all  of that, we'll get to the meat.  We'll think about how to evaluate. We'll think about what we're going   to ask our listeners to do. We're going to design a task.  Put that into some test design, which  includes the materials we should use on.  We'll think about how to measure the listeners'  performance on that task, with those materials.
05:2605:40 We'll also quite briefly mentioned  the idea of objective measures.  In other words, measures that don't need  a listener - measures that are in fact an   algorithm operating on the synthetic speech.  For example, comparing it to a natural  example and measuring some distance.
05:4105:54 When you've done all of that, you need  to then make a decision what to do next.  For example, how do we improve our system,  knowing what we've learned in the listening test?  So what do we do with the outcome of  listening test or some other evaluation?
05:5706:14 So on to the first of those  in a bit more detail, then.  Why do we need to evaluate it all?  It's not always obvious, because it's  something that's just not spelled out ever.  I was never taught this explicitly, and  I've never seen it really in a textbook.  But why do we evaluate? Well, it's to make decisions.
06:1506:44 Imagine I have basic system. I've taken the Festival speech synthesis system.  I've had four brilliant ideas  about how to make it better.  I've built four different variations on Festival,  each one embodying one of these brilliant ideas.  Now I'd like to find out which  of them was the best idea.  So I'll generate some examples from those systems. I'll make my evaluation.  There was some outcome from that. The outcome helps me make a decision.  Which of those ideas was best?
06:4407:20 Let's imagine that the idea in  this system was a great idea.  So perhaps I'll drop the other ideas. I'll take this idea, and I'll make some further   enhancements to it - some more variants on it. So I'll make some new systems in   this evolutionary way. You have now got another   bunch of systems you would like to evaluate. I had again four ideas (it's a magic number!).  And again, I'd like to find out  which of those was the best idea.  Again, we'll do another evaluation, and perhaps  one of those turns out to be better than the rest.  And I'll take that idea forward and  the other ideas will stay on the shelf.
07:2107:50 And so research is a little bit like this   greedy evolutionary search. We have an idea.  We compare it to our other ideas. We pick the best of them.  We take that forward and refine it and  improve it until we can't make it any better.  Evaluation is obviously playing a critical role  in this because it's helping us to make the   decision that this was the best, and this was the  best, and that the route our research should take   should be like this, not some other path. So that's why we need to evaluate.
07:5008:24 You could do these evaluations  at many possible points in time.  If we're building a new front end for a language  or trying to improve the one we've already got,   we'll probably do a lot of testing  on those isolated components.  For example, we might have a module  which does some normalisation.  We're going to put a better POS tagger in, or  the letter-to-sound model might have got better,   or we might have improved the dictionary. We might be able to - in those cases - do some   sort of objective testing against Gold Standard  data: against things we know, with labels.  That's just normal machine learning  with a training set and the test set.
08:2509:17 But we also might want to know what the  effect of those components is when they   work within a complete system. Some components, of course,   only make sense in a complete system, such  as the thing that generates a waveform.  We have to look at the waveforms (i.e., listen  to the waveforms) to know how good they are.  We're not going to say too much more about  evaluating components - that's better covered when   we're talking about how these components work. We really going to focus on complete systems.  We're not in industry here, so we're not  going want to know: Did Does it pass or fail?  We really going to be concentrating  on comparing between systems.  Now, there might be two of our own systems,  like the four variants of my great ideas.  They might be against the baseline - an  established standard - that's just Festival.  They might be competitive against other people   as in, for example, the Blizzard  Challenge evaluation campaigns.
09:1809:44 Now, when we're making those  comparisons between variants of systems,   we need to be careful to control  things so that when we have a winner,   we know the reason was because of our  great idea, not some other change.  We made two in the Blizzard Challenge, and  indeed, in all sorts of internal testing   we'll keep certain things under control. For example, we'll keep the same front end, and   we'll keep the same database if we're trying to  evaluate different forms of waveform generation.
09:4510:04 So we're already seeing some  aspects of experimental design.  We need to be sure that the only difference  between the systems is the one we're interested in   and everything else is controlled so that - when  we get our outcome - we know the only explanation   is the variable of interest and  not some other confounding thing.  Like, we change the database.
10:0410:58 Now there's a whole lot of terminology flying  around testing, whether it's software testing   or whole system testing, whatever it might be,  we're not really going to get bogged down in it.  But let's just look at a couple of  things that come up in some textbooks.  Sometimes we see this term: "glass box". That means we can look inside the   system that we're testing. We have complete access to it.  We could go inside the code. We could even put changes into the code   for the purposes of testing. This kind of testing is   normally about things like reliability and  speed to make sure the system doesn't crash,   make sure it runs fast enough to be usable. We're really not going to talk much about that.  If we wanted to, we could go look at source code. For example, in this piece of source code,   we can see who wrote it, so there's a  high chance there might be a bug here.  We might put in specific tests unit tests  to check that there are no silly bugs.  Put in test cases with known correct  output and compare against them.
10:5811:20 Coming back to this terminology, we'll  sometimes see this term: "black box".  This is now when we're not allowed to  look inside and see how things work.  We can't change the code for  the purpose of testing it.  All we could do is put  things in and get things out.  So to measure performance, we can maybe get some  objective measures against gold standard data.  That will be the sort of thing we  might do for a Part-Of-Speech tagger.
11:2011:53 One would hope that making an improvement  anywhere in the system would lead to an   improvement in the synthetic speech. If only that was the case!  Systems like Text-to-Speech  synthesisers are complex.  They typically have something like a  pipeline architecture that propagates errors   and that can lead to very tricky  interactions between the components.  Those interactions can actually  mean that improving something   early in the pipeline actually causes  problems later in the pipeline.  In other words, making improvement  actually makes synthetic speech worse!
11:5412:18 Let's take some examples. Imagine we fix a problem in text normalisation   so that currencies are now correctly normalised  whereas they used to be incorrectly normalised.  However, if we only do that in the run-time  system and don't change the underlying database,   we will now have a mismatch  between the database labels   and the content of what we try to say at run time. We might get worse performance - for example,   lower naturalness - now,  because we have more joins.
12:1812:34 Similarly, improving the letter-to-sound module  might start producing phoneme sequences which are   low frequency in the database, because the  database used the old letter-to-sound module,   which never produced those sequences. So again, we will get more joins in our units.
12:3412:48 So, in general then, these pipeline architectures,  which are the norm in Text-to-Speech,   can lead to unfortunate interactions. And in general, of course, in all of software   engineering, fixing one bug can easily reveal  other bugs that you hadn't noticed until then.
12:4813:05 Nevertheless, we're always going  to try and improve systems.  We want to improve those components,  but we might have to propagate those   improvements right through the pipeline. We might have to completely rebuild the system   and re-label the database whenever we change,  for example, our letter-to-sound module.
13:0513:22 So we know we need to evaluate: to make decisions,  for example, about where to go with our research.  We can do those evaluations for  components or the whole systems.  We decided when to do it. And now, when you think about   what it is about speech that we could evaluate.
13:2213:56 So we're now going to focus on  just synthetic speech evaluation   and forget about looking inside the system. What aspects of speech could we possibly quantify?  There's lots of descriptors, lots of  words we could use to talk about this.  It's tempting to use the word "quality". That's a common term in speech coding.  For transmission down telephone systems we  talk took about the "quality" of the speech.  That's a little bit of a vague  term for synthetic speech,   because there are many different dimensions. So tend not to talk about that [quality] so much.
13:5614:40 Rather, we use the term naturalness.  Naturalness, implies some similarity to  natural speech from a real human talker.  In general, it's assumed that that's  what we're aiming for in Text-to-Speech.  I would also like that  speech to be understandable.  And there's a load of different terms you  could use to talk about that property,   the most common one is "intelligibility". That's simply the ability of a listener to recall,   or to write down, the words that he or she heard. We might try and evaluate some higher level   things, such as understanding or comprehension,  but then we're going to start interacting with   things like the memory of the listener. So we're going to start measuring   listener properties when really we want to  measure our synthetic speech properties.
14:4115:02 Occasionally we might measure speaker similarity. As I've already mentioned, in the parametric   synthesis case, it's possible to produce  synthetic speech that sounds reasonably natural,   is highly intelligible, but doesn't sound  very much like the person that we recorded.  That sometimes matters. It doesn't always matter,   so there's no reason to evaluate  that unless you care about it.
15:0215:32 And there's a whole lot of other  things you might imagine evaluating.  Go away and think about what they might be  and then put them into practise yourself.  You are going evaluate synthetic speech along a  dimension that's not one of the ones on this slide   and see if you can find something  new that you could measure.  We're no longer going to consider these things. They will be straightforward to measure.  You could do those in experiments for yourself. You could consider how pruning affects speed and   how that trades off against,  for example, naturalness.
15:3415:53 Now, even these simple descriptors  are probably not that simple.  And in particular, it seems to me that  naturalness is not a single dimension.  We could imagine speech that's segmentally natural  - the phonemes have reproduced nicely- but it's   prosodically unnatural. Or vice versa.
15:5316:06 So naturalness might need unpacking,  but the convention in the field is to   give that as an instruction to the listener:  "Please rate the naturalness  of the synthetic speech."  and assume that they can do that  along a 1-dimensional scale.
16:0717:13 Similarly, intelligibility is usually defined  as the ability to transcribe what was said,   but there might be more to listening to synthetic  speech than just getting the words right.  You might like to think about whether somebody  really understood the meaning of the sentence.  Managed to transcribe the words,  but if the prosody was all wrong,   they might have understood a different meaning. More generally, we might think about how much   effort it is to listen to synthetic speech. It seems a reasonable hypothesis that it's   hard work listening to synthetic  speech compared to natural speech.  There are methods out there to try and  probe people's effort or attention,   or all sorts of other factors. We have measured things about their pupils.  Sticking electrodes on their scalp to  measure things about their brain activity.  These are very much research tools.  They're not widely used in  synthetic speech evaluation.  The reason for that is that is then very hard to  separate out measuring things about the listener   from things about the synthetic speech. And so there's a confound there.  For example, we're measuring the listeners working  memory, not how good the synthetic speech was.
17:1417:23 So at the moment these things are not widely used.  It would be nice to think that we could use them  to get a deeper understanding of synthetic speech.  But that's an open question. So I put that to one side.
17:2317:47 For now, let's wrap this  section up with a quick recap.  We know why we should evaluate to  find stuff out and make decisions.  We could do it at various  points: and you need to choose.  And we've listed initially some aspects of  the the system, specifically of this speech,   that we could measure.
17:4718:22 So from this point on in the next video,  we're going to focus on the mainstream.  What's normal in the field? What would be expected if you wanted to publish   a paper about your speech synthesis research? We're going to evaluate the output   from a complete system. Are we going to do that   for an end-to-end Text-to-Speech  system: text in, synthetic speech out.  We're going to measure the two  principal things that we can:   the naturalness and the intelligibility. And so what we need to do now is find out how.

Log in if you want to mark this as completed
Excellent 34
Very helpful 2
Quite helpful 1
Slightly helpful 1
Confusing 0
No rating 0
My brain hurts 0
Really quite difficult 0
Getting harder 0
Just right 36
Pretty simple 2
No rating 0