Whilst the video is playing, click on a line in the transcript to play the video from that point. Before we get into the method of unit selection speech synthesis, it will be a good idea to get some key concepts established. What motivates the method? Why does it work? And perhaps what its limitations might be. Let's go right back to how speech is produced. Obviously speech is produced in the human vocal tract. Therefore the signal that we observe - the waveform - is a result of several interacting processes. For example, when the tongue moves up and down this isn't perfectly synchronized with the opening of the lips, or the opening / closing of the velum, or the bringing together of the vocal folds to start voicing. Each of these articulations is loosely synchronized and can spill over into subsequent sounds. That's the process known as co-articulation. The end result of co-articulation is that the sound that's produced is changed depending on the context in which it's produced; not just the sound that came before but also the sound that's coming next, as the articulators anticipate where they need to go next. That's an articulatory process. But co-articulation happens effectively at many levels, not just at the low acoustic level inside the vocal tract. There are phonological effects such as assimilation where entire sounds change category depending on their environment. The prosodic environment changes the way sounds are produced. For example, their fundamental frequency and their duration are strongly influenced by their position in the phrase. From that description of speech, it sounds like it's almost going to be impossible to divide speech into small atomic units: in other words, units that we don't subdivide any further. Strictly speaking, that's true. What we know about phonetics and linguistics tells us that speech is not a linear string of units like beads on a necklace. The beads on the necklace don't overlap; you can cut them apart you can put them back together and perfectly rebuild your necklace. We can't quite do that with speech. However, we would like to pretend that we CAN do that with speech. Because, if we're willing to do that - if we're willing to pretend - then we can do speech synthesis by concatenating waveform fragments. We know that this could give very high speech quality because we're essentially playing back perfect natural speech that was pre-recorded. So the potential is for extremely high quality. The only catch - the thing that we have to work around - is that we can't quite cut speech into units and just put them back together again in any order. The solution to that problem is to think about some base unit type and that those units are context-dependent. In other words, their sound changes depending on their linguistic environment. Therefore, rather than just having these base units, we'll have context-dependent versions or "flavours" of those units. In fact, we'll have many, many different versions of each unit: one version for every possible different linguistic context in which it could occur. That sounds quite reasonable. If there was a finite set of contexts in which a sound could occur, we could just record a version of the sound for every possible context. Then, for any sentence we needed to synthesize, we would have exactly the right units-in-context available to us. Unfortunately, if we enumerate all the possible contexts, they'll actually be pretty much infinite because context (theoretically at least) spans the entire sentence - possibly beyond. But it's not all bad news. Let's think about what's really important about context. What's important is the effect that it has on the current speech sound. Let's state that clearly. We can describe the linguistic context of a sound: for example, the preceding phoneme + the following phoneme. The number of different contexts seems to be about infinite. But what really matters is whether that context has a discernable - in other words audible - effect on the sound. Does it change the sound? We can hope that there is actually a rather smaller number of effectively-different contexts. That's that idea that unit selection is going to build on. Let's reduce the problem to something a bit simpler and think only about one particular aspect of linguistic context that we definitely know affects the current sound. That is: the identity of the preceding and the following sounds. This is handled relatively easily because there's definitely a small number of such contexts and in fact if we use the diphone as the unit type we're effectively capturing this left or right context-dependency automatically in the base unit type. I'll just spell that out, so we are completely clear. This is a recording of me saying the word "complete". We could segment it and label it. Let's just take that first syllable (in red). Let's think about the [@] sound in "complete" If we're willing to assume that the effects of the following nasal - the [n] sound - only affect the latter half of the [@], we could form diphone units by cutting in half and taking this as our base unit type. That's the diphone that we already know about. If there are N phoneme categories, there are about N^2 diphone types. So there's an explosion (an exponential increase) in the number of types, just by taking one aspect of context into account. While diphone synthesis has solved the local co-articulation problem by hardwiring that into the unit type, in other words considering left or right phonetic context, it doesn't solve any of the other problems of variation according to linguistic context. Our single recorded example of each diphone still needs to be manipulated with some fairly extensive signal processing. The most obvious example of that would be prosody where we would need to modify F0 and duration to impose some predicted prosody. Now, we have signal processing techniques that can do that reasonably well, so that seems OK. What's less obvious are the more subtle variations, for example of the spectral envelope, voicing quality, or things that do correlate with prosody. It's not obvious what to modify there. Our techniques for modifying the spectral envelope are not as good as the ones for modifying F0 and duration. So there's a question mark there. Let's summarize the key problems with diphone synthesis. The most obvious - the one word that we hear when we listen to diphone synthesis - is the signal processing. It introduces artifacts and degrades the signal. That's a problem. However, there's a deeper more fundamental problem that's harder to solve: the things that we're imposing with signal-processing have had to be predicted from text, and our predictions of them aren't perfect. So, even if we had perfect signal processing - even if it didn't introduce any artifacts or degradation - we still wouldn't know exactly what to do with it. What should we use the signal processing to modify? We have to make predictions of what the speech should sound like, from the text, and then impose those predictions with signal processing. Unit selection is going to get us out of all of these problems by choosing units that already have the right properties. The way that we'll get them from the database - the way that we'll select them - will be an implicit prediction of their properties from text. We'll see that later. Let's pursue the idea of dipphones for a moment. Although it's too naive, and going this way won't actually work, it will help us understand what we're trying to do. Let's try hardwiring into the unit type ALL of the linguistic context. So instead of N phoneme types giving us about N^2 diphone types, let's now have a version of each diphone in lexically stressed and unstressed positions. The database size will double. But that's not enough. We need things to vary prosodically, so let's have things in phrase-final and non-final positions. The database size will double again. We could keep doing that for all of the linguistic context factors that we think are important. The number of types is going to grow exponentially with the number of factors. That's not going to work, but it's a reasonable way to start understanding unit selection. In unit selection, we wish we could record this almost infinite database of all possible variation. Then, at synthesis time, we'd always have the right unit available for any sentence we wanted to say. In practice though, we can only record a fixed size database. It might be very large but it won't be infinite. That database can only capture some small fraction of the possible combinations of linguistic factors. For example: stress, phrase-finality, phonetic environment,... and so on. From this finite database we have to "make do". We have to do the best we can, choosing units that we think will sound as similar as possible to the unit that we wish we had, if only we had that infinite database. What makes that possible? What makes unit selection synthesis feasible? The answer is that some (hopefully very many) linguistic contexts lead to about the same speech sound. In other words, some of the linguistic environment has a very weak or negligible effect on the current sound. More generally, certain combinations of linguistic features in the environment all lead to about the same speech sound. What that means then is that, instead of having to record and store a version of every speech sound in every possible linguistic context, we can instead have a sufficient variety that captures the acoustic variation. We will always find, from amongst those, one that's "good enough" at synthesis time. When we record this database (we're going to say a lot more about that later on in the course) what we want is variety. We want the effects of context because we don't want to have to impose them with single processing. We want the same speech sound many, many times in many different linguistic contexts, and sounding different in each of those contexts. The key concepts are: to record a database of natural speech - probably somebody reading out sentences - that contains the natural variation we want to hear, that has been caused by linguistic context; at synthesis time, we're going to search for the most appropriate sequence of units, in other words the one that we predict will sound the best when concatenated. We're going to put aside the question of exactly what's in the database until a bit later, because we don't quite know yet what we want. Neither do we quite know what the best unit size would be. Lots of sizes are possible: the diphone is the most obvious one. Everything that we're going to say is general though, whether we're talking about diphones or half-phones, or even whole phones. All over the theory that we're going to talk about is going to apply to all of these different unit types. The principles will be the same, so we don't need to decide that at this point. Let's wrap this part up with a little orientation, putting ourselves in the bigger picture to see how far we've got, and what's coming up. Until now, all we knew about was diphone speech synthesis, with one recorded copy of each type of unit. We'd already decided that whole phones (recordings of phonemes) were not appropriate because of co-articulation. So we made a first order solution to that which was just to capture the co-articulation between adjacent phones in the speech signal. That only captures very local phonetic variation. Everything else - for example prosody - had to be imposed using fairly extensive signal manipulation. What we're going to do now is we're going to deliberately record the variation that we want, to be able to produce at synthesis time lots of variation: all the speech sounds in lots of different contexts. The best way to do that is going to be to record naturally-occurring speech: people reading out sentences - natural utterances. Synthesis is now going to involve very carefully choosing from that large database the sounds that are closest to the ones we want. It's worth stating the terminology, because there is a potential for confusion here. When we say diphone speech synthesis, we mean one copy of each type. When we say unit selection speech synthesis, we mean many copies of each unit type, in fact as many as possible in as many different variations as possible. Now, the actual base unit type in unit selection could well be the diphone. That's the most common choice. So there's a potential for confusion there.
Key concepts
Linguistic context affects the acoustic realisation of speech sounds. But several different linguistic contexts can lead to almost the same sound. Unit selection takes advantage of this "interchangeability".
Log in if you want to mark this as completed
|
|