Module status: ready
This module covers unit selection speech synthesis and provides a complete picture of the method. Initially, we will not go into much depth regarding the database of pre-recorded speech, because it’s not yet clear what it should contain. Also, this module contains only a high-level description of the target cost; there is more detail about the target cost in the next module.
Let’s test your background knowledge. You should be able to answer these questions without looking back over the Speech Processing material.
Download the slides for the module 2 videos
Total video to watch in this module: 60 minutes
Whilst the video is playing, click on a line in the transcript to play the video from that point. In this module we're going to cover unit selection. We're going to look at the entire system. The key concepts that are involved. The idea of selecting from a large database of relatively natural recorded speech. Of using some cost functions to choose amongst the many possible sequences of units available in that database. We're going to look at a thing called the target cost function and the join cost function that capture these costs. We're not going to look in complete detail at the target cost function because that will come a little bit later, but we'll finish off by seeing how a search is conducted efficiently to find the optimal unit sequence. Before starting on that, you need to make sure you already know the following things: The first area of importance is a little bit of phonetics. What do we need to know about phonetics? We need to know what a phoneme is: a category of sound and a phone is a realization of that - someone speaking that sound. For the consonants, we need to know that we can describe them using these distinctive features. The first feature is place and that's where in the mouth the sound is made and in particular with a consonant where some sort of constriction is formed and in the IPA chart those are arranged along the horizontal axis of this table. So going from the lips right back to the vocal folds, this is a dimension called place. At each place is possibly used several different manners of making the sound and those are arranged on the vertical axis, not in any particular order. For example we could make a complete closure, and let sound pressure release in an explosive fashion. That's a plosive sound. And finally, for any combination of place and manner, we can also vary the activity of the vocal folds. That's voicing. And so it's possible to make a further contrast, for example between /p/ and /b/ by vibrating the vocal folds in the case of /b/. We can describe vowels in a similar way. We can talk about the place in the mouth that the tongue is positioned. For example: is it near the roof of the mouth? Or is it down near the bottom? So, how open or closed is the vocal tract by the tongue? And we can talk about where that positioning takes place, whether it's near the front of the mouth or the back of the mouth. And these two dimensions, sometimes called height and front-back, characterize the vowel sounds. Here too, there's the third dimension. We can make an additional contrast, by slightly extending the length of the vocal tract, by protruding the lips. And that's called rounding. So again, there's a set of distinctive features that we can use to describe the vowel sounds: height, front-back, and rounding. And these will become useful later on, in unit selection where we have to decide how similar two sounds are. The next thing you need to know about is the source filter model. This was covered in the course Speech Processing. In a source filter model, there's a source of sound: that's either the vibration of the vocal folds, or some form of sound in the vocal tract such as release of a closure, or frication. It's possible of course for both of those things to happen at the same time. We might make voiced fricatives, like [v] or [z]. The source of sound goes through a filter: that's the vocal tract. That filter imposes a vocal tract frequency response. One way of describing that frequency response is in terms of the peaks (the resonant frequencies). They're called formants. Another way is just to talk more generally about the overall spectral envelope. So you need to know something about the source filter model. You should also already know about the front end, what's sometimes called the text processor. The front end takes the text input and adds lots and lots of information to that based on either rules, or statistical models, or sources of knowledge such as the pronunciation dictionary. This picture here captures a lot of the work that's done in the front end. There's the input text and the front end adds things such as: part of speech tags, possibly some structural information, possibly some prosodic information, and always some pronunciation information. As we progress we're going to see that we might find it convenient to attach all of that information to the segment - to the phoneme. So we might end up with structures that look rather flatter, that are essentially context-dependent phonemes. That will become clearer as we go through unit selection. But what you need to know at this stage is where all of that information comes from: where the linguistic specification is made, and that's in the front end. Now, we've already also covered a more basic form of speech synthesis, that's called diphone speech synthesis, in which we record one of each type of speech sound, and we use a very special speech sound: that's the second half of one phone and the first half of a consecutive phone said in connected speech and we perform quite a lot of signal processing at synthesis time to impose, for example, the prosody that we require, on that sequence of waveforms. And the final thing that you need to know about, that we will have covered already in automatic speech recognition for example, is dynamic programming. In unit selection, we're going to have to make a choice between many possible sequences of waveform fragments. In other words, we're going to search amongst many possibilities and the search space could be extremely large and so we need an efficient way of doing that search and we're going to formulate the problem in a way that allows us to use dynamic programming to make that search very efficient.
Whilst the video is playing, click on a line in the transcript to play the video from that point. Before getting into all the details, let's just play with some unit selection. Here's a interactive example. You can find it on the website speech.zone. We're going to try and synthesize my name and so I've found the appropriate diphones from a database. I've used one of the Arctic databases for this, and I've just pulled out a few candidates for each target position. So, along the bottom there we have the target diphone sequence, and above it we have the candidates. Each of these candidates is just a little waveform fragment. So, we can listen to those and, to say my name, we need to pick one candidate from each column. So, for example ... This is interactive, so if we select ... those will synthesize the waveform. This one this one all of those ones. There are many other sequences we can try. Have a go for yourself. Try another sequence. I've labeled each of the candidates with the utterance that it comes from, and I've also made sure that any candidates coming from the same utterance were contiguous in that utterance. For example, these all came contiguously. And so on. See if you can find the best sequence out of all the possible sequences. Now, in a real system we can't do this interactively with listeners. We have to automate it. We need criteria for choosing between the different candidates for each target position. We also need to decide how well they might concatenate. We're going to see that knowing if units were contiguous in the original database could be very helpful. Because, of course we expect those to join perfectly. In fact, we won't even cut them up. We'll take them as a larger unit. So, I'd like you to go and play with this interactive example and then continue watching the videos.
Here’s a link to the interactive demo.
Whilst the video is playing, click on a line in the transcript to play the video from that point. Before we get into the method of unit selection speech synthesis, it will be a good idea to get some key concepts established. What motivates the method? Why does it work? And perhaps what its limitations might be. Let's go right back to how speech is produced. Obviously speech is produced in the human vocal tract. Therefore the signal that we observe - the waveform - is a result of several interacting processes. For example, when the tongue moves up and down this isn't perfectly synchronized with the opening of the lips, or the opening / closing of the velum, or the bringing together of the vocal folds to start voicing. Each of these articulations is loosely synchronized and can spill over into subsequent sounds. That's the process known as co-articulation. The end result of co-articulation is that the sound that's produced is changed depending on the context in which it's produced; not just the sound that came before but also the sound that's coming next, as the articulators anticipate where they need to go next. That's an articulatory process. But co-articulation happens effectively at many levels, not just at the low acoustic level inside the vocal tract. There are phonological effects such as assimilation where entire sounds change category depending on their environment. The prosodic environment changes the way sounds are produced. For example, their fundamental frequency and their duration are strongly influenced by their position in the phrase. From that description of speech, it sounds like it's almost going to be impossible to divide speech into small atomic units: in other words, units that we don't subdivide any further. Strictly speaking, that's true. What we know about phonetics and linguistics tells us that speech is not a linear string of units like beads on a necklace. The beads on the necklace don't overlap; you can cut them apart you can put them back together and perfectly rebuild your necklace. We can't quite do that with speech. However, we would like to pretend that we CAN do that with speech. Because, if we're willing to do that - if we're willing to pretend - then we can do speech synthesis by concatenating waveform fragments. We know that this could give very high speech quality because we're essentially playing back perfect natural speech that was pre-recorded. So the potential is for extremely high quality. The only catch - the thing that we have to work around - is that we can't quite cut speech into units and just put them back together again in any order. The solution to that problem is to think about some base unit type and that those units are context-dependent. In other words, their sound changes depending on their linguistic environment. Therefore, rather than just having these base units, we'll have context-dependent versions or "flavours" of those units. In fact, we'll have many, many different versions of each unit: one version for every possible different linguistic context in which it could occur. That sounds quite reasonable. If there was a finite set of contexts in which a sound could occur, we could just record a version of the sound for every possible context. Then, for any sentence we needed to synthesize, we would have exactly the right units-in-context available to us. Unfortunately, if we enumerate all the possible contexts, they'll actually be pretty much infinite because context (theoretically at least) spans the entire sentence - possibly beyond. But it's not all bad news. Let's think about what's really important about context. What's important is the effect that it has on the current speech sound. Let's state that clearly. We can describe the linguistic context of a sound: for example, the preceding phoneme + the following phoneme. The number of different contexts seems to be about infinite. But what really matters is whether that context has a discernable - in other words audible - effect on the sound. Does it change the sound? We can hope that there is actually a rather smaller number of effectively-different contexts. That's that idea that unit selection is going to build on. Let's reduce the problem to something a bit simpler and think only about one particular aspect of linguistic context that we definitely know affects the current sound. That is: the identity of the preceding and the following sounds. This is handled relatively easily because there's definitely a small number of such contexts and in fact if we use the diphone as the unit type we're effectively capturing this left or right context-dependency automatically in the base unit type. I'll just spell that out, so we are completely clear. This is a recording of me saying the word "complete". We could segment it and label it. Let's just take that first syllable (in red). Let's think about the [@] sound in "complete" If we're willing to assume that the effects of the following nasal - the [n] sound - only affect the latter half of the [@], we could form diphone units by cutting in half and taking this as our base unit type. That's the diphone that we already know about. If there are N phoneme categories, there are about N^2 diphone types. So there's an explosion (an exponential increase) in the number of types, just by taking one aspect of context into account. While diphone synthesis has solved the local co-articulation problem by hardwiring that into the unit type, in other words considering left or right phonetic context, it doesn't solve any of the other problems of variation according to linguistic context. Our single recorded example of each diphone still needs to be manipulated with some fairly extensive signal processing. The most obvious example of that would be prosody where we would need to modify F0 and duration to impose some predicted prosody. Now, we have signal processing techniques that can do that reasonably well, so that seems OK. What's less obvious are the more subtle variations, for example of the spectral envelope, voicing quality, or things that do correlate with prosody. It's not obvious what to modify there. Our techniques for modifying the spectral envelope are not as good as the ones for modifying F0 and duration. So there's a question mark there. Let's summarize the key problems with diphone synthesis. The most obvious - the one word that we hear when we listen to diphone synthesis - is the signal processing. It introduces artifacts and degrades the signal. That's a problem. However, there's a deeper more fundamental problem that's harder to solve: the things that we're imposing with signal-processing have had to be predicted from text, and our predictions of them aren't perfect. So, even if we had perfect signal processing - even if it didn't introduce any artifacts or degradation - we still wouldn't know exactly what to do with it. What should we use the signal processing to modify? We have to make predictions of what the speech should sound like, from the text, and then impose those predictions with signal processing. Unit selection is going to get us out of all of these problems by choosing units that already have the right properties. The way that we'll get them from the database - the way that we'll select them - will be an implicit prediction of their properties from text. We'll see that later. Let's pursue the idea of dipphones for a moment. Although it's too naive, and going this way won't actually work, it will help us understand what we're trying to do. Let's try hardwiring into the unit type ALL of the linguistic context. So instead of N phoneme types giving us about N^2 diphone types, let's now have a version of each diphone in lexically stressed and unstressed positions. The database size will double. But that's not enough. We need things to vary prosodically, so let's have things in phrase-final and non-final positions. The database size will double again. We could keep doing that for all of the linguistic context factors that we think are important. The number of types is going to grow exponentially with the number of factors. That's not going to work, but it's a reasonable way to start understanding unit selection. In unit selection, we wish we could record this almost infinite database of all possible variation. Then, at synthesis time, we'd always have the right unit available for any sentence we wanted to say. In practice though, we can only record a fixed size database. It might be very large but it won't be infinite. That database can only capture some small fraction of the possible combinations of linguistic factors. For example: stress, phrase-finality, phonetic environment,... and so on. From this finite database we have to "make do". We have to do the best we can, choosing units that we think will sound as similar as possible to the unit that we wish we had, if only we had that infinite database. What makes that possible? What makes unit selection synthesis feasible? The answer is that some (hopefully very many) linguistic contexts lead to about the same speech sound. In other words, some of the linguistic environment has a very weak or negligible effect on the current sound. More generally, certain combinations of linguistic features in the environment all lead to about the same speech sound. What that means then is that, instead of having to record and store a version of every speech sound in every possible linguistic context, we can instead have a sufficient variety that captures the acoustic variation. We will always find, from amongst those, one that's "good enough" at synthesis time. When we record this database (we're going to say a lot more about that later on in the course) what we want is variety. We want the effects of context because we don't want to have to impose them with single processing. We want the same speech sound many, many times in many different linguistic contexts, and sounding different in each of those contexts. The key concepts are: to record a database of natural speech - probably somebody reading out sentences - that contains the natural variation we want to hear, that has been caused by linguistic context; at synthesis time, we're going to search for the most appropriate sequence of units, in other words the one that we predict will sound the best when concatenated. We're going to put aside the question of exactly what's in the database until a bit later, because we don't quite know yet what we want. Neither do we quite know what the best unit size would be. Lots of sizes are possible: the diphone is the most obvious one. Everything that we're going to say is general though, whether we're talking about diphones or half-phones, or even whole phones. All over the theory that we're going to talk about is going to apply to all of these different unit types. The principles will be the same, so we don't need to decide that at this point. Let's wrap this part up with a little orientation, putting ourselves in the bigger picture to see how far we've got, and what's coming up. Until now, all we knew about was diphone speech synthesis, with one recorded copy of each type of unit. We'd already decided that whole phones (recordings of phonemes) were not appropriate because of co-articulation. So we made a first order solution to that which was just to capture the co-articulation between adjacent phones in the speech signal. That only captures very local phonetic variation. Everything else - for example prosody - had to be imposed using fairly extensive signal manipulation. What we're going to do now is we're going to deliberately record the variation that we want, to be able to produce at synthesis time lots of variation: all the speech sounds in lots of different contexts. The best way to do that is going to be to record naturally-occurring speech: people reading out sentences - natural utterances. Synthesis is now going to involve very carefully choosing from that large database the sounds that are closest to the ones we want. It's worth stating the terminology, because there is a potential for confusion here. When we say diphone speech synthesis, we mean one copy of each type. When we say unit selection speech synthesis, we mean many copies of each unit type, in fact as many as possible in as many different variations as possible. Now, the actual base unit type in unit selection could well be the diphone. That's the most common choice. So there's a potential for confusion there.
Whilst the video is playing, click on a line in the transcript to play the video from that point. Now that we have those key concepts established, let's work our way up to a complete description of unit selection. The first thing we're going to construct is a target unit sequence. That's a sequence of the ideal units that we would like to be able to find in the database. These target units are going to be abstract linguistic things. They don't have waveforms. When we talk about the search a bit later on, we'll find that we want the information about the target, about the candidates from the database, and also about the way that they join, to all be stored locally. That's going to reduce the complexity of the search problem dramatically. So, let's do that straight away. What the front-end gives us is this structured linguistic representation. It has connections between the different tiers. It has structure. For example, it might have some tree shapes, or some bracketing like this structure. We're going to attach all the information to the phoneme tier. We're going to produce a flat representation, where all of that higher-level structure - such as part of speech - is attached down on to the pronunciation. What we've got effectively, is a string of segments (that's just a fancy word for phones) with contextual information attached. So: context-dependent phones. One of them might be like that, and it's part of a sequence. In this example, my base unit type is the phoneme. A real system probably wouldn't do that. I'm just using it to make the diagrams simpler to draw and simpler for you to understand. I'm going to move that to the top of the page, because we need some room to put the candidates on this diagram later. What we have in this target unit sequence is a linear string of base unit types, and each of those is annotated with all of its linguistic context: everything that we think might be important for affecting the way that this particular phoneme is pronounced, in this particular environment. It's essential to understand that this information here is stored inside this target unit specification. We do not need to refer to its context, to read off that linguistic specification. It's local. I'm just going to repeat again that the base unit type in this diagram is the phoneme - just for simplicity. We could build a system like that, though we wouldn't expect it to work that well. In reality we'll probably use diphones. That is going to make the diagram look a bit messy. So we'll just pretend that the whole phone is the acoustic unit: so the base unit type is the phoneme. There's our target unit sequence, and what we'd like to do now is go and find candidate waveform fragments to render that utterance. We're going to get those candidate units from a pre-recorded database. The full details of the database are not yet clear to us. That's for a good reason: we don't know exactly what we need to put in that database yet, because that all depends on how we're going to select units from it. For every target unit position - such as this one - we're going to go to the database and retrieve all the candidates that match the base unit type. So we'll pull all of the waveform fragments out that have been labelled, in this case, with the phoneme /@/. Here's one candidate for the first one we found, and remember that the candidates are waveform fragments. They're also going to be annotated with the same linguistic specification as the target. That waveform's what we're going to concatenate eventually to produce speech. In general (if we've designed our database well), we'll have multiple candidates for each target position. So we can go off and fetch more from the database: we'll get all of them in fact. I've only got a tiny database in this toy example, so I just found 5. In general, in a big system, we might find hundreds or thousands for some of the more common unit types. We're going to repeat that for all of the target positions. Let's do the next one. Now, it seems a bit odd to treat silence as a recorded unit here, but remember in the case of diphones we're just going to treat silence as if it was another segment type - another phoneme. So we can have silence-to-speech and speech-to-silence diphones, just like any other diphone. We'll go off now and get candidates for all of the other target positions. At this stage, I haven't applied any selection criteria at all, except that we're insisting on an exact match between the base unit type of each target and the candidates that are available to synthesize that part of the utterance. That implies that our database minimally needs to contain at least one recording of every base unit type. That should be pretty easy to design into any reasonable size database. Right, where are we at this point? Let's orient ourselves again into the bigger picture. We have run the front end, and got a linguistic specification of the complete utterance. We've attached the linguistic specification down on to the segment - on to the pronunciation level. That's given us a linear string: a sequence of target units. Those target units are each annotated with linguistic features. They do not yet have waveforms. We're going to find - for each target - a single candidate from the database. So far, all we've managed to do is retrieve all possible candidates from the database, and we've just matched on the base unit type. So what remains to be done is to choose amongst the multiple candidates for each target position, so that we end up with a sequence of candidates that we can concatenate to produce output speech. We're going to need some principle on which to select from all the different possible sequences of candidates. Of course, what we want is the one that sounds the best. We're going to have to formalize and define what we mean by "best sounding", quantify that, and then come up with an algorithm to find the best sequence. It's important to remember at all times that the linguistic features are locally attached to each target and each candidate unit. Specifically, we don't need to look at the neighbours. That's going to be particularly important for the candidates, because for different sequences of candidate units their neighbours might change. That will not change their linguistic features. The linguistic features are determined by the source utterance in the recorded database where that candidate came from. The next steps are to come up with some function that quantifies "best sounding", and then to search for the sequence of candidates that optimizes that function.
Whilst the video is playing, click on a line in the transcript to play the video from that point. We've retrieved from the database a number of possible candidate waveform fragments to use in each target position. The task now is to choose amongst them. There are many, many possible sequences of candidates, even for this very small example here. Let's just pick one of them for illustration... and you can imagine how many more there are. We want to measure how well each of those will sound. We want to quantify it: put a number on it. Then we're going to pick the one that we predict will sound the best. So, what do we need to take into account when selecting from amongst those many, many possible candidate sequences? Perhaps the most obvious one is that, when we're choosing a candidate - let's say for this position - from these available candidates, we could consider the linguistic context of the target: in other words, its linguistic environment in this target sentence. We could consider the linguistic environment of each individual candidate, and measure how close they are. We're going to look at the similarity between a candidate and a target in terms of their linguistic contexts. The motivation for that is pretty obvious. If we could find candidates from identical linguistic contexts to those in the target unit sequence, we'd effectively be pulling out the entire target sentence from the database. Now, that's not possible in general, because there's an infinite number of sentences that our system will have to synthesize. So we're not (in general) going to find exactly-matched candidate units, measured in terms of their linguistic context. We're going to have to use candidate units from mismatched non-identical linguistic contexts. So we need a function to measure this mismatch. We need to quantify it. We're going to do that with a function. The function is going to return a cost (we might call that a distance). The function is called the target cost function. A target cost of zero means that the linguistic context - measured using whatever features are available to us - was identical between target and candidate. That's rarely (if ever) going to be the case, so we're going to try and look for ones that have low target costs. The way I've just described that is in terms of linguistic features: effectively counting how many linguistic features (for example left phonetic context, or syllable stress, or position in phrase) match and how many mis-match. The number of mismatches will lead us to a cost. Taylor, in his book, proposes two possible formulations of the target cost function. One of them is what we've just described. It basically counts up the number of mismatched linguistic features. He calls that the "Independent Feature Formulation" because a mismatch in one feature and a mismatch in another feature both count independently towards the total cost. The function won't do anything clever about particular combinations of mismatch. Another way to think about measuring the mismatch between a candidate and a target is in terms of their acoustic features, but we can't do that directly because the targets don't have any acoustic features. We're trying to synthesize them. They're just abstract linguistic specifications at this point. So, if we wanted to measure the difference between a target and a candidate acoustically (which really is what we want to do: we want to know if they're going to sound the same or not) we would have to make a prediction about the acoustic properties of the target units. The target cost is a very important part of unit selection, so we're going to devote a later part of the course to that, and not go into the details just at this moment. All we need at this point is to know that we can have a function that measures how close a candidate is to the target. That closeness could be measured in terms of whether they have similar linguistic environments, or whether they "sound the same". That measure of "sounding the same" involves an extra step of making some prediction of the acoustic properties of those target units. Measuring similarity between an individual candidate and its target position is only part of the story. What are we going to do with those candidates after we've selected them? We're going to concatenate their waveforms, and play that back, and hope a listener doesn't notice that we've made a new utterance by concatenating fragments of other utterances. The most perceptible artefact we get in unit selection synthesis is sometimes those concatenation points, or "joins". Therefore, we're going to have to quantify how good each join is, and take that into account when choosing the sequence of candidates. So the second part of quantifying the best-sounding candidate sequence is to measure this concatenation quality. Let's focus on this target position, and let's imagine we've decided that this candidate has got the lowest overall target cost. It's tempting just to choose that - because we'll make an instant local decision - and then repeat that for each target position, choosing its candidate with the lowest target cost. However, that fails to take into account whether this candidate will concatenate well with the candidates either side. So, before choosing this particular candidate, we need to quantify how well it will concatenate with each of the things it needs to join to. The same will be true to the left as well. We can see now that the choice of candidate in this position depends (i.e., it's going to change, potentially) on the choice of candidate in the neighbouring positions. So, in general, then we're going to have to measure the join cost - the potential quality of the concatenation - between every possible pair of units... and so on for all the other positions. So we have to compute all of these costs and they have to be taken into account when deciding which overall sequence of candidates is best. Our join cost function has to make a prediction about how perceptible the join will be. Will a listener notice there's a join? Why would a listener notice there's been a join in some speech? Well, that's because there'll be a mismatch in the acoustic properties around the join. That mismatch - that change in acoustic properties - will be larger than is normal in natural connected speech. For example, sudden discontinuities in F0 don't happen in natural speech. So, if they do happen in synthetic speech they are likely to be heard by listeners. Our join cost function is going to measure the sorts of things that we think listeners can hear. The obvious ones are going to be the pitch (or the physical underlying property: fundamental frequency / F0), the energy - if speech suddenly gets louder or quieter we will notice that, if it's in an unnatural way - and, more generally, the overall spectral characteristics. Underlying all of this there's an assumption. The assumption is that measuring acoustic mismatch is a prediction of the perceived discontinuity that a listener will experience when listening to this speech. If we're going to use multiple acoustic properties in the join cost function, then we have to combine those mismatches in some way. A typical way is the way that Festival's Multisyn unit selection engine works. That's to measure the mismatch in each of those three properties separately and then sum them together. Since some might be more important than others, there'll be some weights. So, it'll be a weighted sum of mismatches. It's also quite common to inject a little bit of phonetic knowledge into the join cost. We know that listeners are much more sensitive to some sorts of discontinuities than others. A simple way of expressing that is to say that they are much more likely to notice a join in some segment types than in other segment types. For example, making joins in unvoiced fricatives is fairly straightforward: the spectral envelope doesn't have much detail, and there's no pitch to have a mismatch in. So we can quite easily splice those things together. Whereas perhaps in a more complex sound, such as a liquid or a diphthong, with a complex and changing spectral envelope, it's more difficult to hide the joins in those sounds. So, very commonly, join costs will also include some rules which express phonetic knowledge about where the joins are best placed. Here's a graphical representation of what the join cost is doing. We have a diphone on the left, and a diphone on the right. (Or, in our simple example, just whole phones) We have their waveforms, because these are candidates from the database. Because we have their waveforms, we can extract any acoustic properties that we like. In this example, we've extracted fundamental frequency, energy and the spectral envelope. It's plotted here as a spectrogram. We could parameterize that spectral envelope any way we like. This picture is using formants to make things obvious. More generally, we wouldn't use formants: they're rather hard to track automatically. We'd use a more generalized representation like the cepstrum. We're going to measure the mismatch in each of these properties. For example... the F0 is slightly discontinuous, so that's going to contribute something to the cost. The energy is continuous here, so there's very low mismatch (so, low cost) in the energy. We're similarly going to quantify the difference in the spectral envelope just before the join and just after the join. We're going to sum up those mismatches with some weights that express the relative importance of them, perceptually. That's a really simple join cost. It's going to work perfectly well, although it's a little bit simple. Its main limitation is it's extremely local. We just took the last frame (maybe 20ms) of one diphone and the first frame (maybe the first 20ms) of the next diphone (the next candidate that we're considering concatenating) and we're just measuring the very local mismatch between those. That will fail to capture things like sudden changes of direction. Maybe F0 has no discontinuity but in the left diphone it was increasing and in the right diphone it was decreasing. That sudden change from increasing to decreasing will also be unnatural: listeners might notice. So we could improve that: we could put several frames around the join and measure the join cost across multiple frames. We could look at the rate of change (the deltas). Or we could just generalize that much further and build some probabilistic model of what trajectories of natural speech parameters normally look like, compare that model's prediction to the concatenated diphones, and measure how natural they are under this model. Now, eventually we are going to go there: we're going to have a statistical model that's going to do that for us, but we're not ready for that yet because we don't know about statistical models. So we're going to defer that for later, once we've understood statistical models and how they can be used to synthesize speech themselves, we'll then come back to unit selection and see how that statistical model can help us compute the joint cost, and in fact also the target cost. When we use a statistical model underlying our unit selection system we call that "hybrid synthesis". But that's for later: we'll come back to that.
Whilst the video is playing, click on a line in the transcript to play the video from that point. We can now wrap up the description of unit selection by looking at the search. We need to understand why a search is necessary at all: what the search is finding for us. It's finding the lowest cost sequence of candidates. We'll see that that search can be made very efficient indeed. We'll wrap up at the very end, by saying how that search could be made even faster if we needed to do so. The ideas here are very similar to those in automatic speech recognition, so make sure you understand the basics of Hidden Markov Models and the Viterbi algorithm before you start on this part. By definition, because our cost functions are measuring perceptual mismatch (either the perceptual mismatch between a target and a possible candidate for that target, or the perceptual mismatch between a candidate and a consecutive candidate that we're considering concatenating it with) the lowest cost path should sound the best. In other words, it should sound as close as possible to the target that we're trying to say, and sound the most natural. Of course, these cost functions are not perfect. They're either based on linguistic features, or acoustic properties. Those are not the same thing as perception. Our cost functions are just predictions of perceived quality. There's always a possibility of trying to make our cost functions better. Eventually, the best possible current solution to these cost functions is actually a complete statistical model. We'll come back to that much later in the course when we come full circle and look at hybrid methods. For now, we've got relatively simple cost functions and we're going to define the best candidate sequence as the one that has the lowest total cost. The total cost is just a sum of local costs. Let's draw one candidate sequence and define what the cost of that sequence would be. There's one path through this lattice of candidates. The total cost of this sequence will be the target cost of this candidate measured with respect to this target - the mismatch between those two things - possibly that's a simple weighted sum of linguistic feature mismatches; plus the join cost between these two units; plus the target cost of this candidate with respect to its target; plus the concatenation (or join) cost to the next unit; and so on, summed across the entire sequence. We should already understand that we can't make local choices, because the choice of one candidate depends on what we're concatenating it to. So, through the join cost there's a kind of "domino effect". The choice of unit in this position will have an effect on the choice of unit in this position, and vice versa. Everything is symmetrical. We could have drawn that path going from right to left. There's a definition of best path: it's simply the one with the lowest total cost, which is a sum of local costs. We've understood now this "domino effect": that one choice, anywhere in the search, has an effect potentially on all of the other units that are chosen to go with it, because of the join cost. Now, of course there is one of the sequences that has the lowest total cost. It's lower than all of the rest. The search is going to be required to find that sequence. Now we're going to understand why it was so important that all of those costs (the target cost and the join cost) could be computed entirely locally, and therefore we can do dynamic programming. So, let's remind ourselves in general terms how the magic of dynamic programming works. It works by breaking the problem into separate independent problems. Let's draw a couple of paths and see how dynamic programming could make that computation more efficient. Consider these two paths. Let's just give them names: refer to them a Path A and Path B. Path A and Path B be have a common prefix. Up to the choice of this unit they're the same . Therefore, when we're computing the total cost of Path B, we can reuse the computations of Path A up to that point. We only have to compute the bit that's different - this point here. That idea generalizes to paths that have common suffixes, or common infixes. In fact we can break the problem right down and use dynamic programming to make this search just as efficient as if this was a Hidden Markov Model. Let's spell that out. Let's see where the dynamic programming step happens. To make the dynamic programming work, we're going to explore in this example from left to right. It doesn't matter: we could do right to left, but we'll do left to right. We'll explore all paths in parallel, this way. We'll start at the beginning, and we'll send paths forwards in parallel. They will propagate. Let's look at the middle part of the problem. Imagine considering choosing this unit. This unit lies on several possible paths coming from the left. It's either preceded by that unit, that unit, that one, or that one. It has a concatenation cost and then the paths could head off in other directions: or it could go here, or here. We can state the same thing as we stated in dynamic time warping, or in Hidden Markov Model-based speech recognition: That the lowest cost path through this point must include the lowest cost path up to this point. Because, if we've decided that we're choosing this unit, then all of the choices here are now independent from all the choices here. The past and the future are independent, given the present. That's the dynamic programming step. That looks incredibly similar to dynamic time warping on the grid. Or we could think of this as a lattice: we're passing tokens through the lattice, so it's something like a Hidden Markov Model. You'll see the idea written formally in the readings. This is the classic paper from Hunt & Black, where this formulation of unit selection was written down for the first time. This diagram is a way of writing down the search problem. We can see that the costs are local and that the shape of this graph allows us to do dynamic programming in a very simple way. This is essentially just a Hidden Markov Model. As we've described it so far, unit selection concatenates small fragments of waveform. In our rather simplified toy examples we've been pretending that those fragments are phones (whole phones: recordings of phonemes). But, we've been reminding ourselves all along that that's not really going to work very well. We are better off using diphones. In either case, there still seems to be a fundamental problem with the way that we've described the situation. To understand that, let's just go back to this diagram. To synthesize this target sequence, we pick one from each column of the candidates, and concatenate them. That implies that there's a join between every pair of consecutive candidates. That's a LOT of joins! We know that joins are the single biggest problem with unit selection speech synthesis. The joins are what our listener is most likely to notice. How can we reduce the number of joins? An obvious way is to make the units longer in duration: bigger units. For example, instead of diphones, we could use half-syllables or whole syllables, or some other bigger unit. That's a great idea. That will work extremely well: bigger units = fewer joins. Generally we're going to get higher quality. Let's think more generally about that. There are two sorts of system we could imagine building. One is where all of the units are of the same type - they're all diphones, or they're all syllables - so they're all the same: they're homogeneous. The lattice will look very much like the pictures we've drawn so far, but the unit type might change. A more complicated system might use units of different types. It might use whole words, if we happen to have the word in the inventory, and then syllables to make up words we don't have, and then diphones to make up syllables that we don't have. That way, we can say anything, but we try and use the biggest units available in the database. Older systems used to be built like that. We say those units are heterogeneous. The lattice is going to look a bit messy in that case, but we could still implement it and still build such a system. Let's see that in pictures because it's going to be easier to understand. Here's a homogeneous system. All the units are of the same type. Here they're whole phones. They could be diphones. It could be any unit you like, but they must all be of the same approximate size. When I say size, I mean size of linguistic unit, so, a half-syllable or a syllable. That's easy. The number of joins is the same for any path through this lattice. The number of concatenation points is the same. We could potentially reduce the number of concatenation points (the number of joins) by trying to find longer units where they're available, and kind of "filling in the gaps" with smaller units when they're not available. Here's a lattice of candidates that has these heterogeneous unit types. When I say lattice, I'm referring to the fact that there are paths that can go through these units, like this, and so forth. There's a lattice of paths. You could build systems like that. I've built ones like that with half-syllables and diphones and things, all mixed together. They're a little bit messy to code, and you have to be a little bit careful about normalizing the costs of each path, so that they could be compared with each other. Fortunately there's a very easy way to build a system that effectively has longer and longer units in it, where they're available in the database, but automatically falls back to shorter units. At the same time, it can make shorter units out of longer units where that's preferable. We can do that simply by realizing that each of these multi-phone units is made of several single phone units. Of course, in a real system we'd have multi-diphone units made of single diphones. This picture could simply be redrawn as follows. We write down the individual constituent units of those larger units, but we note that they were consecutive in the database: that they were spoken contiguously together as a single unit. That's what these red lines indicate. There's one little trick that's very common (it's pretty much universal in unit selection systems) to take a system that's essentially homogeneous and get magically larger units out of it. That's to just simply record these contiguous units and define the join cost as 0 between them and not calculate it. So, for example in this particular database it looks like the word cat occurred in its entirety. So we put that into the lattice, but we put it in as the three separate units. We just remember that, if we make a path that passes through all three of them, it incurs no join cost. When we search this lattice, the search is going to find (in general) lower cost paths, if it can join up more of these red lines, because they have zero join cost. But, it's not forced to do that, because it might not always be the best path. For example there's a path here that essentially concatenates whole words. There it is. Those individual words are going to sound perfect because they're just recordings of whole words from the database. But it might be the case that the joins between them are very unnatural. Maybe there's a big F0 discontinuity. So that might not be the best path through this lattice. It doesn't matter; we don't need to make a hard decision. There might be better paths through this lattice that don't concatenate exactly those whole words. Maybe this path. This path takes advantage of some of those "free" or zero-cost joins. Maybe it's got a lower total cost than the other path. The search will decide. We'll finish with a final reminder that this picture should really be written with diphones but that would be a little messy and confusing to understand. Another good idea would be to write out the problem in half-phones. Then we could put zero join costs between pairs of half-phones that make up diphones. We'd get a system that's effectively a diphone system that can fall back to half-phones where the diphones aren't suitable. Perhaps, because of about database design, there was a diphone missing. It can do what's called "backing off". A half phone system makes a lot of sense with this zero join cost trick, where we get effectively variable-sized units from half-phone to diphone to multi-diphone. A nice advantage of a half-phone system is that it can sometimes make joins at phone boundaries. Generally, it's not a good idea, but there are specific cases where joining at phone boundaries works pretty well. An obvious one is that we can put an [s] on the end of something to make the plural. We can make that join at the phone boundary fairly successfully. We now have a complete picture of unit selection. We didn't say very much about the target cost. We said that we could simply look at the mismatches in linguistic features, or that we could make some acoustic prediction about the target and then look at the mismatch in acoustic features. But we didn't say exactly how those two things would be done. The target cost is so important, it's covered in its own section of the course and that's coming next. We're going to look in a lot more detail about these two different formulations: the Independent Feature Formulation and the Acoustic Space Formulation. And mixing those two things together, which is what actually happens in many real systems. When we talk about the Acoustic Space Formulation we'll once again point forward to statistical models and then eventually to hybrid systems. After we've completed our look at the target cost, we'd better decide what's going in our database. At the moment we just know there is a database. We think it's probably got natural speech: probably someone reading out whole sentences. But, what sentences? What's the ideal database? We'll look at how to design the ideal database. We'll see that we want coverage. We want maximum variation, so that our target cost function has a lot of candidates to choose amongst for each target position, and that our join cost function can find nice smooth joins between those candidates.
Here are some questions about the videos in this module. If you need to watch the videos again whilst answering the questions, that’s allowed.
The chapter from Taylor is listed as Essential for both Module 2 and Module 3. Suggestion: watch the Module 2 videos, read this chapter through once, watch the Module 3 videos, then come back to the chapter.
Reading
Taylor – Chapter 16 – Unit-selection synthesis
A substantial chapter covering target cost, join cost and search.
Jurafsky & Martin – Section 8.5 – Unit Selection (Waveform) Synthesis
A brief explanation. Worth reading before tackling the more substantial chapter in Taylor (Speech Synthesis course only).
Download the slides for the class on 2025-01-21 14:10-15:00
You now have a complete picture of unit selection speech synthesis. The synthetic speech is created by concatenating pre-recorded waveform fragments. These fragments are found from a database of natural speech.
The quality of the synthetic speech is heavily reliant on two things, each of which is covered in more detail in the next part of the course:
- The target cost function
- The database of recorded natural speech