Whilst the video is playing, click on a line in the transcript to play the video from that point. We can now wrap up the description of unit selection by looking at the search. We need to understand why a search is necessary at all: what the search is finding for us. It's finding the lowest cost sequence of candidates. We'll see that that search can be made very efficient indeed. We'll wrap up at the very end, by saying how that search could be made even faster if we needed to do so. The ideas here are very similar to those in automatic speech recognition, so make sure you understand the basics of Hidden Markov Models and the Viterbi algorithm before you start on this part. By definition, because our cost functions are measuring perceptual mismatch (either the perceptual mismatch between a target and a possible candidate for that target, or the perceptual mismatch between a candidate and a consecutive candidate that we're considering concatenating it with) the lowest cost path should sound the best. In other words, it should sound as close as possible to the target that we're trying to say, and sound the most natural. Of course, these cost functions are not perfect. They're either based on linguistic features, or acoustic properties. Those are not the same thing as perception. Our cost functions are just predictions of perceived quality. There's always a possibility of trying to make our cost functions better. Eventually, the best possible current solution to these cost functions is actually a complete statistical model. We'll come back to that much later in the course when we come full circle and look at hybrid methods. For now, we've got relatively simple cost functions and we're going to define the best candidate sequence as the one that has the lowest total cost. The total cost is just a sum of local costs. Let's draw one candidate sequence and define what the cost of that sequence would be. There's one path through this lattice of candidates. The total cost of this sequence will be the target cost of this candidate measured with respect to this target - the mismatch between those two things - possibly that's a simple weighted sum of linguistic feature mismatches; plus the join cost between these two units; plus the target cost of this candidate with respect to its target; plus the concatenation (or join) cost to the next unit; and so on, summed across the entire sequence. We should already understand that we can't make local choices, because the choice of one candidate depends on what we're concatenating it to. So, through the join cost there's a kind of "domino effect". The choice of unit in this position will have an effect on the choice of unit in this position, and vice versa. Everything is symmetrical. We could have drawn that path going from right to left. There's a definition of best path: it's simply the one with the lowest total cost, which is a sum of local costs. We've understood now this "domino effect": that one choice, anywhere in the search, has an effect potentially on all of the other units that are chosen to go with it, because of the join cost. Now, of course there is one of the sequences that has the lowest total cost. It's lower than all of the rest. The search is going to be required to find that sequence. Now we're going to understand why it was so important that all of those costs (the target cost and the join cost) could be computed entirely locally, and therefore we can do dynamic programming. So, let's remind ourselves in general terms how the magic of dynamic programming works. It works by breaking the problem into separate independent problems. Let's draw a couple of paths and see how dynamic programming could make that computation more efficient. Consider these two paths. Let's just give them names: refer to them a Path A and Path B. Path A and Path B be have a common prefix. Up to the choice of this unit they're the same . Therefore, when we're computing the total cost of Path B, we can reuse the computations of Path A up to that point. We only have to compute the bit that's different - this point here. That idea generalizes to paths that have common suffixes, or common infixes. In fact we can break the problem right down and use dynamic programming to make this search just as efficient as if this was a Hidden Markov Model. Let's spell that out. Let's see where the dynamic programming step happens. To make the dynamic programming work, we're going to explore in this example from left to right. It doesn't matter: we could do right to left, but we'll do left to right. We'll explore all paths in parallel, this way. We'll start at the beginning, and we'll send paths forwards in parallel. They will propagate. Let's look at the middle part of the problem. Imagine considering choosing this unit. This unit lies on several possible paths coming from the left. It's either preceded by that unit, that unit, that one, or that one. It has a concatenation cost and then the paths could head off in other directions: or it could go here, or here. We can state the same thing as we stated in dynamic time warping, or in Hidden Markov Model-based speech recognition: That the lowest cost path through this point must include the lowest cost path up to this point. Because, if we've decided that we're choosing this unit, then all of the choices here are now independent from all the choices here. The past and the future are independent, given the present. That's the dynamic programming step. That looks incredibly similar to dynamic time warping on the grid. Or we could think of this as a lattice: we're passing tokens through the lattice, so it's something like a Hidden Markov Model. You'll see the idea written formally in the readings. This is the classic paper from Hunt & Black, where this formulation of unit selection was written down for the first time. This diagram is a way of writing down the search problem. We can see that the costs are local and that the shape of this graph allows us to do dynamic programming in a very simple way. This is essentially just a Hidden Markov Model. As we've described it so far, unit selection concatenates small fragments of waveform. In our rather simplified toy examples we've been pretending that those fragments are phones (whole phones: recordings of phonemes). But, we've been reminding ourselves all along that that's not really going to work very well. We are better off using diphones. In either case, there still seems to be a fundamental problem with the way that we've described the situation. To understand that, let's just go back to this diagram. To synthesize this target sequence, we pick one from each column of the candidates, and concatenate them. That implies that there's a join between every pair of consecutive candidates. That's a LOT of joins! We know that joins are the single biggest problem with unit selection speech synthesis. The joins are what our listener is most likely to notice. How can we reduce the number of joins? An obvious way is to make the units longer in duration: bigger units. For example, instead of diphones, we could use half-syllables or whole syllables, or some other bigger unit. That's a great idea. That will work extremely well: bigger units = fewer joins. Generally we're going to get higher quality. Let's think more generally about that. There are two sorts of system we could imagine building. One is where all of the units are of the same type - they're all diphones, or they're all syllables - so they're all the same: they're homogeneous. The lattice will look very much like the pictures we've drawn so far, but the unit type might change. A more complicated system might use units of different types. It might use whole words, if we happen to have the word in the inventory, and then syllables to make up words we don't have, and then diphones to make up syllables that we don't have. That way, we can say anything, but we try and use the biggest units available in the database. Older systems used to be built like that. We say those units are heterogeneous. The lattice is going to look a bit messy in that case, but we could still implement it and still build such a system. Let's see that in pictures because it's going to be easier to understand. Here's a homogeneous system. All the units are of the same type. Here they're whole phones. They could be diphones. It could be any unit you like, but they must all be of the same approximate size. When I say size, I mean size of linguistic unit, so, a half-syllable or a syllable. That's easy. The number of joins is the same for any path through this lattice. The number of concatenation points is the same. We could potentially reduce the number of concatenation points (the number of joins) by trying to find longer units where they're available, and kind of "filling in the gaps" with smaller units when they're not available. Here's a lattice of candidates that has these heterogeneous unit types. When I say lattice, I'm referring to the fact that there are paths that can go through these units, like this, and so forth. There's a lattice of paths. You could build systems like that. I've built ones like that with half-syllables and diphones and things, all mixed together. They're a little bit messy to code, and you have to be a little bit careful about normalizing the costs of each path, so that they could be compared with each other. Fortunately there's a very easy way to build a system that effectively has longer and longer units in it, where they're available in the database, but automatically falls back to shorter units. At the same time, it can make shorter units out of longer units where that's preferable. We can do that simply by realizing that each of these multi-phone units is made of several single phone units. Of course, in a real system we'd have multi-diphone units made of single diphones. This picture could simply be redrawn as follows. We write down the individual constituent units of those larger units, but we note that they were consecutive in the database: that they were spoken contiguously together as a single unit. That's what these red lines indicate. There's one little trick that's very common (it's pretty much universal in unit selection systems) to take a system that's essentially homogeneous and get magically larger units out of it. That's to just simply record these contiguous units and define the join cost as 0 between them and not calculate it. So, for example in this particular database it looks like the word cat occurred in its entirety. So we put that into the lattice, but we put it in as the three separate units. We just remember that, if we make a path that passes through all three of them, it incurs no join cost. When we search this lattice, the search is going to find (in general) lower cost paths, if it can join up more of these red lines, because they have zero join cost. But, it's not forced to do that, because it might not always be the best path. For example there's a path here that essentially concatenates whole words. There it is. Those individual words are going to sound perfect because they're just recordings of whole words from the database. But it might be the case that the joins between them are very unnatural. Maybe there's a big F0 discontinuity. So that might not be the best path through this lattice. It doesn't matter; we don't need to make a hard decision. There might be better paths through this lattice that don't concatenate exactly those whole words. Maybe this path. This path takes advantage of some of those "free" or zero-cost joins. Maybe it's got a lower total cost than the other path. The search will decide. We'll finish with a final reminder that this picture should really be written with diphones but that would be a little messy and confusing to understand. Another good idea would be to write out the problem in half-phones. Then we could put zero join costs between pairs of half-phones that make up diphones. We'd get a system that's effectively a diphone system that can fall back to half-phones where the diphones aren't suitable. Perhaps, because of about database design, there was a diphone missing. It can do what's called "backing off". A half phone system makes a lot of sense with this zero join cost trick, where we get effectively variable-sized units from half-phone to diphone to multi-diphone. A nice advantage of a half-phone system is that it can sometimes make joins at phone boundaries. Generally, it's not a good idea, but there are specific cases where joining at phone boundaries works pretty well. An obvious one is that we can put an [s] on the end of something to make the plural. We can make that join at the phone boundary fairly successfully. We now have a complete picture of unit selection. We didn't say very much about the target cost. We said that we could simply look at the mismatches in linguistic features, or that we could make some acoustic prediction about the target and then look at the mismatch in acoustic features. But we didn't say exactly how those two things would be done. The target cost is so important, it's covered in its own section of the course and that's coming next. We're going to look in a lot more detail about these two different formulations: the Independent Feature Formulation and the Acoustic Space Formulation. And mixing those two things together, which is what actually happens in many real systems. When we talk about the Acoustic Space Formulation we'll once again point forward to statistical models and then eventually to hybrid systems. After we've completed our look at the target cost, we'd better decide what's going in our database. At the moment we just know there is a database. We think it's probably got natural speech: probably someone reading out whole sentences. But, what sentences? What's the ideal database? We'll look at how to design the ideal database. We'll see that we want coverage. We want maximum variation, so that our target cost function has a lot of candidates to choose amongst for each target position, and that our join cost function can find nice smooth joins between those candidates.
Search
With multiple candidates available for each target position, a search must be performed.
Log in if you want to mark this as completed
|
|