Bonus material: partial synthesis

The simplest type of hybrid synthesis is essentially unit selection with an ASF target cost function, where the acoustic features are created through "partial synthesis".

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.
in this module.
We're going to bring together the two key concepts that we've covered so far.
And those are unit selection, which generates were formed by concatenation of recordings on Statistical Parametric Speech synthesis, which uses a model trained on data.
And we're going to use that statistical model to drive a unit selection system.
So obviously, you'll need a very good understanding of statistical parametric speak synthesis before proceeding.
You could either do that with HIDTA, Markoff models or deep neural networks.
It doesn't matter.
Either of them could be combined with unit selection to make a hybrid system.
You also obviously need to understand how unit selection works on particularly that you could get potentially very good naturalist from such a system.
Now statistical Parametric speech Census might be flexible and robust.
The labelling errors but in systems that use of Oh Koda naturalist is limited by that as well as other things such as the statistical model not being perfect.
Hybrid synthesis has the potential to improve naturalness compared to statistical parametric speech synthesis.
In contrast, unit selection potentially offers excellent naturalists simply because it's playing back recorded way forms.
But if the database has errors of any sort and particularly labelling errors, they will very strongly affect the natural nous.
Another problem with the unit selection system is it's quite hard work to optimise it on new data, even a new speaker of the same language.
We need to perhaps change waits in target cost or join cost.
That's hard work on DH.
We can never be completely sure that we've done the best possible job of that.
So by combining statistical Parametric systems with unit selection, we have a potential taking the best of both worlds and in particular taking this robustness and the fact that we can automatically learn from data and combining it with the naturalness of playing by way forms.
We can get a system which has the naturalness of unit selection but is not as affected by, for example, labelling errors on the corpus.
What perhaps isn't as much worked optimise on a new voice as well as knowing about those two concepts you need to know about the components behind that on.
Do you need to particularly know something about signal processing on what we need to know Here is about how we might parameter isa speech signal on that we might do that in rather different ways.
If we're classifying those speak signals, for example, we're doing automatic speech recognition compared to if we want to regenerate the speech signal from the Parametric form from the speech parameters.
That's called bo coding, as who might use very different parameters in these two cases.
Back when we talked about unit selection, we spent some time thinking about Spar City on.
We considered how that interacts with the type of target cost function we're using, whether it's measuring similarity between candidates and targets in independent feature formulation way.
In other words, based only on linguistic features or in acoustic space.
It is SF style target cost.
And we made the claim that if we could measure similarity well in an acoustic space who might suffer from less Spar City problems than where we measure in linguistic space? If you don't remember why, that is, go back to that module on unit selection and compare again this independent feature formulation on acoustic specs formulation for the target cost function.
When we talked about statistical Parametric speech synthesis and the module on H M.
M's and then the module on deep neural networks.
We tried to have a unified view of all of that on DH.
If we'd like a very short description of what statistical Parametric speech synthesis is, it's a regression problem from a sequence of linguistic features to a sequence off speech parameters.
So it's a sequence two Sequels regression problem.
So we're going to take now the knowledge ofthe signal processing and how we might represent speech with the problems of unit selection of Spar City, combined with a technique for secrets secrets regression, such as a deep neural network on Build What we're going to call, ah, hybrid speech synthesiser hybrid simply because it combines unit selection on a statistical model, a phrase that you will have come across in the readings from Taylor is this idea of partial synthesis.
This idea is going to be important now in the statistical model that we used to drive this unit selection system in the hybrid set up.
We don't need to generate a speech way form from the Parametric representation that will eventually happen through concatenation.
That's why Taylor says, partial synthesis.
We're not going all the way to a speech way form.
We're going to some other representation, which is good enough to then select way forms.
That means we've got choices about what representation we generate.
It does not need to be the same as we would need when driving a vocoder.
For example, our model could just generate, MFC sees on.
We could use those too much against candidates in the database.
Equally, we don't need to generate the high frame rate that we would need.
If we vote coding perhaps 200 frames per second, we might predict the acoustic properties far less often, maybe once per segment or once for each half of a di foehn and use that in the target cost function.
So a lot more flexible in what we generate from our statistical model.
When we doing hybrid synthesis compared to statistical Parametric speech synthesis, that's the idea of partial synthesis.
And keep that in mind throughout that the statistical model may or may not be generating vocoder parameters might be generating something a bit simpler.
Another way to describe Hibri speech synthesis is that it's just statistical parametric speech synthesis with a clever Vukota with Dakota that generates speech in a clever way so we could draw a picture of that.
We have our statistical models.
It might be models off context dependent phones on.
They will generate a Parametric representation of speech speech parameters.
That doesn't have to be vocoder parameters.
It could be anything you want.
But instead of using a vocoder to get from those parameters to away form, we use something else.
We use a database of recorded speech and then concatenation the fragments.
The candidates.
We can view that statistical parametric speech synthesis with a very clever vocoder based on a speech database.
Or we could describe this house fairly traditional unit selection with a target cost function operating an acoustic space So the acoustic space will be the speech parameters, which or whatever you want them to be, Let's say, um, of CCS on DH.
It's US election.
So there's a picture off a set of candidates and those comforts former lattice.
And our job is to find a path through the lattice that sounds good on the target.
Cost function is going to be based on the match between his parametric representation on a particular candidate we're considering.
So those are just different ways of describing the same thing.
Use whichever you're most comfortable with.
Think of it as unit selection with a statistical model doing the target costs job.
Well, think of it statistical Parametric speech synthesis with a rather clever Vukota.
Just to be clear, then the speech parameters, anything that you want, because we don't actually need to be able to reconstruct the way from from them so we could call those a partial synthesis.
It's not a full specifications.
It's just enough to make comparisons with candidates from the database, and we're going to measure this distance, and that's the target cost.
I quite like the following analogy, So let's see if it works for you.
When we generate images by computer off Rheal objects, for example, people.
It's quite usual to start from measurements of real objects and then to make a model.
And then we can control that model, for example, animated, Make it move on.
We render that model to make it look photo realistic, if that's what we want.
So here's how that would work for making a face.
First we get some raw measurement data from a human subject, so some sort of three D scanning device would measure lots of points on the surface of somebody's face and give us this raw data.
That's like the speech database we get in the studio way forms the row daters, high dimensional and hard to work with.
It's not easy to directly, for example, animate this representation.
So we turned that into a Parametric model, which loses some detail.
But gains control.
Yeah, has fewer dimensions to control, but those air now meaningful so we could change the shape of the mouth more easily in this representation, for example, those just parameters.
So to generate the final image, Maybe this is for a movie or a computer game.
We have to give a surface to this set of parameters.
So think of these parameters as the milk caps from so we could just shade that model like this.
For me, that's a bit like Vaux coding.
It looks now like a person.
It's kind of convincing, but there's something unnatural about it.
It's rather smooth, and in this case it doesn't have any colour on.
It doesn't have much texture because that's a very simple rendering from the Parametric model, and if we want to make this look better, we need to put some photo realistic images on top ofthe this shaded model.
So the way that can be done is essentially to take lots of little image tiles and to cover up the shaded model with little photographs in this case, little photographs of skin.
This method of photo realistic rendering by tiling little rial images on top off a mesh of Parametric model is very similar to one particular form.
Off hybrid speech synthesis.