Bonus material: trajectory tiling

A case study based on one of the readings.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.
I'm going to describe that one particular form of hybrid speech synthesis now taking it just as a case study.
In other words, there are many other ways of doing this.
I'm picking this particular way of doing it because I think it's very easy to understand.
And I just like the name trajectory tiling and there's the reading available in the reading list.
I strongly recommend this paper to you.
The core idea is very simple.
We're going to generate speech parameters using a statistical model.
In this paper.
It's hidden Markov model.
But if we're repeating this work today would probably just use a deep neural network on those speech parameters are going to be, for example, a spectral envelope, fundamental frequency on energy.
They could be in vocoder, prominent in vain.
And in this paper, basically they are, or they could be simpler.
Could be a lower frame rate or lower spectral resolution would probably still work.
Whatever we do, we're going to generate those privateers from a statistical model.
Then we're just going to go to a unit database and find the sequence of way form fragments that somehow matches.
Those parameters have to measure that match, and then Khun Caffeinated, just as if we were doing unit selection.
So this is a very economical prototypical approached a hybrid speech synthesis.
And that's why I like this paper because it takes this fairly straightforward approach and the cheese very good results in the coming slides.
I'm going to be using this diagram from the paper or parts of it.
This diagram attempts to say everything all in one figure, but we're going to deconstruct it and look at it bit by bit.
Let's start with just a general overview.
We have a statistical model that generates parametric forms of speech.
Here.
It's the fundamental frequency game.
Let's just call that energy on these things called Lion.
Spectral Pairs, which represent the spectral envelope, will say a little bit more about exactly what those are in a moment on.
The paper causes the guiding parameter trajectories.
That's a nice name.
We're going to select way form fragments guided by the specifications.
In other words, we might not slave ish ly obey it exactly.
We allow a bit of a mismatch between the selective way forms on this representation.
Maybe some distance we measure their and we'll be willing to compromise on that in return for good joins.
In other words, we're going to wait the sum of joint costume target costs, just like in unit selection.
So these parameter trajectories are a guide we might not get a speech.
Signal has precise these properties, but we'll be close to them.
Well, then go to the database of speech and pull out lots ofthe speech fragments.
These things here they're called way form tiles by analogy with those little image tiles that we saw earlier.
A little bit confusingly, this paper calls this a sausage.
Let's call it a lattice.
Lattice is what it really is.
And we're going to do the usual thing in units election.
We're going to find the lowest cost path through this network, and we're going to com.
Captain ate the corresponding sequence of way form tiles to produce that output speak signal.
A key component, obviously, is how to measure the distance between a candidate one of these way form tiles on DH.
The specifications, the acoustic specifications coming from the guide Statistical Parametric speech synthesis system, and hmm.
In this case, we need to measure the distance between those two things In order to measure that distance, we obviously have to get these two things into the same domain, the same representation.
We can't measure the difference between away form in the time domain on DH, a spectral envelope we have to convert one to be the same representation is the other on.
The obvious thing to do is to take this bit of speech and extract these same features from it and measure the difference distance between those extracted features on the guiding perimeter trajectories.
That seems reasonable.
We'll see in a moment that we can do actually slightly better than that.
But for now, let's assume we extract parameters from the speech candidate.
And we measured the distance, perhaps just Euclidean distance.
Maybe these things would be normalised between the promises of the speech on the guiding trajectories, and then we just some that overall the frames off a unit, and that will be a target cost of that candidate, we'll come back to join cost in a moment before we go any further.
Well, better just understand what these lines spectral pairs are or l s peace.
I'm going to give you a very informal idea ofwhat they capture and how they're rather different from the cap strum.
So here's the FT spectrum off a frame of voice speech until its voice could.
You can see the harmonics, and it's probably a vowel because it's got some form and structure that looks very obvious.
Let's just extract the envelope lots of ways we could do that such a straight anywhere you like.
We've talked about that before.
The line spectral pairs, quite often called lion spectral frequencies are a way of representing the shape of that spectral envelope, and I'm going to say this rather informally.
This isn't precisely true, but it's approximately the case, and it's a very nice way to understand it.
We have a pair of values representing each peak, each for mint, so let's guess where they might be on this diagram.
They'll be, too, for this, for mint and then to hear, to hear and maybe some representing the rest of the shape like that.
And the key properties of these lines spectral pairs are there more closely spaced when there's a sharper peak in the spectrum.
So think of them as capturing somehow the former frequency and bandwidth, using a pair of numbers.
Now they don't exactly map on performance.
This is a rather informal way of describing things, but I think that's a good enough understanding to go forward with this paper.
So the lion special pairs or spectral frequencies then have a value on that value is a frequency.
When we plot them on this little extract off, the figure we can plot them on a diagram that has time going this way and frequency going that way.
So on the same space as a spectrum would be plotted on each line, spectral frequency clearly changes over time.
It has a trajectory.
Now.
The method in this paper could probably equally well have used, MFC sees.
It just happens to use line spectral pears.
One nice thing about lying spectral pairs is that we can actually draw pictures often like this.
There are meaningful interpret herbal.
That's not the case with them of CCS.
Okay, we've understood line spectral frequencies at least well enough to know what's going on in this picture.
So we've got this special education, which has come out of our statistical model.
Scott Energy F zero on these line spectral frequencies representing the spectral envelope on we've got a candidate way form for my database on.
We're going to try and compare them on.
The way to do that is to convert this way form into the same space as thie parameters coming out of the statistical model.
Okay, so we can't deal down to the way form from the way form.
We will extract some parameters.
I'm going to do that by hand.
So this has got some F zero value.
Got some energy on DH.
It's got these lying spectral frequencies with their trajectories over time.
Yeah, so that's time on DH frequency on that little bit of diagram.
So I've now got the representation for the statistical model in the same domain of the representation off this speech way form.
And now it will be very easy to use any distance measure.
We like Euclidean distance between those two we were doing frame by frame and then just sum them up across all the frames.
So some overtime.
Now there's a problem in doing that.
The problem is that the parameters that we extract from away form will look a little bit different to the ones generated by the hmm.
In particular, they will be noisy on the ones generated by the H ma'am will be rather smooth because of the nature of the statistical model.
Here are two figures illustrating that idea On the left, I've got natural speech with L S F extracted from it on on the right, I've over laid on top of the natural speech Spectra Graham trajectories generated by statistical model Look how much smoother they are.
So there's some systematic mismatch between natural L S F trajectories on one's generated from our statistical model.
Here's another picture of the same thing with the trajectories over laid on top of each other in blue.
I've got trajectories extracted from natural speech and then red I've got was generated from a statistical model.
In this case, it happens to be a Jeep neural network.
Let's zoom in.
And it's really obvious that there's this systematic mismatch now the most obvious suspect, too.
That mismatch is that the blue things, very noisy on the red things very smooth, was actually a deeper problem with a mismatch.
And that's that the statistical model.
We'll make systematic errors in its predictions, but there will be systematic will always make the same error.
So for the same phone in the same context, it will tend to make the same error over and over again, whereas in the natural speech for that phone, in the context, every speak sample will be different.
So this mismatch is a problem on DH.
We have a clever way of getting around it, what we'll do instead of extracting the features from the way forms.
In other words, the candidates well, actually regenerate them for the training data using our chosen statistical model.
And that regeneration is essentially synthesising the training data that seems slightly odd at first.
But when we think about it, deeply will realise this is an excellent way to remove mismatch because the sort of trajectories that we now have for the training data are from the same model that will be using its synthesis time.
So, for example, that will have the same smoothness property, but more importantly and more fundamentally, will make the same systematic errors.
If you're finding that idea a little bit hard to grasp, cast your mind back right to unit selection, where we thought about what sort of labels to put on the database on what sort of things to consider in the target cost.
And we thought about when we label the database.
Do we need to use the economical phone sequence or a very close phonetic transcription that's exact compared tto up speaker set.
And we came to the conclusion that consistency was more important than accuracy because we wanted no mismatch between database on what happens.
It's synthesis time.
That's exactly what's happening here.
The labels that we're putting on the database here are speech parameter trajectories.
The labels are acoustic labels because our target cost function is going to be an acoustic space.
So it's important that those acoustic labels F zero by special frequencies energy.
There's acoustic labels that we're putting a lot of training data look very much like the ones that will get it since this time.
For example, if we asked our hybrid synthesiser to say a sentence from the training data, we like it to retrieve that entire sentence intact.
And that's going to be much more likely if the labels were put on the training data have bean regenerated in this way and not extracted from the natural speech.
So that's our target cost taken care off diving promised directories way, form fragment in the database which we also have trajectories for not extracted from that fragment itself, but regenerated from the model for the entire training data on we can compute the target cost just with any distance function we like between those two.
So the other component we need obviously is a joint cost.
So as we take paths through this lattice, what's called here a sausage? We're considering Con captain ating one candidate with another candidate and we need a joint cost.
Now we could just use the same sort of joined cost from unit selection.
Take the one festival uses waited some ofem, F C C S F zero and energy that would work.
That's fine.
This favour to something a little bit different.
He actually combines join cost with a method for finding a good joint point.
Here's a familiar idea, but used in a different way.
Remember when we were estimating the fundamental frequency or speak signals? We use an idea called auto Correlation.
On cross correlation, we took two signals which are just copies of the same signal on we slid one backwards and forwards with respect to the other looked for similarity, self similarity.
This is just a fancy word for similarity.
Our purpose, then, was to find a shift which corresponds to the fundamental period from which we can get F zero.
What's happening here is essentially the same measure, but used in a different way.
We're now doing it between two different way forms.
This is the candidates of the left.
This is the candidate to the right when we considering where we might join them.
And the joy might just be simple, overlapping out.
So trying to find a place where the wave forms will align with each other with the most similarity.
So we'll take one of them.
And as this diagram implies, sliding backwards and forwards with respect to the other one at each lag, each offset will measure the cross correlation between them within some window, and that will give us a number.
We will find the lag which maximises that number, which maximises correlation, similarity.
Now it's a similarity between two different way forms.
If we can find that point of maximum similarity that suggests this is a really good point to join the way forms, so will line them up that position of maximum similarity and just do some simple fade out of one way form and fading of the next way form on.
Do they overlap? A nod that's for finding a good place to join these two particular candidates.
But that number that we computed that correlation the correlation for the best possible offset is a good measure of how well they join.
And so that's uses the joint cost in this paper.
So for every possible, join every candidate in every possible successive candidate, we put them through this cross correlation, alter them, sliding one backwards and forwards with respect to the other.
Finding the point of maximum similarity between them.
I'm making a note of that similarity value of that cross correlation value on that the joint cost that's put into the lattice for the search and when we do eventually find the best path through the lattice will know exactly where to make the cross phase between the units.
Now it should be apparent from the fact that we doing this cost correlation, which involves trying lots of different offsets or lags.
This is a relatively expensive sort of joint cost to compute, probably going to be a lot more expensive to compute this than it is just our simple Euclidean distance of MFC seas with a single joint point.
But this is doing a bit more than just a joint cost.
It's finding the best possible joint point as well.
Contrast that with what happens in festival, where the joint points are predetermined.
Each die phone has left boundary and right boundary and pick synchronous on.
Those joint points are the same regardless of what we can.
Captain Eytan, that candidate with here, something smarter is happening.
The particular joint point for this candidate will vary depending on what we're going to join it, too, within a range of a few pitch parents.
Now the paper also describes how the underlying hmm system is trained.
We don't need to go into that because we basically understand that it's doing a slightly more sophisticated form of training that we don't need to worry about here because, to be honest, if we were building the system today, we would just use a deep in your network instead of H M.
M's so we could just go on and then summarise what we know now about this method with a rather wonderful name trajectory tiling.
The core idea is simple.
We pick a statistical model.
Here is the H Man.
We generate speech parameters using that statistical model.
Hear those parameters are effectively what we would have used for a vocoder.
In fact, that's probably because they recycled in hmm, system they already had from a complete statistical parametric system.
But they could have used different speech parameters that would've been okay.
And then we basically do pretty straightforward unit selection.
We find the sequence of way from fragments in the paper they called tiles and festival.
We call them candidates on DH.
Then we can captain it that sequence.
And of course, there's some details to each of that.
The special envelope is represented in a very particular way as line spectral frequencies.
That's just a nice representation of spectral envelope in other choices would be possible.
The paper does a standard thing, which is to regenerate the training data with the train statistical model to provide the acoustic specifications of the training data.
In other words, the acoustic labels on the training data.
And as we said, that's the precise of the same reasons as when we use an independent feature formulation.
Target cost.
We prefer consistency and the linguistic labels over faithfulness to what the speaker actually said, Toe err on the side of Kanaan ical pronunciations with just minor deviations on the final.
Nice thing that paper does is it has a joint cost function that just two things at once.
We get two for the price of one.
Not only does it measure mismatch, which is the joint cost that goes into the search as a byproduct, it finds a good concatenation point, and we can remember that.
So when we do choose a sequence of candidates, we know precisely where the best place to overlap.
None of them is, well, what comes next? Well, it's up to, you know, at this point I'm going to stop videos because it's futile to make videos about the state of the art because it's going to change all the time.
You need to now go to the primary literature, and by that I mean journal papers or conference papers, not textbooks.
So although Taylor is excellent, it's dated 2009, so it's not going to tell you about the state of the art and you need to research for yourself.
What the current state of the artists.
I'm not even going to speculate what it might be by the time you're watching this video, that's your job is to go and find primary literature.
Start with recent papers and work your way back to find out what's happening in speech synthesis today.
Whether it's in your networks or in some other paradigm, I've provided a list off the key journals and conferences, good places to start looking anything.
Publishing those venues is worth considering something to read, and that's all, folks.