Whilst the video is playing, click on a line in the transcript to play the video from that point. Now, while using listeners to evaluate your synthetic speech is by far the most common thing to do. It would be nice if we could find a measure an algorithm perhaps that could evaluate objectively. So we're using the word objectively here to imply that listeners are subjective. So is that possible? In a subjective test, we have subjects, and they express perhaps an opinion about the synthetic speech. In an objective test, we have an algorithm. It might be a simpler is measuring distances between synthetic speech and a reference sample, which is almost certainly going to be natural speech. Or we might use some more sophisticated model and the model will try and take account of human perception. So let's turn our minds then to whether an objective measure is even possible. Think about why it would be non trivial, and then we'll look at what can be done so far. But throughout all of this, let's always remember that were not yet able to replace listeners. Simple, objective measures involve just comparing synthetic speech to a reference. The reference will be natural speech and the comparison will probably be just some distance. Now there's already a huge number of assumptions in doing that. First of all, it assumes that natural speech is somehow gold standard, and we can't do better than that. But any natural speaker's rendition of a sentence is just one possible way of saying that sentence. And there are many, many possible ways. It also assumes that distance between a synthetic sample on this natural reference is somehow going to indicate what listeners would have said about it is going to correlate with their opinion score. Most objective measures are just a sum of local differences. So we're going to align the natural and synthetic that could be just, say, dynamic, time warping. Or we could force out synthesisers to generate speech with the same durational pattern is a natural reference. And then we'll sum the local differences, and that will be some total distance. And that would be our measure, and a large distance would imply less natural. But this use off a single natural reference as gold standard is obviously flawed for the reason that we've already stated: there is more than one correct way of saying any sentence. So the method, as it stands, fails to account for natural variation. We could try and mitigate that by having multiple natural examples. We can't really capture all of the possible valid and correct natural ways of saying a sentence. These simple, objective measures are just based on properties of the signal. That's all we have to go on. And the properties are the obvious things, the stuff that might correlate with phonetic identity. For example, spectral envelope and the stuff that might correlate with prosody, F0 contour, maybe energy. For the special envelope, we just think about a distortion of the difference between natural and synthetic. We would want to do that in a representation that's, somehow relevant to perception, captures some perceptual properties. And one thing that we know off that does that is the Mel Cepstrum. So Mel Cepstral distortion. For an F0 contour, there's a couple of different things we might measure. My just look at the difference in Hertz or some other units between the natural and synthetic contours. And sum those up and calculate the root mean squared error between those two contours. That's just a sum of local differences. That has a problem that if there's just a small offset between the two that will turn up to look like rather a large error. And so we might also measure the shape of the two contours with something like correlation. If we produce a synthetic F0 contour that essentially has the right shape but is offset or is its amplitude is a little bit too big or too small, we would still predict that sounds very nice to listeners. It has peaks in the right places, for example. So for F0 we might have a pair of measures: Error and Correlation. Let's just see what properties those measures are looking at. So if we've got a natural and a synthetic version of a sentence, the first thing of course we're going to do is make some alignment between them. So we'll extract some features. Maybe MFCCs. We'll do some procedure of alignment. A good way to do that would be dynamic time warping. That's just going to make some non linear alignment between these two sequences, matching up the most similar sound with the most similar sound. And then we could take frames that have been aligned with one another and look at the local difference. Maybe we'll take that frame there on its corresponding frame. So those two were aligned by the dynamic time warping and we'll extract from this one. This is just a spectrum with frequency here and magnitude. Here we might extract the spectral envelope represent that with cepstrum. We'll do the same here. And in the cepstral domain. compare these two things on measure the difference or distortion. Then we'll just sum that up for every frame off the utterance. That'll be Mel Cepstral Distortion. Now this is a pretty low level signal measure is the same sort of thing we might do in speech recognition to say if two things were the same or different, but it's got some properties that seem reasonable. If these two signals were identical than the distortion would be zero, that seems reasonable. If the two signals were extremely different, distortion would be high and that seems reasonable. So to that extent that will correlate with listeners judgments very high cepstral distortion would suggest that they'll say they sound very different or if played only the synthetic sample. They would say that sounds very unnatural. And zero distortion means the synthetic is identical to natural. So for big differences, this measure is probably going to work, however, that relationship between Mel Cepstral Distortion and what listeners think isn't completely smooth and monotonic and so small differences are not going to be captured so reliably by this measure. And in system development, as we incrementally improve our system, it's actually small differences that we really want to measure the most. Nevertheless, this distance measure or distortion in the mel cepstral domain is very widely used in a statistical parametric synthesis. It's very much the same thing that we're actually minimising when we're training our systems And so it's a possible measure of system performance or of predicted naturalness of the final speech. Something to think about for yourself is, would this measure make sense for synthetic speech made by concatenation of wave forms, remembering that those are locally, essentially, perfectly natural, so any local frame is natural speech. It might not be the same as the natural reference, but it's perfectly natural. Would this measure capture naturalness of unit selection speech. That's something to think about for yourself. Let's move on to the other common thing to measure objectively, and that's F0 and here again. We'll have to make some alignment between synthetic and natural. We'll look at local differences. Perhaps just this absolute difference or the mean squared difference and sum that up across so that we get this root mean squared error. Correlation would measure the similarity in the shapes of these two things. For example, whether they have peaks in the same place. Again this has some reasonable properties. If the contours are identical, the RMSE is zero. And we would expect listeners to say this prosody was perfect or perfectly natural. If they're radically different, we get a large error, and that's a prediction of unnatural this. But once again, there are lots of valid natural contours. And our one reference contour isn't the only way of saying the sentence, and so error won't perfectly correlate with listeners' opinions. So the summary message for the simple objective measures is: you'll find them widely used and reported in the literature, especially for statistical parametric synthesis. They have some uses, a development tool because they could be done very quickly during development without going out to listeners. They're actually quite close to the measures were minimising when training statistical models anyway. But we should exercise their use with caution and they're not a replacement for using listeners. We might want to report these results alongside subjective results and if there's a disagreement, we'll probably believe the subjective result, and not this objective measure. But these are very simple, low level, signal processing based measures. There's not a whole lot to do with perception in here. The cepstrum might be Mel cepstrum to get some perceptual warping. In F0 we might not do it on Herzt. We might put that onto a perceptual scale. Maybe we'll just take the log and then expressed these differences in semitones. We could put a little perceptual knowledge, but not a lot. And so it seems reasonable to try and put much more complex perceptual model into our objective measure. There is a place where reasonably sophisticated models of perception in fact, things that have been fitted to perceptual data are widely used and are reliable. They're in fact a part of international standards. And that's from the field of telecommunications. So once again, for the transmission of natural speech, which gets distorted in transmission on these things either measure the receive signal or the difference between the transmitted and received signal. The standard measure until recently was called PESQ and recently, there's a new, better measure being proposed that's called POLQA. These both standards the exist in software implementation and they're used widely by telecommunications companies. As they stand - certainly you can say for PESQ - they don't well correlate, or they don't in fact correlate at all with listeners opinions of synthetic speech. So we can't just take these black boxes and use them to evaluate the naturalness of synthetic speech, pretending that synthetic species just a distorted form of natural speech because the distortions are completely different to the sort of distortions we get in transmission of speech. However, there are attempts to take these measures, which are typically big combinations of many, many different features and modify them. For example, modify the weights in this waited combination and fit those weights to listening test results for synthetic speech. So there are some modified versions, and they do work to some limited extent. They're not yet good enough to replace listeners, but they are now correlating, roughly speaking with listeners opinions. So to wrap this part up. There are standards out there, like PESQ. That's tempting to think we could just apply those two synthetic speech, and they will give us a number. Unfortunately, that number's pretty meaningless. It's not a good predictor of what listeners would say if we did the proper listening test. We could try modifying PESQ which has been done.
Objective evaluation
It would be convenient to avoid using listeners and instead use an objective, or algorithmic, measure to evaluate synthetic speech. This is possible, but only to a rather limited extent. Use with caution!
Log in if you want to mark this as completed
|
|