› Forums › Speech Processing – Live Q&A Sessions › Module 6 – acoustic similarity features for join costs
- This topic has 1 reply, 2 voices, and was last updated 1 year, 9 months ago by Simon.
-
AuthorPosts
-
-
October 31, 2022 at 09:37 #16178
I’m wondering about the sub-features for join costs mentioned in Jurafsky and Martin:
The three main things we’d like to ensure joins up smoothly at each point is the spectral features (like formant trajectories), energy (/amplitude), and pitch (/F0). Is there a one-to-one mapping between these three things and the features mentioned in JM:
– cepstral distance = similar spectral features
– absolute difference in log power = similar energy
– absolute difference in F0 = similar F0
… the correlation between the first two isn’t obvious to me, especially not since cepstral distance seems to be a general method for computing the difference between features as it is both describes as a sub-feature in the join cost equation and also as a measure of the success of selecting and joining a phone sequence (cepstral distance between the realised acoustics of the chosen phones and the acoustics of the target specification). -
November 1, 2022 at 15:50 #16209
There are many ways of measuring the spectral difference between two speech sounds. Formants would be one way, although it can be difficult to accurately estimate them in speech, and they don’t exist for all speech sounds, so we rarely use them. The cepstrum is a way to parameterise the spectral envelope, and we’ll be properly defining the cepstrum in the upcoming part of Speech Processing about Automatic Speech Recognition.
Cepstral distance would be a good choice for measuring how similar two speech sounds are whilst ignoring their F0. (That’s what we’ll use it for in Automatic Speech Recognition.) So, you are correct that we can use it as part of the join cost in unit selection speech synthesis for measuring how different the spectral envelopes are on either side of a potential join.
Your question about “cepstral distance between the realised acoustics of the chosen phones and the acoustics of the target specification” is straying into material from the Speech Synthesis course, in a method known as hybrid unit selection. Best to wait for that course, rather than answer the question now.
-
-
AuthorPosts
- You must be logged in to reply to this topic.