› Forums › Speech Synthesis › F0 estimation and epoch detection › Autocorrelation and Pitch Prediction in FastPitch Vs. UnitSelec
- This topic has 1 reply, 2 voices, and was last updated 9 months, 2 weeks ago by Simon.
-
AuthorPosts
-
-
April 8, 2024 at 15:18 #17711
Hi,
About the differences between FastPitch and Unit Selection specifically in terms of how they handle pitch/F0:
Both methods use autocorrelation (as described in Talkin’s RAPT) to predict ground truth F0 for a given phoneme, correct? Only in UnitSelec, there is no NN component that is trained to reduce the loss between predicted F0 and ground truth F0. In UnitSelec, you can synthesise only the F0 variants that is explicitly present in your database but in FastPitch, it is possible to synthetise pitch that isn’t explicitly present in your database, as long as it is an F0-value that can be interpolated from your existing training data.
1) Is this a correct understanding of their differences in terms of how they predict and produce different F0s?
Fundamentally, it seems to me that both UnitSelec and FastPitch rely on autocorrelation (to varying degrees) but FastPitch allows for F0 manipulation at inference, so it has an advantage.
2) The FastPitch paper said it is trained on 24hrs of single-speaker speech with transcriptions. I’m assuming this recording includes a good range of F0 so there are varying pitch values for the model to be trained on (is this true?). It makes me conclude that both UnitSelec and FastPitch rely on curated recordings (with whatever prosody you want to synthesise) and even FastPitch can’t just synthesise expressive speech from a training data that is relatively unexpressive, right? (of course it has other advantages like allowing you to alter the pitch during inference, etc).
Thanks!
-
April 8, 2024 at 16:00 #17712
You need to more clearly separate two independent design choices:
1. How to estimate F0 for recorded speech (which will become the database for a unit selection system, or the training data for a FastPitch model).
The method for estimating F0 (whether autocorrelation based like RAPT, or something else) is independent of the method used for synthesis. The synthesis methods just need values for F0, they don’t care where they come from.
2. Using F0 during synthesis (which will be either the unit selection algorithm, or FastPitch inference).
In a unit selection system that doesn’t employ any signal modification, you are correct in stating that the system can only synthesise speech with F0 values found in the database. FastPitch can, in theory, generate any F0 value.
But both methods use the data to learn how to predict F0, so they are both constrained by what is present in the database. The ‘model’ of F0 prediction in unit selection is implicit: the combination of target and join cost function. The model of F0 prediction in FastPitch is explicit.
So, in practice, as you suggest, FastPitch is very constrained by what is present in the training data. In that regard, it’s not so very different to unit selection.
-
-
AuthorPosts
- You must be logged in to reply to this topic.