Autocorrelation and Pitch Prediction in FastPitch Vs. UnitSelec

This topic has 1 reply, 2 voices, and was last updated 10 months, 4 weeks ago by Simon.

Viewing 1 reply thread

Author

Posts
- April 8, 2024 at 15:18 #17711
  Manisha V
  Student
  Hi,
  
  About the differences between FastPitch and Unit Selection specifically in terms of how they handle pitch/F0:
  
  Both methods use autocorrelation (as described in Talkin’s RAPT) to predict ground truth F0 for a given phoneme, correct? Only in UnitSelec, there is no NN component that is trained to reduce the loss between predicted F0 and ground truth F0. In UnitSelec, you can synthesise only the F0 variants that is explicitly present in your database but in FastPitch, it is possible to synthetise pitch that isn’t explicitly present in your database, as long as it is an F0-value that can be interpolated from your existing training data.
  
  1) Is this a correct understanding of their differences in terms of how they predict and produce different F0s?
  
  Fundamentally, it seems to me that both UnitSelec and FastPitch rely on autocorrelation (to varying degrees) but FastPitch allows for F0 manipulation at inference, so it has an advantage.
  
  2) The FastPitch paper said it is trained on 24hrs of single-speaker speech with transcriptions. I’m assuming this recording includes a good range of F0 so there are varying pitch values for the model to be trained on (is this true?). It makes me conclude that both UnitSelec and FastPitch rely on curated recordings (with whatever prosody you want to synthesise) and even FastPitch can’t just synthesise expressive speech from a training data that is relatively unexpressive, right? (of course it has other advantages like allowing you to alter the pitch during inference, etc).
  
  Thanks!
- April 8, 2024 at 16:00 #17712
  Simon
  Professor
  You need to more clearly separate two independent design choices:
  
  1. How to estimate F0 for recorded speech (which will become the database for a unit selection system, or the training data for a FastPitch model).
  
  The method for estimating F0 (whether autocorrelation based like RAPT, or something else) is independent of the method used for synthesis. The synthesis methods just need values for F0, they don’t care where they come from.
  
  2. Using F0 during synthesis (which will be either the unit selection algorithm, or FastPitch inference).
  
  In a unit selection system that doesn’t employ any signal modification, you are correct in stating that the system can only synthesise speech with F0 values found in the database. FastPitch can, in theory, generate any F0 value.
  
  But both methods use the data to learn how to predict F0, so they are both constrained by what is present in the database. The ‘model’ of F0 prediction in unit selection is implicit: the combination of target and join cost function. The model of F0 prediction in FastPitch is explicit.
  
  So, in practice, as you suggest, FastPitch is very constrained by what is present in the training data. In that regard, it’s not so very different to unit selection.
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

Autocorrelation and Pitch Prediction in FastPitch Vs. UnitSelec

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis