Synthesis with SoundStream

This topic has 1 reply, 2 voices, and was last updated 1 year, 1 month ago by Simon.

Viewing 1 reply thread

Author

Posts
- April 6, 2024 at 14:41 #17704
  Iain W
  Student
  I think I understand how SoundStream can encode a waveform into compressed embeddings and decode from that to a waveform, but I am not completely sure how this connects to text-to-speech. Would SoundStream just be used for compressing audio or can it also convert text to audio?
  
  (This question is probably in the wrong section of the forum, but I couldn’t work out where it should go.)
- April 6, 2024 at 15:10 #17705
  Simon
  Professor
  The SoundStream codes are an alternative to the mel spectrogram.
  
  To do Text-to-Speech, we would train a model to generate SoundStream codes, instead of generating a mel spectrogram.
  
  Before training the system, we would pass all our training data waveforms through the SoundStream encoder, thus converting each waveform into a sequence of codes.
  
  (In the case of a mel spectrogram, we would pass each waveform through a mel-scale filterbank to convert it to a mel spectrogram.)
  
  Then we train a speech synthesis model to predict a code sequence given a phone (or text) input.
  
  To do speech synthesis, we perform inference with the model to generate a sequence of codes, given a phone (or text) input. We then pass that sequence of codes through the decoder of SoundStream which outputs a waveform.
  
  (In the case of a mel spectrogram, we would pass the mel spectrogram to a neural vocoder which would output a waveform)
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.