- This topic has 1 reply, 2 voices, and was last updated 4 months ago by .
Viewing 1 reply thread
Viewing 1 reply thread
- You must be logged in to reply to this topic.
› Forums › Speech Synthesis › DNN synthesis › Synthesis with SoundStream
I think I understand how SoundStream can encode a waveform into compressed embeddings and decode from that to a waveform, but I am not completely sure how this connects to text-to-speech. Would SoundStream just be used for compressing audio or can it also convert text to audio?
(This question is probably in the wrong section of the forum, but I couldn’t work out where it should go.)
The SoundStream codes are an alternative to the mel spectrogram.
To do Text-to-Speech, we would train a model to generate SoundStream codes, instead of generating a mel spectrogram.
Before training the system, we would pass all our training data waveforms through the SoundStream encoder, thus converting each waveform into a sequence of codes.
(In the case of a mel spectrogram, we would pass each waveform through a mel-scale filterbank to convert it to a mel spectrogram.)
Then we train a speech synthesis model to predict a code sequence given a phone (or text) input.
To do speech synthesis, we perform inference with the model to generate a sequence of codes, given a phone (or text) input. We then pass that sequence of codes through the decoder of SoundStream which outputs a waveform.
(In the case of a mel spectrogram, we would pass the mel spectrogram to a neural vocoder which would output a waveform)
Some forums are only available if you are logged in. Searching will only return results from those forums if you log in.
Copyright © 2024 · Balance Child Theme on Genesis Framework · WordPress · Log in