- This topic has 1 reply, 2 voices, and was last updated 7 years, 9 months ago by .
Viewing 1 reply thread
Viewing 1 reply thread
- You must be logged in to reply to this topic.
› Forums › Speech Synthesis › DNN synthesis › Can we absorb the process of alignment into the DNN training process?
When training a DNN-based SPSS, it seems that we have to use the EM algorithm to do alignment first. The alignment tells us the duration of each phone, and we use the duration to decide how many frames we should generate for that phone.
This seems tedious. Can we use DNN find the alignment between text and audio directly?
I know is that an end-to-end ASR system can find the alignment using the acoustic model itself – e.g., with CTC training. In machine translation, we can find the word alignment using an RNN with an attention mechanism.
Is there a possibility to apply these methods from ASR or MT to speech synthesis?
If so, why aren’t there any relevant papers? Is there any particular difficulty in speech synthesis preventing us applying these methods?
Hi,
Yes, there are already some attempts for this. For example:
http://www.isca-speech.org/archive/Interspeech_2016/pdfs/0134.PDF
http://ssw9.net/download/ssw9_proceedings.pdf#page=125
Regards,
Felipe
Some forums are only available if you are logged in. Searching will only return results from those forums if you log in.
Copyright © 2024 · Balance Child Theme on Genesis Framework · WordPress · Log in