Can we absorb the process of alignment into the DNN training process?

This topic has 1 reply, 2 voices, and was last updated 8 years, 3 months ago by Felipe E.

Viewing 1 reply thread

Author

Posts
- February 21, 2017 at 14:03 #6784
  Jiewen Z
  Student
  When training a DNN-based SPSS, it seems that we have to use the EM algorithm to do alignment first. The alignment tells us the duration of each phone, and we use the duration to decide how many frames we should generate for that phone.
  
  This seems tedious. Can we use DNN find the alignment between text and audio directly?
  
  I know is that an end-to-end ASR system can find the alignment using the acoustic model itself – e.g., with CTC training. In machine translation, we can find the word alignment using an RNN with an attention mechanism.
  
  Is there a possibility to apply these methods from ASR or MT to speech synthesis?
  If so, why aren’t there any relevant papers? Is there any particular difficulty in speech synthesis preventing us applying these methods?
- February 28, 2017 at 00:00 #6813
  Felipe E
  Student
  Hi,
  
  Yes, there are already some attempts for this. For example:
  
  http://www.isca-speech.org/archive/Interspeech_2016/pdfs/0134.PDF
  
  http://ssw9.net/download/ssw9_proceedings.pdf#page=125
  
  Regards,
  Felipe
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.