Silence Removal when Normalising Input Features

This topic has 1 reply, 2 voices, and was last updated 9 years, 1 month ago by Simon.

Viewing 1 reply thread

Author

Posts
- April 20, 2016 at 11:56 #3162
  Joseph M
  Student
  You say:”Most silence frames are automatically removed at this stage, so that the distribution of frames is more balanced. This has been found to improve the training of the DNN.”
  
  This seems to account for the fact that punctuation breaks and end-of-phrase breaks get removed at this step (your recipe says ‘most’, but it seems like ‘All’). This results in many generated files (after training) sounding cut off at the end, depending on where the forced alignment placed the beginning (first state) of the phrase-final silence label. Internal pauses are also missing, such as after commas.
  
  Can you explain why this improves training, and is there a way to keep the silence frames that really affect perception, like the ones mentioned?
- April 23, 2016 at 11:33 #3165
  Simon
  Professor
  To be more precise: most frames of all regions labelled as silence, are removed.
  
  It improves training (as found empirically) because otherwise the training data is dominated by silence frames and the network will optimise for generating silence in preference to speech sounds (it’s very easy to minimise the error on silence, and that contributes too much to total error if there are a lot of silence frames).
  
  To prevent the truncation of phrase-final speech sounds, the correct solution is to improve the forced alignment.
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.