- This topic has 1 reply, 2 voices, and was last updated 8 years, 3 months ago by .
Viewing 1 reply thread
Viewing 1 reply thread
- You must be logged in to reply to this topic.
› Forums › Speech Synthesis › Merlin › Silence Removal when Normalising Input Features
You say:”Most silence frames are automatically removed at this stage, so that the distribution of frames is more balanced. This has been found to improve the training of the DNN.”
This seems to account for the fact that punctuation breaks and end-of-phrase breaks get removed at this step (your recipe says ‘most’, but it seems like ‘All’). This results in many generated files (after training) sounding cut off at the end, depending on where the forced alignment placed the beginning (first state) of the phrase-final silence label. Internal pauses are also missing, such as after commas.
Can you explain why this improves training, and is there a way to keep the silence frames that really affect perception, like the ones mentioned?
To be more precise: most frames of all regions labelled as silence, are removed.
It improves training (as found empirically) because otherwise the training data is dominated by silence frames and the network will optimise for generating silence in preference to speech sounds (it’s very easy to minimise the error on silence, and that contributes too much to total error if there are a lot of silence frames).
To prevent the truncation of phrase-final speech sounds, the correct solution is to improve the forced alignment.
Some forums are only available if you are logged in. Searching will only return results from those forums if you log in.
Copyright © 2024 · Balance Child Theme on Genesis Framework · WordPress · Log in