A label alignment error means that the start or end time of a phone label is wrong. Small errors are not the main problem, because we derive diphone boundaries from these phone boundaries.
What to listen for
You should listen for a substantial error which leads to either a sound missing from a synthetic utterance or (and this is easier to hear) an extraneous sound inserted. Find a nice short test sentence that contains such an error.
Very occasionally, the forced alignment works so well that you may find it hard to identify an obvious error, so limit the amount of time you spend looking for one. If you think you’ve got a voice with no alignment errors, lucky you, but send me some audio samples for checking – the chances are that you are wrong!
How to find out which units were used to synthesise a sentence
festival> (set! myutt (SayText "Hello world."))
festival> (utt.relation.print myutt 'Unit)
festival> (set! myutt (SayText "Hello world.")) festival> (utt.relation.print myutt 'Unit)
then look for the source_utt and source_end features to find the utterance and location of each diphone. Decide which of these utterances must contain the mis-aligned label.
How to correct the error
The label files from forced alignment will contain many sp (short pause) labels of zero duration. Wavesurfer does not handle these correctly, so you need to manually remove these zero-duration sp labels (using a plain text editor such as Aquamacs) first. You only need to do this for an utterance for which you wish to correct the label alignments. Load the waveform and the labels for the utterance into Wavesurfer. Now, you should move the label times earlier or later, to the correct alignment with the waveform. Do not change their names; do not add or delete any labels. Save the labels and quit Wavesurfer. Repeat this procedure for any other utterances that you need to correct.
You now need to rebuild the utterance structures, to incorporate the new label times. You will also need to rebuild the stripped join cost coefficients.
Evaluating the effect of fixing the error
Fixing a single error won’t have a measurable effect on the average quality of the voice, so you shouldn’t try to measure any improvement in a formal listening test. Make an informal assessment and decide whether you fixed the error.
Moving a phone label boundary might lead to the unit selection algorithm choosing a different unit sequence, possibly not including the unit you just corrected!
To get around this, you can force Festival to only select from the same database utterances as before you made the correction:
- Make a new utts.data file listing only those utterances
- Fix the error and rebuild the voice
- Run Festival using the new utts.data file