› Forums › Speech Synthesis › Festival › Align mismatch
- This topic has 2 replies, 3 voices, and was last updated 2 years, 4 months ago by Korin Richmond.
-
AuthorPosts
-
-
March 22, 2021 at 14:15 #13907
Hi there,
I was trying to synthesise one of my voices and I ran into this problem when executing the command: (build_utts “utts.data” ‘unilex-rpx)
In the terminal, I get this:
recipe_0033
recipe_0034
Bad pitchmarking: # 4.3639998.
recipe_0035
recipe_0036
recipe_0037
recipe_0038
recipe_0039
recipe_0040
recipe_0041
recipe_0042
align missmatch at t (0.000000) r (9.500000) recipe_0042
SIOD ERROR: nil
BACKTRACE:
0: (if
(string-matches (item.name (car actual-segments)) closure_regex)
(begin
(…)
(set! actual-segments (cdr actual-segments)))
…)
1: (while
(and segments actual-segments)
(set!
wname
(…))
…)
2: (if
(eq? (cadr l) (quote apml))
(align_utt_apml (car l))
…)
3: (f (car l2))
4: (cons (f (car l2)) r)
5: (set! r (cons (f (car l2)) r))
6: (while l2 (set! r (cons (f (car l2)) r)) (set! l2 (cdr l2)))
7: (mapcar
(lambda
(l)
(format t “%s
” (car l))
…)
p)
8: (build_utts “utts.data” (quote unilex-rpx))
festival> (build_utts “utts.data” ‘unilex-rpx)What can I do?
The process got interrupted and I don’t know how to proceed.Thanks!
-
March 22, 2021 at 18:19 #13909
Have you inspected recipe_0042? Is there anything weird about it? Is the label wrong somehow? What happens if you remove recipe_0042 from utts.data and try running that script again?
-
April 1, 2022 at 11:20 #15836
If you look at the scheme code for what happens when you call (build_utts …):
/Volumes/Network/courses/ss/festival/festival_linux/multisyn_build/scm/build_unitsel.scm
it can help you understand what’s going on “under the hood” I think.
In short, the purpose of the “build_utts” function is to build the *.utt Utterance files in your “utt/” directory. These contain all the linguistic information about the utterances in your voice database – in effect they *are* your voice database for your final voice (well, those, together with the waveform resynthesis parameters in your “lpc/” directory, and the join cost coefficients in your “coef/” or “coef2/” directory).
As part of its work, it takes the input text for one sentence from utts.data, does front-end processing to generate a standard EST_Utterance data structure, then adds in other information. A critical part of that information is the phone timings identified by the force alignment processing done with HTK (i.e. the “do_alignment…” stage…).
For this you’ll see a function called “align_utt_internal” – its job is to load the HTK-derived phone labelling and reconcile it with the phone sequence in the Festival Utterance structure, then copying across the timings.
It is expected the two phone sequences won’t match completely, for example mostly because: i) there may be optional short pauses HTK found between words; ii) the “phone_substitution” file in the alignment process allows for some phones to be changed to others (e.g. full vowels to be reduced to schwa). Therefore, the scheme code tries to reconcile these anticipated differences between what Festival predicted is the phone sequence, and what HTK force alignment says is the phone sequence that matches what the recorded speech actually says.
(btw, note the script also adds other information to the utterance structures it builds at this point – e.g. any phones that have log duration which look like outliers, they’ll get a “bad dur” flag, or if they don’t have any pitchmarks, they’ll get a “no pms” flag, etc… that information is used by cost calculation during Viterbi search, usually to avoid selecting those suspect units…)
The error you are getting indicates the “align_utt_internal” script has failed to reconcile the two phone sequences. Or in other words, it means they don’t match in an unexpected way.
To diagnose this, you should look at the phone sequence from force alignment and compare it to the one produced by the Festival front end for your voice (i.e. phoneset and lexicon) by default. Where do they differ? Have you perhaps done the forced alignment with one accent setting, but are then trying to build the utterances using another? Or did you perhaps add a word to the lexicon to correct a word pronunciation at one stage but not the other? Basically, there’s some difference!
-
-
AuthorPosts
- You must be logged in to reply to this topic.