Align mismatch

This topic has 2 replies, 3 voices, and was last updated 2 years, 11 months ago by Korin Richmond.

Viewing 2 reply threads

Author

Posts
- March 22, 2021 at 14:15 #13907
  Elisa G
  Student
  Hi there,
  
  I was trying to synthesise one of my voices and I ran into this problem when executing the command: (build_utts “utts.data” ‘unilex-rpx)
  
  In the terminal, I get this:
  recipe_0033
  recipe_0034
  Bad pitchmarking: # 4.3639998.
  recipe_0035
  recipe_0036
  recipe_0037
  recipe_0038
  recipe_0039
  recipe_0040
  recipe_0041
  recipe_0042
  align missmatch at t (0.000000) r (9.500000) recipe_0042
  SIOD ERROR: nil
  BACKTRACE:
  0: (if
  (string-matches (item.name (car actual-segments)) closure_regex)
  (begin
  (…)
  (set! actual-segments (cdr actual-segments)))
  …)
  1: (while
  (and segments actual-segments)
  (set!
  wname
  (…))
  …)
  2: (if
  (eq? (cadr l) (quote apml))
  (align_utt_apml (car l))
  …)
  3: (f (car l2))
  4: (cons (f (car l2)) r)
  5: (set! r (cons (f (car l2)) r))
  6: (while l2 (set! r (cons (f (car l2)) r)) (set! l2 (cdr l2)))
  7: (mapcar
  (lambda
  (l)
  (format t “%s
  ” (car l))
  …)
  p)
  8: (build_utts “utts.data” (quote unilex-rpx))
  festival> (build_utts “utts.data” ‘unilex-rpx)
  
  What can I do?
  The process got interrupted and I don’t know how to proceed.
  
  Thanks!
- March 22, 2021 at 18:19 #13909
  Aidan P
  Student
  Have you inspected recipe_0042? Is there anything weird about it? Is the label wrong somehow? What happens if you remove recipe_0042 from utts.data and try running that script again?
- April 1, 2022 at 11:20 #15836
  Korin Richmond
  Professor
  If you look at the scheme code for what happens when you call (build_utts …):
  
  /Volumes/Network/courses/ss/festival/festival_linux/multisyn_build/scm/build_unitsel.scm
  
  it can help you understand what’s going on “under the hood” I think.
  
  In short, the purpose of the “build_utts” function is to build the *.utt Utterance files in your “utt/” directory. These contain all the linguistic information about the utterances in your voice database – in effect they *are* your voice database for your final voice (well, those, together with the waveform resynthesis parameters in your “lpc/” directory, and the join cost coefficients in your “coef/” or “coef2/” directory).
  
  As part of its work, it takes the input text for one sentence from utts.data, does front-end processing to generate a standard EST_Utterance data structure, then adds in other information. A critical part of that information is the phone timings identified by the force alignment processing done with HTK (i.e. the “do_alignment…” stage…).
  
  For this you’ll see a function called “align_utt_internal” – its job is to load the HTK-derived phone labelling and reconcile it with the phone sequence in the Festival Utterance structure, then copying across the timings.
  
  It is expected the two phone sequences won’t match completely, for example mostly because: i) there may be optional short pauses HTK found between words; ii) the “phone_substitution” file in the alignment process allows for some phones to be changed to others (e.g. full vowels to be reduced to schwa). Therefore, the scheme code tries to reconcile these anticipated differences between what Festival predicted is the phone sequence, and what HTK force alignment says is the phone sequence that matches what the recorded speech actually says.
  
  (btw, note the script also adds other information to the utterance structures it builds at this point – e.g. any phones that have log duration which look like outliers, they’ll get a “bad dur” flag, or if they don’t have any pitchmarks, they’ll get a “no pms” flag, etc… that information is used by cost calculation during Viterbi search, usually to avoid selecting those suspect units…)
  
  The error you are getting indicates the “align_utt_internal” script has failed to reconcile the two phone sequences. Or in other words, it means they don’t match in an unexpected way.
  
  To diagnose this, you should look at the phone sequence from force alignment and compare it to the one produced by the Festival front end for your voice (i.e. phoneset and lexicon) by default. Where do they differ? Have you perhaps done the forced alignment with one accent setting, but are then trying to build the utterances using another? Or did you perhaps add a word to the lexicon to correct a word pronunciation at one stage but not the other? Basically, there’s some difference!
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

Align mismatch

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis