Forum Replies Created
-
AuthorPosts
-
As discussed (+hopefully solved!) in the lab today, you need to:
– follow the instructions in the Festival manual on defining a new lexicon
https://www.cstr.ed.ac.uk/projects/festival/manual/festival_13.html#:~:text=A%20Lexicon%20in%20Festival%20is,words%20not%20in%20either%20list(including compiling the lexicon first)
– and then add a few lines in the build-unitsel.scm file (so, save a copy to your home space and then edit it), in order to define what to do if the user who’s building a voice requests your new lexicon)
(solved in the lab, but summarised above for completeness and future users!)
The ‘oir’ phone is in the Festival phone set definition for unilex (see /Volumes/Network/courses/ss/festival/festival_linux/festival/lib/unilex_phones.scm).
Or do you mean the phone_list file that gets put in the alignment directory when doing force alignment with the multisyn build tools? If so, you’re right, that vowel doesn’t seem to be there. Who exactly created that file is now lost in the mists of time, but it *is* a very rare phone!
If that phone appears in the phone transcriptions of sentences you need to align with HTK/multisyn-build tools, then just add it to the phone_list file (in the “alignment” subdirectory).
OK, so it turns out the (tts_file …) function used by this script creates utterances of the “Token” type, which doesn’t have the “iform” feature, so you can’t simply print that feature out unfortunately. Instead, you can just print out all the items in the Token relation.
Here’s some scheme code for that. If you add this “utt.print.tokens” function to the scheme file, you can then add a call to this to where the after-synth hooks print the flat utterance representation for each sentence.
(define (utt.print.tokens utt)
“(utt.print.tokens UTT)
Print tokens of UTT (items in Token relation) to standard out.”
(mapcar
(lambda (token)
(format t “%s ” (item.feat token ‘name)))
(utt.relation.leafs utt ‘Token))
(format t “\n”)
)Look higher up in the scheme code:
(utt.save utt (path-append globalSaveDir uttfilename)) ; save the utterance structure
Festival help on utt.save:
festival> (utt.save
(utt.save UTT FILENAME TYPE)
Save UTT in FILENAME in an Xlabel-like format. If FILENAME is “-”
then print output to stdout. TYPE may be nil or est_asciiHmm, that seems to suggest the feature is indeed not defined in the utterance structure. The easiest way to check what information is contained in your utterance structures would be to save one (or all) and post an example here (they are just a text file, so you can also view it yourself).
You’ll see in the code indications for how/where the utterance file should be saved – can you attach one of those here please?
Aha, yes, the bit you’re looking for here is:
(utt.feat utt “iform”)
This will look up a feature called “iform” on the utterance object – this is the “input form” of the text for a Text style utterance object.
January 19, 2024 at 16:01 in reply to: Unit Selection exercise – No module named ‘EST_Utterance’ #17448Yes, some of the Multisyn build tools are python scripts which need extension module wrapper around Edinburgh Speech Tools (compiled native binaries). There are 2 options:
i) if you look in the $EST/config/config file in your EST directory, you’ll see an option to switch on compilation of those wrappers to match your architecture. That might just work.
ii) The last section of CPSLP was on extension modules and used a wrapper around EST as an exercise. It’s available for Mac (intel) & Linux python versions 3.8 and 3.9. You could try importing that (on a mac you’d need to use a x86_64 version of python). If that works, the interface is *slightly* different to that used by the multisyn build tools, so you’d need to slightly tweak the multisyn build tools python scripts which use it, but nothing major – just a slightly different, updated API.
Yes, you need to use the same front end (ie phone set, lexicon, g2p model etc) for alignment and voice building as you will use at run time for the resulting synthetic voice.
It sounds like you may have built the initial mlf creation step using a different festival voice? (Maybe the default one that’s loaded when you start festival?)
Most likely, yes.
Similar topic:
#15813Your my_lexicon.scm file tries to use some functions/symbols from the “build_unitsel.scm” file – as this error indicates, when running the code in ./my_lexicon.scm Festival is looking to use “setup_phoneset_and_lexicon”, which unless you load the build_unitsel.scm file first, it cannot find. If you look at the contents of build_unitsel.scm (which I recommend doing – it’s instructional!), you’ll see it’s indeed there.
If you don’t want to include that “build_unitsel.scm” on the command line each time you want to use my_lexicon.scm, you could always just insert a line to load it at the top of the my_lexicon.scm file, before any of the code tries to use bits found there. You can do that, for example, with just the (load …) function.
btw, every time festival starts up, it checks for a file at ~/.festivalrc. If that exists, it runs the code contained in it. So, if there’s something you want done automatically *every time* festival starts, then you can just put the code in that file (i.e. in this case, to load these two scheme files). But remember – it will do it every time!
Candidate beam width dictates how many candidate units will be considered for each target unit (candidates units with target cost outside the beam width will be dropped), while the other beam width dictates how many viterbi paths are kept alive at each point (paths with a total score outwith the beam width of the best one are dropped).
One explanation for the smaller than expected time difference between the two pruning conditions you give could be that aggressive candidate pruning means there are fewer opportunities for path pruning? We’d just need to know how many paths are considered at each point for the two conditions to properly understand what’s happening here.
No, Edinburgh Speech Tools/Festival can’t produce those files.
Those are files derived from STRAIGHT analysis as part of the standard HTS (HMM-based synthesis) build recipe, as Matt Shannon notes elsewhere:
https://github.com/MattShannon/HTS-demo_CMU-ARCTIC-SLT-STRAIGHT-AR-decision-tree/issues/2
Looking at the data processing in, for example
You’ll see the matlab version of STRAIGHT is used, followed by several subsequent steps using SPTK tools, perl scripts etc…
If you look at the scheme code for what happens when you call (build_utts …):
/Volumes/Network/courses/ss/festival/festival_linux/multisyn_build/scm/build_unitsel.scm
it can help you understand what’s going on “under the hood” I think.
In short, the purpose of the “build_utts” function is to build the *.utt Utterance files in your “utt/” directory. These contain all the linguistic information about the utterances in your voice database – in effect they *are* your voice database for your final voice (well, those, together with the waveform resynthesis parameters in your “lpc/” directory, and the join cost coefficients in your “coef/” or “coef2/” directory).
As part of its work, it takes the input text for one sentence from utts.data, does front-end processing to generate a standard EST_Utterance data structure, then adds in other information. A critical part of that information is the phone timings identified by the force alignment processing done with HTK (i.e. the “do_alignment…” stage…).
For this you’ll see a function called “align_utt_internal” – its job is to load the HTK-derived phone labelling and reconcile it with the phone sequence in the Festival Utterance structure, then copying across the timings.
It is expected the two phone sequences won’t match completely, for example mostly because: i) there may be optional short pauses HTK found between words; ii) the “phone_substitution” file in the alignment process allows for some phones to be changed to others (e.g. full vowels to be reduced to schwa). Therefore, the scheme code tries to reconcile these anticipated differences between what Festival predicted is the phone sequence, and what HTK force alignment says is the phone sequence that matches what the recorded speech actually says.
(btw, note the script also adds other information to the utterance structures it builds at this point – e.g. any phones that have log duration which look like outliers, they’ll get a “bad dur” flag, or if they don’t have any pitchmarks, they’ll get a “no pms” flag, etc… that information is used by cost calculation during Viterbi search, usually to avoid selecting those suspect units…)
The error you are getting indicates the “align_utt_internal” script has failed to reconcile the two phone sequences. Or in other words, it means they don’t match in an unexpected way.
To diagnose this, you should look at the phone sequence from force alignment and compare it to the one produced by the Festival front end for your voice (i.e. phoneset and lexicon) by default. Where do they differ? Have you perhaps done the forced alignment with one accent setting, but are then trying to build the utterances using another? Or did you perhaps add a word to the lexicon to correct a word pronunciation at one stage but not the other? Basically, there’s some difference!
> 1) I thought the main goal of epoch detection is to find a consistent point within each pitch period. Why is it then relevant that this point is on the largest peak?
I can think of two reasons: i) notionally, we are seeking to find the very instant of glottal closure (GCI), which is the point of maximum excitation; ii) we would typically want to centre any window (e.g. Hamming or Hann window, when doing PSOLA or other pitch-synchronous processing) at the GCI because maximum energy is there (and also the closed phase just following it).
> 2) After counting the zero-crossings in the derivative of the waveform: why are the zero-crossings not on the largest peaks/ why do we need to shift our marks? I thought we remove everything besides F0, so each maximum in the waveform (i.e. largest peaks) will correspond to a zero-crossing in the derivative. How can it be that a zero-crossing does not correspond to the largest peak in the waveform?
Because signals can be “messy” and the algorithm imperfect 🙂
The lab machines have Anaconda installed, with Python 3.7 and a large number of packages pre-installed.
You can find that version of Python at:
/anaconda3/bin/python3
Finally, if you ever need you install packages that aren’t pre-installed in that Python installation, you can always first just create a virtual environment (or alternatively use a conda env) – you can then install packages into this virtualenv without needed root privileges.
-
AuthorPosts