Forum Replies Created

Viewing 15 posts - 1 through 15 (of 33 total)

1 2 3 →

Author

Posts
February 26, 2025 at 12:29 in reply to: [Bug] SynthAndFlatPrintSentences ignores line break after the string ” I.” #18291
Korin Richmond
Professor
This isn’t related to the “SynthAndFlatPrintSentences” per se, but rather how Festival interprets the text in your file.

The script you’re using just uses Festival’s standard textfile processing – if you look at the code in the “synthAndSaveSentences.scm” scheme code file, you’ll see it just calls “(tts_file text)”. In that text file processing, line breaks have no special meaning, and sentences can span multiple lines. I guess in the example you give above Festival is interpreting “I.” as an abbreviation, and not the end of a sentence. That’s why all those lines get concatenated into a single sentence – this is not a bug, this is how festival is interpreting your text file.
March 21, 2024 at 14:04 in reply to: Voice with new dictionary and phone set #17656
Korin Richmond
Professor
As discussed (+hopefully solved!) in the lab today, you need to:

– follow the instructions in the Festival manual on defining a new lexicon
https://www.cstr.ed.ac.uk/projects/festival/manual/festival_13.html#:~:text=A%20Lexicon%20in%20Festival%20is,words%20not%20in%20either%20list

(including compiling the lexicon first)

– and then add a few lines in the build-unitsel.scm file (so, save a copy to your home space and then edit it), in order to define what to do if the user who’s building a voice requests your new lexicon)

(solved in the lab, but summarised above for completeness and future users!)
February 26, 2024 at 18:02 in reply to: Phone (‘oir’) missing from unilex-gam? #17552
Korin Richmond
Professor
The ‘oir’ phone is in the Festival phone set definition for unilex (see /Volumes/Network/courses/ss/festival/festival_linux/festival/lib/unilex_phones.scm).

Or do you mean the phone_list file that gets put in the alignment directory when doing force alignment with the multisyn build tools? If so, you’re right, that vowel doesn’t seem to be there. Who exactly created that file is now lost in the mists of time, but it *is* a very rare phone!

If that phone appears in the phone transcriptions of sentences you need to align with HTK/multisyn-build tools, then just add it to the phone_list file (in the “alignment” subdirectory).
February 20, 2024 at 14:46 in reply to: Print out raw utterance Festival #17543
Korin Richmond
Professor
OK, so it turns out the (tts_file …) function used by this script creates utterances of the “Token” type, which doesn’t have the “iform” feature, so you can’t simply print that feature out unfortunately. Instead, you can just print out all the items in the Token relation.

Here’s some scheme code for that. If you add this “utt.print.tokens” function to the scheme file, you can then add a call to this to where the after-synth hooks print the flat utterance representation for each sentence.

(define (utt.print.tokens utt)
“(utt.print.tokens UTT)
Print tokens of UTT (items in Token relation) to standard out.”
(mapcar
(lambda (token)
(format t “%s ” (item.feat token ‘name)))
(utt.relation.leafs utt ‘Token))
(format t “\n”)
)
February 20, 2024 at 09:56 in reply to: Print out raw utterance Festival #17542
Korin Richmond
Professor
Look higher up in the scheme code:

(utt.save utt (path-append globalSaveDir uttfilename)) ; save the utterance structure

Festival help on utt.save:

festival> (utt.save
(utt.save UTT FILENAME TYPE)
Save UTT in FILENAME in an Xlabel-like format. If FILENAME is “-”
then print output to stdout. TYPE may be nil or est_ascii
February 19, 2024 at 17:59 in reply to: Print out raw utterance Festival #17538
Korin Richmond
Professor
Hmm, that seems to suggest the feature is indeed not defined in the utterance structure. The easiest way to check what information is contained in your utterance structures would be to save one (or all) and post an example here (they are just a text file, so you can also view it yourself).

You’ll see in the code indications for how/where the utterance file should be saved – can you attach one of those here please?
February 19, 2024 at 15:25 in reply to: Print out raw utterance Festival #17536
Korin Richmond
Professor
Aha, yes, the bit you’re looking for here is:

(utt.feat utt “iform”)

This will look up a feature called “iform” on the utterance object – this is the “input form” of the text for a Text style utterance object.
January 19, 2024 at 16:01 in reply to: Unit Selection exercise – No module named ‘EST_Utterance’ #17448
Korin Richmond
Professor
Yes, some of the Multisyn build tools are python scripts which need extension module wrapper around Edinburgh Speech Tools (compiled native binaries). There are 2 options:

i) if you look in the $EST/config/config file in your EST directory, you’ll see an option to switch on compilation of those wrappers to match your architecture. That might just work.

ii) The last section of CPSLP was on extension modules and used a wrapper around EST as an exercise. It’s available for Mac (intel) & Linux python versions 3.8 and 3.9. You could try importing that (on a mac you’d need to use a x86_64 version of python). If that works, the interface is *slightly* different to that used by the multisyn build tools, so you’d need to slightly tweak the multisyn build tools python scripts which use it, but nothing major – just a slightly different, updated API.
March 30, 2023 at 18:34 in reply to: Phone set used for alignment vs synthesis #16802
Korin Richmond
Professor
Yes, you need to use the same front end (ie phone set, lexicon, g2p model etc) for alignment and voice building as you will use at run time for the resulting synthetic voice.

It sounds like you may have built the initial mlf creation step using a different festival voice? (Maybe the default one that’s loaded when you start festival?)
March 13, 2023 at 12:37 in reply to: Should we load my_lexicon.scm when running a voice? #16790
Korin Richmond
Professor
Most likely, yes.
March 13, 2023 at 10:57 in reply to: Should we load my_lexicon.scm when running a voice? #16788
Korin Richmond
Professor
Similar topic:
#15813

Your my_lexicon.scm file tries to use some functions/symbols from the “build_unitsel.scm” file – as this error indicates, when running the code in ./my_lexicon.scm Festival is looking to use “setup_phoneset_and_lexicon”, which unless you load the build_unitsel.scm file first, it cannot find. If you look at the contents of build_unitsel.scm (which I recommend doing – it’s instructional!), you’ll see it’s indeed there.

If you don’t want to include that “build_unitsel.scm” on the command line each time you want to use my_lexicon.scm, you could always just insert a line to load it at the top of the my_lexicon.scm file, before any of the code tries to use bits found there. You can do that, for example, with just the (load …) function.

btw, every time festival starts up, it checks for a file at ~/.festivalrc. If that exists, it runs the code contained in it. So, if there’s something you want done automatically *every time* festival starts, then you can just put the code in that file (i.e. in this case, to load these two scheme files). But remember – it will do it every time!
April 12, 2022 at 10:20 in reply to: Viterbi Optimisation #15883
Korin Richmond
Professor
Candidate beam width dictates how many candidate units will be considered for each target unit (candidates units with target cost outside the beam width will be dropped), while the other beam width dictates how many viterbi paths are kept alive at each point (paths with a total score outwith the beam width of the best one are dropped).

One explanation for the smaller than expected time difference between the two pruning conditions you give could be that aggressive candidate pruning means there are fewer opportunities for path pruning? We’d just need to know how many paths are considered at each point for the two conditions to properly understand what’s happening here.
April 4, 2022 at 09:52 in reply to: MCD Script File Formats #15853
Korin Richmond
Professor
No, Edinburgh Speech Tools/Festival can’t produce those files.

Those are files derived from STRAIGHT analysis as part of the standard HTS (HMM-based synthesis) build recipe, as Matt Shannon notes elsewhere:

https://github.com/MattShannon/HTS-demo_CMU-ARCTIC-SLT-STRAIGHT-AR-decision-tree/issues/2

Looking at the data processing in, for example

https://github.com/MattShannon/HTS-demo_CMU-ARCTIC-SLT-STRAIGHT-AR-decision-tree/blob/master/data/Makefile.in

You’ll see the matlab version of STRAIGHT is used, followed by several subsequent steps using SPTK tools, perl scripts etc…
April 1, 2022 at 11:20 in reply to: Align mismatch #15836
Korin Richmond
Professor
If you look at the scheme code for what happens when you call (build_utts …):

/Volumes/Network/courses/ss/festival/festival_linux/multisyn_build/scm/build_unitsel.scm

it can help you understand what’s going on “under the hood” I think.

In short, the purpose of the “build_utts” function is to build the *.utt Utterance files in your “utt/” directory. These contain all the linguistic information about the utterances in your voice database – in effect they *are* your voice database for your final voice (well, those, together with the waveform resynthesis parameters in your “lpc/” directory, and the join cost coefficients in your “coef/” or “coef2/” directory).

As part of its work, it takes the input text for one sentence from utts.data, does front-end processing to generate a standard EST_Utterance data structure, then adds in other information. A critical part of that information is the phone timings identified by the force alignment processing done with HTK (i.e. the “do_alignment…” stage…).

For this you’ll see a function called “align_utt_internal” – its job is to load the HTK-derived phone labelling and reconcile it with the phone sequence in the Festival Utterance structure, then copying across the timings.

It is expected the two phone sequences won’t match completely, for example mostly because: i) there may be optional short pauses HTK found between words; ii) the “phone_substitution” file in the alignment process allows for some phones to be changed to others (e.g. full vowels to be reduced to schwa). Therefore, the scheme code tries to reconcile these anticipated differences between what Festival predicted is the phone sequence, and what HTK force alignment says is the phone sequence that matches what the recorded speech actually says.

(btw, note the script also adds other information to the utterance structures it builds at this point – e.g. any phones that have log duration which look like outliers, they’ll get a “bad dur” flag, or if they don’t have any pitchmarks, they’ll get a “no pms” flag, etc… that information is used by cost calculation during Viterbi search, usually to avoid selecting those suspect units…)

The error you are getting indicates the “align_utt_internal” script has failed to reconcile the two phone sequences. Or in other words, it means they don’t match in an unexpected way.

To diagnose this, you should look at the phone sequence from force alignment and compare it to the one produced by the Festival front end for your voice (i.e. phoneset and lexicon) by default. Where do they differ? Have you perhaps done the forced alignment with one accent setting, but are then trying to build the utterances using another? Or did you perhaps add a word to the lexicon to correct a word pronunciation at one stage but not the other? Basically, there’s some difference!
April 7, 2021 at 16:59 in reply to: Postprocessing in Epoch Detection #13932
Korin Richmond
Professor
> 1) I thought the main goal of epoch detection is to find a consistent point within each pitch period. Why is it then relevant that this point is on the largest peak?

I can think of two reasons: i) notionally, we are seeking to find the very instant of glottal closure (GCI), which is the point of maximum excitation; ii) we would typically want to centre any window (e.g. Hamming or Hann window, when doing PSOLA or other pitch-synchronous processing) at the GCI because maximum energy is there (and also the closed phase just following it).

> 2) After counting the zero-crossings in the derivative of the waveform: why are the zero-crossings not on the largest peaks/ why do we need to shift our marks? I thought we remove everything besides F0, so each maximum in the waveform (i.e. largest peaks) will correspond to a zero-crossing in the derivative. How can it be that a zero-crossing does not correspond to the largest peak in the waveform?

Because signals can be “messy” and the algorithm imperfect 🙂
Author

Posts

Viewing 15 posts - 1 through 15 (of 33 total)

1 2 3 →

Korin Richmond

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis