Forum Replies Created
-
AuthorPosts
-
This script makes a list of the
.mfcc
files on which the forced-alignment HMMs will be trained. After running it, the filetrain.scp
should contain the list of.mfcc
filenames, corresponding to the lines inutts.data
.How large is your corpus?
What is the goal of finding the out-of-dictionary words?
If you wish to exclude all sentences that contain such a word, then you’ll have to find them all – you could do this using the provided Festival script (which might be slow for a very large corpus) or some other way (by writing your own code).
But if your aim is to identify all the words you might need to add to the dictionary, then you are not expected to do that for the large source corpus. You might need to rely on letter-to-sound to provide pronunciations during the text selection phase.
You should only manually write pronunciations for a modest number of words appearing in your (much smaller) recording script.
Check your quota like this:
$ quota --hum
If the figure in the
space
column is larger than thequota
(which is generally 5000 MB = 5 GB) then you need to remove files.Use the
du
command mentioned earlier in this topic to find what is taking up the most space.(The
abrt-cli status
warning is probably also caused by full quota.)If you are making many voices for the Speech Synthesis assignment, in separate copies of the
ss
directory, you can share files between them where applicable (e.g., thewav
directory for voices that use the same database). See https://speech.zone/forums/topic/symbolic-links/ (if that looks tricky, do it with a tutor in a lab session).You shouldn’t need that much space to do the assignment. It’s likely that you have a large amount of unnecessary files somewhere. Check disk usage like this:
cd du -sh * du -sh .?*
cd
changes to your home directory. The firstdu
measures the size of all regular files and directories. The second uses a glob.?*
that matches all the hidden items (anything whose name starts with a period) including the directory.cache
.It should be safe to delete the contents of
.cache
, if that is the offending directory, or you can delete just some of the subdirectories if you prefer.What you need to do is pass your text through Festival’s front end, to obtain the pronunciation, from which you can easily determine the diphones.
You are already doing that as part of building the voice, in the step where you do forced alignment – looks at the Creating the initial labels step of Time-align the labels
See also this topic for other ways to do this.
It’s important that, whichever method you use, you load the same phone set and pronunciation dictionary that your final voice will use. (e.g., don’t use CMUdict).
Some of the posts in this topic may also be helpful.
The purpose of building a voice from pre-existing ARCTIC recordings (available from festvox.org) was to learn the tools.
Your experiments will use voices built from recordings of yourself, both of the ARCTIC script and your own script, separately and in combination. That is why it’s important to make all your recordings in the same studio.
You are likely to need to build a number of different voices – e.g., from different subsets of the recordings – to answer whatever questions you devise for you experiments.
That page is from 2019-20 and is now outdated. The fileserver
fs1.ppls.ed.ac.uk
may no longer be available – see instead this page and scroll down to Note [24/10/23] which says to replacefs1.ppls.ed.ac.uk
with one of the lab machines instead.Does that work?
January 18, 2024 at 18:50 in reply to: Unit Selection exercise – No module named ‘EST_Utterance’ #17436As noted in the first class, it is non-trivial to build the version of Festival required for this exercise: it requires compilation of Python wrappers for the underlying C++ code using SWIG. Getting this compilation to work correctly involves performing Magic.
We strongly recommend using the lab machines, where everything Just Works.
Anyone really determined to build their own Python wrappers should bring their machine to a lab session and smile very nicely at Korin.
The number of observations in the observation sequence is fixed, and they all have to be generated by the model (i.e., the compiled-together language model and acoustic model).
There are many possible paths through the model that could generate this observation sequence. Some paths will pass through mostly short words, each of which generates a short sequence of observations (because short words tend to have short durations when spoken). Other paths pass through long words, each of which will typically generate a longer sequence of observations.
So, to generate the fixed-length observation sequence, the model might take a path through many short words, or through a few long words, or something in-between.
Paths through many short words are likely to contain insertion errors. Paths through a few long words are likely to contain deletion errors. The path with the lowest WER is likely to be a compromise between the two: we need some way to control that, which is what the WIP provides.
Again, J&M’s explanation of the LMSF is not the best, so don’t get lost in their explanations of the interaction between LMSF and WIP.
In summary:
- The LMSF is required because the language model computes probability mass, whilst the acoustic model computes probability density.
- The WIP enables us to trade off insertion errors against deletion errors.
“fixed” just means that it is a constant value
The word insertion penalty, which is a log probability, is “logWIP” in J&M equation 9.50. It is summed to the partial path log probability (e.g., the token log probability in a token passing implementation) once for each word in that partial path, which is why it is multiplied by N in the equation.
The HTK manual says
The grammar scale factor is the amount by which the language model probability is scaled before being added to each token as it transits from the end of one word to the start of the next
but of course, they mean “language model log probability”, and when they say
The word insertion penalty is a fixed value added to each token
they mean “added to the log probability of each token” (the same applies to the previous point too).
The HTK term “penalty” is potentially misleading, since in their implementation the value is added not subtracted. Conceptually there is no difference and it doesn’t really matter: we can just experiment with positive and negative values to find a value that minimises the WER on some held-out data.
The implementation in HTK is consistent with J&M equation 9.50.
When J&M say
Thus, if (on average) the language model probability decreases…
they are talking about the probability decreasing as the sentence length increases, since more and more word probabilities will be multiplied together.
Their explanation of the LMSF is rather long-winded. There is a much simpler and better explanation for why we need to scale the language model probability when combining it with the acoustic model likelihood. In equation 9.48, P(O|W) implies that the acoustic model calculates a probability mass. It generally does not!
If the acoustic model uses Gaussian probability density functions, it cannot compute probability mass. It can only compute a probability density. Density is proportional to the probability mass in a small region around the observation O. The constant of proportionality is unknown.
Since we always work in the log probability domain, equation 9.48 involves a sum of two log probabilities.
The acoustic model will compute quantities on a different scale to the language model. We need to account for the unknown constant of proportionality by scaling one or other of them in this sum. The convention is to scale the language model log probability, hence the LMSF. We typically find a good value for the LMSF empirically (e.g., by minimising the Word Error Rate on some held-out data).
The error
Unable to open label file rec/panagiot_test.2.lab
tells us that
HResults
cannot find the.lab
file (which contains the correct label) corresponding to the recognition result stored in the file inrec/panagiot_test.2.rec
HResults
is looking for that.lab
file in therec
directory. But this is because it did not find it within the MLF (multiple label file)panagiot_test.mlf
– perhaps this file is missing from that user’s data?The error
(standard_in) 2: syntax error
occurs later – so debug that next…
Your error is occurring with
HResults
, notHVite
:FATAL ERROR - Terminating program HResults
-
AuthorPosts