Forum Replies Created
-
AuthorPosts
-
You are telling
HRest
to loadmodels/proto/$PROTO
whereas it should start with the models just created byHInit
, although this won’t actually cause an error in HTK.Look for an error message from
HRest
– if it is failing to create any models, it should report an error.Remember to always wipe all models (from
hmm0
andhmm1
) before every experiment: this will help you catch errors.You should use either a loop (around all the files to be recognised), or the
-S
option (to pass the name of a file, in which all files to be recognised are listed), but not both.The
-S
option will generally be faster(*). Why is that?(*) although you might not notice the difference on the lab computers, because network speed typically dominates the run-time.
Yes, there are a total of 39 elements in the feature vectors. 12 of them are the MFCCs. You have 12+13+13=38 though.
The language model computes P(W) where W is the word sequence of one utterance to be recognised. For the digit recogniser, can you locate and inspect the language model that is being used?
The acoustic model computes P(O|W) where O is the observation sequence. How does O relate to the MFCCs?
How are P(O|W) and P(W) combined to calculate P(W|O), and why do we need to do that?
You should also exercise extreme caution in uploading data to external AI tools, since they are likely to retain this data and potentially include it in the training data for a future update of the tool.
You do not have permission to share any of the data for this assignment outside the University.
Please read “Please briefly describe if you use any external Artificial Intelligence (AI) related tools in doing this assignment…” which applies to all use of such tools.
Whilst professional programmers certainly use AI “co-pilots” to write code faster, I strongly recommend against this for beginners: for one thing, you do not yet have the skill to judge whether the resulting code is correct!
You will learn a lot more, and build more confidence, doing it yourself from scratch.
These notes are not currently available – there is everything you need here on speech.zone and the forums.
Reproducing a figure from another source may not be the best way to show your understanding.
If you really want to do this, then you need to cite the source in the caption of the figure, in the same way that you would cite the source of a direct text quote.
It’s up to you whether to include the original figure, or redraw it.
No, they are not. I’ve clarified that in the Formatting instructions.
You can check when AT 4.02 is available (i.e., whenever there is no class scheduled) by visiting timetables then searching for 4.02 and selecting the one in Appleton Tower. This link should take you directly there (although you may need to authenticate with EASE first).
Key points
You are correct that the x-axis (horizontal) will be frequency, and will be labelled in units of Hertz (Hz).
The vertical axis is magnitude, which is most commonly plotted on a logarithmic scale and is therefore labelled in decibels (dB).
Additional detail
Magnitude is a ratio (in this case, of filter output to its input), and therefore has no units: formally, we say it is dimensionless. So dB are not actually a unit, but a scale.
emulabel
is an outdated program mentioned in some old documentation onfestvox.org
In Pitchmark the speech, the command
make_pmlab_pm
creates label files from the pitchmarks, and places them in thepm_lab
directory. These can be viewed in the same way as any other label file (such as the aligned phone labels), e.g., using wavesurfer.You can use Qualtrics to build the survey, but host your audio files somewhere else, then enter their URLs into Qualtrics.
You can host the audio files anywhere that is able to provide a URL for the file. For example, a free github page, which might give you URLs like this:
https://jonojace.github.io/IS19-robustness-audio-samples/figure3/g_100clean.wav
Yes, you need both an abstract an an introduction.
April 8, 2024 at 16:00 in reply to: Autocorrelation and Pitch Prediction in FastPitch Vs. UnitSelec #17712You need to more clearly separate two independent design choices:
1. How to estimate F0 for recorded speech (which will become the database for a unit selection system, or the training data for a FastPitch model).
The method for estimating F0 (whether autocorrelation based like RAPT, or something else) is independent of the method used for synthesis. The synthesis methods just need values for F0, they don’t care where they come from.
2. Using F0 during synthesis (which will be either the unit selection algorithm, or FastPitch inference).
In a unit selection system that doesn’t employ any signal modification, you are correct in stating that the system can only synthesise speech with F0 values found in the database. FastPitch can, in theory, generate any F0 value.
But both methods use the data to learn how to predict F0, so they are both constrained by what is present in the database. The ‘model’ of F0 prediction in unit selection is implicit: the combination of target and join cost function. The model of F0 prediction in FastPitch is explicit.
So, in practice, as you suggest, FastPitch is very constrained by what is present in the training data. In that regard, it’s not so very different to unit selection.
There is probably either a formatting error or a non-ASCII character in your
utts.data
.If you can’t easily locate it, try using binary search to find the offending line (here I’ll assume
utts.data
has 600 lines):0. make a backup of
utts.data
1. make a file containing only first half of
utts.data
, for example withhead -300 utts.data > first.data
2. try
check_script
onfirst.data
3a. if you get an error then take the first half again
head -150 utts.data > first.data
3b.if you don’t get an error, make a file containing the first three-quarters of
utts.data
head -450 utts.data > first.data
and iterate, varying the number of lines in a binary search pattern, until you home in on the error.
-
AuthorPosts