Forum Replies Created
-
AuthorPosts
-
We’ll be covering the Viterbi algorithm in the next section of Speech Processing, about Automatic Speech Recognition. In Module 7, we will encounter the general algorithm of Dynamic Programming in the method known as Dynamic Time Warping. Later, we will see another form of Dynamic Programming, called the Viterbi algorithm, applied to Hidden Markov Models. So, wait for those parts of the course, then ask your question again.
Your questions about whether to take the maximum or minimum relate to whether we are maximising the total (which is what we would do if it was a probability) or minimising it (which is what we would do it if was a distance).
Regarding your questions about whether to multiply or sum: if we are working with probabilities, we multiply. If we are working with distances or costs, we sum (and in fact, we will end up working with log probabilities, which we will sum).
The F0 estimation tool is part of CSTR’s own Speech Tools library and is an implementation of this algorithm
Y. Medan, E. Yair and D. Chazan, “Super resolution pitch determination of speech signals,” in IEEE Transactions on Signal Processing, vol. 39, no. 1, pp. 40-48, Jan. 1991, DOI: 10.1109/78.80763.
with improvements described in
Paul C. Bagshaw, Steven Hiller and Mervyn A. Jack. “Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching” in Proc. EUROSPEECH’93, pp 1003-1006.
At the heart of the algorithm is the same idea as RAPT: the correlation between the waveform and a time-shifted copy of itself (variously called autocorrelation or cross-correlation)
For the purposes of the assignment, you may assume that the algorithm is essentially the same as RAPT, since that is the one taught in the course.
The pitchmarking method is indeed based on finding negative-going zero-crossings of the differentiated waveform. Here’s the code if you want to read it (not required!).
Yes, they are created using the ‘keyword lexicon’ technique invented by Fitt & Isard.
The error is caused by using letter-to-sound rules for the word ‘arugula’. The solution is to write a pronunciation (that does not contain a glottal stop) for this word and add that to
my_lexicon.scm
Miscellaneous
There was one suggestion to use Piazza, which is a good tool. The University has a track record of changing the Virtual Learning Environment (VLE) every 5 or so years. The current platform is a tangled mess of many tools, ‘integrated’ in Learn. Critically, a previous VLE change led to total loss of all past student-contributed discussions for this course. I do not want that to ever happen again, which is why I built speech.zone
Some things were mentioned that we are already doing: release the videos earlier (they were on speech.zone from the start of the course) and provide lecture recordings (which are on Learn, and mentioned on every class tab within the course on speech.zone).
A better classroom
I wish I could promise that, and especially offer a room with work surfaces for your notebooks and laptops, but I have no control over this. Asking for a different room would probably mean somewhere even more obscure and certainly much further from George Square. (I don’t know what to say to the person who complained the room is too far from Appleton Tower…it’s less than 10 minutes walk!)
More guidance on the coursework report
Only a small number of responses mentioned this, asking for a prescribed structure. That was done in Speech Processing, and you are expected to use what you learned there in this course. You should no longer need a prescribed structure to know what makes a good report.
Improve the coursework
There were various valid criticisms of the lab-based assignment. We plan to improve the assignment and would like to introduce an element of state-of-the-art modelling. However, this course has no programming pre-requisite, and we need to keep the assignment accessible to all students.
Requiring students to record some original data is very important – it’s something that most people in industry lack experience of – so we will keep this.
Keep doing this
Number of people mentioning each point is given in parentheses.
Videos (14)
Interactive classes including individual/pair/group work (9)
Reading papers in class (4)
Use of diagrams (3)
Does this topic help?
You just need to load
my_lexicon.scm
whenever you start Festival for performing synthesis. Simply give it as an argument to Festival:bash$ festival my_lexicon.scm
which tells Festival to load (and execute) the contents of
my_lexicon.scm
immediately after it starts.Yes, the most likely cause is incorrect sampling rate of waveforms. See this topic.
You shouldn’t edit the main dictionary, but you can override a pronunciation using the command
lex.add.entry
in the addendum which is stored inmy_lexicon.scm
in the assignment.The filtering can be done directly in the time domain. It’s easiest to describe and understand filtering in the frequency domain, but the filter can be implemented as a direct operation on the speech waveform samples.
Filter design is an entire subject on its own and out of scope. But we can understand one very simple form of low-pass filter that is easy to implement: a moving average. If we take a moving average of a speech waveform, that will smooth out the smaller details (i.e., remove the higher frequencies).
The time offset is because the main peak in the speech waveform may not align with the peak that we found in the low-pass-filtered version. That could be for two separate reasons. The first is to do with the phases of the many different harmonic frequencies making up the speech signal. The second is the phase response of the low-pass filter (e.g., the filter introduces a time delay).
I think a visual inspection (plot both distributions, suitably normalised, on the same axes) would suffice for the purposes of the Speech Synthesis assignment.
-
AuthorPosts