Forum Replies Created
-
AuthorPosts
-
Yes, they are created using the ‘keyword lexicon’ technique invented by Fitt & Isard.
The error is caused by using letter-to-sound rules for the word ‘arugula’. The solution is to write a pronunciation (that does not contain a glottal stop) for this word and add that to
my_lexicon.scm
Miscellaneous
There was one suggestion to use Piazza, which is a good tool. The University has a track record of changing the Virtual Learning Environment (VLE) every 5 or so years. The current platform is a tangled mess of many tools, ‘integrated’ in Learn. Critically, a previous VLE change led to total loss of all past student-contributed discussions for this course. I do not want that to ever happen again, which is why I built speech.zone
Some things were mentioned that we are already doing: release the videos earlier (they were on speech.zone from the start of the course) and provide lecture recordings (which are on Learn, and mentioned on every class tab within the course on speech.zone).
A better classroom
I wish I could promise that, and especially offer a room with work surfaces for your notebooks and laptops, but I have no control over this. Asking for a different room would probably mean somewhere even more obscure and certainly much further from George Square. (I don’t know what to say to the person who complained the room is too far from Appleton Tower…it’s less than 10 minutes walk!)
More guidance on the coursework report
Only a small number of responses mentioned this, asking for a prescribed structure. That was done in Speech Processing, and you are expected to use what you learned there in this course. You should no longer need a prescribed structure to know what makes a good report.
Improve the coursework
There were various valid criticisms of the lab-based assignment. We plan to improve the assignment and would like to introduce an element of state-of-the-art modelling. However, this course has no programming pre-requisite, and we need to keep the assignment accessible to all students.
Requiring students to record some original data is very important – it’s something that most people in industry lack experience of – so we will keep this.
Keep doing this
Number of people mentioning each point is given in parentheses.
Videos (14)
Interactive classes including individual/pair/group work (9)
Reading papers in class (4)
Use of diagrams (3)
Does this topic help?
You just need to load
my_lexicon.scm
whenever you start Festival for performing synthesis. Simply give it as an argument to Festival:bash$ festival my_lexicon.scm
which tells Festival to load (and execute) the contents of
my_lexicon.scm
immediately after it starts.Yes, the most likely cause is incorrect sampling rate of waveforms. See this topic.
You shouldn’t edit the main dictionary, but you can override a pronunciation using the command
lex.add.entry
in the addendum which is stored inmy_lexicon.scm
in the assignment.The filtering can be done directly in the time domain. It’s easiest to describe and understand filtering in the frequency domain, but the filter can be implemented as a direct operation on the speech waveform samples.
Filter design is an entire subject on its own and out of scope. But we can understand one very simple form of low-pass filter that is easy to implement: a moving average. If we take a moving average of a speech waveform, that will smooth out the smaller details (i.e., remove the higher frequencies).
The time offset is because the main peak in the speech waveform may not align with the peak that we found in the low-pass-filtered version. That could be for two separate reasons. The first is to do with the phases of the many different harmonic frequencies making up the speech signal. The second is the phase response of the low-pass filter (e.g., the filter introduces a time delay).
I think a visual inspection (plot both distributions, suitably normalised, on the same axes) would suffice for the purposes of the Speech Synthesis assignment.
To record sound, we measure deviations above or below mean (resting) air pressure using a microphone [*]. The vertical axis on an audio waveform plot corresponds to air pressure. That is why waveform samples during silence have values around to zero.
Our idealised impulse train is a waveform where all samples have a value of exactly zero, except for one sample per period which has a positive value. This is the simplest possible signal that contains energy at all multiples of the fundamental frequency.
As you have realised, this idealised signal is not physically possible – a real signal does indeed need to also have regions of below-mean-air-pressure. (It does not need to be symmetric though, so we don’t need “negative impulses” to balance the positive ones.)
[* Actually, microphones vary in precisely what they measure: pressure, pressure gradient across a diaphragm, velocity, …etc. This subtlety is not important to understand here.]
First let’s determine if that is the original distribution of the corpus – a little digging shows that it’s not. From the paper we eventually find the v2 update which has this license. You can read that to check if it allows your intended use (it does).
In the class for Module 4 – the database you can think more about whether this dataset is suitable (e.g., is it going to be easy to read out loud? will the text present normalisation issues?) and, if it is suitable, how to select a subset for recording.
-
AuthorPosts