Forum Replies Created
-
AuthorPosts
-
Taylor is separating it out here because he is trying to show how the equations align with the physics of sound propagation in the vocal tract.
Lip radiation can be assumed a constant effect: effectively, a filter that boosts high frequencies. This filtering effect is independent of the configuration of the articulators (Taylor, 2009, equation 11.29).
Furthermore, the constant high-pass filtering effect of lip radiation is more than cancelled out by another constant effect of low-pass filtering at the sound source:
It is this, combined with the radiation effect, that gives all speech spectra their characteristic spectral slope.
(Taylor, 2009, page 332)
So, we don’t need any learnable model parameters for these effects. We can account for them either by absorbing this constant effect into the vocal tract filter (which might be modelled using linear prediction) or by pre-emphasising the signal in the time domain (Taylor, 2009, page 375) to make its spectrum flatter, before any subsequent modelling, processing or feature extraction.
Pre-emphasis is standard practice in most speech processing – can you find where this is done in the digit recogniser ?
Could you re-upload your textgrid file but changing the suffix to .txt (you’ll need to do that on the command line)
The Fourier transform is invertible, which means that you get as many data points out as you put in. This means that using a longer analysis window in the time domain gives you more points (often referred to as “FFT bins”) in the frequency domain magnitude spectrum (we will ignore phase). Those points (“bins”) are equally spaced from 0 Hz up to the Nyquist frequency.
So, a longer analysis frame in the time domain means that the spectrum has more detail: higher frequency resolution.
The Fourier spectrum is effectively an average over the entire analysis frame (formally, we are making an assumption that the spectrum is constant throughout the frame). A longer time window means averaging over a longer duration of the signal and thus being less precise in the time domain: lower time resolution.
There is a trade-off between time and frequency resolution. In practical terms, this means we have to choose an analysis frame length that is appropriate for our signal. For speech, 25 ms is a common choice.
The choice between red and green could be arbitrary, but it would be better to choose whichever gives you the lowest global distance (as in DTW).
There are many problems with pattern matching using exemplars, whether using linear or dynamic alignment, and the HMM will solve most of them. For example, the number of local distances summed together might vary with different alignments, making it hard to fairly compare them
Holmes & Holmes Chapter 8 details various solutions that were proposed for some of these problems, such as introducing ad-hoc penalties, or placing restrictions on how paths move through the grid (e.g., slope constraints). These all demonstrate that the method has many problems and ad-hoc design decisions.
The filter-bank is performing feature extraction from the speech signal. Even though this chapter is now outdated, filter-bank features are back in use for Automatic Speech Recognition (ASR).
“excitation periodicity” refers to the nature of the vocal fold excitation.
In the time domain, this means the waveform is periodic and its energy will fluctuate over time. Holmes & Holmes say
is also necessary to apply some time-smoothing to remove it
but in fact all this really means in practice is that we need to use a sufficiently long analysis frame (with a duration of, say, at least 2 pitch periods) for short-term analysis. A typical analysis frame duration would be 25 ms or 1/40th of a second.
In the frequency domain, the periodic sound source means that there is harmonic structure in the spectrum.
For ASR, we generally do not want to capture any information about F0 in our features, and so the filters in the filter-bank need to have large enough bandwidths to avoid resolving the harmonics. That is, each filter needs to be wider than, say, twice F0. You could think of this as “blurring” the spectrum (like an out-of-focus photograph) so that we can only see the overall shape and cannot make out the fine detail of the harmonics.
Holmes & Homes describes this as
It is best, therefore, to make the bandwidth of the spectral resolution such that it will not resolve the harmonics of the fundamental of voiced speech.
Module 7 covers feature engineering in more depth and we’ll see that some further processing of the filter-bank outputs can improve things. For example, we will take the log of each filter’s output power.
The reading is “Jurafsky & Martin – Chapter 9 introduction” – meaning the 2.5 pages of un-numbered introduction text immediately after the chapter heading and before 9.1.
Positive comments
Each of these points was made by at least 5 people:
The course is interesting
The course is well organised
The videos are helpful
The website is good
The topic map is useful
Several of you also said the tutors were good (and there were no negative points about them).
The forums are sometimes confusing and hard to navigate: there is a lot of content
This is inevitable to some extent, but I do take the following actions during the course, and between years, to maintain some order in the forums:
- editing student’s questions for clarity
- moving topics to the most relevant forum
- merging related topics
I don’t expect you to read every topic in the forums! Use the search function (there is a separate search box only for the forums, separate from the main site search). I also don’t mind duplicate questions – I will simply merge topics and point you to where the question was already answered.
Expectations for the assignment should be more clear
Regarding the lab work: as noted above, I have designed this to be exploratory and only semi-structured. This is to complement other learning styles in the course (e.g., classes are much more structured).
Regarding the written lab report: yes, it is difficult to make expectations clear to such a diverse class where approximately half the students have mostly written essays in previous courses, whilst others have never written an essay. Marking for the first assignment takes this into account: there is flexibility in how you interpret the guidance provided and there is more than one way to get a high mark. We also provide lots of individual and class-wide feedback to help with the second assignment.
Lab instructions for the first assignment should contain examples of each type of mistake we are looking for
I gave live examples in the lab of each type of mistake, rather than write them in the instructions. I will not be adding these to the written instructions, because this assignment is deliberately exploratory to encourage you to develop your own understanding and not simply to follow a sequence of instructions in a prescribed order.
Classes are too fast-paced (and spend too long on introductions and basic material, then rush the harder material)
I will try to pace classes better, getting through the preliminaries more quickly to leave more time for the harder concepts.
I already assume that all students have completed the videos and essential readings ahead of class, and I will continue to do this.
(Only one respondent thought classes were too slow)
Lab sessions need more structure and guidance
We will increase this in the remaining lab sessions, which are for the second coursework assignment on Automatic Speech Recognition.
There are already more detailed milestones for this assignment than the previous one, and we’ll link the lab session content closely to those. We will say in each lab session what the goal of the session is, and what you should be learning in it.
The expansion method depends on the category of NSW. Some are trivial and can be done with very simple rules (e.g., LSEQ) or no further processing (ASWD), whilst others require something more sophisticated, e.g., time or money expressions where there is context-dependency and possibly re-ordering between the characters and the words.
FSTs would be a sensible formalism for some categories, and these would generally be written by hand (possibly expressed via another formalism such as a grammar). The precise choice per NSW category will depend on the particular system, so don’t get hung up on that too much.
For the purposes of the assignment, you can assume Festival uses a variety of methods including both simple rules and FSTs.
You are right that an FST is both an acceptor (for recognising a pattern of characters in an NSW and thus classifying the token as being of that NSW type) and an emitter (for outputting the words). However, you are also right that classification takes places before expansion, so we only use the FST to transduce (“translate”) the token’s characters to words.
October 26, 2019 at 15:05 in reply to: Installing the Scottish male voice for festival at home #10068You can’t do this assignment with the kal_diphone voice. You would need to take a copy of the Scottish male voice from the lab computers (and delete it from your own device afterwards).
Given the large variety of student computers, we don’t currently offer any technical support on installing Festival. There is a festival mailing list where you can find most answers.
The Appleton Tower lab is fully supported and is the best place to do the assignment, partly because you will learn more by working alongside other students.
Yes, J&M (2nd edition, Section 8.1.1) are discussing segmenting a longer text into sentences, rather than dividing a sentence into tokens for further processing. The former is the harder problem, for the reasons they explain.
They don’t explicitly discuss the latter, but imply that hand-written rules using whitespace and punctuation would be enough, given that this is what happens to the entire text before a classifier is used to find End-Of-Sentence boundaries.
Festival will process multi-sentence text, although its internal data structure assumes a single utterance and there is no representation of “sentences” within an utterance.
So, for the purposes of this assignment you should restrict yourself to isolated single sentences.
-
AuthorPosts