Page 29

Forum Replies Created

Viewing 15 posts - 421 through 435 (of 1,087 total)

← 1 2 3 … 28 29 30 … 71 72 73 →

Author

Posts
November 8, 2019 at 11:04 in reply to: Using Praat to label data #10170
Simon
Professor
Could you re-upload your textgrid file but changing the suffix to .txt (you’ll need to do that on the command line)
November 7, 2019 at 11:49 in reply to: Trading Off Time and Frequency Resolution #10166
Simon
Professor
The Fourier transform is invertible, which means that you get as many data points out as you put in. This means that using a longer analysis window in the time domain gives you more points (often referred to as “FFT bins”) in the frequency domain magnitude spectrum (we will ignore phase). Those points (“bins”) are equally spaced from 0 Hz up to the Nyquist frequency.

So, a longer analysis frame in the time domain means that the spectrum has more detail: higher frequency resolution.

The Fourier spectrum is effectively an average over the entire analysis frame (formally, we are making an assumption that the spectrum is constant throughout the frame). A longer time window means averaging over a longer duration of the signal and thus being less precise in the time domain: lower time resolution.

There is a trade-off between time and frequency resolution. In practical terms, this means we have to choose an analysis frame length that is appropriate for our signal. For speech, 25 ms is a common choice.
November 7, 2019 at 11:12 in reply to: Linear Time Warping, alignment issues #10165
Simon
Professor
The choice between red and green could be arbitrary, but it would be better to choose whichever gives you the lowest global distance (as in DTW).

There are many problems with pattern matching using exemplars, whether using linear or dynamic alignment, and the HMM will solve most of them. For example, the number of local distances summed together might vary with different alignments, making it hard to fairly compare them

Holmes & Holmes Chapter 8 details various solutions that were proposed for some of these problems, such as introducing ad-hoc penalties, or placing restrictions on how paths move through the grid (e.g., slope constraints). These all demonstrate that the method has many problems and ad-hoc design decisions.
November 6, 2019 at 09:38 in reply to: Holmes & Holmes – Chapter 8 #10157
Simon
Professor
The filter-bank is performing feature extraction from the speech signal. Even though this chapter is now outdated, filter-bank features are back in use for Automatic Speech Recognition (ASR).

“excitation periodicity” refers to the nature of the vocal fold excitation.

In the time domain, this means the waveform is periodic and its energy will fluctuate over time. Holmes & Holmes say

is also necessary to apply some time-smoothing to remove it

but in fact all this really means in practice is that we need to use a sufficiently long analysis frame (with a duration of, say, at least 2 pitch periods) for short-term analysis. A typical analysis frame duration would be 25 ms or 1/40th of a second.

In the frequency domain, the periodic sound source means that there is harmonic structure in the spectrum.

For ASR, we generally do not want to capture any information about F0 in our features, and so the filters in the filter-bank need to have large enough bandwidths to avoid resolving the harmonics. That is, each filter needs to be wider than, say, twice F0. You could think of this as “blurring” the spectrum (like an out-of-focus photograph) so that we can only see the overall shape and cannot make out the fine detail of the harmonics.

Holmes & Homes describes this as

It is best, therefore, to make the bandwidth of the spectral resolution such that it will not resolve the harmonics of the fundamental of voiced speech.

Module 7 covers feature engineering in more depth and we’ll see that some further processing of the filter-bank outputs can improve things. For example, we will take the log of each filter’s output power.
November 5, 2019 at 12:08 in reply to: Reading for module 6 #10151
Simon
Professor
The reading is “Jurafsky & Martin – Chapter 9 introduction” – meaning the 2.5 pages of un-numbered introduction text immediately after the chapter heading and before 9.1.
November 4, 2019 at 09:22 in reply to: Response to Speech Processing feedback of 2019-10-24 #10139
Simon
Professor
Positive comments

Each of these points was made by at least 5 people:

The course is interesting

The course is well organised

The videos are helpful

The website is good

The topic map is useful

Several of you also said the tutors were good (and there were no negative points about them).
November 4, 2019 at 09:18 in reply to: Response to Speech Processing feedback of 2019-10-24 #10138
Simon
Professor
The forums are sometimes confusing and hard to navigate: there is a lot of content

This is inevitable to some extent, but I do take the following actions during the course, and between years, to maintain some order in the forums:
- editing student’s questions for clarity
- moving topics to the most relevant forum
- merging related topics
I don’t expect you to read every topic in the forums! Use the search function (there is a separate search box only for the forums, separate from the main site search). I also don’t mind duplicate questions – I will simply merge topics and point you to where the question was already answered.
November 4, 2019 at 09:13 in reply to: Response to Speech Processing feedback of 2019-10-24 #10136
Simon
Professor
Expectations for the assignment should be more clear

Regarding the lab work: as noted above, I have designed this to be exploratory and only semi-structured. This is to complement other learning styles in the course (e.g., classes are much more structured).

Regarding the written lab report: yes, it is difficult to make expectations clear to such a diverse class where approximately half the students have mostly written essays in previous courses, whilst others have never written an essay. Marking for the first assignment takes this into account: there is flexibility in how you interpret the guidance provided and there is more than one way to get a high mark. We also provide lots of individual and class-wide feedback to help with the second assignment.
November 4, 2019 at 09:07 in reply to: Response to Speech Processing feedback of 2019-10-24 #10135
Simon
Professor
Lab instructions for the first assignment should contain examples of each type of mistake we are looking for

I gave live examples in the lab of each type of mistake, rather than write them in the instructions. I will not be adding these to the written instructions, because this assignment is deliberately exploratory to encourage you to develop your own understanding and not simply to follow a sequence of instructions in a prescribed order.
November 4, 2019 at 09:05 in reply to: Response to Speech Processing feedback of 2019-10-24 #10134
Simon
Professor
Classes are too fast-paced (and spend too long on introductions and basic material, then rush the harder material)

I will try to pace classes better, getting through the preliminaries more quickly to leave more time for the harder concepts.

I already assume that all students have completed the videos and essential readings ahead of class, and I will continue to do this.

(Only one respondent thought classes were too slow)
November 4, 2019 at 09:02 in reply to: Response to Speech Processing feedback of 2019-10-24 #10133
Simon
Professor
Lab sessions need more structure and guidance

We will increase this in the remaining lab sessions, which are for the second coursework assignment on Automatic Speech Recognition.

There are already more detailed milestones for this assignment than the previous one, and we’ll link the lab session content closely to those. We will say in each lab session what the goal of the session is, and what you should be learning in it.
October 26, 2019 at 18:24 in reply to: Explaining expansion of NSWs #10073
Simon
Professor
The expansion method depends on the category of NSW. Some are trivial and can be done with very simple rules (e.g., LSEQ) or no further processing (ASWD), whilst others require something more sophisticated, e.g., time or money expressions where there is context-dependency and possibly re-ordering between the characters and the words.

FSTs would be a sensible formalism for some categories, and these would generally be written by hand (possibly expressed via another formalism such as a grammar). The precise choice per NSW category will depend on the particular system, so don’t get hung up on that too much.

For the purposes of the assignment, you can assume Festival uses a variety of methods including both simple rules and FSTs.

You are right that an FST is both an acceptor (for recognising a pattern of characters in an NSW and thus classifying the token as being of that NSW type) and an emitter (for outputting the words). However, you are also right that classification takes places before expansion, so we only use the FST to transduce (“translate”) the token’s characters to words.
October 26, 2019 at 15:05 in reply to: Installing the Scottish male voice for festival at home #10068
Simon
Professor
You can’t do this assignment with the kal_diphone voice. You would need to take a copy of the Scottish male voice from the lab computers (and delete it from your own device afterwards).

Given the large variety of student computers, we don’t currently offer any technical support on installing Festival. There is a festival mailing list where you can find most answers.

The Appleton Tower lab is fully supported and is the best place to do the assignment, partly because you will learn more by working alongside other students.
October 26, 2019 at 12:35 in reply to: Sentence tokenization in festival. #10064
Simon
Professor
Yes, J&M (2nd edition, Section 8.1.1) are discussing segmenting a longer text into sentences, rather than dividing a sentence into tokens for further processing. The former is the harder problem, for the reasons they explain.

They don’t explicitly discuss the latter, but imply that hand-written rules using whitespace and punctuation would be enough, given that this is what happens to the entire text before a classifier is used to find End-Of-Sentence boundaries.

Festival will process multi-sentence text, although its internal data structure assumes a single utterance and there is no representation of “sentences” within an utterance.

So, for the purposes of this assignment you should restrict yourself to isolated single sentences.
October 26, 2019 at 11:36 in reply to: Repetition of Token and POS #10061
Simon
Professor
Token_POS is where Festival stores additional disambiguation information needed for certain homographs where POS alone is not sufficient (e.g., “read”).

http://www.cstr.ed.ac.uk/projects/festival/manual/festival_15.html#SEC59
Author

Posts

Viewing 15 posts - 421 through 435 (of 1,087 total)

← 1 2 3 … 28 29 30 … 71 72 73 →

Simon

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis