Forum Replies Created
-
AuthorPosts
-
The same techniques are used for unseen names and for other types of unseen words – for example a classification tree. But, in some systems, separate classifiers are used in the two cases.
The classifier for names might use additional features, provided by some earlier stage in the pipeline. For example, a prediction (“guess”) at which foreign language the word originates from.
This prediction would come from a language classifier that would itself need to be trained in a supervised manner from labelled data, such as a large list of words tagged with language of origin. This classifier might use features derived from the sequence of letters in the word, or even simply letter frequency, which differs between languages.
For our purpose (which is to form a simple theoretical source-filter model of speech production), I think it’s OK to say that the vocal folds are simply vibrating during vowel sounds and the only thing that can vary about this is the frequency of vibration (F0).
In tone languages, F0 can distinguish words and so is a phonological feature. In other languages (e.g., English), F0 does not carry any phonetic information.
Of course, reality is more complex. The vocal folds can be used to control voice quality. For example, in breathy speech, the folds never completely close and the leaking airflow results in turbulence (like in a fricative).
The term magnitude is usually used with regard to the spectrum (e.g., obtained by the FFT). It is used to distinguish from the phase spectrum, which we don’t really need to worry about here.
Try it for yourself!
Draw a sine wave on some graph paper and then sample it slightly more than 2 times per period. Scan that and post it here, then we can discuss it.
We’ll come on to this a little later, and then a lot more in the second semester course. The answer depends on the type of synthesis, and here are a few factors:
For all synthesis methods:
- Prosody – both intonation (F0) and duration
For concatenative synthesis:
- Discontinuities at the joins
- Units taken from inappropriate contexts (e.g., mismatched co-articulation)
For parametric synthesis:
- Artefacts of vocoding (e.g., ‘buzzy’ speech)
- Effects of averaging several speech samples together (e.g., ‘muffled’ speech)
You are correct: the Fast Fourier Transform (FFT) is simply a fast implementation of the Discrete Fourier Transform (DFT). Both are discrete: they take as input a digital signal (i.e., sampled) and produce as output a discrete (i.e., sampled) spectrum.
Wavesurfer is showing you a discrete spectrum – it’s just “joining up the dots” with a line, to make visualisation easier. That’s just the same as what happens in the waveform display.
The spectrum produced by the FFT is discrete in frequency: in other words, it is “sampled” at a set of evenly spaced frequencies between 0Hz and the Nyquist frequency.
The resolution (i.e., how closely spaced the samples are) depends directly on the analysis frame length.
If you zoom in far enough to either the spectrum or the waveform, you’ll see this discrete nature.
OK – there are now week-by-week lists for each course
“Well developed” meant in terms of linguistic knowledge and resources. Specifically, because POS tagging is performed using supervised machine learning, we need lots of accurately labelled (i.e., hand-tagged) data on which to train our tagger.
I don’t have a nice physical explanation of this phenomenon, I’m afraid. You are clearly including the resonant frequency(ies) of the bottle in some way.
When you say “the wave that gets the best resonation” you are referring to a standing wave. The wavelength of this standing wave will be related to the length of the tube.
The waves are standing within the tube. This means that a pattern is set up within the tube, caused by the back-and-forth propagation of sound waves that re-enforce one another. These waves so do not “leave” the tube as such. Rather, the standing wave transmits energy to the air beyond the tube, which then propagates to the listener’s ears.
You need to investigate the earlier steps in the pipeline to see where the difference between “UoE” and “UOE” first arises.
See this topic.
Yes, that’s correct. If we just want to detect whether a string matches a pattern (e.g., regular expression), we can use an automaton to “accept” it. No output is required: the fact that the automaton reaches the end state means the input string matched (i.e., was “accepted”).
If we want to convert the input to some output (e.g., convert “8.46” to “eight forty six”, then we’d use a transducer.
The theory of automata and transducers is beyond the scope of this course, but we do need to know what they can be used for.
Normalisation means converting the sequence of tokens into a sequence of words. A word is something that we can attempt to look up in a dictionary, or pass to the Letter To Sound module. That is, a word is something we can pronounce(i.e., say out loud).
In Festival, you can detect when a pronunciation has come from the dictionary: it will have a correct Part Of Speech (POS) tag. Pronunciations predicted by the Letter To Sound (LTS) module have a ‘nil’ part of speech tag.
The example of caterpillar here returns a nil POS tag.
The syllabification of words whose pronunciation come from LTS, must also be made automatically, and therefore can contain errors.
An incorrect syllabification could indeed have consequences for speech synthesis later in the pipeline. It might affect the prediction of prosody. In unit selection, it might affect the units chosen from the database.
-
AuthorPosts