Forum Replies Created
-
AuthorPosts
-
We’ll come on to this a little later, and then a lot more in the second semester course. The answer depends on the type of synthesis, and here are a few factors:
For all synthesis methods:
- Prosody – both intonation (F0) and duration
For concatenative synthesis:
- Discontinuities at the joins
- Units taken from inappropriate contexts (e.g., mismatched co-articulation)
For parametric synthesis:
- Artefacts of vocoding (e.g., ‘buzzy’ speech)
- Effects of averaging several speech samples together (e.g., ‘muffled’ speech)
You are correct: the Fast Fourier Transform (FFT) is simply a fast implementation of the Discrete Fourier Transform (DFT). Both are discrete: they take as input a digital signal (i.e., sampled) and produce as output a discrete (i.e., sampled) spectrum.
Wavesurfer is showing you a discrete spectrum – it’s just “joining up the dots” with a line, to make visualisation easier. That’s just the same as what happens in the waveform display.
The spectrum produced by the FFT is discrete in frequency: in other words, it is “sampled” at a set of evenly spaced frequencies between 0Hz and the Nyquist frequency.
The resolution (i.e., how closely spaced the samples are) depends directly on the analysis frame length.
If you zoom in far enough to either the spectrum or the waveform, you’ll see this discrete nature.
OK – there are now week-by-week lists for each course
“Well developed” meant in terms of linguistic knowledge and resources. Specifically, because POS tagging is performed using supervised machine learning, we need lots of accurately labelled (i.e., hand-tagged) data on which to train our tagger.
I don’t have a nice physical explanation of this phenomenon, I’m afraid. You are clearly including the resonant frequency(ies) of the bottle in some way.
When you say “the wave that gets the best resonation” you are referring to a standing wave. The wavelength of this standing wave will be related to the length of the tube.
The waves are standing within the tube. This means that a pattern is set up within the tube, caused by the back-and-forth propagation of sound waves that re-enforce one another. These waves so do not “leave” the tube as such. Rather, the standing wave transmits energy to the air beyond the tube, which then propagates to the listener’s ears.
You need to investigate the earlier steps in the pipeline to see where the difference between “UoE” and “UOE” first arises.
See this topic.
Yes, that’s correct. If we just want to detect whether a string matches a pattern (e.g., regular expression), we can use an automaton to “accept” it. No output is required: the fact that the automaton reaches the end state means the input string matched (i.e., was “accepted”).
If we want to convert the input to some output (e.g., convert “8.46” to “eight forty six”, then we’d use a transducer.
The theory of automata and transducers is beyond the scope of this course, but we do need to know what they can be used for.
Normalisation means converting the sequence of tokens into a sequence of words. A word is something that we can attempt to look up in a dictionary, or pass to the Letter To Sound module. That is, a word is something we can pronounce(i.e., say out loud).
In Festival, you can detect when a pronunciation has come from the dictionary: it will have a correct Part Of Speech (POS) tag. Pronunciations predicted by the Letter To Sound (LTS) module have a ‘nil’ part of speech tag.
The example of caterpillar here returns a nil POS tag.
The syllabification of words whose pronunciation come from LTS, must also be made automatically, and therefore can contain errors.
An incorrect syllabification could indeed have consequences for speech synthesis later in the pipeline. It might affect the prediction of prosody. In unit selection, it might affect the units chosen from the database.
When you say “discretize” I think you perhaps mean “segment” – dividing the continuous speech stream into a sequence of units.
You are correct in thinking that a sequence of vowels is hard to segment (think about diphthongs) and that some consonants are relatively easy. This has implications for concatenative speech synthesis, in which we segment and then re-sequence recorded speech.
The concern of phonology is to place speech units into a finite set of discrete categories that can distinguish the words of a language (think about minimal pairs).
Adding more proper names would be a good idea, because those are generally hard to get right with LTS. But we could never cover every possible proper name. Likewise, adding “unusual” (i.e., low frequency) words cannot in itself solve the problem.
One simple reason is that new words are invented all the time, and no dictionary can include every possible word we might encounter.
Your comment about the efficiency of LTS is spot-on though, in terms of storage space. After creating the LTS model, we could remove all words from the dictionary that this LTS model correctly predicts. That would reduce the size of the dictionary. Festival does not do this, but commercial systems may.
The “rule” you suggest is performing Word Sense Disambiguation (in the case of “bass” at least). So, the general solution is to add a Word Sense Disambiguation module to the front-end of the system. I’ll add this to the list of possible topics for the next lecture.
You are seeing the artefacts caused by the discontinuity at the edges of the window.
The “columns” are the spectra at the start and end of the window. They are not “super high frequency” – what you are seeing is energy across all frequencies.
-
AuthorPosts