Page 51

Forum Replies Created

Viewing 15 posts - 751 through 765 (of 1,084 total)

← 1 2 3 … 50 51 52 … 71 72 73 →

Author

Posts
October 5, 2016 at 09:40 in reply to: A general question about speech synthesis #5156
Simon
Professor
We’ll come on to this a little later, and then a lot more in the second semester course. The answer depends on the type of synthesis, and here are a few factors:

For all synthesis methods:
- Prosody – both intonation (F0) and duration
For concatenative synthesis:
- Discontinuities at the joins
- Units taken from inappropriate contexts (e.g., mismatched co-articulation)
For parametric synthesis:
- Artefacts of vocoding (e.g., ‘buzzy’ speech)
- Effects of averaging several speech samples together (e.g., ‘muffled’ speech)
October 5, 2016 at 09:36 in reply to: spectrum returned by FFT #5155
Simon
Professor
You are correct: the Fast Fourier Transform (FFT) is simply a fast implementation of the Discrete Fourier Transform (DFT). Both are discrete: they take as input a digital signal (i.e., sampled) and produce as output a discrete (i.e., sampled) spectrum.

Wavesurfer is showing you a discrete spectrum – it’s just “joining up the dots” with a line, to make visualisation easier. That’s just the same as what happens in the waveform display.

The spectrum produced by the FFT is discrete in frequency: in other words, it is “sampled” at a set of evenly spaced frequencies between 0Hz and the Nyquist frequency.

The resolution (i.e., how closely spaced the samples are) depends directly on the analysis frame length.

If you zoom in far enough to either the spectrum or the waveform, you’ll see this discrete nature.
October 4, 2016 at 22:16 in reply to: Weekly reading lists #5150
Simon
Professor
OK – there are now week-by-week lists for each course
October 4, 2016 at 21:58 in reply to: Part-of-speech Tagging #5146
Simon
Professor
“Well developed” meant in terms of linguistic knowledge and resources. Specifically, because POS tagging is performed using supervised machine learning, we need lots of accurately labelled (i.e., hand-tagged) data on which to train our tagger.
October 4, 2016 at 13:41 in reply to: Influencing resonant frequency #5098
Simon
Professor
I don’t have a nice physical explanation of this phenomenon, I’m afraid. You are clearly including the resonant frequency(ies) of the bottle in some way.
October 4, 2016 at 13:39 in reply to: Tube model #5097
Simon
Professor
When you say “the wave that gets the best resonation” you are referring to a standing wave. The wavelength of this standing wave will be related to the length of the tube.

The waves are standing within the tube. This means that a pattern is set up within the tube, caused by the back-and-forth propagation of sound waves that re-enforce one another. These waves so do not “leave” the tube as such. Rather, the standing wave transmits energy to the air beyond the tube, which then propagates to the listener’s ears.
October 4, 2016 at 13:29 in reply to: Segment structure of UoE and UOE #5095
Simon
Professor
You need to investigate the earlier steps in the pipeline to see where the difference between “UoE” and “UOE” first arises.
October 4, 2016 at 10:41 in reply to: Festival's phoneme inventory #5074
Simon
Professor
See this topic.
October 3, 2016 at 19:59 in reply to: FSA vs transducer #5072
Simon
Professor
Yes, that’s correct. If we just want to detect whether a string matches a pattern (e.g., regular expression), we can use an automaton to “accept” it. No output is required: the fact that the automaton reaches the end state means the input string matched (i.e., was “accepted”).

If we want to convert the input to some output (e.g., convert “8.46” to “eight forty six”, then we’d use a transducer.

The theory of automata and transducers is beyond the scope of this course, but we do need to know what they can be used for.
October 3, 2016 at 19:56 in reply to: Normalisation #5071
Simon
Professor
Normalisation means converting the sequence of tokens into a sequence of words. A word is something that we can attempt to look up in a dictionary, or pass to the Letter To Sound module. That is, a word is something we can pronounce(i.e., say out loud).
October 3, 2016 at 17:57 in reply to: Syllable structure & stress #5067
Simon
Professor
In Festival, you can detect when a pronunciation has come from the dictionary: it will have a correct Part Of Speech (POS) tag. Pronunciations predicted by the Letter To Sound (LTS) module have a ‘nil’ part of speech tag.

The example of caterpillar here returns a nil POS tag.

The syllabification of words whose pronunciation come from LTS, must also be made automatically, and therefore can contain errors.

An incorrect syllabification could indeed have consequences for speech synthesis later in the pipeline. It might affect the prediction of prosody. In unit selection, it might affect the units chosen from the database.
October 3, 2016 at 17:52 in reply to: Discreteness of phonology #5066
Simon
Professor
When you say “discretize” I think you perhaps mean “segment” – dividing the continuous speech stream into a sequence of units.

You are correct in thinking that a sequence of vowels is hard to segment (think about diphthongs) and that some consonants are relatively easy. This has implications for concatenative speech synthesis, in which we segment and then re-sequence recorded speech.

The concern of phonology is to place speech units into a finite set of discrete categories that can distinguish the words of a language (think about minimal pairs).
October 3, 2016 at 15:22 in reply to: Dictionaries v. LTS models #5064
Simon
Professor
Adding more proper names would be a good idea, because those are generally hard to get right with LTS. But we could never cover every possible proper name. Likewise, adding “unusual” (i.e., low frequency) words cannot in itself solve the problem.

One simple reason is that new words are invented all the time, and no dictionary can include every possible word we might encounter.

Your comment about the efficiency of LTS is spot-on though, in terms of storage space. After creating the LTS model, we could remove all words from the dictionary that this LTS model correctly predicts. That would reduce the size of the dictionary. Festival does not do this, but commercial systems may.
October 3, 2016 at 15:17 in reply to: Jurafsky & Martin – Chapter 8 #5063
Simon
Professor
The “rule” you suggest is performing Word Sense Disambiguation (in the case of “bass” at least). So, the general solution is to add a Word Sense Disambiguation module to the front-end of the system. I’ll add this to the list of possible topics for the next lecture.
September 30, 2016 at 08:18 in reply to: Rectangular Window Causing High Frequency #5057
Simon
Professor
You are seeing the artefacts caused by the discontinuity at the edges of the window.

The “columns” are the spectra at the start and end of the window. They are not “super high frequency” – what you are seeing is energy across all frequencies.
Author

Posts

Viewing 15 posts - 751 through 765 (of 1,084 total)

← 1 2 3 … 50 51 52 … 71 72 73 →

Simon

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis