Forum Replies Created
-
AuthorPosts
-
When you say “discretize” I think you perhaps mean “segment” – dividing the continuous speech stream into a sequence of units.
You are correct in thinking that a sequence of vowels is hard to segment (think about diphthongs) and that some consonants are relatively easy. This has implications for concatenative speech synthesis, in which we segment and then re-sequence recorded speech.
The concern of phonology is to place speech units into a finite set of discrete categories that can distinguish the words of a language (think about minimal pairs).
Adding more proper names would be a good idea, because those are generally hard to get right with LTS. But we could never cover every possible proper name. Likewise, adding “unusual” (i.e., low frequency) words cannot in itself solve the problem.
One simple reason is that new words are invented all the time, and no dictionary can include every possible word we might encounter.
Your comment about the efficiency of LTS is spot-on though, in terms of storage space. After creating the LTS model, we could remove all words from the dictionary that this LTS model correctly predicts. That would reduce the size of the dictionary. Festival does not do this, but commercial systems may.
The “rule” you suggest is performing Word Sense Disambiguation (in the case of “bass” at least). So, the general solution is to add a Word Sense Disambiguation module to the front-end of the system. I’ll add this to the list of possible topics for the next lecture.
You are seeing the artefacts caused by the discontinuity at the edges of the window.
The “columns” are the spectra at the start and end of the window. They are not “super high frequency” – what you are seeing is energy across all frequencies.
Start with the papers in the special session “Singing Synthesis Challenge: Fill-In the Gap” at Interspeech 2016 and look at the bibliographies of those papers to find your way back through the literature.
It’s because each sample is stored as a binary number with a fixed number of bits. Let’s use 4 bits, which would give only these possible numbers (with decimal equivalents):
0000 = 0 0001 = 1 0010 = 2 0011 = 3 0100 = 4 0101 = 5 0110 = 6 0111 = 7 1000 = 8 1000 = 9 1001 = 10 1010 = 11 1011 = 12 1100 = 13 1101 = 14 1110 = 15 1111 = 16
That means that each individual sample will be quantised into one of those 16 possible values (i.e., amplitudes). No “in-between” values are possible.
Using more bits means more values are possible. The standard value in consumer audio is 16 bits. In music production, 24 bits is common.
How many possible values are there with 16 bits? What about 24 bits?
First question: why is CD audio at 44.1kHz and not 44kHz (please note: kHz, not KHZ or K) ? The reason is historical and not important and dates back to the early days of digital audio and compatibility with video frame rates.
Second question: why are there so many other “standard” sampling rates? The main alternatives are 48kHz, 96kHz, 192kHz and (rarely) 384kHz. Each one is double the lower rate, which is convenient when converting between sampling rates (especially when downsampling).
You probably have a sound card built into your computer that will handle 44.1kHz and 48kHz. If you’ve got a more expensive model, it may also handle 96kHz. Only professional equipment (e.g. in recording studios) uses 192kHz and above.
None of this really matters for speech. 16kHz sounds OK, 48kHz sounds better, and there is little point going higher than that.
In general, we analyse each frame individually.
You’re probably referring to Wavesurfer’s feature to take the average spectrum across the selected region. In this case, the region is divided into frames (the size of which is controlled by the FFT points setting). Each frame is analysed (i.e., pass through the FFT) and then averaged to obtain the spectrum that is displayed.
September 29, 2016 at 12:03 in reply to: Why can't we consider a larger bandwidth when resonating objects #5047We covered this in the week 2 lectures.
A couple people have requested this.
I used to use a Talis Resource List for this, but it’s not possible to automatically synchronise that with the speech.zone website, and so they easily end up disagreeing. This is confusing for students.
I will investigate auto-generating such a list on the speech.zone website, but this will involve writing some code, I suspect, so will take time.
In the meantime, please construct your own list, as you watch the videos.
This was hopefully clarified in the week 2 lectures.
Two things are going on here
1. What you see in the FFT spectrum is plotted on a logarithmic vertical scale, so that emphasises the very low energy parts. You can ignore these and just focus on the peaks.
2. We see a peak with some width, not a perfect vertical line. The width of that peak depends on
a) the analysis window size (number of FFT points): longer window = higher frequency resolution = narrower peak
b) the use of a tapered window, which introduces this as an artefact (but without tapered window we would have worse artefacts due to discontinuities in the time-domain signal)
A technical aside (not relevant for this course): different tapered window shapes – Hamming, Hanning, Blackman,… – lead to slightly different widths and shapes of this peak.
The low-ass filter removes all energy above the cut-off frequency – not just harmonics, but frication and any other sounds.
The cut-off frequency of the low-pass filter needs to be no higher than the Nyquist frequency. Real filters have (as you point out) a slope between the pass-band and the stop-band, not a perfect cut-off, and so we will have to filter out some energy just below the Nyquist frequency as well.
A “spectrum which plots a whole utterance” would show us the long-term average spectrum of the speech. This is somewhat interesting – for example, we can then infer what kinds of additive noise would, or would not, reduce the intelligibility of the signal.
But the long-term average spectrum is not useful for phonetic analysis, and that’s what we are focussed on here.
Aliasing is not so much a “loss of fidelity” as a distortion. We will introduce frequencies into the sampled signal that are false: they are related to the contents of the original signal above the Nyquist frequency (mirrored about the Nyquist frequency in fact).
-
AuthorPosts