Forum Replies Created
-
AuthorPosts
-
Start with the papers in the special session “Singing Synthesis Challenge: Fill-In the Gap” at Interspeech 2016 and look at the bibliographies of those papers to find your way back through the literature.
It’s because each sample is stored as a binary number with a fixed number of bits. Let’s use 4 bits, which would give only these possible numbers (with decimal equivalents):
0000 = 0 0001 = 1 0010 = 2 0011 = 3 0100 = 4 0101 = 5 0110 = 6 0111 = 7 1000 = 8 1000 = 9 1001 = 10 1010 = 11 1011 = 12 1100 = 13 1101 = 14 1110 = 15 1111 = 16
That means that each individual sample will be quantised into one of those 16 possible values (i.e., amplitudes). No “in-between” values are possible.
Using more bits means more values are possible. The standard value in consumer audio is 16 bits. In music production, 24 bits is common.
How many possible values are there with 16 bits? What about 24 bits?
First question: why is CD audio at 44.1kHz and not 44kHz (please note: kHz, not KHZ or K) ? The reason is historical and not important and dates back to the early days of digital audio and compatibility with video frame rates.
Second question: why are there so many other “standard” sampling rates? The main alternatives are 48kHz, 96kHz, 192kHz and (rarely) 384kHz. Each one is double the lower rate, which is convenient when converting between sampling rates (especially when downsampling).
You probably have a sound card built into your computer that will handle 44.1kHz and 48kHz. If you’ve got a more expensive model, it may also handle 96kHz. Only professional equipment (e.g. in recording studios) uses 192kHz and above.
None of this really matters for speech. 16kHz sounds OK, 48kHz sounds better, and there is little point going higher than that.
In general, we analyse each frame individually.
You’re probably referring to Wavesurfer’s feature to take the average spectrum across the selected region. In this case, the region is divided into frames (the size of which is controlled by the FFT points setting). Each frame is analysed (i.e., pass through the FFT) and then averaged to obtain the spectrum that is displayed.
September 29, 2016 at 12:03 in reply to: Why can't we consider a larger bandwidth when resonating objects #5047We covered this in the week 2 lectures.
A couple people have requested this.
I used to use a Talis Resource List for this, but it’s not possible to automatically synchronise that with the speech.zone website, and so they easily end up disagreeing. This is confusing for students.
I will investigate auto-generating such a list on the speech.zone website, but this will involve writing some code, I suspect, so will take time.
In the meantime, please construct your own list, as you watch the videos.
This was hopefully clarified in the week 2 lectures.
Two things are going on here
1. What you see in the FFT spectrum is plotted on a logarithmic vertical scale, so that emphasises the very low energy parts. You can ignore these and just focus on the peaks.
2. We see a peak with some width, not a perfect vertical line. The width of that peak depends on
a) the analysis window size (number of FFT points): longer window = higher frequency resolution = narrower peak
b) the use of a tapered window, which introduces this as an artefact (but without tapered window we would have worse artefacts due to discontinuities in the time-domain signal)
A technical aside (not relevant for this course): different tapered window shapes – Hamming, Hanning, Blackman,… – lead to slightly different widths and shapes of this peak.
The low-ass filter removes all energy above the cut-off frequency – not just harmonics, but frication and any other sounds.
The cut-off frequency of the low-pass filter needs to be no higher than the Nyquist frequency. Real filters have (as you point out) a slope between the pass-band and the stop-band, not a perfect cut-off, and so we will have to filter out some energy just below the Nyquist frequency as well.
A “spectrum which plots a whole utterance” would show us the long-term average spectrum of the speech. This is somewhat interesting – for example, we can then infer what kinds of additive noise would, or would not, reduce the intelligibility of the signal.
But the long-term average spectrum is not useful for phonetic analysis, and that’s what we are focussed on here.
Aliasing is not so much a “loss of fidelity” as a distortion. We will introduce frequencies into the sampled signal that are false: they are related to the contents of the original signal above the Nyquist frequency (mirrored about the Nyquist frequency in fact).
The simplest version of the source-filter model only uses one source at a time (either periodic, or non-periodic). This cannot model voiced fricatives, so we need to upgrade the source-filter model to have mixed excitation.
This is still not a great model though, because the voiced an unvoiced sources will be shaped by the same filter, whereas in the vocal tract the two sources are often at different physical locations and so have a different amount of vocal tract between the source and the lips.
Sounds like clicks and even plosive bursts are not well-modelled by a simple source-filter model.
But, in the end, we need to stress that the source-filter model is a model of the speech signal (that’s all we need) and not a faithful model of the physics of speech production (which would be interesting, but not essential for our purposes).
That’s basically correct. In singing, especially of female sopranos, it is harder to discriminate different vowel sounds from one another.
The spectral envelope exists at all frequencies, but the only evidence available to the listener is at the harmonics. That is, the harmonics sample the envelope (just like digital sampling of audio). This means that more widely-spaced harmonics (due to higher F0) provide a lower-resolution representation of the spectral envelope. This makes things harder for the listener.
Breaking your question into two parts:
1. Yes, we can perceive the fundamental frequency (as pitch) even if there is zero energy at the fundamental frequency. Our perceptual system interprets the harmonic structure (i.e., the spacing between the higher harmonics) and “fills in” the fundamental.
2. FFT analysis will only show frequencies that actually exist in the signal. The first peak in the FFT does not necessarily correspond to the first harmonic.
Yes, that’s correct.
In detailed, sophisticated vocal tract models (e.g., finite element simulations of the aeroacoustics), then the area of the vocal tract becomes important.
But for our purposes, we just need to understand why speech has formants, and what determines their values.
-
AuthorPosts