Page 52

Forum Replies Created

Viewing 15 posts - 766 through 780 (of 1,084 total)

← 1 2 3 … 51 52 53 … 71 72 73 →

Author

Posts
September 29, 2016 at 21:39 in reply to: Singing Voice Synthesis #5055
Simon
Professor
Start with the papers in the special session “Singing Synthesis Challenge: Fill-In the Gap” at Interspeech 2016 and look at the bibliographies of those papers to find your way back through the literature.
September 29, 2016 at 21:31 in reply to: Quantisation #5054
Simon
Professor
It’s because each sample is stored as a binary number with a fixed number of bits. Let’s use 4 bits, which would give only these possible numbers (with decimal equivalents):
```
0000 = 0
0001 = 1
0010 = 2
0011 = 3
0100 = 4
0101 = 5
0110 = 6
0111 = 7
1000 = 8
1000 = 9
1001 = 10
1010 = 11
1011 = 12
1100 = 13
1101 = 14
1110 = 15
1111 = 16
```
That means that each individual sample will be quantised into one of those 16 possible values (i.e., amplitudes). No “in-between” values are possible.

Using more bits means more values are possible. The standard value in consumer audio is 16 bits. In music production, 24 bits is common.

How many possible values are there with 16 bits? What about 24 bits?
September 29, 2016 at 21:25 in reply to: Sample rate, why it it so fixed rate ? #5053
Simon
Professor
First question: why is CD audio at 44.1kHz and not 44kHz (please note: kHz, not KHZ or K) ? The reason is historical and not important and dates back to the early days of digital audio and compatibility with video frame rates.

Second question: why are there so many other “standard” sampling rates? The main alternatives are 48kHz, 96kHz, 192kHz and (rarely) 384kHz. Each one is double the lower rate, which is convenient when converting between sampling rates (especially when downsampling).

You probably have a sound card built into your computer that will handle 44.1kHz and 48kHz. If you’ve got a more expensive model, it may also handle 96kHz. Only professional equipment (e.g. in recording studios) uses 192kHz and above.

None of this really matters for speech. 16kHz sounds OK, 48kHz sounds better, and there is little point going higher than that.
September 29, 2016 at 12:08 in reply to: Spectrum and frame size #5048
Simon
Professor
In general, we analyse each frame individually.

You’re probably referring to Wavesurfer’s feature to take the average spectrum across the selected region. In this case, the region is divided into frames (the size of which is controlled by the FFT points setting). Each frame is analysed (i.e., pass through the FFT) and then averaged to obtain the spectrum that is displayed.
September 29, 2016 at 12:03 in reply to: Why can't we consider a larger bandwidth when resonating objects #5047
Simon
Professor
We covered this in the week 2 lectures.
September 29, 2016 at 12:02 in reply to: Weekly reading lists #5046
Simon
Professor
A couple people have requested this.

I used to use a Talis Resource List for this, but it’s not possible to automatically synchronise that with the speech.zone website, and so they easily end up disagreeing. This is confusing for students.

I will investigate auto-generating such a list on the speech.zone website, but this will involve writing some code, I suspect, so will take time.

In the meantime, please construct your own list, as you watch the videos.
September 29, 2016 at 12:01 in reply to: Formants vs harmonics #5045
Simon
Professor
This was hopefully clarified in the week 2 lectures.
September 29, 2016 at 12:00 in reply to: A Spectrum of a pure sine wave #5044
Simon
Professor
Two things are going on here

1. What you see in the FFT spectrum is plotted on a logarithmic vertical scale, so that emphasises the very low energy parts. You can ignore these and just focus on the peaks.

2. We see a peak with some width, not a perfect vertical line. The width of that peak depends on

a) the analysis window size (number of FFT points): longer window = higher frequency resolution = narrower peak

b) the use of a tapered window, which introduces this as an artefact (but without tapered window we would have worse artefacts due to discontinuities in the time-domain signal)

A technical aside (not relevant for this course): different tapered window shapes – Hamming, Hanning, Blackman,… – lead to slightly different widths and shapes of this peak.
September 29, 2016 at 11:56 in reply to: prevent aliasing #5043
Simon
Professor
The low-ass filter removes all energy above the cut-off frequency – not just harmonics, but frication and any other sounds.

The cut-off frequency of the low-pass filter needs to be no higher than the Nyquist frequency. Real filters have (as you point out) a slope between the pass-band and the stop-band, not a perfect cut-off, and so we will have to filter out some energy just below the Nyquist frequency as well.
September 29, 2016 at 11:43 in reply to: Spectrum #5042
Simon
Professor
A “spectrum which plots a whole utterance” would show us the long-term average spectrum of the speech. This is somewhat interesting – for example, we can then infer what kinds of additive noise would, or would not, reduce the intelligibility of the signal.

But the long-term average spectrum is not useful for phonetic analysis, and that’s what we are focussed on here.
September 29, 2016 at 11:41 in reply to: Nyquist frequency, fidelity, & perception #5041
Simon
Professor
Aliasing is not so much a “loss of fidelity” as a distortion. We will introduce frequencies into the sampled signal that are false: they are related to the contents of the original signal above the Nyquist frequency (mirrored about the Nyquist frequency in fact).
September 29, 2016 at 11:40 in reply to: limitation of source-filter model #5040
Simon
Professor
The simplest version of the source-filter model only uses one source at a time (either periodic, or non-periodic). This cannot model voiced fricatives, so we need to upgrade the source-filter model to have mixed excitation.

This is still not a great model though, because the voiced an unvoiced sources will be shaped by the same filter, whereas in the vocal tract the two sources are often at different physical locations and so have a different amount of vocal tract between the source and the lips.

Sounds like clicks and even plosive bursts are not well-modelled by a simple source-filter model.

But, in the end, we need to stress that the source-filter model is a model of the speech signal (that’s all we need) and not a faithful model of the physics of speech production (which would be interesting, but not essential for our purposes).
September 29, 2016 at 11:36 in reply to: Speak or sing at very high pitch #5039
Simon
Professor
That’s basically correct. In singing, especially of female sopranos, it is harder to discriminate different vowel sounds from one another.

The spectral envelope exists at all frequencies, but the only evidence available to the listener is at the harmonics. That is, the harmonics sample the envelope (just like digital sampling of audio). This means that more widely-spaced harmonics (due to higher F0) provide a lower-resolution representation of the spectral envelope. This makes things harder for the listener.
September 28, 2016 at 20:56 in reply to: If fundamental wave doesn't exist #5021
Simon
Professor
Breaking your question into two parts:

1. Yes, we can perceive the fundamental frequency (as pitch) even if there is zero energy at the fundamental frequency. Our perceptual system interprets the harmonic structure (i.e., the spacing between the higher harmonics) and “fills in” the fundamental.

2. FFT analysis will only show frequencies that actually exist in the signal. The first peak in the FFT does not necessarily correspond to the first harmonic.
September 28, 2016 at 20:52 in reply to: Tube Model #5020
Simon
Professor
Yes, that’s correct.

In detailed, sophisticated vocal tract models (e.g., finite element simulations of the aeroacoustics), then the area of the vocal tract becomes important.

But for our purposes, we just need to understand why speech has formants, and what determines their values.
Author

Posts

Viewing 15 posts - 766 through 780 (of 1,084 total)

← 1 2 3 … 51 52 53 … 71 72 73 →

Simon

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis