Forum Replies Created
-
AuthorPosts
-
The simplest version of the source-filter model only uses one source at a time (either periodic, or non-periodic). This cannot model voiced fricatives, so we need to upgrade the source-filter model to have mixed excitation.
This is still not a great model though, because the voiced an unvoiced sources will be shaped by the same filter, whereas in the vocal tract the two sources are often at different physical locations and so have a different amount of vocal tract between the source and the lips.
Sounds like clicks and even plosive bursts are not well-modelled by a simple source-filter model.
But, in the end, we need to stress that the source-filter model is a model of the speech signal (that’s all we need) and not a faithful model of the physics of speech production (which would be interesting, but not essential for our purposes).
That’s basically correct. In singing, especially of female sopranos, it is harder to discriminate different vowel sounds from one another.
The spectral envelope exists at all frequencies, but the only evidence available to the listener is at the harmonics. That is, the harmonics sample the envelope (just like digital sampling of audio). This means that more widely-spaced harmonics (due to higher F0) provide a lower-resolution representation of the spectral envelope. This makes things harder for the listener.
Breaking your question into two parts:
1. Yes, we can perceive the fundamental frequency (as pitch) even if there is zero energy at the fundamental frequency. Our perceptual system interprets the harmonic structure (i.e., the spacing between the higher harmonics) and “fills in” the fundamental.
2. FFT analysis will only show frequencies that actually exist in the signal. The first peak in the FFT does not necessarily correspond to the first harmonic.
Yes, that’s correct.
In detailed, sophisticated vocal tract models (e.g., finite element simulations of the aeroacoustics), then the area of the vocal tract becomes important.
But for our purposes, we just need to understand why speech has formants, and what determines their values.
The signal you describe is quite extreme. Those other harmonics exist but have zero amplitude. Even the fundamental (we could call this the first harmonic) has zero amplitude.
A more reasonable example would be a square wave, in which only the odd harmonics have energy and all the even ones (2nd, 4th, 6th, …) have zero amplitude.
I’ll bring a tuning fork to main lecture 2, and we’ll find out…
We’ll look at this in main lecture 2.
The spectrum and spectrogram are showing us exactly the same information. The spectrum is for a single frame of speech, and the spectrogram is for a sequence of frames (so will reveal changes in the spectrum over time).
We’ll go over this in main lecture 2.
I think you’re referring to the way Praat defaults to plotting spectrograms with a frequency axis from 0 to 5000Hz, regardless of the sampling rate. In other words, Praat often does not show the full frequency range of the signal.
Why do you think Praat limits the frequency range when plotting spectrograms?
In Wavesurfer, the spectrogram is always plotted from 0 to the Nyquist frequency (half the sampling rate).
It’s always good to know several tools.
I find Wavesurfer faster and easier to use, and it’s widely used in speech technology for tasks such as labelling speech. Praat is the more common tool in the field of phonetics, and is more powerful. Personally, I don’t like the way Praat labels the axes (it doesn’t provide tick marks).
Use whichever you prefer.
For this course, the older edition will be OK, at least for the first 6 chapters.
Good question! Think about the relative width and length of of the vocal tract, and what resonant frequencies each would give rise to.
We’ll answer this question properly in main lecture 2.
September 27, 2016 at 20:59 in reply to: Differentiate between synthesised speech and natural ones #4992Yes, we can almost always tell. Look at some of the synthetic speech provided here
and see what you can discover for yourself. We can revisit this question after one or two lectures on speech synthesis – so ask it again then.
Sampling happens when we convert from analogue to digital, so we need to worry about the Nyquist frequency then: when we record the raw data.
Our soundcard takes care of this for us. It includes a low-pass filter to remove all frequencies above the Nyquist frequency. This is done in the analogue domain, before going digital.
We also need to take care to do the equivalent thing when digitally downsampling any previously-recorded signal: our downsampling program must include a low-pass filter before the down-sampling step.
Well-spotted! You are correct that signals close to the Nyquist frequency will not be very well represented. We’ll look at this in foundation lecture 2.
-
AuthorPosts