Forum Replies Created
-
AuthorPosts
-
You are running the wrong version of Festival (an old version that is also installed in the VM). You need to set your
PATH
as described in the instructions for Module 3 Tutorial B, to pick up the correct version.That means you are not connected to the VPN. Have you installed the VPN client (Forticlient) and used it to connect to the VPN?
OK – good – you can reach the machine. That’s possible without the VPN, but logging in to it requires the VPN.
To test if you are on the VPN, use a browser (try this on both your host computer and in the VM) and type “What’s my IP” into Google. You are looking for an IP address starting with one of these:
129.215
192.41or maybe one of these
192.107
192.82
193.62
193.63
194.80
194.81Make sure your host computer is connected to the VPN. Then, in the VM, try these and report your results back here:
$ nslookup scp1.ppls.ed.ac.uk $ ping scp1.ppls.ed.ac.uk $ ping 129.215.204.88
A very interesting topic. I’m not going to answer this now, but ask you to ask this question again in Module 7, when we will use knowledge of human hearing to motivate feature extraction for Automatic Speech Recognition.
A simplified description of the vocal folds is that they are closed most of the time. The air pressure from the lungs eventually forces them to burst open and release the pressure, after which they snap shut very rapidly.
This is not random motion, but a very particular pattern of opening and closing.
The signal generated by one cycle of the vocal folds (mainly related to the very rapid closing) can be approximated as an impulse.
Hence, the acoustic wave generated at the glottis is approximately an impulse train. So, we use an impulse train as a model of this signal, and our model has just one parameter: F0.
In the literature on phonation you will of course find much more sophisticated descriptions and models than this. The glottal wave is not exactly an impulse train, but has a particular shape which can be parameterised – e.g., the Liljencrantz-Fant (LF) model has 4 parameters. This is beyond the scope of Speech Processing.
Yes, that’s why there is a non-zero magnitude in the 0 Hz DFT bin.
Is the waveform of an impulse train centred on zero?
Yes, that’s correct – we always plot all the bins, even if their magnitude is zero. Normally we just join them up with a line for easy visualisation, but the notebooks plot them as distinct points to help you understand the process.
Your last sentence gets it: the 0 Hz basis function is a horizontal line. It’s not at 0 because that wouldn’t be much use when weighted and summed to make the signal being analysed, so it’s at an amplitude of 1.
We need this 0 Hz component to account for any offset (bias) in the signal: is it on average above zero, below zero, or centred on zero?
Welcome to the wonderful world of prosody! Terminology is sometimes used differently by different authors.
This is why the video ‘Prosody’ avoided getting into definitions and concentrated on the engineering application. So…keeping to what is relevant for speech synthesis, and talking only about English:
Words are made of syllables. In the pronunciation dictionary, at least one syllable in the word is marked as having primary lexical stress, and perhaps some other syllables as having secondary lexical stress.
When spoken in citation form (i.e., as an isolated word, obeying the dictionary pronunciation), the primary lexically-stressed syllable will sound more prominent than the others. The speaker will make some F0 movement on it to achieve this, and probably also make it louder and longer than usual. This is called a pitch accent. There might also be smaller pitch accents on the other lexically-stressed syllables.
In connected speech, not every lexically-stressed syllable in a spoken sentence will receive a pitch accent (there would be too many). Only some words in the sentence will be chosen by the speaker to receive a pitch accent, which will be placed on a lexically-stressed syllable.
So: lexical stress marked in the dictionary indicates syllables that might receive a pitch accent in connected speech.
Syllables that are not accented may have their vowels reduced, potentially all the way to schwa.
‘Sampling’ is indeed used to mean several things.
When talking about digital signals, sampling means the process of converting an analogue signal (in continuous-time) to a digital one (in discrete time): we take samples at regular intervals. A ‘sample’ here is a single number, stored in binary form (e.g., using 16 bits).
But we could also use the term ‘sample’ to describe a speech waveform taken from a larger one, perhaps for the purposes of speech synthesis. This use of ‘sampling’ is like in music production when someone ‘borrows’ a sample (e.g., a few notes, or a drum beat) from an existing track, and makes new music with it.
In your plot, make sure you know the difference between bins (all the red points) and the energy in the signal (in this case, the harmonics) which are only the red points with non-zero magnitude.
Earlier, we agreed that the lowest frequency basis function is the one with a single cycle in the analysis frame (e.g., 1 Hz for a 1 s analysis frame).
Your plot is consistent with that, except that there is also a bin at 0 Hz. We often disregard this because it doesn’t tell us about the frequency content of the signal, but about something else.
To understand why there is some energy at 0 Hz, and what that means, can you first describe what that 0 Hz basis function looks like?
That will work fine, or use
ctrl-C
in the Terminal to terminate the Jupyter process (it will prompt you if you really want to quit).You are correct: “bins are then the discrete frequencies of the basis functions” – they are indexed by k in the DFT equation in the notebooks.
The DFT bins are determined only by the duration of the analysis frame and the sampling rate. They do not and cannot depend on the signal (e.g., on its fundamental frequency), because the DFT works for any signal and gives the same frequency resolution in all cases (for a particular sampling rate and analysis frame duration).
You correctly state that, for a 1 s analysis frame duration, the lowest frequency bin will be at 1 Hz. (We don’t even need to know the sampling rate to work this out.)
You are also correct in stating that when analysing a signal with energy at an “awkward” frequency (i.e., not an exact bin frequency), we will get leakage across several adjacent bins around that frequency.
Your final point is also correct: to construct a single impulse, we need to sum together all basis functions with equal amplitudes. The spectrum is flat, and there are no harmonics (because the single impulse is not a periodic signal).
Overall, your understanding looks pretty good!
-
AuthorPosts