Forum Replies Created
-
AuthorPosts
-
This means that you have not performed synthesis for that
Utterance
object. Perhaps you only set the text but didn’t complete the rest of the pipeline? AWave
relation only exists in anUtterance
object if the waveform generation step has been run (e.g.,SayText
runs all steps including that one)No, you have to find mistakes: instances where the output speech is incorrect. Festival’s POS tagger is really old and not particularly good, so you will find mistakes. You might need to craft sentences in order to do this. Think about ambiguity!
It looks like students who enrolled slightly later on the course didn’t have their access enabled. Computing support are fixing this now…
Update: now fixed. Everyone should be able to
rsync
.This issue with sound playback occurs in some VM installations, and can usually be solved like this.
You don’t need to use
sudo
to edit that file – it’s in your file space, so you can edit it as yourself.You should add the PATH line at the end of the file, as per the instructions, not the start.
Don’t type the “$” before commands – this is used in the instructions to indicate the bash prompt (to differentiate from the Festival prompt).
You are running the wrong version of Festival (an old version that is also installed in the VM). You need to set your
PATH
as described in the instructions for Module 3 Tutorial B, to pick up the correct version.That means you are not connected to the VPN. Have you installed the VPN client (Forticlient) and used it to connect to the VPN?
OK – good – you can reach the machine. That’s possible without the VPN, but logging in to it requires the VPN.
To test if you are on the VPN, use a browser (try this on both your host computer and in the VM) and type “What’s my IP” into Google. You are looking for an IP address starting with one of these:
129.215
192.41or maybe one of these
192.107
192.82
193.62
193.63
194.80
194.81Make sure your host computer is connected to the VPN. Then, in the VM, try these and report your results back here:
$ nslookup scp1.ppls.ed.ac.uk $ ping scp1.ppls.ed.ac.uk $ ping 129.215.204.88
A very interesting topic. I’m not going to answer this now, but ask you to ask this question again in Module 7, when we will use knowledge of human hearing to motivate feature extraction for Automatic Speech Recognition.
A simplified description of the vocal folds is that they are closed most of the time. The air pressure from the lungs eventually forces them to burst open and release the pressure, after which they snap shut very rapidly.
This is not random motion, but a very particular pattern of opening and closing.
The signal generated by one cycle of the vocal folds (mainly related to the very rapid closing) can be approximated as an impulse.
Hence, the acoustic wave generated at the glottis is approximately an impulse train. So, we use an impulse train as a model of this signal, and our model has just one parameter: F0.
In the literature on phonation you will of course find much more sophisticated descriptions and models than this. The glottal wave is not exactly an impulse train, but has a particular shape which can be parameterised – e.g., the Liljencrantz-Fant (LF) model has 4 parameters. This is beyond the scope of Speech Processing.
Yes, that’s why there is a non-zero magnitude in the 0 Hz DFT bin.
Is the waveform of an impulse train centred on zero?
Yes, that’s correct – we always plot all the bins, even if their magnitude is zero. Normally we just join them up with a line for easy visualisation, but the notebooks plot them as distinct points to help you understand the process.
Your last sentence gets it: the 0 Hz basis function is a horizontal line. It’s not at 0 because that wouldn’t be much use when weighted and summed to make the signal being analysed, so it’s at an amplitude of 1.
We need this 0 Hz component to account for any offset (bias) in the signal: is it on average above zero, below zero, or centred on zero?
Welcome to the wonderful world of prosody! Terminology is sometimes used differently by different authors.
This is why the video ‘Prosody’ avoided getting into definitions and concentrated on the engineering application. So…keeping to what is relevant for speech synthesis, and talking only about English:
Words are made of syllables. In the pronunciation dictionary, at least one syllable in the word is marked as having primary lexical stress, and perhaps some other syllables as having secondary lexical stress.
When spoken in citation form (i.e., as an isolated word, obeying the dictionary pronunciation), the primary lexically-stressed syllable will sound more prominent than the others. The speaker will make some F0 movement on it to achieve this, and probably also make it louder and longer than usual. This is called a pitch accent. There might also be smaller pitch accents on the other lexically-stressed syllables.
In connected speech, not every lexically-stressed syllable in a spoken sentence will receive a pitch accent (there would be too many). Only some words in the sentence will be chosen by the speaker to receive a pitch accent, which will be placed on a lexically-stressed syllable.
So: lexical stress marked in the dictionary indicates syllables that might receive a pitch accent in connected speech.
Syllables that are not accented may have their vowels reduced, potentially all the way to schwa.
-
AuthorPosts