Forum Replies Created
-
AuthorPosts
-
In general, no.
Anything that you might be able to find in a dictionary (if you imagine having a really huge dictionary) is a Standard Word (even if our particular dictionary doesn’t include it).
Another way to decide what is Standard Word might be to say that its pronunciation has to be determined directly from its spelling, using the same method as for all other Standard Words, without any other processing first.
Anything that you would not expect to find in the dictionary, however large it was, is a Non-Standard Word (NSW). These need converting to Standard Word(s) before attempting dictionary lookup – and that process is called normalisation.
For example, I just made up the word “Simonification”. If that got into common usage (it’s possible!) then one day dictionary writers would include it in their dictionaries. So, that’s a Standard Word, even though no current dictionary yet includes it.
In contrast, no dictionary would ever attempt to include
- £1.25
- £1.26
- £1.27
- etc.
so those are NSWs.
October 25, 2019 at 15:01 in reply to: Linear Predictive Coding (LPC) – what is the residual? #10028The residual is a special waveform. It is what you need to input to the filter in order to exactly reconstruct the speech signal.
The filter is not a perfect simulation of the vocal tract. The vocal folds also do not generate a perfect impulse train. Therefore, putting a impulse train through an LPC filter will not produce perfectly natural speech.
We really like the simple form of the filter because it is easy to solve equations to find its coefficients, given a frame of natural speech waveform. So, we account for the imperfections in the model by replacing the pulse train with the residual, which contains all the information “left over” from the original speech waveform that the filter was not able to model (that’s why it’s called the “residual”).
Join smoothing:
- we can manipulate the filter coefficients of a few frames around each join, if we wish to remove discontinuities in the spectral envelope.
- we can use PSOLA on the residual (it’s just a waveform) to manipulate the fundamental frequency, if we wish to remove pitch discontinuities across joins
So, in residual-excited LPC (RELP) we still need to use PSOLA to manipulate F0 and duration of the residual! Why not do that processing on the speech waveform (i.e., TD-PSOLA)? It’s because PSOLA works better on residual waveforms than on speech waveforms. This is because the residual is closer to an impulse train and so overlap-add creates fewer artefacts.
October 25, 2019 at 13:58 in reply to: Referencing several parts of the same work (e.g., chapters in a book). #10026If the whole book is by the same author(s), it should appear only once in the bibliography.
If the book is a collection of chapters by different authors, then each chapter should be a separate entry in the bibliography, under the appropriate author(s).
When citing any longer work, and especially a book, you should narrow down the citation at the point where you cite it, to help your reader precisely locate the material you are implicitly asking them to read. This could be to a chapter, section, or page(s), such as:
“Text processing for TTS usually involves a sequence of diverse operations. Some are as simple as splitting sentences into tokens, others as challenging as disambiguating homographs (Taylor, 2009, Section 4.1).”
Omitting the section number in the above example implies that you expect the reader to read the entire book in order to locate the material supporting your claim. That is an easy way to annoy your reader.
I generally prefer to use section or chapter rather than page number because this should be more consistent across different editions, the e-book version, and so on.
Citing an entire long work would only be appropriate if it really is the entire work that supports your claim, such as:
“Taylor (2009) provides a comprehensive overview of the state-of-the-art in TTS as it was in 2009, but there have been significant developments since then.”
October 25, 2019 at 11:25 in reply to: Linear Predictive Coding (LPC) – what is the residual? #10022The spectral envelope of the speech signal (the output of the source-filter model) is equal to the frequency response of the filter. That is, the filter is solely responsible for shaping the speech spectral envelope.
In the frequency domain, we multiply frequency responses together, so the residual spectrum has to be flat so that when it is multiplied by the filter frequency response we get the speech spectral envelope.
In the idealised (simplified) version of the model, the residual is replaced with a train of impulses. This has a flat spectral envelope, as you will have discovered in the lab when analysing that signal. It has harmonic structure, but all harmonics are of the same amplitude.
There is a note on the library page for the ebook that says “Access to this ebook is unavailable until 04/07/2020” – this is probably because the publisher has withdrawn it, or we have exceeded the number of views allowed.
This book is published by a very strange and uncooperative publisher who won’t let us purchase any more “copies” of the ebook and limits the number of views (per year, I think).
You’ll have to do things the normal way (you might think “old fashioned”) and actually go to the library to read the book ! There are copies on reserve.
You might also ask classmates if they have long-term loan copies or their own copy, and ask to share.
Note: I expect taught MSc students on degrees including SLP and some of the Informatics programmes to buy this book and not use library copies.
The filter parameters control the spectral envelope (e.g., the formants) and so this is what we manipulate to smooth joins. So we do alter the values of the filter coefficients (which are model parameters).
There is an answer to your question in Forums > Speech Processing – Exam revision > Top Hat questions
October 15, 2019 at 13:46 in reply to: Installing the Scottish male voice for festival at home #9975No, those are the instructions for actually building a new voice. That’s the coursework for Speech Synthesis in semester 2.
To install a voice, you simply need to copy the appropriate files from the computers in the lab. Korin is the best person to write instructions for that, so I’ll ask him to do so.
That’s a good idea and I will try to do this.
It will take a little time because the 3rd edition looks substantially different from the 2nd edition and therefore I need to read it and compare the content to the current readings.
The official readings (i.e., what is examinable) will remain those from the 2nd edition.
All classes are recorded and can be accessed via the Learn page for the course.
You are probably using a too short analysis window, which is preventing you from resolving the spectrum with enough detail.
In Wavesurfer’s spectrum window, the parameter to change is “FFT points”. Try a larger value such as 4096 (which would be about 0.25 seconds at a 16 kHz sampling rate) or 8192.
The fundamental frequency of a complex wave does not necessarily have the largest amplitude in the spectrum.
We can use the source-filter model to understand how that is possible, using figure 4.13 from Ladefoged that you attached. These are idealised speech waveforms, made by passing an impulse train through a filter.
The filter is particularly simple in this example: it has a single resonance at 600 Hz.
Energy at or close to the resonant frequency is amplified by the filter, whereas energy at frequencies far away from the resonant frequency is attenuated. To convince yourself that a filter can do that, think about a brass instrument like a trumpet: the input is generated by vibrating lips at the mouthpiece, which is not very loud, yet the output can be very loud indeed.
The input impulse train in Ladefoged’s example has a fundamental frequency of 100Hz in the uppermost plot. This contains equal amounts of energy at 100 Hz, 200 Hz, 300 Hz, 400 Hz, 500 Hz, 600 Hz, 700 Hz, …. etc.
Thinking in the frequency domain will be easier than the time domain. The spectrum of that impulse train tells us that the waveform is equivalent to a sine wave at 100 Hz, added to one at 200 Hz, another one at 300 Hz, and so on.
All that a (linear) filter can do to a sine wave is change its amplitude: increase or decrease it. The amount of increase or decrease plotted against frequency is called the frequency response of the filter. The filter in the example has a peak in its frequency response at 600 Hz, meaning that any input at that frequency (e.g., the 600 Hz sine wave component of the impulse train) will be amplified.
Try this yourself in the lab: take an impulse train and pass it through a filter that has a single resonance (in Praat you can use “filter one formant“), then inspect the waveform and the spectrum. With appropriate filter settings you can almost entirely attenuate the fundamental. But, listen to the resulting signal and you will perceive the same pitch as the original impulse train.
People have tried using automatic speech recognition to evaluate the intelligibility of synthetic speech, but with only limited success. So the simple answer is that there is no objective measure of intelligibility.
The Blizzard Challenges in 2008, 2009, and 2010 included tasks on Mandarin Chinese and the summary papers for these years (available from http://festvox.org/blizzard/index.html) tell you about the two measures used: pinyin error rate with or without tone.
The general answer is: no, Word Error Rate is not the most useful measure of intelligibility for all languages.
There is no single index of all available TTS front-ends – the closest thing would be on the SynSIG website’s software list.
Availability varies widely with language, and for some there is no free software available.
So the short answer is “No – you’ll need to talk to your supervisor”.
-
AuthorPosts