Forum Replies Created
-
AuthorPosts
-
Features must be specified by the system designer, using knowledge of the problem. If in doubt, every possible feature that the designer can think of is extracted – the tree building algorithm will select the most useful ones.
Questions about features can be hand-designed, but can often be automatically created from the features (e.g., enumerate all possible yes/no questions about every possible value of every feature).
At each iteration of tree building (i.e., splitting a node), all available questions are tried, and the one that most reduces the entropy (in the case of a classification task) is placed into the tree at that node.
ToBI is symbolic – all prosodic events (accents or boundary tones) are categorised into a set of classes (the most common accent being H*, which is a rise-fall shape).
Tilt is parametric and represents intonation events as a small set of continuous parameters that describe the shape.
ToBI is suitable for hand-labelling prosody (although this is a highly skilled task). Tilt is not designed for hand-labelling: it is for converting the F0 contours of events (which might be themselves manually or automatically found) into numbers which we can then model.
The material on predicting prosody shows how prediction can be done in several stages. We might use ToBI for the categories in the first stages (placement of events, classification of their types) and perhaps then use Tilt as the parametric representation of shape for realisation as a fragment of F0 contour.
It’s important to understand that prosody is very much an unsolved problem in TTS, and so everything you read on prosody should be seen in that light. There is no single method that “just works” and each system will probably take a different approach.
Try searching for ToBI in the forums to find related topics.
It’s also worth noting that certain users need exceptionally high speaking rates. Blind computer programmers are one example.
Some people may still use formant synthesis, depending on personal preference.
In this topic I noted that we generally avoid performing such signal modifications in unit selection speech synthesis because they degrade the quality. However, speed-up of a synthetic voice is sometimes necessary (e.g., it is required by blind users) and the paper referenced in the above thread compares several ways to do that.
In older diphone synthesis, it was necessary to manipulate duration and F0 independently, and there are several ways to do that. You can try TD-PSOLA for yourself in Praat.
In unit selection (i.e., concatenative) speech synthesis, we generally avoid making any modifications to the recorded speech, because they will introduce artefacts and so degrade naturalness. It’s much easier to vary speaking rate in statistical parametric synthesis.
To read our recent research in this area, try
Cassia Valentini-Botinhao, Markus Toman, Michael Pucher, Dietmar Schabus, and Junichi Yamagishi. “Intelligibility of time-compressed synthetic speech: Compression method and speaking style” in Speech Communication, Volume 74, Nov 2015, pp 52–64. DOI: http://dx.doi.org/10.1016/j.specom.2015.09.002
One finding is that linear time compression of normal speech is the best strategy. This is not what happens in natural speech though. As you point out, natural fast speech does indeed involve more and more deletions, but these seem to harm intelligibility and so we must conclude they are done to benefit the speaker (less effort) and not the listener.
Yes, there is – it’s described here, as part of the coursework for the Speech Synthesis course.
That will tell you the utterance number, and then you can look that up in the list of ARCTIC sentences that was used as the recording script for the
cstr_edi_awb_arctic_multisyn
voice.To listen to the original source sentence, find the appropriate wav file here.
It’s a common misconception that a language with fixed inventory of phonological units (whether Consonant-Vowel units, syllables, or whatever) can be perfectly synthesised from an inventory containing a single recording of each such unit.
All languages have a fixed inventory of phonemes (it’s not possible to invent new ones!) and also of syllables (due to phonotactic constraints), or whatever the equivalent is in that language (e.g., the mora in Japanese).
The key point is that the acoustic realisation of each unit is influenced by its context, and so having multiple recordings of each (from many different contexts) will give better results.
Tone languages still have intonation. Tone by itself does not entirely determine F0. Tone is typically realised as the shapes of an F0 contour, not absolute values. In tone languages, F0 is carrying both segmental and supra-segmental information.
Anything that is not in the pronunciation dictionary will have to be dealt with by the Letter-To-Sound (LTS) model.
[I’m merging this into “Jurafsky & Martin – Chapter 8” where a related question about pronunciation of names has been asked previously]
Nothing will go wrong when using any number of bits.
However, the choices of 8 and 16 are the most convenient because of the way computers store numbers. 8 bits is one byte and corresponds to a
char
in software (in C family languages). 16 bits corresponds to either anint
orshort
in software.Using 9 bits, for example, would be very inconvenient when writing software, since there is no built-in type that is of exactly that size.
Deep down in the operating system (in fact, in the hardware), everything is stored with a fixed number of bits. In modern operating systems, this is now usually 64 bits (older computers used 32 bits). The operating system can very neatly pack 8 or 16 bit numbers into memory. It would be messy to pack 9 bit numbers into memory, and also wasteful since we couldn’t use all 64 bits.
[please ask one question per post – see the topic bit depth for your second question]
dB is always 10 log(ratio) because the 10 is converting from Bels to decibels
the 20 comes from 10 log (ratio^2), where the raising to the power of 2 is converting magnitude to power
remember that log(x . x) = log(x) + log(x) = 2 log(x)
the key point to remember is that dB is a logarithmic scale, which is generally much more useful, when plotting a spectrum for example
It was a typo on the page – the video really is just 42 seconds long – nothing is missing
The same techniques are used for unseen names and for other types of unseen words – for example a classification tree. But, in some systems, separate classifiers are used in the two cases.
The classifier for names might use additional features, provided by some earlier stage in the pipeline. For example, a prediction (“guess”) at which foreign language the word originates from.
This prediction would come from a language classifier that would itself need to be trained in a supervised manner from labelled data, such as a large list of words tagged with language of origin. This classifier might use features derived from the sequence of letters in the word, or even simply letter frequency, which differs between languages.
For our purpose (which is to form a simple theoretical source-filter model of speech production), I think it’s OK to say that the vocal folds are simply vibrating during vowel sounds and the only thing that can vary about this is the frequency of vibration (F0).
In tone languages, F0 can distinguish words and so is a phonological feature. In other languages (e.g., English), F0 does not carry any phonetic information.
Of course, reality is more complex. The vocal folds can be used to control voice quality. For example, in breathy speech, the folds never completely close and the leaking airflow results in turbulence (like in a fricative).
The term magnitude is usually used with regard to the spectrum (e.g., obtained by the FFT). It is used to distinguish from the phase spectrum, which we don’t really need to worry about here.
Try it for yourself!
Draw a sine wave on some graph paper and then sample it slightly more than 2 times per period. Scan that and post it here, then we can discuss it.
-
AuthorPosts