Forum Replies Created
-
AuthorPosts
-
Yes, that’s how you convert a log probability back to a probability (noting that logs are not necessarily base 10, although in Festival I believe they are).
“transition probabilities of observed states and emission states” is a little muddled – I suggest waiting for HMMs to be covered in the Speech Processing course and then see if that helps you understand better. States emit observations according to some emission probability distribution, and there are also probabilities on the transitions between states.
It is not expected for this assignment to modify the method used by Festival for any steps in the pipeline, such as
prob_models
. The combination of which methods are used for each part of the pipeline is part of the voice definition and it won’t always make sense to modify one method in isolation (e.g., a subsequent step in the pipeline might be expecting a specific relation to be created by a preceding step).Those are log probabilities. It is good practice to store probabilities as their logarithm because absolute probabilities can be very small numbers which are hard to store with sufficient precision.
Diphone concatenation and TD-PSOLA could be implemented as a single process during waveform generation. We can simultaneously modify the F0 and duration inside each individual diphone and overlap-add the last pitch period of each diphone with the first pitch period of the next diphone to concatenate them.
In TD-PSOLA, the width of the analysis window is typically twice the fundamental period so that each window contains two pitch periods. This is so that, if we need to space the pitch periods further apart (i.e., to reduce F0), there is some waveform to ‘fill the gap’.
The nicely-plotted waveform shown in the video at 5:50 is for one of the diphones in the sequence. At 5:45 my hand-drawn waveform for one diphone was unfortunately only two periods long – that was sloppy of me and potentially confusable with a TD-PSOLA analysis frame, which it is not. A diphone would generally be longer than that.
Each impulse response does not represent one phone. A phone is generally much longer than T0 : the vocal tract shape changes much more slowly than the vocal folds vibrate.
November 4, 2022 at 18:15 in reply to: Finding documentation about recognizing non-standard words in Festival #16262You can assume that Festival does it in the way described in this core reading. Remember that the assignment is about Text-to-Speech in general, not narrowly about Festival.
This recommended reading from Taylor will help you think about which parts of the problem are harder than others.
If you want to read beyond the course readings (which is of course optional and not expected, but something you may choose to do if aiming for a very high mark), then Chapter 5 Text decoding: finding the words from the text of Taylor’s book goes into more depth.
November 4, 2022 at 15:59 in reply to: Taking at look at letter-to-sound rules (for new words) #16256No, you don’t need to explain specific failures by referencing the actual tree. That would not be very insightful anyway: although we often say that decision trees are human readable, the error might occur very deep down the tree and be hard to explain.
Instead, focus on the general properties of the model being used. For example, why might a decision tree make errors at all? How could you do better: a bigger tree, a smaller tree, change from a decision tree to another model, train it on more data, … etc ?
For a word that Festival’s letter-to-sound model got wrong, was it because that word might be particularly hard for some reason? What reason?
Yes: what is the overall aim of the report? what question(s) are you trying to answer? Essentially, you want to give the reader some motivation to read the rest of the report: what will they find out? why is it interesting?
November 4, 2022 at 12:17 in reply to: Taking at look at letter-to-sound rules (for new words) #16231Yes – see here. Remember that, for English, they are not rules as such, but a decision tree learned from data (i.e., from the dictionary).
Can you point at the exact video and timestamp for context, so I can give a precise answer?
Yes, it’s possible to think of diphone synthesis and unit selection along a continuum.
At one end is diphone synthesis in which we have exactly one copy of each diphone type, so there no need for a selection algorithm, but we will need lots of signal processing to manipulate the recordings.
At the other end is unit selection with an infinitely large database containing all conceivable variants of every possible diphone type. Now, selection from that database becomes the critical step. With perfect selection criteria, we will find such a perfect sequence of units that no signal processing will be required: the units will already have exactly the right acoustic properties for the utterance being synthesised.
Real systems, of course, live somewhere in-between. We can’t have an infinite database; even if we could, there are no perfect selection criteria (target cost and join cost). This real system will select a pretty good sequence of units most of the time. A little signal processing might be employed, for example to smooth the joins.
First person singular is acceptable for the lab report. The passive is not wrong, but can create problems including being more verbose, or (much more importantly) being potentially ambiguous about who did something.
It’s crucial to be clear about which ideas and work are your own, and which are not. Prioritise that above any stylistic decisions.
November 1, 2022 at 15:50 in reply to: Module 6 – acoustic similarity features for join costs #16209There are many ways of measuring the spectral difference between two speech sounds. Formants would be one way, although it can be difficult to accurately estimate them in speech, and they don’t exist for all speech sounds, so we rarely use them. The cepstrum is a way to parameterise the spectral envelope, and we’ll be properly defining the cepstrum in the upcoming part of Speech Processing about Automatic Speech Recognition.
Cepstral distance would be a good choice for measuring how similar two speech sounds are whilst ignoring their F0. (That’s what we’ll use it for in Automatic Speech Recognition.) So, you are correct that we can use it as part of the join cost in unit selection speech synthesis for measuring how different the spectral envelopes are on either side of a potential join.
Your question about “cepstral distance between the realised acoustics of the chosen phones and the acoustics of the target specification” is straying into material from the Speech Synthesis course, in a method known as hybrid unit selection. Best to wait for that course, rather than answer the question now.
We’ll be covering the Viterbi algorithm in the next section of Speech Processing, about Automatic Speech Recognition. In Module 7, we will encounter the general algorithm of Dynamic Programming in the method known as Dynamic Time Warping. Later, we will see another form of Dynamic Programming, called the Viterbi algorithm, applied to Hidden Markov Models. So, wait for those parts of the course, then ask your question again.
Your questions about whether to take the maximum or minimum relate to whether we are maximising the total (which is what we would do if it was a probability) or minimising it (which is what we would do it if was a distance).
Regarding your questions about whether to multiply or sum: if we are working with probabilities, we multiply. If we are working with distances or costs, we sum (and in fact, we will end up working with log probabilities, which we will sum).
The F0 estimation tool is part of CSTR’s own Speech Tools library and is an implementation of this algorithm
Y. Medan, E. Yair and D. Chazan, “Super resolution pitch determination of speech signals,” in IEEE Transactions on Signal Processing, vol. 39, no. 1, pp. 40-48, Jan. 1991, DOI: 10.1109/78.80763.
with improvements described in
Paul C. Bagshaw, Steven Hiller and Mervyn A. Jack. “Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching” in Proc. EUROSPEECH’93, pp 1003-1006.
At the heart of the algorithm is the same idea as RAPT: the correlation between the waveform and a time-shifted copy of itself (variously called autocorrelation or cross-correlation)
For the purposes of the assignment, you may assume that the algorithm is essentially the same as RAPT, since that is the one taught in the course.
The pitchmarking method is indeed based on finding negative-going zero-crossings of the differentiated waveform. Here’s the code if you want to read it (not required!).
-
AuthorPosts