Forum Replies Created
-
AuthorPosts
-
Can you point at the exact video and timestamp for context, so I can give a precise answer?
Yes, it’s possible to think of diphone synthesis and unit selection along a continuum.
At one end is diphone synthesis in which we have exactly one copy of each diphone type, so there no need for a selection algorithm, but we will need lots of signal processing to manipulate the recordings.
At the other end is unit selection with an infinitely large database containing all conceivable variants of every possible diphone type. Now, selection from that database becomes the critical step. With perfect selection criteria, we will find such a perfect sequence of units that no signal processing will be required: the units will already have exactly the right acoustic properties for the utterance being synthesised.
Real systems, of course, live somewhere in-between. We can’t have an infinite database; even if we could, there are no perfect selection criteria (target cost and join cost). This real system will select a pretty good sequence of units most of the time. A little signal processing might be employed, for example to smooth the joins.
First person singular is acceptable for the lab report. The passive is not wrong, but can create problems including being more verbose, or (much more importantly) being potentially ambiguous about who did something.
It’s crucial to be clear about which ideas and work are your own, and which are not. Prioritise that above any stylistic decisions.
November 1, 2022 at 15:50 in reply to: Module 6 – acoustic similarity features for join costs #16209There are many ways of measuring the spectral difference between two speech sounds. Formants would be one way, although it can be difficult to accurately estimate them in speech, and they don’t exist for all speech sounds, so we rarely use them. The cepstrum is a way to parameterise the spectral envelope, and we’ll be properly defining the cepstrum in the upcoming part of Speech Processing about Automatic Speech Recognition.
Cepstral distance would be a good choice for measuring how similar two speech sounds are whilst ignoring their F0. (That’s what we’ll use it for in Automatic Speech Recognition.) So, you are correct that we can use it as part of the join cost in unit selection speech synthesis for measuring how different the spectral envelopes are on either side of a potential join.
Your question about “cepstral distance between the realised acoustics of the chosen phones and the acoustics of the target specification” is straying into material from the Speech Synthesis course, in a method known as hybrid unit selection. Best to wait for that course, rather than answer the question now.
We’ll be covering the Viterbi algorithm in the next section of Speech Processing, about Automatic Speech Recognition. In Module 7, we will encounter the general algorithm of Dynamic Programming in the method known as Dynamic Time Warping. Later, we will see another form of Dynamic Programming, called the Viterbi algorithm, applied to Hidden Markov Models. So, wait for those parts of the course, then ask your question again.
Your questions about whether to take the maximum or minimum relate to whether we are maximising the total (which is what we would do if it was a probability) or minimising it (which is what we would do it if was a distance).
Regarding your questions about whether to multiply or sum: if we are working with probabilities, we multiply. If we are working with distances or costs, we sum (and in fact, we will end up working with log probabilities, which we will sum).
The F0 estimation tool is part of CSTR’s own Speech Tools library and is an implementation of this algorithm
Y. Medan, E. Yair and D. Chazan, “Super resolution pitch determination of speech signals,” in IEEE Transactions on Signal Processing, vol. 39, no. 1, pp. 40-48, Jan. 1991, DOI: 10.1109/78.80763.
with improvements described in
Paul C. Bagshaw, Steven Hiller and Mervyn A. Jack. “Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching” in Proc. EUROSPEECH’93, pp 1003-1006.
At the heart of the algorithm is the same idea as RAPT: the correlation between the waveform and a time-shifted copy of itself (variously called autocorrelation or cross-correlation)
For the purposes of the assignment, you may assume that the algorithm is essentially the same as RAPT, since that is the one taught in the course.
The pitchmarking method is indeed based on finding negative-going zero-crossings of the differentiated waveform. Here’s the code if you want to read it (not required!).
Yes, they are created using the ‘keyword lexicon’ technique invented by Fitt & Isard.
The error is caused by using letter-to-sound rules for the word ‘arugula’. The solution is to write a pronunciation (that does not contain a glottal stop) for this word and add that to
my_lexicon.scm
Miscellaneous
There was one suggestion to use Piazza, which is a good tool. The University has a track record of changing the Virtual Learning Environment (VLE) every 5 or so years. The current platform is a tangled mess of many tools, ‘integrated’ in Learn. Critically, a previous VLE change led to total loss of all past student-contributed discussions for this course. I do not want that to ever happen again, which is why I built speech.zone
Some things were mentioned that we are already doing: release the videos earlier (they were on speech.zone from the start of the course) and provide lecture recordings (which are on Learn, and mentioned on every class tab within the course on speech.zone).
A better classroom
I wish I could promise that, and especially offer a room with work surfaces for your notebooks and laptops, but I have no control over this. Asking for a different room would probably mean somewhere even more obscure and certainly much further from George Square. (I don’t know what to say to the person who complained the room is too far from Appleton Tower…it’s less than 10 minutes walk!)
More guidance on the coursework report
Only a small number of responses mentioned this, asking for a prescribed structure. That was done in Speech Processing, and you are expected to use what you learned there in this course. You should no longer need a prescribed structure to know what makes a good report.
Improve the coursework
There were various valid criticisms of the lab-based assignment. We plan to improve the assignment and would like to introduce an element of state-of-the-art modelling. However, this course has no programming pre-requisite, and we need to keep the assignment accessible to all students.
Requiring students to record some original data is very important – it’s something that most people in industry lack experience of – so we will keep this.
Keep doing this
Number of people mentioning each point is given in parentheses.
Videos (14)
Interactive classes including individual/pair/group work (9)
Reading papers in class (4)
Use of diagrams (3)
Does this topic help?
You just need to load
my_lexicon.scm
whenever you start Festival for performing synthesis. Simply give it as an argument to Festival:bash$ festival my_lexicon.scm
which tells Festival to load (and execute) the contents of
my_lexicon.scm
immediately after it starts. -
AuthorPosts