Page 7

Forum Replies Created

Viewing 15 posts - 91 through 105 (of 1,087 total)

← 1 2 3 … 6 7 8 … 71 72 73 →

Author

Posts
November 3, 2022 at 11:05 in reply to: Module 6 – unit selection and diphone concatenation #16224
Simon
Professor
Yes, it’s possible to think of diphone synthesis and unit selection along a continuum.

At one end is diphone synthesis in which we have exactly one copy of each diphone type, so there no need for a selection algorithm, but we will need lots of signal processing to manipulate the recordings.

At the other end is unit selection with an infinitely large database containing all conceivable variants of every possible diphone type. Now, selection from that database becomes the critical step. With perfect selection criteria, we will find such a perfect sequence of units that no signal processing will be required: the units will already have exactly the right acoustic properties for the utterance being synthesised.

Real systems, of course, live somewhere in-between. We can’t have an infinite database; even if we could, there are no perfect selection criteria (target cost and join cost). This real system will select a pretty good sequence of units most of the time. A little signal processing might be employed, for example to smooth the joins.
November 1, 2022 at 17:48 in reply to: report POV #16213
Simon
Professor
First person singular is acceptable for the lab report. The passive is not wrong, but can create problems including being more verbose, or (much more importantly) being potentially ambiguous about who did something.

It’s crucial to be clear about which ideas and work are your own, and which are not. Prioritise that above any stylistic decisions.
November 1, 2022 at 15:50 in reply to: Module 6 – acoustic similarity features for join costs #16209
Simon
Professor
There are many ways of measuring the spectral difference between two speech sounds. Formants would be one way, although it can be difficult to accurately estimate them in speech, and they don’t exist for all speech sounds, so we rarely use them. The cepstrum is a way to parameterise the spectral envelope, and we’ll be properly defining the cepstrum in the upcoming part of Speech Processing about Automatic Speech Recognition.

Cepstral distance would be a good choice for measuring how similar two speech sounds are whilst ignoring their F0. (That’s what we’ll use it for in Automatic Speech Recognition.) So, you are correct that we can use it as part of the join cost in unit selection speech synthesis for measuring how different the spectral envelopes are on either side of a potential join.

Your question about “cepstral distance between the realised acoustics of the chosen phones and the acoustics of the target specification” is straying into material from the Speech Synthesis course, in a method known as hybrid unit selection. Best to wait for that course, rather than answer the question now.
November 1, 2022 at 15:43 in reply to: Module 6 – The Viterbi algorithm #16208
Simon
Professor
We’ll be covering the Viterbi algorithm in the next section of Speech Processing, about Automatic Speech Recognition. In Module 7, we will encounter the general algorithm of Dynamic Programming in the method known as Dynamic Time Warping. Later, we will see another form of Dynamic Programming, called the Viterbi algorithm, applied to Hidden Markov Models. So, wait for those parts of the course, then ask your question again.

Your questions about whether to take the maximum or minimum relate to whether we are maximising the total (which is what we would do if it was a probability) or minimising it (which is what we would do it if was a distance).

Regarding your questions about whether to multiply or sum: if we are working with probabilities, we multiply. If we are working with distances or costs, we sum (and in fact, we will end up working with log probabilities, which we will sum).
April 11, 2022 at 09:52 in reply to: F0 tracking and Pitch marks #15878
Simon
Professor
The F0 estimation tool is part of CSTR’s own Speech Tools library and is an implementation of this algorithm

Y. Medan, E. Yair and D. Chazan, “Super resolution pitch determination of speech signals,” in IEEE Transactions on Signal Processing, vol. 39, no. 1, pp. 40-48, Jan. 1991, DOI: 10.1109/78.80763.

with improvements described in

Paul C. Bagshaw, Steven Hiller and Mervyn A. Jack. “Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching” in Proc. EUROSPEECH’93, pp 1003-1006.

At the heart of the algorithm is the same idea as RAPT: the correlation between the waveform and a time-shifted copy of itself (variously called autocorrelation or cross-correlation)

For the purposes of the assignment, you may assume that the algorithm is essentially the same as RAPT, since that is the one taught in the course.

The pitchmarking method is indeed based on finding negative-going zero-crossings of the differentiated waveform. Here’s the code if you want to read it (not required!).
April 10, 2022 at 16:10 in reply to: Pronunciation dictionaries #15874
Simon
Professor
Yes, they are created using the ‘keyword lexicon’ technique invented by Fitt & Isard.
April 4, 2022 at 09:57 in reply to: Unknown Label error #15854
Simon
Professor
The error is caused by using letter-to-sound rules for the word ‘arugula’. The solution is to write a pronunciation (that does not contain a glottal stop) for this word and add that to my_lexicon.scm
April 3, 2022 at 14:01 in reply to: Response to Speech Synthesis feedback of 2021-03 #15850
Simon
Professor
Miscellaneous

There was one suggestion to use Piazza, which is a good tool. The University has a track record of changing the Virtual Learning Environment (VLE) every 5 or so years. The current platform is a tangled mess of many tools, ‘integrated’ in Learn. Critically, a previous VLE change led to total loss of all past student-contributed discussions for this course. I do not want that to ever happen again, which is why I built speech.zone

Some things were mentioned that we are already doing: release the videos earlier (they were on speech.zone from the start of the course) and provide lecture recordings (which are on Learn, and mentioned on every class tab within the course on speech.zone).
April 3, 2022 at 13:55 in reply to: Response to Speech Synthesis feedback of 2021-03 #15849
Simon
Professor
A better classroom

I wish I could promise that, and especially offer a room with work surfaces for your notebooks and laptops, but I have no control over this. Asking for a different room would probably mean somewhere even more obscure and certainly much further from George Square. (I don’t know what to say to the person who complained the room is too far from Appleton Tower…it’s less than 10 minutes walk!)
April 3, 2022 at 13:51 in reply to: Response to Speech Synthesis feedback of 2021-03 #15848
Simon
Professor
More guidance on the coursework report

Only a small number of responses mentioned this, asking for a prescribed structure. That was done in Speech Processing, and you are expected to use what you learned there in this course. You should no longer need a prescribed structure to know what makes a good report.
April 3, 2022 at 13:49 in reply to: Response to Speech Synthesis feedback of 2021-03 #15847
Simon
Professor
Improve the coursework

There were various valid criticisms of the lab-based assignment. We plan to improve the assignment and would like to introduce an element of state-of-the-art modelling. However, this course has no programming pre-requisite, and we need to keep the assignment accessible to all students.

Requiring students to record some original data is very important – it’s something that most people in industry lack experience of – so we will keep this.
April 3, 2022 at 13:46 in reply to: Response to Speech Synthesis feedback of 2021-03 #15846
Simon
Professor
Keep doing this

Number of people mentioning each point is given in parentheses.

Videos (14)

Interactive classes including individual/pair/group work (9)

Reading papers in class (4)

Use of diagrams (3)
April 1, 2022 at 08:56 in reply to: Alignment Mismatch #15835
Simon
Professor
Does this topic help?
March 28, 2022 at 13:04 in reply to: Accessing unilex-rpx lexicon #15820
Simon
Professor
You just need to load my_lexicon.scm whenever you start Festival for performing synthesis. Simply give it as an argument to Festival:

bash$ festival my_lexicon.scm

which tells Festival to load (and execute) the contents of my_lexicon.scm immediately after it starts.
March 28, 2022 at 12:22 in reply to: ‘bad_alloc’ error #15818
Simon
Professor
Yes, the most likely cause is incorrect sampling rate of waveforms. See this topic.
Author

Posts

Viewing 15 posts - 91 through 105 (of 1,087 total)

← 1 2 3 … 6 7 8 … 71 72 73 →

Simon

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis