Page 65

Forum Replies Created

Viewing 15 posts - 961 through 975 (of 1,087 total)

← 1 2 3 … 64 65 66 … 71 72 73 →

Author

Posts
February 4, 2016 at 15:59 in reply to: Spontaneous Speech Transcription Strategy #2418
Simon
Professor
Using spontaneous speech as the basis for a speech synthesiser is an attractive idea, but is rather hard in practice, for several reasons. Here are some of them:

Word-level transcription: spontaneous speech is harder to transcribe even at the word level than read speech, because it is not entirely made of words (as found in a lexicon); ASR could be tried, as could hard-transcription, but both would have difficulty with this – remember that commercial ASR is designed for careful planned speech such as dictation and will not work very well for unplanned speech

Phonetic transcription: even harder than word-level transcription, because the pronunciations deviate considerably from those found in the lexicon (due to co-articulation, assimilation, deletion,…)

Phonetic alignment: the idea that speech is a linear string of phones (“beads on a string”) was never quite true even for read speech, but is even more problematic for spontaneous speech.

Here’s an experiment to try:
1. record a spontaneous utterance
2. transcribe the words
3. record a read-text version of that
4. compare the spontaneous and read-text versions side by side
  - listen
  - examine waveforms and spectrograms
  - try to hand-label word and phone boundaries
February 4, 2016 at 13:10 in reply to: Diphone boundaries #2411
Simon
Professor
That function is calculating the midpoints, yes. The code you’re showing is used for stripping the join cost coefficients during voice building, but it’s performing the same calculation that is done during synthesis.

In lectures, we did indeed gloss over a couple of special cases:

Diphthongs: the 50% point is a poor choice, since the spectrum may be changing rapidly there, so we make the join 75% of the way through the segment where the spectrum is generally a little more stable.

Stops: the end of the closure (stored in cl_end) will have been found during forced alignment (how?) and so we use that as the join point; picking the 50% point in a stop (=closure+burst) might sometimes be before the burst, and other times in the middle of the burst, so would be a bad place to make a join (e.g., we might end up with two bursts in the synthetic speech).
February 3, 2016 at 08:36 in reply to: Diphone boundaries #2409
Simon
Professor
Diphone boundaries are generally just the midpoint between phone boundaries. So, there is no need to store this information in the .utt files because it’s very fast to compute on the fly (e.g., as the file is loaded).

Likewise, it’s easy to construct an index of all available diphones on the fly, as the .utt files are loaded, and store it in memory.
January 24, 2016 at 17:30 in reply to: Labelling the diphones (not the features, just the phonemes) #2325
Simon
Professor
We’ll look at this in detail in the lecture.
January 24, 2016 at 17:30 in reply to: Breathing sounds in the database #2324
Simon
Professor
Yes, this would be pretty straightforward to do. As you say, you could treat it as a special kind of word (presumably pronounced as a special new phone).

In fact, you might find that – at least in unit selection – you will get these in-breaths ‘for free’ because phrase-initial silence diphones will be chosen to synthesise silence in phrase-initial positions.
January 24, 2016 at 17:27 in reply to: Advantages of Spontaneous Speech Database #2323
Simon
Professor
Building synthetic voices from spontaneous speech is an area of active research.

Although we might be able to gather a lot of spontaneous speech, one barrier is that we then have to manually transcribe it. The second barrier is that it is hard to align the phonetic sequence with the speech; this is for many of the same reasons that Automatic Speech Recognition of such speech is hard (you list some of them: disfluencies, co-articulations, deletions,…).

The hypothesised advantage of using spontaneous speech, over read text, is that the voice would sound more natural.

You put your finger on the core theoretical problem though: without a good model of the variation in spontaneous speech (including a predictive model of that variation given only text input), it is indeed just unwanted noise in the database.
January 24, 2016 at 17:21 in reply to: Expressive speech #2322
Simon
Professor
If we want a voice that speaks in a single speaking style (or emotion), then we can simply record data in that style and build the voice as usual. That will work very well, but will not scale to producing many different styles / emotions / etc.

Can you each try to specify more precisely what you mean by ‘expression’ before I continue my answer? Is it a property of whole utterances, or parts of utterances, for example?
January 24, 2016 at 17:19 in reply to: Implicit and explicit labeling #2321
Simon
Professor
In many systems, I think a mixture of explicit and implicit labels are used.

Read what Taylor has to say about intuitive versus analytic labelling (section 17.1.4 to 17.1.6).

When you mention hand-labelling POS tags, I think a better idea is to hand-label some training data, then train a POS tagger, and use that to label your actual database. This is what is done in practice.

Can you suggest which explicit labels you would like to place on the database?
January 24, 2016 at 17:05 in reply to: Labeling Prosody #2320
Simon
Professor
Labelling prosody on the database is one of those topics that has a long history of research, but no really good solutions: it’s a very hard problem.

We are fairly sure that highly-accurate ToBI labels are helpful, provided that we have them for both the database utterances and test sentences. So, even if we hand-label the database, we still have the hard problem of accurately predicting from text at synthesis time, in a way that is consistent with the database labels.

Yes, many people have looked at simpler systems that ToBI. Festival reduces the number of boundary strength levels, for example. Your suggestion to train a model on a hand-labelled subset of data and use that to label the rest of the database is excellent: this is indeed what people do. But there remains the “predicting from text” problem at synthesis time.

One simpler approach is to think just about prominences and boundaries.

Perhaps a more promising approach these days is to label the database with more sophisticated linguistic information than plain old POS tags, such as shallow syntactic and semantic structure.
January 24, 2016 at 16:58 in reply to: Vowel reduction in forced alignment #2319
Simon
Professor
We certainly do need to train models of full and reduced vowels before we can used forced alignment to choose which one is the most likely label to place on a particular segment in the database.

We’ll look at how to train those models in the lecture.
January 24, 2016 at 16:54 in reply to: Database Redundancy: Used to our advantage? #2318
Simon
Professor
It’s certainly the case in unit selection that there are many versions that will sound as good as the one chosen via the target and join costs. Actually, there will very probably be many that sound better, but were not the lowest cost sequence in the search (why is that?).

It’s easy in principal to generate an n-best list during a Viterbi search (although not implemented in Festival).

Here’s an idea for how you might generate variants from your own unit selection voice without modifying any code:
1. Synthesis the sentence, and examine the utterance structure to see which prompts from the database were used
2. Remove one or more (maybe all) of those prompts from utts.data
3. Restart Festival
4. Synthesise the sentence again: different units will be chosen
January 24, 2016 at 16:45 in reply to: Automatic text selection #2317
Simon
Professor
Sure – that would be fine.

In general, I don’t think people use Natural Language Generation (NLG) for this, mainly because NLG systems are typically limited domain, and so will only generate a closed set of sentences (or at least, from a closed vocabulary).

The vast majority of missing diphones will be cross-word (why is that?). So, all you would really need to do is find word pairs that contain the required diphone. However, you would want these to occur in a reasonably natural sentence, so that they can be used in the same way as the other prompts (i.e., recorded and used in their entirety).
January 24, 2016 at 16:20 in reply to: Simple string manipulations #2314
Simon
Professor
You might need to cut a string on a separator, keeping only some parts if it. There are lots of ways to do that. The built-in cut command is one way (and you can pass it files too, whereby it will perform the same operation to all lines). The pipe “|” sends the output of one process to the input of the next.
```
$ # -c cuts using character positions
$ echo some_file.txt | cut -c6-9
file

$ # -d cuts using the delimiter you specify
$ echo some_file.txt | cut -d"_" -f1
some

$ # and -f specifies which field(s) you want to keep
$ echo some_file.txt | cut -d"_" -f2
file.txt

$ echo a_long_file_name.txt | cut -d"_" -f2-4
long_file_name.txt
```
January 24, 2016 at 14:44 in reply to: CMU punctuation normalisation #2309
Simon
Professor
I’ve clarified my response: removing the question sentences entirely seems to be preferable to keeping them but removing their question marks.
January 24, 2016 at 13:43 in reply to: Automatic text selection #2307
Simon
Professor
We’ll look at a more detailed example of greedy text selection in the lecture.

Your suggestion to normalise for the length of the sentence is a good idea, otherwise we might just select the longest sentences (because they contain more diphones than shorter sentences).

You make a good point about final total coverage: 100% might be impossible simply because there are no occurrences of certain very rare diphones in our large corpus. The ARCTIC corpus covers around 75-80% of all possible diphones. The initial large corpus contained at least one example of about 90% of all possible diphones (reducing to around 80% when discarding sentences that are not “nice”), so that would be a ceiling on the possible coverage that could ever be obtained.
Author

Posts

Viewing 15 posts - 961 through 975 (of 1,087 total)

← 1 2 3 … 64 65 66 … 71 72 73 →

Simon

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis