› Forums › Readings › Jurafsky & Martin – Speech and language processing › Jurafsky & Martin – Chapter 8
- This topic has 22 replies, 11 voices, and was last updated 5 years, 9 months ago by Simon.
-
AuthorPosts
-
-
September 24, 2016 at 18:41 #4933
Speech Synthesis
-
October 2, 2016 at 18:24 #5059
Jurafsky and Martin writes that where homographs cannot be resolved by part of speech, they are simply ignored in TTS systems (page 291). However, would it not be possible to create rules which could solve such issues in most, or at least some, cases? For example, they talk about the word ‘bass’ which cannot be distinguished by to which word class it belongs. But, you could write a rule that says if the word is followed by the word ‘guitar’ it is the /b ey s/ pronounciation that should be used, and presumably there are some cases where there then would be no confusion as to which pronounciation to use. Does such a rule resolve too few cases for it to be worth including in the set of rules?
-
October 3, 2016 at 15:17 #5063
The “rule” you suggest is performing Word Sense Disambiguation (in the case of “bass” at least). So, the general solution is to add a Word Sense Disambiguation module to the front-end of the system. I’ll add this to the list of possible topics for the next lecture.
-
-
October 2, 2016 at 21:01 #5060
From J&M 8.2 Grapheme-to-phoneme conversion describes pronunciation of unseen words/names using context-based probabilities of particular phonemes; the same section describes pronouncing unseen names versus unseen words, and that the pronunciation of unseen names can be deduced using features of foreign languages. Why couldn’t this same principle be applied to unseen words? Further, since the language of origin for unseen names likely cannot be determined from the context of the sentence, how would the correct language be determined?
-
October 5, 2016 at 13:19 #5177
The same techniques are used for unseen names and for other types of unseen words – for example a classification tree. But, in some systems, separate classifiers are used in the two cases.
The classifier for names might use additional features, provided by some earlier stage in the pipeline. For example, a prediction (“guess”) at which foreign language the word originates from.
This prediction would come from a language classifier that would itself need to be trained in a supervised manner from labelled data, such as a large list of words tagged with language of origin. This classifier might use features derived from the sequence of letters in the word, or even simply letter frequency, which differs between languages.
-
October 8, 2016 at 10:01 #5262
I am not quite sure I understand what is the difference between the ToBI and the Tilt model(8.3.4). Could you, please,explain it in a more simplified way?
-
October 8, 2016 at 10:23 #5263
ToBI is symbolic – all prosodic events (accents or boundary tones) are categorised into a set of classes (the most common accent being H*, which is a rise-fall shape).
Tilt is parametric and represents intonation events as a small set of continuous parameters that describe the shape.
ToBI is suitable for hand-labelling prosody (although this is a highly skilled task). Tilt is not designed for hand-labelling: it is for converting the F0 contours of events (which might be themselves manually or automatically found) into numbers which we can then model.
The material on predicting prosody shows how prediction can be done in several stages. We might use ToBI for the categories in the first stages (placement of events, classification of their types) and perhaps then use Tilt as the parametric representation of shape for realisation as a fragment of F0 contour.
It’s important to understand that prosody is very much an unsolved problem in TTS, and so everything you read on prosody should be seen in that light. There is no single method that “just works” and each system will probably take a different approach.
Try searching for ToBI in the forums to find related topics.
-
-
October 11, 2016 at 18:20 #5418
In chapter 8.3.2, it says there are four levels of prominence: emphatic accent, pitch accent, unaccented and reduced. Here I understand those four terms as just four ways to emphasise (or ‘de-emphasize’) a part of a speech.
However, in the ToBI model, ‘pitch accent’ appears again as a set of classes like H* and L*. Is ‘pitch accent’ here the same as that in the four levels of prominence? If so, what makes it unique to be chosen as a set of classes in ToBI? If not, what does it mean respectively in the two different contexts?
And another trivial question: what is an IP-initial accent in chapter 8.3.6?
Thanks!
-
October 12, 2016 at 11:42 #5457
ToBI is a description of the shape of intonation events (i.e., small fragments of F0 contour). We could make a syllable sound more prominent using one of several different shapes of F0 contour; the most obvious choice is a simple rise-fall (H*) but other shapes can also add prominence.
ToBI does also attempt to associate a function with some accent types (e.g., L* for “surprise” in Figure 8.10). But, many people (including me) are sceptical about this functional aspect of ToBI, because there really isn’t a simple mapping between shapes of F0 contours and the underlying meaning.
“IP” means “intonation phrase, as described in 8.3.1. So an “IP-initial accent” is the first accent in an intonation phrase.
-
-
October 12, 2016 at 11:23 #5455
In figure 8.10 in 8.3.4, I noticed that the positions of two labels H* and L* for pitch accent are different within the word “marianna”, and their positions seem to be on very specific points in the waveform. Does this imply that ToBI assign the pitch accent labels to phones instead of words? If so, by depending on what criteria does the model make its choices on phones?
Another question is about accent ratio model in 8.3.2. I am a bit confused about the condition for k/N. Why does the B(k,N,0.5) need to ≤0.05? And why does the model use 0.5 to differentiate accent ratio?
Thank your!
-
October 12, 2016 at 11:48 #5458
ToBI associates accents with words, but in fact intonation events align with syllables. Accents align with a particular syllable in the word (usually one with lexical stress) and their precise timing (earlier or later) can also matter.
In the accent ratio model, the ≤0.05 is saying “no more than 5%” and is a statistical significance test. It is there so that the model only makes predictions in cases where it is confident (e.g., because it has seen enough examples in the training data).
(in future, please can you split each question into a separate post – it makes the forums easier to read and search)
-
-
October 16, 2016 at 20:30 #5480
In section 8.4, J&M talk about pitch markings and pitch tracking. Instinctively, I correlate these with intonation and stress but this might inaccurate. So, what is the relation, if any, of pitch marking and tracking with stress and intonation patterns?
-
October 16, 2016 at 20:40 #5482
Pitch marking means finding the instants of glottal closure. Pitch marks are moments in time. The interval of time between two pitch marks is the pitch period, denoted as T0, which of course is equal to 1 / F0.
You might think that pitch marking would be the best way to find F0. However, it’s actually not, because pitch marking is hard to do accurately and will give a lot of local error in the estimate for F0.
Pitch marking useful for signal processing, such as TD-PSOLA.
Pitch tracking is a procedure to find the value of F0, as it varies over time.
Pitch tracking is done over longer windows (i.e., multiple pitch periods) to get better accuracy, and can take advantage of the continuity of F0 in order to get a more robust and error-free estimate of its value.
Pitch tracking useful for visualising F0, analysing intonation, and building models of it.
Exactly how pitch marking and pitch tracking work is beyond the scope of the Speech Processing course, but is covered in the more advanced Speech Synthesis course.
-
-
October 12, 2018 at 20:59 #9415
In Chapter 8.2 (Page 294), the book explains a letter-to-phone alignment algorithm. I understand this is implemented on training data (as well as test data) for a probabilistic g2p algorithm. However, the doesn’t describe this particular algorithm in detail.
How does the algorithm find all alignments between the pronunciation and the spelling (which conforms to the allowable phones)? Could you provide a concrete example of this?-
October 13, 2018 at 18:31 #9418
Letter-to-phone alignment is needed when preparing the data for training the letter-to-sound model such as a classification tree. This is because letter-to-sound is a sequence-to-sequence problem, but a classification tree only deals with fixed-length input and output. We therefore do a common ‘trick’ of sliding a fixed length window along the sequence of predictors (which are the letters, in this case).
One way to find the alignment is by using Dynamic Programming, which searches for the most likely alignment between the two sequences. We will define (by hand) a simple cost function which (for example) gives higher probability to alignments between letters that are vowels and phonemes that are vowels, and the same for consonants. Or, the cost function could list, for every letter, the phonemes that it is allowed to align with.
Dynamic Programming is coming up later in the course – we’ll first encounter it in the Dynamic Time Warping (DTW) method for speech recognition. I suggest waiting until we get there, and then revisiting this topic to see if you can work out how to apply Dynamic Programming to this problem.
I’ll leave one hint here for you to come back to: in DTW, we create a grid and place the template along one axis and the unknown word along the other. For letter-to-phoneme alignment, we would place the letters along one axis and the phonemes along the other.
Post a follow-up later in the course if you need more help.
-
-
October 16, 2018 at 08:06 #9423
In section 8.5, page 312, when discussing target cost, the arguments of the vector(?) representation of the target cost, T, are given as (St, Ut) initially where the previous refers to the target specification and the latter the potential unit.
The index on U somehow changes to (St[p], Uj[p]) later when talking about the subcost with respect to the feature specification of the diphone. The equation in 8.20 also uses Uj, so I wanted to ask if this difference is symbolic of something or (potentially) a printing mistake because I don’t understand why the indices of the target specification and the potential unit shouldn’t be the same.
Attachments:
You must be logged in to view attached files.-
October 19, 2018 at 13:27 #9447
The difference between t and j is meaningful, but Taylor forgot to spell it out explicitly.
In the equations where the candidate unit, u, is indexed by t (meaning time as an integer counting through the target sequence, not time measured in seconds), he is referring to the unit selected at that time to be used for synthesis.
In the equations where u is indexed using another variable, j, he is using j to index all the available candidates of that unit type, of which one will be selected.
-
-
October 28, 2018 at 19:43 #9500
In page 292 Figure 8.6 I don’t understand what information is the number providing (CMU pronouncing Dict) (See picture number 1 attached)
Later there is an example of UNISYN dictionary. What does the symbol mean? what info do they provide? (*, .> , ~, etc). (See picture number 2 attached)
Finally, on page 293, equation 8.11 shows a rule from Allen et al. (1987) … I have no idea how to read that expression. (Picture number 3)
Attachments:
You must be logged in to view attached files.-
November 5, 2018 at 14:25 #9556
J&M shouldn’t really have included examples from UNISYN without explaining the notation, which is much more sophisticated than other dictionaries. You don’t really need to know this level of detail, but if you are interested then the notation is explained in the UNISYN manual:
Curly brackets {} surround free morphemes, left angle brackets << are used to enclose prefixes, right angle brackets >> are used for suffixes, and equals signs == join bound morphemes or relatively unproductive affixes, for example ‘ade’ in ‘blockade’
Attachments:
You must be logged in to view attached files. -
November 5, 2018 at 14:29 #9558
Again, including an example from Allen et al without explaining the notation is not helpful of J&M. I also would not know how to read that rule, without reading the Allen paper in full. I think J&M are just making the point that stress assignment is complex, and showing us an esoteric rule as evidence of this.
-
-
October 30, 2018 at 03:16 #9515
I Realize that my first question the numbers in the CMU dict are the stress on the syllable.
Still unsure about the other two pictures. -
October 5, 2016 at 19:33 #5200
How does a TTS program tackle the pronounciation of words and names foreign to the intended language, such as “Jiménez” for English, which are not readily available in the dictionary? Is it purely through voice-specific rules or some other front-end process?
-
October 6, 2016 at 07:47 #5202
Anything that is not in the pronunciation dictionary will have to be dealt with by the Letter-To-Sound (LTS) model.
[I’m merging this into “Jurafsky & Martin – Chapter 8” where a related question about pronunciation of names has been asked previously]
-
-
AuthorPosts
- You must be logged in to reply to this topic.