› Forums › Readings › Holmes & Holmes – Speech Synthesis and Recognition › Holmes & Holmes – Chapter 8
- This topic has 9 replies, 5 voices, and was last updated 4 years, 8 months ago by Simon.
-
AuthorPosts
-
-
October 25, 2016 at 14:35 #5595
Template matching and dynamic time warping
-
October 28, 2018 at 00:21 #9498
In Speech Synthesis and Recognition, page 124 on continuous speech recognition, figure 8.9, the figure title explains three template sequences are been consider- T1-T3-T1- T3, T1-T3-T3-T1 and T1-T1-T1-T1. However, judging from the illustration, I read three sequences as T1-T3-T1- T2, T1-T3-T3-T2 and T1-T3-T3-T3. Could you please explain how to read this trace-back chart correctly? thanks!
Attachments:
You must be logged in to view attached files.-
November 5, 2018 at 14:42 #9559
There is an error in the caption of your version (is this the ebook?). The caption in my hardcopy correctly lists the paths which are
T1 – T3 – T3 – T3
T1 – T3 – T3 – T2
T1 – T3 – T1 – T2
The take home message from this reading is that working with these data structures is not necessarily the easiest way to understand (or even to practically implement) dynamic programming. We will shortly see a much more elegant approach called Token Passing.
-
-
October 30, 2018 at 17:14 #9517
Page 111:
Because the excitation periodicity is evident in the amplitude variations of the output from a broadband analysis, it is also necessary to apply some time-smoothing to remove it. Such time-smoothing will also remove most of the fluctuations that result from randomness in turbulent excitation.
Because there is no diagram that accompanies this explanation, I don’t fully understand how the excitation periodicity is visible (or what it appears as) when performing broadband analysis. I assume that broadband analysis produces a spectrum from which various frequencies present in the signal are observable. So, is ‘excitation periodicity’ the same thing as F0? I would also like to clarify what ‘time-smoothing’ is.
Page 112:
Use of filter-bank power directly gives most weight to more intense regions of the spectrum, where a change of 2 or 3 dB will represent a very large absolute difference.
As for this information, why is this the case? What is meant by filter-bank power? Does this ‘power’ refer to some kind of power spectrum generated from filter-bank analysis?
-
November 5, 2019 at 18:53 #10152
Same question as #9517.
I’m confused at what “excitation periodicity” is and what “resolve the harmonics of the fundamental of voiced speech. ” means as in “It is best, therefore, to make the bandwidth of the spectral resolution such that it will not resolve the harmonics of the fundamental of voiced speech.” Could you elaborate on them a bit?I have to say there are many areas in this book that are very confusing…
-
November 6, 2019 at 09:38 #10157
The filter-bank is performing feature extraction from the speech signal. Even though this chapter is now outdated, filter-bank features are back in use for Automatic Speech Recognition (ASR).
“excitation periodicity” refers to the nature of the vocal fold excitation.
In the time domain, this means the waveform is periodic and its energy will fluctuate over time. Holmes & Holmes say
is also necessary to apply some time-smoothing to remove it
but in fact all this really means in practice is that we need to use a sufficiently long analysis frame (with a duration of, say, at least 2 pitch periods) for short-term analysis. A typical analysis frame duration would be 25 ms or 1/40th of a second.
In the frequency domain, the periodic sound source means that there is harmonic structure in the spectrum.
For ASR, we generally do not want to capture any information about F0 in our features, and so the filters in the filter-bank need to have large enough bandwidths to avoid resolving the harmonics. That is, each filter needs to be wider than, say, twice F0. You could think of this as “blurring” the spectrum (like an out-of-focus photograph) so that we can only see the overall shape and cannot make out the fine detail of the harmonics.
Holmes & Homes describes this as
It is best, therefore, to make the bandwidth of the spectral resolution such that it will not resolve the harmonics of the fundamental of voiced speech.
Module 7 covers feature engineering in more depth and we’ll see that some further processing of the filter-bank outputs can improve things. For example, we will take the log of each filter’s output power.
-
-
November 15, 2019 at 21:32 #10243
page 112, end of 2nd paragraph: “It can be seen from the frequency scales that the channels are closer together in the lower-frequency regions.” But I think Figure 8.1 shows the opposite? That is, the lower-frequencies seem to be more separated on the scale. I have attached an annotated copy of Figure 8.1.
Attachments:
You must be logged in to view attached files. -
November 15, 2019 at 22:49 #10245
The figure uses a non-linear frequency scale on the vertical axis.
The lower frequency channels are closer together on a linear frequency scale and this is what the figure shows.
Compare the 3 channels covering the lower frequency range of 0 Hz to 1 kHz (which is a range of 1 kHz) with the 3 channels covering the higher frequency range from 3 kHz to 5kHz (which is a range of 2 kHz). The ones in the lower frequency range are closer together (i.e., there are more “channels per kHz”) than those in the higher range.
-
December 8, 2019 at 20:56 #10474
regarding section 8.7:
I don’t understand why we need to double the distance contribution for diagonal paths (2d(i,j)) when using a symmetrical algorithm to deal with timescale differences (page 117), and why it is no longer needed if we alternatively use slopes to disallow vertical paths (page 118).
-
December 8, 2019 at 21:43 #10481
There are many variants on path weightings. We don’t need to get lost in those details – just aim to understand dynamic programming and its application to aligning two sequences of differing lengths.
The reason for the double weighting on diagonal paths is that the alternative path against which they are being compared involves summing the costs of one horizontal path and one vertical path, so the comparison would be “unfair”.
This problem of correctly weighting paths arises because there is no modelling of duration. In an HMM, there is a (very simple) duration model: the transition probabilities.
-
-
AuthorPosts
- You must be logged in to reply to this topic.