Taylor writes on page 516 that linguistic features are ‘high-level’ while acoustic features are ‘low-level’. What exactly does this mean? Is high-level and low-level meant in a programming sense e.g Python is relatively high-level and C is relatively low level? If not, please correct my understanding.
Taken from Page 489 of this reading: “A further complication arises because the unit-selection system isn’t a generative model in the normal probabilistic sense. Rather it is a hybrid model that uses a function to select units (which is not problematic) but then just concatenates the actual units from the database, rather than generating the units from a parameterised model. This part is problematic for a maximum-likelihood type of approach. The problem arises because, if we try to synthesise an utterance in the database, the unit-selection system should find those actual units in the utterance and use them. Exact matches for the whole specification will be found and all the target and join costs will be zero. The result is that the synthesized sentence in every case is identical to the database sentence.”
I am confused – what exactly is the problem that Taylor is talking about? Is it:
1. Because the unit selection system is not a generative model, trying to synthesise a sentence that is actually one of the original sentences of the database will then result in the system simply selecting all units from that entry in the database, leading to 0 costs
or
2. Because the unit selection system is not a generative model, trying to synthesise a sentence that is actually one of the original sentences of the database should result in the system simply selecting all units from that entry in the database, but this does not happen.
I did a Jurafsky and Martin reading (chapter 9.3) before this, and it used a diagram from this chapter (page 334, figure 9.14).
However, in this reading the magnitude spectrum and log magnitude spectrums are reversed (page 354, Figure 12.11, compare 12.11a and 12.11b to 9.14a and 9.14b from previous readings).
So which one is correct and which one is wrong? Which one is the normal spectrum and which one is its logged version?
Because the excitation periodicity is evident in the amplitude variations of the output from a broadband analysis, it is also necessary to apply some time-smoothing to remove it. Such time-smoothing will also remove most of the fluctuations that result from randomness in turbulent excitation.
Because there is no diagram that accompanies this explanation, I don’t fully understand how the excitation periodicity is visible (or what it appears as) when performing broadband analysis. I assume that broadband analysis produces a spectrum from which various frequencies present in the signal are observable. So, is ‘excitation periodicity’ the same thing as F0? I would also like to clarify what ‘time-smoothing’ is.
Page 112:
Use of filter-bank power directly gives most weight to more intense regions of the spectrum, where a change of 2 or 3 dB will represent a very large absolute difference.
As for this information, why is this the case? What is meant by filter-bank power? Does this ‘power’ refer to some kind of power spectrum generated from filter-bank analysis?
Hi Fabian, I don’t know if you happen to take Phonetics and Lab Phonology as well, a separate course to Speech Processing. Last week in class there was a very similar question about the differences between different perceptual scales and why some are preferred over others in certain situations.
Our lecturer Bert provided us a very informative paper which is basically an experimental evaluation of the different perceptual scales that illustrates the differences between them and the different experimental purposes they might be used for.
I found that really helpful and I thought it might help you too, so I have uploaded it here.
In section 8.5, page 312, when discussing target cost, the arguments of the vector(?) representation of the target cost, T, are given as (St, Ut) initially where the previous refers to the target specification and the latter the potential unit.
The index on U somehow changes to (St[p], Uj[p]) later when talking about the subcost with respect to the feature specification of the diphone. The equation in 8.20 also uses Uj, so I wanted to ask if this difference is symbolic of something or (potentially) a printing mistake because I don’t understand why the indices of the target specification and the potential unit shouldn’t be the same.
On page 15 of this reading (page 40 of the book), Taylor seems to distinguish between ‘suprasegmental “prosody”‘, ‘affective prosody’ and ‘augmentative prosody’ when describing the complete prosody generation model. I know this may be a little out-of-context to this course, but what exactly is the difference between these three? Previously, I had assumed that all prosodic features were simply suprasegmental.
Hi I just stumbled upon this and would like to ask if this is the answer as to why this and this are essentially the same thing but appear different (because of Wavesurfer).
Quick question: on page 778 of this reading the caption for figure 20.12 writes: ‘the third panel shows the first 13 values of the DCT of each column of the Mel spectrogram…[continued]’.
I would like to confirm if this ‘DCT’ refers to the Discrete Cosine Transform. If so, do I need to know more about why and how this is applied to the Mel spectrogram? Also, how does DCT differ from the Fourier Transform (and its permutations) such that it is used with the Mel spectrogram over Fourier Transforms?
Hello! I have a question about the materials discussed from pages 61 – 64, specifically, his references to Fig. 5.1. I understand very little about how the diagram is doing anything that the accompanying text is saying it does.
I’m not sure how the diagram shows the composition of the complex wave or its preferred resonances; I was under the impression that the one on the top was a representation of the complex wave itself, while the bottom spectrum shows the frequencies of the source’s resonances.
The text says that the diagram reflects that the source responds best to frequencies around 1000Hz and hardly responds to anything above 3000hz, but the peak of the spectrum of the sound wave seems to be around 4000Hz, and that value is not mentioned at all. In general, I just didn’t know where to look and what each detail of the diagram is supposed to reflect. I could make assumptions based on the shape of the diagram and its alignment, but I thought it would be best to clarify so I can gain a better understanding.
This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish.AcceptRead More
Privacy policy
Privacy Overview
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.