Forum Replies Created

Viewing 10 posts - 1 through 10 (of 10 total)

Author

Posts
January 26, 2019 at 18:07 in reply to: Taylor – Chapter 16 #9681
Danielle O
Student
Taylor writes on page 516 that linguistic features are ‘high-level’ while acoustic features are ‘low-level’. What exactly does this mean? Is high-level and low-level meant in a programming sense e.g Python is relatively high-level and C is relatively low level? If not, please correct my understanding.
January 21, 2019 at 16:42 in reply to: Taylor – Chapter 16 #9670
Danielle O
Student
Taken from Page 489 of this reading: “A further complication arises because the unit-selection system isn’t a generative model in the normal probabilistic sense. Rather it is a hybrid model that uses a function to select units (which is not problematic) but then just concatenates the actual units from the database, rather than generating the units from a parameterised model. This part is problematic for a maximum-likelihood type of approach. The problem arises because, if we try to synthesise an utterance in the database, the unit-selection system should find those actual units in the utterance and use them. Exact matches for the whole specification will be found and all the target and join costs will be zero. The result is that the synthesized sentence in every case is identical to the database sentence.”

I am confused – what exactly is the problem that Taylor is talking about? Is it:

1. Because the unit selection system is not a generative model, trying to synthesise a sentence that is actually one of the original sentences of the database will then result in the system simply selecting all units from that entry in the database, leading to 0 costs

or

2. Because the unit selection system is not a generative model, trying to synthesise a sentence that is actually one of the original sentences of the database should result in the system simply selecting all units from that entry in the database, but this does not happen.
November 7, 2018 at 12:09 in reply to: Taylor – Chapter 12 #9567
Danielle O
Student
I did a Jurafsky and Martin reading (chapter 9.3) before this, and it used a diagram from this chapter (page 334, figure 9.14).

However, in this reading the magnitude spectrum and log magnitude spectrums are reversed (page 354, Figure 12.11, compare 12.11a and 12.11b to 9.14a and 9.14b from previous readings).

So which one is correct and which one is wrong? Which one is the normal spectrum and which one is its logged version?

Thank you!

Attachments:
You must be logged in to view attached files.
October 30, 2018 at 17:14 in reply to: Holmes & Holmes – Chapter 8 #9517
Danielle O
Student
Page 111:

Because the excitation periodicity is evident in the amplitude variations of the output from a broadband analysis, it is also necessary to apply some time-smoothing to remove it. Such time-smoothing will also remove most of the fluctuations that result from randomness in turbulent excitation.

Because there is no diagram that accompanies this explanation, I don’t fully understand how the excitation periodicity is visible (or what it appears as) when performing broadband analysis. I assume that broadband analysis produces a spectrum from which various frequencies present in the signal are observable. So, is ‘excitation periodicity’ the same thing as F0? I would also like to clarify what ‘time-smoothing’ is.

Page 112:

Use of filter-bank power directly gives most weight to more intense regions of the spectrum, where a change of 2 or 3 dB will represent a very large absolute difference.

As for this information, why is this the case? What is meant by filter-bank power? Does this ‘power’ refer to some kind of power spectrum generated from filter-bank analysis?
October 29, 2018 at 13:28 in reply to: Bark Scale vd Mel Scale #9510
Danielle O
Student
Hi Fabian, I don’t know if you happen to take Phonetics and Lab Phonology as well, a separate course to Speech Processing. Last week in class there was a very similar question about the differences between different perceptual scales and why some are preferred over others in certain situations.

Our lecturer Bert provided us a very informative paper which is basically an experimental evaluation of the different perceptual scales that illustrates the differences between them and the different experimental purposes they might be used for.

I found that really helpful and I thought it might help you too, so I have uploaded it here.

Attachments:
You must be logged in to view attached files.
October 16, 2018 at 08:06 in reply to: Jurafsky & Martin – Chapter 8 #9423
Danielle O
Student
In section 8.5, page 312, when discussing target cost, the arguments of the vector(?) representation of the target cost, T, are given as (St, Ut) initially where the previous refers to the target specification and the latter the potential unit.

The index on U somehow changes to (St[p], Uj[p]) later when talking about the subcost with respect to the feature specification of the diphone. The equation in 8.20 also uses Uj, so I wanted to ask if this difference is symbolic of something or (potentially) a printing mistake because I don’t understand why the indices of the target specification and the potential unit shouldn’t be the same.

Attachments:
You must be logged in to view attached files.
October 2, 2018 at 13:48 in reply to: Taylor – Chapter 3 #9384
Danielle O
Student
On page 15 of this reading (page 40 of the book), Taylor seems to distinguish between ‘suprasegmental “prosody”‘, ‘affective prosody’ and ‘augmentative prosody’ when describing the complete prosody generation model. I know this may be a little out-of-context to this course, but what exactly is the difference between these three? Previously, I had assumed that all prosodic features were simply suprasegmental.
October 1, 2018 at 14:27 in reply to: spectrum returned by FFT #9383
Danielle O
Student
Hi I just stumbled upon this and would like to ask if this is the answer as to why this and this are essentially the same thing but appear different (because of Wavesurfer).

Thank you!
October 1, 2018 at 14:14 in reply to: An introduction to signal processing for speech #9382
Danielle O
Student
Quick question: on page 778 of this reading the caption for figure 20.12 writes: ‘the third panel shows the first 13 values of the DCT of each column of the Mel spectrogram…[continued]’.

I would like to confirm if this ‘DCT’ refers to the Discrete Cosine Transform. If so, do I need to know more about why and how this is applied to the Mel spectrogram? Also, how does DCT differ from the Fourier Transform (and its permutations) such that it is used with the Mel spectrogram over Fourier Transforms?

Thank you!
September 25, 2018 at 11:44 in reply to: Ladefoged – Chapter 5 #9368
Danielle O
Student
Hello! I have a question about the materials discussed from pages 61 – 64, specifically, his references to Fig. 5.1. I understand very little about how the diagram is doing anything that the accompanying text is saying it does.

I’m not sure how the diagram shows the composition of the complex wave or its preferred resonances; I was under the impression that the one on the top was a representation of the complex wave itself, while the bottom spectrum shows the frequencies of the source’s resonances.

The text says that the diagram reflects that the source responds best to frequencies around 1000Hz and hardly responds to anything above 3000hz, but the peak of the spectrum of the sound wave seems to be around 4000Hz, and that value is not mentioned at all. In general, I just didn’t know where to look and what each detail of the diagram is supposed to reflect. I could make assumptions based on the shape of the diagram and its alignment, but I thought it would be best to clarify so I can gain a better understanding.

Thank you!

Attachments:
You must be logged in to view attached files.
Author

Posts

Viewing 10 posts - 1 through 10 (of 10 total)

Danielle O

Forum Replies Created

Attachments:

Attachments:

Attachments:

Attachments:

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis