Forum Replies Created
-
AuthorPosts
-
It is possible to get really excellent results from Microsoft Word, if you are an expert user and know how to control all the settings. However, in my experience very few people manage that. The default output from Word looks ugly. It is poor at typesetting equations. For long documents, it becomes unreliable.
For these reasons, I recommend learning (and mastering) LaTeX. It is a little harder to learn than Word, but its default output is better. I don’t recommend learning it just to write the coursework for this course; but, if you need to write a dissertation later in your programme, this is a better tool than Word.
This is a case where citing the online version is the correct thing to do. Cite it as you would any other URL (e.g., mention the date on which you last accessed it).
For the voice used in this assignment, this is done by rules hardwired into the low-level C++ code, which are specific to the Unilex dictionary.
(You are not expected to be able to read or understand the code, but feel free to try).
EDIT – see below for a more detailed answer explaining what the rules do.
These questions can be very useful, because a single split can give a large reduction in entropy, and we end up with a smaller tree than if we had to ask each individual question in sequence.
Including category questions is very common. It’s a kind of feature engineering because it’s exactly equivalent to adding a new 2-valued predictor to every data point.
This is a good way to include domain knowledge or our own intuitions about the problem.
Related example: In HMM-based automatic speech recognition, regression trees are used to cluster the parameters of context-dependent models. Category questions are used as standard.
When we partition some data points using a binary question, we hope to make the distribution of values of the predictee less uniform and more predictable. In other words, we try to reduce the entropy of the probability distribution of the predictee.
If we manage to do that, we have gained some information about the value of the predictee. We know more about it (= we are more certain of its value) after the split than before it.
The reduction in entropy from before to after the split is the information gain, measured in bits.
[other part of question answered separately – please include only one question per post]
October 23, 2016 at 18:20 in reply to: Linear time-invariant (LTI) system and impulse response #5557Yes, LPC uses a simple linear filter that is time-invariant.
You should assume that Festival only accepts plain ASCII characters and cannot interpret characters with accents / diacritics.
This was done in main lecture 5 of Speech Processing.
It’s a matter of degree, without a right/wrong answer. You need to find a nice balance between citing support for each claim or fact, but without the density of citations making the text unreadable.
In your example, a citation is not essential at the end of that sentence, but you will want to provide citations once you start describing the individual processes.
Do we calculate the filter coefficients as if they were the filter producing the speech we’re trying to synthesise?
Yes, that’s correct – in effect, we fit the filter to the spectral envelope of the speech.
Would inverse filtering the recorded speech result in a pulse train as the excitation signal?
No. The filter is simple and cannot model speech perfectly. The error in this modelling is captured in the residual signal. The residual is a waveform. For voiced speech, it will be more similar to a pulse train than the speech was, but not exactly a pulse train.
We have joins in consonants too. So in the example /k ae t/, the diphones would be
sil_k k_ae ae_t t_sil
where sil is “silence” and is just another phoneme.
You correctly spot that we might not want the place the cut point at exactly the centre (50% point) in all cases. In the case of stops, we will make the join in the closure portion.
Yes, a pulse train is an approximation to the sound produced by vocal fold vibration.
Although it might not seem like a particularly good approximation, it is simple and mathematically convenient. The principal difference between a pulse train and the true signal, is that the pulse train has a flat spectral envelope.
That’s not a problem though: we can include the modelling of the actual spectral envelope of the vocal fold signal in the vocal tract filter.
In other respects, the pulse train has the correct properties: specifically, it has energy at every multiple of F0 (a “comb-like” or “line” spectrum).
So, we can say that a source-filter model is really a model of the signal, and not a literal model of the physics of speech production.
Let’s follow your working:
I label the predictors as their Parts-Of-Speech
Correct – but you should say that you annotate the training data samples with values for each of the three predictors. In this example, we are using the POS of the preceding, current and following word as the predictors 1, 2 and 3 respectively.
use the question Is the label after “BREAK” PUNC?
Let’s word that more carefully. Questions must be about predictors of the current data point. So you should say:
Ask the question Is predictor 3 = PUNC?
Now partition the data accordingly.
everything which is punctuation comes after a BREAK, and everything which isn’t punctuation is a conjunction
This is where you’ve made the mistake. For question Is predictor 3 = PUNC?, 8 data points have the answer “Yes” and all of them have the value “NO-BREAK” for the predictee, which indeed is a distribution with zero entropy. So far, so good.
Now look at the 26 data points for which the answer to Is predictor 3 = PUNC? was “No”. The distribution of predictee values is 4 BREAKs and 22 NO-BREAKs. That distribution does not have zero entropy.
Your reasoning about “everything which isn’t punctuation is a conjunction” is incorrect. You are looking at the distribution of values of a predictor. When measuring entropy, we look only at the value of the predictee. That is the thing we are trying to predict. Entropy is measuring how much more predictable it has become after a particular split of the data.
I’ve tried in the past setting questions as homework or lecture preparation, but only a few students took part. We can try again, perhaps in the speech recognition part of the Speech Processing course.
Pitch marking means finding the instants of glottal closure. Pitch marks are moments in time. The interval of time between two pitch marks is the pitch period, denoted as T0, which of course is equal to 1 / F0.
You might think that pitch marking would be the best way to find F0. However, it’s actually not, because pitch marking is hard to do accurately and will give a lot of local error in the estimate for F0.
Pitch marking useful for signal processing, such as TD-PSOLA.
Pitch tracking is a procedure to find the value of F0, as it varies over time.
Pitch tracking is done over longer windows (i.e., multiple pitch periods) to get better accuracy, and can take advantage of the continuity of F0 in order to get a more robust and error-free estimate of its value.
Pitch tracking useful for visualising F0, analysing intonation, and building models of it.
Exactly how pitch marking and pitch tracking work is beyond the scope of the Speech Processing course, but is covered in the more advanced Speech Synthesis course.
-
AuthorPosts