Forum Replies Created
-
AuthorPosts
-
Let’s clarify your understanding – this most important thing to say first is
we must separate the description of the analysis and synthesis parts
You said that we calculate “one coefficient for each pitch period” – no, we calculate the complete set of filter coefficients (there might be 16 of them, say) for each analysis frame.
We have choices about the analysis frame: it might be a pitch period, but that would require pitch marking the signal, which is error-prone. So, let’s just consider the simple case where the analysis frame is of fixed duration (25ms, say) and the frames are spaced at fixed times (every 10ms, say).
After calculating the filter coefficients, we inverse filter the frame of speech signal. This gives us the residual signal for the current analysis frame. The residual is a waveform. If we use this residual signal to excite the filter (i.e., as the excitation signal), we will get near-perfect reconstruction of the frame of speech being analysed.
We store the filter coefficients and the residual waveform together. They are “matched”: only the combination of residual and filter from the same analysis frame will give near-perfect speech output. If we “mix and match” across different analysis frames, we will not get such good reconstruction.
You have correctly understood that we “should not manipulate the filter, because it might deviate from the perfect match“. That is true. So, we will only manipulate the filter by small amounts (for join smoothing), to avoid too much mismatch. We may also manipulate the residual using overlap-and-add (to modify F0) – this will also create some amount of mismatch. So, again, we will limit the amount of manipulation to limit the severity of the mismatch.
Now on to the synthesis stage, which happens every time we use the TTS system to say a sentence…
Here, we have choices about the resynthesis frame. It could be as simple as the fixed analysis frame from above. This will work, but because the filter coefficients are updated at a fixed rate (every 10ms, which is 100 times per second) we may hear an artefact: a constant 100Hz “buzz”.
We can’t avoid updating the filter, but we can be clever about the rate at which we do it. If we update not every 10ms but every pitch period, then we will create an artefact not at 100Hz but at F0. Since the listener will perceive F0 anyway (in voiced speech), then we can “hide” the artefact “behind” the natural F0.
In diphone synthesis, there is just one recorded copy of each diphone. The F0 and duration of that recorded copy will be arbitrary. If we simply concatenated these recordings, we would get an arbitrary and very probably discontinuous F0 contour. We must manipulate the recording in order to impose the predicted F0 (e.g., to get gradual declination over a phrase), and to impose predicted duration.
In unit selection synthesis, we have many recordings of each diphone to choose from. In some versions of unit selection (covered in detail in the Speech Synthesis course), we will use the front end’s predictions of F0 and duration to help us choose the most appropriate one.
If you use Chrome, there are various plugins that will do this for you, such as HTML5 Video Speed Control or Video Speed Controller.
Vector graphics means that a figure is represented by actual lines and text. This means it can be scaled without losing quality. Vector graphics can be created using many drawing packages, provided you export in a suitable format. PDF is usually best.
The alternative is a bitmap or image. Such images are represented by individual pixels. There are many file formats for this (PNG, BMP, JPG). In these formats, you need to have to make sure that the resolution of the image (the number of pixels in the horizontal and vertical dimensions) is high enough.
A good rule of thumb is that, if you actually print your document on A4 paper, you cannot detect the individual pixels in your figures.
Sometimes, you have to use a bitmap format. For example, a spectrogram is inherently an image and not vectors. In these cases, when you take a screenshot, make the window as large as possible on your screen, before taking the screenshot. This will give you the maximum resolution.
It is possible to get really excellent results from Microsoft Word, if you are an expert user and know how to control all the settings. However, in my experience very few people manage that. The default output from Word looks ugly. It is poor at typesetting equations. For long documents, it becomes unreliable.
For these reasons, I recommend learning (and mastering) LaTeX. It is a little harder to learn than Word, but its default output is better. I don’t recommend learning it just to write the coursework for this course; but, if you need to write a dissertation later in your programme, this is a better tool than Word.
This is a case where citing the online version is the correct thing to do. Cite it as you would any other URL (e.g., mention the date on which you last accessed it).
For the voice used in this assignment, this is done by rules hardwired into the low-level C++ code, which are specific to the Unilex dictionary.
(You are not expected to be able to read or understand the code, but feel free to try).
EDIT – see below for a more detailed answer explaining what the rules do.
These questions can be very useful, because a single split can give a large reduction in entropy, and we end up with a smaller tree than if we had to ask each individual question in sequence.
Including category questions is very common. It’s a kind of feature engineering because it’s exactly equivalent to adding a new 2-valued predictor to every data point.
This is a good way to include domain knowledge or our own intuitions about the problem.
Related example: In HMM-based automatic speech recognition, regression trees are used to cluster the parameters of context-dependent models. Category questions are used as standard.
When we partition some data points using a binary question, we hope to make the distribution of values of the predictee less uniform and more predictable. In other words, we try to reduce the entropy of the probability distribution of the predictee.
If we manage to do that, we have gained some information about the value of the predictee. We know more about it (= we are more certain of its value) after the split than before it.
The reduction in entropy from before to after the split is the information gain, measured in bits.
[other part of question answered separately – please include only one question per post]
October 23, 2016 at 18:20 in reply to: Linear time-invariant (LTI) system and impulse response #5557Yes, LPC uses a simple linear filter that is time-invariant.
You should assume that Festival only accepts plain ASCII characters and cannot interpret characters with accents / diacritics.
This was done in main lecture 5 of Speech Processing.
It’s a matter of degree, without a right/wrong answer. You need to find a nice balance between citing support for each claim or fact, but without the density of citations making the text unreadable.
In your example, a citation is not essential at the end of that sentence, but you will want to provide citations once you start describing the individual processes.
Do we calculate the filter coefficients as if they were the filter producing the speech we’re trying to synthesise?
Yes, that’s correct – in effect, we fit the filter to the spectral envelope of the speech.
Would inverse filtering the recorded speech result in a pulse train as the excitation signal?
No. The filter is simple and cannot model speech perfectly. The error in this modelling is captured in the residual signal. The residual is a waveform. For voiced speech, it will be more similar to a pulse train than the speech was, but not exactly a pulse train.
We have joins in consonants too. So in the example /k ae t/, the diphones would be
sil_k k_ae ae_t t_sil
where sil is “silence” and is just another phoneme.
You correctly spot that we might not want the place the cut point at exactly the centre (50% point) in all cases. In the case of stops, we will make the join in the closure portion.
-
AuthorPosts