Forum Replies Created
-
AuthorPosts
-
Also, I think that run_dnn.py might be hardwired to use only tanh layers regardless of what the configuration file specifies.
Your reasoning behind why we need to cluster (also called “tie”) models is correct, yet.
The nodes in the tree each contain a question about a phonetic feature (e.g., “is the previous phone nasal?”). The tree is simply a CART. The phonetic features are the predictors. The predictee is the current model state’s parameters (mean and variance of its Gaussian).
The tree is learned in very much the same way as a classification or regression tree.
Your question about how this eventually affects the generated waveform can be restated in two parts
1. how does this affect the models’ parameters?
2. how do model parameters affect the waveform that they generate?
The answer to 1. you have already figured out: the models share parameters, that’s all. We don’t need to average the group of models (actually, model states) that end up at a leaf – we simply have only one shared (= tied) state there and it is trained on all the corresponding data. So, If you like, you might think instead of the tree finding all the suitable data that this shared state should be trained on, pooled across a group of sufficiently-similar contexts.
The answer to 2. is via the usual generation process of statistical parametric speech synthesis: the models generate trajectories of vocoder parameters, and those are then vocoded into a waveform.
To be more precise: most frames of all regions labelled as silence, are removed.
It improves training (as found empirically) because otherwise the training data is dominated by silence frames and the network will optimise for generating silence in preference to speech sounds (it’s very easy to minimise the error on silence, and that contributes too much to total error if there are a lot of silence frames).
To prevent the truncation of phrase-final speech sounds, the correct solution is to improve the forced alignment.
I believe you can (and should) now switch to using run_lstm.py in all cases, both LSTM and purely feed-forward architectures.
This error is relevant because it leads to incorrect labels on the database, which unit selection is not robust to. So, it may be worth mentioning. Perhaps you could suggest some solutions (one would be better WSD of course) and think about how a statistical parametric approach would be affected by the same kind of front-end error.
Correct – join locations are at pitch marks, and the signal (*) is overlap-added at the joins with an overlap region of one pitch period.
(*) the signal could in principal be the speech waveform, but in Festival it is the residual waveform because the released version of Festival uses RELP signal processing and not TD-PSOLA.
In the Unilex dictionary (RPX accent) there are the following entries for “lower”:
("lower" (nn glare) (((l ow @) 1))) ("lower" (vb glare) (((l ow @) 1))) ("lower" (vb make-low) (((l ou @) 1))) ("lower" (vbp glare) (((l ow @) 1))) ("lower" (vbp make-low) (((l ou @) 1)))
The “glare” part of the entry is the word sense, which could be used by Word Sense Disambiguation. Festival’s WSD module very probably doesn’t know about the rather obscure “glare” sense of the word “lower” (as in to glare at someone).
The pitch marks determine the possible join locations (after forced-alignment, they are rounded up or down to the nearest pitch mark), so unit choices can change.
To turn off the target cost, set its weight to zero. The target costs will all be 0 when you inspect the Unit relation.
(Lecture 5 was about pitch tracking and pitchmarking. Do you mean Lecture 6?)
Unit selection systems do necessarily not require a vocoder, but could optionally use one so that joins could be smoothed. Alternatively, we might only do time-domain processing (i.e., direct waveform concatenation).
The RELP coding used in Festival could be thought of as a type of vocoding, although the need for the residual (which is itself a waveform) would make this a rather inconvenient vocoder for other purposes, such as statistical parametric speech synthesis.
Use direct quotes very sparingly, and wherever possible state things in your own words. One situation where a direct quote is needed is where reporting the precise wording used by that author is important (e.g., you are going to criticise it).
If you are just attributing an idea, or supporting a claim then there is generally no need for a direct quote. Describe the idea in your own words, and place the citation at a place that makes the connection between the idea and the citation obvious.
In your example, you are making a claim that something is well-known, and so a citation is essential there. No direct quote is needed in this case.
Many citation styles are possible. The APA style is good default choice.
Any reasonable style (i.e., anything in current use in a major journal in our field) is acceptable. APA is a particularly good choice of referencing style.
The first thing we might observe is that objective measures (in your case, you are proposing “number of pitchmarking errors”) do not always correlate with perceptual results. If they did, life would be much easier!
There are clearly some complex interactions between the various factors that you mention. That’s typical of unit selection, and reflects the difficulty in automatically tuning this type of system.
I can’t suggest a simple reason for the results you are observing, but some other things to look at might be:
– total number of joins in each case
– what happens when you turn off the target cost, in each case
You’ve got all the essential points. The coefficients needed for RELP synthesis are stored in two parallel sets of files: the LPC filter coefficients and the residual signals. The filter coefficients are a sequence of vectors (one vector = one set of filter coefficients at a certain point in time) and these are pitch synchronous, and so implicitly represent the pitch marks (your point 2. is correct). The answer to point 3 is that the information is already there in the filter co-efficients, and there is no need to duplicate that in the residuals. Filter co-efficients and residuals “belong together” and for each utterance there is a pair of files.
Here’s a paper on which I am a co-author, to give an example of reducing both the word count and the amount of space (from nearly 5 pages down to 4 pages), as well as making editorial improvements to the text.
Attachments:
You must be logged in to view attached files. -
AuthorPosts