Forum Replies Created
-
AuthorPosts
-
This error is relevant because it leads to incorrect labels on the database, which unit selection is not robust to. So, it may be worth mentioning. Perhaps you could suggest some solutions (one would be better WSD of course) and think about how a statistical parametric approach would be affected by the same kind of front-end error.
Correct – join locations are at pitch marks, and the signal (*) is overlap-added at the joins with an overlap region of one pitch period.
(*) the signal could in principal be the speech waveform, but in Festival it is the residual waveform because the released version of Festival uses RELP signal processing and not TD-PSOLA.
In the Unilex dictionary (RPX accent) there are the following entries for “lower”:
("lower" (nn glare) (((l ow @) 1))) ("lower" (vb glare) (((l ow @) 1))) ("lower" (vb make-low) (((l ou @) 1))) ("lower" (vbp glare) (((l ow @) 1))) ("lower" (vbp make-low) (((l ou @) 1)))
The “glare” part of the entry is the word sense, which could be used by Word Sense Disambiguation. Festival’s WSD module very probably doesn’t know about the rather obscure “glare” sense of the word “lower” (as in to glare at someone).
The pitch marks determine the possible join locations (after forced-alignment, they are rounded up or down to the nearest pitch mark), so unit choices can change.
To turn off the target cost, set its weight to zero. The target costs will all be 0 when you inspect the Unit relation.
(Lecture 5 was about pitch tracking and pitchmarking. Do you mean Lecture 6?)
Unit selection systems do necessarily not require a vocoder, but could optionally use one so that joins could be smoothed. Alternatively, we might only do time-domain processing (i.e., direct waveform concatenation).
The RELP coding used in Festival could be thought of as a type of vocoding, although the need for the residual (which is itself a waveform) would make this a rather inconvenient vocoder for other purposes, such as statistical parametric speech synthesis.
Use direct quotes very sparingly, and wherever possible state things in your own words. One situation where a direct quote is needed is where reporting the precise wording used by that author is important (e.g., you are going to criticise it).
If you are just attributing an idea, or supporting a claim then there is generally no need for a direct quote. Describe the idea in your own words, and place the citation at a place that makes the connection between the idea and the citation obvious.
In your example, you are making a claim that something is well-known, and so a citation is essential there. No direct quote is needed in this case.
Many citation styles are possible. The APA style is good default choice.
Any reasonable style (i.e., anything in current use in a major journal in our field) is acceptable. APA is a particularly good choice of referencing style.
The first thing we might observe is that objective measures (in your case, you are proposing “number of pitchmarking errors”) do not always correlate with perceptual results. If they did, life would be much easier!
There are clearly some complex interactions between the various factors that you mention. That’s typical of unit selection, and reflects the difficulty in automatically tuning this type of system.
I can’t suggest a simple reason for the results you are observing, but some other things to look at might be:
– total number of joins in each case
– what happens when you turn off the target cost, in each case
You’ve got all the essential points. The coefficients needed for RELP synthesis are stored in two parallel sets of files: the LPC filter coefficients and the residual signals. The filter coefficients are a sequence of vectors (one vector = one set of filter coefficients at a certain point in time) and these are pitch synchronous, and so implicitly represent the pitch marks (your point 2. is correct). The answer to point 3 is that the information is already there in the filter co-efficients, and there is no need to duplicate that in the residuals. Filter co-efficients and residuals “belong together” and for each utterance there is a pair of files.
Here’s a paper on which I am a co-author, to give an example of reducing both the word count and the amount of space (from nearly 5 pages down to 4 pages), as well as making editorial improvements to the text.
Attachments:
You must be logged in to view attached files.Here’s an example of reducing the word count -the first page shows the original (thanks to the anonymous student who contributed this example) and the second page is my edit. The constraint here was word count, and not space.
Attachments:
You must be logged in to view attached files.The filter co-efficients are not cross-faded. Remember that they are specified frame-by-frame (not sample-by-sample, like the residual). We just concatenate the sequences of frames of filter co-efficients for all the candidates – this gives us a complete sequence of filter co-efficients for the full utterance.
You need to do some more detective work to find out how the pitchmarks are found. For example, try omitting the pitchmarking step and see what happens as you build the voice.
What is the pipeline for concatenation and RELP waveform generation?
A complete residual signal (which is just a waveform) for the whole utterance is constructed by concatenating the residuals of the selected candidates. Overlap-and-add (i.e., crossfade) is performed at the joins, over a duration of one pitch period. A corresponding sequence of LPC filter coefficients is also constructed.
The function
lpc_filter_fast
in.../speech_tools/sigpr/filter.cc
then does the waveform generation. The inputs are the utterance-length residual waveform and the sequence of LPC filter co-efficients. I’ve just realised that I wrote that code nearly 20 years ago…/*************************************************************************/ /* Author : Simon King */ /* Date : October 1996 */ /*-----------------------------------------------------------------------*/ /* Filter functions */ /* */ /*=======================================================================*/
Why does Festival use Residual Excited LP (RELP)?
The released version of Festival uses RELP for two reasons. The first reason is practical – TDPSOLA is patented:
Method and apparatus for speech synthesis by wave form overlapping and adding EP0363233 (A1)
The patent was filed by the French state via their research centre CNET which then became France Telecom, or Orange, as they are known today.
The second reason is that RELP allows pitch/time/spectral envelope modification, as you mention. In the older diphone engine, RELP is indeed used for time- and pitch-modification. In Multisyn, no modification or join smoothing is performed, although in principle it would be possible to add this to the implementation.
-
AuthorPosts