› Forums › Readings › Taylor – Text-to-speech synthesis › Taylor – Chapter 16
- This topic has 8 replies, 5 voices, and was last updated 4 years, 5 months ago by Simon.
-
AuthorPosts
-
-
January 25, 2017 at 11:56 #6622
In Taylor’s book, the total cost of selecting a unit is a sum of target cost and join cost,But it seems that target cost and join cost are treated equally.
Does it make more sense that we have wights for both to distinguish the importance of target cost and join cost themselves?
eg. cost = 0.3*target_cost + 0.7*join_cost.Attachments:
You must be logged in to view attached files. -
January 25, 2017 at 13:28 #6624
Usually, as you suggest, there will be a relative weighting between the two costs. You can experiment with that for yourself in Festival.
The join cost and target cost might be on quite different scales, since they are measuring different things entirely. Internally, the costs may perform some normalisation to partially address this. The cost functions themselves generally also involve summing up sub-costs, and again there will be weights involved in that sum: you can also experiment with this for yourself in Festival’s join cost function.
-
January 21, 2019 at 16:42 #9670
Taken from Page 489 of this reading: “A further complication arises because the unit-selection system isn’t a generative model in the normal probabilistic sense. Rather it is a hybrid model that uses a function to select units (which is not problematic) but then just concatenates the actual units from the database, rather than generating the units from a parameterised model. This part is problematic for a maximum-likelihood type of approach. The problem arises because, if we try to synthesise an utterance in the database, the unit-selection system should find those actual units in the utterance and use them. Exact matches for the whole specification will be found and all the target and join costs will be zero. The result is that the synthesized sentence in every case is identical to the database sentence.”
I am confused – what exactly is the problem that Taylor is talking about? Is it:
1. Because the unit selection system is not a generative model, trying to synthesise a sentence that is actually one of the original sentences of the database will then result in the system simply selecting all units from that entry in the database, leading to 0 costs
or
2. Because the unit selection system is not a generative model, trying to synthesise a sentence that is actually one of the original sentences of the database should result in the system simply selecting all units from that entry in the database, but this does not happen.
-
January 22, 2019 at 13:03 #9672
I think it’s the first one – if the test sentence is in the train data, there’s an exact match where no “new” joins need to be performed, which has a cost of 0. The problem that Taylor is discussing is that while this is a useful metric for some things, since it’s just a playback of human speech it doesn’t give a situation where we can compare synthesized speech with recorded speech.
Later in that section Taylor talks about keeping a test set of the input utterances, and synthesizing those sentences from the training (remainder) of the data. With this, you can compare the natural speech and hopefully-close-to-identical synthesized speech, as opposed to the natural speech and… the same natural speech.
-
January 22, 2019 at 18:18 #9673
In section 16.3.4, Taylor is talking about the specific problem of how to set the target cost weights. If unit selection was a generative model (e.g., an HMM), we could use the usual objective of maximising likelihood.
The problem is that there is no explicit “model” as such – the unit selection system actually contains the training data, rather than abstracting away (i.e., generalising) from it by fitting a model.
Because it has “memorised” the training data exactly, it is perfectly fitted to the training data (we would say “over-fitted” if it was a generative model). This means that changing the target cost weights has absolutely no effect (*) on the output when we generate sentences from the training data.
(*) a weighted sum of zero terms is always zero, regardless of the weights
-
January 26, 2019 at 18:07 #9681
Taylor writes on page 516 that linguistic features are ‘high-level’ while acoustic features are ‘low-level’. What exactly does this mean? Is high-level and low-level meant in a programming sense e.g Python is relatively high-level and C is relatively low level? If not, please correct my understanding.
-
January 27, 2019 at 11:19 #9682
Your analogy with programming languages is along the right lines. In this context:
“high level” means “further away from the waveform”, “more abstract” and “changing at a slower rate”
“low level” means “closer to the waveform”, “more concrete (e.g., specified more precisely using more parameters)” and “changing more rapidly”
-
February 4, 2020 at 13:42 #10643
In 16.2.4, when Taylor discuss the tradeoff between dimensionality reduction and accuracy, he specifies that “there is a natural distance metric between the specification and units and this simplifies the design of the target function” How to understand the “natural distance metric” mentioned here?
-
February 16, 2020 at 14:24 #10658
F0 is real-valued. Taylor argues that this means there is a very natural way to measure the distance between two F0 values. For example, we could take their difference. I would make this argument on the basis of perception: it is clear that a larger difference in F0 values will generally produce a larger perceived difference in two speech sounds. The relationship is not linear, but at least it is monotonic.
This is in contrast to using multiple high-level features such as stress, accentuation, phrasing and phonetic identity. It is not at all clear what distance metric we should use here, for reasons including:
- they are not real-valued
- we don’t know their relative importance
- we don’t know if/how they are correlated with one another
- the relationship with perception is not so obvious as for F0
-
-
AuthorPosts
- You must be logged in to reply to this topic.