› Forums › Speech Synthesis › Evaluation › Automatic speech systhesis evaluation with ASR system
- This topic has 4 replies, 2 voices, and was last updated 7 years, 6 months ago by Jiewen Z.
-
AuthorPosts
-
-
February 14, 2017 at 20:02 #6771
After i learnt the current evaluation method for SS, i feel that there are a lot of space to impove.And i have some ideas about evaluation.
Firstly, we currently care about the intelligibility and naturalness of SS. i think those probably are not the final points/criteria that we should really care about.My reason is that, first, i think there is really not clear boundary between intelligibility and naturalness and they are connected somehow. For example, if we speak with a very un-natural way, this is actural can add more difficult of comprehending meaning.At least, our brain will spend more time to undertand it correctly. just like the GMM co-variance martrix, intelligibility and naturalness really form a full co-variance martrix,but a diagonal one.But current evaluation method really evalute them seperately,which i think is not appropriate.
Secondly, If we think the evaluation from the communication perspective, i think we really should focus on the effectiveness of communication itself. If we look the SS itself just as a noise-channel model which contaminates the original information(text) and transfer it into waveform which encoded the information into it, our evaluation is really a way to find out how heavily the information(text) is contaminated. To evaluate, we are actually doing the information recovering.
From the above two points, i think if we stand higher and look further, intelligibility and natualness are just two terms that coined by human to partly decribe information recovering.what elso is encoded in the nature speech? we do not know. But from the comunication perspective,how quickly and accurately recover information really depends on the contaminated information itself if we use same tool.
If that is ture, Why not use ASR system help us automatically recover information as a quantitive evaluation method? My idea is this: we can train a ASR system with natural speech and have human liseners and SS generate speech from the same set of sentences. Then we use ASR recognises the human speech and SS speech to generate two copy of text for each sentence. Finally we can use difference between WER for human speech and SS speech as the score. We may think ASR is not perfect,so it can not be used. But i think as we just need a relative score but a absolute score(no way to do this), score from ASR is still reasonable.
From what i can sense, i think this method is feasible. Please help to comment.
-
February 16, 2017 at 13:34 #6777
“if we speak in a very un-natural way, this is can add more difficulty of comprehending meaning.At least, our brain will spend more time to undertand it correctly”
That seems a sensible hypothesis, and I believe it is true.
What we do know (from the Blizzard Challenge) is that the most intelligible system is definitely not always the most natural. A very basic statistical parametric system can be highly intelligible without sounding very natural.
What we don’t know as much about is whether “our brain will spend more time” processing such speech. We might formally call that “cognitive load”. We have a hypothesis that some or all forms of synthetic speech impose a higher cognitive load on the listener than natural speech does. Testing this hypothesis is ongoing research.
-
February 16, 2017 at 13:40 #6778
“I think we really should focus on the effectiveness of communication itself.”
Yes, that would a good idea. The challenge in doing that would be creating an evaluation that truly measured communication success. It’s possible, but likely to be more complicated than a simple listening + transcribing task.
Thinking more generally, you are touching on the idea of “ecological validity” in evaluation, which is an evaluation that accurately measures “real world” performance. It is not easy to make an evaluation ecologically valid, and is very likely to be much more complex to set up than a simple listening test. There is also a significant risk of introducing confounding or uncontrolled external factors into the experimental situation. These would make the results harder to interpret.
The reason that researchers very rarely worry about ecological validity in synthetic speech evaluation is a mixture of laziness, cost, and the problem of external confounding factors.
-
February 17, 2017 at 22:21 #6780
Think you, this is very helpful!
-
-
February 16, 2017 at 13:44 #6779
“Why not use an ASR system as an objective evaluation method? “
This idea is often proposed, and occasionally tried. In essence, it uses ASR as a model of a listener, and assumes that its accuracy correlates with listener performance.
However, an ASR system is unlikely to be a very good model of human perception and language processing, and so you would need to find out whether this is true before proceeding.
-
-
AuthorPosts
- You must be logged in to reply to this topic.