› Forums › Speech Synthesis › The front end › Percentage of English words requiring WSD
- This topic has 1 reply, 2 voices, and was last updated 8 years, 2 months ago by Simon.
-
AuthorPosts
-
-
October 9, 2016 at 19:48 #5289
I had asked in last week’s lecture what percentage of words (dictionary entries) require word sense disambiguation. Before you do the work of checking for me, is there a simple way to access the dataset containing dictionary entries in Festival so that I could check? Or could you share how you would go about checking for yourself?
Also, is there a way that we can get some estimates for the probabilities of these words occurring in an input text? In other words, I would like to know not only the percentage of words requiring WSD but specifically the likelihood of encountering such a word when processing text. I realise that it would be a rough estimate, but I’m curious nonetheless.
-
October 12, 2016 at 12:01 #5459
On the system here in Edinburgh, you can see a dictionary that marks word sense at
/Volumes/ss/festival/festival_mac/festival/lib/dicts/unilex/unilex-edi.out
You can count the number entries that include a word sense – it’s very small: 342 out of 116740 lexical baseforms. Here’s an extract:
("repress" (vb keep-down) (((t^ i) 0) ((p r e s) 1))) ("repress" (vb press-again) (((t^ ii) 3) ((p r e s) 1))) ("repress" (vbp keep-down) (((t^ i) 0) ((p r e s) 1))) ("repress" (vbp press-again) (((t^ ii) 3) ((p r e s) 1))) ("repressed" (jj keep-down) (((t^ i) 0) ((p r e s t) 1))) ("repressed" (jj press-again) (((t^ ii) 3) ((p r e s t) 1))) ("repressed" (vbd keep-down) (((t^ i) 0) ((p r e s t) 1))) ("repressed" (vbd press-again) (((t^ ii) 3) ((p r e s t) 1))) ("repressed" (vbn keep-down) (((t^ i) 0) ((p r e s t) 1))) ("repressed" (vbn press-again) (((t^ ii) 3) ((p r e s t) 1))) ("represses" (vbz keep-down) (((t^ i) 0) ((p r e s) 1) ((i z) 0))) ("represses" (vbz press-again) (((t^ ii) 3) ((p r e s) 1) ((i z) 0))) ("repressing" (jj keep-down) (((t^ i) 0) ((p r e s) 1) ((i n) 0))) ("repressing" (jj press-again) (((t^ ii) 3) ((p r e s) 1) ((i n) 0))) ("repressing" (nn keep-down) (((t^ i) 0) ((p r e s) 1) ((i n) 0))) ("repressing" (nn press-again) (((t^ ii) 3) ((p r e s) 1) ((i n) 0))) ("repressing" (vbg keep-down) (((t^ i) 0) ((p r e s) 1) ((i n) 0))) ("repressing" (vbg press-again) (((t^ ii) 3) ((p r e s) 1) ((i n) 0)))
To get a better idea of how often this matters in practice, we would need to take a large corpus of text that typifies the type of input text we expect, and count how often one of those 342 words occurs.
To refine that, we should only count the times where its pronunciation would have been incorrect based on POS alone. However, that would be expensive, because we would have to know the correct pronunciation – for example, by manually annotating the text.
-
-
AuthorPosts
- You must be logged in to reply to this topic.