Front end: non-ASCII characters

This topic has 3 replies, 3 voices, and was last updated 9 years, 7 months ago by Simon.

Viewing 2 reply threads

Author

Posts
- October 26, 2015 at 21:41 #436
  Anonymous Student
  Student
  I found an interesting normalization error in festival:
  the text “£3 billion” results in the words “three billion pounds billion”, with the word “billion” doubled.
  This happens for similar words like “million”, too. It does however not happen for “$” instead of “£”.
  
  In the (utt.relation.print utt ‘Word) listing, there is an unknown character as the very first element,
  probably a remainder of the pound symbol.
  This also happens for other characters which lie outside of the ascii-range, but these unknown characters are simply not read.
  Maybe this second byte of the pound symbol confuses the normalizer? Or there is a simple rule error?
- October 26, 2015 at 21:42 #437
  Simon
  Professor
  Festival can handle utf-8 and utf-16 characters, but not via the interactive command-line interface. This is a limitation of the input method. You would need to input such text from a file.
- October 26, 2015 at 22:54 #451
  Lars H
  Student
  I tried reading “£3 billion” from a utf-8 file using (tts "input.txt" nil) and it still says “three billion pounds billion”.
  - October 27, 2015 at 20:45 #458
    Simon
    Professor
    OK, so my hypothesis about non-ASCII characters is probably wrong here. You seem to have found a pretty bad error in the part of the pipeline that detects/classifies/expands non-standard words. Can you speculate on exactly where this might have happened, and maybe even propose where a change would have to be made to fix this problem?
    
    The unknown / blank item in the Word relation is probably the place where the pound sign used to be just after tokenisation, but has been deleted after completion of the non-standard word processing step (because we don’t want “pounds three billion”).
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

Front end: non-ASCII characters

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis