Tokenisation & normalisation

When processing almost any text, we need to find the words. This involves splitting the input character sequence into tokens and normalising each token into words.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoIn many spoken language applications, we use text data.
That might be the input to a Text-to-Speech synthesiser.
It could be the transcript of the training data for an Automatic Speech Recognise er.
In both cases, we need the text to be made only of words.
One reason for that is because we might eventually need to find the pronunciations, and that might involve looking in the dictionary.
In this video, we're only going to define the problems of tokenisation and normalisation, but not give any solutions.
Those are in other videos and in readings.
Rather, I'll be asking you to think, not just about this particular problem, but more generally about the difficulty of each part of the problem and what kind of solutions would be needed.
Some parts are easy, some parts are much harder.
I'm going to use some conventions from Paul Taylor's book on speech synthesis, starting here with writing the text in the Courier font.
So here is the written form of a sentence.
It is important to understand, this is just a sequence of characters.
These are not yet words.
For example, there's use of case: there's a capital 'H' there.
There are numbers, there's a currency symbol, and there's some punctuation.
To say this sentence out loud, we need to find the underlying words.
We're going to write those in ALL CAPS.
These are all things that we might be able to find in a dictionary.
That's a reasonably straightforward sentence.
But how about this one?
Look how different the written form can be from the underlying words that we need to say out loud.
So how hard can that be?
Well, try it for yourself.
There several parts of this sentence that are not standard words.
What I mean by that is you would not find them in a dictionary.
Can you find them?
Pause the video.
I found this, which needs to be expanded into the word 'Doctor', this, which needs to be expanded into 'the seventeenth' and this, which needs to be expanded in to 'seventeen seventy three'.
My examples are all restricted to English because that's the language we have in common.
But the same problems and the general form of solutions are common across many languages.
Let's start right at the beginning of the pipeline with an apparently easy task: splitting the input into individual sentences.
Why do we need to do that at all?
That's because almost all spoken language technology, including speech synthesisers, can only deal with individual sentences.
In other words, the way they generate speech only uses information within the current sentence.
Can you segment this text into sentences?
Of course, as speaker of the language, you can.
You'll realise that there are just three possible characters that can end a sentence in English: these ones.
From that, you could imagine writing a simple rule that detects these three characters and segments a text into sentences.
Will that work all the time?
Unfortunately not!
There are cases where, for example, a period does not mean the end of sentence.
So even this apparently simple task - of splitting text into sentences - is not entirely trivial.
So have a think about what kind of technique you might need to resolve this problem.
This period is ambiguous.
It could be the end of a sentence, or not.
We need to resolve that ambiguity.
Have a think about whether that's something that you could write down as a speaker of the language.
Maybe you could express it in a rule?
Or would you need to see lots and lots of examples of periods and label them as either 'end of sentence' or 'not end of sentence' and learn something from those labelled examples?
Remember, we're not specifying solutions here.
We're trying to survey the problems and get an idea of how hard they are, and what kind of techniques we're going to need.
In later topics and readings we'll actually provide some solutions.
They might be as simple as handwritten rules that capture the knowledge of a speaker of the language.
Or they might be something more complicated, such as a model learned from data that's been labelled by speakers of the language.
Now that we have individual sentences such as this rather splendid one here, we need to break it down into some smaller units for further processing.
This is still not made of words.
Our goal is to find the underlying words.
So can we just split on white space?
Would that be good enough?
Well, no, not here, because that would leave these as potential tokens.
That's not a word: that expands into 'three inches'.
So once again, have a think about whether you could write down, from your knowledge of the language, a way of tokenising this text reliably.
Or, again, would you actually need to label a large set of data with how it is tokenised and then learn something from that labelled data?
Once we've tokenised (we've broken the text into first sentences, and then the sentences into tokens, which might be words or might not be words yet), we need to decide whether there's further processing required for each of those tokens.
Consider some of the tokens in this sentence.
What do we need to do to expand these into words?
We need to classify them.
We need to decide whether they're already natural language, such as all the things I've just greyed out, or whether they are some other type, such as 'year'.
We then need to resolve ambiguity.
We've detected that this is not a standard word but it's ambiguous as to whether expands into 'Doctor' or 'Drive'.
Once we've detected and classified these types and resolved that ambiguity, we need to verbalise.
We need to turn all these non-natural language tokens into natural language: into words.
So which steps of that do you think are hard and which are easy?
Specifically, consider this token here.
Is it easy to decide whether this should be read as a year 'seventeen seventy three', or a cardinal number 'one thousand seven hundred and seventy three', or as a sequence of digits 'one seven seven three'?
Then, once you've done that correctly - you have decided it is a year - how hard is it to expand that into the underlying words?
There are different steps to the problem.
Some are hard, and some are easy.
One reason that we carefully distinguished written form from the underlying words is that the written form very often contains ambiguity.
We already saw that in the abbreviation 'Dr.', which might or might not have a period after it.
But it's not just abbreviations.
The natural language tokens can also be ambiguous.
When the same written form can denote several different underlying words, we say that it's a 'homograph'.
'Homo' means 'the same' and 'graph' means 'written'.
These are the three ways in which homographs come about.
There are abbreviations, which typically omit characters, which makes words that were distinct have the same written form:
Drive / Doctor.
Street / Saint.
metres / miles.
There are pure accidents such as 'polish' and 'polish'.
I'm going to leave the others for you to think about yourself.
Finally, there are written forms that could denote one of several related underlying words, such as 'record' and 'record'.
All of this ambiguity will need to be resolved before we can determine the underlying words to say them out loud in our Text-to-Speech synthesiser.
But can you do that right now?
Looking at these written forms, could you tell me unambiguously what the spoken word will be?
Of course not!
You need more information.
So the interesting question is, "What information do you need to resolve this ambiguity?"
Let's summarise the key steps in tokenisation and normalisation of text, ready for Text-to-Speech synthesis.
We tokenise the input character sequence into sentences and then into tokens.
Then, for each token, we're going to classify it as either already being natural language or being what's called a Non-Standard Word, which we write as NSW.
That might be an abbreviation, a cardinal number, year, a date, a money expression, and so on.
For both natural language tokens and non-standard word tokens, we need to resolve ambiguity and find the underlying form.
Once we've done that, we need to verbalise the Non-Standard Words into natural language; for example, turning sequences like this into 'seventeen seventy three'.
I've outlined the key problems in turning text - the written form - into a sequence of words to be spoken out loud.
But I didn't offer any solutions, because I want you to think about which parts of this problem are relatively easy and could be solved simply using knowledge from the head of a native speaker encapsulated, perhaps, in rules, and which problems are much harder and can't be solved in that way.
In other words, I asked you think about what types of solution are going to be needed.
These fall into those two broad categories.
In one, it's about linguistic knowledge that could be expressed in a way that can be implemented in software.
In the other, it's not about expressing knowledge directly (for example, in rules), but just using it to provide examples - and we'll call that 'data' - and then learning a solution from those examples.
Both categories of solution have their place in Text-to-Speech and in all sorts of other natural language processing applications.
Let's see where we're going next.
Because we're right at the start of text processing, I'm going to look quite far into the future to try and give you the big picture.
We'll make a first attempt to capture linguistic knowledge in simple rules.
I'll call them 'handwritten' because we're going to go directly from knowledge in the mind of a user of the language to rules that we can implement in software.
We'll see that has some uses, but is limited.
We'll look at then a more powerful and general way to express that knowledge called 'finite state transducers', which we could also write by hand.
With those methods understood, we'll attempt to use them for the problem of predicting pronunciation from spelling.
That can work quite well for some languages.
We can write a pretty comprehensive set of rules for Spanish to do a good job of predicting pronunciation from spelling.
But it doesn't work very well for English.
Have a think about why.
We will then have encountered a problem - predicting pronunciation from spelling for English - that needs more than handwritten rules.
It requires learning from examples: from data.
So, for pronunciation, and for other problems such as predicting prosody, we need some way of learning from data.
We meet machine learning for the first time, and we going to look at decision trees.