Finite state transducer

Finite State Transducers provide general-purpose machinery for rewriting an input sequence as an output sequence. They have many uses, including verbalising NSWs into natural language.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoWhilst rules can work well in some simple cases, sometimes we need a more powerful technique.
One possibility is a finite state transducer.
We'll look at them here as a generalisation of handwritten rules, and that means we're just going to be creating a finite state transducer by hand, for now.
Remember that the input to our Text-to-Speech system is just a string of characters.
There are no words; we have to find them.
To find the words we need to tokenise, classify, resolve ambiguity, and verbalise.
This video is about finite state transducers, which could be used to classify non-standard words as well as to verbalise them.
Let's use the two non-standard words in this sentence as examples and just focus on the process of verbalization.
That means we've already detected that this is a year and that this is a money amount.
Some parts of verbalization might be non-trivial.
The pound symbol is at the beginning of the written form, but the end of the spoken form.
So we need some machinery that can deal with that.
Now, this is not a complete course on finite state transducers.
That would be part of a larger topic of formal languages in either computer science or computational linguistics.
Here we're just interested in their capabilities.
Capturing our own linguistic knowledge and expertise is a perfectly valid approach to solving some problems in Text-to-Speech and spoken language processing more generally.
One challenge in doing that is to find a suitable form in which we can express our knowledge.
Rules are one option, and they will often need to access the context of the token we're processing.
We saw context-sensitive rewrite rules and collocation rules doing that.
Finite state transducers are another option.
They're closely related and, for our purposes, let's just say they're part of the same family of approaches.
Here is a simple finite state transducer.
It's made of states and there are, of course, a finite number of them.
The notation I'm using is that there's a start state, and that's bold.
There's an end state written with two circles.
Then normal states written like that.
Those states are connected by arcs, and it's on the arcs where the action happens.
The action involves the finite state transducer consuming some input and emitting some output.
The arcs are directed: they have arrows on them.
Let's make everything clearer with a concrete example of verbalising the numbers 10 to 19.
I'll have a start state and then we'll go to the next state and we'll consume the first digit '1'.
But we can't emit anything yet because we don't know whether it's going to be 10 or 11 or 17 or 19.
So we write 'epsilon' for emitting nothing.
This state has meaning.
The model has no external memory, no place to store anything; it only has states and arcs.
This state means we have consumed '1' and we're waiting to see what happens next.
So, in fact, the states are the memory of the network.
The more we have to memorise before we can emit anything, the bigger this network is going to have to be.
All we can do next is consume another character.
We consume a '0' and now we can emit the word 'ten'.
Consume a '1' and emit 'eleven'.
Given any of those individual numbers 10, 11, 12, up to 19, this machine will consume the input and emit a word as output and then finish in its end state.
This transducer only handles 10 to 19.
So, on your own as an exercise, try expanding it (just with pencil and paper) to verbalise all numbers from 10 to 99.
You should make the network a small as possible.
In other words, with as few states as possible.
Let's do another example, this time years such as this one here, which we read as 'seventeen ninety'
We consume the digits one by one and optionally emit some output.
We can consume the '1' but we can't say anything yet.
We can consume the next digit, and now we can say 'ten', 'eleven',...'seventeen',...
In this case, we'd have consumed the '1', consumed the '7', and emitted 'seventeen'.
Then we consume the next digit, and we can say, for example if it's a '9' we've just consumed, we could immediately emit 'ninety'.
We don't need to wait for any more input.
Then the final digit: if it's a '0', we emit nothing, otherwise, we emit 'one' to 'nine'.
This finite state transducer once again works for some years, but not all.
First of all, I want you to, as an exercise on your own, decide which years it will work for and which it won't.
Then improve it handle all possible four digit years.
Let's set a reasonable range on that: let's go from 1066 to 1999.
If you succeed in doing that, then extend it into the 21st century as well.
A final example: money amounts in which the currency symbol is written at the start but spoken at the end.
Have a go on your own.
Pause the video.
Well, hopefully you have understood that the model needs to memorise things by having states.
So this model needs a start state and an end state, as they always do.
We need to consume the '£' symbol, but we can't emit anything yet.
This state means that we've seen a '£' symbol and we're waiting to see what happens next.
Towards the end of the finite state transducer, we'll have to emit the word 'pounds'.
Let's write an arc for that.
There won't be anything to consume.
We've already consumed the '£' symbol.
Somewhere in the middle you need a finite state transducer that is a general number expander.
I'll let you have a go writing one.
If you manage that, try generalising to multiple currencies in which the symbol is written at the beginning but the word is said at the end.
I'll give you a hint on how to do that.
It can use the same number-expanding finite state transducer, but it'll have to do something different at the beginning in the end.
Then you'll have discovered a really useful property of finite-state transducers: that we can join together smaller ones to make larger ones.
Here, we reused a general purpose number expander, to expand currency amounts.
That's good engineering.
We build one good number expander, and then use it many times.
This is the end of this little sequence of topics on text processing.
But finite state methods are really important, and they're going to reappear much later, in Automatic Speech Recognition.
They could be used to model language: in other words, the possible sequences of words, and perhaps their probabilities of occurring.
They can also be used as the basis for a model of speech itself.
If we have a model of language (of word sequences) and a model of speech, we can then take advantage of this ability to combine small, finite state models into larger ones to make models of complete utterances from models of smaller things such as words or sub-word units like phonemes.