You should focus on experiments using speaker-independent systems for your report.
To think about why this is, consider using the already-trained models from the speaker-dependent experiment.
If you use models trained only on your own speech to recognise the Test set of another speaker, what Word Error Rate do you expect?
The WER is probably going to be high unless the recordings of the other speaker sounds pretty much like yours. So, this isn’t going to tell you anything very informative about what makes ASR better or worse. Instead, your main experiments should investigate speaker-independent systems trained and tested on larger numbers of speakers so that your finding potentially have wider applicability.
When designing your experiments, the testing speakers must be distinct from the training speakers. You need to control all factors that are not of interest: (including: accent, gender, microphone type, amount of training data).
Below are some possible experiment designs for experiments looking at the effect of gender in training and testing the digit recogniser. Some of these designs are better than others. Can you work out the pros and cons of each design?
- The effect of gender, with simplistic control over accent and microphone
- Training set: the training data of 20 female UK English speakers using headset microphones
- Test set A: the test data of 20 female UK English speakers not in the training set, also using headset microphones
- Test set B: the test data of 20 male UK English speakers (obviously not in the training set), also using headset microphones
- The effect of gender, with more sophisticated control over accent and microphone (version 1)
- Training set A: the training data of 50 female speakers, with a mixture of accents and microphones
- Training set B: the training data of 50 male speakers, with a mixture of accents and microphones in the same proportions as training set A
- Test set: the test data of 50 female speakers not in training set A, with a mixture of accents and microphones in the same proportions as training set A
- The effect of gender, with more sophisticated control over accent and microphone (version 2)
- Training set: the training data of 50 female speakers with a mixture of accents and microphones
- Test set A: the test data of 50 female speakers not in the training set, with a mixture of accents and microphones in the same proportions as the training set
- Test set B: the test data of 50 male speakers, with a mixture of accents and microphones in the same proportions as the training set
In general, you should try to go for more sophisticated designs that allow you to use more data as this potentially allows you to generalise your findings better. However, there are always tradeoffs to be made deciding on an experimental design. One issue might be that you simply cannot find the same proportions for the factors that you want to control. You’ll need to consider are the pros and cons of using a not quite perfectly balanced design with covering more test speakers per condition of interest (e.g. gender in the examples above) compared to a perfectly balanced design that only has a few test set speakers for each condition? If your test set only has 5 speakers per condition, do you think the results would be very different if you picked 5 other speakers for a specific condition?
Some possible experiments to try for this data set
What effect does microphone type have?
Design an experiment to discover whether the microphone type is important. This might involve discovering if some microphones give lower Word Error Rate than others, or finding out the effect of mismatches between the Training and Test sets. Remember to control all the other factors.
You can perform equivalent experiments to investigate the gender and accent factors too.
What effect does the amount of training data have?
In machine learning, it’s often said that more training data is better. But is that always the case? Design some experiments to explore this. Include cases where the Training and Test sets are well-matched (e.g., in gender and/or accent, etc) and cases where there is mismatch. What is more important: matched training data, or just more data?
These questions are very important in commercial systems: it costs a lot of money to obtain the training data, so we want to collect the most useful data we can.
There are many other experiments you can think about! You don’t have to restrict yourselves to the ones listed here!