Forum Replies Created
-
AuthorPosts
-
Your proposal is to use perceptual data (i.e., from listening tests with human subjects) to define a target cost function. It’s a good idea, and has been tried, but it’s difficult to get enough perceptual data to automatically learn such a function.
In the following paper, we describe a simple target cost function (in the form of a classifier) that is learned from perceptual data. It worked, but did not beat Festival’s standard IFF target cost function. Note that our novel target cost function is still using only linguistic features as input, and doesn’t use acoustic properties of the candidates.
http://www.outsideecho.com/llsti/pubs/Pitch_marking.pdf is wrong, in this regard.
make_pm_wave
is a script that runs the programpitchmark
The documentation is here. Note that this method does actually work on speech waveforms, although at the time the manual was written we were still using Laryngograph signals, which later proved to be unnecessary.
pitchmark
works as described in the videoThe answer is in the complete IEEE style manual (which I do not expect you to read in its entirety) that gives examples of how to cite a specific part of a work, including:
[3, pp. 5-10]
[3, eq. (2)]
[3, Fig. 1]
[3, Appendix I]
[3, Sec. 4.5]
[3, Ch. 2, pp. 5-10]
[3, Algorithm 5]
You probably do not need an appendix, unless you think that it’s a useful place for some content (e.g., tables of results) that would otherwise interrupt the flow of the main body of the paper. For example, you might have compact summaries as tables or graphs in the main body, with the underlying data in an appendix. This is optional, and most scientific papers don’t go that far.
It’s usually better, where possible, to provide results within the main body, so that the reader has them handy without turning the page. In fact, you should try hard to make tables and figures appear on the same page as that in which they are first referred to in the text.
If you do include an appendix, it will contribute to the word count.
It sounds like you are relying on letter-to-sound to provide pronunciations for this word – is that the case? You should manually add the pronunciation for “Skulason” to your lexicon. Avoid using the glottal stop in that pronunciation.
The pronunciation /s k ? l ei z n!/ looks pretty weird to me and a native speaker of English would have difficulty producing a glottal stop in the context [s k _ l].
That sounds reasonable, yes.
But, what if (for a few individual sentences), the ‘degraded’ voice is actually more natural than the full voice. It’s perfectly possible! What response would you expect from listeners in those cases? Will you need to give them instructions about how to respond in such cases?
“How [does] the decrease of the size of the database degrade the system?”
is not a hypothesis – it’s a research question. That’s OK, but the word “how” is ambiguous: are you talking about “how much” or “in what way” or even “why”?
Degradation Category Rating would be a valid paradigm, but you need to have a non-degraded reference sample against which each test stimulus can be compared by the listener. That makes perfect sense in speech coding, where the codec will always degrade each and every sample.
But for synthetic speech, what will your reference be – natural speech?
Q is the glottal stop. In some phone sets, the symbol ? is used (since that’s closest to the IPA symbol) – but that character would cause problems in HTK because it is used as part of the pseudo-regular expression language for state clustering.
Treat it just like any other phone as far as coverage is concerned.
Remember that the phone set depends on the dictionary you are using, and there is not necessarily any correspondence between symbols in one phone set and those in another: they are somewhat arbitrary.
the voice built on carefully selected, domain-specific data performs better overall than… a voice that combines both datasets
This is quite possible. More data is not always better! You should definitely try to understand why this is.
It depends what your hypothesis is. Tell me that, and we can decide what the minimum experiment required to test that hypothesis is.
Always think in terms of hypothesis testing, and not testing of voices.
Things that multiple people thought were good about the course
- Lively lectures with varied activities
- Interesting course content
- Resources are easily accessible
- The videos
- Readings, both content and quantity
- The assignment is interesting / fun / helpful for learning
The structure of the web pages for the assignment is not as good as for course content
You are correct. I may improve it in future, but do not want to change it mid-way through the course.
However, there is a deliberate design to the coursework instructions: they are intended to make you work a bit, to help you learn.
For example, the instructions do not make the dependencies between the steps explicit. You are expected to work this out as part of the learning process: e.g., you have to work out that changing the alignment will change the join positions, and therefore you would need to re-calculate the join cost coefficients.
What software can we use for running our listening tests?
In-class Question-and-Answer is better than using the forums
Yes, you are right. So, please ask more questions in class!
Provide more help on scripting in the labs
I do provide quite a bit of help in-person – just put your hand up more often and ask me for this in any lab session. There is also a forum dedicated to this topic.
-
AuthorPosts