Forum Replies Created
-
AuthorPosts
-
Record the lectures
This is already being done and the recordings automatically appear on Learn, usually just a few hours after each lecture. This was announced at the start of the course.
Lectures are too short
When combined with 1 hour of tutored lab time per week, each student has 2 or 3 hours of contact per week. This is consistent with PPLS guidelines for 10 and 20 credit courses.
To get the most benefit from the available lecture time, do the following:
- Do all essential reading well before the lecture
- Ask your questions in advance
- Come to the 9.00 lectures
- Use the lab sessions to ask further questions
I don’t like the lecture room
There is an instant poll to find out whether this is widely-held view.
Lab session are too short to finish all the work required
This is by design. You need to spend a substantial amount of time in the lab beyond the tutored lab sessions. You can obtain an unlimited amount of help by using the forums.
To get the most out of the tutored lab sessions, it should be obvious that you read all the instructions in advance. I’ve noticed some students spending time in the tutored sessions simply going through the instructions. This is not a smart strategy.
Ideally, I would like to offer 2 hour lab sessions. However, the current lab facility is not big enough for this and is in high demand. I hope to do this in future years when we might get a larger lab.
Provide the lab instructions further in advance
I think this one is the students’ responsibility: be more proactive and read further ahead in the course materials. This is level 10+11 course. The materials have been available since before the start of the course.
However, I will find some time within each 10.00 lecture to highlight how far through the 2nd assignment you should be by that point.
Need more “big picture” / recap / overview
In the next part of the course (speech recognition), I will include a short section at the start of each 10.00 lecture, summarising what you should know at that point, and how the current topic fits into the overall picture.
I’m sorry about this – I have no control over it. The teaching offices are responsible for creating the submission links and sending you all information on how to submit. Please visit them in person tomorrow morning to find out what to do.
Again – my apologies – I have to rely on the teaching offices for this stuff.
October 25, 2016 at 22:30 in reply to: I really cite just a little in my report, is that acceptable ? #5626You need to do both.
Citing published work in not a substitute for providing explanations in your own words. You should combine these, with the citations being used to support your explanations.
Website structure is too deeply nested
I accept that the nesting has gone too deep in some places. Restructuring the pages part-way through a course is likely to cause more problems than it solves, so this feedback will be acted upon in the future, for both Speech Synthesis and Speech Processing.
Let’s clarify your understanding – this most important thing to say first is
we must separate the description of the analysis and synthesis parts
You said that we calculate “one coefficient for each pitch period” – no, we calculate the complete set of filter coefficients (there might be 16 of them, say) for each analysis frame.
We have choices about the analysis frame: it might be a pitch period, but that would require pitch marking the signal, which is error-prone. So, let’s just consider the simple case where the analysis frame is of fixed duration (25ms, say) and the frames are spaced at fixed times (every 10ms, say).
After calculating the filter coefficients, we inverse filter the frame of speech signal. This gives us the residual signal for the current analysis frame. The residual is a waveform. If we use this residual signal to excite the filter (i.e., as the excitation signal), we will get near-perfect reconstruction of the frame of speech being analysed.
We store the filter coefficients and the residual waveform together. They are “matched”: only the combination of residual and filter from the same analysis frame will give near-perfect speech output. If we “mix and match” across different analysis frames, we will not get such good reconstruction.
You have correctly understood that we “should not manipulate the filter, because it might deviate from the perfect match“. That is true. So, we will only manipulate the filter by small amounts (for join smoothing), to avoid too much mismatch. We may also manipulate the residual using overlap-and-add (to modify F0) – this will also create some amount of mismatch. So, again, we will limit the amount of manipulation to limit the severity of the mismatch.
Now on to the synthesis stage, which happens every time we use the TTS system to say a sentence…
Here, we have choices about the resynthesis frame. It could be as simple as the fixed analysis frame from above. This will work, but because the filter coefficients are updated at a fixed rate (every 10ms, which is 100 times per second) we may hear an artefact: a constant 100Hz “buzz”.
We can’t avoid updating the filter, but we can be clever about the rate at which we do it. If we update not every 10ms but every pitch period, then we will create an artefact not at 100Hz but at F0. Since the listener will perceive F0 anyway (in voiced speech), then we can “hide” the artefact “behind” the natural F0.
In diphone synthesis, there is just one recorded copy of each diphone. The F0 and duration of that recorded copy will be arbitrary. If we simply concatenated these recordings, we would get an arbitrary and very probably discontinuous F0 contour. We must manipulate the recording in order to impose the predicted F0 (e.g., to get gradual declination over a phrase), and to impose predicted duration.
In unit selection synthesis, we have many recordings of each diphone to choose from. In some versions of unit selection (covered in detail in the Speech Synthesis course), we will use the front end’s predictions of F0 and duration to help us choose the most appropriate one.
If you use Chrome, there are various plugins that will do this for you, such as HTML5 Video Speed Control or Video Speed Controller.
Vector graphics means that a figure is represented by actual lines and text. This means it can be scaled without losing quality. Vector graphics can be created using many drawing packages, provided you export in a suitable format. PDF is usually best.
The alternative is a bitmap or image. Such images are represented by individual pixels. There are many file formats for this (PNG, BMP, JPG). In these formats, you need to have to make sure that the resolution of the image (the number of pixels in the horizontal and vertical dimensions) is high enough.
A good rule of thumb is that, if you actually print your document on A4 paper, you cannot detect the individual pixels in your figures.
Sometimes, you have to use a bitmap format. For example, a spectrogram is inherently an image and not vectors. In these cases, when you take a screenshot, make the window as large as possible on your screen, before taking the screenshot. This will give you the maximum resolution.
It is possible to get really excellent results from Microsoft Word, if you are an expert user and know how to control all the settings. However, in my experience very few people manage that. The default output from Word looks ugly. It is poor at typesetting equations. For long documents, it becomes unreliable.
For these reasons, I recommend learning (and mastering) LaTeX. It is a little harder to learn than Word, but its default output is better. I don’t recommend learning it just to write the coursework for this course; but, if you need to write a dissertation later in your programme, this is a better tool than Word.
This is a case where citing the online version is the correct thing to do. Cite it as you would any other URL (e.g., mention the date on which you last accessed it).
-
AuthorPosts