› Forums › Speech Synthesis › DNN synthesis › DNN Basics
- This topic has 6 replies, 3 voices, and was last updated 8 years, 10 months ago by Simon.
-
AuthorPosts
-
-
March 11, 2016 at 16:06 #2719
I have a few questions regarding the basic explanations about simple NN in the videos:
In “The basics”
1. Is each input unit a vector, or is each vector an input with each unit representing one dimension of the vector? And does each input represent one frame? So frames are trained one by one, and all the inputs are trained in one epoch, after which they are trained again in the next epoch?
2. By a “matrix”, do you mean the weights that are multiplied to the input? In one hidden layer, each weight is a number, and why do the weights become a matrix instead of a vector?
In “Preparing the data”
3. Could you explain more about the clock ticking thing? -
March 12, 2016 at 12:48 #2725
All good questions – I’ll cover these points in the next lecture.
-
March 13, 2016 at 11:11 #2779
Each individual input unit (“neuron”) in a neural network takes just one number as its input (typically scaled to between -1 and 1, or to between 0 and 1).
The set of input units therefore takes a vector as input: this is a single frame of linguistic specification.
In the simplest configuration (a feed-forward network), the neural network takes a single frame of linguistic specification as input, and predicts a single frame of vocoder parameters as output. Each frame is thus processed independently of all the other frames.
-
March 13, 2016 at 11:21 #2780
Don’t use phrases like “the frames are trained” or “inputs are trained” – you will confuse yourself. The trainable parameters in a neural network are the weights that connect one layer to the next layer, plus the bias values in each unit (neuron).
In the simplest form of training, the training input vectors are presented to the network one at a time, and a forward pass is made. The error between the network’s outputs and the correct (target) outputs is measured. After all training vectors have been presented, the average error at each output unit is calculated, and this signal is back-propagated: the weights (and biases) are thus updated by small amount. This whole procedure is called gradient descent. It is repeated for a number of epochs (i.e., passes through the complete training set), until the average error – summed across all the outputs – stops decreasing (i.e., converges to a stable value).
In practice, the above procedure is very slow. So, the weights are usually updated after processing smaller subsets (called “mini-batches”) of the training set. The training procedure is then called stochastic gradient descent. The term stochastic is used to indicated that the procedure is now approximate because we are measuring the error on only a subset of the training data. It is usual to randomly shuffle all the frames in the training set before dividing it into mini-batches, so that each mini-batch contains a nice variety of input and output values. In general, this form of training is much faster: it takes fewer epochs to converge.
-
March 13, 2016 at 11:26 #2781
There is some variation in the terminology used to refer to the weights that connect one layer to the next. Because the number of weights between two layers is equal to the product of the numbers of units in the two layers, it is natural to think of the weights as being a rectangular matrix: hence “weight matrix”.
However, many authors conceptualise all the trainable parameters of the network (several weight matrices and all the individual biases) as one variable, and they will place them all together into a vector: hence “parameter vector” or “weight vector”. This is a notational convenience, so we can write a single equation for the derivative of the error with respect to the weight vector as a whole.
-
March 13, 2016 at 12:39 #2782
Adding to this topic on basics on NNs…
I don’t understand how people choose the number of hidden layers, the number of units per layer, and the functions to put in them. Is it just a matter of trial and error? For example, in the Zen’s reading a foot note says:
“5 Although the linear activation function is popular in DNN-based regression, our preliminary experiments showed that the DNN with the sigmoid activation function at the output layer consistently outperformed those with the linear one.”
Is there any intuition to choose your functions based on how you think the net should transform your input to get to the desire output? (specifically here for speech synthesis) Or do you try different combinations and after we get the best result we try to understand why that architecture was better?
-
March 13, 2016 at 13:23 #2785
Choices about sizes and numbers of hidden layers are generally made empirically, to minimise the error on a development set. In the quote you give above, that is what Zen is saying: he tried different options and chose that one that worked the best.
It is computationally expensive to explore all possible architectures, so in practice these things are roughly optimised and then left fixed (e.g., 5 hidden layers of 1024 units each).
The transformation from input linguistic features to output vocoder features is highly non-linear. The only essential requirement in a neural network is that the units in the hidden layers have non-linear activation functions (if all activations were linear, the network would be a linear function regardless of the number of layers: it would be a sequence of matrix multiplies and additions of biases).
-
-
AuthorPosts
- You must be logged in to reply to this topic.