Problems training the dnn

This topic has 9 replies, 4 voices, and was last updated 8 years, 7 months ago by Felipe E.

Viewing 9 reply threads

Author

Posts
- June 8, 2016 at 10:32 #3238
  Pilar O
  Tutor
  I’m trying to train in a small set and I’ve been having a lot of trouble trying to train the net. (working on the AT computers, by the way…)
  
  1) If I try to train with “run_dnn.py”, I get this error (Andy was getting the same error yesterday):
  
  File “/Volumes/Network/courses/ss/dnn/dnn_tts/run_dnn.py”, line 957, in <module>
  main_function(cfg)
  File “/Volumes/Network/courses/ss/dnn/dnn_tts/run_dnn.py”, line 765, in main_function
  hyper_params = cfg.hyper_params, buffer_size = cfg.buffer_size, plot = cfg.plot)
  File “/Volumes/Network/courses/ss/dnn/dnn_tts/run_dnn.py”, line 182, in train_DNN
  private_l2_reg = float(hyper_params[‘private_l2_reg’])
  KeyError: ‘private_l2_reg’
  
  2) If I try to train with “run_lstm.py”, first I get all these “malloc” errors (I read the post about it), that I “”fixed”” by restarting the terminal and they don’t appear anymore, such that the training is effectively done.
  But when I try to go to the generation step I realise that the model has not been saved in the folder “nets_model” (nothing in there)… I run permissions and all that and the training again but it seems that it is not saving the model anywhere (also I don’t see any line in the log file saying that is saving the model)…
  
  Anything I could do to fix this?
  Thank you!
- June 8, 2016 at 13:40 #3243
  Srikanth R
  Student
  1. Some of the variables are not being used in the code and hence commented in configuration/configuration.py – you can search for “private_l2_reg” in hyper params section in the file and uncomment it. However, there is no reason to use “run_dnn.py” now, as everything updated in “run_lstm.py”. Please set to “sequential_training” variable to False if you want to run DNN and set to True if you are having either RNN/LSTM in hidden layers. Also, please update the “configuration/configuration.py” with the current version.
  
  2. Please use smaller learning rate for training — or else your training may converge before 10th epoch and hence model won’n be saved. Alternatively, you can comment the line number 313 in current version of the script “run_lstm.py”: which says save the model only after 10th epoch.
  
  if epoch>10: ## comment this line
  cPickle.dump(best_dnn_model, open(nnets_file_name, ‘wb’))
- July 29, 2016 at 22:50 #3923
  Andrew W
  Student
  Hi,
  
  With regards to your second point, how problematic is it if you use the that fix (commenting out the if condition) and the best model saved is from the 3rd epoch, but the data does not full converge until ~epoch 20? I’ve had better sounding results from models that are trained for longer/until around convergence.
  
  Thanks
- July 31, 2016 at 16:57 #3980
  Felipe E
  Student
  Hi Andrew,
  
  I would use the best model around convergence, but you can compare results and decide. For example, you can modify the line 313 to skip saving the epoch 3:
  
  if epoch>3:
  
  Thanks,
  
  Felipe
- August 1, 2016 at 13:23 #3982
  Andrew W
  Student
  Thanks,
  
  Would it be enough to just use the model from the last epoch? Or would it be best to save the model at each epoch, then take the best one from around convergence and use that?
- August 1, 2016 at 20:29 #4007
  Felipe E
  Student
  Hi Andrew,
  
  I would say that you should use the last model. However, maybe your data is not converging. Could you post a picture of the errors, or provide the values of training and validation errors, please?
  
  Thanks,
  
  Felipe
- August 1, 2016 at 20:37 #4009
  Andrew W
  Student
  Hi,
  
  I spoke to Simon earlier and he suggested saving a model at a certain number of epochs, maybe around 25-30.
  I’ve attached one of the plots.
  
  Thanks
  
  Attachments:
  You must be logged in to view attached files.
- August 1, 2016 at 21:05 #4018
  Felipe E
  Student
  OK, do what Simon suggested.
  
  The plot looks strange. In my opinion, you should also try:
  
  – Use the model stored at the epoch 3.
  – Decrease the learning rate. Try with 0.1 or 0.5 times the learning rate that you are currently using.
  
  Thanks,
  
  Felipe
- August 1, 2016 at 21:16 #4025
  Andrew W
  Student
  Especially for this model (8khz), models trained for longer tend to sound better as the buzz/stochastic noise reduces decreases to nearly nothing, so the silences between words are preserved. The 0.0001 learning rate for this model does come out with a better validation error, and shows a better trend.
  
  This behaviour is particularly prevalent in the 16khz models.
  
  Attachments:
  You must be logged in to view attached files.
- August 1, 2016 at 21:24 #4030
  Felipe E
  Student
  Yes, that looks much better.
  
  Felipe
Author

Posts