Forum Replies Created
-
AuthorPosts
-
Using screen to create a persistent remote session
Now let’s imagine that we want to run a program on the remote machine that takes a long time. It’s risky to do this directly on the command line, because if the ssh connection fails, then the program will terminate.
One option is to use nohup (find information on the web) but I strongly recommend using an different (and strangely underused) technique: screen.
screen gives you a persistent session on the remote machine. You can disconnect and reconnect to it whenever you like, and leave programs running in it. They will continue running after you disconnect.
To make screen easy to use, you need to set up a configuration file first. Log in to the remote machine and create a file called .screenrc in your home directory there. Note that this filename starts with a period. Here is an example of what to put in this file – the first line has a tricky character sequence – two backquotes inside double quotes:
escape "``"
and after than line, put this:
# define a bigger scrollback, default is 100 lines defscrollback 1024 hardstatus on hardstatus alwayslastline hardstatus string "%{.bW}%-w%{.rW}%n %t%{-}%+w %=%{..G} %H %{..Y} %d/%m %C%a " # some tabs, with titles - change this to whatever you like screen -t script -h 1000 1 screen -t log -h 1000 2 screen -t bash -h 1000 3
Once you’ve created that file (I’ve also attached an example – download and rename it), log out of the remote machine.
Now connect to the remote machine like this, which uses ssh to start screen:
$ ssh -t kairos.inf.ed.ac.uk /usr/bin/screen -D -R
you can navigate between the tabs using the key sequence ` (backquote) then either p or n. You have three separate shells running in this example.
If you actually need to type a backquote character, just press ` twice.
Try disconnecting – either just kill the Terminal that screen is running in, or use the key sequence ` (backquote) then d (for ‘detach’). To re-connect, just use the ssh command above. Your screen will come back just as you left it – magic!
Attachments:
You must be logged in to view attached files.June 6, 2016 at 14:44 in reply to: Synthesising directly from a phone sequence rather than text #3230SayPhones is probably only going to work for a diphone voice, not a Multisyn unit selection voice. Try loading a diphone voice and see if that works. You are going to get monotonic F0 though, I think.
Two things to do
1. ask for more quota (http://www.inf.ed.ac.uk/systems/support/form/ – mention my name)
2. make a directory in /disk/scratch on the local machine you are working on – this is NOT backed up, so should just be used for temporary working space
I’ve updated …/dnn_tts/configuration/configuration.py in the centrally installed version.
June 5, 2016 at 10:41 in reply to: Synthesising directly from a phone sequence rather than text #3224I’m not sure of the solution to this. Let’s talk in person – is Festival the best framework for you, or should we consider a DNN system?
You don’t need utterance structures for the very simple case that you are trying at this point (treat letters as phonemes, and use no other linguistic information). To build a voice, you simply need to figure out how to create the input features for training the DNN. You need to use the Prepare the input labels steps of the DNN voice building exercise as your starting point, but replace some steps with your own scripts.
For example, you do not need the step “Convert utterance structures to full context labels” – you need to create these full context labels using your own script (I suggest starting with a “full context” of triphones or quinphones).
The “Convert label files to numerical values” will be essentially the same, but you’ll need to modify the questions so that they correctly query your labels.
It’s well worth doing all of this with your own scripts (they are quite simple) because this will give you a deeper understanding of all the steps involved. Then, you could switch to the Ossian framework, which will automate some of this for you.
Yes – you are on the right lines – just assume that each letter is a phoneme.
A malloc (“memory allocation”) error of “can’t allocate region” suggests that you are running out of memory (RAM). Try reducing the minibatch size.
In sequence training, the minibatches need to be constructed from entire utterances, rather than randomised frames. So, the minibatch size will vary slightly, and not be constant. This may be why you only get this error seemingly randomly.
You can take a copy of
bandmat
from/Volumes/Network/courses/ss/dnn/dnn_tts/bandmat
, or add that to your PYTHONPATHCurrently libxml isn’t working on the lab machines. But it’s not needed – comment out all lxml (or modules from lxml) imports. These will be in
frontend/label_composer.py frontend/label_normalisation.py
Or, if you want to be more future-proof (you might need the libxml functions if you want to integrate with Ossian), wrap the imports in
try...except
such astry: from lxml import etree except: print "Failed to import etree from lxml"
or
try: import lxml from lxml import etree from lxml.etree import * MODULE_PARSER = etree.XMLParser() except: print "Failed to import lxml"
Stripping problematic punctuation from utts.data should be OK in your first build of this system. Come back and solve this problem later.
Also, I think that run_dnn.py might be hardwired to use only tanh layers regardless of what the configuration file specifies.
Your reasoning behind why we need to cluster (also called “tie”) models is correct, yet.
The nodes in the tree each contain a question about a phonetic feature (e.g., “is the previous phone nasal?”). The tree is simply a CART. The phonetic features are the predictors. The predictee is the current model state’s parameters (mean and variance of its Gaussian).
The tree is learned in very much the same way as a classification or regression tree.
Your question about how this eventually affects the generated waveform can be restated in two parts
1. how does this affect the models’ parameters?
2. how do model parameters affect the waveform that they generate?
The answer to 1. you have already figured out: the models share parameters, that’s all. We don’t need to average the group of models (actually, model states) that end up at a leaf – we simply have only one shared (= tied) state there and it is trained on all the corresponding data. So, If you like, you might think instead of the tree finding all the suitable data that this shared state should be trained on, pooled across a group of sufficiently-similar contexts.
The answer to 2. is via the usual generation process of statistical parametric speech synthesis: the models generate trajectories of vocoder parameters, and those are then vocoded into a waveform.
To be more precise: most frames of all regions labelled as silence, are removed.
It improves training (as found empirically) because otherwise the training data is dominated by silence frames and the network will optimise for generating silence in preference to speech sounds (it’s very easy to minimise the error on silence, and that contributes too much to total error if there are a lot of silence frames).
To prevent the truncation of phrase-final speech sounds, the correct solution is to improve the forced alignment.
I believe you can (and should) now switch to using run_lstm.py in all cases, both LSTM and purely feed-forward architectures.
-
AuthorPosts