Skills: Tensorboard

Tensorboard allows us to monitor the training of our model.

What you will learn in this part:

  • What is Tensorboard?
  • How to run Tensorboard on a remote machine and connect to it from a web browser on your local machine

Tensorboard is essentially just a simple webserver that generates a little website to display log files, plot charts, and so on. The log files are being created by the process that is training your model on a GPU node. You can find them in your experiment directory.

Connecting to Tensorboard running on a remote machine

To view Tensorboard, we simply need to visit that website in a browser. There is only one complication: you need to run Tensorboard on a machine that has access to the log files (e.g., an Eddie login node) but you need to run the web browser on your local computer (e.g., your laptop).

The solution is “port forwarding”, which will create the necessary link between the two machines.

On the remote machine, check the hostname of the login node you are on:

hostname

which will return something like login01.ecdf.ed.ac.uk where 01 might vary. Now cd to the project directory where you are currently training a model (or training has finished and you want to inspect the logs) and start Tensorboard (assuming you have already activated the py312torch27cuda118 environment):

cd ${EXP_DIR}/${TTS_PROJECT}
OPENBLAS_NUM_THREADS=1 tensorboard --port 0 --logdir logs_and_checkpoints

Look for a message like `TensorBoard 2.20.0 at http://localhost:12345/` where the port number 12345 will vary. (Why will it vary? Because multiple users might be running Tensorboard on the same login node, and you each need a unique port number.)

(If you get errors when trying to start Tensorboard, you might already have it running, so kill any existing processes with killall tensorboard then try again. If that doesn’t work, and you have VS Code connected to the same login node, quit VS Code too.)

Using the hostname and port number from above, and taking careful note of the -ext, do this in a terminal on your local machine:

ssh -NL 6006:localhost:12345 -X s1234567@login01-ext.ecdf.ed.ac.uk

Alternatively, if you have your ssh configured appropriately, that would simply become:

ssh -NL 6006:localhost:12345 -X eddie

This has “tunnelled” port 12345 on the remote machine to port 6006 on your local machine. Now visit http://localhost:6006/ in a web browser running on your local machine.