Skip to content

Latest commit

 

History

History
105 lines (70 loc) · 2.47 KB

File metadata and controls

105 lines (70 loc) · 2.47 KB

Cluster Usage

JUPYTER on the cluster

First connect to get cluster and pull the newest changes

	ssh <user>@login.leonhard.ethz.ch
	cd <project>
	git pull

Then run this script locally

./start_jupyter_nb.sh LeoOpen alelidis 8 01:20 4096 1

Then connect tensorboard.

	ssh <user>@login.leonhard.ethz.ch -L localhost:17605:localhost:17605
	cd DeepExplain/experiments/logs
	module load python_gpu/3.6.4
	tensorboard --logdir ./adv --host "0.0.0.0" --port 17605

When done hit logout in notebook

Setup

Connect to ETH through VPN, then you can login to the cluster using you NETHZ username and password. First, upload the data and code files to the cluster

scp -r -v <data and code folder> <user>@login.leonhard.ethz.ch:data

Make sure to rename the data folder to "data" for the code to run correctly.

Then login to the cluster using ssh and load the required modules

ssh <user>@login.leonhard.ethz.ch
module load python_gpu/3.6.1 hdf5/1.10.1

Running the code

Run the preProcessor.py to generate the required numpy files in "data/preprocessingOut/"

bsub -n 8 -R "[mem=8000]" python preProcessor.py

Then use the batch submission system to submit the job

bsub -W 4:00 -n 8 -R "rusage[mem=8000,ngpus_excl_p=1]" python <file name>.py

Options:

  • W: maximum time for the job in the format of HH:MM (4 hours in the example above)
  • n: number of CPU cores to use (8 in the example above)
  • R: resources to use: - mem memory (RAM) in mega bytes per CPU core (8 cores * 8000 MB = 64GB) - ngpus_excl_p number of GPUs to use -I: interactive mode, this directs the output to

The regular GPU models used are NVIDIA GTX 1080 with approximately 10GB of dedicated memory.

Checking job status

To show the stats of all the jobs you have in queue or running after submission:

bbjobs

Or, for less information:

bjobs

To check a jobs output (a job which is not in interactive mode)

bpeek <job ID>

After the job finishes execution, it will generate a file in your home directory with the output of the job called lsf.<job ID>

To terminate a job

bkill <job ID>

Max resources without having to wait

  • Memory: 160 GB
  • Cores: 24 Cores
  • GPUs: 1 GPU
  • Run time: 120 hours

Click here for more information about the batch system and clusters.