Short instructions to Jyväskylä Clusters puck (frontend puck.it.jyu.fi) and oberon (frontend oberon.it.jyu.fi) Version 0.2 Queue batch system Slurm ======================== see https://research.csc.fi/documents/48467/85840/taito_user_guide.pdf (general CSC computing information page https://docs.csc.fi/computing/overview) - A job needs a batch script file Sample batch scripts: A single core job ------------------------------------- #!/bin/bash #SBATCH -J testjob #SBATCH -o testoutput #SBATCH -n 1 #SBATCH --ntasks=1 echo "Current working directory is `pwd`" echo "Running on `hostname`" ### load modules that your program needs ## module add puck_... ### run program "myprogram" ## myprogram ## test command: uptime ------------------------------------- A 48 core job, 2 nodes in puck using OpenMPI -------------------------------------- #!/bin/bash #SBATCH -J testjob #SBATCH -o testoutput #SBATCH -e erroroutput #SBATCH -n 48 ## reserves 2 nodes = 48 cores ; same as #SBATCH -N 2 ## load modules the program needs ## module "puck_openmpi" tells communication to go via fast 10.0.40.xx network module add puck_openmpi ## don't specify number of tasks, mpirun knows you reserved 48 cores and does "mpirun -np 48" # test command is hostname: mpirun hostname --------------------------------------- ============================= - Contents of a script file lines beginning with #SBATCH are Slurm instructions lines beginning with ## or ### are comments Description of some slurm instructions: ---------------------------------------- #!/bin/bash : run the job commands in bash shell (this is done only when the job actually starts execution, not before) #SBATCH -J mytest : job name is "mytest" #SBATCH -o out : output to file "out" in the working directory #SBATCH -e errorout : error message output #SBATCH -n 1 : serial job, reserve a single core, same as --ntasks=1 #SBATCH -n 4 : parallel job, reserve 4 cores, same as --ntasks=4 #SBATCH -N 1 : reserve one node (24 cores), same as --nodes=1 #SBATCH --time=45:00 : optional; jobs takes max 45 minutes (it will be killed if it's not finished in 45 minutes) - Send the job to queue. The following command puts the job in script file "slurm.job" to queue: sbatch slurm.job in return, you get a job number, "jobid" - Monitor current queue situation, "R" means running, "PD" means pending squeue - Graphical sview - View the list of free/occupied nodes and their queue assignments sinfo sinfo -l ; more information sinfo -a ; view all queues, also ones you have no right to submit to - Job queues: just one (old test queue is no longer available) normal default queue, max 3-day jobs (sinfo shows TIMELIMIT 3-00:00:00) - Cancel the job : scancel jobid Environment modules =================== Modules set up the correct environment for you job, all paths to executables and libraries. A module can set up an environment for using a specific version of a program. All but the system libraries and compilers are loaded as modules. Modules made specifically for puck are named puck_name some module commands: module avail : list of modules module load puck_gpaw : load the default GPAW environment, specifically made for puck module add puck_gpaw : same thing module list : list of modules that are currently loaded module purge : unload all modules Example: command gcc will invoke the usually quite old system gcc compiler. You want to use the GCC version 6.1.0, do module add puck_gcc/6.1.0 then the compiler and its libraries are visible, gcc --version returns gcc (GCC) 6.1.0 Copyright (C) 2016 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Some programs are licensed to a certain group of users. Things not to do ================ - Don't run jobs in the login node - except very short data analysis/plotting - Don't run jobs interactively, run only through the queue system Slurm - Don't run jobs in your home directory /home/$user, use your work directory /n/work00/$user - Don't reserve more resources than you really use Multinode parallel jobs ======================= - Always use network 10.0.40.0/24 ; for OpenMPI jobs load module local openmpi module A good idea is to ================= - Monitor that your job is actually running. * Use "squeue" * If in doubt, see what nodes the job uses and login to one of them using ssh Try "top" to see if your processes consume 100% of the cpu ssh ssh compute-0-10 (if jobs run in node 10) top