Short instructions to Jyväskylä Clusters puck (puck.it.jyu.fi) and oberon (oberon.it.jyu.fi) Version 0.5 NEW 2024-2025 FIRST-TIME LOGIN PROCESS: 1) Log into jalava.cc.jyu.fi using JYU credentials First, you need to have "Unix shell server logon rights" activated at https://account.jyu.fi/oma/services 2) Log in from jalava to the cluster front end cfe.it.jyu.fi using JYU credentials 3) Add your the ssh-key from you own computer to cfe home directory file .ssh/authorized_key IMPORTANT: The file already contains keys, so don't overwrite it! 4) log into oberon After adding the ssh key, you should be able to log in to cfe directly from you computer, IF YOUR COMPUTER IS IN JYU NETWORK! If not, use JYU VPN. Queue batch system Slurm ======================== see https://research.csc.fi/taito-constructing-a-batch-job-file - A job needs a batch script file Sample batch script files are attached : slurm_single.job : a single core job slurm_mpi.job : a 48 core job (2 nodes) using OpenMPI - Contents of a script file lines beginning with #SBATCH are Slurm instructions lines beginning with ## or ### are comments Description of some slurm instructions: ---------------------------------------- #!/bin/bash : run the job commands in bash shell (this is done only when the job actually starts execution, not before) #SBATCH -J mytest : job name is "mytest" #SBATCH -o out : output to file "out" in the working directory #SBATCH -e errorout : error message output #SBATCH -n 1 : serial job, reserve a single core, same as --ntasks=1 #SBATCH -n 4 : parallel job, reserve 4 cores, same as --ntasks=4 #SBATCH -N 1 : reserve one node (24 cores), same as --nodes=1 #SBATCH --time=45:00 : optional; jobs takes max 45 minutes (it will be killed if it's not finished in 45 minutes) #SBATCH --partition=test : optional; put the job to test queue (default queue is used if not specified) - Send the job to queue. The following command puts the job in script file "slurm.job" to queue: sbatch slurm.job in return, you get a job number, "jobid" - Monitor current queue situation, "R" means running, "PD" means pending squeue - Graphical sview - View the list of free/occupied nodes and their queue assignments sinfo sinfo -l ; more information sinfo -a ; view all queues, also ones you have no right to submit to - Job queues in puck.it.jyu.fi (may change, there is always a default queue): normal default queue, 24 nodes, max 12 nodes per job, max 3-day jobs test testing only, one node, short (at the moemnt 1 hour) test jobs riskybusiness 10 nodes, max 10 nodes per job ; may overbook with grid jobs and run slower than expected (in practise: use as you would use the normal queue) - Cancel the job : scancel jobid Environment modules =================== Modules set up the correct environment for you job, all paths to executables and libraries. A module can set up an environment for using a specific version of a program. All but the system libraries and compilers are loaded as modules. Modules made specifically for puck are named puck_name some module commands: module avail : list of modules module load puck_gpaw : load the default GPAW environment, specifically made for puck module add puck_gpaw : same thing module list : list of modules that are currently loaded module purge : unload all modules Example: command gcc will invoke the usually quite old system gcc compiler. You want to use the GCC version 6.1.0, do module add puck_gcc/6.1.0 then the compiler and its libraries are visible, gcc --version returns gcc (GCC) 6.1.0 Copyright (C) 2016 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Some programs are licensed to a certain group of users. Things not to do ================ - Don't run jobs in the login node - except very short data analysis/plotting - Don't run jobs interactively, run only through the queue system Slurm - Don't run jobs in your home directory /home/$user, use your work directory /n/work00/$user - Don't reserve more resources than you really use Multinode parallel jobs ======================= - Always use network 10.0.40.0/24 ; for OpenMPI jobs load module local openmpi module A good idea is to ================= - Monitor that your job is actually running. * Use "squeue" * If in doubt, see what nodes the job uses and login to one of them using ssh Try "top" to see if your processes consume 100% of the cpu ssh ssh compute-0-10 (if jobs run in node 10) top