Short instructions to Jyväskylä Clusters puck (puck.it.jyu.fi) and oberon (oberon.it.jyu.fi)
Version 0.5


NEW 2024-2025
    FIRST-TIME LOGIN PROCESS:
     1) Log into jalava.cc.jyu.fi using JYU credentials
        First, you need to have "Unix shell server logon rights" activated at https://account.jyu.fi/oma/services
     2) Log in from jalava to the cluster front end cfe.it.jyu.fi using JYU credentials
     3) Add your the ssh-key from you own computer to cfe home directory file 
         .ssh/authorized_key
	IMPORTANT: The file already contains keys, so don't overwrite it!
     4) log into oberon


After adding the ssh key, you should be able to log in to cfe directly from you computer,
IF YOUR COMPUTER IS IN JYU NETWORK!
If not, use JYU VPN.


Queue batch system Slurm
========================
 see https://research.csc.fi/taito-constructing-a-batch-job-file

 - A job needs a batch script file
   Sample batch script files are attached :
     slurm_single.job :  a single core job
     slurm_mpi.job    :  a 48 core job (2 nodes) using OpenMPI
 - Contents of a script file     
     lines beginning with #SBATCH  are Slurm instructions 
     lines beginning with ## or ### are comments
     
       Description of some slurm instructions:
       ----------------------------------------
     #!/bin/bash        : run the job commands in bash shell
             (this is done only when the job actually starts execution, not before)
     #SBATCH -J mytest        : job name is "mytest"
     #SBATCH -o out           : output to file "out" in the working directory
     #SBATCH -e errorout      : error message output 
     #SBATCH -n 1             : serial job, reserve a single core, same as --ntasks=1
     #SBATCH -n 4             : parallel job, reserve 4 cores, same as --ntasks=4
     #SBATCH -N 1             : reserve one node (24 cores), same as   --nodes=1 
     #SBATCH --time=45:00     : optional; jobs takes max 45 minutes (it will be killed if it's not finished in 45 minutes)
     #SBATCH --partition=test : optional; put the job to test queue (default queue is used if not specified)

- Send the job to queue. The following command puts the job in script file "slurm.job" to queue: 
     sbatch slurm.job
   in return, you get a job number, "jobid"    
- Monitor current queue situation, "R" means running, "PD" means pending
     squeue
 - Graphical 
     sview
 - View the list of free/occupied nodes and their queue assignments
     sinfo
     sinfo -l    ; more information
     sinfo -a    ; view all queues, also ones you have no right to submit to
     
 - Job queues in puck.it.jyu.fi (may change, there is always a default queue):
     normal        default queue, 24 nodes, max 12 nodes per job, max 3-day jobs     
     test          testing only, one node, short (at the moemnt 1 hour) test jobs
     riskybusiness 10 nodes, max 10 nodes per job ; may overbook with grid jobs and run slower than expected
                   (in practise: use as you would use the normal queue)

- Cancel the job :
    scancel jobid


Environment modules
===================
 Modules set up the correct environment for you job, all paths to executables and libraries.
 A module can set up an environment for using a specific version of a program.
 All but the system libraries and compilers are loaded as modules.

 Modules made specifically for puck are named
   puck_name

  some module commands:
    module avail           : list of modules 
    module load puck_gpaw  : load the default GPAW environment, specifically made for puck
    module add puck_gpaw   : same thing
    module list            : list of modules that are currently loaded
    module purge           : unload all modules 

 Example: command
        gcc 
     will invoke the usually quite old system gcc compiler.  
     You want to use the GCC version 6.1.0, do
         module add puck_gcc/6.1.0
    then the compiler and its libraries are visible,
         gcc --version
      returns
         gcc (GCC) 6.1.0
	 Copyright (C) 2016 Free Software Foundation, Inc.
	 This is free software; see the source for copying conditions.  There is NO
	 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
        
        
 Some programs are licensed to a certain group of users.  
   

Things not to do
================
 - Don't run jobs in the login node - except very short data analysis/plotting 
 - Don't run jobs interactively, run only through the queue system Slurm
 - Don't run jobs in your home directory /home/$user,  use your work directory /n/work00/$user
 - Don't reserve more resources than you really use

Multinode parallel jobs
=======================
 - Always use network 10.0.40.0/24 ; for OpenMPI jobs load module local openmpi module

A good idea is to 
=================
	
 - Monitor that your job is actually running.
    * Use "squeue"
    * If in doubt, see what nodes the job uses and login to one of them using ssh
      Try "top" to see if your processes consume 100% of the cpu
       ssh ssh compute-0-10   (if jobs run in node 10)
       top