Overview
ICARE provides computational resources to registered and accredited ICARE users. Registered users can run their own codes in a linux environment that is very similar to the ICARE production environment, with on-line access to the entire ICARE archive. This PaaS (Platform as a Service) service is specially useful for users running codes on long time-series data sets who can’t afford to download huge amounts of data to their own facility. This service is also useful to mature and test codes that are intended to run in operational mode the ICARE production environment. This service is suitable for both interactive use and massive batch processing exported to the back-end computing nodes of the cluster.Registration
A specific registration is required to access ICARE computing resources. Because ICARE resources are limited, access is restricted to partners working with ICARE on collaborative projects. You register to ICARE data services first (see here), then fill out this additional registration form to request an SSH account. You will be required to provide additional information including the framework of your request and an ICARE project referent.If you only want to access ICARE data services (i.e. FTP or web access), please use the data access registration form.
Description of the cluster
The ICARE computing cluster is composed of one front-end server and 114 allocated cores spread over 5 back-end computing nodes (see table):- 1 front-end server (access.icare.univ-lille1.fr)
- 5 computing nodes
Servers | Number of cores allocated to cluster |
Hyperthreading | Processor | RAM |
---|---|---|---|---|
Front-end access.icare.univ-lille1.fr |
26 | No | Intel(R) Xeon(R) 2xGold 5120 CPU @ 2.20GHz |
384 Go |
Node 001 | 26 | No | Intel(R) Xeon(R) 2xGold 5120 CPU @ 2.20GHz |
384 Go |
Node 002-005 | 22 physical cores 44 logical cores |
Yes | Intel(R) Xeon(R) 2xSilver 4116 CPU @ 2.10GHz |
384 Go |
The front-end server is the primary access to the cluster. No intensive processing is to be run on the front-end server. It is dedicated to interactive use only. All intensive processing jobs must be run on the computing nodes and must be submitted through the job scheduler SLURM (see below).
Disk Space
- Home Directory (40 TB total)
Note: home directories are shared by all nodes of the cluster, so be aware that any modification in your home directory on access32 also modifies your home directory on access64.
- Main Storage Space /work_users (40 TB total)
- Scratch Space /scratch (78 TB total)
Logging in
To use the computer cluster, you have to log in to the front-end server access.icare.univ-lille.fr using your ICARE username and password:ssh -X username@access.icare.univ-lille.fr
Cluster Software and Environment Modules
We are using the Environment Modules Package to provide a dynamic modification of a user’s environment.The Environment Modules package is a tool that simplifies shell initialization and lets users easily modify their environment during the session with modulefiles. Each modulefile contains the information needed to configure the shell for an application.
The main module commands are:
module avail # to list all available modules you can load module list # to list your currently loaded modules module load moduleName # to load moduleName into your environment module unload moduleName # to unload moduleName from your environment
When you login into ICARE cluster some modules are automatically loaded for your convenience. Initially, your module environment is not empty !
- Display default environment variables
[ops@access ~]$ module list Currently Loaded Modulefiles: 1) rhel6/icare_env/1-00_with_PYTHON_2.6 2) rhel6/idl/8.2 3) rhel6/matlab/R2018b
- Display all available software installed on the cluster
[ops@access ~]$ module avail ------------------------------------- /usr/local/modulefiles ------------------------------------------- rhel6/anaconda/2/5.3.1 rhel6/ferret/6.82 rhel6/icare_env/1-00_with_PYTHON_2.6 rhel6/idl/8.2 rhel6/matlab/R2012a rhel6/anaconda/3/5.3.1 rhel6/ferret/6.9 rhel6/icare_env/2-01_with_PYTHON_2.7 rhel6/idl/8.7.2 rhel6/matlab/R2018b
- Show what a module sets for your shell environment
module show rhel6/icare_env/1-00_with_PYTHON_2.6 ------------------------------------------------------------------- /usr/local/modulefiles/rhel6/icare_env/1-00_with_PYTHON_2.6: prepend-path PATH /usr/local/env64_rhel6_1-00/opt/hdf4/bin:/usr/local/env64_rhel6_1-00/opt/netcdf/bin:/usr/local/env64_rhel6_1-00/bin/swig:/usr/local/env64_rhel6_1-00/bin:/usr/local/env64_rhel6_1-00/opt/scilab/bin append-path PATH /usr/local/env64_rhel6_1-00/opt/scilab/bin:/usr/local/env64_rhel6_1-00/opt/gcc/bin setenv PYTHONPATH /usr/local/env64_rhel6_1-00/lib64/python2.6/site-packages/grib_api:/usr/local/env64_rhel6_1-00/lib64/python2.6/site-packages:/usr/local/env64_rhel6_1-00/lib/python2.6/site-packages/grib_api:/usr/local/env64_rhel6_1-00/lib/python2.6/site-packages prepend-path LD_LIBRARY_PATH /usr/local/env64_rhel6_1-00/opt/netcdf/lib64:/usr/local/env64_rhel6_1-00/lib64:/usr/local/env64_rhel6_1-00/opt/scilab/lib64/scilab:/usr/local/env64_rhel6_1-00/opt/netcdf/lib:/usr/local/env64_rhel6_1-00/lib:/usr/local/env64_rhel6_1-00/opt/instantclient_11_2:/usr/local/env64_rhel6_1-00/opt/scilab/lib/scilab setenv BUFR_TABLES /usr/local/env64_rhel6_1-00/lib/python2.6/site-packages/pybufr_ecmwf/ecmwf_bufrtables setenv FER_DIR /usr/local/env64_rhel6_1-00/opt/ferret setenv JAVA_HOME /usr/local/env64_rhel6_1-00/opt/jdk setenv FER_DSETS /usr/local/env64_rhel6_1-00/opt/ferret/fer_dsets setenv FER_WEB_BROWSER firefox setenv FER_DATA_THREDDS http://ferret.pmel.noaa.gov/geoide/geoIDECleanCatalog.xml /usr/local/env64_rhel6_1-00/opt/ferret/fer_dsets /usr/local/env64_rhel6_1-00/opt/ferret/fer_dsets setenv FER_DATA /usr/local/env64_rhel6_1-00/opt/ferret/fer_dsets/data /usr/local/env64_rhel6_1-00/opt/ferret/go /usr/local/env64_rhel6_1-00/opt/ferret/examples setenv FER_DESCR /usr/local/env64_rhel6_1-00/opt/ferret/fer_dsets/descr setenv FER_GRIDS /usr/local/env64_rhel6_1-00/opt/ferret/fer_dsets/grids setenv FER_GO /usr/local/env64_rhel6_1-00/opt/ferret/go /usr/local/env64_rhel6_1-00/opt/ferret/examples /usr/local/env64_rhel6_1-00/opt/ferret/contrib setenv FER_EXTERNAL_FUNCTIONS /usr/local/env64_rhel6_1-00/opt/ferret/ext_func/libs setenv FER_PALETTE /usr/local/env64_rhel6_1-00/opt/ferret/ppl setenv SPECTRA /usr/local/env64_rhel6_1-00/opt/ferret/ppl setenv FER_FONTS /usr/local/env64_rhel6_1-00/opt/ferret/ppl/fonts setenv PLOTFONTS /usr/local/env64_rhel6_1-00/opt/ferret/ppl/fonts setenv FER_LIBS /usr/local/env64_rhel6_1-00/opt/ferret/lib setenv FER_DAT /usr/local/env64_rhel6_1-00/opt/ferret -------------------------------------------------------------------
- Get help information about a module
module help rhel6/anaconda/3/5.3.1 ----------- Module Specific Help for 'rhel6/anaconda/3/5.3.1' -------------------- This modulefile defines all the pathes and variables needed to use the ICARE environment anaconda3-5.3.1 .............................................
- Loading/ unloading modules
[ops@access ~]$ module list Currently Loaded Modulefiles: 1) rhel6/icare_env/1-00_with_PYTHON_2.6 2) rhel6/idl/8.2 3) rhel6/matlab/R2018b [ops@access ~]$ which matlab /usr/local/modules/rhel6/matlab/R2018b/bin/matlab [ops@access ~]module unload rhel6/matlab/R2018b [ops@access ~]$ module list Currently Loaded Modulefiles: 1) rhel6/icare_env/1-00_with_PYTHON_2.6 2) rhel6/idl/8.2 [ops@access ~]module load rhel6/matlab/R2012a [ops@access ~]$ module list Currently Loaded Modulefiles: 1) rhel6/icare_env/1-00_with_PYTHON_2.6 2) rhel6/idl/8.2 3) rhel6/matlab/R2012a [ops@access ~]$ which matlab /usr/local/modules/rhel6/matlab/R2012a/bin/matlab
- Unload ALL software modules
[ops@access ~]module purge
Running your jobs
No intensive processing is to be run on the front-end node. Processing jobs must be submitted through SLURM‘s job scheduler to run on the computing nodes. SLURM (Simple Linux Utility for Resource Management) is a workload manager and a job scheduling system for LINUX clusters.In the current configuration, all the computing nodes belong to one single partition named “COMPUTE” (i.e. all jobs end up in the same queue). The maximum RAM allowed is 4GB per job and the maximum execution time is 24 hours by default (i.e. jobs are automatically killed if this limit is reached). See options to modify that limit (–time option)
The job priority is automatically adjusted based on the required resources specified by the user when scheduling the job. The lower the resources the higher priority.
SLURM commands
Jobs can be submitted to the scheduler using sbatch or srun- sbatch: to submit a job to the queue
The job is submitted via the sbatch command. SLURM then assigns a number to the job and places it in the queue. It will execute when the resources are available.
ops@access:~ $ sbatch submit.sh Submitted batch job 17
Example (submit.sh) for bash users
#!/bin/bash #=============================================================================== # Options SBATCH : #SBATCH --job-name=TestJob # Defines a name for the batch job #SBATCH --time=10:00 # Time limit for the job.(format = m:s ou h:m:s ou j-h:m:s) #SBATCH -o OUTPUT_FILE # Specifies the file containing the stdout #SBATCH -e ERROR_FILE # Specifies the file containing the stderr #SBATCH --mem=2000 # Memory limit per compute node for the job #SBATCH --partition=COMPUTE # Partition is a queue for jobs. (Default is COMPUTE) #SBATCH --mail-type=ALL # When email is sent to user (all notifications) #SBATCH --mail-user=user@univ-lille.fr # User's email address ### Setting the TMPDIR environment variable, specify a directory that is accessible to the user ID export TMPDIR=/scratch/$USER/temp mkdir -p $TMPDIR ###Purge any previous modules module purge ###Load the application module load rhel6/anaconda/3/5.3.1 #load module anaconda/Python 3.6.7 ### Run program ./executable_name #===============================================================================
Example (submit.sh)) for tcsh users
#!/bin/tcsh #=============================================================================== # Options SBATCH : #SBATCH --job-name=TestJob # Defines a name for the batch job #SBATCH --time=10:00 # Time limit for the job.(format = m:s ou h:m:s ou j-h:m:s) #SBATCH -o OUTPUT_FILE # Specifies the file containing the stdout #SBATCH -e ERROR_FILE # Specifies the file containing the stderr #SBATCH --mem=2000 # Memory limit per compute node for the job #SBATCH --partition=COMPUTE # Partition is a queue for jobs. (Default is COMPUTE) #SBATCH --mail-type=ALL # When email is sent to user (all notifications) #SBATCH --mail-user=user@univ-lille.fr # User's email address ### Setting the TMPDIR environment variable, specify a directory that is accessible to the user ID setenv TMPDIR /scratch/$USER/temp mkdir -p $TMPDIR ###Purge any previous modules module purge ###Load the application module load rhel6/anaconda/3/5.3.1 #load module anaconda/Python 3.6.7 ### Run program ./executable_name #===============================================================================
- srun: to submit a job for interactive execution (as you would execute any command line), i.e. you lose the prompt until the execution is complete.
Example of a run in the partition COMPUTE for 30 minutes :
ops@access:~ $ srun --partition=COMPUTE –time=30.0 job.sh
- squeue: to view information about jobs
ops@access:~ $ squeue squeue –u <myusername>
- scancel: to remove a job from the queue, or cancel it if it is running
ops@access:~ $ scancel <jobid> ops@access:~ $ scancel cancel -u <myusername> --state=pending (cancels all pending jobs by <myusername>) ops@access:~ $ scancel cancel -u <myusername> --state=running (cancels all running jobs by <myusername>)
- sinfo: provides information about nodes and partitions
sinfo -N -l NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON node001 1 COMPUTE* idle 26 2:14:1 386225 0 1000 (null) none node002 1 COMPUTE* idle 22 2:24:1 386225 0 1000 (null) none node003 1 COMPUTE* idle 22 2:24:1 386225 0 1000 (null) none node004 1 COMPUTE* idle 22 2:24:1 386225 0 1000 (null) none node005 1 COMPUTE* idle 22 2:24:1 386225 0 1000 (null) none
- scontrol: to see the configuration and state of a job
ops@access:~ $ scontrol show job <jobid>
- sview: is a graphical user interface version
The following table translates some of the more commonly used options for qsub
to their sbatch
equivalents:
qsub to sbatch translation | |||
---|---|---|---|
To specify the: | qsub option | sbatch option | Comments |
Queue/partition | -q QUEUENAME | -p QUEUENAME | Torque “queues” are called “partitions” in slurm. Note: the partition/queue structure has been simplified, see below. |
Number of nodes/ cores requested | -l nodes=NUMBERCORES | -n NUMBERCORES | See below |
-l nodes=NUMBERNODES:CORESPERNODE | -N NUMBERNODES -n NUMBERCORES | ||
Wallclock limit | -l walltime=TIMELIMIT | -t TIMELIMIT | TIMELIMIT should have form of HOURS:MINUTES:SECONDS. Slurm supports some other time formats as well. |
Memory requirements | -l mem=MEMORYmb | –mem=MEMORY | Torque/Maui: This is Total memory used by job Slurm: This is memory per node |
-l pmem=MEMORYmb | –mem-per-cpu=MEMORY | This is per CPU/core. MEMORY in MB | |
Stdout file | -o FILENAME | -o FILENAME | This will combine stdout/stderr on slurm if -e not given also |
Stderr file | -e FILENAME | -e FILENAME | This will combine stderr/stdout on slurm if -o not given also |
Combining stdout/stderr | -j oe | -o OUTFILE and no -e option |
stdout and stderr merged to stdout/OUTFILE |
-j eo | -e ERRFILE and no -o option |
stdout and stderr merged to stderr/ERRFILE | |
Email address | -M EMAILADDR | –mail-user=EMAILADDR | |
Email options | -mb | –mail-type=BEGIN | Send email when job starts |
-me | –mail-type=END | Send email when job ends | |
-mbe | –mail-type=BEGIN –mail-type=END |
Send email when job starts and ends | |
Job name | -N NAME | –job-name=NAME | |
Working directory | -d DIR | –workdir=DIR |
See also
A documentation of SLURM and SLURM commands is available online:http://slurm.schedmd.com
http://slurm.schedmd.com/man_index.html