Computing resources

Overview

ICARE provides computational resources to registered and accredited ICARE users. Registered users can run their own codes in a linux environment that is very similar to the ICARE production environment, with on-line access to the entire ICARE archive. This PaaS (Platform as a Service) service is specially useful for users running codes on long time-series data sets who can’t afford to download huge amounts of data to their own facility. This service is also useful to mature and test codes that are intended to run in operational mode the ICARE production environment. This service is suitable for both interactive use and massive batch processing exported to the back-end computing nodes of the cluster.

Registration

A specific registration is required to access ICARE computing resources. Because ICARE resources are limited, access is restricted to partners working with ICARE on collaborative projects. You register to ICARE data services first (see here), then fill out this additional registration form to request an SSH account. You will be required to provide additional information including the framework of your request and an ICARE project referent.
If you only want to access ICARE data services (i.e. SFTP or web access), please use the data access registration form.

Description of the cluster

The ICARE computing cluster is composed of one front-end server and 192 allocated cores spread

over 4 back-end computing nodes (see table):

  • 1 front-end server (access.icare.univ-lille.fr)
  • 4 computing nodes
ServersNumber of cores allocated
to cluster
HyperthreadingProcessorRAM
Front-end
access.icare.univ-lille.fr
40YesIntel(R)
Xeon(R)
Gold 5218R
CPU @
2.10GHz
384
Go
Node 006-00924
physical cores
48 logical cores
YesIntel(R)
Xeon(R)
Silver
4214R CPU
@ 2.40GHz
384
Go

The front-end server is the primary access to the cluster. No intensive processing is to be run on the front-end server. It is dedicated to interactive use only. All intensive processing jobs must be run on the computing nodes and must be submitted through the job scheduler SLURM (seenbelow).

Disk Space

  • Home Directory (51 TB total)

This space should be used for storing files you want to keep in the long term such as source codes, scripts, etc. The home directory is backed up nightly.
Note: home directories are shared by all nodes of the cluster, so be aware that any modification in
your home directory on access32 also modifies your home directory on access64.

  • Main Storage Space /work_users (75 TB total)

This is the main storage space for large amounts of data. This work space is backed up nightly.

  • Scratch Space /scratch (50 TB total)

The scratch filesystem is intended for temporary storage and should be considered volatile. Older
files are subject to being automatically purged. No backup of any kind is performed for this work
space.

Logging in

To use the computer cluster, you have to log in to the front-end server access.icare.univ-lille.fr using
your ICARE username and password:

ssh -X username@access.icare.univ-lille.fr

Cluster Software and Environment Modules

We are using the Environment Modules Package to provide a dynamic modification of a user’s environment.
The Environment Modules package is a tool that simplifies shell initialization and lets users easily
modify their environment during the session with modulefiles. Each modulefile contains the
information needed to configure the shell for an application.
The main module commands are:

module avail # to list all available modules you can load
module list # to list your currently loaded modules
module load moduleName # to load moduleName into your environment
module unload moduleName # to unload moduleName from your environment

When you login into ICARE cluster some modules are automatically loaded for your convenience.
Initially, your module environment is not empty !

  • Display default environment variables

To see the default environment that you get at login issue the “module list” command.

[ops@access ~]$ module list 

Currently Loaded Modulefiles:

1) scilab/6.1.0 6) cmake/3.15.3 11) proj/6.2.1 16) netcdf-c/4.6.3 21) swig/4.0.2
2) jdk/13.0.1 7) git/2.21.0 12) HDFView/3.1.0 17) netcdf-fortran/4.4.5 22) Python/2.7.16
3) gcc/9.2.0 8) cvs2svn/2.5.0 13) hdf4/4.2.14 18) netcdf-cxx4/4.3.0 23) coda/2.21
4) openssl/1.1.1d 9) eccodes/2.20.0 14) HDF-EOS2/20v1.00 19) nccmp/1.9.1.0 24) icare-env/3.0.0/python2
5) curl/7.65.3 10) geos/3.7.3 15) hdf5/1.10.5 20) oracle_instantclient/18.5.0.0 25) perl/5.32.0
  • Display all available software installed on the cluster
[ops@access ~]$ module avail
--------------------------- modulefiles ----------------------------
rhel7/Anaconda/3/2020.11 rhel7/hdf4/4.2.14-without-netcdf rhel7/netcdf-c/4.6.3
rhel7/cmake/3.15.3 rhel7/hdf5/1.10.5 rhel7/netcdf-c/4.9.0
rhel7/coda/2.21 rhel7/HDF-EOS2/20v1.00 rhel7/netcdf-cxx4/4.3.0
rhel7/coda/2.24 rhel7/HDFView/3.1.0 rhel7/netcdf-fortran/4.4.5
rhel7/coda/2.24.1 rhel7/icare-env/3.0.0/python2 rhel7/openssl/1.1.1d
rhel7/conda_envs/dataviz rhel7/icare-env/3.0.0/python3 rhel7/oracle_instantclient/18.5.0.0
rhel7/conda_envs/dataviz_v3 rhel7/icare-env/3.1.0/python3 rhel7/perl/5.32.0
rhel7/conda_envs/dataviz_v4 rhel7/idl/8.2 rhel7/proj/6.2.1
rhel7/curl/7.65.3 rhel7/idl/8.7 rhel7/proj/9.0.1
rhel7/cvs2svn/2.5.0 rhel7/idl/8.8 rhel7/Python/2.7.16
rhel7/eccodes/2.13.1 rhel7/intel/2021.1.1 rhel7/Python/3.10.5
rhel7/eccodes/2.20.0 rhel7/jdk/13.0.1 rhel7/Python/3.8.0
rhel7/gcc/9.2.0 rhel7/matlab/R2012a rhel7/scilab/6.0.2
rhel7/gdal/3.5.1 rhel7/matlab/R2018b rhel7/scilab/6.1.0
rhel7/gdal/3.6.0 rhel7/matlab/R2020a rhel7/sqlite/3.31.0
rhel7/geos/3.7.3 rhel7/matlab_runtime/R2012a rhel7/swig/4.0.2
rhel7/git/2.21.0 rhel7/matlab_runtime/R2018b
rhel7/hdf4/4.2.14 rhel7/nccmp/1.9.1.0
  • Show what a module sets for your shell environment
module show rhel7/Python/3.10.5
-------------------------------------------------------------------
/usr/local/modulefiles/rhel7/Python/3.10.5:
prepend-path    PATH /usr/local//modules/rhel7/Python/3.10.5/bin
prepend-path    LD_LIBRARY_PATH /usr/local//modules/rhel7/Python/3.10.5/lib
prepend-path    PKG_CONFIG_PATH /usr/local//modules/rhel7/Python/3.10.5/lib/pkgconfig
prepend-path    PYTHONPATH /usr/local//modules/rhel7/Python/3.10.5/lib/python3.10/site-packages/osgeo
prepend-path    CARTOPY_DATA_DIR /usr/local//modules/rhel7/Python/3.10.5/lib/python3.10/site-packages/cartopy/
data
-------------------------------------------------------------------
  • Get help information about a module
module help rhel7/Python/3.10.5
----------- Module Specific Help for 'rhel7/Python/3.10.5' --------
This modulefile defines the pathes and variables for the package
Python-3.10.5
.............................................
  • Loading/ unloading modules

Modules can be loaded and unloaded dynamically.

[ops@access ~]$ module load rhel7/matlab/R2018b
[ops@access ~]$ which matlab
/usr/local/modules/rhel7/matlab/R2018b/bin/matlab
[ops@access ~]module unload rhel7/matlab/R2018b
  • Unload ALL software modules

The module purge command will remove all currently loaded modules. This is particularly useful if
you have to run incompatible software (e.g. python 2.x or python 3.x). The module unload
command will remove a specific module.

[ops@access ~]module purge

Running your jobs

No intensive processing is to be run on the front-end node. Processing jobs must be submitted
through SLURM‘s job scheduler to run on the computing nodes. SLURM (Simple Linux Utility
for Resource Management) is a workload manager and a job scheduling system for LINUX clusters.
In the current configuration, all the computing nodes belong to one single partition named
“COMPUTE” (i.e. all jobs end up in the same queue). The maximum RAM allowed is 4GB per job
and the maximum execution time is 24 hours by default (i.e. jobs are automatically killed if this
limit is reached). See options to modify that limit (–time option)
The job priority is automatically adjusted based on the required resources specified by the user
when scheduling the job. The lower the resources the higher priority.

SLURM commands

Jobs can be submitted to the scheduler using sbatch or srun
sbatch: to submit a job to the queue

The job is submitted via the sbatch command. SLURM then assigns a number to the job and places
it in the queue. It will execute when the resources are available.

ops@access:~ $ sbatch submit.sh
Submitted batch job 17

Example (submit.sh) for bash users

#!/bin/bash
#===============================================================================
# Options SBATCH :
#SBATCH --job-name=TestJob # Defines a name for the batch job
#SBATCH --time=10:00 # Time limit for the job.(format = m:s ou h:m:s
ou j-h:m:s)
#SBATCH -o OUTPUT_FILE # Specifies the file containing the stdout
#SBATCH -e ERROR_FILE # Specifies the file containing the stderr
#SBATCH --mem=2000 # Memory limit per compute node for the job
#SBATCH --partition=COMPUTE # Partition is a queue for jobs. (Default is
COMPUTE)
#SBATCH --mail-type=ALL # When email is sent to user (all
notifications)
#SBATCH --mail-user=user@univ-lille.fr # User's email address
### Setting the TMPDIR environment variable, specify a directory that is
accessible to the user ID
export TMPDIR=/scratch/$USER/temp
mkdir -p $TMPDIR
###Purge any previous modules
module purge
###Load the application
module load rhel7/Python/3.8.0 #load module anaconda/Python 3.8.0
### Run program
./executable_name
#===============================================================================

Example (submit.sh)) for tcsh users

#!/bin/tcsh
#===============================================================================
# Options SBATCH :
#SBATCH --job-name=TestJob # Defines a name for the batch job
#SBATCH --time=10:00 # Time limit for the job.(format = m:s ou h:m:s
ou j-h:m:s)
#SBATCH -o OUTPUT_FILE # Specifies the file containing the stdout
#SBATCH -e ERROR_FILE # Specifies the file containing the stderr
#SBATCH --mem=2000 # Memory limit per compute node for the job
#SBATCH --partition=COMPUTE # Partition is a queue for jobs. (Default is
COMPUTE)
#SBATCH --mail-type=ALL # When email is sent to user (all
notifications)
#SBATCH --mail-user=user@univ-lille.fr # User's email address
### Setting the TMPDIR environment variable, specify a directory that is
accessible to the user ID
setenv TMPDIR /scratch/$USER/temp
mkdir -p $TMPDIR
###Purge any previous modules
module purge
###Load the application
module load rhel7/Python/3.10.5 #load module anaconda/Python 3.10.5
### Run program
./executable_name
#===============================================================================

srun: to submit a job for interactive execution (as you would execute any command line),
i.e. you lose the prompt until the execution is complete.
Example of a run in the partition COMPUTE for 30 minutes :

ops@access:~ $ srun --partition=COMPUTE –time=30.0 job.sh

squeue: to view information about jobs

Usage:

 ops@access:~ $ squeue
squeue –u <myusername>

scancel: to remove a job from the queue, or cancel it if it is running

ops@access:~ $ scancel <jobid>
ops@access:~ $ scancel cancel -u <myusername> --state=pending (cancels all
pending jobs by <myusername>)
ops@access:~ $ scancel cancel -u <myusername> --state=running (cancels all
running jobs by <myusername>)

sinfo: provides information about nodes and partitions

sinfo -N -l
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE
REASON
node006 1 COMPUTE* idle 48 2:12:2 385563 0 1000
(null) none
node007 1 COMPUTE* idle 48 2:12:2 385563 0 1000
(null) none
node008 1 COMPUTE* idle 48 2:12:2 385563 0 1000
(null) none
node009 1 COMPUTE* idle 48 2:12:2 385563 0 1000
(null) none

scontrol: to see the configuration and state of a job

ops@access:~ $ scontrol show job <jobid>

sview: is a graphical user interface version

The following table translates some of the more commonly used options for qsub to their sbatch
equivalents:

qsub to sbatch translation
To specify the: qsub option sbatch option Comments
Queue/partition -q QUEUENAME -p QUEUENAME Torque “queues” are called “partitions” in slurm. Note: the partition/queue structure has been simplified, see below.
Number of nodes/ cores requested -l nodes=NUMBERCORES -n NUMBERCORES See below
-l nodes=NUMBERNODE S:CORESPERNODE -N NUMBERNODES – n NUMBERCORES
Wallclock limit -l walltime=TIMELIMIT -t TIMELIMIT TIMELIMIT should have form of HOURS:MINUTES:SECONDS. Slurm supports some other time formats as well.
Memory requirements -l mem=MEMORYmb –mem=MEMORY Torque/Maui: This is Total memory used by job Slurm: This is memory per node
-l pmem=MEMORYmb –mem-percpu=MEMORY This is per CPU/core. MEMORY in MB
Stdout file -o FILENAME -o FILENAME This will combine stdout/stderr on slurm if -e not given also
Stderr file -e FILENAME -e FILENAME This will combine stderr/stdout on slurm if -o not given also
Combining stdout/stderr -j oe -o OUTFILE and no -eoption stdout and stderr merged to stdout/OUTFILE
-j eo -e ERRFILE and no -ooption stdout and stderr merged to stderr/ERRFILE
Email address -M EMAILADDR –mailuser=EMAILADDR
Email options -mb –mail-type=BEGIN Send email when job starts
-me –mail-type=END Send email when job ends
-mbe –mail-type=BEGIN –mail-type=END Send email when job starts and ends
Job name -N NAME –job-name=NAME
Working directory -d DIR –workdir=DIR

See also

A documentation of SLURM and SLURM commands is available online:
http://slurm.schedmd.com
http://slurm.schedmd.com/man_index.html

Search