Computing environment

As the scheduler allocates resources, the directory /scratch/slurm.<jobid>.<taskid>/tmp is automatically created on the allocated computing server to store the temporary files generated during runtime.

At the same time, the scheduler sets the value /tmp on the TMPDIR environment variable and maps /tmp to the directory created in /scratch. All you need to do is write any temporary files on /tmp, which actually points to the temporary directory (accessible only by the user) in /scratch. At the end of the job, this directory is automatically deleted by the scheduler.

Important

For a better performance, it is recommanded to use TMPDIR when accessing an important volume of data, as /scratch storage space is large and local to the node.

However, if you wish, for example, to pass data from one job to another, you should not use a worker /scratch as a permanent directory, but rather your THRONG or GROUP directories.

Storage and licence resources

Upon submission, it is necessary to declare the storage systems accessed by your jobs, as well as any software licenses used. This declaration is made using the -L option of the sbatch command. For an example syntax, please refer to the example of a standard job.

In order to know the resources limits, use the command scontrol:

% scontrol show lic

For an exhaustive list of resources requiring declaration upon submission:

% scontrol show lic | grep default

To find out the resource limitations of a <group>:

% scontrol show lic | grep -A1 <group>

Attention

Omit _default or _<group> suffixes on the submission line

Please refer to the MATLAB page to know the correct declaration syntax.

Group limitations

Each computing group has a limit on the number of slots concurrently occupied by running jobs; the limit will depend on the group yearly resources request. To know this limit for a given group (account):

% sacctmgr show assoc where account="<group>" user= format="Account,Grpcpus"
   Account  GrpCPUs
---------- --------
   <group>     2000

You may also check how your group is using the allocated ressources on the group computing report in the User portal.

Once this limit is reached, the subsequent submissions will be pending in queue until the completion of running jobs frees the required number of slots. However, if you receive the following error upon submission:

sbatch: error: AssocMaxSubmitJobLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

or, monitoring a waiting job you have this squeue output:

% squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            936551       htc littlene     user PD       0:00      1 (AssocMaxJobsLimit)

you have probably been blocked! Please check your Inbox or contact the user support.

Parameters limitations

The time parameter -t upper limit depends on the quality of service associated with the partition used.

The upper limits of memory --mem =, number of tasks -n or CPUs per task -c depend on the partition and the node used.

Important

To help you quantify the parameters values for your jobs, please refer to the paragraph Job profiling.

For an overview of the available resources and their limits, please refer to the page Informations about slurm batch system.

The main guidelines for the limitations of these resources per partition are the following:

htc limits the job to a single node, this implies that CPU limits will be the hardware limits of the node used; the memory will be limited to 150 GB.
- htc_daemon limits jobs to 1 CPU and 3 GB of memory.
- htc_highmem the memory and CPU limits will be the hardware limits of the node used.
flash has the same CPU and memory limits as the htc.
In hpc partition the job can “overflow” on several nodes. The memory and CPU limits will therefore be the total CPU and memory available on the HPC platform.
gpu limits the job to a single node. The memory limit will be calculated as for htc_highmem; on the other hand, the limit on the number of CPUs is strictly linked to the number of GPUs requested and depends on the lowest CPU/GPU ratio of the GPU platform (linked to the type and configuration of the corresponding hardware).
- If, for example, one node in the platform has a ratio of 5 CPUs for each GPU, and another has a ratio of 6, the maximum limit will be 5 CPUs for each GPU requested.