Computing environment
As the scheduler allocates resources, the directory /scratch/slurm.<jobid>.<taskid>/tmp
is automatically created on the allocated computing server to store the temporary files generated during runtime.
At the same time, the scheduler sets the value /tmp
on the TMPDIR
environment variable and maps /tmp
to the directory created in /scratch
. All you need to do is write any temporary files on /tmp
, which actually points to the temporary directory (accessible only by the user) in /scratch
. At the end of the job, this directory is automatically deleted by the scheduler.
Important
For a better performance, it is recommanded to use TMPDIR
when accessing an important volume of data, as /scratch
storage space is large and local to the node.
However, if you wish, for example, to pass data from one job to another, you should not use a worker /scratch
as a permanent directory, but rather your THRONG or GROUP directories.
Storage and licence resources
Upon submission, it is necessary to declare the storage systems accessed by your jobs, as well as any software licenses used. This declaration is made using the -L
option of the sbatch
command. For an example syntax, please refer to the example of a standard job.
In order to know the resources limits, use the command scontrol
:
% scontrol show lic
For an exhaustive list of resources requiring declaration upon submission:
% scontrol show lic | grep default
To find out the resource limitations of a
<group>
:
% scontrol show lic | grep -A1 <group>
Attention
Omit _default
or _<group>
suffixes on the submission line
Please refer to the MATLAB page to know the correct declaration syntax.
Group limitations
Each computing group has a limit on the number of slots concurrently occupied by running jobs; the limit will depend on the group yearly resources request. To know this limit for a given group (account):
% sacctmgr show assoc where account="<group>" user= format="Account,Grpcpus"
Account GrpCPUs
---------- --------
<group> 2000
Once this limit is reached, the subsequent submissions will be pending in queue until the completion of running jobs frees the required number of slots. However, if you receive the following error upon submission:
sbatch: error: AssocMaxSubmitJobLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
or, monitoring a waiting job you have this squeue
output:
% squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
936551 htc littlene user PD 0:00 1 (AssocMaxJobsLimit)
you have probably been blocked! please check your Inbox or contact the user support.
Parameters limitations
The time parameter -t
upper limit depends on the quality of service associated with the partition used.
The upper limits of memory --mem =
, number of tasks -n
or CPUs per task -c
depend on the partition and the node used.
Important
To help you quantify the parameters values for your jobs, please refer to the paragraph Job profiling.
For an overview of the available resources and their limits, please refer to the page Informations about slurm batch system.
The main guidelines for the limitations of these resources per partition are the following:
htc
limits the job to a single node, this implies that CPU limits will be the hardware limits of the node used; the memory will be limited to 150 GB.htc_daemon
limits jobs to 1 CPU and 3 GB of memory.htc_highmem
the memory and CPU limits will be the hardware limits of the node used.
flash
has the same CPU and memory limits as thehtc
.In
hpc
partition the job can “overflow” on several nodes. The memory and CPU limits will therefore be the total CPU and memory available on the HPC platform.gpu
limits the job to a single node. The memory limit will be calculated as forhtc_highmem
; on the other hand, the limit on the number of CPUs is strictly linked to the number of GPUs requested and depends on the lowest CPU/GPU ratio of the GPU platform (linked to the type and configuration of the corresponding hardware).If, for example, one node in the platform has a ratio of 5 CPUs for each GPU, and another has a ratio of 6, the maximum limit will be 5 CPUs for each GPU requested.