Submit a job
To submit a job on the computing platform, you must have a computing account and connect on an interactive server. Two commands allow job submission.
sbatch
allows the submission of a batch script. The needed submission syntax is explained in the example of a standard job.
srun
allows the allocation of the specified resources and runs a program or an application. When run from the command line,
srun
allows the execution of interactive jobs or, if used in a script to run various commands (multiple tasks), it allows the execution of tasks in parallel.
Attention
In the absence of options indicating the time -t
, the number of tasks -n
or CPUs per task -c
, and the memory --mem =
, the job submission will be refused.
The upper limits of these parameters will be discussed in the Required parameters limits section.
The srun
and sbatch
commands admit the same set of parameters, but the srun
execution, unlike sbatch
is not associated to an interactive shell. The main consequence is that the errors potentially encountered during the execution of srun
will not be reported in the output file (which is the case for sbatch
). To associate the command to a shell, use the option --pty
(see example of interactive jobs).
Note
The job submission directory is the working directory. TMPDIR
may be used for storing large amounts of data, but it is not the default working directory. You may force Slurm to use this space, or any other space, as your workspace using the option -D | --chdir=
.
To monitor your job progress, please refer to the commands and tools explained in the Job monitoring chapter.
Essential sbatch options
You may find below a brief description of the essential sbatch
options. To browse all the available options, please refer to the command help sbatch -h
.
-n | --ntasks=
states the maximum number of parallel tasks lauched by the job. By default it corresponds to allocated CPU number. If this option is used with
srun
, then the task will be repeatedn
times.-c | --cpus-per-task=
states the number of cores per process. This option must be specified if a parent process launches parallel child processes.
--mem=
states the amount of needed memory, ex. 5G
-N | --nodes=
states the number of the computing servers needed
-L | --licenses=
states the types of storage and software resources needed by the job
-p | --partition=
allows to select the chosen partition
-t | --time=
sets a limit on the total run time of the job allocation. Acceptable formats: “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”.
-A | --account=
states the group to be charged with the resources used by the job
-a | --array=
allows the description of a job array,
,
is the separator to define a list or-
to define an interval, ex.--array=0,6,16-32
-J | --job-name=
defines the job name
-D | --chdir=
sets the working directory of the batch script before it is executed. The path can be specified as full path or relative path
-o | --output=
states the file name for the standard output, by default
slurm-jobid.out
-e | --error=
states the file name for the error messages
--mail-user=
states the email to receive the chosen notifications, such as job state changes
--mail-type=
states the type of alert to be notified
Environment and limitations
By default, the TMPDIR
environment variable is set to /tmp
. During the job, this folder is mounted on a current user private folder in /scratch/slurm.<jobid>.<taskid>/tmp
. Temporary data and mapping are deleted at the end of job, ensuring both privacy and security.
Important
It is recommanded to use TMPDIR
for performance when accessing an important volume of data, as /scratch
storage space is large and local to the node.
Each computing group has a limit on the number of slots concurrently occupied by running jobs; the limit will depend on the group yearly resources request. To know this limit for a given group (account):
% sacctmgr show assoc where account="<group>" user= format="Account,Grpcpus"
Account GrpCPUs
---------- --------
<group> 2000
Once this limit is reached, the subsequent submissions will be pending in queue until the completion of running jobs frees the required number of slots. However, if you receive the following error upon submission:
sbatch: error: AssocMaxSubmitJobLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
or, monitoring a waiting job you have this squeue
output:
% squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
936551 htc littlene user PD 0:00 1 (AssocMaxJobsLimit)
you have probably been blocked! please check your Inbox or contact the user support.
Required parameters limits
The time parameter -t
upper limit depends on the quality of service associated with the partition used.
The upper limits of memory --mem =
, number of tasks -n
or CPUs per task -c
depend on the partition and the node used.
Important
To help you quantify the parameters values for your jobs, please refer to the paragraph Job profiling.
For an overview of the available resources and their limits, please refer to the page Informations about slurm batch system.
The main guidelines for the limitations of these resources per partition are the following:
htc
The partition limits the job to a single node. This implies that the memory and CPU limits will be the hardware limits of the node used. For the
htc_highmem
partition, the memory limit will obviously be much higher. Thehtc_daemon
partition limits jobs to 1 CPU and 3 GB of memory.flash
The partition has the same CPU and memory limits as the
htc
.hpc
In this partition the job can “overflow” on several nodes. The memory and CPU limits will therefore be the total CPU and memory available on the HPC platform.
gpu
The partition limits the job to a single node. The memory limit will be calculated as for
htc
; on the other hand, the limit on the number of CPUs is strictly linked to the number of GPUs requested and depends on the lowest CPU/GPU ratio of the GPU platform (linked to the type and configuration of the corresponding hardware).If, for example, one node in the platform has a ratio of 4 CPUs for each GPU, and another has a ratio of 5, the maximum limit will be 4 CPUs for each GPU requested.
Storage and licence resources
Upon submission, it is necessary to declare the storage systems accessed by your jobs, as well as any software licenses used. This declaration is made using the -L
option of the sbatch
command. For an example syntax, please refer to the example of a standard job.
In order to know the resources limits, use the command scontrol
:
% scontrol show lic
For an exhaustive list of resources requiring declaration upon submission:
% scontrol show lic | grep default
To find out the resource limitations of a
<group>
:
% scontrol show lic | grep -A1 <groupe>
Attention
Omit _default
or _<group>
on the submission line
Please refer to the MATLAB page to know the correct declaration syntax.