Job monitoring
Computing Service Status
To assess the computing platform occupancy status, you may refer to the user portal or run the following command:
% sjstat | more
Scheduling pool data:
-------------------------------------------------------------
Pool Memory Cpus Total Usable Free Other Traits
-------------------------------------------------------------
htc* 265000Mb 64 190 190 0 htc,x86_64,el9
htc* 540000Mb 112 56 56 0 htc,x86_64,el9
htc_inter 192934Mb 64 1 1 0 htc_interactive,x86_64,el9
htc_inter 128361Mb 40 2 2 0 htc_interactive,x86_64,el9
htc_highm 1546522Mb 40 1 1 1 htc_highmem,x86_64,el9
gpu 191850Mb 24 9 9 0 gpu,x86_64,el9
[..]
Job submission status
The squeue
command is used to display various information about a job. It gives, among other things, the execution time, the current state (ST
column, with possible state R
for running and PD
for pending), the name of the job, and the partition in which the job is executed:
% squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
465 htc hello <username> R 0:01 1 ccwtbslurm01
The main options to squeue
are:
-t [running|pending]
selects to display the running or pending job state
[[-v] -l] -j
display a specific job.
-l
for a long output,-v
for a verbose output
Attention
The squeue
output trims some job fields to 8 characters. To get the full name, use the -O
option. Below is an example of the job name:
% squeue -O JobID,Name
For more information about this command and its outputs, please refer to the official documentation and the following command line:
% man squeue
Job efficiency
The seff
command displays the resources used by a specific job and calculate efficiency.
% seff <jobid>
Job ID: <jobid>
Cluster: ccslurmlocal
User/Group: <username>/<groupid>
State: CANCELLED (exit code 0)
Cores: 1
CPU Utilized: 00:12:50
CPU Efficiency: 98.59% of 00:13:01 core-walltime
Job Wall-clock time: 00:13:01
Memory Utilized: 120.00 KB
Memory Efficiency: 0.00% of 0.00 MB
Attention
The seff
command samples job activity approximately every 30 seconds. The value of the Memory Utilized
field is therefore not the absolute maximum of the used memory, but the maximum of the sampled values. For a more accurate assessment of your job, please use the method explained in paragraph Job profiling.
Job hold and alteration
The scontrol
command allows jobs management. With the options hold
, update
and release
, it allows respectively to hold a job (take it out of the queue), to modify it, then to put it back in the queue:
% scontrol [hold|update|release] <job list>
The following attributes can be changed after a job is submitted:
wall clock limit,
job name,
job dependency.
Note
In some cases, these attributes can be updated for pending jobs. The wall clock limit may only be reduced, never increased.
The following job attributes cannot be updated during runtime:
number of GPUs requested,
node(s),
memory.
For more information about this command, please refer to the help scontrol -h
and the followig doc :
https://slurm.schedmd.com/scontrol.html#OPT_update
Job deletion
The scancel
command allows to delete one or more jobs:
% scancel <jobid>
or all of a specific user’s jobs:
% scancel -u <username>
or a whole series of jobs having the same name:
% scancel -n <jobname>
It will be necessary to have the full name of the job series: please refer to the squeue command.
For more information about this command, please refer to the help scancel -h
.
Ended job status
The sacct
command verifies and displays the state, the partition and the account of a job:
% sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1377 stress.sh htc ccin2p3 8 CANCELLED+ 0:0
1381 stress.sh htc ccin2p3 8 COMPLETED 0:0
1381.batch batch ccin2p3 8 COMPLETED 0:0
The output format may be occasionally customized with the --format
option:
% sacct --format="Account,JobID,NodeList,CPUTime,MaxRSS"
Account JobID NodeList CPUTime MaxRSS
---------- ------------ --------------- ---------- ----------
ccin2p3 1523 ccwslurm0001 00:10:14
ccin2p3 1523.batch ccwslurm0001 00:10:14
ccin2p3 1524 ccwslurm0001 00:10:14
or modify the environment variable SACCT_FORMAT
to define a new output:
% export SACCT_FORMAT=Account,JobID,NodeList,CPUTime,MaxRSS
% sacct
Account JobID NodeList CPUTime MaxRSS
---------- ------------ --------------- ---------- ----------
... ... ... ... ...
To display the complete list of available fields:
% sacct -e
For more information about this command, please refer to the help sacct -h
.
Job profiling
The computing platform allows to profile a job, providing the user with a html
file, that can be opened in a browser showing profiling infos and graphs, and a xml
file with raw profiling values.
If you need to profile a job, you need to add at least an option to your submission line:
% sbatch -t 0-01:00 -n 3 --mem 7G --profile=task [--acctg-freq=task=10] job.sh
--profile=task
activates the profiling agent
--acctg-freq=task=<number>
defines the polling frequency in seconds (between 1 and 15 sec.). By default:
<number>=15
After launching your production job with the above options, you may launch a second job in the following syntax:
% sbatch -t 0-01:00 -n 1 --mem 1G -d <jobid> slurm_profiling <jobid>
<jobid>
being the ID of the job you want to profile. The files profile_<jobid>.html
and profile_<jobid>.xml
will be created in your job working directory.
Note
It is possible to profile an interactive job:
Start the session (here, the example of a GPU interactive job):
% srun -t 0-02:00 -n 4 --profile=task [--acctg-freq=task=10] --mem 2G --pty --gpus 1 bash -i
In the session, retrieve the jobid:
% echo $SLURM_JOBID
Once the session has ended (
exit
or Ctrl-d), run the profiling job as described above.