Job monitoring

Computing Service Status

To assess the computing platform occupancy status, you may refer to the user portal or run the following command:

% sjstat | more

Scheduling pool data:
-------------------------------------------------------------
Pool        Memory  Cpus  Total Usable   Free  Other Traits
-------------------------------------------------------------
htc*       265000Mb    64    190    190      0  htc,x86_64,el9
htc*       540000Mb   112     56     56      0  htc,x86_64,el9
htc_inter  192934Mb    64      1      1      0  htc_interactive,x86_64,el9
htc_inter  128361Mb    40      2      2      0  htc_interactive,x86_64,el9
htc_highm 1546522Mb    40      1      1      1  htc_highmem,x86_64,el9
gpu        191850Mb    24      9      9      0  gpu,x86_64,el9
[..]

Job submission status

The squeue command is used to display various information about a job. It gives, among other things, the execution time, the current state (ST column, with possible state R for running and PD for pending), the name of the job, and the partition in which the job is executed:

% squeue
JOBID PARTITION     NAME     USER         ST       TIME      NODES NODELIST(REASON)
465   htc           hello    <username>   R        0:01      1     ccwtbslurm01

The main options to squeue are:

-t [running|pending]: selects to display the running or pending job state
[[-v] -l] -j: display a specific job. -l for a long output, -v for a verbose output

Attention

The squeue output trims some job fields to 8 characters. To get the full name, use the -O option. Below is an example of the job name:

% squeue -O JobID,Name

To acquire much more extensive information about your pending and running jobs, you may use the scontrol command:

% scontrol show job=<job id>

This command will not be effective on ended jobs. In this latter case, use sacct.

Job efficiency

The seff command displays the resources used by a specific job and calculate efficiency.

% seff <jobid>
Job ID: <jobid>
Cluster: ccslurmlocal
User/Group: <username>/<groupid>
State: CANCELLED (exit code 0)
Cores: 1
CPU Utilized: 00:12:50
CPU Efficiency: 98.59% of 00:13:01 core-walltime
Job Wall-clock time: 00:13:01
Memory Utilized: 120.00 KB
Memory Efficiency: 0.00% of 0.00 MB

Attention

The seff command samples job activity approximately every 30 seconds. The value of the Memory Utilized field is therefore not the absolute maximum of the used memory, but the maximum of the sampled values. For a more accurate assessment of your job, please use the method explained in paragraph Job profiling.

Job hold and alteration

The scontrol command allows jobs management. With the options hold, update and release, it allows respectively to hold a job (take it out of the queue), to modify it, then to put it back in the queue:

% scontrol [hold|update|release] <job list>

The following attributes can be changed after a job is submitted:

wall clock limit,
job name,
job dependency.

Note

In some cases, these attributes can be updated for pending jobs. The wall clock limit may only be reduced, never increased.

The following job attributes cannot be updated during runtime:

number of GPUs requested,
node(s),
memory.

For more information about this command, please refer to the help scontrol -h and the followig doc : https://slurm.schedmd.com/scontrol.html#OPT_update

Job deletion

The scancel command allows to delete one or more jobs:

% scancel <jobid>

or all of a specific user’s jobs:

% scancel -u <username>

or a whole series of jobs having the same name:

% scancel -n <jobname>

It will be necessary to have the full name of the job series: please refer to the squeue command.

For more information about this command, please refer to the help scancel -h.

Ended job status

The sacct command verifies and displays the state, the partition and the account of a job:

% sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1377          stress.sh   htc         ccin2p3          8 CANCELLED+      0:0
1381          stress.sh   htc         ccin2p3          8  COMPLETED      0:0
1381.batch        batch               ccin2p3          8  COMPLETED      0:0

Important

Workflow managers (Parsl, Snakemake, etc.) use sacct to track jobs. As Slurm has no parameter to limit sacct frequency of execution, if these software are misconfigured they can make too many requests, degrading the service.

We recommend that you configure these software correctly, forcing batch job tracking (sacct <jobid1>,<jobid2> instead of two separate commands) and/or limiting execution frequency (ideally no more than once every 10 seconds).

The output format may be occasionally customized with the --format option:

% sacct --format="Account,JobID,NodeList,CPUTime,MaxRSS"
   Account        JobID        NodeList    CPUTime     MaxRSS
---------- ------------ --------------- ---------- ----------
   ccin2p3 1523            ccwslurm0001   00:10:14
   ccin2p3 1523.batch      ccwslurm0001   00:10:14
   ccin2p3 1524            ccwslurm0001   00:10:14

or modify the environment variable SACCT_FORMAT to define a new output:

% export SACCT_FORMAT=Account,JobID,NodeList,CPUTime,MaxRSS

% sacct
   Account        JobID        NodeList    CPUTime     MaxRSS
---------- ------------ --------------- ---------- ----------
       ... ...                      ...        ...        ...

To display the complete list of available fields:

% sacct -e

For more information about this command, please refer to the help sacct -h.

Job profiling

The computing platform allows to profile a job, providing the user with a html file, that can be opened in a browser showing profiling infos and graphs, and a xml file with raw profiling values.

If you need to profile a job, you need to add at least an option to your submission line:

% sbatch -t 0-01:00 -n 3 --mem 7G --profile=task --acctg-freq=task=15 job.sh

--profile=task: Activates the profiling agent
--acctg-freq=task=<number>: Optional parameter. Defines the polling frequency in seconds (between 1 and 15 sec.). If not present, the default is: <number>=15

After launching your production job with the above options, you may launch a second job in the following syntax:

% sbatch -t 0-01:00 -n 1 --mem 1G -d <jobid> slurm_profiling <jobid>

<jobid> being the ID of the job you want to profile. The files profile_<jobid>.html and profile_<jobid>.xml will be created in your job working directory.

Note

It is possible to profile an interactive job:

Start the session (here, the example of a GPU interactive job):

% srun -t 0-02:00 -n 4 --profile=task --mem 2G --pty --gpus 1 bash -i

In the session, retrieve the jobid:
```
% echo $SLURM_JOBID
```
Once the session has ended (exit or Ctrl-d), run the profiling job as described above.