Job monitoring

Computing Service Status

To assess the computing platform occupancy status, you may refer to the user portal or run the following command:

% sjstat

Scheduling pool data:
-------------------------------------------------------------
Pool        Memory  Cpus  Total Usable   Free  Other Traits
-------------------------------------------------------------
htc*       144801Mb    48    191    191      0  htc
htc*       192934Mb    64    190    190      0  htc
htc_inter  192934Mb    64      1      1      0  htc_interactive
htc_inter  128361Mb    40      2      2      0  htc_interactive
[..]

Job submission status

The squeue command is used to display various information about a job. It gives, among other things, the execution time, the current state (ST column, with possible state R for running and PD for pending), the name of the job, and the partition in which the job is executed:

% squeue
JOBID PARTITION     NAME     USER      ST       TIME      NODES NODELIST(REASON)
465   multiseq      hello    user      R        0:01      1     ccwtbslurm01

The main options to squeue are:

-t [running|pending]

selects to display the running or pending job state

[[-v] -l] -j

display a specific job. -l for a long output, -v for a verbose output

Attention

The squeue output trims some job fields to 8 characters. To get the full name, use the -O option. Below is an example of the job name:

% squeue -O JobID,Name

For more information about this command and its outputs, please refer to the official documentation and the following command line:

% man squeue

Job efficiency

The seff command displays the resources used by a specific job and calculate efficiency.

% seff <jobid>
Job ID: <jobid>
Cluster: ccslurmlocal
User/Group: <userid>/<groupid>
State: CANCELLED (exit code 0)
Cores: 1
CPU Utilized: 00:12:50
CPU Efficiency: 98.59% of 00:13:01 core-walltime
Job Wall-clock time: 00:13:01
Memory Utilized: 120.00 KB
Memory Efficiency: 0.00% of 0.00 MB

Job hold and alteration

The scontrol command allows jobs management. With the options hold, update and release, it allows respectively to hold a job (take it out of the queue), to modify it, then to put it back in the queue:

% scontrol [hold|update|release] <jobs id list>

The following attributes can be changed after a job is submitted:

  • wall clock limit,

  • job name,

  • job dependency.

In some cases, these attributes can only be updated for pending jobs.

Finally, please note that some job attributes cannot be updated during runtime:

  • number of GPUs requested,

  • node(s),

  • memory.

For more information about this command, please refer to the help scontrol -h and the followig doc : https://slurm.schedmd.com/scontrol.html#OPT_update

Job deletion

The scancel command allows to delete one or more jobs:

% scancel <jobid>

or all of a specific user’s jobs:

% scancel -u <userid>

or a whole series of jobs having the same name:

% scancel -n <jobname>

It will be necessary to have the full name of the job series: please refer to the squeue command.

For more information about this command, please refer to the help scancel -h.

Ended job status

The sacct command verifies and displays the state, the partition and the account of a job:

% sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1377          stress.sh   multiseq    ccin2p3          8 CANCELLED+      0:0
1381          stress.sh   multiseq    ccin2p3          8  COMPLETED      0:0
1381.batch        batch               ccin2p3          8  COMPLETED      0:0

The output format may be occasionally customized with the --format option:

% sacct --format="Account,JobID,NodeList,CPUTime,MaxRSS"
   Account        JobID        NodeList    CPUTime     MaxRSS
---------- ------------ --------------- ---------- ----------
   ccin2p3 1523            ccwslurm0001   00:10:14
   ccin2p3 1523.batch      ccwslurm0001   00:10:14
   ccin2p3 1524            ccwslurm0001   00:10:14

or modify the environment variable SACCT_FORMAT to define a new output:

% export SACCT_FORMAT=Account,JobID,NodeList,CPUTime,MaxRSS
% sacct
   Account        JobID        NodeList    CPUTime     MaxRSS
---------- ------------ --------------- ---------- ----------
       ... ...                      ...        ...        ...

To display the complete list of available fields:

% sacct -e

For more information about this command, please refer to the help sacct -h.

Job profiling

The computing platform allows to profile a job, providing the user with a html file, that can be opened in a browser showing profiling infos and graphs, and a xml file with raw profiling values.

If you need to profile a job, you need to add at least an option to your submission line:

% sbatch -t 0-01:00 -n 3 --mem 7G --profile=task [--acctg-freq=task=10] job.sh
--profile=task

activates the profiling agent

--acctg-freq=task=<number>

defines the polling frequency in seconds (between 1 and 15 sec.). By default: <number>=15

After launching your production job with the above options, you may launch a second job in the following syntax:

% sbatch -t 0-01:00 -n 1 --mem 1G -d <jobid> slurm_profiling <jobid>

<jobid> being the ID of the job you want to profile. The html and xml file will be created in your job working directory.