Job monitoring

The qstat command

You can check the status of your pending or running jobs with the qstat command. You get information about the current status of your job:

job-ID frontend user state submit / start at queue slots ja-task-
-------------------------------------------------- -----------------------------
5947764 0.00000 hello_worl login qw 09/31/2019 16:22:23 1

Here, the job is waiting (status qw: queued and waiting). The most common statuses are:

r: running (job running)
Rr: running, re-submit (job running that has been restarted)
qw: queued and waiting (job waiting)
Eqw: queued and waiting with error

The other possible statuses are listed below:

Category State GE Letter Code
Pending pending qw
pending, user hold hqw
pending, system hold hqw
pending, user and system hold hqw
pending, user hold, re-queue hRwq
pending, system hold, re-queue hRwq
pending, user and system hold, re-queue hRwq
Running running r
transferring t
running, re-submit Rr
queued, re-submit Rq
transferring, re-submit Rt
Suspended job suspended s, ts
queue suspended S, tS
queue suspended by alarm T, tT
all suspended with re-submit Rs, Rts, RS, RtS, RT, RtT
Error all pending states with errors Eqw, Ehqw, EhRqw
Deleted all running and suspended states with deletion dr, dt, dRr, dRt, ds, dS, dT, dRs, dRS, dRT

The qstat command provides several useful options:

-s <p|r|s>
displays the jobs being in a specific state (“p” for pending, “r” for running, and “s” for suspended).
-u <@groupname>
shows the jobs submitted by a given group
-j <jobid>
displays detailed information about a particular job
-nenv
reduces verbosity
-r
displays information about resources requested by a job
-ext
displays additional information about jobs

To know all the possible options, see:

% man qstat

Useful commands

Check waiting jobs

To understand why you job pending, you can use the check_waiting_job command:

% check_waiting_job -j <jobid>

If there is a problem, a error message is returned. The command checks:

  1. if you or your group have been blocked or your computing resources have been limited: in that case contact our user support.
% check_waiting_job  -j 12345 -v
DEBUG  verbosity turned on
ERROR  User <user> or group <group> locked
  1. if the requested resources exceed the limit set for the requested queue (check this page)
% check_waiting_job  -j 987654 -v
DEBUG  verbosity turned on
DEBUG  User <user> or group <group> is not limited or blocked
DEBUG  Read long queue configuration
DEBUG  User <user> allowed on queue long
ERROR  s_rss not valid for the queue long: asked 4.6 - available 4.0
  1. if you have the sufficient submission rights for the requested queue
% check_waiting_job  -j 1111111 -v
DEBUG  verbosity turned on
DEBUG  User <user> or group <group> is not limited or blocked
INFO   Queue mc_debug defined by the job
DEBUG  Read mc_debug queue configuration
ERROR  User <user>  can NOT run jobs on queue mc_debug

If no problems are detected, you generally have to wait for the resources to be available (check the resource usage).

The available options to the command are listed using check_waiting_job -h.

Delete jobs

Jobs can be deleted with the qdel command. You can only delete your own jobs.

To cancel all your jobs:

% qdel -u <loginname>

To cancel one or more jobs:

% qdel <jobid>[,jobid,...]

To cancel a job from an array job:

% qdel <jobid>.<taskid>

To undo multiple tasks from an array job:

% qdel <jobid>.<taskid_first>-<taskid_last>[:interval]
#or with the -t option
% qdel <jobid> -t <taskid first>-<taskid last>[:interval]

Type man qdel for all the options of the qdel command.

Edit jobs

The qalter command is used to modify resource requests for jobs that are still in the pending state. To modify a resource of a job, for example the requested memory:

% qalter -mods l_hard s_rss 2G <jobid>

To add a resource to a job, for example sps:

% qalter -adds l_hard sps 1 <jobid>

To delete a resource from a job, for example hpss:

% qalter -clears l_hard hpss <jobid>

To remove all resources from a job:

% qalter -clearp l_hard <jobid>

To replace the list of resources for a job:

% qalter -l resource = value [, resource2 = value2, resource3 = value3] <jobid>

Type man qalter to see all the possible options (change of queue, e-mail, ...).

Suspend and release a job

You can suspend one or more submitted jobs with the qhold command:

% qhold <jobid>

To release the suspended job with the qhold command use the qrls command:

% qrls <jobid>

Jobs in queue with error (Eqw status) can be released with the qmod command:

% qmod -cj <jobid>