Completed jobs

The qacct command

The qacct command provides information about past usage of UGE. An accounting record is written to the accounting file for each completed job. This file contains the information on the jobs completed during the last 5 days. Older data is in the accounting file for the month of execution.

By default, the qacct command allows you to access job information 5 days back. To obtain information on a particular job that has already been completed (within less than 5 days), type:

% qacct -j <jobid>

To access information that is older than 5 days, use the “-f” option to select the accounting file for the relevant month:

% qacct -o <loginname> -j -f /opt/sge/ccin2p3/common/accounting.YYYY.MM

Note

For more details on the meaning of the returned values, see Accounting and CPU time.

The return code

In the output of the qacct -j command, the failed and exit_status lines can help you understand why a job failed. If both are equal zero, your job has been successfully completed:

...
failed 0
exit_status 0
...

Otherwise, there was a problem.

exit_status

The exit_status represents teh signal received by a job. If a soft limit is exceeded, the following signals are sent:

exit_status corresponding signal
152 = 24 (SIGXCPU) + 128 SIGXCPU : CPU (_cpu) or memory (_rss) limit
138 = 10 (SIGUSR1) + 128 SIGUSR1 : elapsed time (_rt) limit
153 = 25 (SIGXFSZ) + 128 SIGXFSZ : file size (_fsize) limit

If a hard limit is exceeded, a signal SIGKILL 137 = 9 (SIGKILL) + 128 is sent. exit_status less than 128 are defined by the user. A complete table with a brief summary of the meaning of each Unix kill signal is available through man 7 signal.

failed

A failed value other than zero indicates that a job failed to start on a compute server:

failed consequence
1: assumedly before job Job could not be started
7: before prolog Job could not be started
8: in prolog Job could not be started
10: in pestart Job could not be started
19: before writing exit_status  
21: in recognizing job  
25: rescheduling Job ran, job will be rescheduled
26: opening input/output le Job could not be started, stderr/stdout could not be opened
28: changing into working directory Job could not be started, error changing to start directory
29: invalid execution state  
37: qmaster enforced h rt limit  
100: assumedly after job Job ran, job killed by a signal

To learn more about the different options of the qacct command, see man qacct. To decrypt the output of the qacct command, see man accounting.

The logs

By default, when submitting a job, two files will be created in your HOME once the jobs are completed:

<jobname>.o<jobid> (stdout standard output)
<jobname>.e<jobid> (stderr standard error)

When submitting an array job, these files will be named (there will be as many files as there are tasks):

<Jobname>.o<jobid>.<Taskid>
<Jobname>.e<jobid>.<Taskid>

Take the example of a Hello World job whose two log files are in your HOME:

hello_world.sh.e <jobid> # empty, if there were no errors
hello_world.sh.o <jobid>

The output can be viewed, for example, with the command cat:

% cat hello_world.sh.o5947764
**********************************************************************
*                     Grid Engine Batch System                       *
*              IN2P3 Computing Centre, Villeurbanne FR               *
**********************************************************************
* User:                           login                              *
* Group:                          ccin2p3                            *
* Jobname:                        hello_world.sh                     *
* JobID:                          5947764                            *
* Queue:                          long                               *
* Worker:                         ccwsge0632.in2p3.fr                *
* Operating system:               Linux 3.10.0-1127.10.1.el7.x86_64  *
* Project:                        P_ccin2p3                          *
**********************************************************************
* Submitted on:                   Mon Aug 17 11:37:21 CEST 2020      *
* Started on:                     Mon Aug 17 11:37:33 CEST 2020      *
**********************************************************************

Hello World!

My working directory is:
/scratch/5947764.1.long
on the server:
ccwsge0632
Done!!

**********************************************************************
* Submitted on:                   Mon Aug 17 11:37:21 CEST 2020      *
* Started on:                     Mon Aug 17 11:37:33 CEST 2020      *
* Ended on:                       Tue Aug 18 11:58:25 CEST 2020      *
* Exit status:                    0                                  *
**********************************************************************
* Requested                                                          *
*   CPU cores:                    1 core(s)                          *
*   CPU time:                     48:00:00 (172800 seconds) (1)      *
**********************************************************************
* Consumed                                                           *
*   wallclock:                    24:20:52 (87652 seconds)           *
*   CPU time:                     24:09:01 (86941 seconds)           *
*   CPU scaling factor:           10.87                              *
*   normalized CPU time:          262:30:52 (945052 HS06 seconds)    *
*   CPU efficiency:               99 % (2)                           *
*   vmem:                         2.374 GB (3)                       *
*   maxvmem:                      2.403 GB (3)                       *
*   maxrss:                       487.258 MB (3)                     *
**********************************************************************
Notes:
(1) Formula: requested CPU time * requested CPU cores
(2) Formula: CPU time / ( wallclock * requested CPU cores )
(3) See man sge_accounting
For more information: qacct -j 5947764

Jobs standard outputs stdout and stderr are redirected to compute server local disk during the execution of your job, then copied (by default) in your HOME directory at the end of the job. To change the position and name of the stdout and stderr outputs, as well as the jobname of your job, see the section Submission options.

Attention

It is preferable to have output files stdout / stderr of a reasonable size (less than 100 MB). Larger files should be redirected to TMPDIR and copied to the appropriate storage system at the end of the job, rather than HOME (as it is the case by default).

Accounting and CPU time

The wallclock time of a job is the time spent between the beginning of the job and its end. The CPU time is the time the jobs really uses the CPU. This CPU time can be significantly lower than the wallclock time, for example in case the jobs is I/O bound: in such case we speak of “inefficient job”.

Moreover, in the case of a multi-core job, the CPU time may be greater than the wallclock time. The reason is that the job scheduler provides in the accounting the cumulative CPU time over all cores, while the wallclock time is independent from the number of cores used.

We therefore have in principle:

wallclock_time > CPU_time/#cores

At CC-IN2P3, the CPU time is commonly expressed in time “HS06”, a normalization unit that depends on the power of the cores. For example, for a CPU time of 3h on one (or more) core (s) with a power factor of 11 HS06 per core, we have the relation:

CPU time (h.HS06) = 3 (h) * 11 (HS06) = 33 h.HS06

In the job stdout, the line “normalized CPU time” indicates the CPU time in time (days:hours:minutes:seconds).HS06, the “CPU scaling factor” corresponds to the HS06 factor of the processor, and the “CPU time” indicates the CPU time in seconds. In the result of a qacct command, the “wallclock” indicates the wallclock time in seconds, and the “cpu” is the CPU time in seconds.HS06.