Completed jobs¶
The qacct command¶
The qacct
command provides information about past usage of UGE. An accounting record is written to the accounting file for each completed job. This file contains the information on the jobs completed during the last 5 days. Older data is in the accounting file for the month of execution.
By default, the qacct
command allows you to access job information 5 days back. To obtain information on a particular job that has already been completed (within less than 5 days), type:
% qacct -j <jobid>
To access information that is older than 5 days, use the “-f” option to select the accounting file for the relevant month:
% qacct -o <loginname> -j -f /opt/sge/ccin2p3/common/accounting.YYYY.MM
Note
For more details on the meaning of the returned values, see Accounting and CPU time.
The return code¶
In the output of the qacct -j
command, the failed
and exit_status
lines can help you understand why a job failed.
If both are equal zero, your job has been successfully completed:
...
failed 0
exit_status 0
...
Otherwise, there was a problem.
exit_status¶
The exit_status
represents teh signal received by a job. If a soft limit is exceeded, the following signals are sent:
exit_status | corresponding signal |
---|---|
152 = 24 (SIGXCPU) + 128 | SIGXCPU : CPU (_cpu ) or memory (_rss ) limit |
138 = 10 (SIGUSR1) + 128 | SIGUSR1 : elapsed time (_rt ) limit |
153 = 25 (SIGXFSZ) + 128 | SIGXFSZ : file size (_fsize ) limit |
If a hard limit is exceeded, a signal SIGKILL 137 = 9 (SIGKILL) + 128
is sent. exit_status
less than 128 are defined by the user.
A complete table with a brief summary of the meaning of each Unix kill signal is available through man 7 signal
.
failed¶
A failed
value other than zero indicates that a job failed to start on a compute server:
failed | consequence |
---|---|
1: assumedly before job | Job could not be started |
7: before prolog | Job could not be started |
8: in prolog | Job could not be started |
10: in pestart | Job could not be started |
19: before writing exit_status | |
21: in recognizing job | |
25: rescheduling | Job ran, job will be rescheduled |
26: opening input/output le | Job could not be started, stderr/stdout could not be opened |
28: changing into working directory | Job could not be started, error changing to start directory |
29: invalid execution state | |
37: qmaster enforced h rt limit | |
100: assumedly after job | Job ran, job killed by a signal |
To learn more about the different options of the qacct
command, see man qacct
.
To decrypt the output of the qacct
command, see man accounting
.
The logs¶
By default, when submitting a job, two files will be created in your HOME
once the jobs are completed:
<jobname>.o<jobid> (stdout standard output)
<jobname>.e<jobid> (stderr standard error)
When submitting an array job, these files will be named (there will be as many files as there are tasks):
<Jobname>.o<jobid>.<Taskid>
<Jobname>.e<jobid>.<Taskid>
Take the example of a Hello World job whose two log files are in your HOME
:
hello_world.sh.e <jobid> # empty, if there were no errors
hello_world.sh.o <jobid>
The output can be viewed, for example, with the command cat
:
% cat hello_world.sh.o5947764
**********************************************************************
* Grid Engine Batch System *
* IN2P3 Computing Centre, Villeurbanne FR *
**********************************************************************
* User: login *
* Group: ccin2p3 *
* Jobname: hello_world.sh *
* JobID: 5947764 *
* Queue: long *
* Worker: ccwsge0632.in2p3.fr *
* Operating system: Linux 3.10.0-1127.10.1.el7.x86_64 *
* Project: P_ccin2p3 *
**********************************************************************
* Submitted on: Mon Aug 17 11:37:21 CEST 2020 *
* Started on: Mon Aug 17 11:37:33 CEST 2020 *
**********************************************************************
Hello World!
My working directory is:
/scratch/5947764.1.long
on the server:
ccwsge0632
Done!!
**********************************************************************
* Submitted on: Mon Aug 17 11:37:21 CEST 2020 *
* Started on: Mon Aug 17 11:37:33 CEST 2020 *
* Ended on: Tue Aug 18 11:58:25 CEST 2020 *
* Exit status: 0 *
**********************************************************************
* Requested *
* CPU cores: 1 core(s) *
* CPU time: 48:00:00 (172800 seconds) (1) *
**********************************************************************
* Consumed *
* wallclock: 24:20:52 (87652 seconds) *
* CPU time: 24:09:01 (86941 seconds) *
* CPU scaling factor: 10.87 *
* normalized CPU time: 262:30:52 (945052 HS06 seconds) *
* CPU efficiency: 99 % (2) *
* vmem: 2.374 GB (3) *
* maxvmem: 2.403 GB (3) *
* maxrss: 487.258 MB (3) *
**********************************************************************
Notes:
(1) Formula: requested CPU time * requested CPU cores
(2) Formula: CPU time / ( wallclock * requested CPU cores )
(3) See man sge_accounting
For more information: qacct -j 5947764
Jobs standard outputs stdout and stderr are redirected to compute server local disk during the execution of your job,
then copied (by default) in your HOME
directory at the end of the job.
To change the position and name of the stdout and stderr outputs, as well as the jobname of your job, see the section Submission options.
Attention
It is preferable to have output files stdout / stderr of a reasonable size (less than 100 MB). Larger files should be redirected to TMPDIR
and copied to the appropriate storage system at the end of the job, rather than HOME
(as it is the case by default).
Accounting and CPU time¶
The wallclock time of a job is the time spent between the beginning of the job and its end. The CPU time is the time the jobs really uses the CPU. This CPU time can be significantly lower than the wallclock time, for example in case the jobs is I/O bound: in such case we speak of “inefficient job”.
Moreover, in the case of a multi-core job, the CPU time may be greater than the wallclock time. The reason is that the job scheduler provides in the accounting the cumulative CPU time over all cores, while the wallclock time is independent from the number of cores used.
We therefore have in principle:
wallclock_time > CPU_time/#cores
At CC-IN2P3, the CPU time is commonly expressed in time “HS06”, a normalization unit that depends on the power of the cores. For example, for a CPU time of 3h on one (or more) core (s) with a power factor of 11 HS06 per core, we have the relation:
CPU time (h.HS06) = 3 (h) * 11 (HS06) = 33 h.HS06
In the job stdout, the line “normalized CPU time” indicates the CPU time in time (days:hours:minutes:seconds).HS06,
the “CPU scaling factor” corresponds to the HS06 factor of the processor, and the “CPU time” indicates the CPU time in seconds.
In the result of a qacct
command, the “wallclock” indicates the wallclock time in seconds, and the “cpu” is the CPU time in seconds.HS06.