blob: 6e5028e640aa479c90a892c53010580914ec764e [file] [log] [blame] [edit]
<!--#include virtual="header.txt"-->
<h1>Job Exit Codes</h1>
<p>A job's exit code (aka exit status, return code and completion
code) is captured by SLURM and saved as part of the job record. For
sbatch jobs, the exit code that is captured is the output of the batch
script. For salloc jobs, the exit code will be the return value of
the exit call that terminates the salloc session. For srun, the exit
code will be the return value of the command that srun executes.</p>
<p>Any non-zero exit code will be assumed to be a job failure and will
result in a Job State of FAILED with a Reason of
"NonZeroExitCode".</p>
<p>The exit code is an 8 bit unsigned number ranging between 0 and
255. While it is possible for a job to return a negative exit code,
SLURM will display it as an unsigned value in the 0 - 255 range.</p>
<h2>Job Step Exit Codes</h2>
<p>When a job contains multiple job steps, the exit code of each
executable invoked by srun is saved individually to the job step
record.</p>
<h2>Signaled Jobs</h2>
<p>When a job or step is sent a signal that causes its termination,
SLURM also captures the signal number and saves it to the job or step
record.</p>
<h2>Displaying Exit Codes and Signals</h2>
<p>SLURM displays a job's exit code in the output of the <b>scontrol
show job</b> and the <b>sview</b> utility. SLURM displays job step
exit codes in the output of the <b>scontrol show step</b> and the
<b>sview</b> utility.
<p>When a signal was responsible for a job or step's termination, the
signal number will be displayed after the exit code, delineated by a
colon(:).</p>
<h2>Database Job/Step Records</h2>
<p>The SLURM control daemon sends job and step records to the SLURM
database when the SLURM accounting_storage plugin is installed. Job
and step records sent to the SLURM db can be viewed using the
<b>sacct</b> command. The default <b>sacct</b> output contains an
ExitCode field whose format mirrors the output of <b>scontrol</b> and
<b>sview</b> described above.</p>
<h1>Derived Exit Code and String</h1>
<p>After reading the above description of a job's exit code, one can
imagine a scenario where a central task of a batch job fails but the
script returns an exit code of zero, indicating success. In many
cases, a user may not be able to ascertain the success or failure of a
job until after they have examined the job's output files.</p>
<p>A new job field, the "derived exit code", has been added to the job
record in SLURM 2.2. It is initially set to the value of the highest
exit code returned by all of the job's steps (srun invocations). The
job's derived exit code is determined by the SLURM control daemon
and sent to the database when the accounting_storage plugin is
enabled.</p>
<p>In addition to the derived exit code, the job record in the SLURM
db contains an additional field known as the "derived exit string".
This is initialized to NULL and can only be changed by the user. A
new option has been added to the <b>sacctmgr</b> command to provide
the user the means to modify these two new fields of the job record.
No other modification to the job record is allowed. For those who
prefer a simpler command specifically designed to view and modify the
derived exit code/string, the <b>sjobexitmod</b> wrapper has been
created (see below).</p>
<p>The user now has the means to annotate a job's exit code after it
completes and provide a description of what failed. This includes the
ability to annotate a successful completion to jobs that appear to
have failed but actually succeeded.</p>
<h2>The sjobexitmod command</h2>
<p>A new command is available in SLURM 2.2 to display and update the
two new derived exit fields of the SLURM db's job record.
<b>sjobexitmod</b> can first be used to display the existing exit code
/ string for a job:</p>
<PRE>
> sjobexitmod -l 123
JobID Account NNodes NodeList State ExitCode DerivedExitCode DerivedExitStr
------------ ---------- -------- --------------- ---------- -------- --------------- --------------
123 lc 1 tux0 COMPLETED 0:0 0:0
</PRE>
If a change is desired, <b>sjobexitmod</b> can modify the derived fields:
<PRE>
> sjobexitmod -e 49 -r "out of memory" 123
Modification of job 123 was successful.
> sjobexitmod -l 123
JobID Account NNodes NodeList State ExitCode DerivedExitCode DerivedExitStr
------------ ---------- -------- --------------- ---------- -------- --------------- --------------
123 lc 1 tux0 COMPLETED 0:0 49:0 out of memory
</PRE>
<p>The existing <b>sacct</b> command also supports the two new derived
exit fields:</p>
<PRE>
> sacct -X -j 123 -o JobID,NNodes,State,ExitCode,DerivedExitcode,DerivedExitStr
JobID NNodes State ExitCode DerivedExitCode DerivedExitStr
------------ -------- ---------- -------- --------------- --------------
123 1 COMPLETED 0:0 49:0 out of memory
</PRE>
<p style="text-align:center;">Last modified 1 November 2010</p>
<!--#include virtual="footer.txt"-->