blob: aba47893c4def010d053266153d3b5595a30cdc2 [file] [log] [blame]
<!--#include virtual="header.txt"-->
<h1><a name="top">Job Launch Design Guide</a></h1>
<h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2>
<p>This guide describes at a high level the processes which occur in order
to initiate a job including the daemons and plugins involved in the process.
It describes the process of job allocation, step allocation, task launch and
job termination. The functionality of tens of thousands of lines of code
has been distilled here to a couple of pages of text, so much detail is
missing.</p>
<h2 id="job_allocation">Job Allocation
<a class="slurm_link" href="#job_allocation"></a>
</h2>
<p>The first step of the process is to create a job allocation, which is
a claim on compute resources. A job allocation can be created using the
<b>salloc</b>, <b>sbatch</b> or <b>srun</b> command. The <b>salloc</b> and
<b>sbatch</b> commands create resource allocations while the <b>srun</b>
command will create a resource allocation (if not already running within one)
plus launch tasks. Each of these commands will fill in a data structure
identifying the specifications of the job allocation requirement (e.g. node
count, task count, etc.) based upon command line options and environment
variables and send the RPC to the <b>slurmctld</b> daemon. The UID and GID of
the user launching the job will be included in a credential which will be used
later to restrict access to the job, so further steps run in the allocation
will need to be launched using the same UID and GID as the one used to create
the allocation. If the new job
request is the highest priority, the <b>slurmctld</b> daemon will attempt
to select resources for it immediately, otherwise it will validate that the job
request can be satisfied at some time and queue the request. In either case
the request will receive a response almost immediately containing one of the
following:</p>
<ul>
<li>A job ID and the resource allocation specification (nodes, cpus, etc.)</li>
<li>A job ID and notification of the job being in a queued state OR</li>
<li>An error code</li>
</ul>
<p>The process of selecting resources for a job request involves multiple steps,
some of which involve plugins. The process is as follows:</p>
<ol>
<li>Call <b>job_submit</b> plugins to modify the request as appropriate</li>
<li>Validate that the options are valid for this user (e.g. valid partition
name, valid limits, etc.)</li>
<li>Determine if this job is the highest priority runnable job, if so then
really try to allocate resources for it now, otherwise only validate that it
could run if no other jobs existed</li>
<li>Determine which nodes could be used for the job. If the feature
specification uses an exclusive OR option, then multiple iterations of the
selection process below will be required with disjoint sets of nodes</li>
<li>Call the <b>select</b> plugin to select the best resources for the request</li>
<li>The <b>select</b> plugin will consider network topology and the topology within
a node (e.g. sockets, cores, and threads) to select the best resources for the
job</li>
<li>If the job can not be initiated using available resources and preemption
support is configured, the <b>select</b> plugin will also determine if the job
can be initiated after preempting lower priority jobs. If so then initiate
preemption as needed to start the job</li>
</ol>
<h2 id="step_allocation">Step Allocation
<a class="slurm_link" href="#step_allocation"></a>
</h2>
<p>The <b>srun</b> command is always used for job step creation. It fills in
a job step request RPC using information from the command line and environment
variables then sends that request to the <b>slurmctld</b> daemon. It is
important to note that many of the <b>srun</b> options are intended for job
allocation and are not supported by the job step request RPC (for example the
socket, core and thread information is not supported). If a job step uses
all of the resources allocated to the job then the lack of support for some
options is not important. If one wants to execute multiple job steps using
various subsets of resources allocated to the job, this shortcoming could
prove problematic. It is also worth noting that the logic used to select
resources for a job step is relatively simple and entirely contained within
the <b>slurmctld</b> daemon code (the <b>select</b> plugin is not used for job
steps). If the request can not be immediately satisfied due to a request for
exclusive access to resources, the appropriate error message will be sent and
the <b>srun</b> command will retry the request on a periodic basis.
(<b>NOTE</b>: It would be desirable to queue the job step requests to support
job step dependencies and better performance in the initiation of job steps,
but that is not currently supported.)
If the request can be satisfied, the response contains a digitally signed
credential (by the <b>cred</b> plugin) identifying the resources to be used.</p>
<h2 id="task_launch">Task Launch
<a class="slurm_link" href="#task_launch"></a>
</h2>
<p>The <b>srun</b> command builds a task launch request data structure
including the credential, executable name, file names, etc. and sends it to
the <b>slurmd</b> daemon on node zero of the job step allocation. The
<b>slurmd</b> daemon validates the signature and forwards the request to the
<b>slurmd</b> daemons on other nodes to launch tasks for that job step. The
degree of fanout in this message forwarding is configurable using the
<b>TreeWidth</b> parameter. Each <b>slurmd</b> daemon tests that the job has
not been cancelled since the credential was issued (due to a possible race
condition) and spawns a <b>slurmstepd</b> program to manage the job step.
Note that the <b>slurmctld</b> daemon is not directly involved in task
launch in order to minimize the overhead on this critical resource.</p>
<p>Each <b>slurmstepd</b> program executes a single job step.
Besides the functions listed below, the <b>slurmstepd</b> program also
executes several SPANK plugin functions at various times.</p>
<ol>
<li>Performs MPI setup (using the appropriate plugin)</li>
<li>Calls the <b>switch</b> plugin to perform any needed network configuration</li>
<li>Creates a container for the job step using a <b>proctrack</b> plugin</li>
<li>Change user ID to that of the user</li>
<li>Configures I/O for the tasks (either using files or a socket connection back
to the <b>srun</b> command</li>
<li>Sets up environment variables for the tasks including many task-specific
environment variables</li>
<li>Fork/exec the tasks</li>
</ol>
<h2 id="step_termination">Job Step Termination
<a class="slurm_link" href="#step_termination"></a>
</h2>
<p>There are several ways in which a job step or job can terminate, each with
slight variation in the logic executed. The simplest case is if the tasks run
to completion. The <b>srun</b> will note the termination of output from the
tasks and notify the <b>slurmctld</b> daemon that the job step has completed.
<b>slurmctld</b> will simply log the job step termination. The job step can
also be explicitly cancelled by a user, reach the end of its time limit, etc.
and those follow a sequence of steps very similar to that for job termination,
which is described below.</p>
<h2 id="job_termination">Job Termination
<a class="slurm_link" href="#job_termination"></a>
</h2>
<p>Job termination can either be user initiated (e.g. <b>scancel</b> command) or system
initiated (e.g. time limit reached). The termination ultimately requires
the <b>slurmctld</b> daemon to notify the <b>slurmd</b> daemons on allocated
nodes that the job is to be ended. The <b>slurmd</b> daemon does the following:
<ol>
<li>Send a SIGCONT and SIGTERM signal to any user tasks</li>
<li>Wait <b>KillWait</b> seconds if there are any user tasks</li>
<li>Send a SIGKILL signal to any user tasks</li>
<li>Wait for all tasks to complete</li>
<li>Execute any <b>Epilog</b> program</li>
<li>Send an epilog_complete RPC to the <b>slurmctld</b> daemon</li>
</ol>
<h2 id="job_record">Job Accounting Records
<a class="slurm_link" href="#job_record"></a>
</h2>
<p>When Slurm is configured to use SlurmDBD to store job records (i.e.
<i>AccountingStorageType=accounting_storage=slurmdbd</i>), there are multiple
records that get stored for each job. There is a record for the job as a
whole as well as entries for the following types of job steps:
<ul>
<li><b>extern step</b> &mdash; A step created for each job as long as you have
<i>PrologFlags=contain</i> in your slurm.conf. Each node in the job will
have a slurmstepd process created for the extern step.
<a href=pam_slurm_adopt.html>pam_slurm_adopt</a> uses this step to contain
external connections.</li>
<li><b>batch step</b> &mdash; A step created for jobs that were submitted with
sbatch. The batch host, or the primary node for the job, will run an instance
of slurmstepd for the batch step, which is used to run the script provided
to sbatch.</li>
<li><b>interactive step</b> &mdash; A step created for jobs that were
submitted with salloc when <i>LaunchParameters=use_interactive_step</i> is
configured in your slurm.conf. The node on which you have the interactive
shell will run an instance of slurmstepd to run the shell or the command
provided to salloc.</li>
<li><b>normal step</b> &mdash; A job can have multiple normal steps, which will
appear in sacct as &lt;<b>job_id</b>&gt;.&lt;<b>step_id</b>&gt;. These steps
are created when srun is called from inside the job and the slurmstepd created
will run the command passed to srun. Each step will have one instance of
slurmstepd created per node in the step and each instance of slurmstepd can
run multiple tasks in the same step.</li>
</ul>
<p style="text-align:center;">Last modified 1 August 2022</p>
<!--#include virtual="footer.txt"-->