doc/html/job_launch.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <h1><a name="top">Job Launch Design Guide</a></h1>

 <h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2>

 <p>This guide describes at a high level the processes which occur in order
 to initiate a job including the daemons and plugins involved in the process.
 It describes the process of job allocation, step allocation, task launch and
 job termination. The functionality of tens of thousands of lines of code
 has been distilled here to a couple of pages of text, so much detail is
 missing.</p>

 <h2 id="job_allocation">Job Allocation
 <a class="slurm_link" href="#job_allocation"></a>
 </h2>

 <p>The first step of the process is to create a job allocation, which is
 a claim on compute resources. A job allocation can be created using the
 <b>salloc</b>, <b>sbatch</b> or <b>srun</b> command. The <b>salloc</b> and
 <b>sbatch</b> commands create resource allocations while the <b>srun</b>
 command will create a resource allocation (if not already running within one)
 plus launch tasks. Each of these commands will fill in a data structure
 identifying the specifications of the job allocation requirement (e.g. node
 count, task count, etc.) based upon command line options and environment
 variables and send the RPC to the <b>slurmctld</b> daemon. The UID and GID of
 the user launching the job will be included in a credential which will be used
 later to restrict access to the job, so further steps run in the allocation
 will need to be launched using the same UID and GID as the one used to create
 the allocation. If the new job
 request is the highest priority, the <b>slurmctld</b> daemon will attempt
 to select resources for it immediately, otherwise it will validate that the job
 request can be satisfied at some time and queue the request. In either case
 the request will receive a response almost immediately containing one of the
 following:</p>
 <ul>
 <li>A job ID and the resource allocation specification (nodes, cpus, etc.)</li>
 <li>A job ID and notification of the job being in a queued state OR</li>
 <li>An error code</li>
 </ul>

 <p>The process of selecting resources for a job request involves multiple steps,
 some of which involve plugins. The process is as follows:</p>
 <ol>
 <li>Call <b>job_submit</b> plugins to modify the request as appropriate</li>
 <li>Validate that the options are valid for this user (e.g. valid partition
 name, valid limits, etc.)</li>
 <li>Determine if this job is the highest priority runnable job, if so then
 really try to allocate resources for it now, otherwise only validate that it
 could run if no other jobs existed</li>
 <li>Determine which nodes could be used for the job. If the feature
 specification uses an exclusive OR option, then multiple iterations of the
 selection process below will be required with disjoint sets of nodes</li>
 <li>Call the <b>select</b> plugin to select the best resources for the request</li>
 <li>The <b>select</b> plugin will consider network topology and the topology within
 a node (e.g. sockets, cores, and threads) to select the best resources for the
 job</li>
 <li>If the job can not be initiated using available resources and preemption
 support is configured, the <b>select</b> plugin will also determine if the job
 can be initiated after preempting lower priority jobs. If so then initiate
 preemption as needed to start the job</li>
 </ol>

 <h2 id="step_allocation">Step Allocation
 <a class="slurm_link" href="#step_allocation"></a>
 </h2>

 <p>The <b>srun</b> command is always used for job step creation. It fills in
 a job step request RPC using information from the command line and environment
 variables then sends that request to the <b>slurmctld</b> daemon. It is
 important to note that many of the <b>srun</b> options are intended for job
 allocation and are not supported by the job step request RPC (for example the
 socket, core and thread information is not supported). If a job step uses
 all of the resources allocated to the job then the lack of support for some
 options is not important. If one wants to execute multiple job steps using
 various subsets of resources allocated to the job, this shortcoming could
 prove problematic. It is also worth noting that the logic used to select
 resources for a job step is relatively simple and entirely contained within
 the <b>slurmctld</b> daemon code (the <b>select</b> plugin is not used for job
 steps). If the request can not be immediately satisfied due to a request for
 exclusive access to resources, the appropriate error message will be sent and
 the <b>srun</b> command will retry the request on a periodic basis.
 (<b>NOTE</b>: It would be desirable to queue the job step requests to support
 job step dependencies and better performance in the initiation of job steps,
 but that is not currently supported.)
 If the request can be satisfied, the response contains a digitally signed
 credential (by the <b>cred</b> plugin) identifying the resources to be used.</p>

 <h2 id="task_launch">Task Launch
 <a class="slurm_link" href="#task_launch"></a>
 </h2>

 <p>The <b>srun</b> command builds a task launch request data structure
 including the credential, executable name, file names, etc. and sends it to
 the <b>slurmd</b> daemon on node zero of the job step allocation. The
 <b>slurmd</b> daemon validates the signature and forwards the request to the
 <b>slurmd</b> daemons on other nodes to launch tasks for that job step. The
 degree of fanout in this message forwarding is configurable using the
 <b>TreeWidth</b> parameter. Each <b>slurmd</b> daemon tests that the job has
 not been cancelled since the credential was issued (due to a possible race
 condition) and spawns a <b>slurmstepd</b> program to manage the job step.
 Note that the <b>slurmctld</b> daemon is not directly involved in task
 launch in order to minimize the overhead on this critical resource.</p>

 <p>Each <b>slurmstepd</b> program executes a single job step.
 Besides the functions listed below, the <b>slurmstepd</b> program also
 executes several SPANK plugin functions at various times.</p>
 <ol>
 <li>Performs MPI setup (using the appropriate plugin)</li>
 <li>Calls the <b>switch</b> plugin to perform any needed network configuration</li>
 <li>Creates a container for the job step using a <b>proctrack</b> plugin</li>
 <li>Change user ID to that of the user</li>
 <li>Configures I/O for the tasks (either using files or a socket connection back
 to the <b>srun</b> command</li>
 <li>Sets up environment variables for the tasks including many task-specific
 environment variables</li>
 <li>Fork/exec the tasks</li>
 </ol>

 <h2 id="step_termination">Job Step Termination
 <a class="slurm_link" href="#step_termination"></a>
 </h2>

 <p>There are several ways in which a job step or job can terminate, each with
 slight variation in the logic executed. The simplest case is if the tasks run
 to completion. The <b>srun</b> will note the termination of output from the
 tasks and notify the <b>slurmctld</b> daemon that the job step has completed.
 <b>slurmctld</b> will simply log the job step termination. The job step can
 also be explicitly cancelled by a user, reach the end of its time limit, etc.
 and those follow a sequence of steps very similar to that for job termination,
 which is described below.</p>

 <h2 id="job_termination">Job Termination
 <a class="slurm_link" href="#job_termination"></a>
 </h2>

 <p>Job termination can either be user initiated (e.g. <b>scancel</b> command) or system
 initiated (e.g. time limit reached). The termination ultimately requires
 the <b>slurmctld</b> daemon to notify the <b>slurmd</b> daemons on allocated
 nodes that the job is to be ended. The <b>slurmd</b> daemon does the following:
 <ol>
 <li>Send a SIGCONT and SIGTERM signal to any user tasks</li>
 <li>Wait <b>KillWait</b> seconds if there are any user tasks</li>
 <li>Send a SIGKILL signal to any user tasks</li>
 <li>Wait for all tasks to complete</li>
 <li>Execute any <b>Epilog</b> program</li>
 <li>Send an epilog_complete RPC to the <b>slurmctld</b> daemon</li>
 </ol>

 <h2 id="job_record">Job Accounting Records
 <a class="slurm_link" href="#job_record"></a>
 </h2>

 <p>When Slurm is configured to use SlurmDBD to store job records (i.e.
 <i>AccountingStorageType=accounting_storage=slurmdbd</i>), there are multiple
 records that get stored for each job. There is a record for the job as a
 whole as well as entries for the following types of job steps:
 <ul>
 <li><b>extern step</b> &mdash; A step created for each job as long as you have
 <i>PrologFlags=contain</i> in your slurm.conf. Each node in the job will
 have a slurmstepd process created for the extern step.
 <a href=pam_slurm_adopt.html>pam_slurm_adopt</a> uses this step to contain
 external connections.</li>
 <li><b>batch step</b> &mdash; A step created for jobs that were submitted with
 sbatch. The batch host, or the primary node for the job, will run an instance
 of slurmstepd for the batch step, which is used to run the script provided
 to sbatch.</li>
 <li><b>interactive step</b> &mdash; A step created for jobs that were
 submitted with salloc when <i>LaunchParameters=use_interactive_step</i> is
 configured in your slurm.conf. The node on which you have the interactive
 shell will run an instance of slurmstepd to run the shell or the command
 provided to salloc.</li>
 <li><b>normal step</b> &mdash; A job can have multiple normal steps, which will
 appear in sacct as &lt;<b>job_id</b>&gt;.&lt;<b>step_id</b>&gt;. These steps
 are created when srun is called from inside the job and the slurmstepd created
 will run the command passed to srun. Each step will have one instance of
 slurmstepd created per node in the step and each instance of slurmstepd can
 run multiple tasks in the same step.</li>
 </ul>

 <p style="text-align:center;">Last modified 1 August 2022</p>

 <!--#include virtual="footer.txt"-->
	<!--#include virtual="header.txt"-->

	<h1><a name="top">Job Launch Design Guide</a></h1>

	<h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2>

	<p>This guide describes at a high level the processes which occur in order
	to initiate a job including the daemons and plugins involved in the process.
	It describes the process of job allocation, step allocation, task launch and
	job termination. The functionality of tens of thousands of lines of code
	has been distilled here to a couple of pages of text, so much detail is
	missing.</p>

	<h2 id="job_allocation">Job Allocation
	<a class="slurm_link" href="#job_allocation"></a>
	</h2>

	<p>The first step of the process is to create a job allocation, which is
	a claim on compute resources. A job allocation can be created using the
	<b>salloc</b>, <b>sbatch</b> or <b>srun</b> command. The <b>salloc</b> and
	<b>sbatch</b> commands create resource allocations while the <b>srun</b>
	command will create a resource allocation (if not already running within one)
	plus launch tasks. Each of these commands will fill in a data structure
	identifying the specifications of the job allocation requirement (e.g. node
	count, task count, etc.) based upon command line options and environment
	variables and send the RPC to the <b>slurmctld</b> daemon. The UID and GID of
	the user launching the job will be included in a credential which will be used
	later to restrict access to the job, so further steps run in the allocation
	will need to be launched using the same UID and GID as the one used to create
	the allocation. If the new job
	request is the highest priority, the <b>slurmctld</b> daemon will attempt
	to select resources for it immediately, otherwise it will validate that the job
	request can be satisfied at some time and queue the request. In either case
	the request will receive a response almost immediately containing one of the
	following:</p>
	<ul>
	<li>A job ID and the resource allocation specification (nodes, cpus, etc.)</li>
	<li>A job ID and notification of the job being in a queued state OR</li>
	<li>An error code</li>
	</ul>

	<p>The process of selecting resources for a job request involves multiple steps,
	some of which involve plugins. The process is as follows:</p>
	<ol>
	<li>Call <b>job_submit</b> plugins to modify the request as appropriate</li>
	<li>Validate that the options are valid for this user (e.g. valid partition
	name, valid limits, etc.)</li>
	<li>Determine if this job is the highest priority runnable job, if so then
	really try to allocate resources for it now, otherwise only validate that it
	could run if no other jobs existed</li>
	<li>Determine which nodes could be used for the job. If the feature
	specification uses an exclusive OR option, then multiple iterations of the
	selection process below will be required with disjoint sets of nodes</li>
	<li>Call the <b>select</b> plugin to select the best resources for the request</li>
	<li>The <b>select</b> plugin will consider network topology and the topology within
	a node (e.g. sockets, cores, and threads) to select the best resources for the
	job</li>
	<li>If the job can not be initiated using available resources and preemption
	support is configured, the <b>select</b> plugin will also determine if the job
	can be initiated after preempting lower priority jobs. If so then initiate
	preemption as needed to start the job</li>
	</ol>

	<h2 id="step_allocation">Step Allocation
	<a class="slurm_link" href="#step_allocation"></a>
	</h2>

	<p>The <b>srun</b> command is always used for job step creation. It fills in
	a job step request RPC using information from the command line and environment
	variables then sends that request to the <b>slurmctld</b> daemon. It is
	important to note that many of the <b>srun</b> options are intended for job
	allocation and are not supported by the job step request RPC (for example the
	socket, core and thread information is not supported). If a job step uses
	all of the resources allocated to the job then the lack of support for some
	options is not important. If one wants to execute multiple job steps using
	various subsets of resources allocated to the job, this shortcoming could
	prove problematic. It is also worth noting that the logic used to select
	resources for a job step is relatively simple and entirely contained within
	the <b>slurmctld</b> daemon code (the <b>select</b> plugin is not used for job
	steps). If the request can not be immediately satisfied due to a request for
	exclusive access to resources, the appropriate error message will be sent and
	the <b>srun</b> command will retry the request on a periodic basis.
	(<b>NOTE</b>: It would be desirable to queue the job step requests to support
	job step dependencies and better performance in the initiation of job steps,
	but that is not currently supported.)
	If the request can be satisfied, the response contains a digitally signed
	credential (by the <b>cred</b> plugin) identifying the resources to be used.</p>

	<h2 id="task_launch">Task Launch
	<a class="slurm_link" href="#task_launch"></a>
	</h2>

	<p>The <b>srun</b> command builds a task launch request data structure
	including the credential, executable name, file names, etc. and sends it to
	the <b>slurmd</b> daemon on node zero of the job step allocation. The
	<b>slurmd</b> daemon validates the signature and forwards the request to the
	<b>slurmd</b> daemons on other nodes to launch tasks for that job step. The
	degree of fanout in this message forwarding is configurable using the
	<b>TreeWidth</b> parameter. Each <b>slurmd</b> daemon tests that the job has
	not been cancelled since the credential was issued (due to a possible race
	condition) and spawns a <b>slurmstepd</b> program to manage the job step.
	Note that the <b>slurmctld</b> daemon is not directly involved in task
	launch in order to minimize the overhead on this critical resource.</p>

	<p>Each <b>slurmstepd</b> program executes a single job step.
	Besides the functions listed below, the <b>slurmstepd</b> program also
	executes several SPANK plugin functions at various times.</p>
	<ol>
	<li>Performs MPI setup (using the appropriate plugin)</li>
	<li>Calls the <b>switch</b> plugin to perform any needed network configuration</li>
	<li>Creates a container for the job step using a <b>proctrack</b> plugin</li>
	<li>Change user ID to that of the user</li>
	<li>Configures I/O for the tasks (either using files or a socket connection back
	to the <b>srun</b> command</li>
	<li>Sets up environment variables for the tasks including many task-specific
	environment variables</li>
	<li>Fork/exec the tasks</li>
	</ol>

	<h2 id="step_termination">Job Step Termination
	<a class="slurm_link" href="#step_termination"></a>
	</h2>

	<p>There are several ways in which a job step or job can terminate, each with
	slight variation in the logic executed. The simplest case is if the tasks run
	to completion. The <b>srun</b> will note the termination of output from the
	tasks and notify the <b>slurmctld</b> daemon that the job step has completed.
	<b>slurmctld</b> will simply log the job step termination. The job step can
	also be explicitly cancelled by a user, reach the end of its time limit, etc.
	and those follow a sequence of steps very similar to that for job termination,
	which is described below.</p>

	<h2 id="job_termination">Job Termination
	<a class="slurm_link" href="#job_termination"></a>
	</h2>

	<p>Job termination can either be user initiated (e.g. <b>scancel</b> command) or system
	initiated (e.g. time limit reached). The termination ultimately requires
	the <b>slurmctld</b> daemon to notify the <b>slurmd</b> daemons on allocated
	nodes that the job is to be ended. The <b>slurmd</b> daemon does the following:
	<ol>
	<li>Send a SIGCONT and SIGTERM signal to any user tasks</li>
	<li>Wait <b>KillWait</b> seconds if there are any user tasks</li>
	<li>Send a SIGKILL signal to any user tasks</li>
	<li>Wait for all tasks to complete</li>
	<li>Execute any <b>Epilog</b> program</li>
	<li>Send an epilog_complete RPC to the <b>slurmctld</b> daemon</li>
	</ol>

	<h2 id="job_record">Job Accounting Records
	<a class="slurm_link" href="#job_record"></a>
	</h2>

	<p>When Slurm is configured to use SlurmDBD to store job records (i.e.
	<i>AccountingStorageType=accounting_storage=slurmdbd</i>), there are multiple
	records that get stored for each job. There is a record for the job as a
	whole as well as entries for the following types of job steps:
	<ul>
	<li><b>extern step</b> — A step created for each job as long as you have
	<i>PrologFlags=contain</i> in your slurm.conf. Each node in the job will
	have a slurmstepd process created for the extern step.
	<a href=pam_slurm_adopt.html>pam_slurm_adopt</a> uses this step to contain
	external connections.</li>
	<li><b>batch step</b> — A step created for jobs that were submitted with
	sbatch. The batch host, or the primary node for the job, will run an instance
	of slurmstepd for the batch step, which is used to run the script provided
	to sbatch.</li>
	<li><b>interactive step</b> — A step created for jobs that were
	submitted with salloc when <i>LaunchParameters=use_interactive_step</i> is
	configured in your slurm.conf. The node on which you have the interactive
	shell will run an instance of slurmstepd to run the shell or the command
	provided to salloc.</li>
	<li><b>normal step</b> — A job can have multiple normal steps, which will
	appear in sacct as <<b>job_id</b>>.<<b>step_id</b>>. These steps
	are created when srun is called from inside the job and the slurmstepd created
	will run the command passed to srun. Each step will have one instance of
	slurmstepd created per node in the step and each instance of slurmstepd can
	run multiple tasks in the same step.</li>
	</ul>

	<p style="text-align:center;">Last modified 1 August 2022</p>

	<!--#include virtual="footer.txt"-->