doc/jsspp/services.tex - SchedMD/slurm - Git at Google

 \section{SLURM Operation and Services}
 \subsection{Command Line Utilities}

 The command line utilities are the user interface to SLURM functionality.
 They offer users access to remote execution and job control. They also
 permit administrators to dynamically change the system configuration.
 These commands all use SLURM APIs which are directly available for
 more sophisticated applications.

 \begin{itemize}
 \item {\tt scancel}: Cancel a running or a pending job or job step,
 subject to authentication and authorization. This command can also
 be used to send an arbitrary signal to all processes on all nodes
 associated with a job or job step.

 \item {\tt scontrol}: Perform privileged administrative commands
 such as draining a node or partition in preparation for maintenance.
 Many \scontrol\ functions can only be executed by privileged users.

 \item {\tt sinfo}: Display a summary of partition and node information.
 A assortment of filtering and output format options are available.

 \item {\tt squeue}: Display the queue of running and waiting jobs
 and/or job steps. A wide assortment of filtering, sorting, and output
 format options are available.

 \item {\tt srun}: Allocate resources, submit jobs to the SLURM queue,
 and initiate parallel tasks (job steps).
 Every set of executing parallel tasks has an associated \srun\ which
 initiated it and, if the \srun\ persists, managing it.
 Jobs may be submitted for batch execution, in which case
 \srun\ terminates after job submission.
 Jobs may also be submitted for interactive execution, where \srun\ keeps
 running to shepherd the running job. In this case,
 \srun\ negotiates connections with remote {\tt slurmd}'s
 for job initiation and to
 get stdout and stderr, forward stdin, and respond to signals from the user.
 The \srun\ may also be instructed to allocate a set of resources and
 spawn a shell with access to those resources.
 \srun\ has a total of 13 parameters to control where and when the job
 is initiated.

 \end{itemize}

 \subsection{Plugins}

 In order to make the use of different infrastructures possible,
 SLURM uses a general purpose plugin mechanism.
 A SLURM plugin is a dynamically linked code object which is
 loaded explicitly at run time by the SLURM libraries.
 A plugin provides a customized implemenation of a well-defined
 API connected to tasks such as authentication, interconnect fabric,
 task scheduling.
 A common set of functions is defined for use by all of the different
 infrastructures of a particular variety.
 For example, the authentication plugin must define functions
 such as:
 {\tt slurm\_auth\_activate} to create a credential,
 {\tt slurm\_auth\_verify} to verify a credential to
 approve or deny authentication,
 {\tt slurm\_auth\_get\_uid} to get the user ID associated with
 a specific credential, etc.
 It also must define the data structure used, a plugin type,
 a plugin version number.
 The available plugins are defined in the configuration file.
 %When a slurm daemon is initiated, it reads the configuration
 %file to determine which of the available plugins should be used.
 %For example {\em AuthType=auth/authd} says to use the plugin for
 %authd based authentication and {\em PluginDir=/usr/local/lib}
 %identifies the directory in which to find the plugin.

 \subsection{Communications Layer}

 SLURM presently uses Berkeley sockets for communications.
 However, we anticipate using the plugin mechanism to easily
 permit use of other communications layers.
 At LLNL we are using an Ethernet for SLURM communications and
 the Quadrics Elan switch exclusively for user applications.
 The SLURM configuration file permits the identification of each
 node's hostname as well as its name to be used for communications.
 %In the case of a control machine known as {\em mcri} to be
 %communicated with using the name {\em emcri} (say to indicate
 %an ethernet communications path), this is represented in the
 %configuration file as {\em ControlMachine=mcri ControlAddr=emcri}.
 %The name used for communication is the same as the hostname unless
 %%otherwise specified.

 While SLURM is able to manage 1000 nodes without difficulty using
 sockets and Ethernet, we are reviewing other communication
 mechanisms which may offer improved scalability.
 One possible alternative is STORM\cite{STORM01}.
 STORM uses the cluster interconnect and Network Interface Cards to
 provide high-speed communications including a broadcast capability.
 STORM only supports the Quadrics Elan interconnnect at present,
 but does offer the promise of improved performance and scalability.

 \subsection{Security}

 SLURM has a simple security model:
 Any user of the cluster may submit parallel jobs to execute and cancel
 his own jobs.  Any user may view SLURM configuration and state
 information.
 Only privileged users may modify the SLURM configuration,
 cancel any jobs, or perform other restricted activities.
 Privileged users in SLURM include the users {\em root}
 and {\tt SlurmUser} (as defined in the SLURM configuration file).
 If permission to modify SLURM configuration is
 required by others, set-uid programs may be used to grant specific
 permissions to specific users.

 We presently support three authentication mechanisms via plugins:
 {\tt authd}\cite{Authd02}, {\tt munged} and {\tt none}.
 A plugin can easily be developed for Kerberos or authentication
 mechanisms as desired.
 The \munged\ implementation is described below.
 A \munged\ daemon running as user {\em root} on each node confirms the
 identify of the user making the request using the {\tt getpeername}
 function and generates a credential.
 The credential contains a user ID,
 group ID, time-stamp, lifetime, some pseudo-random information, and
 any user supplied information. The \munged\ uses a private key to
 generate a Message Authentication Code (MAC) for the credential.
 The \munged\ then uses a public key to symmetrically encrypt
 the credential including the MAC.
 SLURM daemons and programs transmit this encrypted
 credential with communications. The SLURM daemon receiving the message
 sends the credential to \munged\ on that node.
 The \munged\ decrypts the credential using its private key, validates it
 and returns the user ID and group ID of the user originating the
 credential.
 The \munged\ prevents replay of a credential on any single node
 by recording credentials that have already been authenticated.
 In SLURM's case, the user supplied information includes node
 identification information to prevent a credential from being
 used on nodes it is not destined for.

 When resources are allocated to a user by the controller, a
 {\em job step credential} is generated by combining the user ID, job ID,
 step ID, the list of resources allocated (nodes), and the credential
 lifetime. This job step credential is encrypted with
 a \slurmctld\ private key. This credential
 is returned to the requesting agent ({\tt srun}) along with the
 allocation response, and must be forwarded to the remote {\tt slurmd}'s
 upon job step initiation. \slurmd\ decrypts this credential with the
 \slurmctld 's public key to verify that the user may access
 resources on the local node. \slurmd\ also uses this job step credential
 to authenticate standard input, output, and error communication streams.

 %Access to partitions may be restricted via a {\em RootOnly} flag.
 %If this flag is set, job submit or allocation requests to this
 %partition are only accepted if the effective user ID originating
 %the request is a privileged user.
 %The request from such a user may submit a job as any other user.
 %This may be used, for example, to provide specific external schedulers
 %with exclusive access to partitions.  Individual users will not be
 %permitted to directly submit jobs to such a partition, which would
 %prevent the external scheduler from effectively managing it.
 %Access to partitions may also be restricted to users who are
 %members of specific Unix groups using a {\em AllowGroups} specification.

 \subsection{Job Initiation}

 There are three modes in which jobs may be run by users under SLURM. The
 first and most simple is {\em interactive} mode, in which stdout and
 stderr are displayed on the user's terminal in real time, and stdin and
 signals may be forwarded from the  terminal transparently to the remote
 tasks. The second is {\em batch} mode, in which the job is
 queued until the request for resources can be satisfied, at which time the
 job is run by SLURM as the submitting user. In {\em allocate} mode,
 a job is allocated to the requesting user, under which the user may
 manually run job steps via a script or in a sub-shell spawned by \srun .

 \begin{figure}[tb]
 \centerline{\epsfig{file=../figures/connections.eps,scale=0.5}}
 \caption{\small Job initiation connections overview. 1. The \srun\ connects to
          \slurmctld\ requesting resources. 2. \slurmctld\ issues a response,
          with list of nodes and job credential. 3. The \srun\ opens a listen
          port for every task in the job step, then sends a run job step
          request to \slurmd . 4. \slurmd 's initiate job step and connect
          back to \srun\ for stdout/err. }
 \label{connections}
 \end{figure}

 Figure~\ref{connections} gives a high-level depiction of the connections
 that occur between SLURM components during a general interactive job
 startup.
 The \srun\ requests a resource allocation and job step initiation from the {\tt slurmctld},
 which responds with the job ID, list of allocated nodes, job credential.
 if the request is granted.
 The \srun\ then initializes listen ports for each
 task and sends a message to the {\tt slurmd}'s on the allocated nodes requesting
 that the remote processes be initiated. The {\tt slurmd}'s begin execution of
 the tasks and connect back to \srun\ for stdout and stderr. This process and
 the other initiation modes are described in more detail below.

 \subsubsection{Interactive mode initiation}

 \begin{figure}[tb]
 \centerline{\epsfig{file=../figures/interactive-job-init.eps,scale=0.5} }
 \caption{\small Interactive job initiation. \srun\ simultaneously allocates
 nodes
          and a job step from \slurmctld\ then sends a run request to all
          \slurmd 's in job. Dashed arrows indicate a periodic request that
          may or may not occur during the lifetime of the job.}
 \label{init-interactive}
 \end{figure}

 Interactive job initiation is illustrated in Figure~\ref{init-interactive}.
 The process begins with a user invoking \srun\ in interactive mode.
 In Figure~\ref{init-interactive}, the user has requested an interactive
 run of the executable ``{\tt cmd}'' in the default partition.

 After processing command line options, \srun\ sends a message to
 \slurmctld\ requesting a resource allocation and a job step initiation.
 This message simultaneously requests an allocation (or job) and a job step.
 The \srun\ waits for a reply from {\tt slurmctld}, which may not come instantly
 if the user has requested that \srun\ block until resources are available.
 When resources are available
 for the user's job, \slurmctld\ replies with a job step credential, list of
 nodes that were allocated, cpus per node, and so on. The \srun\ then sends
 a message each \slurmd\ on the allocated nodes requesting that a job
 step be initiated. The \slurmd 's verify that the job is valid using
 the forwarded job step credential and then respond to \srun .

 Each \slurmd\ invokes a job thread to handle the request, which in turn
 invokes a task thread for each requested task. The task thread connects
 back to a port opened by \srun\ for stdout and stderr. The host and
 port for this connection is contained in the run request message sent
 to this machine by \srun . Once stdout and stderr have successfully
 been connected, the task thread takes the necessary steps to initiate
 the user's executable on the node, initializing environment, current
 working directory, and interconnect resources if needed.

 Once the user process exits, the task thread records the exit status and
 sends a task exit message back to \srun . When all local processes
 terminate, the job thread exits. The \srun\ process either waits
 for all tasks to exit, or attempt to clean up the remaining processes
 some time after the first task exits.
 Regardless, once all
 tasks are finished, \srun\ sends a message to the \slurmctld\ releasing
 the allocated nodes, then exits with an appropriate exit status.

 When the \slurmctld\ receives notification that \srun\ no longer needs
 the allocated nodes, it issues a request for the epilog to be run on each of
 the \slurmd 's in the allocation. As \slurmd 's report that the epilog ran
 successfully, the nodes are returned to the partition.


 \subsubsection{Batch mode initiation}

 \begin{figure}[tb]
 \centerline{\epsfig{file=../figures/queued-job-init.eps,scale=0.5} }
 \caption{\small Queued job initiation.
          \slurmctld\ initiates the user's job as a batch script on one node.
          Batch script contains an srun call which initiates parallel tasks
          after instantiating job step with controller. The shaded region is
          a compressed representation and is illustrated in more detail in the
          interactive diagram (Figure~\ref{init-interactive}).}
 \label{init-batch}
 \end{figure}

 Figure~\ref{init-batch} illustrates the initiation of a batch  job in SLURM.
 Once a batch job is submitted, \srun\ sends a batch job request
 to \slurmctld\ that contains the input/output location for the job, current
 working directory, environment, requested number of nodes. The
 \slurmctld\ queues the request in its priority ordered queue.

 Once the resources are available and the job has a high enough priority,
 \slurmctld\ allocates the resources to the job and contacts the first node
 of the allocation requesting that the user job be started. In this case,
 the job may either be another invocation of \srun\ or a {\em job script} which
 may have multiple invocations of \srun\ within it. The \slurmd\ on the remote
 node responds to the run request, initiating the job thread, task thread,
 and user script. An \srun\ executed from within the script detects that it
 has access to an allocation and initiates a job step on some or all of the
 nodes within the job.

 Once the job step is complete, the \srun\ in the job script notifies the
 \slurmctld\, and terminates. The job script continues executing and may
 initiate further job steps. Once the job script completes, the task
 thread running the job script collects the exit status and sends a task exit
 message to the \slurmctld . The \slurmctld\ notes that the job is complete
 and requests that the job epilog be run on all nodes that were allocated.
 As the \slurmd 's respond with successful completion of the epilog,
 the nodes are returned to the partition.

 \subsubsection{Allocate mode initiation}

 \begin{figure}[tb]
 \centerline{\epsfig{file=../figures/allocate-init.eps,scale=0.5} }
 \caption{\small Job initiation in allocate mode. Resources are allocated and
          \srun\ spawns a shell with access to the resources. When user runs
          an \srun\ from within the shell, the a job step is initiated under
          the allocation.}
 \label{init-allocate}
 \end{figure}

 In allocate mode, the user wishes to allocate a job and interactively run
 job steps under that allocation. The process of initiation in this mode
 is illustrated in Figure~\ref{init-allocate}. The invoked \srun\ sends
 an allocate request to \slurmctld , which, if resources are available,
 responds with a list of nodes allocated, job id, etc. The \srun\
 process spawns a shell on the user's terminal with access to the
 allocation, then waits for the shell to exit at which time the job
 is considered complete.

 An \srun\ initiated within the allocate sub-shell recognizes that it
 is running under an allocation and therefore already within a job. Provided
 with no other arguments, \srun\ started in this manner initiates a job
 step on all nodes within the current job. However, the user may select
 a subset of these nodes implicitly.

 An \srun\ executed from the sub-shell reads the environment and
 user options, then notify the controller that it is starting a job step
 under the current job. The \slurmctld\ registers the job step and responds
 with a job credential. The \srun\ then initiates the job step using the same
 general method as described in the section on interactive job initiation.

 When the user exits the allocate sub-shell, the original \srun\ receives
 exit status, notifies \slurmctld\ that the job is complete, and exits.
 The controller runs the epilog on each of the allocated nodes, returning
 nodes to the partition as they complete the epilog.
	\section{SLURM Operation and Services}
	\subsection{Command Line Utilities}

	The command line utilities are the user interface to SLURM functionality.
	They offer users access to remote execution and job control. They also
	permit administrators to dynamically change the system configuration.
	These commands all use SLURM APIs which are directly available for
	more sophisticated applications.

	\begin{itemize}
	\item {\tt scancel}: Cancel a running or a pending job or job step,
	subject to authentication and authorization. This command can also
	be used to send an arbitrary signal to all processes on all nodes
	associated with a job or job step.

	\item {\tt scontrol}: Perform privileged administrative commands
	such as draining a node or partition in preparation for maintenance.
	Many \scontrol\ functions can only be executed by privileged users.

	\item {\tt sinfo}: Display a summary of partition and node information.
	A assortment of filtering and output format options are available.

	\item {\tt squeue}: Display the queue of running and waiting jobs
	and/or job steps. A wide assortment of filtering, sorting, and output
	format options are available.

	\item {\tt srun}: Allocate resources, submit jobs to the SLURM queue,
	and initiate parallel tasks (job steps).
	Every set of executing parallel tasks has an associated \srun\ which
	initiated it and, if the \srun\ persists, managing it.
	Jobs may be submitted for batch execution, in which case
	\srun\ terminates after job submission.
	Jobs may also be submitted for interactive execution, where \srun\ keeps
	running to shepherd the running job. In this case,
	\srun\ negotiates connections with remote {\tt slurmd}'s
	for job initiation and to
	get stdout and stderr, forward stdin, and respond to signals from the user.
	The \srun\ may also be instructed to allocate a set of resources and
	spawn a shell with access to those resources.
	\srun\ has a total of 13 parameters to control where and when the job
	is initiated.

	\end{itemize}

	\subsection{Plugins}

	In order to make the use of different infrastructures possible,
	SLURM uses a general purpose plugin mechanism.
	A SLURM plugin is a dynamically linked code object which is
	loaded explicitly at run time by the SLURM libraries.
	A plugin provides a customized implemenation of a well-defined
	API connected to tasks such as authentication, interconnect fabric,
	task scheduling.
	A common set of functions is defined for use by all of the different
	infrastructures of a particular variety.
	For example, the authentication plugin must define functions
	such as:
	{\tt slurm\_auth\_activate} to create a credential,
	{\tt slurm\_auth\_verify} to verify a credential to
	approve or deny authentication,
	{\tt slurm\_auth\_get\_uid} to get the user ID associated with
	a specific credential, etc.
	It also must define the data structure used, a plugin type,
	a plugin version number.
	The available plugins are defined in the configuration file.
	%When a slurm daemon is initiated, it reads the configuration
	%file to determine which of the available plugins should be used.
	%For example {\em AuthType=auth/authd} says to use the plugin for
	%authd based authentication and {\em PluginDir=/usr/local/lib}
	%identifies the directory in which to find the plugin.

	\subsection{Communications Layer}

	SLURM presently uses Berkeley sockets for communications.
	However, we anticipate using the plugin mechanism to easily
	permit use of other communications layers.
	At LLNL we are using an Ethernet for SLURM communications and
	the Quadrics Elan switch exclusively for user applications.
	The SLURM configuration file permits the identification of each
	node's hostname as well as its name to be used for communications.
	%In the case of a control machine known as {\em mcri} to be
	%communicated with using the name {\em emcri} (say to indicate
	%an ethernet communications path), this is represented in the
	%configuration file as {\em ControlMachine=mcri ControlAddr=emcri}.
	%The name used for communication is the same as the hostname unless
	%%otherwise specified.

	While SLURM is able to manage 1000 nodes without difficulty using
	sockets and Ethernet, we are reviewing other communication
	mechanisms which may offer improved scalability.
	One possible alternative is STORM\cite{STORM01}.
	STORM uses the cluster interconnect and Network Interface Cards to
	provide high-speed communications including a broadcast capability.
	STORM only supports the Quadrics Elan interconnnect at present,
	but does offer the promise of improved performance and scalability.

	\subsection{Security}

	SLURM has a simple security model:
	Any user of the cluster may submit parallel jobs to execute and cancel
	his own jobs. Any user may view SLURM configuration and state
	information.
	Only privileged users may modify the SLURM configuration,
	cancel any jobs, or perform other restricted activities.
	Privileged users in SLURM include the users {\em root}
	and {\tt SlurmUser} (as defined in the SLURM configuration file).
	If permission to modify SLURM configuration is
	required by others, set-uid programs may be used to grant specific
	permissions to specific users.

	We presently support three authentication mechanisms via plugins:
	{\tt authd}\cite{Authd02}, {\tt munged} and {\tt none}.
	A plugin can easily be developed for Kerberos or authentication
	mechanisms as desired.
	The \munged\ implementation is described below.
	A \munged\ daemon running as user {\em root} on each node confirms the
	identify of the user making the request using the {\tt getpeername}
	function and generates a credential.
	The credential contains a user ID,
	group ID, time-stamp, lifetime, some pseudo-random information, and
	any user supplied information. The \munged\ uses a private key to
	generate a Message Authentication Code (MAC) for the credential.
	The \munged\ then uses a public key to symmetrically encrypt
	the credential including the MAC.
	SLURM daemons and programs transmit this encrypted
	credential with communications. The SLURM daemon receiving the message
	sends the credential to \munged\ on that node.
	The \munged\ decrypts the credential using its private key, validates it
	and returns the user ID and group ID of the user originating the
	credential.
	The \munged\ prevents replay of a credential on any single node
	by recording credentials that have already been authenticated.
	In SLURM's case, the user supplied information includes node
	identification information to prevent a credential from being
	used on nodes it is not destined for.

	When resources are allocated to a user by the controller, a
	{\em job step credential} is generated by combining the user ID, job ID,
	step ID, the list of resources allocated (nodes), and the credential
	lifetime. This job step credential is encrypted with
	a \slurmctld\ private key. This credential
	is returned to the requesting agent ({\tt srun}) along with the
	allocation response, and must be forwarded to the remote {\tt slurmd}'s
	upon job step initiation. \slurmd\ decrypts this credential with the
	\slurmctld 's public key to verify that the user may access
	resources on the local node. \slurmd\ also uses this job step credential
	to authenticate standard input, output, and error communication streams.

	%Access to partitions may be restricted via a {\em RootOnly} flag.
	%If this flag is set, job submit or allocation requests to this
	%partition are only accepted if the effective user ID originating
	%the request is a privileged user.
	%The request from such a user may submit a job as any other user.
	%This may be used, for example, to provide specific external schedulers
	%with exclusive access to partitions. Individual users will not be
	%permitted to directly submit jobs to such a partition, which would
	%prevent the external scheduler from effectively managing it.
	%Access to partitions may also be restricted to users who are
	%members of specific Unix groups using a {\em AllowGroups} specification.

	\subsection{Job Initiation}

	There are three modes in which jobs may be run by users under SLURM. The
	first and most simple is {\em interactive} mode, in which stdout and
	stderr are displayed on the user's terminal in real time, and stdin and
	signals may be forwarded from the terminal transparently to the remote
	tasks. The second is {\em batch} mode, in which the job is
	queued until the request for resources can be satisfied, at which time the
	job is run by SLURM as the submitting user. In {\em allocate} mode,
	a job is allocated to the requesting user, under which the user may
	manually run job steps via a script or in a sub-shell spawned by \srun .

	\begin{figure}[tb]
	\centerline{\epsfig{file=../figures/connections.eps,scale=0.5}}
	\caption{\small Job initiation connections overview. 1. The \srun\ connects to
	\slurmctld\ requesting resources. 2. \slurmctld\ issues a response,
	with list of nodes and job credential. 3. The \srun\ opens a listen
	port for every task in the job step, then sends a run job step
	request to \slurmd . 4. \slurmd 's initiate job step and connect
	back to \srun\ for stdout/err. }
	\label{connections}
	\end{figure}

	Figure~\ref{connections} gives a high-level depiction of the connections
	that occur between SLURM components during a general interactive job
	startup.
	The \srun\ requests a resource allocation and job step initiation from the {\tt slurmctld},
	which responds with the job ID, list of allocated nodes, job credential.
	if the request is granted.
	The \srun\ then initializes listen ports for each
	task and sends a message to the {\tt slurmd}'s on the allocated nodes requesting
	that the remote processes be initiated. The {\tt slurmd}'s begin execution of
	the tasks and connect back to \srun\ for stdout and stderr. This process and
	the other initiation modes are described in more detail below.

	\subsubsection{Interactive mode initiation}

	\begin{figure}[tb]
	\centerline{\epsfig{file=../figures/interactive-job-init.eps,scale=0.5} }
	\caption{\small Interactive job initiation. \srun\ simultaneously allocates
	nodes
	and a job step from \slurmctld\ then sends a run request to all
	\slurmd 's in job. Dashed arrows indicate a periodic request that
	may or may not occur during the lifetime of the job.}
	\label{init-interactive}
	\end{figure}

	Interactive job initiation is illustrated in Figure~\ref{init-interactive}.
	The process begins with a user invoking \srun\ in interactive mode.
	In Figure~\ref{init-interactive}, the user has requested an interactive
	run of the executable ``{\tt cmd}'' in the default partition.

	After processing command line options, \srun\ sends a message to
	\slurmctld\ requesting a resource allocation and a job step initiation.
	This message simultaneously requests an allocation (or job) and a job step.
	The \srun\ waits for a reply from {\tt slurmctld}, which may not come instantly
	if the user has requested that \srun\ block until resources are available.
	When resources are available
	for the user's job, \slurmctld\ replies with a job step credential, list of
	nodes that were allocated, cpus per node, and so on. The \srun\ then sends
	a message each \slurmd\ on the allocated nodes requesting that a job
	step be initiated. The \slurmd 's verify that the job is valid using
	the forwarded job step credential and then respond to \srun .

	Each \slurmd\ invokes a job thread to handle the request, which in turn
	invokes a task thread for each requested task. The task thread connects
	back to a port opened by \srun\ for stdout and stderr. The host and
	port for this connection is contained in the run request message sent
	to this machine by \srun . Once stdout and stderr have successfully
	been connected, the task thread takes the necessary steps to initiate
	the user's executable on the node, initializing environment, current
	working directory, and interconnect resources if needed.

	Once the user process exits, the task thread records the exit status and
	sends a task exit message back to \srun . When all local processes
	terminate, the job thread exits. The \srun\ process either waits
	for all tasks to exit, or attempt to clean up the remaining processes
	some time after the first task exits.
	Regardless, once all
	tasks are finished, \srun\ sends a message to the \slurmctld\ releasing
	the allocated nodes, then exits with an appropriate exit status.

	When the \slurmctld\ receives notification that \srun\ no longer needs
	the allocated nodes, it issues a request for the epilog to be run on each of
	the \slurmd 's in the allocation. As \slurmd 's report that the epilog ran
	successfully, the nodes are returned to the partition.


	\subsubsection{Batch mode initiation}

	\begin{figure}[tb]
	\centerline{\epsfig{file=../figures/queued-job-init.eps,scale=0.5} }
	\caption{\small Queued job initiation.
	\slurmctld\ initiates the user's job as a batch script on one node.
	Batch script contains an srun call which initiates parallel tasks
	after instantiating job step with controller. The shaded region is
	a compressed representation and is illustrated in more detail in the
	interactive diagram (Figure~\ref{init-interactive}).}
	\label{init-batch}
	\end{figure}

	Figure~\ref{init-batch} illustrates the initiation of a batch job in SLURM.
	Once a batch job is submitted, \srun\ sends a batch job request
	to \slurmctld\ that contains the input/output location for the job, current
	working directory, environment, requested number of nodes. The
	\slurmctld\ queues the request in its priority ordered queue.

	Once the resources are available and the job has a high enough priority,
	\slurmctld\ allocates the resources to the job and contacts the first node
	of the allocation requesting that the user job be started. In this case,
	the job may either be another invocation of \srun\ or a {\em job script} which
	may have multiple invocations of \srun\ within it. The \slurmd\ on the remote
	node responds to the run request, initiating the job thread, task thread,
	and user script. An \srun\ executed from within the script detects that it
	has access to an allocation and initiates a job step on some or all of the
	nodes within the job.

	Once the job step is complete, the \srun\ in the job script notifies the
	\slurmctld\, and terminates. The job script continues executing and may
	initiate further job steps. Once the job script completes, the task
	thread running the job script collects the exit status and sends a task exit
	message to the \slurmctld . The \slurmctld\ notes that the job is complete
	and requests that the job epilog be run on all nodes that were allocated.
	As the \slurmd 's respond with successful completion of the epilog,
	the nodes are returned to the partition.

	\subsubsection{Allocate mode initiation}

	\begin{figure}[tb]
	\centerline{\epsfig{file=../figures/allocate-init.eps,scale=0.5} }
	\caption{\small Job initiation in allocate mode. Resources are allocated and
	\srun\ spawns a shell with access to the resources. When user runs
	an \srun\ from within the shell, the a job step is initiated under
	the allocation.}
	\label{init-allocate}
	\end{figure}

	In allocate mode, the user wishes to allocate a job and interactively run
	job steps under that allocation. The process of initiation in this mode
	is illustrated in Figure~\ref{init-allocate}. The invoked \srun\ sends
	an allocate request to \slurmctld , which, if resources are available,
	responds with a list of nodes allocated, job id, etc. The \srun\
	process spawns a shell on the user's terminal with access to the
	allocation, then waits for the shell to exit at which time the job
	is considered complete.

	An \srun\ initiated within the allocate sub-shell recognizes that it
	is running under an allocation and therefore already within a job. Provided
	with no other arguments, \srun\ started in this manner initiates a job
	step on all nodes within the current job. However, the user may select
	a subset of these nodes implicitly.

	An \srun\ executed from the sub-shell reads the environment and
	user options, then notify the controller that it is starting a job step
	under the current job. The \slurmctld\ registers the job step and responds
	with a job credential. The \srun\ then initiates the job step using the same
	general method as described in the section on interactive job initiation.

	When the user exits the allocate sub-shell, the original \srun\ receives
	exit status, notifies \slurmctld\ that the job is complete, and exits.
	The controller runs the epilog on each of the allocated nodes, returning
	nodes to the partition as they complete the epilog.