doc/jsspp/architecture.tex - SchedMD/slurm - Git at Google

 \section{SLURM Architecture}

 As a cluster resource manager, SLURM has three key functions.  First,
 it allocates exclusive and/or non-exclusive access to resources to users for
 some duration of time so they can perform work.  Second, it provides
 a framework for starting, executing, and monitoring work
 on the set of allocated nodes.  Finally, it arbitrates
 conflicting requests for resources by managing a queue of pending work.
 Users and system administrators interact with SLURM using simple commands.

 %Users interact with SLURM through four command line utilities:
 %\srun\ for submitting a job for execution and optionally controlling it
 %interactively,
 %\scancel\ for early termination of a pending or running job,
 %\squeue\ for monitoring job queues, and
 %\sinfo\ for monitoring partition and overall system state.
 %System administrators perform privileged operations through an additional
 %command line utility: {\tt scontrol}.
 %
 %The central controller daemon, {\tt slurmctld}, maintains the global state
 %and directs operations.
 %Compute nodes simply run a \slurmd\ daemon (similar to a remote shell
 %daemon) to export control to SLURM.
 %
 %SLURM is not a sophisticated batch system.
 %In fact, it was expressly designed to provide high-performance
 %parallel job management while leaving scheduling decisions to an
 %external entity as will be described later.

 \begin{figure}[tb]
 \centerline{\epsfig{file=../figures/arch.eps,scale=0.40}}
 \caption{SLURM Architecture}
 \label{arch}
 \end{figure}

 Figure~\ref{arch} depicts the key components of SLURM. As shown in Figure~\ref{arch},
 SLURM consists of a \slurmd\ daemon
 running on each compute node, a central \slurmctld\ daemon running on
 a management node (with optional fail-over twin), and five command line
 utilities,
 % {\tt srun}, {\tt scancel}, {\tt sinfo}, {\tt squeue}, and {\tt scontrol},
 which can run anywhere in the cluster.

 The entities managed by these SLURM daemons include nodes, the
 compute resource in SLURM, partitions, which group nodes into
 logical disjoint sets, jobs, or allocations of resources assigned
 to a user for a specified amount of time, and job steps, which are
 sets of tasks within a job.
 Each job is allocated nodes within a single partition.
 Once a job is assigned a set of nodes, the user is able to initiate
 parallel work in the form of job steps in any configuration within the
 allocation. For instance a single job step may be started which utilizes
 all nodes allocated to the job, or several job steps may independently
 use a portion of the allocation.

 %\begin{figure}[tcb]
 %\centerline{\epsfig{file=../figures/entities.eps,scale=0.7}}
 %\caption{SLURM Entities}
 %\label{entities}
 %\end{figure}
 %
 %Figure~\ref{entities} further illustrates the interrelation of these
 %entities as they are managed by SLURM. The diagram shows a group of
 %compute nodes split into two partitions. Partition 1 is running one
 %job, with one job step utilizing the full allocation of that job.
 %The job in Partition 2 has only one job step using half of the original
 %job allocation.
 %That job might initiate additional job step(s) to utilize
 %the remaining nodes of its allocation.

 \begin{figure}[tb]
 \centerline{\epsfig{file=../figures/slurm-arch.eps,scale=0.5}}
 \caption{SLURM Architecture - Subsystems}
 \label{archdetail}
 \end{figure}

 Figure~\ref{archdetail} exposes the subsystems that are implemented
 within the \slurmd\ and \slurmctld\ daemons.  These subsystems
 are explained in more detail below.

 \subsection{SLURM Local Daemon (Slurmd)}

 The \slurmd\ is a multi-threaded daemon running on each compute node.
 It reads the common SLURM configuration file and recovers any
 previously saved state information,
 notifies the controller that it is active, waits for work,
 executes the work, returns status, and waits for more work.
 Since it initiates jobs for other users, it must run with root privilege.
 %It also asynchronously exchanges node and job status information with {\tt slurmctld}.
 The only job information it has at any given time pertains to its
 currently executing jobs.
 The \slurmd\ performs five major tasks.

 \begin{itemize}
 \item {\em Machine and Job Status Services}:  Respond to controller
 requests for machine and job state information, and send asynchronous
 reports of some state changes (e.g. \slurmd\ startup) to the controller.

 \item {\em Remote Execution}: Start, monitor, and clean up after a set
 of processes (typically belonging to a parallel job) as dictated by the
 \slurmctld\ daemon or an \srun\ or \scancel\ command. Starting a process may
 include executing a prolog program, setting process limits, setting real
 and effective user id, establishing environment variables, setting working
 directory, allocating interconnect resources, setting core file paths,
 initializing the Stream Copy Service, and managing
 process groups. Terminating a process may include terminating all members
 of a process group and executing an epilog program.

 \item {\em Stream Copy Service}: Allow handling of stderr, stdout, and
 stdin of remote tasks. Job input may be redirected from a file or files, a
 \srun\ process, or /dev/null.  Job output may be saved into local files or
 sent back to the \srun\ command. Regardless of the location of stdout or stderr,
 all job output is locally buffered to avoid blocking local tasks.

 \item {\em Job Control}: Allow asynchronous interaction with the
 Remote Execution environment by propagating signals or explicit job
 termination requests to any set of locally managed processes.

 \end{itemize}

 \subsection{SLURM Central Daemon (Slurmctld)}

 Most SLURM state information is maintained by the controller, {\tt slurmctld}.
 The \slurmctld\ is multi-threaded with independent read and write locks
 for the various data structures to enhance scalability.
 When \slurmctld\ starts, it reads the SLURM configuration file.
 It can also read additional state information
 from a checkpoint file generated by a previous execution of {\tt slurmctld}.
 Full controller state information is written to
 disk periodically with incremental changes written to disk immediately
 for fault-tolerance.
 The \slurmctld\ runs in either master or standby mode, depending on the
 state of its fail-over twin, if any.
 The \slurmctld\ need not execute with root privilege.
 %In fact, it is recommended that a unique user entry be created for
 %executing \slurmctld\ and that user must be identified in the SLURM
 %configuration file as {\tt SlurmUser}.
 The \slurmctld\ consists of three major components:

 \begin{itemize}
 \item {\em Node Manager}: Monitors the state of each node in
 the cluster.  It polls {\tt slurmd}'s for status periodically and
 receives state change notifications from \slurmd\ daemons asynchronously.
 It ensures that nodes have the prescribed configuration before being
 considered available for use.

 \item {\em Partition Manager}: Groups nodes into non-overlapping sets called
 {\em partitions}. Each partition can have associated with it various job
 limits and access controls.  The partition manager also allocates nodes
 to jobs based upon node and partition states and configurations. Requests
 to initiate jobs come from the Job Manager.  The \scontrol\ may be used
 to administratively alter node and partition configurations.

 \item {\em Job Manager}: Accepts user job requests and places pending
 jobs in a priority ordered queue.
 The Job Manager is awakened on a periodic basis and whenever there
 is a change in state that might permit a job to begin running, such
 as job completion, job submission, partition-up transition,
 node-up transition, etc.  The Job Manager then makes a pass
 through the priority-ordered job queue. The highest priority jobs
 for each partition are allocated resources as possible. As soon as an
 allocation failure occurs for any partition, no lower-priority jobs for
 that partition are considered for initiation.
 After completing the scheduling cycle, the Job Manager's scheduling
 thread sleeps.  Once a job has been allocated resources, the Job Manager
 transfers necessary state information to those nodes, permitting it
 to commence execution.  When the Job Manager detects that
 all nodes associated with a job have completed their work, it initiates
 clean-up and performs another scheduling cycle as described above.

 \end{itemize}
	\section{SLURM Architecture}

	As a cluster resource manager, SLURM has three key functions. First,
	it allocates exclusive and/or non-exclusive access to resources to users for
	some duration of time so they can perform work. Second, it provides
	a framework for starting, executing, and monitoring work
	on the set of allocated nodes. Finally, it arbitrates
	conflicting requests for resources by managing a queue of pending work.
	Users and system administrators interact with SLURM using simple commands.

	%Users interact with SLURM through four command line utilities:
	%\srun\ for submitting a job for execution and optionally controlling it
	%interactively,
	%\scancel\ for early termination of a pending or running job,
	%\squeue\ for monitoring job queues, and
	%\sinfo\ for monitoring partition and overall system state.
	%System administrators perform privileged operations through an additional
	%command line utility: {\tt scontrol}.
	%
	%The central controller daemon, {\tt slurmctld}, maintains the global state
	%and directs operations.
	%Compute nodes simply run a \slurmd\ daemon (similar to a remote shell
	%daemon) to export control to SLURM.
	%
	%SLURM is not a sophisticated batch system.
	%In fact, it was expressly designed to provide high-performance
	%parallel job management while leaving scheduling decisions to an
	%external entity as will be described later.

	\begin{figure}[tb]
	\centerline{\epsfig{file=../figures/arch.eps,scale=0.40}}
	\caption{SLURM Architecture}
	\label{arch}
	\end{figure}

	Figure~\ref{arch} depicts the key components of SLURM. As shown in Figure~\ref{arch},
	SLURM consists of a \slurmd\ daemon
	running on each compute node, a central \slurmctld\ daemon running on
	a management node (with optional fail-over twin), and five command line
	utilities,
	% {\tt srun}, {\tt scancel}, {\tt sinfo}, {\tt squeue}, and {\tt scontrol},
	which can run anywhere in the cluster.

	The entities managed by these SLURM daemons include nodes, the
	compute resource in SLURM, partitions, which group nodes into
	logical disjoint sets, jobs, or allocations of resources assigned
	to a user for a specified amount of time, and job steps, which are
	sets of tasks within a job.
	Each job is allocated nodes within a single partition.
	Once a job is assigned a set of nodes, the user is able to initiate
	parallel work in the form of job steps in any configuration within the
	allocation. For instance a single job step may be started which utilizes
	all nodes allocated to the job, or several job steps may independently
	use a portion of the allocation.

	%\begin{figure}[tcb]
	%\centerline{\epsfig{file=../figures/entities.eps,scale=0.7}}
	%\caption{SLURM Entities}
	%\label{entities}
	%\end{figure}
	%
	%Figure~\ref{entities} further illustrates the interrelation of these
	%entities as they are managed by SLURM. The diagram shows a group of
	%compute nodes split into two partitions. Partition 1 is running one
	%job, with one job step utilizing the full allocation of that job.
	%The job in Partition 2 has only one job step using half of the original
	%job allocation.
	%That job might initiate additional job step(s) to utilize
	%the remaining nodes of its allocation.

	\begin{figure}[tb]
	\centerline{\epsfig{file=../figures/slurm-arch.eps,scale=0.5}}
	\caption{SLURM Architecture - Subsystems}
	\label{archdetail}
	\end{figure}

	Figure~\ref{archdetail} exposes the subsystems that are implemented
	within the \slurmd\ and \slurmctld\ daemons. These subsystems
	are explained in more detail below.

	\subsection{SLURM Local Daemon (Slurmd)}

	The \slurmd\ is a multi-threaded daemon running on each compute node.
	It reads the common SLURM configuration file and recovers any
	previously saved state information,
	notifies the controller that it is active, waits for work,
	executes the work, returns status, and waits for more work.
	Since it initiates jobs for other users, it must run with root privilege.
	%It also asynchronously exchanges node and job status information with {\tt slurmctld}.
	The only job information it has at any given time pertains to its
	currently executing jobs.
	The \slurmd\ performs five major tasks.

	\begin{itemize}
	\item {\em Machine and Job Status Services}: Respond to controller
	requests for machine and job state information, and send asynchronous
	reports of some state changes (e.g. \slurmd\ startup) to the controller.

	\item {\em Remote Execution}: Start, monitor, and clean up after a set
	of processes (typically belonging to a parallel job) as dictated by the
	\slurmctld\ daemon or an \srun\ or \scancel\ command. Starting a process may
	include executing a prolog program, setting process limits, setting real
	and effective user id, establishing environment variables, setting working
	directory, allocating interconnect resources, setting core file paths,
	initializing the Stream Copy Service, and managing
	process groups. Terminating a process may include terminating all members
	of a process group and executing an epilog program.

	\item {\em Stream Copy Service}: Allow handling of stderr, stdout, and
	stdin of remote tasks. Job input may be redirected from a file or files, a
	\srun\ process, or /dev/null. Job output may be saved into local files or
	sent back to the \srun\ command. Regardless of the location of stdout or stderr,
	all job output is locally buffered to avoid blocking local tasks.

	\item {\em Job Control}: Allow asynchronous interaction with the
	Remote Execution environment by propagating signals or explicit job
	termination requests to any set of locally managed processes.

	\end{itemize}

	\subsection{SLURM Central Daemon (Slurmctld)}

	Most SLURM state information is maintained by the controller, {\tt slurmctld}.
	The \slurmctld\ is multi-threaded with independent read and write locks
	for the various data structures to enhance scalability.
	When \slurmctld\ starts, it reads the SLURM configuration file.
	It can also read additional state information
	from a checkpoint file generated by a previous execution of {\tt slurmctld}.
	Full controller state information is written to
	disk periodically with incremental changes written to disk immediately
	for fault-tolerance.
	The \slurmctld\ runs in either master or standby mode, depending on the
	state of its fail-over twin, if any.
	The \slurmctld\ need not execute with root privilege.
	%In fact, it is recommended that a unique user entry be created for
	%executing \slurmctld\ and that user must be identified in the SLURM
	%configuration file as {\tt SlurmUser}.
	The \slurmctld\ consists of three major components:

	\begin{itemize}
	\item {\em Node Manager}: Monitors the state of each node in
	the cluster. It polls {\tt slurmd}'s for status periodically and
	receives state change notifications from \slurmd\ daemons asynchronously.
	It ensures that nodes have the prescribed configuration before being
	considered available for use.

	\item {\em Partition Manager}: Groups nodes into non-overlapping sets called
	{\em partitions}. Each partition can have associated with it various job
	limits and access controls. The partition manager also allocates nodes
	to jobs based upon node and partition states and configurations. Requests
	to initiate jobs come from the Job Manager. The \scontrol\ may be used
	to administratively alter node and partition configurations.

	\item {\em Job Manager}: Accepts user job requests and places pending
	jobs in a priority ordered queue.
	The Job Manager is awakened on a periodic basis and whenever there
	is a change in state that might permit a job to begin running, such
	as job completion, job submission, partition-up transition,
	node-up transition, etc. The Job Manager then makes a pass
	through the priority-ordered job queue. The highest priority jobs
	for each partition are allocated resources as possible. As soon as an
	allocation failure occurs for any partition, no lower-priority jobs for
	that partition are considered for initiation.
	After completing the scheduling cycle, the Job Manager's scheduling
	thread sleeps. Once a job has been allocated resources, the Job Manager
	transfers necessary state information to those nodes, permitting it
	to commence execution. When the Job Manager detects that
	all nodes associated with a job have completed their work, it initiates
	clean-up and performs another scheduling cycle as described above.

	\end{itemize}