doc/sigops/report.tex - SchedMD/slurm - Git at Google

 % Presenter info:
 % SIGOPS
 %
 % Use a two-column, 8.5" x 11" document format
 % Use a 0.75" margin everywhere
 % All fonts MUST be Type 1 or 3 PostScript fornst from the Latin 1 Fontset.
 % Do not use TruType, bitmapped, or Ryumin fonts.
 % Use 10-point font size.
 %

 \documentclass{../common/acm}

 \usepackage{verbatim,moreverb}
 \usepackage{float}
 \usepackage{graphicx}

 % Needed for LLNL cover page / TID bibliography style
 \usepackage{calc}
 \usepackage{epsfig}
 \usepackage{graphics}
 \usepackage{hhline}
 \input{pstricks}
 \input{pst-node}
 \usepackage{chngpage}
 \input{llnlCoverPage}

 % Uncomment for DRAFT "watermark"
 %\usepackage{draftcopy}

 % Times font as default roman
 \renewcommand{\sfdefault}{phv}
 \renewcommand{\rmdefault}{ptm}
 %\renewcommand{\ttdefault}{pcr}
 %\usepackage{times}
 \renewcommand{\labelitemi}{$\bullet$}

 \setlength{\oddsidemargin}{-0.25in}
 \setlength{\evensidemargin}{-0.25in}
 \setlength{\topmargin}{-0.25in} % was{-0.75in}
 \setlength{\headsep}{0.5in}
 \setlength{\footskip}{24.0pt}
 \setlength{\textheight}{9.5in}
 \setlength{\columnwidth}{3.32in}
 \setlength{\textwidth}{7.0in}
 \setlength{\columnsep}{0.30in}
 \setlength{\columnseprule}{0.0pt}

 % couple of macros for the title page and document
 \def\ctit{Scalable Resource Management with SLURM}
 \def\ucrl{UCRL-JC-TBD}
 \def\auth{Morris Jette}
 \def\pubdate{January 23, 2006}
 \def\journal{Operating Systems Reviews}

 \begin{document}

 % make the cover page
 %\makeLLNLCover{\ucrl}{\ctit}{\auth}{\journal}{\pubdate}{0in}{0in}

 \title{\ctit}
 \numberofauthors{1}
 \author{
 \alignauthor Morris Jette\\
 	\affaddr{Lawrence Livermore National Laboratory}\\
 	\affaddr{Livermore, CA, USA}\\
 	\email{jette1@llnl.gov}
 }

 \maketitle
 \begin{abstract}
 While the increase in performance of individual computer
 processors through the years has been impressive, the demand
 for compute resources has grown even more rapidly.
 This demand has been met over the past couple of decades by
 turning to parallel computing.
 While operating systems, compilers, libraries, and a wide
 variety of tools exist to make effective use of independent
 computers, a new class of tools is required to effectively
 utilize a parallel computer.
 One important tool is the resource manager, which can be viewed
 as the "glue" to run applications on the parallel computer.
 One very popular resource manager is the Simple Linux
 Utility for Resource Management (SLURM,
 http://www.llnl.gov/linux/slurm).
 SLURM performs resource management on about 1000 parallel
 computers around the world, including many of the largest
 systems.
 This paper will describe resource management scalability
 issues and SLURM's implementation.
 \end{abstract}


 \section{Introduction}

 Executing a parallel program on a cluster involves several
 steps. First resources (cores, memory, nodes, etc.) suitable
 for running the program must be identified. These resources
 are then typically dedicated to the program. In some
 cases, the computer's network/interconnect must be
 configured for the program's use. The program's tasks
 are initiated on the allocated resources. The input and
 output for all of these tasks must be processed.
 Upon the program termination, it's allocated resources
 are released for use by other programs.
 All of these operations are straightforward, but performing
 resource management on clusters containing thousands of
 nodes and over 130,000 processor cores requires more
 than a high degree of parallelism.
 In many respects, data management and fault-tolerance issues
 are paramount.

 SLURM is a resource manager jointly developed by Lawrence
 Livermore National Laboratory (LLNL),
 Hewlett-Packard, and Linux NetworX
 ~\cite{SLURM2003,Yoo2003,SlurmWeb}.
 SLURM's general characteristics include:

 \begin{itemize}
 \item {\tt Simplicity}: SLURM is simple enough to allow motivated
 end users to understand its source code and add functionality.
 It supports only a few simple scheduling algorithms,
 but relies upon an external scheduler for sophisticated
 workload prioritization.

 \item {\tt Open Source}: SLURM is available to everyone and
 will remain free.
 Its source code is distributed under the GNU General Public
 License
 ~\cite{GPL2002}.

 \item {\tt Portability}: SLURM is written in the C language,
 with a GNU {\em autoconf} configuration engine.
 SLURM has a fully functional skeleton of functionality with a
 wide assortment of plugins available for customization.
 A total of 30 different plugins are available and provide a
 great deal of flexibility in configuring SLURM using a
 building block approach.

 \item {\tt Scalability}: SLURM is designed for scalability to clusters
 of thousands of nodes. Time to fully execute (allocate, launch
 tasks, process I/O, and deallocate resources) a simple program
 is only 2 seconds for 4,800 tasks on 2,400 nodes. Clusters
 containing up to 16,384 nodes have been emulated with highly
 scalable performance.

 \item {\tt Fault Tolerance}: SLURM can handle a variety of failures
 in hardward or the infrastructure without inducing failures in
 the workload.

 \item {\tt Security}: SLURM employs crypto technology to authenticate
 users to services and services to each other with a variety of options
 available through the plugin mechanism.

 \item {\tt System Administrator Friendly}: SLURM utilizes
 a simple configuration file and minimizes distributed state.
 Its configuration may be changed at any time without impacting running
 jobs.  Heterogeneous nodes within a cluster may be easily managed.  SLURM
 interfaces are usable by scripts and its behavior is highly deterministic.

 \end{itemize}

 \section{Architecture}

 \begin{figure}[tb]
 \centerline{\includegraphics[width=3.32in]{../figures/arch2.eps}}
 \caption{\small SLURM architecture}
 \label{arch}
 \end{figure}

 SLURM's commands and daemons are illustrated in Figure~\ref{arch}.
 The main SLURM control program, {\tt slurmctld}, orchestrates
 activities throughout the cluster. While highly optimized,
 {\tt slurmctld} is best run on a dedicated node of the cluster for optimal performance.
 In addition, SLURM provides the option of running  a backup controller
 on another node for increased fault-tolerance.
 Each node in the cluster available for running user applications
 has a relatively small daemon called {\tt slurmd} that monitors
 and manages resources and work within that node.
 Several user tools are provided:

 \begin{itemize}
 \item {\tt scontrol} is a system administrator tool to monitor and change
 system configuration

 \item {\tt sinfo} reports status and configuration of the nodes and queues

 \item {\tt squeue} reports status of jobs

 \item {\tt scancel} can signal and/or cancel jobs or job steps

 \item {\tt smap} reports node, queue, and job status including
 topological state information

 \item {\tt sacct} reports job accounting information

 \item {\tt sbcast} copies a file to local file systems on a job's allocated nodes

 \item {\tt srun} is used to allocate resources and spawn the job's tasks.
 \end{itemize}

 \begin{figure}[tcb]
 \centerline{\includegraphics[width=3.32in]{../figures/entities2.eps}}
 \caption{\small SLURM entities: nodes, partitions, jobs, and job steps}
 \label{entities}
 \end{figure}

 The entities managed by SLURM are illustrated in Figure~\ref{entities}
 and include:
 {\em nodes} including their processors, memory and temporary disk space,
 {\em partitions} are collections of nodes with various limits and constraints,
 {\em jobs} are allocations of resources assigned
 to a user for a specified amount of time, and
 {\em job steps} are sets of (possibly parallel) tasks within a job.
 Each node must be capable of independent scheduling and job execution
 \footnote{On BlueGene computers, the c-nodes can not be independently
 scheduled. Each midplane or base partition is considered a SLURM node
 with 1,024 processors. SLURM supports the execution of more than one
 job per BlueGene node.}.
 Each job in the priority-ordered queue is allocated nodes within a single
 partition.
 Since nodes can be in multiple partitions, one can think of them as
 general purpose queues for jobs.

 \section{Implementation}

 Very high scalability was treated as a high priority for SLURM.
 More comprehensive support for smaller clusters in fact was
 added in later revisions.
 For example, the initial implementation allocated whole nodes
 to jobs, in part to avoid the extra overhead of tracking individual
 processors.
 While allocation of entire nodes to jobs is still a recommended mode of
 operation for very large clusters, an alternate SLURM plugin provides
 resource management down the the resolution of individual processors.

 The SLURM's {\tt srun} command and the daemons are extensively
 multi-threaded.
 {\tt slurmctld} also maintains independent read and
 write locks for critical data structures.
 The combination of these two features means that,
 for example, three users can get job state information at the same
 time that a system administrator is modifying the time limit
 for a partition.

 \begin{figure}[tcb]
 \centerline{\includegraphics[width=3.32in]{../figures/comm.eps}}
 \caption{\small Hierarchical SLURM communications with fanout=2}
 \label{comms}
 \end{figure}

 Communications to large numbers of nodes are optimized in two
 ways. The programs initiating communications are multithreaded
 and can process tens or even hundreds of simultaneous active
 communications.
 Second, the {\tt slurmd} daemon is designed to forward
 communications on a hierarchical basis as shown in Figure ~\ref{comms}.
 For example, the initiation of tasks on 1000 nodes does not require
 {\tt srun} to directly communication with all 1000 nodes. {\tt Srun}
 can communicate directly with {\tt slurmd} daemons on 32 nodes
 (the degree of fanout in communications is configurable).
 Each of those {\tt slurmd} will simultaneously forward the request
 to {\tt slurmd} programs on another 32 nodes.
 This improves performance by distributing the communication workload.
 Note that every communication is authenticated and acknowleged
 for fault-tolerance.

 A number of interesting papers
 ~\cite{Jones2003,Kerbyson2001,Petrini2003,Phillips2003,Tsafrir2005}
 have recently been written about
 the impact of system daemons and other system overhead on
 parallel job performance. This {\tt system noise} can have a
 dramatic impact upon the performance of highly parallel jobs.
 In a simplified example, consider a system daemon that run for
 one second out of every 100 seconds on every node in a cluster.
 For serial jobs this
 reduces throughput by one percent, but the impact is compounded
 on parallel computers if these daemons do not all execute concurrently.
 If the parallel program runs on 100 nodes and tries to synchronize
 every second, almost every synchronization period will include a
 one second delay from the daemon running on one of the 100 nodes.
 This effectively limits job parallelism to about 50-way, orders
 of magnitude smaller than the largest systems currently available.
 SLURM addresses this issue by:
 \begin{itemize}
 \item Making the {\tt slurmd} daemon resource requirements negligible
 \item Supporting configurations that let the {\tt slurmd} daemon sleep
 during the entire job execution period
 \item If the {\tt slurmd} daemons do perform work, it is done on a
 highly synchronized fashion across all nodes
 \end{itemize}
 In addition, the default mode of operation is to allocate entire
 nodes with all of their processors to applications rather than
 individual processors on each node.
 This eliminates the possibility of interference between jobs,
 which could severely degrade performance of parallel applications.
 Allocation of resources to the resolution of individual processors
 on each node is supported by SLURM, but this comes at a higher cost
 in terms of the data managed.
 The selection of resource resolution is provided by different plugins.

 Resource management of large clusters entails the processing of
 large quantities of data, both for the software and the
 system administrator.
 In order to ease the burden on system administrators, SLURM
 configuration files and tools all support node naming using
 numeric ranges.
 For example, "linux[1-4096]" represents 4096 node names with
 a prefix of "linux" and numeric suffix from 1 to 4096.
 These naming convention permits even the largest clusters
 to be described in a configure file containing only a
 couple of dozen lines.
 State information output from various SLURM commands uses
 the same convention to maintain a modest volume of output
 on even large cluster.

 Extensive use is made of bitmaps to represent nodes in the cluster.
 For example, bitmaps are maintained for each unique node configuration,
 the nodes associated with each partition, nodes allocated to
 the active jobs, nodes available for use, etc. This reduces most
 scheduling operations to very rapid AND and OR operations on those bitmaps.

 \section{Application Launch}

 To better illustrate SLURM's operation, the execution of an
 application is detailed below and illustrated in Figure~\ref{launch}.
 This example is based upon a typical configuration and the
 {\em interactive} mode, in which stdout and
 stderr are displayed on the user's terminal in real time, and stdin and
 signals may be forwarded from the terminal transparently to the remote
 tasks.

 \begin{figure}[tb]
 \centerline{\includegraphics[width=3.32in]{../figures/launch.eps}}
 \caption{\small SLURM Job Launch}
 \label{launch}
 \end{figure}

 The task launch request is initiated by a user's execution of the
 {\tt srun} command. {\tt Srun} has a multitude of options to specify
 resource requirements such as minimum memory per node, minimum
 temporary disk space per node, features associated with nodes,
 partition to use, node count, task count, etc.
 {\tt Srun} gets a credential to identify the user and his group
 then sends the request to {\tt slurmctld} (message 1).

 {\tt Slurmctld} authenticates the request and identifies the resources
 to be allocated using a series of bitmap operations.
 First the nodes containing the appropriate resources (processors,
 memory, temporary disk space, and features) are identified through
 comparison with a node configuration table, which typically has
 a very small number of entries.
 The resulting bitmap is ANDed with the bitmap associated with the
 requested partition.
 This bitmap is ANDed with the bitmap identifying available nodes.
 The requested node and/or processor count is then satisfied from
 the nodes identified with the resulting bitmap.
 This completes the job allocation process, but for interactive
 mode, a job step credential is also constructed for the allocation
 and sent to {\tt srun} in the reply (message 2).

 The {\tt srun} command open sockets for task input and output then
 sends the job step credential directly to the {\tt slurmd} daemons
 (message 3) in order to launch the tasks, which is acknowledged
 (message 4).
 Note the {\tt slurmctld} and {\tt slurmd} daemons do not directly
 communicate during the task launch operation in order to minimize the
 workload on the {\tt slurmctld}, which has to manage the entire
 cluster.

 Task termination is communicated to {\tt srun} over the same
 socket used for input and output.
 When all tasks have terminated, {\tt srun} notifies {\tt slurmctld}
 of the job step termination (message 5).
 {\tt Slurmctld} authenticates the request, acknowledges it
 (message 6) and sends messages to the {\tt slurmd} daemons to
 insure that all processes associated with the job have
 terminated (message 7).
 Upon receipt of job termination confirmation on each node (message 8),
 {\tt slurmctld} releases the resources for use by another job.

 The full time for execution of a simple parallel application across
 a few nodes is a few milliseconds.
 The time reaches a few seconds for jobs that span thousands of
 nodes.
 The times will vary with the hardware and configuration used,
 but is hardly noticeable to the user at even the largest scales.

 \section{Conclusion}

 SLURM has demonstrated high reliability and scalability.
 It provides resource management on several clusters containing
 over 2,000 compute nodes and has emulated clusters containing
 up to 16,384 compute nodes, four times the size of of the
 largest parallel computer in use today.
 The cluster size is presently limited by the size of some SLURM
 data structures, but operation on larger systems is certainly
 feasible with minor code changes.

 \raggedright
 % make the bibliography
 \bibliographystyle{plain}
 \bibliography{project}

 % make the back cover page
 %\makeLLNLBackCover
 \end{document}
	% Presenter info:
	% SIGOPS
	%
	% Use a two-column, 8.5" x 11" document format
	% Use a 0.75" margin everywhere
	% All fonts MUST be Type 1 or 3 PostScript fornst from the Latin 1 Fontset.
	% Do not use TruType, bitmapped, or Ryumin fonts.
	% Use 10-point font size.
	%

	\documentclass{../common/acm}

	\usepackage{verbatim,moreverb}
	\usepackage{float}
	\usepackage{graphicx}

	% Needed for LLNL cover page / TID bibliography style
	\usepackage{calc}
	\usepackage{epsfig}
	\usepackage{graphics}
	\usepackage{hhline}
	\input{pstricks}
	\input{pst-node}
	\usepackage{chngpage}
	\input{llnlCoverPage}

	% Uncomment for DRAFT "watermark"
	%\usepackage{draftcopy}

	% Times font as default roman
	\renewcommand{\sfdefault}{phv}
	\renewcommand{\rmdefault}{ptm}
	%\renewcommand{\ttdefault}{pcr}
	%\usepackage{times}
	\renewcommand{\labelitemi}{$\bullet$}

	\setlength{\oddsidemargin}{-0.25in}
	\setlength{\evensidemargin}{-0.25in}
	\setlength{\topmargin}{-0.25in} % was{-0.75in}
	\setlength{\headsep}{0.5in}
	\setlength{\footskip}{24.0pt}
	\setlength{\textheight}{9.5in}
	\setlength{\columnwidth}{3.32in}
	\setlength{\textwidth}{7.0in}
	\setlength{\columnsep}{0.30in}
	\setlength{\columnseprule}{0.0pt}

	% couple of macros for the title page and document
	\def\ctit{Scalable Resource Management with SLURM}
	\def\ucrl{UCRL-JC-TBD}
	\def\auth{Morris Jette}
	\def\pubdate{January 23, 2006}
	\def\journal{Operating Systems Reviews}

	\begin{document}

	% make the cover page
	%\makeLLNLCover{\ucrl}{\ctit}{\auth}{\journal}{\pubdate}{0in}{0in}

	\title{\ctit}
	\numberofauthors{1}
	\author{
	\alignauthor Morris Jette\\
	\affaddr{Lawrence Livermore National Laboratory}\\
	\affaddr{Livermore, CA, USA}\\
	\email{jette1@llnl.gov}
	}

	\maketitle
	\begin{abstract}
	While the increase in performance of individual computer
	processors through the years has been impressive, the demand
	for compute resources has grown even more rapidly.
	This demand has been met over the past couple of decades by
	turning to parallel computing.
	While operating systems, compilers, libraries, and a wide
	variety of tools exist to make effective use of independent
	computers, a new class of tools is required to effectively
	utilize a parallel computer.
	One important tool is the resource manager, which can be viewed
	as the "glue" to run applications on the parallel computer.
	One very popular resource manager is the Simple Linux
	Utility for Resource Management (SLURM,
	http://www.llnl.gov/linux/slurm).
	SLURM performs resource management on about 1000 parallel
	computers around the world, including many of the largest
	systems.
	This paper will describe resource management scalability
	issues and SLURM's implementation.
	\end{abstract}


	\section{Introduction}

	Executing a parallel program on a cluster involves several
	steps. First resources (cores, memory, nodes, etc.) suitable
	for running the program must be identified. These resources
	are then typically dedicated to the program. In some
	cases, the computer's network/interconnect must be
	configured for the program's use. The program's tasks
	are initiated on the allocated resources. The input and
	output for all of these tasks must be processed.
	Upon the program termination, it's allocated resources
	are released for use by other programs.
	All of these operations are straightforward, but performing
	resource management on clusters containing thousands of
	nodes and over 130,000 processor cores requires more
	than a high degree of parallelism.
	In many respects, data management and fault-tolerance issues
	are paramount.

	SLURM is a resource manager jointly developed by Lawrence
	Livermore National Laboratory (LLNL),
	Hewlett-Packard, and Linux NetworX
	~\cite{SLURM2003,Yoo2003,SlurmWeb}.
	SLURM's general characteristics include:

	\begin{itemize}
	\item {\tt Simplicity}: SLURM is simple enough to allow motivated
	end users to understand its source code and add functionality.
	It supports only a few simple scheduling algorithms,
	but relies upon an external scheduler for sophisticated
	workload prioritization.

	\item {\tt Open Source}: SLURM is available to everyone and
	will remain free.
	Its source code is distributed under the GNU General Public
	License
	~\cite{GPL2002}.

	\item {\tt Portability}: SLURM is written in the C language,
	with a GNU {\em autoconf} configuration engine.
	SLURM has a fully functional skeleton of functionality with a
	wide assortment of plugins available for customization.
	A total of 30 different plugins are available and provide a
	great deal of flexibility in configuring SLURM using a
	building block approach.

	\item {\tt Scalability}: SLURM is designed for scalability to clusters
	of thousands of nodes. Time to fully execute (allocate, launch
	tasks, process I/O, and deallocate resources) a simple program
	is only 2 seconds for 4,800 tasks on 2,400 nodes. Clusters
	containing up to 16,384 nodes have been emulated with highly
	scalable performance.

	\item {\tt Fault Tolerance}: SLURM can handle a variety of failures
	in hardward or the infrastructure without inducing failures in
	the workload.

	\item {\tt Security}: SLURM employs crypto technology to authenticate
	users to services and services to each other with a variety of options
	available through the plugin mechanism.

	\item {\tt System Administrator Friendly}: SLURM utilizes
	a simple configuration file and minimizes distributed state.
	Its configuration may be changed at any time without impacting running
	jobs. Heterogeneous nodes within a cluster may be easily managed. SLURM
	interfaces are usable by scripts and its behavior is highly deterministic.

	\end{itemize}

	\section{Architecture}

	\begin{figure}[tb]
	\centerline{\includegraphics[width=3.32in]{../figures/arch2.eps}}
	\caption{\small SLURM architecture}
	\label{arch}
	\end{figure}

	SLURM's commands and daemons are illustrated in Figure~\ref{arch}.
	The main SLURM control program, {\tt slurmctld}, orchestrates
	activities throughout the cluster. While highly optimized,
	{\tt slurmctld} is best run on a dedicated node of the cluster for optimal performance.
	In addition, SLURM provides the option of running a backup controller
	on another node for increased fault-tolerance.
	Each node in the cluster available for running user applications
	has a relatively small daemon called {\tt slurmd} that monitors
	and manages resources and work within that node.
	Several user tools are provided:

	\begin{itemize}
	\item {\tt scontrol} is a system administrator tool to monitor and change
	system configuration

	\item {\tt sinfo} reports status and configuration of the nodes and queues

	\item {\tt squeue} reports status of jobs

	\item {\tt scancel} can signal and/or cancel jobs or job steps

	\item {\tt smap} reports node, queue, and job status including
	topological state information

	\item {\tt sacct} reports job accounting information

	\item {\tt sbcast} copies a file to local file systems on a job's allocated nodes

	\item {\tt srun} is used to allocate resources and spawn the job's tasks.
	\end{itemize}

	\begin{figure}[tcb]
	\centerline{\includegraphics[width=3.32in]{../figures/entities2.eps}}
	\caption{\small SLURM entities: nodes, partitions, jobs, and job steps}
	\label{entities}
	\end{figure}

	The entities managed by SLURM are illustrated in Figure~\ref{entities}
	and include:
	{\em nodes} including their processors, memory and temporary disk space,
	{\em partitions} are collections of nodes with various limits and constraints,
	{\em jobs} are allocations of resources assigned
	to a user for a specified amount of time, and
	{\em job steps} are sets of (possibly parallel) tasks within a job.
	Each node must be capable of independent scheduling and job execution
	\footnote{On BlueGene computers, the c-nodes can not be independently
	scheduled. Each midplane or base partition is considered a SLURM node
	with 1,024 processors. SLURM supports the execution of more than one
	job per BlueGene node.}.
	Each job in the priority-ordered queue is allocated nodes within a single
	partition.
	Since nodes can be in multiple partitions, one can think of them as
	general purpose queues for jobs.

	\section{Implementation}

	Very high scalability was treated as a high priority for SLURM.
	More comprehensive support for smaller clusters in fact was
	added in later revisions.
	For example, the initial implementation allocated whole nodes
	to jobs, in part to avoid the extra overhead of tracking individual
	processors.
	While allocation of entire nodes to jobs is still a recommended mode of
	operation for very large clusters, an alternate SLURM plugin provides
	resource management down the the resolution of individual processors.

	The SLURM's {\tt srun} command and the daemons are extensively
	multi-threaded.
	{\tt slurmctld} also maintains independent read and
	write locks for critical data structures.
	The combination of these two features means that,
	for example, three users can get job state information at the same
	time that a system administrator is modifying the time limit
	for a partition.

	\begin{figure}[tcb]
	\centerline{\includegraphics[width=3.32in]{../figures/comm.eps}}
	\caption{\small Hierarchical SLURM communications with fanout=2}
	\label{comms}
	\end{figure}

	Communications to large numbers of nodes are optimized in two
	ways. The programs initiating communications are multithreaded
	and can process tens or even hundreds of simultaneous active
	communications.
	Second, the {\tt slurmd} daemon is designed to forward
	communications on a hierarchical basis as shown in Figure ~\ref{comms}.
	For example, the initiation of tasks on 1000 nodes does not require
	{\tt srun} to directly communication with all 1000 nodes. {\tt Srun}
	can communicate directly with {\tt slurmd} daemons on 32 nodes
	(the degree of fanout in communications is configurable).
	Each of those {\tt slurmd} will simultaneously forward the request
	to {\tt slurmd} programs on another 32 nodes.
	This improves performance by distributing the communication workload.
	Note that every communication is authenticated and acknowleged
	for fault-tolerance.

	A number of interesting papers
	~\cite{Jones2003,Kerbyson2001,Petrini2003,Phillips2003,Tsafrir2005}
	have recently been written about
	the impact of system daemons and other system overhead on
	parallel job performance. This {\tt system noise} can have a
	dramatic impact upon the performance of highly parallel jobs.
	In a simplified example, consider a system daemon that run for
	one second out of every 100 seconds on every node in a cluster.
	For serial jobs this
	reduces throughput by one percent, but the impact is compounded
	on parallel computers if these daemons do not all execute concurrently.
	If the parallel program runs on 100 nodes and tries to synchronize
	every second, almost every synchronization period will include a
	one second delay from the daemon running on one of the 100 nodes.
	This effectively limits job parallelism to about 50-way, orders
	of magnitude smaller than the largest systems currently available.
	SLURM addresses this issue by:
	\begin{itemize}
	\item Making the {\tt slurmd} daemon resource requirements negligible
	\item Supporting configurations that let the {\tt slurmd} daemon sleep
	during the entire job execution period
	\item If the {\tt slurmd} daemons do perform work, it is done on a
	highly synchronized fashion across all nodes
	\end{itemize}
	In addition, the default mode of operation is to allocate entire
	nodes with all of their processors to applications rather than
	individual processors on each node.
	This eliminates the possibility of interference between jobs,
	which could severely degrade performance of parallel applications.
	Allocation of resources to the resolution of individual processors
	on each node is supported by SLURM, but this comes at a higher cost
	in terms of the data managed.
	The selection of resource resolution is provided by different plugins.

	Resource management of large clusters entails the processing of
	large quantities of data, both for the software and the
	system administrator.
	In order to ease the burden on system administrators, SLURM
	configuration files and tools all support node naming using
	numeric ranges.
	For example, "linux[1-4096]" represents 4096 node names with
	a prefix of "linux" and numeric suffix from 1 to 4096.
	These naming convention permits even the largest clusters
	to be described in a configure file containing only a
	couple of dozen lines.
	State information output from various SLURM commands uses
	the same convention to maintain a modest volume of output
	on even large cluster.

	Extensive use is made of bitmaps to represent nodes in the cluster.
	For example, bitmaps are maintained for each unique node configuration,
	the nodes associated with each partition, nodes allocated to
	the active jobs, nodes available for use, etc. This reduces most
	scheduling operations to very rapid AND and OR operations on those bitmaps.

	\section{Application Launch}

	To better illustrate SLURM's operation, the execution of an
	application is detailed below and illustrated in Figure~\ref{launch}.
	This example is based upon a typical configuration and the
	{\em interactive} mode, in which stdout and
	stderr are displayed on the user's terminal in real time, and stdin and
	signals may be forwarded from the terminal transparently to the remote
	tasks.

	\begin{figure}[tb]
	\centerline{\includegraphics[width=3.32in]{../figures/launch.eps}}
	\caption{\small SLURM Job Launch}
	\label{launch}
	\end{figure}

	The task launch request is initiated by a user's execution of the
	{\tt srun} command. {\tt Srun} has a multitude of options to specify
	resource requirements such as minimum memory per node, minimum
	temporary disk space per node, features associated with nodes,
	partition to use, node count, task count, etc.
	{\tt Srun} gets a credential to identify the user and his group
	then sends the request to {\tt slurmctld} (message 1).

	{\tt Slurmctld} authenticates the request and identifies the resources
	to be allocated using a series of bitmap operations.
	First the nodes containing the appropriate resources (processors,
	memory, temporary disk space, and features) are identified through
	comparison with a node configuration table, which typically has
	a very small number of entries.
	The resulting bitmap is ANDed with the bitmap associated with the
	requested partition.
	This bitmap is ANDed with the bitmap identifying available nodes.
	The requested node and/or processor count is then satisfied from
	the nodes identified with the resulting bitmap.
	This completes the job allocation process, but for interactive
	mode, a job step credential is also constructed for the allocation
	and sent to {\tt srun} in the reply (message 2).

	The {\tt srun} command open sockets for task input and output then
	sends the job step credential directly to the {\tt slurmd} daemons
	(message 3) in order to launch the tasks, which is acknowledged
	(message 4).
	Note the {\tt slurmctld} and {\tt slurmd} daemons do not directly
	communicate during the task launch operation in order to minimize the
	workload on the {\tt slurmctld}, which has to manage the entire
	cluster.

	Task termination is communicated to {\tt srun} over the same
	socket used for input and output.
	When all tasks have terminated, {\tt srun} notifies {\tt slurmctld}
	of the job step termination (message 5).
	{\tt Slurmctld} authenticates the request, acknowledges it
	(message 6) and sends messages to the {\tt slurmd} daemons to
	insure that all processes associated with the job have
	terminated (message 7).
	Upon receipt of job termination confirmation on each node (message 8),
	{\tt slurmctld} releases the resources for use by another job.

	The full time for execution of a simple parallel application across
	a few nodes is a few milliseconds.
	The time reaches a few seconds for jobs that span thousands of
	nodes.
	The times will vary with the hardware and configuration used,
	but is hardly noticeable to the user at even the largest scales.

	\section{Conclusion}

	SLURM has demonstrated high reliability and scalability.
	It provides resource management on several clusters containing
	over 2,000 compute nodes and has emulated clusters containing
	up to 16,384 compute nodes, four times the size of of the
	largest parallel computer in use today.
	The cluster size is presently limited by the size of some SLURM
	data structures, but operation on larger systems is certainly
	feasible with minor code changes.

	\raggedright
	% make the bibliography
	\bibliographystyle{plain}
	\bibliography{project}

	% make the back cover page
	%\makeLLNLBackCover
	\end{document}