doc/jsspp/intro.tex - SchedMD/slurm - Git at Google

 \section{Introduction}
 Linux clusters, often constructed by using commodity off-the-shelf (COTS) componnets,
 have become increasingly populuar as a computing platform for parallel computation
 in recent years, mainly due to their ability to deliver a high perfomance-cost ratio.
 Researchers have built and used small to medium size clusters for various
 applications~\cite{BeowulfWeb,LokiWeb}.
 The continuous decrease in the price of the COTS parts in conjunction with
 the good scalability of the cluster architecture has now made it feasible to economically
 build large-scale clusters with thousands of processors~\cite{MCRWeb,PCRWeb}.

 An essential component that is needed to harness such a computer is a
 resource management system.
 A resource management system (or resource manager) performs such crucial tasks as
 scheduling user jobs, monitoring machine and job status, launching user applications, and
 managing machine configuration,
 An ideal resource manager should be simple, efficient, scalable, fault-tolerant,
 and portable.

 Unfortunately there are no open-source resource management systems currently available
 which satisfy these requirements.
 A survey~\cite{Jette02} has revealed that many existing resource managers have poor scalability and fault-tolerance rendering them unsuitable for large clusters having
 thousands of processors~\cite{LoadLevelerWeb,LoadLevelerManual}.
 While some proprietary cluster managers are suitable for large clusters,
 they are typically designed for particular computer systems and/or
 interconnects~\cite{RMS,LoadLevelerWeb,LoadLevelerManual}.
 Proprietary systems can also be expensive and unavailable in source-code form.
 Furthermore, proprietary cluster management functionality is usually provided as a
 part of a specific job scheduling system package.
 This mandates the use of the given scheduler just to manage a cluster,
 even though the scheduler does not necessarily meet the need of organization that hosts the cluster.
 Clear separation of the cluster management functionality from scheduling policy is desired.

 This observation led us to set out to design a simple, highly scalable, and
 portable resource management system.
 The result of this effort is Simple Linux Utility Resource Management
 (SLURM\footnote{A tip of the hat to Matt Groening and creators of {\em Futurama},
 where Slurm is the most popular carbonated beverage in the universe.}).
 SLURM was developed with the following design goals:

 \begin{itemize}
 \item {\em Simplicity}: SLURM is simple enough to allow motivated end-users
 to understand its source code and add functionality.  The authors will
 avoid the temptation to add features unless they are of general appeal.

 \item {\em Open Source}: SLURM is available to everyone and will remain free.
 Its source code is distributed under the GNU General Public
 License~\cite{GPLWeb}.

 \item {\em Portability}: SLURM is written in the C language, with a GNU
 {\em autoconf} configuration engine.
 While initially written for Linux, other UNIX-like operating systems
 should be easy porting targets.
 SLURM also supports a general purpose {\em plugin} mechanism, which
 permits a variety of different infrastructures to be easily supported.
 The SLURM configuration file specifies which set of plugin modules
 should be used.

 \item {\em Interconnect independence}: SLURM supports UDP/IP based
 communication as well as the Quadrics Elan3 and Myrinet interconnects.
 Adding support for other interconnects is straightforward and utilizes
 the plugin mechanism described above.

 \item {\em Scalability}: SLURM is designed for scalability to clusters of
 thousands of nodes.
 Jobs may specify their resource requirements in a variety of ways
 including requirements options and ranges, potentially permitting
 faster initiation than otherwise possible.

 \item {\em Robustness}: SLURM can handle a variety of failure modes
 without terminating workloads, including crashes of the node running
 the SLURM controller.
 User jobs may be configured to continue execution despite the failure
 of one or more nodes on which they are executing.
 Nodes allocated to a job are available for reuse as soon as the job(s)
 allocated to that node terminate.
 If some nodes fail to complete job termination
 in a timely fashion due to hardware of software problems, only the
 scheduling of those tardy nodes will be effected.

 \item {\em Secure}: SLURM employs crypto technology to authenticate
 users to services and services to each other with a variety of options
 available through the plugin mechanism.
 SLURM does not assume that its networks are physically secure,
 but does assume that the entire cluster is within a single
 administrative domain with a common user base across the
 entire cluster.

 \item {\em System administrator friendly}: SLURM is configured a
 simple configuration file and minimizes distributed state.
 Its configuration may be changed at any time without impacting running jobs.
 Heterogeneous nodes within a cluster may be easily managed.
 SLURM interfaces are usable by scripts and its behavior is highly
 deterministic.

 \end{itemize}

 The main contribution of our work is that we have provided a readily available
 tool that anybody can use to efficiently manage clusters of different size and architecture.
 SLURM is highly scalable\footnote{It was observed that it took less than five seconds for SLURM to launch a 1900-task job over 950 nodes on recently installed cluster at Lawrence Livermore National Laboratory.}.
 The SLURM can be easily ported to any cluster system with minimal effort with its plugin
 capability and can be used with any meta-batch scheduler or a Grid resource broker~\cite{Gridbook}
 with its well-defined interfaces.

 The rest of the paper is organized as follows.
 Section 2 describes the architecture of SLURM in detail. Section 3 discusses the services provided by SLURM followed by performance study of
 SLURM in Section 4. Brief survey of existing cluster management systems is presented in Section 5.
 %Section 6 describes how the SLURM can be used with more sphisticated external schedulers.
 Concluding remarks and future development plan of SLURM is given in Section 6.
	\section{Introduction}
	Linux clusters, often constructed by using commodity off-the-shelf (COTS) componnets,
	have become increasingly populuar as a computing platform for parallel computation
	in recent years, mainly due to their ability to deliver a high perfomance-cost ratio.
	Researchers have built and used small to medium size clusters for various
	applications~\cite{BeowulfWeb,LokiWeb}.
	The continuous decrease in the price of the COTS parts in conjunction with
	the good scalability of the cluster architecture has now made it feasible to economically
	build large-scale clusters with thousands of processors~\cite{MCRWeb,PCRWeb}.

	An essential component that is needed to harness such a computer is a
	resource management system.
	A resource management system (or resource manager) performs such crucial tasks as
	scheduling user jobs, monitoring machine and job status, launching user applications, and
	managing machine configuration,
	An ideal resource manager should be simple, efficient, scalable, fault-tolerant,
	and portable.

	Unfortunately there are no open-source resource management systems currently available
	which satisfy these requirements.
	A survey~\cite{Jette02} has revealed that many existing resource managers have poor scalability and fault-tolerance rendering them unsuitable for large clusters having
	thousands of processors~\cite{LoadLevelerWeb,LoadLevelerManual}.
	While some proprietary cluster managers are suitable for large clusters,
	they are typically designed for particular computer systems and/or
	interconnects~\cite{RMS,LoadLevelerWeb,LoadLevelerManual}.
	Proprietary systems can also be expensive and unavailable in source-code form.
	Furthermore, proprietary cluster management functionality is usually provided as a
	part of a specific job scheduling system package.
	This mandates the use of the given scheduler just to manage a cluster,
	even though the scheduler does not necessarily meet the need of organization that hosts the cluster.
	Clear separation of the cluster management functionality from scheduling policy is desired.

	This observation led us to set out to design a simple, highly scalable, and
	portable resource management system.
	The result of this effort is Simple Linux Utility Resource Management
	(SLURM\footnote{A tip of the hat to Matt Groening and creators of {\em Futurama},
	where Slurm is the most popular carbonated beverage in the universe.}).
	SLURM was developed with the following design goals:

	\begin{itemize}
	\item {\em Simplicity}: SLURM is simple enough to allow motivated end-users
	to understand its source code and add functionality. The authors will
	avoid the temptation to add features unless they are of general appeal.

	\item {\em Open Source}: SLURM is available to everyone and will remain free.
	Its source code is distributed under the GNU General Public
	License~\cite{GPLWeb}.

	\item {\em Portability}: SLURM is written in the C language, with a GNU
	{\em autoconf} configuration engine.
	While initially written for Linux, other UNIX-like operating systems
	should be easy porting targets.
	SLURM also supports a general purpose {\em plugin} mechanism, which
	permits a variety of different infrastructures to be easily supported.
	The SLURM configuration file specifies which set of plugin modules
	should be used.

	\item {\em Interconnect independence}: SLURM supports UDP/IP based
	communication as well as the Quadrics Elan3 and Myrinet interconnects.
	Adding support for other interconnects is straightforward and utilizes
	the plugin mechanism described above.

	\item {\em Scalability}: SLURM is designed for scalability to clusters of
	thousands of nodes.
	Jobs may specify their resource requirements in a variety of ways
	including requirements options and ranges, potentially permitting
	faster initiation than otherwise possible.

	\item {\em Robustness}: SLURM can handle a variety of failure modes
	without terminating workloads, including crashes of the node running
	the SLURM controller.
	User jobs may be configured to continue execution despite the failure
	of one or more nodes on which they are executing.
	Nodes allocated to a job are available for reuse as soon as the job(s)
	allocated to that node terminate.
	If some nodes fail to complete job termination
	in a timely fashion due to hardware of software problems, only the
	scheduling of those tardy nodes will be effected.

	\item {\em Secure}: SLURM employs crypto technology to authenticate
	users to services and services to each other with a variety of options
	available through the plugin mechanism.
	SLURM does not assume that its networks are physically secure,
	but does assume that the entire cluster is within a single
	administrative domain with a common user base across the
	entire cluster.

	\item {\em System administrator friendly}: SLURM is configured a
	simple configuration file and minimizes distributed state.
	Its configuration may be changed at any time without impacting running jobs.
	Heterogeneous nodes within a cluster may be easily managed.
	SLURM interfaces are usable by scripts and its behavior is highly
	deterministic.

	\end{itemize}

	The main contribution of our work is that we have provided a readily available
	tool that anybody can use to efficiently manage clusters of different size and architecture.
	SLURM is highly scalable\footnote{It was observed that it took less than five seconds for SLURM to launch a 1900-task job over 950 nodes on recently installed cluster at Lawrence Livermore National Laboratory.}.
	The SLURM can be easily ported to any cluster system with minimal effort with its plugin
	capability and can be used with any meta-batch scheduler or a Grid resource broker~\cite{Gridbook}
	with its well-defined interfaces.

	The rest of the paper is organized as follows.
	Section 2 describes the architecture of SLURM in detail. Section 3 discusses the services provided by SLURM followed by performance study of
	SLURM in Section 4. Brief survey of existing cluster management systems is presented in Section 5.
	%Section 6 describes how the SLURM can be used with more sphisticated external schedulers.
	Concluding remarks and future development plan of SLURM is given in Section 6.