doc/bgl.report/report.tex - SchedMD/slurm - Git at Google

 % Presenter info:
 % http://www.linuxclustersinstitute.org/Linux-HPC-Revolution/presenterinfo.html
 %
 % Main Text Layout
 % Set the main text in 10 point Times Roman or Times New Roman (normal),
 % (no boldface), using single line spacing. All text should be in a single
 % column and justified.
 %
 % Opening Style (First Page)
 % This includes the title of the paper, the author names, organization and
 % country, the abstract, and the first part of the paper.
 % * Start the title 35mm down from the top margin in Times Roman font, 16
 %   point bold, range left. Capitalize only the first letter of the first
 %   word and proper nouns.
 % * On a new line, type the authors' names, organizations, and country only
 %   (not the full postal address, although you may add the name of your
 %   department), in Times Roman, 11 point italic, range left.
 % * Start the abstract with the heading two lines below the last line of the
 %   address. Set the abstract in Times Roman, 12 point bold.
 % * Leave one line, then type the abstract in Times Roman 10 point, justified
 %   with single line spacing.
 %
 % Other Pages
 % For the second and subsequent pages, use the full 190 x 115mm area and type
 % in one column beginning at the upper right of each page, inserting tables
 % and figures as required.
 %
 % We're recommending the Lecture Notes in Computer Science styles from
 % Springer Verlag --- google on Springer Verlag LaTeX.  These work nicely,
 % *except* that it does not work with the hyperref package. Sigh.
 %
 % http://www.springer.de/comp/lncs/authors.html
 %
 % NOTE: This is an excerpt from the document in slurm/doc/pubdesign

 \documentclass[10pt,onecolumn,times]{../common/llncs}

 \usepackage{verbatim,moreverb}
 \usepackage{float}

 % Needed for LLNL cover page / TID bibliography style
 \usepackage{calc}
 \usepackage{epsfig}
 \usepackage{graphics}
 \usepackage{hhline}
 \input{pstricks}
 \input{pst-node}
 \usepackage{chngpage}
 \input{llnlCoverPage}

 % Uncomment for DRAFT "watermark"
 \usepackage{draftcopy}

 % Times font as default roman
 \renewcommand{\sfdefault}{phv}
 \renewcommand{\rmdefault}{ptm}
 %\renewcommand{\ttdefault}{pcr}
 %\usepackage{times}
 \renewcommand{\labelitemi}{$\bullet$}

 \setlength{\textwidth}{115mm}
 \setlength{\textheight}{190mm}
 \setlength{\oddsidemargin}{(\paperwidth-\textwidth)/2 - 1in}
 \setlength{\topmargin}{(\paperheight-\textheight -\headheight-\headsep-\footskip
 )/2 - 1in + .5in }

 % couple of macros for the title page and document
 \def\ctit{SLURM Resource Management for Blue Gene/L}
 \def\ucrl{UCRL-JC-TBD}
 \def\auth{Morris Jette \\ Danny Auble \\ Dan Phung}
 \def\pubdate{October 18, 2004}
 \def\journal{Conference TBD}

 \begin{document}

 % make the cover page
 %\makeLLNLCover{\ucrl}{\ctit}{\auth}{\journal}{\pubdate}{0in}{0in}

 % Title - 16pt bold
 \vspace*{35mm}
 \noindent\Large
 \textbf{\ctit}
 \vskip1\baselineskip
 % Authors - 11pt
 \noindent\large
 {Morris Jette, Dan Phung and Danny Auble \\
 {\em Lawrence Livermore National Laboratory, USA}
 \vskip2\baselineskip
 % Abstract heading - 12pt bold
 \noindent\large
 \textbf{Abstract}
 \vskip1\baselineskip
 % Abstract itself - 10pt
 \noindent\normalsize
 The Blue Gene/L (BGL) system is a highly scalable computer developed
 by IBM and deployed Lawrence Livermore National Laboratory (LLNL).
 The current system has over 131,000 processors interconnected by a
 three-dimensional toroidal network with complex rules for managing
 the network and allocating resources to jobs.
 SLURM (Simple Linux Utility for Resource Management ) was selected to
 fulfull this role.
 SLURM is an open source, fault-tolerant, and highly scalable cluster
 management and job scheduling system in widespread use on Linux clusters.
 This paper presents overviews of BGL resource management issues and
 SLURM architecture.
 It also presents a description of how SLURM provides resource
 management for BGL and preliminary performance results.

 % define some additional macros for the body
 \newcommand{\munged}{{\tt munged}}
 \newcommand{\srun}{{\tt srun}}
 \newcommand{\scancel}{{\tt scancel}}
 \newcommand{\squeue}{{\tt squeue}}
 \newcommand{\scontrol}{{\tt scontrol}}
 \newcommand{\sinfo}{{\tt sinfo}}
 \newcommand{\slurmctld}{{\tt slurmctld}}
 \newcommand{\slurmd}{{\tt slurmd}}
 \newcommand{\smap}{{\tt smap}}

 \section{Overview}

 The BlueGene/L (BGL) system offers a unique cell-based design in which
 the capacity can be expanded without introducing bottlenecks
 \cite{BlueGeneWeb,BlueGeneL2002}.
 The Blue Gene/L system delivered to LLNL consists of
 131,072 processors and 33TB of memory \cite{BlueGene2002}.
 The peak computational rate will exceed 360 TeraFLOPs.

 Simple Linux Utility for Resource Management (SLURM)\footnote{A tip of
 the hat to Matt Groening and creators of {\em Futurama},
 where Slurm is the most popular carbonated beverage in the universe.}
 is a resource management system suitable for use on both small and
 very large clusters.
 SLURM was developed by Lawrence Livermore National Laboratory
 (LLNL), Linux NetworX and HP.
 It has been deployed on hundreds of Linux clusters world-wide and has
 proven both highly reliable and highly scalalble.

 \section{Architecture of Blue Gene/L}

 The basic building-blocks of BGL are c-nodes.
 Each c-node consists
 of two processors based upon the PowerPC 550GX, 512 MB of memory
 and support for five separate networks on a single chip.
 One of the processors may be used for computations and the
 second used exclusively for communications.
 Alternately, both processors may be used for computations.
 These c-nodes are subsequently grouped into base partitions, each consisting
 of 512 c-nodes in an eight by eight by eight array with the same
 network support.
 The BGL system delivered to LLNL consists of 128 base
 partitions organized in an eight by four by four array.
 The minimal resource allocation unit for applications is one
 base partition so that at most 128 simultaneous jobs may execute.

 The c-nodes execute a custom micro-kernel.
 System calls that can not directly be processed by the c-node
 micro-kernel are routed to one of the systems I/O nodes.
 There are 1024 I/O nodes running the Linux operating system,
 each of which service the requests from 64 c-nodes.

 Three distinct communications networks are supported:
 a three-dimensional torus with direct nearest-neighbor connections;
 a global tree network for broadcast and reduction operations; and
 a barrier network for synchronization.
 The torus network connects each node to
 its nearest neighbors in the X, Y and Z directions for a
 total of six of these connections for each node.

 Only parallel user applications execute on the c-node.
 BGL has eight front-end nodes for other user tasks.
 Users can login to the front-end nodes, compile and
 launch parallel applications. Front-end nodes can also
 be used for pre- and post-processing of data files.

 BGL system administrative functions are performed on a
 computer known as the service node, which also maintains
 a DB2 database used for many BGL management functions.

 TO DO: Mesh vs. Torus.  Wiring rules.

 TO DO: Overhead of starting a new job (e.g. reboot nodes).

 NOTE: Be careful not to use non-public information (don't use
 information directly from the "IBM Confidential" documents).

 \section{Architecture of SLURM}

 Only a brief description of SLURM architecture and implemenation is provided
 here.
 A more thorough treatment of the SLURM design and implementation is
 available from several sources \cite{SLURM2003,SlurmWeb}.

 Several SLURM features make it well suited to serve as a resource manager
 for Blue Gene/L.

 \begin{itemize}

 \item {\tt Scalability}:
 The SLURM daemons are highly parallel  with independent read and write
 locks on the various data structures.
 SLURM presently manages several Linux clusters with over 1000 nodes
 and executes full-system parallel jobs on these systems in a few seconds.

 \item {\tt Portability}:
 SLURM is written in the C language, with a GNU {\em autoconf} configuration engine.
 While initially written for Linux, other Unix-like operating systems including
 AIX have proven easy porting targets.
 SLURM also supports a general purpose ``plugin'' mechanism, which
 permits a variety of different infrastructures to be easily supported.
 The SLURM configuration file specifies which set of plugin modules
 should be used.
 For example, plugins are used for interfacing with different authentication
 mechanisms and interconnects.

 \item {\tt Fault Tolerance}: SLURM can handle a variety of failure
 modes without terminating workloads, including crashes of the node
 running the SLURM controller.  User jobs may be configured to continue
 execution despite the failure of one or more nodes on which they are
 executing.  The user command controlling a job, {\tt srun}, may detach
 and reattach from the parallel tasks at any time.

 \item {\tt System Administrator Friendly}: SLURM utilizes
 a simple configuration file and minimizes distributed state.
 Its configuration may be changed at any time without impacting running
 jobs.  SLURM interfaces are usable by scripts and its behavior is
 highly deterministic.

 \end{itemize}

 As a cluster resource manager, SLURM has three key functions.  First,
 it allocates exclusive and/or non-exclusive access to resources (compute
 nodes) to users for some duration of time so they can perform work.
 Second, it provides a framework for starting, executing, and monitoring
 work (normally a parallel job) on the set of allocated nodes.  Finally,
 it arbitrates conflicting requests for resources by managing a queue of
 pending work.

 \begin{figure}[tb]
 \centerline{\epsfig{file=../figures/arch.eps,scale=0.35}}
 \caption{\small SLURM architecture}
 \label{arch}
 \end{figure}

 As shown in Figure~\ref{arch}, SLURM consists of a \slurmd\ daemon
 running on each compute node, a central \slurmctld\ daemon running
 on a management node (with optional fail-over twin), and five command
 line utilities: \srun\, \scancel, \sinfo\, \squeue\, and \scontrol\,
 which can run anywhere in the cluster.

 The central controller daemon, \slurmctld\, maintains the global
 state and directs operations.
 \slurmctld\ monitors the state of nodes (through {\tt slurmd}),
 groups nodes into partitions with various contraints,
 manages a queue of pending work, and
 allocates resouces to pending jobs and job steps.
 \slurmctld\ does not directly execute any user jobs, but
 provides overall management of jobs and resources.

 Compute nodes simply run a \slurmd\ daemon (similar to a remote
 shell daemon) to export control to SLURM.
 Each \slurmd\ monitors machine status,
 performs remote job execution, manages the job's I/O, and otherwise
 manages the jobs and job steps for its execution host.

 Users interact with SLURM through four command line utilities:
 \srun\ for submitting a job for execution and optionally controlling
 it interactively,
 \scancel\ for signalling or terminating a pending or running job,
 \squeue\ for monitoring job queues, and
 \sinfo\ for monitoring partition and overall system state.
 System administrators perform privileged operations through an
 additional command line utility, {\tt scontrol}.

 The entities managed by these SLURM daemons include {\em nodes}, the
 compute resource in SLURM, {\em partitions}, which group nodes into
 logical disjoint sets, {\em jobs}, or allocations of resources assigned
 to a user for a specified amount of time, and {\em job steps}, which
 are sets of (possibly parallel) tasks within a job.  Each job in the
 priority-ordered queue is allocated nodes within a single partition.
 Once a job is assigned a set of nodes, the user is able to initiate
 parallel work in the form of job steps in any configuration within
 the job's allocation.
 For instance, a single job step may be started that utilizes all nodes
 allocated to the job, or several job steps may independently use a
 portion of the allocation.

 \begin{figure}[tcb]
 \centerline{\epsfig{file=../figures/entities.eps,scale=0.5}}
 \caption{\small SLURM entities: nodes, partitions, jobs, and job steps}
 \label{entities}
 \end{figure}

 Figure~\ref{entities} further illustrates the interrelation of these
 entities as they are managed by SLURM by showing a group of
 compute nodes split into two partitions. Partition 1 is running one job,
 with one job step utilizing the full allocation of that job.  The job
 in Partition 2 has only one job step using half of the original job
 allocation.  That job might initiate additional job steps to utilize
 the remaining nodes of its allocation.

 In order to simplify the use of different infrastructures,
 SLURM uses a general purpose plugin mechanism.  A SLURM plugin is a
 dynamically linked code object that is loaded explicitly at run time
 by the SLURM libraries.  A plugin provides a customized implementation
 of a well-defined API connected to tasks such as authentication,
 or job scheduling.  A common set of functions is defined
 for use by all of the different infrastructures of a particular variety.
 For example, the authentication plugin must define functions such as
 {\tt slurm\_auth\_create} to create a credential, {\tt slurm\_auth\_verify}
 to verify a credential to approve or deny authentication, etc.
 When a SLURM command or daemon is initiated, it
 reads the configuration file to determine which of the available plugins
 should be used.  For example {\em AuthType=auth/munge} says to use the
 plugin for munge based authentication and {\em PluginDir=/usr/local/lib}
 identifies the directory in which to find the plugin.

 \section {Blue Gene/L Specific Resource Management Issues}

 Since a BGL base partition is the minimum allocation unit for a job,
 it was natural to consider each one as an independent SLURM node.
 This meant SLURM would manage a very reasonable 128 nodes
 rather than tens of thousands of individual c-nodes.
 The \slurmd\ daemon was originally designed to execute on each
 SLURM node to monitor the status of that node, launch job steps, etc.
 Unfortunately BGL prohibited the execute of SLURM daemons within
 the base partitions on any of the c-nodes.
 SLURM was compelled to execute \slurmd\ on one or more
 front-end nodes.
 In addition,  the typical Unix mechanism used to interact with a
 compute host (e.g. getting memory size or processor count) do not
 function normally with BGL base partitions.
 This issue was addressed by adding a SLURM parameter to
 indicate when it is running with a front-end node, in which case
 there is assumed to be a single \slurmd\ for the entire system.
 We anticipate changing this in the future to support multiple
 \slurmd\ daemons on the front-end nodes.

 SLURM was originally designed to address a one-dimensional topology
 and this impacted a variety of areas from naming convensions to
 node selection.
 SLURM provides resource management on several Linux clusters
 exceeding 1000 nodes and it is impractical to display or otherwise
 work with hundreds of individual node names.
 SLURM addresses this by using a compressed hostlist range format to indicate
 ranges of node names.
 For example, "linux[0-1023]" was used to represent 1024 nodes
 with names having a prefix of "linux" and a numeric suffic ranging
 from "0" to "1023" (e.g. "linux0" through "linux1023").
 The most reasonable way to name the BGL nodes seemed to be
 using a three digit suffix, but rather than indicate a monotonically
 increasing number, each digit would represent the base partition's
 location in the X, Y and Z dimensions (the value of X ranges
 from 0 to 7, Y from 0 to 3, and Z from 0 to 3 on the LLNL system).
 For example, "bgl012" would represent the base partition at
 the position X=0, Y=1 and Z=2.
 Since BGL resources naturally tend to be rectangular prisms in
 shape, we modified the hostlist range format to indicate the two
 extreme base partition locations.
 The name prefix is always "bgl".
 Within the brackets one lists the base partition with the smallest
 X, Y and Z coordinates followed by a "x" followed by the base
 partition with the highest X, Y and Z coordinates.
 For example, "bgl[200x311]" represents the following eight base
 partitions: bgl200, bgl201, bgl210, bgl211, bgl300, bgl301, bgl310
 and bgl311.
 Note that this method does can not accomodate blocks of base
 partitions that wrap over the torus boundaries particularly well,
 although a hostlist range format of this sort is supported:
 "bgl[000x-011,700x711]".

 The node selection functionality is another topology aware
 SLURM component.
 Rather than embedding BGL-specific logic into a multitude of
 locations, all of this logic was put into a single plugin.
 The pre-existing node selection logic was put into a plugin
 supporting typical Linux clusters with node names based
 upon a one-dimensional array.
 The BGL-specific plugin not only selects nodes for pending jobs
 based upon BGL topography, but issues the BGL-specific APIs
 to monitor the system health (draining nodes with any failure
 mode) and perform initialization and termination sequences for the job.

 BGL's topology requirement necessitated the addition of several
 \srun\ options: {\em --geometry} to specify the dimension required by
 the job,
  {\em --no-rotate} to indicate of the geometry specification could rotate
 in three-dimensions,
 {\em --comm-type} to indicate the communctions type being mesh or torus,
 {\em --node-use} to specify if the second process on a c-node should
 be used to execute the user application or be used for communications.
 While \srun\ accepts these new options on all computer systems,
 the node selection plugin logic is used to manage this data in an
 opaque data type.
 Since these new data types are unused on non-BGL systems, the
 functions to manage them perform no work.
 Other computers with other topology requiremens will be able to
 take advantage of this plugin infrastructure with minimal effort.

 In order to provide users with a clear view of the BGL topology, a new
 tools was developed.
 \smap\ presents the same type of information as the \sinfo\ and \squeue\
 commands, but graphically displays the location of SLURM nodes
 (BGL base partitions) assigned to partitions or partitions as shown in
 Table ~\ref{smap_out}.

 \begin{table}[t]
 \begin{center}

 \begin{tabular}[c]{c}
 \\
 \fbox{
    \begin{minipage}[c]{1.0\linewidth}
      {\scriptsize \verbatiminput{smap.output} }
    \end{minipage}
 }
 \\
 \end{tabular}
 \caption{\label{smap_out} Output of \srun\ command}
 \end{center}
 \end{table}

 Rather than modifying SLURM to initiate and manage the parallel
 tasks for BGL jobs, we decided utilize existing software from IBM.
 This eliminated a multitude of software integration issues.
 SLURM will manage resources, select resources for the job,
 set an environment variable BGL\_PARTITION\_ID, and spawn
 a script.
 The job will initiate its parallel tasks through the use of {\em mpirun}.
 {\em mpirun} uses BGL-specific APIs to launch and manage the
 tasks.
 We disabled SLURM's job step support for normal users to
 mitigate the possible impact of users inadvertently attempting
 to initiate job steps through SLURM.

 \section{Blue Gene/L Network Wiring Issues}

 TBD

 Static partitioning

 \section{Results}

 TBD

 \section{Future Plans}

 Dynamic partitioning

 \raggedright
 % make the bibliography
 \bibliographystyle{splncs}
 \bibliography{project}

 % make the back cover page
 %\makeLLNLBackCover
 \end{document}
	% Presenter info:
	% http://www.linuxclustersinstitute.org/Linux-HPC-Revolution/presenterinfo.html
	%
	% Main Text Layout
	% Set the main text in 10 point Times Roman or Times New Roman (normal),
	% (no boldface), using single line spacing. All text should be in a single
	% column and justified.
	%
	% Opening Style (First Page)
	% This includes the title of the paper, the author names, organization and
	% country, the abstract, and the first part of the paper.
	% * Start the title 35mm down from the top margin in Times Roman font, 16
	% point bold, range left. Capitalize only the first letter of the first
	% word and proper nouns.
	% * On a new line, type the authors' names, organizations, and country only
	% (not the full postal address, although you may add the name of your
	% department), in Times Roman, 11 point italic, range left.
	% * Start the abstract with the heading two lines below the last line of the
	% address. Set the abstract in Times Roman, 12 point bold.
	% * Leave one line, then type the abstract in Times Roman 10 point, justified
	% with single line spacing.
	%
	% Other Pages
	% For the second and subsequent pages, use the full 190 x 115mm area and type
	% in one column beginning at the upper right of each page, inserting tables
	% and figures as required.
	%
	% We're recommending the Lecture Notes in Computer Science styles from
	% Springer Verlag --- google on Springer Verlag LaTeX. These work nicely,
	% except that it does not work with the hyperref package. Sigh.
	%
	% http://www.springer.de/comp/lncs/authors.html
	%
	% NOTE: This is an excerpt from the document in slurm/doc/pubdesign

	\documentclass[10pt,onecolumn,times]{../common/llncs}

	\usepackage{verbatim,moreverb}
	\usepackage{float}

	% Needed for LLNL cover page / TID bibliography style
	\usepackage{calc}
	\usepackage{epsfig}
	\usepackage{graphics}
	\usepackage{hhline}
	\input{pstricks}
	\input{pst-node}
	\usepackage{chngpage}
	\input{llnlCoverPage}

	% Uncomment for DRAFT "watermark"
	\usepackage{draftcopy}

	% Times font as default roman
	\renewcommand{\sfdefault}{phv}
	\renewcommand{\rmdefault}{ptm}
	%\renewcommand{\ttdefault}{pcr}
	%\usepackage{times}
	\renewcommand{\labelitemi}{$\bullet$}

	\setlength{\textwidth}{115mm}
	\setlength{\textheight}{190mm}
	\setlength{\oddsidemargin}{(\paperwidth-\textwidth)/2 - 1in}
	\setlength{\topmargin}{(\paperheight-\textheight -\headheight-\headsep-\footskip
	)/2 - 1in + .5in }

	% couple of macros for the title page and document
	\def\ctit{SLURM Resource Management for Blue Gene/L}
	\def\ucrl{UCRL-JC-TBD}
	\def\auth{Morris Jette \\ Danny Auble \\ Dan Phung}
	\def\pubdate{October 18, 2004}
	\def\journal{Conference TBD}

	\begin{document}

	% make the cover page
	%\makeLLNLCover{\ucrl}{\ctit}{\auth}{\journal}{\pubdate}{0in}{0in}

	% Title - 16pt bold
	\vspace*{35mm}
	\noindent\Large
	\textbf{\ctit}
	\vskip1\baselineskip
	% Authors - 11pt
	\noindent\large
	{Morris Jette, Dan Phung and Danny Auble \\
	{\em Lawrence Livermore National Laboratory, USA}
	\vskip2\baselineskip
	% Abstract heading - 12pt bold
	\noindent\large
	\textbf{Abstract}
	\vskip1\baselineskip
	% Abstract itself - 10pt
	\noindent\normalsize
	The Blue Gene/L (BGL) system is a highly scalable computer developed
	by IBM and deployed Lawrence Livermore National Laboratory (LLNL).
	The current system has over 131,000 processors interconnected by a
	three-dimensional toroidal network with complex rules for managing
	the network and allocating resources to jobs.
	SLURM (Simple Linux Utility for Resource Management ) was selected to
	fulfull this role.
	SLURM is an open source, fault-tolerant, and highly scalable cluster
	management and job scheduling system in widespread use on Linux clusters.
	This paper presents overviews of BGL resource management issues and
	SLURM architecture.
	It also presents a description of how SLURM provides resource
	management for BGL and preliminary performance results.

	% define some additional macros for the body
	\newcommand{\munged}{{\tt munged}}
	\newcommand{\srun}{{\tt srun}}
	\newcommand{\scancel}{{\tt scancel}}
	\newcommand{\squeue}{{\tt squeue}}
	\newcommand{\scontrol}{{\tt scontrol}}
	\newcommand{\sinfo}{{\tt sinfo}}
	\newcommand{\slurmctld}{{\tt slurmctld}}
	\newcommand{\slurmd}{{\tt slurmd}}
	\newcommand{\smap}{{\tt smap}}

	\section{Overview}

	The BlueGene/L (BGL) system offers a unique cell-based design in which
	the capacity can be expanded without introducing bottlenecks
	\cite{BlueGeneWeb,BlueGeneL2002}.
	The Blue Gene/L system delivered to LLNL consists of
	131,072 processors and 33TB of memory \cite{BlueGene2002}.
	The peak computational rate will exceed 360 TeraFLOPs.

	Simple Linux Utility for Resource Management (SLURM)\footnote{A tip of
	the hat to Matt Groening and creators of {\em Futurama},
	where Slurm is the most popular carbonated beverage in the universe.}
	is a resource management system suitable for use on both small and
	very large clusters.
	SLURM was developed by Lawrence Livermore National Laboratory
	(LLNL), Linux NetworX and HP.
	It has been deployed on hundreds of Linux clusters world-wide and has
	proven both highly reliable and highly scalalble.

	\section{Architecture of Blue Gene/L}

	The basic building-blocks of BGL are c-nodes.
	Each c-node consists
	of two processors based upon the PowerPC 550GX, 512 MB of memory
	and support for five separate networks on a single chip.
	One of the processors may be used for computations and the
	second used exclusively for communications.
	Alternately, both processors may be used for computations.
	These c-nodes are subsequently grouped into base partitions, each consisting
	of 512 c-nodes in an eight by eight by eight array with the same
	network support.
	The BGL system delivered to LLNL consists of 128 base
	partitions organized in an eight by four by four array.
	The minimal resource allocation unit for applications is one
	base partition so that at most 128 simultaneous jobs may execute.

	The c-nodes execute a custom micro-kernel.
	System calls that can not directly be processed by the c-node
	micro-kernel are routed to one of the systems I/O nodes.
	There are 1024 I/O nodes running the Linux operating system,
	each of which service the requests from 64 c-nodes.

	Three distinct communications networks are supported:
	a three-dimensional torus with direct nearest-neighbor connections;
	a global tree network for broadcast and reduction operations; and
	a barrier network for synchronization.
	The torus network connects each node to
	its nearest neighbors in the X, Y and Z directions for a
	total of six of these connections for each node.

	Only parallel user applications execute on the c-node.
	BGL has eight front-end nodes for other user tasks.
	Users can login to the front-end nodes, compile and
	launch parallel applications. Front-end nodes can also
	be used for pre- and post-processing of data files.

	BGL system administrative functions are performed on a
	computer known as the service node, which also maintains
	a DB2 database used for many BGL management functions.

	TO DO: Mesh vs. Torus. Wiring rules.

	TO DO: Overhead of starting a new job (e.g. reboot nodes).

	NOTE: Be careful not to use non-public information (don't use
	information directly from the "IBM Confidential" documents).

	\section{Architecture of SLURM}

	Only a brief description of SLURM architecture and implemenation is provided
	here.
	A more thorough treatment of the SLURM design and implementation is
	available from several sources \cite{SLURM2003,SlurmWeb}.

	Several SLURM features make it well suited to serve as a resource manager
	for Blue Gene/L.

	\begin{itemize}

	\item {\tt Scalability}:
	The SLURM daemons are highly parallel with independent read and write
	locks on the various data structures.
	SLURM presently manages several Linux clusters with over 1000 nodes
	and executes full-system parallel jobs on these systems in a few seconds.

	\item {\tt Portability}:
	SLURM is written in the C language, with a GNU {\em autoconf} configuration engine.
	While initially written for Linux, other Unix-like operating systems including
	AIX have proven easy porting targets.
	SLURM also supports a general purpose ``plugin'' mechanism, which
	permits a variety of different infrastructures to be easily supported.
	The SLURM configuration file specifies which set of plugin modules
	should be used.
	For example, plugins are used for interfacing with different authentication
	mechanisms and interconnects.

	\item {\tt Fault Tolerance}: SLURM can handle a variety of failure
	modes without terminating workloads, including crashes of the node
	running the SLURM controller. User jobs may be configured to continue
	execution despite the failure of one or more nodes on which they are
	executing. The user command controlling a job, {\tt srun}, may detach
	and reattach from the parallel tasks at any time.

	\item {\tt System Administrator Friendly}: SLURM utilizes
	a simple configuration file and minimizes distributed state.
	Its configuration may be changed at any time without impacting running
	jobs. SLURM interfaces are usable by scripts and its behavior is
	highly deterministic.

	\end{itemize}

	As a cluster resource manager, SLURM has three key functions. First,
	it allocates exclusive and/or non-exclusive access to resources (compute
	nodes) to users for some duration of time so they can perform work.
	Second, it provides a framework for starting, executing, and monitoring
	work (normally a parallel job) on the set of allocated nodes. Finally,
	it arbitrates conflicting requests for resources by managing a queue of
	pending work.

	\begin{figure}[tb]
	\centerline{\epsfig{file=../figures/arch.eps,scale=0.35}}
	\caption{\small SLURM architecture}
	\label{arch}
	\end{figure}

	As shown in Figure~\ref{arch}, SLURM consists of a \slurmd\ daemon
	running on each compute node, a central \slurmctld\ daemon running
	on a management node (with optional fail-over twin), and five command
	line utilities: \srun\, \scancel, \sinfo\, \squeue\, and \scontrol\,
	which can run anywhere in the cluster.

	The central controller daemon, \slurmctld\, maintains the global
	state and directs operations.
	\slurmctld\ monitors the state of nodes (through {\tt slurmd}),
	groups nodes into partitions with various contraints,
	manages a queue of pending work, and
	allocates resouces to pending jobs and job steps.
	\slurmctld\ does not directly execute any user jobs, but
	provides overall management of jobs and resources.

	Compute nodes simply run a \slurmd\ daemon (similar to a remote
	shell daemon) to export control to SLURM.
	Each \slurmd\ monitors machine status,
	performs remote job execution, manages the job's I/O, and otherwise
	manages the jobs and job steps for its execution host.

	Users interact with SLURM through four command line utilities:
	\srun\ for submitting a job for execution and optionally controlling
	it interactively,
	\scancel\ for signalling or terminating a pending or running job,
	\squeue\ for monitoring job queues, and
	\sinfo\ for monitoring partition and overall system state.
	System administrators perform privileged operations through an
	additional command line utility, {\tt scontrol}.

	The entities managed by these SLURM daemons include {\em nodes}, the
	compute resource in SLURM, {\em partitions}, which group nodes into
	logical disjoint sets, {\em jobs}, or allocations of resources assigned
	to a user for a specified amount of time, and {\em job steps}, which
	are sets of (possibly parallel) tasks within a job. Each job in the
	priority-ordered queue is allocated nodes within a single partition.
	Once a job is assigned a set of nodes, the user is able to initiate
	parallel work in the form of job steps in any configuration within
	the job's allocation.
	For instance, a single job step may be started that utilizes all nodes
	allocated to the job, or several job steps may independently use a
	portion of the allocation.

	\begin{figure}[tcb]
	\centerline{\epsfig{file=../figures/entities.eps,scale=0.5}}
	\caption{\small SLURM entities: nodes, partitions, jobs, and job steps}
	\label{entities}
	\end{figure}

	Figure~\ref{entities} further illustrates the interrelation of these
	entities as they are managed by SLURM by showing a group of
	compute nodes split into two partitions. Partition 1 is running one job,
	with one job step utilizing the full allocation of that job. The job
	in Partition 2 has only one job step using half of the original job
	allocation. That job might initiate additional job steps to utilize
	the remaining nodes of its allocation.

	In order to simplify the use of different infrastructures,
	SLURM uses a general purpose plugin mechanism. A SLURM plugin is a
	dynamically linked code object that is loaded explicitly at run time
	by the SLURM libraries. A plugin provides a customized implementation
	of a well-defined API connected to tasks such as authentication,
	or job scheduling. A common set of functions is defined
	for use by all of the different infrastructures of a particular variety.
	For example, the authentication plugin must define functions such as
	{\tt slurm\_auth\_create} to create a credential, {\tt slurm\_auth\_verify}
	to verify a credential to approve or deny authentication, etc.
	When a SLURM command or daemon is initiated, it
	reads the configuration file to determine which of the available plugins
	should be used. For example {\em AuthType=auth/munge} says to use the
	plugin for munge based authentication and {\em PluginDir=/usr/local/lib}
	identifies the directory in which to find the plugin.

	\section {Blue Gene/L Specific Resource Management Issues}

	Since a BGL base partition is the minimum allocation unit for a job,
	it was natural to consider each one as an independent SLURM node.
	This meant SLURM would manage a very reasonable 128 nodes
	rather than tens of thousands of individual c-nodes.
	The \slurmd\ daemon was originally designed to execute on each
	SLURM node to monitor the status of that node, launch job steps, etc.
	Unfortunately BGL prohibited the execute of SLURM daemons within
	the base partitions on any of the c-nodes.
	SLURM was compelled to execute \slurmd\ on one or more
	front-end nodes.
	In addition, the typical Unix mechanism used to interact with a
	compute host (e.g. getting memory size or processor count) do not
	function normally with BGL base partitions.
	This issue was addressed by adding a SLURM parameter to
	indicate when it is running with a front-end node, in which case
	there is assumed to be a single \slurmd\ for the entire system.
	We anticipate changing this in the future to support multiple
	\slurmd\ daemons on the front-end nodes.

	SLURM was originally designed to address a one-dimensional topology
	and this impacted a variety of areas from naming convensions to
	node selection.
	SLURM provides resource management on several Linux clusters
	exceeding 1000 nodes and it is impractical to display or otherwise
	work with hundreds of individual node names.
	SLURM addresses this by using a compressed hostlist range format to indicate
	ranges of node names.
	For example, "linux[0-1023]" was used to represent 1024 nodes
	with names having a prefix of "linux" and a numeric suffic ranging
	from "0" to "1023" (e.g. "linux0" through "linux1023").
	The most reasonable way to name the BGL nodes seemed to be
	using a three digit suffix, but rather than indicate a monotonically
	increasing number, each digit would represent the base partition's
	location in the X, Y and Z dimensions (the value of X ranges
	from 0 to 7, Y from 0 to 3, and Z from 0 to 3 on the LLNL system).
	For example, "bgl012" would represent the base partition at
	the position X=0, Y=1 and Z=2.
	Since BGL resources naturally tend to be rectangular prisms in
	shape, we modified the hostlist range format to indicate the two
	extreme base partition locations.
	The name prefix is always "bgl".
	Within the brackets one lists the base partition with the smallest
	X, Y and Z coordinates followed by a "x" followed by the base
	partition with the highest X, Y and Z coordinates.
	For example, "bgl[200x311]" represents the following eight base
	partitions: bgl200, bgl201, bgl210, bgl211, bgl300, bgl301, bgl310
	and bgl311.
	Note that this method does can not accomodate blocks of base
	partitions that wrap over the torus boundaries particularly well,
	although a hostlist range format of this sort is supported:
	"bgl[000x-011,700x711]".

	The node selection functionality is another topology aware
	SLURM component.
	Rather than embedding BGL-specific logic into a multitude of
	locations, all of this logic was put into a single plugin.
	The pre-existing node selection logic was put into a plugin
	supporting typical Linux clusters with node names based
	upon a one-dimensional array.
	The BGL-specific plugin not only selects nodes for pending jobs
	based upon BGL topography, but issues the BGL-specific APIs
	to monitor the system health (draining nodes with any failure
	mode) and perform initialization and termination sequences for the job.

	BGL's topology requirement necessitated the addition of several
	\srun\ options: {\em --geometry} to specify the dimension required by
	the job,
	{\em --no-rotate} to indicate of the geometry specification could rotate
	in three-dimensions,
	{\em --comm-type} to indicate the communctions type being mesh or torus,
	{\em --node-use} to specify if the second process on a c-node should
	be used to execute the user application or be used for communications.
	While \srun\ accepts these new options on all computer systems,
	the node selection plugin logic is used to manage this data in an
	opaque data type.
	Since these new data types are unused on non-BGL systems, the
	functions to manage them perform no work.
	Other computers with other topology requiremens will be able to
	take advantage of this plugin infrastructure with minimal effort.

	In order to provide users with a clear view of the BGL topology, a new
	tools was developed.
	\smap\ presents the same type of information as the \sinfo\ and \squeue\
	commands, but graphically displays the location of SLURM nodes
	(BGL base partitions) assigned to partitions or partitions as shown in
	Table ~\ref{smap_out}.

	\begin{table}[t]
	\begin{center}

	\begin{tabular}[c]{c}
	\\
	\fbox{
	\begin{minipage}[c]{1.0\linewidth}
	{\scriptsize \verbatiminput{smap.output} }
	\end{minipage}
	}
	\\
	\end{tabular}
	\caption{\label{smap_out} Output of \srun\ command}
	\end{center}
	\end{table}

	Rather than modifying SLURM to initiate and manage the parallel
	tasks for BGL jobs, we decided utilize existing software from IBM.
	This eliminated a multitude of software integration issues.
	SLURM will manage resources, select resources for the job,
	set an environment variable BGL\_PARTITION\_ID, and spawn
	a script.
	The job will initiate its parallel tasks through the use of {\em mpirun}.
	{\em mpirun} uses BGL-specific APIs to launch and manage the
	tasks.
	We disabled SLURM's job step support for normal users to
	mitigate the possible impact of users inadvertently attempting
	to initiate job steps through SLURM.

	\section{Blue Gene/L Network Wiring Issues}

	TBD

	Static partitioning

	\section{Results}

	TBD

	\section{Future Plans}

	Dynamic partitioning

	\raggedright
	% make the bibliography
	\bibliographystyle{splncs}
	\bibliography{project}

	% make the back cover page
	%\makeLLNLBackCover
	\end{document}