blob: f4871acae1363e629bdb9ffb8c22a7275df8ea7d [file] [log] [blame]
% Presenter info:
% http://www.linuxclustersinstitute.org/Linux-HPC-Revolution/presenterinfo.html
%
% Main Text Layout
% Set the main text in 10 point Times Roman or Times New Roman (normal),
% (no boldface), using single line spacing. All text should be in a single
% column and justified.
%
% Opening Style (First Page)
% This includes the title of the paper, the author names, organization and
% country, the abstract, and the first part of the paper.
% * Start the title 35mm down from the top margin in Times Roman font, 16
% point bold, range left. Capitalize only the first letter of the first
% word and proper nouns.
% * On a new line, type the authors' names, organizations, and country only
% (not the full postal address, although you may add the name of your
% department), in Times Roman, 11 point italic, range left.
% * Start the abstract with the heading two lines below the last line of the
% address. Set the abstract in Times Roman, 12 point bold.
% * Leave one line, then type the abstract in Times Roman 10 point, justified
% with single line spacing.
%
% Other Pages
% For the second and subsequent pages, use the full 190 x 115mm area and type
% in one column beginning at the upper right of each page, inserting tables
% and figures as required.
%
% We're recommending the Lecture Notes in Computer Science styles from
% Springer Verlag --- google on Springer Verlag LaTeX. These work nicely,
% *except* that it does not work with the hyperref package. Sigh.
%
% http://www.springer.de/comp/lncs/authors.html
%
% NOTE: This is an excerpt from the document in slurm/doc/pubdesign
\documentclass[10pt,onecolumn,times]{../common/llncs}
\usepackage{verbatim,moreverb}
\usepackage{float}
% Needed for LLNL cover page / TID bibliography style
\usepackage{calc}
\usepackage{epsfig}
\usepackage{graphics}
\usepackage{hhline}
\input{pstricks}
\input{pst-node}
\usepackage{chngpage}
\input{llnlCoverPage}
% Uncomment for DRAFT "watermark"
\usepackage{draftcopy}
% Times font as default roman
\renewcommand{\sfdefault}{phv}
\renewcommand{\rmdefault}{ptm}
%\renewcommand{\ttdefault}{pcr}
%\usepackage{times}
\renewcommand{\labelitemi}{$\bullet$}
\setlength{\textwidth}{115mm}
\setlength{\textheight}{190mm}
\setlength{\oddsidemargin}{(\paperwidth-\textwidth)/2 - 1in}
\setlength{\topmargin}{(\paperheight-\textheight -\headheight-\headsep-\footskip
)/2 - 1in + .5in }
% couple of macros for the title page and document
\def\ctit{SLURM Resource Management for Blue Gene/L}
\def\ucrl{UCRL-JC-TBD}
\def\auth{Morris Jette \\ Danny Auble \\ Dan Phung}
\def\pubdate{October 18, 2004}
\def\journal{Conference TBD}
\begin{document}
% make the cover page
%\makeLLNLCover{\ucrl}{\ctit}{\auth}{\journal}{\pubdate}{0in}{0in}
% Title - 16pt bold
\vspace*{35mm}
\noindent\Large
\textbf{\ctit}
\vskip1\baselineskip
% Authors - 11pt
\noindent\large
{Morris Jette, Dan Phung and Danny Auble \\
{\em Lawrence Livermore National Laboratory, USA}
\vskip2\baselineskip
% Abstract heading - 12pt bold
\noindent\large
\textbf{Abstract}
\vskip1\baselineskip
% Abstract itself - 10pt
\noindent\normalsize
The Blue Gene/L (BGL) system is a highly scalable computer developed
by IBM and deployed Lawrence Livermore National Laboratory (LLNL).
The current system has over 131,000 processors interconnected by a
three-dimensional toroidal network with complex rules for managing
the network and allocating resources to jobs.
SLURM (Simple Linux Utility for Resource Management ) was selected to
fulfull this role.
SLURM is an open source, fault-tolerant, and highly scalable cluster
management and job scheduling system in widespread use on Linux clusters.
This paper presents overviews of BGL resource management issues and
SLURM architecture.
It also presents a description of how SLURM provides resource
management for BGL and preliminary performance results.
% define some additional macros for the body
\newcommand{\munged}{{\tt munged}}
\newcommand{\srun}{{\tt srun}}
\newcommand{\scancel}{{\tt scancel}}
\newcommand{\squeue}{{\tt squeue}}
\newcommand{\scontrol}{{\tt scontrol}}
\newcommand{\sinfo}{{\tt sinfo}}
\newcommand{\slurmctld}{{\tt slurmctld}}
\newcommand{\slurmd}{{\tt slurmd}}
\newcommand{\smap}{{\tt smap}}
\section{Overview}
The BlueGene/L (BGL) system offers a unique cell-based design in which
the capacity can be expanded without introducing bottlenecks
\cite{BlueGeneWeb,BlueGeneL2002}.
The Blue Gene/L system delivered to LLNL consists of
131,072 processors and 33TB of memory \cite{BlueGene2002}.
The peak computational rate will exceed 360 TeraFLOPs.
Simple Linux Utility for Resource Management (SLURM)\footnote{A tip of
the hat to Matt Groening and creators of {\em Futurama},
where Slurm is the most popular carbonated beverage in the universe.}
is a resource management system suitable for use on both small and
very large clusters.
SLURM was developed by Lawrence Livermore National Laboratory
(LLNL), Linux NetworX and HP.
It has been deployed on hundreds of Linux clusters world-wide and has
proven both highly reliable and highly scalalble.
\section{Architecture of Blue Gene/L}
The basic building-blocks of BGL are c-nodes.
Each c-node consists
of two processors based upon the PowerPC 550GX, 512 MB of memory
and support for five separate networks on a single chip.
One of the processors may be used for computations and the
second used exclusively for communications.
Alternately, both processors may be used for computations.
These c-nodes are subsequently grouped into base partitions, each consisting
of 512 c-nodes in an eight by eight by eight array with the same
network support.
The BGL system delivered to LLNL consists of 128 base
partitions organized in an eight by four by four array.
The minimal resource allocation unit for applications is one
base partition so that at most 128 simultaneous jobs may execute.
The c-nodes execute a custom micro-kernel.
System calls that can not directly be processed by the c-node
micro-kernel are routed to one of the systems I/O nodes.
There are 1024 I/O nodes running the Linux operating system,
each of which service the requests from 64 c-nodes.
Three distinct communications networks are supported:
a three-dimensional torus with direct nearest-neighbor connections;
a global tree network for broadcast and reduction operations; and
a barrier network for synchronization.
The torus network connects each node to
its nearest neighbors in the X, Y and Z directions for a
total of six of these connections for each node.
Only parallel user applications execute on the c-node.
BGL has eight front-end nodes for other user tasks.
Users can login to the front-end nodes, compile and
launch parallel applications. Front-end nodes can also
be used for pre- and post-processing of data files.
BGL system administrative functions are performed on a
computer known as the service node, which also maintains
a DB2 database used for many BGL management functions.
TO DO: Mesh vs. Torus. Wiring rules.
TO DO: Overhead of starting a new job (e.g. reboot nodes).
NOTE: Be careful not to use non-public information (don't use
information directly from the "IBM Confidential" documents).
\section{Architecture of SLURM}
Only a brief description of SLURM architecture and implemenation is provided
here.
A more thorough treatment of the SLURM design and implementation is
available from several sources \cite{SLURM2003,SlurmWeb}.
Several SLURM features make it well suited to serve as a resource manager
for Blue Gene/L.
\begin{itemize}
\item {\tt Scalability}:
The SLURM daemons are highly parallel with independent read and write
locks on the various data structures.
SLURM presently manages several Linux clusters with over 1000 nodes
and executes full-system parallel jobs on these systems in a few seconds.
\item {\tt Portability}:
SLURM is written in the C language, with a GNU {\em autoconf} configuration engine.
While initially written for Linux, other Unix-like operating systems including
AIX have proven easy porting targets.
SLURM also supports a general purpose ``plugin'' mechanism, which
permits a variety of different infrastructures to be easily supported.
The SLURM configuration file specifies which set of plugin modules
should be used.
For example, plugins are used for interfacing with different authentication
mechanisms and interconnects.
\item {\tt Fault Tolerance}: SLURM can handle a variety of failure
modes without terminating workloads, including crashes of the node
running the SLURM controller. User jobs may be configured to continue
execution despite the failure of one or more nodes on which they are
executing. The user command controlling a job, {\tt srun}, may detach
and reattach from the parallel tasks at any time.
\item {\tt System Administrator Friendly}: SLURM utilizes
a simple configuration file and minimizes distributed state.
Its configuration may be changed at any time without impacting running
jobs. SLURM interfaces are usable by scripts and its behavior is
highly deterministic.
\end{itemize}
As a cluster resource manager, SLURM has three key functions. First,
it allocates exclusive and/or non-exclusive access to resources (compute
nodes) to users for some duration of time so they can perform work.
Second, it provides a framework for starting, executing, and monitoring
work (normally a parallel job) on the set of allocated nodes. Finally,
it arbitrates conflicting requests for resources by managing a queue of
pending work.
\begin{figure}[tb]
\centerline{\epsfig{file=../figures/arch.eps,scale=0.35}}
\caption{\small SLURM architecture}
\label{arch}
\end{figure}
As shown in Figure~\ref{arch}, SLURM consists of a \slurmd\ daemon
running on each compute node, a central \slurmctld\ daemon running
on a management node (with optional fail-over twin), and five command
line utilities: \srun\, \scancel, \sinfo\, \squeue\, and \scontrol\,
which can run anywhere in the cluster.
The central controller daemon, \slurmctld\, maintains the global
state and directs operations.
\slurmctld\ monitors the state of nodes (through {\tt slurmd}),
groups nodes into partitions with various contraints,
manages a queue of pending work, and
allocates resouces to pending jobs and job steps.
\slurmctld\ does not directly execute any user jobs, but
provides overall management of jobs and resources.
Compute nodes simply run a \slurmd\ daemon (similar to a remote
shell daemon) to export control to SLURM.
Each \slurmd\ monitors machine status,
performs remote job execution, manages the job's I/O, and otherwise
manages the jobs and job steps for its execution host.
Users interact with SLURM through four command line utilities:
\srun\ for submitting a job for execution and optionally controlling
it interactively,
\scancel\ for signalling or terminating a pending or running job,
\squeue\ for monitoring job queues, and
\sinfo\ for monitoring partition and overall system state.
System administrators perform privileged operations through an
additional command line utility, {\tt scontrol}.
The entities managed by these SLURM daemons include {\em nodes}, the
compute resource in SLURM, {\em partitions}, which group nodes into
logical disjoint sets, {\em jobs}, or allocations of resources assigned
to a user for a specified amount of time, and {\em job steps}, which
are sets of (possibly parallel) tasks within a job. Each job in the
priority-ordered queue is allocated nodes within a single partition.
Once a job is assigned a set of nodes, the user is able to initiate
parallel work in the form of job steps in any configuration within
the job's allocation.
For instance, a single job step may be started that utilizes all nodes
allocated to the job, or several job steps may independently use a
portion of the allocation.
\begin{figure}[tcb]
\centerline{\epsfig{file=../figures/entities.eps,scale=0.5}}
\caption{\small SLURM entities: nodes, partitions, jobs, and job steps}
\label{entities}
\end{figure}
Figure~\ref{entities} further illustrates the interrelation of these
entities as they are managed by SLURM by showing a group of
compute nodes split into two partitions. Partition 1 is running one job,
with one job step utilizing the full allocation of that job. The job
in Partition 2 has only one job step using half of the original job
allocation. That job might initiate additional job steps to utilize
the remaining nodes of its allocation.
In order to simplify the use of different infrastructures,
SLURM uses a general purpose plugin mechanism. A SLURM plugin is a
dynamically linked code object that is loaded explicitly at run time
by the SLURM libraries. A plugin provides a customized implementation
of a well-defined API connected to tasks such as authentication,
or job scheduling. A common set of functions is defined
for use by all of the different infrastructures of a particular variety.
For example, the authentication plugin must define functions such as
{\tt slurm\_auth\_create} to create a credential, {\tt slurm\_auth\_verify}
to verify a credential to approve or deny authentication, etc.
When a SLURM command or daemon is initiated, it
reads the configuration file to determine which of the available plugins
should be used. For example {\em AuthType=auth/munge} says to use the
plugin for munge based authentication and {\em PluginDir=/usr/local/lib}
identifies the directory in which to find the plugin.
\section {Blue Gene/L Specific Resource Management Issues}
Since a BGL base partition is the minimum allocation unit for a job,
it was natural to consider each one as an independent SLURM node.
This meant SLURM would manage a very reasonable 128 nodes
rather than tens of thousands of individual c-nodes.
The \slurmd\ daemon was originally designed to execute on each
SLURM node to monitor the status of that node, launch job steps, etc.
Unfortunately BGL prohibited the execute of SLURM daemons within
the base partitions on any of the c-nodes.
SLURM was compelled to execute \slurmd\ on one or more
front-end nodes.
In addition, the typical Unix mechanism used to interact with a
compute host (e.g. getting memory size or processor count) do not
function normally with BGL base partitions.
This issue was addressed by adding a SLURM parameter to
indicate when it is running with a front-end node, in which case
there is assumed to be a single \slurmd\ for the entire system.
We anticipate changing this in the future to support multiple
\slurmd\ daemons on the front-end nodes.
SLURM was originally designed to address a one-dimensional topology
and this impacted a variety of areas from naming convensions to
node selection.
SLURM provides resource management on several Linux clusters
exceeding 1000 nodes and it is impractical to display or otherwise
work with hundreds of individual node names.
SLURM addresses this by using a compressed hostlist range format to indicate
ranges of node names.
For example, "linux[0-1023]" was used to represent 1024 nodes
with names having a prefix of "linux" and a numeric suffic ranging
from "0" to "1023" (e.g. "linux0" through "linux1023").
The most reasonable way to name the BGL nodes seemed to be
using a three digit suffix, but rather than indicate a monotonically
increasing number, each digit would represent the base partition's
location in the X, Y and Z dimensions (the value of X ranges
from 0 to 7, Y from 0 to 3, and Z from 0 to 3 on the LLNL system).
For example, "bgl012" would represent the base partition at
the position X=0, Y=1 and Z=2.
Since BGL resources naturally tend to be rectangular prisms in
shape, we modified the hostlist range format to indicate the two
extreme base partition locations.
The name prefix is always "bgl".
Within the brackets one lists the base partition with the smallest
X, Y and Z coordinates followed by a "x" followed by the base
partition with the highest X, Y and Z coordinates.
For example, "bgl[200x311]" represents the following eight base
partitions: bgl200, bgl201, bgl210, bgl211, bgl300, bgl301, bgl310
and bgl311.
Note that this method does can not accomodate blocks of base
partitions that wrap over the torus boundaries particularly well,
although a hostlist range format of this sort is supported:
"bgl[000x-011,700x711]".
The node selection functionality is another topology aware
SLURM component.
Rather than embedding BGL-specific logic into a multitude of
locations, all of this logic was put into a single plugin.
The pre-existing node selection logic was put into a plugin
supporting typical Linux clusters with node names based
upon a one-dimensional array.
The BGL-specific plugin not only selects nodes for pending jobs
based upon BGL topography, but issues the BGL-specific APIs
to monitor the system health (draining nodes with any failure
mode) and perform initialization and termination sequences for the job.
BGL's topology requirement necessitated the addition of several
\srun\ options: {\em --geometry} to specify the dimension required by
the job,
{\em --no-rotate} to indicate of the geometry specification could rotate
in three-dimensions,
{\em --comm-type} to indicate the communctions type being mesh or torus,
{\em --node-use} to specify if the second process on a c-node should
be used to execute the user application or be used for communications.
While \srun\ accepts these new options on all computer systems,
the node selection plugin logic is used to manage this data in an
opaque data type.
Since these new data types are unused on non-BGL systems, the
functions to manage them perform no work.
Other computers with other topology requiremens will be able to
take advantage of this plugin infrastructure with minimal effort.
In order to provide users with a clear view of the BGL topology, a new
tools was developed.
\smap\ presents the same type of information as the \sinfo\ and \squeue\
commands, but graphically displays the location of SLURM nodes
(BGL base partitions) assigned to partitions or partitions as shown in
Table ~\ref{smap_out}.
\begin{table}[t]
\begin{center}
\begin{tabular}[c]{c}
\\
\fbox{
\begin{minipage}[c]{1.0\linewidth}
{\scriptsize \verbatiminput{smap.output} }
\end{minipage}
}
\\
\end{tabular}
\caption{\label{smap_out} Output of \srun\ command}
\end{center}
\end{table}
Rather than modifying SLURM to initiate and manage the parallel
tasks for BGL jobs, we decided utilize existing software from IBM.
This eliminated a multitude of software integration issues.
SLURM will manage resources, select resources for the job,
set an environment variable BGL\_PARTITION\_ID, and spawn
a script.
The job will initiate its parallel tasks through the use of {\em mpirun}.
{\em mpirun} uses BGL-specific APIs to launch and manage the
tasks.
We disabled SLURM's job step support for normal users to
mitigate the possible impact of users inadvertently attempting
to initiate job steps through SLURM.
\section{Blue Gene/L Network Wiring Issues}
TBD
Static partitioning
\section{Results}
TBD
\section{Future Plans}
Dynamic partitioning
\raggedright
% make the bibliography
\bibliographystyle{splncs}
\bibliography{project}
% make the back cover page
%\makeLLNLBackCover
\end{document}