blob: 96fe208273d8d068f12c8a5285be612050a4ee89 [file] [log] [blame]
% Presenter info:
% SIGOPS
%
% Use a two-column, 8.5" x 11" document format
% Use a 0.75" margin everywhere
% All fonts MUST be Type 1 or 3 PostScript fornst from the Latin 1 Fontset.
% Do not use TruType, bitmapped, or Ryumin fonts.
% Use 10-point font size.
%
\documentclass{../common/acm}
\usepackage{verbatim,moreverb}
\usepackage{float}
\usepackage{graphicx}
% Needed for LLNL cover page / TID bibliography style
\usepackage{calc}
\usepackage{epsfig}
\usepackage{graphics}
\usepackage{hhline}
\input{pstricks}
\input{pst-node}
\usepackage{chngpage}
\input{llnlCoverPage}
% Uncomment for DRAFT "watermark"
%\usepackage{draftcopy}
% Times font as default roman
\renewcommand{\sfdefault}{phv}
\renewcommand{\rmdefault}{ptm}
%\renewcommand{\ttdefault}{pcr}
%\usepackage{times}
\renewcommand{\labelitemi}{$\bullet$}
\setlength{\oddsidemargin}{-0.25in}
\setlength{\evensidemargin}{-0.25in}
\setlength{\topmargin}{-0.25in} % was{-0.75in}
\setlength{\headsep}{0.5in}
\setlength{\footskip}{24.0pt}
\setlength{\textheight}{9.5in}
\setlength{\columnwidth}{3.32in}
\setlength{\textwidth}{7.0in}
\setlength{\columnsep}{0.30in}
\setlength{\columnseprule}{0.0pt}
% couple of macros for the title page and document
\def\ctit{Scalable Resource Management with SLURM}
\def\ucrl{UCRL-JC-TBD}
\def\auth{Morris Jette}
\def\pubdate{January 23, 2006}
\def\journal{Operating Systems Reviews}
\begin{document}
% make the cover page
%\makeLLNLCover{\ucrl}{\ctit}{\auth}{\journal}{\pubdate}{0in}{0in}
\title{\ctit}
\numberofauthors{1}
\author{
\alignauthor Morris Jette\\
\affaddr{Lawrence Livermore National Laboratory}\\
\affaddr{Livermore, CA, USA}\\
\email{jette1@llnl.gov}
}
\maketitle
\begin{abstract}
While the increase in performance of individual computer
processors through the years has been impressive, the demand
for compute resources has grown even more rapidly.
This demand has been met over the past couple of decades by
turning to parallel computing.
While operating systems, compilers, libraries, and a wide
variety of tools exist to make effective use of independent
computers, a new class of tools is required to effectively
utilize a parallel computer.
One important tool is the resource manager, which can be viewed
as the "glue" to run applications on the parallel computer.
One very popular resource manager is the Simple Linux
Utility for Resource Management (SLURM,
http://www.llnl.gov/linux/slurm).
SLURM performs resource management on about 1000 parallel
computers around the world, including many of the largest
systems.
This paper will describe resource management scalability
issues and SLURM's implementation.
\end{abstract}
\section{Introduction}
Executing a parallel program on a cluster involves several
steps. First resources (cores, memory, nodes, etc.) suitable
for running the program must be identified. These resources
are then typically dedicated to the program. In some
cases, the computer's network/interconnect must be
configured for the program's use. The program's tasks
are initiated on the allocated resources. The input and
output for all of these tasks must be processed.
Upon the program termination, it's allocated resources
are released for use by other programs.
All of these operations are straightforward, but performing
resource management on clusters containing thousands of
nodes and over 130,000 processor cores requires more
than a high degree of parallelism.
In many respects, data management and fault-tolerance issues
are paramount.
SLURM is a resource manager jointly developed by Lawrence
Livermore National Laboratory (LLNL),
Hewlett-Packard, and Linux NetworX
~\cite{SLURM2003,Yoo2003,SlurmWeb}.
SLURM's general characteristics include:
\begin{itemize}
\item {\tt Simplicity}: SLURM is simple enough to allow motivated
end users to understand its source code and add functionality.
It supports only a few simple scheduling algorithms,
but relies upon an external scheduler for sophisticated
workload prioritization.
\item {\tt Open Source}: SLURM is available to everyone and
will remain free.
Its source code is distributed under the GNU General Public
License
~\cite{GPL2002}.
\item {\tt Portability}: SLURM is written in the C language,
with a GNU {\em autoconf} configuration engine.
SLURM has a fully functional skeleton of functionality with a
wide assortment of plugins available for customization.
A total of 30 different plugins are available and provide a
great deal of flexibility in configuring SLURM using a
building block approach.
\item {\tt Scalability}: SLURM is designed for scalability to clusters
of thousands of nodes. Time to fully execute (allocate, launch
tasks, process I/O, and deallocate resources) a simple program
is only 2 seconds for 4,800 tasks on 2,400 nodes. Clusters
containing up to 16,384 nodes have been emulated with highly
scalable performance.
\item {\tt Fault Tolerance}: SLURM can handle a variety of failures
in hardward or the infrastructure without inducing failures in
the workload.
\item {\tt Security}: SLURM employs crypto technology to authenticate
users to services and services to each other with a variety of options
available through the plugin mechanism.
\item {\tt System Administrator Friendly}: SLURM utilizes
a simple configuration file and minimizes distributed state.
Its configuration may be changed at any time without impacting running
jobs. Heterogeneous nodes within a cluster may be easily managed. SLURM
interfaces are usable by scripts and its behavior is highly deterministic.
\end{itemize}
\section{Architecture}
\begin{figure}[tb]
\centerline{\includegraphics[width=3.32in]{../figures/arch2.eps}}
\caption{\small SLURM architecture}
\label{arch}
\end{figure}
SLURM's commands and daemons are illustrated in Figure~\ref{arch}.
The main SLURM control program, {\tt slurmctld}, orchestrates
activities throughout the cluster. While highly optimized,
{\tt slurmctld} is best run on a dedicated node of the cluster for optimal performance.
In addition, SLURM provides the option of running a backup controller
on another node for increased fault-tolerance.
Each node in the cluster available for running user applications
has a relatively small daemon called {\tt slurmd} that monitors
and manages resources and work within that node.
Several user tools are provided:
\begin{itemize}
\item {\tt scontrol} is a system administrator tool to monitor and change
system configuration
\item {\tt sinfo} reports status and configuration of the nodes and queues
\item {\tt squeue} reports status of jobs
\item {\tt scancel} can signal and/or cancel jobs or job steps
\item {\tt smap} reports node, queue, and job status including
topological state information
\item {\tt sacct} reports job accounting information
\item {\tt sbcast} copies a file to local file systems on a job's allocated nodes
\item {\tt srun} is used to allocate resources and spawn the job's tasks.
\end{itemize}
\begin{figure}[tcb]
\centerline{\includegraphics[width=3.32in]{../figures/entities2.eps}}
\caption{\small SLURM entities: nodes, partitions, jobs, and job steps}
\label{entities}
\end{figure}
The entities managed by SLURM are illustrated in Figure~\ref{entities}
and include:
{\em nodes} including their processors, memory and temporary disk space,
{\em partitions} are collections of nodes with various limits and constraints,
{\em jobs} are allocations of resources assigned
to a user for a specified amount of time, and
{\em job steps} are sets of (possibly parallel) tasks within a job.
Each node must be capable of independent scheduling and job execution
\footnote{On BlueGene computers, the c-nodes can not be independently
scheduled. Each midplane or base partition is considered a SLURM node
with 1,024 processors. SLURM supports the execution of more than one
job per BlueGene node.}.
Each job in the priority-ordered queue is allocated nodes within a single
partition.
Since nodes can be in multiple partitions, one can think of them as
general purpose queues for jobs.
\section{Implementation}
Very high scalability was treated as a high priority for SLURM.
More comprehensive support for smaller clusters in fact was
added in later revisions.
For example, the initial implementation allocated whole nodes
to jobs, in part to avoid the extra overhead of tracking individual
processors.
While allocation of entire nodes to jobs is still a recommended mode of
operation for very large clusters, an alternate SLURM plugin provides
resource management down the the resolution of individual processors.
The SLURM's {\tt srun} command and the daemons are extensively
multi-threaded.
{\tt slurmctld} also maintains independent read and
write locks for critical data structures.
The combination of these two features means that,
for example, three users can get job state information at the same
time that a system administrator is modifying the time limit
for a partition.
\begin{figure}[tcb]
\centerline{\includegraphics[width=3.32in]{../figures/comm.eps}}
\caption{\small Hierarchical SLURM communications with fanout=2}
\label{comms}
\end{figure}
Communications to large numbers of nodes are optimized in two
ways. The programs initiating communications are multithreaded
and can process tens or even hundreds of simultaneous active
communications.
Second, the {\tt slurmd} daemon is designed to forward
communications on a hierarchical basis as shown in Figure ~\ref{comms}.
For example, the initiation of tasks on 1000 nodes does not require
{\tt srun} to directly communication with all 1000 nodes. {\tt Srun}
can communicate directly with {\tt slurmd} daemons on 32 nodes
(the degree of fanout in communications is configurable).
Each of those {\tt slurmd} will simultaneously forward the request
to {\tt slurmd} programs on another 32 nodes.
This improves performance by distributing the communication workload.
Note that every communication is authenticated and acknowleged
for fault-tolerance.
A number of interesting papers
~\cite{Jones2003,Kerbyson2001,Petrini2003,Phillips2003,Tsafrir2005}
have recently been written about
the impact of system daemons and other system overhead on
parallel job performance. This {\tt system noise} can have a
dramatic impact upon the performance of highly parallel jobs.
In a simplified example, consider a system daemon that run for
one second out of every 100 seconds on every node in a cluster.
For serial jobs this
reduces throughput by one percent, but the impact is compounded
on parallel computers if these daemons do not all execute concurrently.
If the parallel program runs on 100 nodes and tries to synchronize
every second, almost every synchronization period will include a
one second delay from the daemon running on one of the 100 nodes.
This effectively limits job parallelism to about 50-way, orders
of magnitude smaller than the largest systems currently available.
SLURM addresses this issue by:
\begin{itemize}
\item Making the {\tt slurmd} daemon resource requirements negligible
\item Supporting configurations that let the {\tt slurmd} daemon sleep
during the entire job execution period
\item If the {\tt slurmd} daemons do perform work, it is done on a
highly synchronized fashion across all nodes
\end{itemize}
In addition, the default mode of operation is to allocate entire
nodes with all of their processors to applications rather than
individual processors on each node.
This eliminates the possibility of interference between jobs,
which could severely degrade performance of parallel applications.
Allocation of resources to the resolution of individual processors
on each node is supported by SLURM, but this comes at a higher cost
in terms of the data managed.
The selection of resource resolution is provided by different plugins.
Resource management of large clusters entails the processing of
large quantities of data, both for the software and the
system administrator.
In order to ease the burden on system administrators, SLURM
configuration files and tools all support node naming using
numeric ranges.
For example, "linux[1-4096]" represents 4096 node names with
a prefix of "linux" and numeric suffix from 1 to 4096.
These naming convention permits even the largest clusters
to be described in a configure file containing only a
couple of dozen lines.
State information output from various SLURM commands uses
the same convention to maintain a modest volume of output
on even large cluster.
Extensive use is made of bitmaps to represent nodes in the cluster.
For example, bitmaps are maintained for each unique node configuration,
the nodes associated with each partition, nodes allocated to
the active jobs, nodes available for use, etc. This reduces most
scheduling operations to very rapid AND and OR operations on those bitmaps.
\section{Application Launch}
To better illustrate SLURM's operation, the execution of an
application is detailed below and illustrated in Figure~\ref{launch}.
This example is based upon a typical configuration and the
{\em interactive} mode, in which stdout and
stderr are displayed on the user's terminal in real time, and stdin and
signals may be forwarded from the terminal transparently to the remote
tasks.
\begin{figure}[tb]
\centerline{\includegraphics[width=3.32in]{../figures/launch.eps}}
\caption{\small SLURM Job Launch}
\label{launch}
\end{figure}
The task launch request is initiated by a user's execution of the
{\tt srun} command. {\tt Srun} has a multitude of options to specify
resource requirements such as minimum memory per node, minimum
temporary disk space per node, features associated with nodes,
partition to use, node count, task count, etc.
{\tt Srun} gets a credential to identify the user and his group
then sends the request to {\tt slurmctld} (message 1).
{\tt Slurmctld} authenticates the request and identifies the resources
to be allocated using a series of bitmap operations.
First the nodes containing the appropriate resources (processors,
memory, temporary disk space, and features) are identified through
comparison with a node configuration table, which typically has
a very small number of entries.
The resulting bitmap is ANDed with the bitmap associated with the
requested partition.
This bitmap is ANDed with the bitmap identifying available nodes.
The requested node and/or processor count is then satisfied from
the nodes identified with the resulting bitmap.
This completes the job allocation process, but for interactive
mode, a job step credential is also constructed for the allocation
and sent to {\tt srun} in the reply (message 2).
The {\tt srun} command open sockets for task input and output then
sends the job step credential directly to the {\tt slurmd} daemons
(message 3) in order to launch the tasks, which is acknowledged
(message 4).
Note the {\tt slurmctld} and {\tt slurmd} daemons do not directly
communicate during the task launch operation in order to minimize the
workload on the {\tt slurmctld}, which has to manage the entire
cluster.
Task termination is communicated to {\tt srun} over the same
socket used for input and output.
When all tasks have terminated, {\tt srun} notifies {\tt slurmctld}
of the job step termination (message 5).
{\tt Slurmctld} authenticates the request, acknowledges it
(message 6) and sends messages to the {\tt slurmd} daemons to
insure that all processes associated with the job have
terminated (message 7).
Upon receipt of job termination confirmation on each node (message 8),
{\tt slurmctld} releases the resources for use by another job.
The full time for execution of a simple parallel application across
a few nodes is a few milliseconds.
The time reaches a few seconds for jobs that span thousands of
nodes.
The times will vary with the hardware and configuration used,
but is hardly noticeable to the user at even the largest scales.
\section{Conclusion}
SLURM has demonstrated high reliability and scalability.
It provides resource management on several clusters containing
over 2,000 compute nodes and has emulated clusters containing
up to 16,384 compute nodes, four times the size of of the
largest parallel computer in use today.
The cluster size is presently limited by the size of some SLURM
data structures, but operation on larger systems is certainly
feasible with minor code changes.
\raggedright
% make the bibliography
\bibliographystyle{plain}
\bibliography{project}
% make the back cover page
%\makeLLNLBackCover
\end{document}