blob: eacaa353a51a61cac2e22cdf4edf2a1b29682b7b [file] [log] [blame]
% Presenter info:
% http://www.linuxclustersinstitute.org/Linux-HPC-Revolution/presenterinfo.html
%
% Main Text Layout
% Set the main text in 10 point Times Roman or Times New Roman (normal),
% (no boldface), using single line spacing. All text should be in a single
% column and justified.
%
% Opening Style (First Page)
% This includes the title of the paper, the author names, organization and
% country, the abstract, and the first part of the paper.
% * Start the title 35mm down from the top margin in Times Roman font, 16
% point bold, range left. Capitalize only the first letter of the first
% word and proper nouns.
% * On a new line, type the authors' names, organizations, and country only
% (not the full postal address, although you may add the name of your
% department), in Times Roman, 11 point italic, range left.
% * Start the abstract with the heading two lines below the last line of the
% address. Set the abstract in Times Roman, 12 point bold.
% * Leave one line, then type the abstract in Times Roman 10 point, justified
% with single line spacing.
%
% Other Pages
% For the second and subsequent pages, use the full 190 x 115mm area and type
% in one column beginning at the upper right of each page, inserting tables
% and figures as required.
%
% We're recommending the Lecture Notes in Computer Science styles from
% Springer Verlag --- google on Springer Verlag LaTeX. These work nicely,
% *except* that it does not work with the hyperref package. Sigh.
%
% http://www.springer.de/comp/lncs/authors.html
\documentclass[10pt,onecolumn,times]{llncs}
\usepackage{verbatim,moreverb}
\usepackage{float}
% Needed for LLNL cover page / TID bibliography style
\usepackage{calc}
\usepackage{epsfig}
\usepackage{graphics}
\usepackage{hhline}
\input{pstricks}
\input{pst-node}
\usepackage{chngpage}
\input{llnlCoverPage}
% Uncomment for DRAFT "watermark"
%\usepackage{draftcopy}
% Times font as default roman
\renewcommand{\sfdefault}{phv}
\renewcommand{\rmdefault}{ptm}
%\renewcommand{\ttdefault}{pcr}
%\usepackage{times}
\renewcommand{\labelitemi}{$\bullet$}
\setlength{\textwidth}{115mm}
\setlength{\textheight}{190mm}
\setlength{\oddsidemargin}{(\paperwidth-\textwidth)/2 - 1in}
\setlength{\topmargin}{(\paperheight-\textheight -\headheight-\headsep-\footskip
)/2 - 1in + .5in }
% couple of macros for the title page and document
\def\ctit{SLURM: Simple Linux Utility for Resource \linebreak
Management}
\def\ucrl{UCRL-JC-147996 REV 1}
\def\auth{Morris Jette \\ Mark Grondona}
\def\pubdate{June 23, 2003}
\def\journal{ClusterWorld Conference and Expo}
\begin{document}
% make the cover page
\makeLLNLCover{\ucrl}{\ctit}{\auth}{\journal}{\pubdate}{0in}{0in}
% Title - 16pt bold
\vspace*{35mm}
\noindent\Large
\textbf{\ctit}
\vskip1\baselineskip
% Authors - 11pt
\noindent\large
{Morris Jette and Mark Grondona \\
{\em Lawrence Livermore National Laboratory, USA}
\vskip2\baselineskip
% Abstract heading - 12pt bold
\noindent\large
\textbf{Abstract}
\vskip1\baselineskip
% Abstract itself - 10pt
\noindent\normalsize
Simple Linux Utility for Resource Management (SLURM) is an open source,
fault-tolerant, and highly scalable cluster management and job scheduling
system for Linux clusters of thousands of nodes. Components include
machine status, partition management, job management, scheduling, and
stream copy modules. This paper presents an overview of the SLURM
architecture and functionality.
% define some additional macros for the body
\newcommand{\munged}{{\tt munged}}
\newcommand{\srun}{{\tt srun}}
\newcommand{\scancel}{{\tt scancel}}
\newcommand{\squeue}{{\tt squeue}}
\newcommand{\scontrol}{{\tt scontrol}}
\newcommand{\sinfo}{{\tt sinfo}}
\newcommand{\slurmctld}{{\tt slurmctld}}
\newcommand{\slurmd}{{\tt slurmd}}
\section{Overview}
Simple Linux Utility for Resource Management (SLURM)\footnote{A tip of
the hat to Matt Groening and creators of {\em Futurama},
where Slurm is the most popular carbonated beverage in the universe.}
is a resource management system suitable for use on large and small Linux
clusters. After surveying \cite{Jette2002} resource managers available
for Linux and finding none that were simple, highly scalable, and portable
to different cluster architectures and interconnects, the authors set out
to design a new system.
The resulting design is a resource management system with the following general
characteristics:
\begin{itemize}
\item {\tt Simplicity}: SLURM is simple enough to allow motivated end users
to understand its source code and add functionality. The authors will
avoid the temptation to add features unless they are of general appeal.
\item {\tt Open Source}: SLURM is available to everyone and will remain free.
Its source code is distributed under the GNU General Public
License \cite{GPL2002}.
\item {\tt Portability}: SLURM is written in the C language, with a GNU
{\em autoconf} configuration engine.
While initially written for Linux, other Unix-like operating systems
should be easy porting targets.
SLURM also supports a general purpose ``plugin'' mechanism, which
permits a variety of different infrastructures to be easily supported.
The SLURM configuration file specifies which set of plugin modules
should be used.
\item {\tt Interconnect Independence}: SLURM currently supports UDP/IP-based
communication and the Quadrics Elan3 interconnect. Adding support for
other interconnects, including topography constraints, is straightforward
and utilizes the plugin mechanism described above.
\item {\tt Scalability}: SLURM is designed for scalability to clusters of
thousands of nodes. The SLURM controller for a cluster with 1000 nodes
occupies on the order of 2 MB of memory, and excellent performance has
been demonstrated. Jobs may specify their resource requirements in a
variety of ways, including requirements options and ranges.
\item {\tt Fault Tolerance}: SLURM can handle a variety of failure
modes without terminating workloads, including crashes of the node
running the SLURM controller. User jobs may be configured to continue
execution despite the failure of one or more nodes on which they are
executing. The user command controlling a job, {\tt srun}, may detach
and reattach from the parallel tasks at any time. Nodes allocated to
a job are available for reuse as soon as the job(s) allocated to that
node terminate. If some nodes fail to complete job termination in a
timely fashion because of hardware or software problems, only the
scheduling of those tardy nodes will be affected.
\item {\tt Security}: SLURM employs crypto technology to authenticate
users to services and services to each other with a variety of options
available through the plugin mechanism. SLURM does not assume that its
networks are physically secure, but it does assume that the entire cluster
is within a single administrative domain with a common user base.
\item {\tt System Administrator Friendly}: SLURM utilizes
a simple configuration file and minimizes distributed state.
Its configuration may be changed at any time without impacting running
jobs. Heterogeneous nodes within a cluster may be easily managed. SLURM
interfaces are usable by scripts and its behavior is highly deterministic.
\end{itemize}
\subsection{What Is SLURM?}
As a cluster resource manager, SLURM has three key functions. First,
it allocates exclusive and/or non-exclusive access to resources (compute
nodes) to users for some duration of time so they can perform work.
Second, it provides a framework for starting, executing, and monitoring
work (normally a parallel job) on the set of allocated nodes. Finally,
it arbitrates conflicting requests for resources by managing a queue of
pending work.
Users interact with SLURM through four command line utilities: \srun\
for submitting a job for execution and optionally controlling it
interactively, \scancel\ for terminating a pending or running job,
\squeue\ for monitoring job queues, and \sinfo\ for monitoring partition
and overall system state. System administrators perform privileged
operations through an additional command line utility, {\tt scontrol}.
The central controller daemon, {\tt slurmctld}, maintains the global
state and directs operations. Compute nodes simply run a \slurmd\ daemon
(similar to a remote shell daemon) to export control to SLURM.
\subsection{What SLURM Is Not}
SLURM is not a comprehensive cluster administration or monitoring package.
While SLURM knows the state of its compute nodes, it makes no attempt
to put this information to use in other ways, such as with a general
purpose event logging mechanism or a back-end database for recording
historical state. It is expected that SLURM will be deployed in a
cluster with other tools performing those functions.
SLURM is not a meta-batch system like Globus \cite{Globus2002} or DPCS
(Distributed Production Control System) \cite{DPCS2002}. SLURM supports
resource management across a single cluster.
SLURM is not a sophisticated batch system. In fact, it was expressly
designed to provide high-performance parallel job management while
leaving scheduling decisions to an external entity. Its default scheduler
implements First-In First-Out (FIFO). An scheduler entity can establish
a job's initial priority through a plugin. An external scheduler may
also submit, signal, and terminate jobs as well as reorder the queue of
pending jobs via the API.
\section{Architecture}
\begin{figure}[tb]
\centerline{\epsfig{file=../figures/arch.eps,scale=0.35}}
\caption{\small SLURM architecture}
\label{arch}
\end{figure}
As shown in Figure~\ref{arch}, SLURM consists of a \slurmd\ daemon
running on each compute node, a central \slurmctld\ daemon running
on a management node (with optional fail-over twin), and five command
line utilities: {\tt srun}, {\tt scancel}, {\tt sinfo}, {\tt squeue},
and {\tt scontrol}, which can run anywhere in the cluster.
The entities managed by these SLURM daemons include {\em nodes}, the
compute resource in SLURM, {\em partitions}, which group nodes into
logical disjoint sets, {\em jobs}, or allocations of resources assigned
to a user for a specified amount of time, and {\em job steps}, which
are sets of (possibly parallel) tasks within a job. Each job in the
priority-ordered queue is allocated nodes within a single partition.
Once an allocation request fails, no lower priority jobs for that
partition will be considered for a resource allocation. Once a job is
assigned a set of nodes, the user is able to initiate parallel work in
the form of job steps in any configuration within the allocation. For
instance, a single job step may be started that utilizes all nodes
allocated to the job, or several job steps may independently use a
portion of the allocation.
\begin{figure}[tcb]
\centerline{\epsfig{file=../figures/entities.eps,scale=0.5}}
\caption{\small SLURM entities: nodes, partitions, jobs, and job steps}
\label{entities}
\end{figure}
Figure~\ref{entities} further illustrates the interrelation of these
entities as they are managed by SLURM by showing a group of
compute nodes split into two partitions. Partition 1 is running one job,
with one job step utilizing the full allocation of that job. The job
in Partition 2 has only one job step using half of the original job
allocation. That job might initiate additional job steps to utilize
the remaining nodes of its allocation.
\begin{figure}[tb]
\centerline{\epsfig{file=../figures/slurm-arch.eps,scale=0.5}}
\caption{SLURM architecture - subsystems}
\label{archdetail}
\end{figure}
Figure~\ref{archdetail} shows the subsystems that are implemented
within the \slurmd\ and \slurmctld\ daemons. These subsystems are
explained in more detail below.
\subsection{Slurmd}
\slurmd\ is a multi-threaded daemon running on each compute node and
can be compared to a remote shell daemon: it reads the common SLURM
configuration file and saved state information,
notifies the controller that it is active, waits
for work, executes the work, returns status, then waits for more work.
Because it initiates jobs for other users, it must run as user {\em root}.
It also asynchronously exchanges node and job status with {\tt slurmctld}.
The only job information it has at any given time pertains to its
currently executing jobs. \slurmd\ has five major components:
\begin{itemize}
\item {\tt Machine and Job Status Services}: Respond to controller
requests for machine and job state information and send asynchronous
reports of some state changes (e.g., \slurmd\ startup) to the controller.
\item {\tt Remote Execution}: Start, manage, and clean up after a set
of processes (typically belonging to a parallel job) as dictated by
the \slurmctld\ daemon or an \srun\ or \scancel\ command. Starting a
process may include executing a prolog program, setting process limits,
setting real and effective uid, establishing environment variables,
setting working directory, allocating interconnect resources, setting core
file paths, initializing stdio, and managing process groups. Terminating
a process may include terminating all members of a process group and
executing an epilog program.
\item {\tt Stream Copy Service}: Allow handling of stderr, stdout, and
stdin of remote tasks. Job input may be redirected
from a single file or multiple files (one per task), an
\srun\ process, or /dev/null. Job output may be saved into local files or
returned to the \srun\ command. Regardless of the location of stdout/err,
all job output is locally buffered to avoid blocking local tasks.
\item {\tt Job Control}: Allow asynchronous interaction with the Remote
Execution environment by propagating signals or explicit job termination
requests to any set of locally managed processes.
\end{itemize}
\subsection{Slurmctld}
Most SLURM state information exists in {\tt slurmctld}, also known as
the controller. \slurmctld\ is multi-threaded with independent read
and write locks for the various data structures to enhance scalability.
When \slurmctld\ starts, it reads the SLURM configuration file and
any previously saved state information. Full controller state
information is written to disk periodically, with incremental changes
written to disk immediately for fault tolerance. \slurmctld\ runs in
either master or standby mode, depending on the state of its fail-over
twin, if any. \slurmctld\ need not execute as user {\em root}. In fact,
it is recommended that a unique user entry be created for executing
\slurmctld\ and that user must be identified in the SLURM configuration
file as {\tt SlurmUser}. \slurmctld\ has three major components:
\begin{itemize}
\item {\tt Node Manager}: Monitors the state of each node in the cluster.
It polls {\tt slurmd}s for status periodically and receives state
change notifications from \slurmd\ daemons asynchronously. It ensures
that nodes have the prescribed configuration before being considered
available for use.
\item {\tt Partition Manager}: Groups nodes into non-overlapping sets
called partitions. Each partition can have associated with it
various job limits and access controls. The Partition Manager also
allocates nodes to jobs based on node and partition states and
configurations. Requests to initiate jobs come from the Job Manager.
\scontrol\ may be used to administratively alter node and partition
configurations.
\item {\tt Job Manager}: Accepts user job requests and places pending jobs
in a priority-ordered queue. The Job Manager is awakened on a periodic
basis and whenever there is a change in state that might permit a job to
begin running, such as job completion, job submission, partition {\em up}
transition, node {\em up} transition, etc. The Job Manager then makes
a pass through the priority-ordered job queue. The highest priority
jobs for each partition are allocated resources as possible. As soon as
an allocation failure occurs for any partition, no lower-priority jobs
for that partition are considered for initiation. After completing the
scheduling cycle, the Job Manager's scheduling thread sleeps. Once a
job has been allocated resources, the Job Manager transfers necessary
state information to those nodes, permitting it to commence execution.
When the Job Manager detects that all nodes associated with a job
have completed their work, it initiates cleanup and performs another
scheduling cycle as described above.
\end{itemize}
\subsection{Command Line Utilities}
The command line utilities offer users access to remote execution and
job control. They also permit administrators to dynamically change
the system configuration. These commands use SLURM APIs that are
directly available for more sophisticated applications.
\begin{itemize}
\item {\tt scancel}: Cancel a running or a pending job or job step,
subject to authentication and authorization. This command can also be
used to send an arbitrary signal to all processes on all nodes associated
with a job or job step.
\item {\tt scontrol}: Perform privileged administrative commands
such as bringing down a node or partition in preparation for maintenance.
Many \scontrol\ functions can only be executed by privileged users.
\item {\tt sinfo}: Display a summary of partition and node information.
An assortment of filtering and output format options are available.
\item {\tt squeue}: Display the queue of running and waiting jobs and/or
job steps. A wide assortment of filtering, sorting, and output format
options are available.
\item {\tt srun}: Allocate resources, submit jobs to the SLURM queue,
and initiate parallel tasks (job steps). Every set of executing parallel
tasks has an associated \srun\ that initiated it and, if the \srun\
persists, manages it. Jobs may be submitted for later execution
(e.g., batch), in which case \srun\ terminates after job submission.
Jobs may also be submitted for interactive execution, where \srun\ keeps
running to shepherd the running job. In this case, \srun\ negotiates
connections with remote {\tt slurmd}s for job initiation and to get
stdout and stderr, forward stdin,\footnote{\srun\ command line options
select the stdin handling method, such as broadcast to all tasks, or
send only to task 0.} and respond to signals from the user. \srun\
may also be instructed to allocate a set of resources and spawn a shell
with access to those resources.
\end{itemize}
\subsection{Plugins}
In order to simplify the use of different infrastructures,
SLURM uses a general purpose plugin mechanism. A SLURM plugin is a
dynamically linked code object that is loaded explicitly at run time
by the SLURM libraries. A plugin provides a customized implementation
of a well-defined API connected to tasks such as authentication,
interconnect fabric, and task scheduling. A common set of functions is defined
for use by all of the different infrastructures of a particular variety.
For example, the authentication plugin must define functions such as
{\tt slurm\_auth\_create} to create a credential, {\tt slurm\_auth\_verify}
to verify a credential to approve or deny authentication,
{\tt slurm\_auth\_get\_uid} to get the uid associated with a specific
credential, etc. It also must define the data structure used, a plugin
type, a plugin version number, etc. When a SLURM daemon is initiated, it
reads the configuration file to determine which of the available plugins
should be used. For example {\em AuthType=auth/authd} says to use the
plugin for authd based authentication and {\em PluginDir=/usr/local/lib}
identifies the directory in which to find the plugin.
\subsection{Communications Layer}
SLURM presently uses Berkeley sockets for communications. However,
we anticipate using the plugin mechanism to permit use of
other communications layers. At LLNL we are using an Ethernet
for SLURM communications and the Quadrics Elan switch exclusively
for user applications. The SLURM configuration file permits the
identification of each node's hostname as well as its name to be used
for communications. In the case of a control machine known as {\em mcri}
to be communicated with using the name {\em emcri} (say to indicate an
Ethernet communications path), this is represented in the configuration
file as {\em ControlMachine=mcri ControlAddr=emcri}. The name used for
communication is the same as the hostname unless otherwise specified.
Internal SLURM functions pack and unpack data structures in machine
independent format. We considered the use of XML style messages, but we
felt this would adversely impact performance (albeit slightly). If XML
support is desired, it is straightforward to perform a translation and
use the SLURM APIs.
\subsection{Security}
SLURM has a simple security model: any user of the cluster may submit
parallel jobs to execute and cancel his own jobs. Any user may view
SLURM configuration and state information. Only privileged users
may modify the SLURM configuration, cancel any job, or perform other
restricted activities. Privileged users in SLURM include the users
{\em root} and {\em SlurmUser} (as defined in the SLURM configuration file).
If permission to modify SLURM configuration is required by others, set-uid
programs may be used to grant specific permissions to specific users.
\subsubsection{Communication Authentication.}
Historically, inter-node authentication has been accomplished via the use
of reserved ports and set-uid programs. In this scheme, daemons check the
source port of a request to ensure that it is less than a certain value
and thus only accessible by {\em root}. The communications over that
connection are then implicitly trusted. Because reserved ports are a
limited resource and set-uid programs are a possible security concern,
we have employed a credential-based authentication scheme that
does not depend on reserved ports. In this design, a SLURM authentication
credential is attached to every message and authoritatively verifies the
uid and gid of the message originator. Once recipients of SLURM messages
verify the validity of the authentication credential, they can use the uid
and gid from the credential as the authoritative identity of the sender.
The actual implementation of the SLURM authentication credential is
relegated to an ``auth'' plugin. We presently have implemented three
functional authentication plugins: authd\cite{Authd2002},
Munge, and none. The ``none'' authentication type employs a null
credential and is only suitable for testing and networks where security
is not a concern. Both the authd and Munge implementations employ
cryptography to generate a credential for the requesting user that
may then be authoritatively verified on any remote nodes. However,
authd assumes a secure network and Munge does not. Other authentication
implementations, such as a credential based on Kerberos, should be easy
to develop using the auth plugin API.
\subsubsection{Job Authentication.}
When resources are allocated to a user by the controller, a ``job step
credential'' is generated by combining the uid, job id, step id, the list
of resources allocated (nodes), and the credential lifetime and signing
the result with the \slurmctld\ private key. This credential grants the
user access to allocated resources and removes the burden from \slurmd\
to contact the controller to verify requests to run processes. \slurmd\
verifies the signature on the credential against the controller's public
key and runs the user's request if the credential is valid. Part of the
credential signature is also used to validate stdout, stdin,
and stderr connections from \slurmd\ to \srun .
\subsubsection{Authorization.}
Access to partitions may be restricted via a {\em RootOnly} flag.
If this flag is set, job submit or allocation requests to this partition
are only accepted if the effective uid originating the request is a
privileged user. A privileged user may submit a job as any other user.
This may be used, for example, to provide specific external schedulers
with exclusive access to partitions. Individual users will not be
permitted to directly submit jobs to such a partition, which would
prevent the external scheduler from effectively managing it. Access to
partitions may also be restricted to users who are members of specific
Unix groups using a {\em AllowGroups} specification.
\subsection{Example: Executing a Batch Job}
In this example a user wishes to run a job in batch mode, in which \srun\
returns immediately and the job executes in the background when resources
are available. The job is a two-node run of script containing {\em mping},
a simple MPI application. The user submits the job:
\begin{verbatim}
srun --batch --nodes 2 --nprocs 2 mping.sh
\end{verbatim}
The script {\tt mping.sh} contains:
\begin{verbatim}
#!/bin/sh
srun hostname
srun mping 1 1048576
\end{verbatim}
The initial \srun\ command authenticates the user to the controller and
submits the job request. The request includes the \srun\ environment,
current working directory, and command line option information. By
default, stdout and stderr are sent to files in the current working
directory and stdin is copied from {\tt /dev/null}.
The controller consults the Partition Manager to test whether the job
will ever be able to run. If the user has requested a non-existent partition,
a non-existent constraint,
etc., the Partition Manager returns an error and the request is discarded.
The failure is reported to \srun\, which informs the user and exits, for
example:
\begin{verbatim}
srun: error: Unable to allocate resources: Invalid partition name
\end{verbatim}
On successful submission, the controller assigns the job a unique
{\em SLURM job id}, adds it to the job queue, and returns the job's
job id to \srun\, which reports this to user and exits, returning
success to the user's shell:
\begin{verbatim}
srun: jobid 42 submitted
\end{verbatim}
The controller awakens the Job Manager, which tries to run jobs starting
at the head of the priority ordered job queue. It finds job {\em 42} and
makes a successful request to the Partition Manager to allocate two nodes
from the default (or requested) partition: {\em dev6} and {\em dev7}.
The Job Manager then sends a request to the \slurmd\ on the first
node in the job {\em dev6} to execute the script specified on the user's
command line.\footnote{Had the user specified an executable file rather
than a job script, an \srun\ program would be initiated on the first
node and \srun\ would initiate the executable with the desired task
distribution.} The Job Manager also sends a copy of the environment,
current working directory, stdout and stderr location, along with other
options. Additional environment variables are appended to the user's
environment before it is sent to the remote \slurmd\ detailing the job's
resources, such as the SLURM job id ({\em 42}) and the allocated nodes
({\em dev[6-7]}).
The remote \slurmd\ establishes the new environment, executes a SLURM
prolog program (if one is configured) as user {\em root}, and executes
the job script (or command) as the submitting user. The \srun\ within
the job script detects that it is running with allocated resources from
the presence of a {\tt SLURM\_JOBID} environment variable. \srun\
connects to \slurmctld\ to request a job step to run on all nodes of
the current job. \slurmctld\ validates the request and replies with a
job step credential and switch resources. \srun\ then contacts {\tt slurmd}s
running on both {\em dev6} and {\em dev7}, passing the job step credential,
environment, current working directory, command path and arguments,
and interconnect information. The {\tt slurmd}s verify the valid job
step credential, connect stdout and stderr back to \srun , establish
the environment, and execute the command as the submitting user.
Unless instructed otherwise by the user, stdout and stderr are
copied to a file in the current working directory by \srun :
\begin{verbatim}
/path/to/cwd/slurm-42.out
\end{verbatim}
The user may examine output files at any time if they reside
in a globally accessible directory. In this example
{\tt slurm-42.out} would contain the output of the job script's two
commands (hostname and mping):
\begin{verbatim}
dev6
dev7
1 pinged 0: 1 bytes 5.38 uSec 0.19 MB/s
1 pinged 0: 2 bytes 5.32 uSec 0.38 MB/s
1 pinged 0: 4 bytes 5.27 uSec 0.76 MB/s
1 pinged 0: 8 bytes 5.39 uSec 1.48 MB/s
...
1 pinged 0: 1048576 bytes 4682.97 uSec 223.91 MB/s
\end{verbatim}
When the tasks complete execution, \srun\ is notified by \slurmd\ of
each task's exit status. \srun\ reports job step completion to the Job
Manager and exits. \slurmd\ detects when the job script terminates and
notifies the Job Manager of its exit status and begins cleanup. The Job
Manager directs the {\tt slurmd}s formerly assigned to the job to run
the SLURM epilog program (if one is configured) as user {\em root}.
Finally, the Job Manager releases the resources allocated to job {\em 42}
and updates the job status to {\em complete}. The record of a job's
existence is eventually purged.
\subsection{Example: Executing an Interactive Job}
In this example a user wishes to run the same {\em mping} command
in interactive mode, in which \srun\ blocks while the job executes
and stdout/stderr of the job are copied onto stdout/stderr of {\tt srun}.
The user submits the job, this time without the {\tt batch} option:
\begin{verbatim}
srun --nodes 2 --nprocs 2 mping 1 1048576
\end{verbatim}
The \srun\ command authenticates the user to the controller and makes a
request for a resource allocation and job step. The Job Manager
responds with a list of nodes, a job step credential, and interconnect
resources on successful allocation. If resources are not immediately
available, the request terminates or blocks depending on user options.
If the request is successful, \srun\ forwards the job run request to
the assigned {\tt slurmd}s in the same manner as the \srun\ in the batch
job script. In this case, the user sees the program output on stdout of
{\tt srun}:
\begin{verbatim}
1 pinged 0: 1 bytes 5.38 uSec 0.19 MB/s
1 pinged 0: 2 bytes 5.32 uSec 0.38 MB/s
1 pinged 0: 4 bytes 5.27 uSec 0.76 MB/s
1 pinged 0: 8 bytes 5.39 uSec 1.48 MB/s
...
1 pinged 0: 1048576 bytes 4682.97 uSec 223.91 MB/s
\end{verbatim}
When the job terminates, \srun\ receives an EOF (End Of File) on each
stream and closes it, then receives the task exit status from each
{\tt slurmd}. The \srun\ process notifies \slurmctld\ that the job is
complete and terminates. The controller contacts all {\tt slurmd}s allocated
to the terminating job and issues a request to run the SLURM epilog,
then releases the job's resources.
Most signals received by \srun\ while the job is executing are
transparently forwarded to the remote tasks. SIGINT (generated by
Control-C) is a special case and only causes \srun\ to report
remote task status unless two SIGINTs are received in rapid succession.
SIGQUIT (Control-$\backslash$) is another special case. SIGQUIT forces
termination of the running job.
\section{Slurmctld Design}
\slurmctld\ is modular and multi-threaded with independent read and
write locks for the various data structures to enhance scalability.
The controller includes the following subsystems: Node Manager, Partition
Manager, and Job Manager. Each of these subsystems is described in
detail below.
\subsection{Node Management}
The Node Manager monitors the state of nodes. Node information monitored
includes:
\begin{itemize}
\item Count of processors on the node
\item Size of real memory on the node
\item Size of temporary disk storage
\item State of node (RUN, IDLE, DRAINED, etc.)
\item Weight (preference in being allocated work)
\item Feature (arbitrary description)
\item IP address
\end{itemize}
The SLURM administrator can specify a list of system node names using
a numeric range in the SLURM configuration file or in the SLURM tools
(e.g., ``{\em NodeName=linux[001-512] CPUs=4 RealMemory=1024 TmpDisk=4096 \linebreak
Weight=4 Feature=Linux}''). These values for CPUs, RealMemory, and
TmpDisk are considered to be the minimal node configuration values
acceptable for the node to enter into service. The \slurmd\ registers
whatever resources actually exist on the node, and this is recorded
by the Node Manager. Actual node resources are checked on \slurmd\
initialization and periodically thereafter. If a node registers with
less resources than configured, it is placed in DOWN state and
the event logged. Otherwise, the actual resources reported are recorded
and possibly used as a basis for scheduling (e.g., if the node has more
RealMemory than recorded in the configuration file, the actual node
configuration may be used for determining suitability for any application;
alternately, the data in the configuration file may be used for possibly
improved scheduling performance). Note the node name syntax with numeric
range permits even very large heterogeneous clusters to be described in
only a few lines. In fact, a smaller number of unique configurations
can provide SLURM with greater efficiency in scheduling work.
{\em Weight} is used to order available nodes in assigning work to
them. In a heterogeneous cluster, more capable nodes (e.g., larger memory
or faster processors) should be assigned a larger weight. The units
are arbitrary and should reflect the relative value of each resource.
Pending jobs are assigned the least capable nodes (i.e., lowest weight)
that satisfy their requirements. This tends to leave the more capable
nodes available for those jobs requiring those capabilities.
{\em Feature} is an arbitrary string describing the node, such as a
particular software package, file system, or processor speed. While the
feature does not have a numeric value, one might include a numeric value
within the feature name (e.g., ``1200MHz'' or ``16GB\_Swap''). If the
nodes on the cluster have disjoint features (e.g., different ``shared''
file systems), one should identify these as features (e.g., ``FS1'',
``FS2'', etc.). Programs may then specify that all nodes allocated to
it should have the same feature, but that any of the specified features
are acceptable (e.g., ``$Feature=FS1|FS2|FS3$'' means the job should be
allocated nodes that all have the feature ``FS1'' or they all have feature
``FS2,'' etc.).
Node records are kept in an array with hash table lookup. If nodes are
given names containing sequence numbers (e.g., ``lx01'', ``lx02'', etc.),
the hash table permits specific node records to be located very quickly;
therefore, this is our recommended naming convention for larger clusters.
An API is available to view any of this information and to update some
node information (e.g., state). APIs designed to return SLURM state
information permit the specification of a time stamp. If the requested
data has not changed since the time stamp specified by the application,
the application's current information need not be updated. The API
returns a brief ``No Change'' response rather than returning relatively
verbose state information. Changes in node configurations (e.g., node
count, memory, etc.) or the nodes actually in the cluster should be
reflected in the SLURM configuration files. SLURM configuration may be
updated without disrupting any jobs.
\subsection{Partition Management}
The Partition Manager identifies groups of nodes to be used for execution
of user jobs. One might consider this the actual resource scheduling
component. Data associated with a partition includes:
\begin{itemize}
\item Name
\item RootOnly flag to indicate that only users {\em root} or
{\tt SlurmUser} may allocate resources in this partition (for any user)
\item List of associated nodes
\item State of partition (UP or DOWN)
\item Maximum time limit for any job
\item Minimum and maximum nodes allocated to any single job
\item List of groups permitted to use the partition (defaults to ALL)
\item Shared access (YES, NO, or FORCE)
\item Default partition (if no partition is specified in a job request)
\end{itemize}
It is possible to alter most of this data in real-time in order to affect
the scheduling of pending jobs (currently executing jobs would not be
affected). This information is confined to the controller machine(s)
for better scalability. It is used by the Job Manager (and possibly an
external scheduler), which either exist only on the control machine or
communicate only with the control machine.
The nodes in a partition may be designated for exclusive or non-exclusive
use by a job. A {\tt shared} value of YES indicates that jobs may
share nodes on request. A {\tt shared} value of NO indicates that
jobs are always given exclusive use of allocated nodes. A {\tt shared}
value of FORCE indicates that jobs are never ensured exclusive
access to nodes, but SLURM may initiate multiple jobs on the nodes
for improved system utilization and responsiveness. In this case,
job requests for exclusive node access are not honored. Non-exclusive
access may negatively impact the performance of parallel jobs or cause
them to fail upon exhausting shared resources (e.g., memory or disk
space). However, shared resources may improve overall system utilization
and responsiveness. The proper support of shared resources, including
enforcement of limits on these resources, entails a substantial amount of
effort, which we are not presently planning to expend. However, we have
designed SLURM so as to not preclude the addition of such a capability
at a later time if so desired. Future enhancements could include
constraining jobs to a specific CPU count or memory size within a
node, which could be used to effectively space-share individual nodes.
The Partition Manager will allocate nodes to pending jobs on request
from the Job Manager.
Submitted jobs can specify desired partition, time limit, node count
(minimum and maximum), CPU count (minimum) task count, the need for
contiguous node assignment, and an explicit list of nodes to be included
and/or excluded in its allocation. Nodes are selected so as to satisfy
all job requirements. For example, a job requesting four CPUs and four
nodes will actually be allocated eight CPUs and four nodes in the case
of all nodes having two CPUs each. The request may also indicate node
configuration constraints such as minimum real memory or CPUs per node,
required features, shared access, etc. Overall there are 13 different
parameters that may identify resource requirements for a job.
Nodes are selected for possible assignment to a job based on the
job's configuration requirements (e.g., partition specification, minimum
memory, temporary disk space, features, node list, etc.). The selection
is refined by determining which nodes are up and available for use.
Groups of nodes are then considered in order of weight, with the nodes
having the lowest {\em Weight} preferred. Finally, the physical location
of the nodes is considered.
Bit maps are used to indicate which nodes are up, idle, associated
with each partition, and associated with each unique configuration.
This technique permits scheduling decisions to normally be made by
performing a small number of tests followed by fast bit map manipulations.
If so configured, a job's resource requirements would be compared with
the (relatively small number of) node configuration records, each of
which has an associated bit map. Usable node configuration bitmaps would
be ANDed with the selected partitions bit map ANDed with the UP node
bit map and possibly ANDed with the IDLE node bit map (this last test
depends on the desire to share resources). This method can eliminate
tens of thousands of individual node configuration comparisons that
would otherwise be required in large heterogeneous clusters.
The actual selection of nodes for allocation to a job is currently tuned
for the Quadrics interconnect. This hardware supports hardware message
broadcast only if the nodes are contiguous. If a job is not allocated
contiguous nodes, a slower software based multi-cast mechanism is used.
Jobs will be allocated continuous nodes to the extent possible (in
fact, contiguous node allocation may be specified as a requirement on
job submission). If contiguous nodes cannot be allocated to a job, it
will be allocated resources from the minimum number of sets of contiguous
nodes possible. If multiple sets of contiguous nodes can be allocated
to a job, the one that most closely fits the job's requirements will
be used. This technique will leave the largest continuous sets of nodes
intact for jobs requiring them.
The Partition Manager builds a list of nodes to satisfy a job's request.
It also caches the IP addresses of each node and provides this information
to \srun\ at job initiation time for improved performance.
The failure of any node to respond to the Partition Manager only affects
jobs associated with that node. In fact, a job may indicate it should
continue executing even if allocated nodes cease responding. In this
case, the job needs to provide for its own fault tolerance. All other
jobs and nodes in the cluster will continue to operate after a node
failure. No additional work is allocated to the failed node, and it will
be pinged periodically to determine when it resumes responding. The node
may then be returned to service (depending on the {\tt ReturnToService}
parameter in the SLURM configuration).
\subsection{Configuration}
A single configuration file applies to all SLURM daemons and commands.
Most of this information is used only by the controller. Only the
host and port information is referenced by most commands. A sample
configuration file is shown in Table~\ref{sample_config}.
\begin{table}[t]
\begin{center}
\begin{tabular}[c]{c}
\\
\fbox{
\begin{minipage}[c]{0.8\linewidth}
{\tiny \verbatiminput{sample.config} }
\end{minipage}
}
\\
\end{tabular}
\caption{Sample SLURM config file \label{sample_config}}
\end{center}
\end{table}
\subsection{Job Manager}
There are a multitude of parameters associated with each job, including:
\begin{itemize}
\item Job name
\item Uid
\item Job id
\item Working directory
\item Partition
\item Priority
\item Node constraints (processors, memory, features, etc.)
\end{itemize}
Job records have an associated hash table for rapidly locating
specific records. They also have bit maps of requested and/or
allocated nodes (as described above).
The core functions supported by the Job Manager include:
\begin{itemize}
\item Request resource (job may be queued)
\item Reset priority of a job
\item Status job (including node list, memory and CPU use data)
\item Signal job (send arbitrary signal to all processes associated
with a job)
\item Terminate job (remove all processes)
\item Change node count of running job (could fail if insufficient
resources are available)
%\item Preempt/resume job (future)
%\item Checkpoint/restart job (future)
\end{itemize}
Jobs are placed in a priority-ordered queue and allocated nodes as
selected by the Partition Manager. SLURM implements a very simple default
scheduling algorithm, namely FIFO. An attempt is made to schedule pending
jobs on a periodic basis and whenever any change in job, partition,
or node state might permit the scheduling of a job.
We are aware that this scheduling algorithm does not satisfy the needs
of many customers, and we provide the means for establishing other
scheduling algorithms. Before a newly arrived job is placed into the
queue, an external scheduler plugin assigns its initial priority.
A plugin function is also called at the start of each scheduling
cycle to modify job or system state as desired. SLURM APIs permit an
external entity to alter the priorities of jobs at any time and re-order
the queue as desired. The Maui Scheduler \cite{Jackson2001,Maui2002}
is one example of an external scheduler suitable for use with SLURM.
LLNL uses DPCS \cite{DPCS2002} as SLURM's external scheduler.
DPCS is a meta-scheduler with flexible scheduling algorithms that
suit our needs well.
It also provides the scalability required for this application.
DPCS maintains pending job state internally and only transfers the
jobs to SLURM (or another underlying resources manager) only when
they are to begin execution.
By not transferring jobs to a particular resources manager earlier,
jobs are assured of being initiated on the first resource satisfying
their requirements, whether a Linux cluster with SLURM or an IBM SP
with LoadLeveler (assuming a highly flexible application).
This mode of operation may also be suitable for computational grid
schedulers.
In a future release, the Job Manager will collect resource consumption
information (CPU time used, CPU time allocated, and real memory used)
associated with a job from the \slurmd\ daemons. Presently, only the
wall-clock run time of a job is monitored. When a job approaches its
time limit (as defined by wall-clock execution time) or an imminent
system shutdown has been scheduled, the job is terminated. The actual
termination process is to notify \slurmd\ daemons on nodes allocated
to the job of the termination request. The \slurmd\ job termination
procedure, including job signaling, is described in Section~\ref{slurmd}.
One may think of a job as described above as an allocation of resources
rather than a collection of parallel tasks. The job script executes
\srun\ commands to initiate the parallel tasks or ``job steps. '' The
job may include multiple job steps, executing sequentially and/or
concurrently either on separate or overlapping nodes. Job steps have
associated with them specific nodes (some or all of those associated with
the job), tasks, and a task distribution (cyclic or block) over the nodes.
The management of job steps is considered a component of the Job Manager.
Supported job step functions include:
\begin{itemize}
\item Register job step
\item Get job step information
\item Run job step request
\item Signal job step
\end{itemize}
Job step information includes a list of nodes (entire set or subset of
those allocated to the job) and a credential used to bind communications
between the tasks across the interconnect. The \slurmctld\ constructs
this credential and sends it to the \srun\ initiating the job step.
\subsection{Fault Tolerance}
SLURM supports system level fault tolerance through the use of a secondary
or ``backup'' controller. The backup controller, if one is configured,
periodically pings the primary controller. Should the primary controller
cease responding, the backup loads state information from the last state
save and assumes control. When the primary controller is returned to
service, it tells the backup controller to save state and terminate.
The primary then loads state and assumes control.
SLURM utilities and API users read the configuration file and initially try
to contact the primary controller. Should that attempt fail, an attempt
is made to contact the backup controller before returning an error.
SLURM attempts to minimize the amount of time a node is unavailable
for work. Nodes assigned to jobs are returned to the partition as
soon as they successfully clean up user processes and run the system
epilog. In this manner,
those nodes that fail to successfully run the system epilog, or those
with unkillable user processes, are held out of the partition while
the remaining nodes are quickly returned to service.
SLURM considers neither the crash of a compute node nor termination
of \srun\ as a critical event for a job. Users may specify on a per-job
basis whether the crash of a compute node should result in the premature
termination of their job. Similarly, if the host on which \srun\ is
running crashes, the job continues execution and no output is lost.
\section{Slurmd Design}\label{slurmd}
The \slurmd\ daemon is a multi-threaded daemon for managing user jobs
and monitoring system state. Upon initiation it reads the configuration
file, recovers any saved state, captures system state,
attempts an initial connection to the SLURM
controller, and awaits requests. It services requests for system state,
accounting information, job initiation, job state, job termination,
and job attachment. On the local node it offers an API to translate
local process ids into SLURM job id.
The most common action of \slurmd\ is to report system state on request.
Upon \slurmd\ startup and periodically thereafter, it gathers the
processor count, real memory size, and temporary disk space for the
node. Should those values change, the controller is notified. In a
future release of SLURM, \slurmd\ will also capture CPU and real-memory and
virtual-memory consumption from the process table entries for uploading
to {\tt slurmctld}.
%FUTURE: Another thread is
%created to capture CPU, real-memory and virtual-memory consumption from
%the process table entries. Differences in resource utilization values
%from one process table snapshot to the next are accumulated. \slurmd\
%insures these accumulated values are not decremented if resource
%consumption for a user happens to decrease from snapshot to snapshot,
%which would simply reflect the termination of one or more processes.
%Both the real and virtual memory high-water marks are recorded and
%the integral of memory consumption (e.g. megabyte-hours). Resource
%consumption is grouped by uid and SLURM job id (if any). Data
%is collected for system users ({\em root}, {\em ftp}, {\em ntp},
%etc.) as well as customer accounts.
%The intent is to capture all resource use including
%kernel, idle and down time. Upon request, the accumulated values are
%uploaded to \slurmctld\ and cleared.
\slurmd\ accepts requests from \srun\ and \slurmctld\ to initiate
and terminate user jobs. The initiate job request contains such
information as real uid, effective uid, environment variables, working
directory, task numbers, job step credential, interconnect specifications and
authorization, core paths, SLURM job id, and the command line to execute.
System-specific programs can be executed on each allocated node prior
to the initiation of a user job and after the termination of a user
job (e.g., {\em Prolog} and {\em Epilog} in the configuration file).
These programs are executed as user {\em root} and can be used to
establish an appropriate environment for the user (e.g., permit logins,
disable logins, terminate orphan processes, etc.). \slurmd\ executes
the prolog program, resets its session id, and then initiates the job
as requested. It records to disk the SLURM job id, session id, process
id associated with each task, and user associated with the job. In the
event of \slurmd\ failure, this information is recovered from disk in
order to identify active jobs.
When \slurmd\ receives a job termination request from the SLURM
controller, it sends SIGTERM to all running tasks in the job,
waits for {\em KillWait} seconds (as specified in the configuration
file), then sends SIGKILL. If the processes do not terminate \slurmd\
notifies \slurmctld , which logs the event and sets the node's state
to DRAINED. After all processes have terminated, \slurmd\ executes the
configured epilog program, if any.
\section{Command Line Utilities}
\subsection{scancel}
\scancel\ terminates queued jobs or signals running jobs or job steps.
The default signal is SIGKILL, which indicates a request to terminate
the specified job or job step. \scancel\ identifies the job(s) to
be signaled through user specification of the SLURM job id, job step id,
user name, partition name, and/or job state. If a job id is supplied,
all job steps associated with the job are affected as well as the job
and its resource allocation. If a job step id is supplied, only that
job step is affected. \scancel\ can only be executed by the job's owner
or a privileged user.
\subsection{scontrol}
\scontrol\ is a tool meant for SLURM administration by user {\em root}.
It provides the following capabilities:
\begin{itemize}
\item {\tt Shutdown}: Cause \slurmctld\ and \slurmd\ to save state
and terminate.
\item {\tt Reconfigure}: Cause \slurmctld\ and \slurmd\ to reread the
configuration file.
\item {\tt Ping}: Display the status of primary and backup \slurmctld\ daemons.
\item {\tt Show Configuration Parameters}: Display the values of general SLURM
configuration parameters such as locations of files and values of timers.
\item {\tt Show Job State}: Display the state information of a particular job
or all jobs in the system.
\item {\tt Show Job Step State}: Display the state information of a particular
job step or all job steps in the system.
\item {\tt Show Node State}: Display the state and configuration information
of a particular node, a set of nodes (using numeric ranges syntax to
identify their names), or all nodes.
\item {\tt Show Partition State}: Display the state and configuration
information of a particular partition or all partitions.
\item {\tt Update Job State}: Update the state information of a particular job
in the system. Note that not all state information can be changed in this
fashion (e.g., the nodes allocated to a job).
\item {\tt Update Node State}: Update the state of a particular node. Note
that not all state information can be changed in this fashion (e.g., the
amount of memory configured on a node). In some cases, you may need
to modify the SLURM configuration file and cause it to be reread
using the ``Reconfigure'' command described above.
\item {\tt Update Partition State}: Update the state of a partition
node. Note that not all state information can be changed in this fashion
(e.g., the default partition). In some cases, you may need to modify
the SLURM configuration file and cause it to be reread using the
``Reconfigure'' command described above.
\end{itemize}
\subsection{squeue}
\squeue\ reports the state of SLURM jobs. It can filter these
jobs input specification of job state (RUN, PENDING, etc.), job id,
user name, job name, etc. If no specification is supplied, the state of
all pending and running jobs is reported.
\squeue\ also has a variety of sorting and output options.
\subsection{sinfo}
\sinfo\ reports the state of SLURM partitions and nodes. By default,
it reports a summary of partition state with node counts and a summary
of the configuration of those nodes. A variety of sorting and
output formatting options exist.
\subsection{srun}
\srun\ is the user interface to accessing resources managed by SLURM.
Users may utilize \srun\ to allocate resources, submit batch jobs,
run jobs interactively, attach to currently running jobs, or launch a
set of parallel tasks (job step) for a running job. \srun\ supports a
full range of options to specify job constraints and characteristics,
for example minimum real memory, temporary disk space, and CPUs per node,
as well as time limits, stdin/stdout/stderr handling, signal handling,
and working directory for job. The full range of options is detailed
in Table~\ref{srun_opts}.
\begin{table}[!tb]
\begin{center}
\begin{tabular}[t]{lcl}
\hhline{---}
Option & Arg Type & Description \\
\hhline{---}
{\em attach} & string & attach \srun\ to a running job \\
{\em allocate} & boolean& allocate nodes only \\
{\em batch} & boolean& submit a batch script to job queue \\
{\em cddir} & string & working directory of remote processes \\
{\em constraint} & string & arbitrary feature constraints \\
{\em contiguous} & boolean& allocate contiguous nodes only \\
{\em cpus-per-task } & number & number of CPUs needed per process \\
{\em distribution} & string & distribution method for processes (block$|$cyclic) \\
{\em error} & string & location of stderr redirection \\
{\em exclude} & string & do not allocate from a specific set of hosts \\
{\em immediate} & boolean& exit if resources are not immediately available \\
{\em input} & string & location of stdin redirection \\
{\em join} & string & join existing \srun\ to collect output of a running job \\
{\em job-name} & string & name of job \\
{\em label} & boolean& prepend task number to lines of stdout/err \\
{\em mem} & number & minimum amount of real memory per node \\
{\em mincpus} & number & minimum number of CPUs per node \\
% {\em no-allocate} & boolean& do not allocate nodes, SlurmUser only \\
{\em no-kill} & boolean& don't kill job if allocated nodes fail \\
{\em nodelist} & string & request a specific set of hosts \\
{\em nodes} & numbers& minimum and maximum number of nodes on which to run \\
{\em ntasks} & number & number of tasks to run \\
{\em output} & string & location of stdout redirection \\
{\em overcommit} & boolean& allow more than 1 process per CPU \\
{\em partition} & string & partition name in which to run \\
{\em share} & boolean& allow nodes to be shared with other jobs \\
% {\em slurmd-debug} & number & get slurmd error message at this level of detail \\
{\em threads} & number & run with this number of communication threads \\
{\em time} & number & wall-clock time limit in minutes \\
{\em tmp} & number & minimum amount of temporary disk space \\
{\em verbose} & boolean& verbose operation \\
{\em version} & boolean& print \srun\ version and exit \\
{\em wait} & number & seconds to wait after first task end before killing job \\
\hhline{---}
\end{tabular}
\caption{\label{srun_opts} List of \srun\ user options}
\end{center}
\end{table}
The \srun\ utility can run in four different modes: {\em interactive},
in which the \srun\ process remains resident in the user's session,
manages stdout/stderr/stdin, and forwards signals to the remote tasks;
{\em batch}, in which \srun\ submits a job script to the SLURM queue for
later execution; {\em allocate}, in which \srun\ requests resources from
the SLURM controller and spawns a shell with access to those resources;
{\em attach}, in which \srun\ attaches to a currently
running job and displays stdout/stderr in real time from the remote
tasks.
% FUTURE:
% An interactive job may also be forced into the background with a special
% control sequence typed at the user's terminal. Output from the running
% job is subsequently redirected to files in the current working directory
% and stdin is copied from {\tt /dev/null}. A backgrounded job may be
% reattached to a user's terminal at a later time by running
% \begin{verbatim}
% srun --attach <jobid>
% \end{verbatim}
% at any time, though the remote \srun\ is not terminated as the result
% of an attach.
\section{Job Initiation Design}
There are three modes in which jobs may be run by users under SLURM. The
first and most simple mode is {\em interactive} mode, in which stdout and
stderr are displayed on the user's terminal in real time, and stdin and
signals may be forwarded from the terminal transparently to the remote
tasks. The second mode is {\em batch} or {\em queued} mode, in which the job is
queued until the request for resources can be satisfied, at which time the
job is run by SLURM as the submitting user. In the third mode,{\em allocate}
mode, a job is allocated to the requesting user, under which the user may
manually run job steps via a script or in a sub-shell spawned by \srun .
\begin{figure}[tb]
\centerline{\epsfig{file=../figures/connections.eps,scale=0.4}}
\caption{\small Job initiation connections overview. 1. \srun\ connects to
\slurmctld\ requesting resources. 2. \slurmctld\ issues a response,
with list of nodes and job step credential. 3. \srun\ opens a listen
port for job IO connections, then sends a run job step
request to \slurmd . 4. \slurmd initiates job step and connects
back to \srun\ for stdout/err }
\label{connections}
\end{figure}
Figure~\ref{connections} shows a high-level depiction of the connections
that occur between SLURM components during a general interactive
job startup. \srun\ requests a resource allocation and job step
initiation from the {\tt slurmctld}, which responds with the job id,
list of allocated nodes, job step credential, etc. if the request is granted,
\srun\ then initializes a listen port for stdio connections and connects
to the {\tt slurmd}s on the allocated nodes requesting that the remote
processes be initiated. The {\tt slurmd}s begin execution of the tasks and
connect back to \srun\ for stdout and stderr. This process and other
initiation modes are described in more detail below.
\subsection{Interactive Job Initiation}
\begin{figure}[tb]
\centerline{\epsfig{file=../figures/interactive-job-init.eps,scale=0.45} }
\caption{\small Interactive job initiation. \srun\ simultaneously allocates
nodes and a job step from \slurmctld\ then sends a run request to all
{\tt slurmd}s in job. Dashed arrows indicate a periodic request that
may or may not occur during the lifetime of the job}
\label{init-interactive}
\end{figure}
Interactive job initiation is shown in
Figure~\ref{init-interactive}. The process begins with a user invoking
\srun\ in interactive mode. In Figure~\ref{init-interactive}, the user
has requested an interactive run of the executable ``{\tt cmd}'' in the
default partition.
After processing command line options, \srun\ sends a message to
\slurmctld\ requesting a resource allocation and a job step initiation.
This message simultaneously requests an allocation (or job) and a job
step. \srun\ waits for a reply from {\tt slurmctld}, which may not come
instantly if the user has requested that \srun\ block until resources are
available. When resources are available for the user's job, \slurmctld\
replies with a job step credential, list of nodes that were allocated,
cpus per node, and so on. \srun\ then sends a message each \slurmd\ on
the allocated nodes requesting that a job step be initiated.
The \slurmd\ daemons verify that the job is valid using the forwarded job
step credential and then respond to \srun .
Each \slurmd\ invokes a job manager process to handle the request, which
in turn invokes a session manager process that initializes the session for
the job step. An IO thread is created in the job manager that connects
all tasks' IO back to a port opened by \srun\ for stdout and stderr.
Once stdout and stderr have successfully been connected, the task thread
takes the necessary steps to initiate the user's executable on the node,
initializing environment, current working directory, and interconnect
resources if needed.
Each \slurmd\ forks a copy of itself that is responsible for the job
step on this node. This local job manager process then creates an
IO thread that initializes stdout, stdin, and stderr streams for each
local task and connects these streams to the remote \srun . Meanwhile,
the job manager forks a session manager process that initializes
the session becomes the requesting user and invokes the user's processes.
As user processes exit, their exit codes are collected, aggregated when
possible, and sent back to \srun\ in the form of a task exit message.
Once all tasks have exited, the session manager exits, and the job
manager process waits for the IO thread to complete, then exits.
The \srun\ process either waits for all tasks to exit, or attempts to
clean up the remaining processes some time after the first task exits
(based on user option). Regardless, once all tasks are finished,
\srun\ sends a message to the \slurmctld\ releasing the allocated nodes,
then exits with an appropriate exit status.
When the \slurmctld\ receives notification that \srun\ no longer needs
the allocated nodes, it issues a request for the epilog to be run on
each of the {\tt slurmd}s in the allocation. As {\tt slurmd}s report that the
epilog ran successfully, the nodes are returned to the partition.
\subsection{Queued (Batch) Job Initiation}
\begin{figure}[tb]
\centerline{\epsfig{file=../figures/queued-job-init.eps,scale=0.45} }
\caption{\small Queued job initiation.
\slurmctld\ initiates the user's job as a batch script on one node.
Batch script contains an \srun\ call that initiates parallel tasks
after instantiating job step with controller. The shaded region is
a compressed representation and is shown in more detail in the
interactive diagram (Figure~\ref{init-interactive})}
\label{init-batch}
\end{figure}
Figure~\ref{init-batch} shows the initiation of a queued job in
SLURM. The user invokes \srun\ in batch mode by supplying the {\tt --batch}
option to \srun . Once user options are processed, \srun\ sends a batch
job request to \slurmctld\ that identifies the stdin, stdout and stderr file
names for the job, current working directory, environment, requested
number of nodes, etc.
The \slurmctld\ queues the request in its priority-ordered queue.
Once the resources are available and the job has a high enough priority, \linebreak
\slurmctld\ allocates the resources to the job and contacts the first node
of the allocation requesting that the user job be started. In this case,
the job may either be another invocation of \srun\ or a job script
including invocations of \srun . The \slurmd\ on
the remote node responds to the run request, initiating the job manager,
session manager, and user script. An \srun\ executed from within the script
detects that it has access to an allocation and initiates a job step on
some or all of the nodes within the job.
Once the job step is complete, the \srun\ in the job script notifies
the \slurmctld\, and terminates. The job script continues executing and
may initiate further job steps. Once the job script completes, the task
thread running the job script collects the exit status and sends a task
exit message to the \slurmctld . The \slurmctld\ notes that the job
is complete and requests that the job epilog be run on all nodes that
were allocated. As the {\tt slurmd}s respond with successful completion
of the epilog, the nodes are returned to the partition.
\subsection{Allocate Mode Initiation}
\begin{figure}[tb]
\centerline{\epsfig{file=../figures/allocate-init.eps,scale=0.45} }
\caption{\small Job initiation in allocate mode. Resources are allocated and
\srun\ spawns a shell with access to the resources. When user runs
an \srun\ from within the shell, the a job step is initiated under
the allocation}
\label{init-allocate}
\end{figure}
In allocate mode, the user wishes to allocate a job and interactively run
job steps under that allocation. The process of initiation in this mode
is shown in Figure~\ref{init-allocate}. The invoked \srun\ sends
an allocate request to \slurmctld , which, if resources are available,
responds with a list of nodes allocated, job id, etc. The \srun\ process
spawns a shell on the user's terminal with access to the allocation,
then waits for the shell to exit (at which time the job is considered
complete).
An \srun\ initiated within the allocate sub-shell recognizes that
it is running under an allocation and therefore already within a
job. Provided with no other arguments, \srun\ started in this manner
initiates a job step on all nodes within the current job.
% Maybe later:
%
% However, the user may select a subset of these nodes implicitly by using
% the \srun\ {\tt --nodes} option, or explicitly by specifying a relative
% nodelist ( {\tt --nodelist=[0-5]} ).
An \srun\ executed from the sub-shell reads the environment and user
options, then notifies the controller that it is starting a job step under
the current job. The \slurmctld\ registers the job step and responds
with a job step credential. \srun\ then initiates the job step using the same
general method as for interactive job initiation.
When the user exits the allocate sub-shell, the original \srun\ receives
exit status, notifies \slurmctld\ that the job is complete, and exits.
The controller runs the epilog on each of the allocated nodes, returning
nodes to the partition as they successfully complete the epilog.
%
% Information in this section seems like it should be some place else
% (Some of it is incorrect as well)
% -mark
%
%\section{Infrastructure}
%
%The state of \slurmctld\ is written periodically to disk for fault
%tolerance. SLURM daemons are initiated via {\tt inittab} using the {\tt
%respawn} option to insure their continuous execution. If the control
%machine itself becomes inoperative, its functions can easily be moved in
%an automated fashion to another node. In fact, the computers designated
%as both primary and backup control machine can easily be relocated as
%needed without loss of the workload by changing the configuration file
%and restarting all SLURM daemons.
%
%The {\tt syslog} tools are used for logging purposes and take advantage
%of the severity level parameter.
%
%Direct use of the Elan interconnect is provided a version of MPI developed
%and supported by Quadrics. SLURM supports this version of MPI with no
%modifications.
%
%SLURM supports the TotalView debugger\cite{Etnus2002}. This requires
%\srun\ to not only maintain a list of nodes used by each job step, but
%also a list of process ids on each node corresponding the application's
%tasks.
\section{Results}
\begin{figure}[htb]
\centerline{\epsfig{file=../figures/times.eps}}
\caption{Time to execute /bin/hostname with various node counts}
\label{timing}
\end{figure}
We were able to perform some SLURM tests on a 1000-node cluster
in November 2002. Some development was still underway at that time
and tuning had not been performed. The results for executing the
program {\em /bin/hostname} on two tasks per node and various node
counts are shown in Figure~\ref{timing}. We found SLURM performance
to be comparable to the
Quadrics Resource Management System (RMS) \cite{Quadrics2002} for all
job sizes and about 80 times faster than IBM LoadLeveler\cite{LL2002}
at tested job sizes.
\section{Future Plans}
SLURM begin production use on LLNL Linux clusters in March 2003
and is available from our web site\cite{SLURM2003}.
While SLURM is able to manage 1000 nodes without difficulty using
sockets and Ethernet, we are reviewing other communication mechanisms
that may offer improved scalability. One possible alternative
is STORM \cite{STORM2001}. STORM uses the cluster interconnect
and Network Interface Cards to provide high-speed communications,
including a broadcast capability. STORM only supports the Quadrics
Elan interconnnect at present, but it does offer the promise of improved
performance and scalability.
Looking ahead, we anticipate adding support for additional
interconnects (InfiniBand and the IBM
Blue Gene \cite{BlueGene2002} system\footnote{Blue Gene has a different
interconnect than any supported by SLURM and a 3-D topography with
restrictive allocation constraints.}). We anticipate adding a job
preempt/resume capability to the next release of SLURM. This will
provide an external scheduler the infrastructure required to perform gang
scheduling. We also anticipate adding a checkpoint/restart capability at
some time in the future ,and we plan to support changing the node count
associated with running jobs (as needed for MPI2). Recording resource
use by each parallel job is planned for a future release.
\section{Acknowledgments}
SLURM is jointly developed by LLNL and Linux NetworX.
Contributers to SLURM development include:
\begin{itemize}
\item Jay Windley of Linux NetworX for his development of the plugin
mechanism and work on the security components
\item Joey Ekstrom for his work developing the user tools
\item Kevin Tew for his work developing the communications infrastructure
\item Jim Garlick for his development of the Quadrics Elan interface and
technical guidance
\item Gregg Hommes, Bob Wood, and Phil Eckert for their help designing the
SLURM APIs
\item Mark Seager and Greg Tomaschke for their support of this project
\item Chris Dunlap for technical guidance
\item David Jackson of Linux NetworX for technical guidance
\item Fabrizio Petrini of Los Alamos National Laboratory for his work to
integrate SLURM with STORM communications
\end{itemize}
%\appendix
%\newpage
%\section{Glossary}
%
%\begin{description}
%\item[Authd] User authentication mechanism
%\item[DCE] Distributed Computing Environment
%\item[DFS] Distributed File System (part of DCE)
%\item[DPCS] Distributed Production Control System, a meta-batch system
% and resource manager developed by LLNL
%\item[Globus] Grid scheduling infrastructure
%\item[Kerberos] Authentication mechanism
%\item[LoadLeveler] IBM's parallel job management system
%\item[LLNL] Lawrence Livermore National Laboratory
%\item[Munge] User authentication mechanism developed by LLNL
%\item[NQS] Network Queuing System (a batch system)
%\item[RMS] Quadrics' Resource Management System
%\item[TotalView] Etnus' debugger
%\end{description}
% make the bibliography
\bibliographystyle{unsrt}
\bibliography{project}
% make the back cover page
\makeLLNLBackCover
\end{document}