doc/pubdesign/report.tex - SchedMD/slurm - Git at Google

 % Presenter info:
 % http://www.linuxclustersinstitute.org/Linux-HPC-Revolution/presenterinfo.html
 %
 % Main Text Layout
 % Set the main text in 10 point Times Roman or Times New Roman (normal),
 % (no boldface), using single line spacing. All text should be in a single
 % column and justified.
 %
 % Opening Style (First Page)
 % This includes the title of the paper, the author names, organization and
 % country, the abstract, and the first part of the paper.
 % * Start the title 35mm down from the top margin in Times Roman font, 16
 %   point bold, range left. Capitalize only the first letter of the first
 %   word and proper nouns.
 % * On a new line, type the authors' names, organizations, and country only
 %   (not the full postal address, although you may add the name of your
 %   department), in Times Roman, 11 point italic, range left.
 % * Start the abstract with the heading two lines below the last line of the
 %   address. Set the abstract in Times Roman, 12 point bold.
 % * Leave one line, then type the abstract in Times Roman 10 point, justified
 %   with single line spacing.
 %
 % Other Pages
 % For the second and subsequent pages, use the full 190 x 115mm area and type
 % in one column beginning at the upper right of each page, inserting tables
 % and figures as required.
 %
 % We're recommending the Lecture Notes in Computer Science styles from
 % Springer Verlag --- google on Springer Verlag LaTeX.  These work nicely,
 % *except* that it does not work with the hyperref package. Sigh.
 %
 % http://www.springer.de/comp/lncs/authors.html


 \documentclass[10pt,onecolumn,times]{llncs}

 \usepackage{verbatim,moreverb}
 \usepackage{float}

 % Needed for LLNL cover page / TID bibliography style
 \usepackage{calc}
 \usepackage{epsfig}
 \usepackage{graphics}
 \usepackage{hhline}
 \input{pstricks}
 \input{pst-node}
 \usepackage{chngpage}
 \input{llnlCoverPage}

 % Uncomment for DRAFT "watermark"
 %\usepackage{draftcopy}

 % Times font as default roman
 \renewcommand{\sfdefault}{phv}
 \renewcommand{\rmdefault}{ptm}
 %\renewcommand{\ttdefault}{pcr}
 %\usepackage{times}
 \renewcommand{\labelitemi}{$\bullet$}

 \setlength{\textwidth}{115mm}
 \setlength{\textheight}{190mm}
 \setlength{\oddsidemargin}{(\paperwidth-\textwidth)/2 - 1in}
 \setlength{\topmargin}{(\paperheight-\textheight -\headheight-\headsep-\footskip
 )/2 - 1in + .5in }

 % couple of macros for the title page and document
 \def\ctit{SLURM: Simple Linux Utility for Resource \linebreak
 Management}
 \def\ucrl{UCRL-JC-147996 REV 1}
 \def\auth{Morris Jette \\ Mark Grondona}
 \def\pubdate{June 23, 2003}
 \def\journal{ClusterWorld Conference and Expo}

 \begin{document}

 % make the cover page
 \makeLLNLCover{\ucrl}{\ctit}{\auth}{\journal}{\pubdate}{0in}{0in}

 % Title - 16pt bold
 \vspace*{35mm}
 \noindent\Large
 \textbf{\ctit}
 \vskip1\baselineskip
 % Authors - 11pt
 \noindent\large
 {Morris Jette and Mark Grondona \\
 {\em Lawrence Livermore National Laboratory, USA}
 \vskip2\baselineskip
 % Abstract heading - 12pt bold
 \noindent\large
 \textbf{Abstract}
 \vskip1\baselineskip
 % Abstract itself - 10pt
 \noindent\normalsize
 Simple Linux Utility for Resource Management (SLURM) is an open source,
 fault-tolerant, and highly scalable cluster management and job scheduling
 system for Linux clusters of thousands of nodes.  Components include
 machine status, partition management, job management, scheduling, and
 stream copy modules.  This paper presents an overview of the SLURM
 architecture and functionality.

 % define some additional macros for the body
 \newcommand{\munged}{{\tt munged}}
 \newcommand{\srun}{{\tt srun}}
 \newcommand{\scancel}{{\tt scancel}}
 \newcommand{\squeue}{{\tt squeue}}
 \newcommand{\scontrol}{{\tt scontrol}}
 \newcommand{\sinfo}{{\tt sinfo}}
 \newcommand{\slurmctld}{{\tt slurmctld}}
 \newcommand{\slurmd}{{\tt slurmd}}


 \section{Overview}

 Simple Linux Utility for Resource Management (SLURM)\footnote{A tip of
 the hat to Matt Groening and creators of {\em Futurama},
 where Slurm is the most popular carbonated beverage in the universe.}
 is a resource management system suitable for use on large and small Linux
 clusters.  After surveying \cite{Jette2002} resource managers available
 for Linux and finding none that were simple, highly scalable, and portable
 to different cluster architectures and interconnects, the authors set out
 to design a new system.

 The resulting design is a resource management system with the following general
 characteristics:

 \begin{itemize}
 \item {\tt Simplicity}: SLURM is simple enough to allow motivated end users
 to understand its source code and add functionality.  The authors will
 avoid the temptation to add features unless they are of general appeal.

 \item {\tt Open Source}: SLURM is available to everyone and will remain free.
 Its source code is distributed under the GNU General Public
 License \cite{GPL2002}.

 \item {\tt Portability}: SLURM is written in the C language, with a GNU
 {\em autoconf} configuration engine.
 While initially written for Linux, other Unix-like operating systems
 should be easy porting targets.
 SLURM also supports a general purpose ``plugin'' mechanism, which
 permits a variety of different infrastructures to be easily supported.
 The SLURM configuration file specifies which set of plugin modules
 should be used.

 \item {\tt Interconnect Independence}: SLURM currently supports UDP/IP-based
 communication and the Quadrics Elan3 interconnect.  Adding support for
 other interconnects, including topography constraints, is straightforward
 and utilizes the plugin mechanism described above.

 \item {\tt Scalability}: SLURM is designed for scalability to clusters of
 thousands of nodes. The SLURM controller for a cluster with 1000 nodes
 occupies on the order of 2 MB of memory, and excellent performance has
 been demonstrated.  Jobs may specify their resource requirements in a
 variety of ways, including requirements options and ranges.

 \item {\tt Fault Tolerance}: SLURM can handle a variety of failure
 modes without terminating workloads, including crashes of the node
 running the SLURM controller.  User jobs may be configured to continue
 execution despite the failure of one or more nodes on which they are
 executing.  The user command controlling a job, {\tt srun}, may detach
 and reattach from the parallel tasks at any time.  Nodes allocated to
 a job are available for reuse as soon as the job(s) allocated to that
 node terminate.  If some nodes fail to complete job termination in a
 timely fashion because of hardware or software problems, only the
 scheduling of those tardy nodes will be affected.

 \item {\tt Security}: SLURM employs crypto technology to authenticate
 users to services and services to each other with a variety of options
 available through the plugin mechanism.  SLURM does not assume that its
 networks are physically secure, but it does assume that the entire cluster
 is within a single administrative domain with a common user base.

 \item {\tt System Administrator Friendly}: SLURM utilizes
 a simple configuration file and minimizes distributed state.
 Its configuration may be changed at any time without impacting running
 jobs.  Heterogeneous nodes within a cluster may be easily managed.  SLURM
 interfaces are usable by scripts and its behavior is highly deterministic.

 \end{itemize}

 \subsection{What Is SLURM?}

 As a cluster resource manager, SLURM has three key functions.  First,
 it allocates exclusive and/or non-exclusive access to resources (compute
 nodes) to users for some duration of time so they can perform work.
 Second, it provides a framework for starting, executing, and monitoring
 work (normally a parallel job) on the set of allocated nodes.  Finally,
 it arbitrates conflicting requests for resources by managing a queue of
 pending work.

 Users interact with SLURM through four command line utilities: \srun\
 for submitting a job for execution and optionally controlling it
 interactively, \scancel\ for terminating a pending or running job,
 \squeue\ for monitoring job queues, and \sinfo\ for monitoring partition
 and overall system state.  System administrators perform privileged
 operations through an additional command line utility, {\tt scontrol}.

 The central controller daemon, {\tt slurmctld}, maintains the global
 state and directs operations.  Compute nodes simply run a \slurmd\ daemon
 (similar to a remote shell daemon) to export control to SLURM.

 \subsection{What SLURM Is Not}

 SLURM is not a comprehensive cluster administration or monitoring package.
 While SLURM knows the state of its compute nodes, it makes no attempt
 to put this information to use in other ways, such as with a general
 purpose event logging mechanism or a back-end database for recording
 historical state.  It is expected that SLURM will be deployed in a
 cluster with other tools performing those functions.

 SLURM is not a meta-batch system like Globus \cite{Globus2002} or DPCS
 (Distributed Production Control System) \cite{DPCS2002}.  SLURM supports
 resource management across a single cluster.

 SLURM is not a sophisticated batch system.  In fact, it was expressly
 designed to provide high-performance parallel job management while
 leaving scheduling decisions to an external entity.  Its default scheduler
 implements First-In First-Out (FIFO).  An scheduler entity can establish
 a job's initial priority through a plugin.  An external scheduler may
 also submit, signal, and terminate jobs as well as reorder the queue of
 pending jobs via the API.


 \section{Architecture}

 \begin{figure}[tb]
 \centerline{\epsfig{file=../figures/arch.eps,scale=0.35}}
 \caption{\small SLURM architecture}
 \label{arch}
 \end{figure}

 As shown in Figure~\ref{arch}, SLURM consists of a \slurmd\ daemon
 running on each compute node, a central \slurmctld\ daemon running
 on a management node (with optional fail-over twin), and five command
 line utilities: {\tt srun}, {\tt scancel}, {\tt sinfo}, {\tt squeue},
 and {\tt scontrol}, which can run anywhere in the cluster.

 The entities managed by these SLURM daemons include {\em nodes}, the
 compute resource in SLURM, {\em partitions}, which group nodes into
 logical disjoint sets, {\em jobs}, or allocations of resources assigned
 to a user for a specified amount of time, and {\em job steps}, which
 are sets of (possibly parallel) tasks within a job.  Each job in the
 priority-ordered queue is allocated nodes within a single partition.
 Once an allocation request fails, no lower priority jobs for that
 partition will be considered for a resource allocation.  Once a job is
 assigned a set of nodes, the user is able to initiate parallel work in
 the form of job steps in any configuration within the allocation. For
 instance, a single job step may be started that utilizes all nodes
 allocated to the job, or several job steps may independently use a
 portion of the allocation.

 \begin{figure}[tcb]
 \centerline{\epsfig{file=../figures/entities.eps,scale=0.5}}
 \caption{\small SLURM entities: nodes, partitions, jobs, and job steps}
 \label{entities}
 \end{figure}

 Figure~\ref{entities} further illustrates the interrelation of these
 entities as they are managed by SLURM by showing a group of
 compute nodes split into two partitions. Partition 1 is running one job,
 with one job step utilizing the full allocation of that job.  The job
 in Partition 2 has only one job step using half of the original job
 allocation.  That job might initiate additional job steps to utilize
 the remaining nodes of its allocation.

 \begin{figure}[tb]
 \centerline{\epsfig{file=../figures/slurm-arch.eps,scale=0.5}}
 \caption{SLURM architecture - subsystems}
 \label{archdetail}
 \end{figure}

 Figure~\ref{archdetail} shows the subsystems that are implemented
 within the \slurmd\ and \slurmctld\ daemons.  These subsystems are
 explained in more detail below.

 \subsection{Slurmd}

 \slurmd\ is a multi-threaded daemon running on each compute node and
 can be compared to a remote shell daemon: it reads the common SLURM
 configuration file and saved state information,
 notifies the controller that it is active, waits
 for work, executes the work, returns status, then waits for more work.
 Because it initiates jobs for other users, it must run as user {\em root}.
 It also asynchronously exchanges node and job status with {\tt slurmctld}.
 The only job information it has at any given time pertains to its
 currently executing jobs.  \slurmd\ has five major components:

 \begin{itemize}
 \item {\tt Machine and Job Status Services}:  Respond to controller
 requests for machine and job state information and send asynchronous
 reports of some state changes (e.g., \slurmd\ startup) to the controller.

 \item {\tt Remote Execution}: Start, manage, and clean up after a set
 of processes (typically belonging to a parallel job) as dictated by
 the \slurmctld\ daemon or an \srun\ or \scancel\ command. Starting a
 process may include executing a prolog program, setting process limits,
 setting real and effective uid, establishing environment variables,
 setting working directory, allocating interconnect resources, setting core
 file paths, initializing stdio, and managing process groups. Terminating
 a process may include terminating all members of a process group and
 executing an epilog program.

 \item {\tt Stream Copy Service}: Allow handling of stderr, stdout, and
 stdin of remote tasks. Job input may be redirected
 from a single file or multiple files (one per task), an
 \srun\ process, or /dev/null.  Job output may be saved into local files or
 returned to the \srun\ command. Regardless of the location of stdout/err,
 all job output is locally buffered to avoid blocking local tasks.

 \item {\tt Job Control}: Allow asynchronous interaction with the Remote
 Execution environment by propagating signals or explicit job termination
 requests to any set of locally managed processes.

 \end{itemize}

 \subsection{Slurmctld}

 Most SLURM state information exists in {\tt slurmctld}, also known as
 the controller.  \slurmctld\ is multi-threaded with independent read
 and write locks for the various data structures to enhance scalability.
 When \slurmctld\ starts, it reads the SLURM configuration file and
 any previously saved state information.  Full controller state
 information is written to disk periodically, with incremental changes
 written to disk immediately for fault tolerance.  \slurmctld\ runs in
 either master or standby mode, depending on the state of its fail-over
 twin, if any.  \slurmctld\ need not execute as user {\em root}.  In fact,
 it is recommended that a unique user entry be created for executing
 \slurmctld\ and that user must be identified in the SLURM configuration
 file as {\tt SlurmUser}.  \slurmctld\ has three major components:


 \begin{itemize}
 \item {\tt Node Manager}: Monitors the state of each node in the cluster.
 It polls {\tt slurmd}s for status periodically and receives state
 change notifications from \slurmd\ daemons asynchronously.  It ensures
 that nodes have the prescribed configuration before being considered
 available for use.

 \item {\tt Partition Manager}: Groups nodes into non-overlapping sets
 called partitions. Each partition can have associated with it
 various job limits and access controls.  The Partition Manager also
 allocates nodes to jobs based on node and partition states and
 configurations. Requests to initiate jobs come from the Job Manager.
 \scontrol\ may be used to administratively alter node and partition
 configurations.

 \item {\tt Job Manager}: Accepts user job requests and places pending jobs
 in a priority-ordered queue.  The Job Manager is awakened on a periodic
 basis and whenever there is a change in state that might permit a job to
 begin running, such as job completion, job submission, partition {\em up}
 transition, node {\em up} transition, etc.  The Job Manager then makes
 a pass through the priority-ordered job queue. The highest priority
 jobs for each partition are allocated resources as possible. As soon as
 an allocation failure occurs for any partition, no lower-priority jobs
 for that partition are considered for initiation.  After completing the
 scheduling cycle, the Job Manager's scheduling thread sleeps.  Once a
 job has been allocated resources, the Job Manager transfers necessary
 state information to those nodes, permitting it to commence execution.
 When the Job Manager detects that all nodes associated with a job
 have completed their work, it initiates cleanup and performs another
 scheduling cycle as described above.

 \end{itemize}

 \subsection{Command Line Utilities}

 The command line utilities offer users access to remote execution and
 job control. They also permit administrators to dynamically change
 the system configuration. These commands use SLURM APIs that are
 directly available for more sophisticated applications.

 \begin{itemize}
 \item {\tt scancel}: Cancel a running or a pending job or job step,
 subject to authentication and authorization. This command can also be
 used to send an arbitrary signal to all processes on all nodes associated
 with a job or job step.

 \item {\tt scontrol}: Perform privileged administrative commands
 such as bringing down a node or partition in preparation for maintenance.
 Many \scontrol\ functions can only be executed by privileged users.

 \item {\tt sinfo}: Display a summary of partition and node information.
 An assortment of filtering and output format options are available.

 \item {\tt squeue}: Display the queue of running and waiting jobs and/or
 job steps. A wide assortment of filtering, sorting, and output format
 options are available.

 \item {\tt srun}: Allocate resources, submit jobs to the SLURM queue,
 and initiate parallel tasks (job steps).  Every set of executing parallel
 tasks has an associated \srun\ that initiated it and, if the \srun\
 persists, manages it.  Jobs may be submitted for later execution
 (e.g., batch), in which case \srun\ terminates after job submission.
 Jobs may also be submitted for interactive execution, where \srun\ keeps
 running to shepherd the running job. In this case, \srun\ negotiates
 connections with remote {\tt slurmd}s for job initiation and to get
 stdout and stderr, forward stdin,\footnote{\srun\ command line options
 select the stdin handling method, such as broadcast to all tasks, or
 send only to task 0.} and respond to signals from the user.  \srun\
 may also be instructed to allocate a set of resources and spawn a shell
 with access to those resources.

 \end{itemize}

 \subsection{Plugins}

 In order to simplify the use of different infrastructures,
 SLURM uses a general purpose plugin mechanism.  A SLURM plugin is a
 dynamically linked code object that is loaded explicitly at run time
 by the SLURM libraries.  A plugin provides a customized implementation
 of a well-defined API connected to tasks such as authentication,
 interconnect fabric, and task scheduling.  A common set of functions is defined
 for use by all of the different infrastructures of a particular variety.
 For example, the authentication plugin must define functions such as
 {\tt slurm\_auth\_create} to create a credential, {\tt slurm\_auth\_verify}
 to verify a credential to approve or deny authentication,
 {\tt slurm\_auth\_get\_uid} to get the uid associated with a specific
 credential, etc.  It also must define the data structure used, a plugin
 type, a plugin version number, etc.  When a SLURM daemon is initiated, it
 reads the configuration file to determine which of the available plugins
 should be used.  For example {\em AuthType=auth/authd} says to use the
 plugin for authd based authentication and {\em PluginDir=/usr/local/lib}
 identifies the directory in which to find the plugin.

 \subsection{Communications Layer}

 SLURM presently uses Berkeley sockets for communications.  However,
 we anticipate using the plugin mechanism to permit use of
 other communications layers.  At LLNL we are using an Ethernet
 for SLURM communications and the Quadrics Elan switch exclusively
 for user applications.  The SLURM configuration file permits the
 identification of each node's hostname as well as its name to be used
 for communications.  In the case of a control machine known as {\em mcri}
 to be communicated with using the name {\em emcri} (say to indicate an
 Ethernet communications path), this is represented in the configuration
 file as {\em ControlMachine=mcri ControlAddr=emcri}.  The name used for
 communication is the same as the hostname unless otherwise specified.

 Internal SLURM functions pack and unpack data structures in machine
 independent format. We considered the use of XML style messages, but we
 felt this would adversely impact performance (albeit slightly).  If XML
 support is desired, it is straightforward to perform a translation and
 use the SLURM APIs.


 \subsection{Security}

 SLURM has a simple security model: any user of the cluster may submit
 parallel jobs to execute and cancel his own jobs.  Any user may view
 SLURM configuration and state information.  Only privileged users
 may modify the SLURM configuration, cancel any job, or perform other
 restricted activities.  Privileged users in SLURM include the users
 {\em root} and {\em SlurmUser} (as defined in the SLURM configuration file).
 If permission to modify SLURM configuration is required by others, set-uid
 programs may be used to grant specific permissions to specific users.

 \subsubsection{Communication Authentication.}

 Historically, inter-node authentication has been accomplished via the use
 of reserved ports and set-uid programs. In this scheme, daemons check the
 source port of a request to ensure that it is less than a certain value
 and thus only accessible by {\em root}. The communications over that
 connection are then implicitly trusted.  Because reserved ports are a
 limited resource and set-uid programs are a possible security concern,
 we have employed a credential-based authentication scheme that
 does not depend on reserved ports. In this design, a SLURM authentication
 credential is attached to every message and authoritatively verifies the
 uid and gid of the message originator. Once recipients of SLURM messages
 verify the validity of the authentication credential, they can use the uid
 and gid from the credential as the authoritative identity of the sender.

 The actual implementation of the SLURM authentication credential is
 relegated to an ``auth'' plugin.  We presently have implemented three
 functional authentication plugins: authd\cite{Authd2002},
 Munge, and none.  The ``none'' authentication type employs a null
 credential and is only suitable for testing and networks where security
 is not a concern. Both the authd and Munge implementations employ
 cryptography to generate a credential for the requesting user that
 may then be authoritatively verified on any remote nodes. However,
 authd assumes a secure network and Munge does not.  Other authentication
 implementations, such as a credential based on Kerberos, should be easy
 to develop using the auth plugin API.

 \subsubsection{Job Authentication.}

 When resources are allocated to a user by the controller, a ``job step
 credential'' is generated by combining the uid, job id, step id, the list
 of resources allocated (nodes), and the credential lifetime and signing
 the result with the \slurmctld\ private key.  This credential grants the
 user access to allocated resources and removes the burden from \slurmd\
 to contact the controller to verify requests to run processes. \slurmd\
 verifies the signature on the credential against the controller's public
 key and runs the user's request if the credential is valid.  Part of the
 credential signature is also used to validate stdout, stdin,
 and stderr connections from \slurmd\ to \srun .

 \subsubsection{Authorization.}

 Access to partitions may be restricted via a {\em RootOnly} flag.
 If this flag is set, job submit or allocation requests to this partition
 are only accepted if the effective uid originating the request is a
 privileged user.  A privileged user may submit a job as any other user.
 This may be used, for example, to provide specific external schedulers
 with exclusive access to partitions.  Individual users will not be
 permitted to directly submit jobs to such a partition, which would
 prevent the external scheduler from effectively managing it.  Access to
 partitions may also be restricted to users who are members of specific
 Unix groups using a {\em AllowGroups} specification.

 \subsection{Example: Executing a Batch Job}

 In this example a user wishes to run a job in batch mode, in which \srun\
 returns immediately and the job executes in the background when resources
 are available.  The job is a two-node run of script containing {\em mping},
 a simple MPI application.  The user submits the job:

 \begin{verbatim}
 srun --batch --nodes 2 --nprocs 2 mping.sh
 \end{verbatim}
 The script {\tt mping.sh} contains:
 \begin{verbatim}
 #!/bin/sh
 srun hostname
 srun mping 1 1048576
 \end{verbatim}

 The initial \srun\ command authenticates the user to the controller and
 submits the job request.  The request includes the \srun\ environment,
 current working directory, and command line option information. By
 default, stdout and stderr are sent to files in the current working
 directory and stdin is copied from {\tt /dev/null}.

 The controller consults the Partition Manager to test whether the job
 will ever be able to run.  If the user has requested a non-existent partition,
 a non-existent constraint,
 etc., the Partition Manager returns an error and the request is discarded.
 The failure is reported to \srun\, which informs the user and exits, for
 example:
 \begin{verbatim}
 srun: error: Unable to allocate resources: Invalid partition name
 \end{verbatim}

 On successful submission, the controller assigns the job a unique
 {\em SLURM job id}, adds it to the job queue, and returns the job's
 job id to \srun\, which reports this to user and exits, returning
 success to the user's shell:

 \begin{verbatim}
 srun: jobid 42 submitted
 \end{verbatim}

 The controller awakens the Job Manager, which tries to run jobs starting
 at the head of the priority ordered job queue.  It finds job {\em 42} and
 makes a successful request to the Partition Manager to allocate two nodes
 from the default (or requested) partition: {\em dev6} and {\em dev7}.

 The Job Manager then sends a request to the \slurmd\ on the first
 node in the job {\em dev6} to execute the script specified on the user's
 command line.\footnote{Had the user specified an executable file rather
 than a job script, an \srun\ program would be initiated on the first
 node and \srun\ would initiate the executable with the desired task
 distribution.}  The Job Manager also sends a copy of the environment,
 current working directory, stdout and stderr location, along with other
 options. Additional environment variables are appended to the user's
 environment before it is sent to the remote \slurmd\ detailing the job's
 resources, such as the SLURM job id ({\em 42}) and the allocated nodes
 ({\em dev[6-7]}).

 The remote \slurmd\ establishes the new environment, executes a SLURM
 prolog program (if one is configured) as user {\em root}, and executes
 the job script (or command) as the submitting user. The \srun\ within
 the job script detects that it is running with allocated resources from
 the presence of a {\tt SLURM\_JOBID} environment variable. \srun\
 connects to \slurmctld\ to request a job step to run on all nodes of
 the current job. \slurmctld\ validates the request and replies with a
 job step credential and switch resources. \srun\ then contacts {\tt slurmd}s
 running on both {\em dev6} and {\em dev7}, passing the job step credential,
 environment, current working directory, command path and arguments,
 and interconnect information. The {\tt slurmd}s verify the valid job
 step credential, connect stdout and stderr back to \srun , establish
 the environment, and execute the command as the submitting user.

 Unless instructed otherwise by the user, stdout and stderr are
 copied to a file in the current working directory by \srun :
 \begin{verbatim}
 /path/to/cwd/slurm-42.out
 \end{verbatim}

 The user may examine output files at any time if they reside
 in a globally accessible directory. In this example
 {\tt slurm-42.out} would  contain the output of the job script's two
 commands (hostname and mping):

 \begin{verbatim}
 dev6
 dev7
   1 pinged   0:        1 bytes      5.38 uSec     0.19 MB/s
   1 pinged   0:        2 bytes      5.32 uSec     0.38 MB/s
   1 pinged   0:        4 bytes      5.27 uSec     0.76 MB/s
   1 pinged   0:        8 bytes      5.39 uSec     1.48 MB/s
   ...
   1 pinged   0:  1048576 bytes   4682.97 uSec   223.91 MB/s
 \end{verbatim}

 When the tasks complete execution, \srun\ is notified by \slurmd\ of
 each task's exit status. \srun\ reports job step completion to the Job
 Manager and exits.  \slurmd\ detects when the job script terminates and
 notifies the Job Manager of its exit status and begins cleanup.  The Job
 Manager directs the {\tt slurmd}s formerly assigned to the job to run
 the SLURM epilog program (if one is configured) as user {\em root}.
 Finally, the Job Manager releases the resources allocated to job {\em 42}
 and updates the job status to {\em complete}. The record of a job's
 existence is eventually purged.

 \subsection{Example:  Executing an Interactive Job}

 In this example a user wishes to run the same {\em mping} command
 in interactive mode, in which \srun\ blocks while the job executes
 and stdout/stderr of the job are copied onto stdout/stderr of {\tt srun}.
 The user submits the job, this time without the {\tt batch} option:

 \begin{verbatim}
 srun --nodes 2 --nprocs 2 mping 1 1048576
 \end{verbatim}

 The \srun\ command authenticates the user to the controller and makes a
 request for a resource allocation and job step. The Job Manager
 responds with a list of nodes, a job step credential, and interconnect
 resources on successful allocation. If resources are not immediately
 available, the request terminates or blocks depending on user options.

 If the request is successful, \srun\ forwards the job run request to
 the assigned {\tt slurmd}s in the same manner as the \srun\ in the batch
 job script. In this case, the user sees the program output on stdout of
 {\tt srun}:

 \begin{verbatim}
   1 pinged   0:        1 bytes      5.38 uSec     0.19 MB/s
   1 pinged   0:        2 bytes      5.32 uSec     0.38 MB/s
   1 pinged   0:        4 bytes      5.27 uSec     0.76 MB/s
   1 pinged   0:        8 bytes      5.39 uSec     1.48 MB/s
   ...
   1 pinged   0:  1048576 bytes   4682.97 uSec   223.91 MB/s
 \end{verbatim}

 When the job terminates, \srun\ receives an EOF (End Of File) on each
 stream and closes it, then receives the task exit status from each
 {\tt slurmd}.  The \srun\ process notifies \slurmctld\ that the job is
 complete and terminates. The controller contacts all {\tt slurmd}s allocated
 to the terminating job and issues a request to run the SLURM epilog,
 then releases the job's resources.

 Most signals received by \srun\ while the job is executing are
 transparently forwarded to the remote tasks. SIGINT (generated by
 Control-C) is a special case and only causes \srun\ to report
 remote task status unless two SIGINTs are received in rapid succession.
 SIGQUIT (Control-$\backslash$) is another special case. SIGQUIT forces
 termination of the running job.

 \section{Slurmctld Design}

 \slurmctld\ is modular and multi-threaded with independent read and
 write locks for the various data structures to enhance scalability.
 The controller includes the following subsystems: Node Manager, Partition
 Manager, and Job Manager.  Each of these subsystems is described in
 detail below.

 \subsection{Node Management}

 The Node Manager monitors the state of nodes.  Node information monitored
 includes:

 \begin{itemize}
 \item Count of processors on the node
 \item Size of real memory on the node
 \item Size of temporary disk storage
 \item State of node (RUN, IDLE, DRAINED, etc.)
 \item Weight (preference in being allocated work)
 \item Feature (arbitrary description)
 \item IP address
 \end{itemize}

 The SLURM administrator can specify a list of system node names using
 a numeric range in the SLURM configuration file or in the SLURM tools
 (e.g., ``{\em NodeName=linux[001-512] CPUs=4 RealMemory=1024 TmpDisk=4096 \linebreak
 Weight=4 Feature=Linux}'').  These values for CPUs, RealMemory, and
 TmpDisk are considered to be the minimal node configuration values
 acceptable for the node to enter into service.  The \slurmd\ registers
 whatever resources actually exist on the node, and this is recorded
 by the Node Manager.  Actual node resources are checked on \slurmd\
 initialization and periodically thereafter.  If a node registers with
 less resources than configured, it is placed in DOWN state and
 the event logged.  Otherwise, the actual resources reported are recorded
 and possibly used as a basis for scheduling (e.g., if the node has more
 RealMemory than recorded in the configuration file, the actual node
 configuration may be used for determining suitability for any application;
 alternately, the data in the configuration file may be used for possibly
 improved scheduling performance).  Note the node name syntax with numeric
 range permits even very large heterogeneous clusters to be described in
 only a few lines.  In fact, a smaller number of unique configurations
 can provide SLURM with greater efficiency in scheduling work.

 {\em Weight} is used to order available nodes in assigning work to
 them.  In a heterogeneous cluster, more capable nodes (e.g., larger memory
 or faster processors) should be assigned a larger weight.  The units
 are arbitrary and should reflect the relative value of each resource.
 Pending jobs are assigned the least capable nodes (i.e., lowest weight)
 that satisfy their requirements.  This tends to leave the more capable
 nodes available for those jobs requiring those capabilities.

 {\em Feature} is an arbitrary string describing the node, such as a
 particular software package, file system, or processor speed.  While the
 feature does not have a numeric value, one might include a numeric value
 within the feature name (e.g., ``1200MHz'' or ``16GB\_Swap'').  If the
 nodes on the cluster have disjoint features (e.g., different ``shared''
 file systems), one should identify these as features (e.g., ``FS1'',
 ``FS2'', etc.).  Programs may then specify that all nodes allocated to
 it should have the same feature, but that any of the specified features
 are acceptable (e.g., ``$Feature=FS1|FS2|FS3$'' means the job should be
 allocated nodes that all have the feature ``FS1'' or they all have feature
 ``FS2,'' etc.).

 Node records are kept in an array with hash table lookup.  If nodes are
 given names containing sequence numbers (e.g., ``lx01'', ``lx02'', etc.),
 the hash table permits specific node records to be located very quickly;
 therefore, this is our recommended naming convention for larger clusters.

 An API is available to view any of this information and to update some
 node information (e.g., state). APIs designed to return SLURM state
 information permit the specification of a time stamp.  If the requested
 data has not changed since the time stamp specified by the application,
 the application's current information need not be updated.  The API
 returns a brief ``No Change'' response rather than returning relatively
 verbose state information.  Changes in node configurations (e.g., node
 count, memory, etc.) or the nodes actually in the cluster should be
 reflected in the SLURM configuration files. SLURM configuration may be
 updated without disrupting any jobs.

 \subsection{Partition Management}

 The Partition Manager identifies groups of nodes to be used for execution
 of user jobs. One might consider this the actual resource scheduling
 component.  Data associated with a partition includes:

 \begin{itemize}
 \item Name
 \item RootOnly flag to indicate that only users {\em root} or
 {\tt SlurmUser} may allocate resources in this partition (for any user)
 \item List of associated nodes
 \item State of partition (UP or DOWN)
 \item Maximum time limit for any job
 \item Minimum and maximum nodes allocated to any single job
 \item List of groups permitted to use the partition (defaults to ALL)
 \item Shared access (YES, NO, or FORCE)
 \item Default partition (if no partition is specified in a job request)
 \end{itemize}

 It is possible to alter most of this data in real-time in order to affect
 the scheduling of pending jobs (currently executing jobs would not be
 affected).  This information is confined to the controller machine(s)
 for better scalability.  It is used by the Job Manager (and possibly an
 external scheduler), which either exist only on the control machine or
 communicate only with the control machine.

 The nodes in a partition may be designated for exclusive or non-exclusive
 use by a job.  A {\tt shared} value of YES indicates that jobs may
 share nodes on request.  A {\tt shared} value of NO indicates that
 jobs are always given exclusive use of allocated nodes.  A {\tt shared}
 value of FORCE indicates that jobs are never ensured exclusive
 access to nodes, but SLURM may initiate multiple jobs on the nodes
 for improved system utilization and responsiveness.  In this case,
 job requests for exclusive node access are not honored.  Non-exclusive
 access may negatively impact the performance of parallel jobs or cause
 them to fail upon exhausting shared resources (e.g., memory or disk
 space). However, shared resources may improve overall system utilization
 and responsiveness. The proper support of shared resources, including
 enforcement of limits on these resources, entails a substantial amount of
 effort, which we are not presently planning to expend.  However, we have
 designed SLURM so as to not preclude the addition of such a capability
 at a later time if so desired.  Future enhancements could include
 constraining jobs to a specific CPU count or memory size within a
 node, which could be used to effectively space-share individual nodes.
 The Partition Manager will allocate nodes to pending jobs on request
 from the Job Manager.

 Submitted jobs can specify desired partition, time limit, node count
 (minimum and maximum), CPU count (minimum) task count, the need for
 contiguous node assignment, and an explicit list of nodes to be included
 and/or excluded in its allocation.  Nodes are selected so as to satisfy
 all job requirements.  For example, a job requesting four CPUs and four
 nodes will actually be allocated eight CPUs and four nodes in the case
 of all nodes having two CPUs each.  The request may also indicate node
 configuration constraints such as minimum real memory or CPUs per node,
 required features, shared access, etc.  Overall there are 13 different
 parameters that may identify resource requirements for a job.

 Nodes are selected for possible assignment to a job based on the
 job's configuration requirements (e.g., partition specification, minimum
 memory, temporary disk space, features, node list, etc.).  The selection
 is refined by determining which nodes are up and available for use.
 Groups of nodes are then considered in order of weight, with the nodes
 having the lowest {\em Weight} preferred.  Finally, the physical location
 of the nodes is considered.

 Bit maps are used to indicate which nodes are up, idle, associated
 with each partition, and associated with each unique configuration.
 This technique permits scheduling decisions to normally be made by
 performing a small number of tests followed by fast bit map manipulations.
 If so configured, a job's resource requirements would be compared with
 the (relatively small number of) node configuration records, each of
 which has an associated bit map. Usable node configuration bitmaps would
 be ANDed with the selected partitions bit map ANDed with the UP node
 bit map and possibly ANDed with the IDLE node bit map (this last test
 depends on the desire to share resources).  This method can eliminate
 tens of thousands of individual node configuration comparisons that
 would otherwise be required in large heterogeneous clusters.

 The actual selection of nodes for allocation to a job is currently tuned
 for the Quadrics interconnect.  This hardware supports hardware message
 broadcast only if the nodes are contiguous.  If a job is not allocated
 contiguous nodes, a slower software based multi-cast mechanism is used.
 Jobs will be allocated continuous nodes to the extent possible (in
 fact, contiguous node allocation may be specified as a requirement on
 job submission).  If contiguous nodes cannot be allocated to a job, it
 will be allocated resources from the minimum number of sets of contiguous
 nodes possible.  If multiple sets of contiguous nodes can be allocated
 to a job, the one that most closely fits the job's requirements will
 be used.  This technique will leave the largest continuous sets of nodes
 intact for jobs requiring them.

 The Partition Manager builds a list of nodes to satisfy a job's request.
 It also caches the IP addresses of each node and provides this information
 to \srun\ at job initiation time for improved performance.

 The failure of any node to respond to the Partition Manager only affects
 jobs associated with that node.  In fact, a job may indicate it should
 continue executing even if allocated nodes cease responding.  In this
 case, the job needs to provide for its own fault tolerance.  All other
 jobs and nodes in the cluster will continue to operate after a node
 failure.  No additional work is allocated to the failed node, and it will
 be pinged periodically to determine when it resumes responding. The node
 may then be returned to service (depending on the {\tt ReturnToService}
 parameter in the SLURM configuration).

 \subsection{Configuration}

 A single configuration file applies to all SLURM daemons and commands.
 Most of this information is used only by the controller.  Only the
 host and port information is referenced by most commands.  A sample
 configuration file is shown in Table~\ref{sample_config}.


 \begin{table}[t]
 \begin{center}

 \begin{tabular}[c]{c}
 \\
 \fbox{
    \begin{minipage}[c]{0.8\linewidth}
      {\tiny \verbatiminput{sample.config} }
    \end{minipage}
 }
 \\
 \end{tabular}
 \caption{Sample SLURM config file \label{sample_config}}
 \end{center}
 \end{table}


 \subsection{Job Manager}

 There are a multitude of parameters associated with each job, including:
 \begin{itemize}
 \item Job name
 \item Uid
 \item Job id
 \item Working directory
 \item Partition
 \item Priority
 \item Node constraints (processors, memory, features, etc.)
 \end{itemize}

 Job records have an associated hash table for rapidly locating
 specific records. They also have bit maps of requested and/or
 allocated nodes (as described above).

 The core functions supported by the Job Manager include:
 \begin{itemize}
 \item Request resource (job may be queued)
 \item Reset priority of a job
 \item Status job (including node list, memory and CPU use data)
 \item Signal job (send arbitrary signal to all processes associated
       with a job)
 \item Terminate job (remove all processes)
 \item Change node count of running job (could fail if insufficient
 resources are available)
 %\item Preempt/resume job  (future)
 %\item Checkpoint/restart job (future)

 \end{itemize}

 Jobs are placed in a priority-ordered queue and allocated nodes as
 selected by the Partition Manager.  SLURM implements a very simple default
 scheduling algorithm, namely FIFO.  An attempt is made to schedule pending
 jobs on a periodic basis and whenever any change in job, partition,
 or node state might permit the scheduling of a job.

 We are aware that this scheduling algorithm does not satisfy the needs
 of many customers, and we provide the means for establishing other
 scheduling algorithms. Before a newly arrived job is placed into the
 queue, an external scheduler plugin assigns its initial priority.
 A plugin function is also called at the start of each scheduling
 cycle to modify job or system state as desired.  SLURM APIs permit an
 external entity to alter the priorities of jobs at any time and re-order
 the queue as desired.  The Maui Scheduler \cite{Jackson2001,Maui2002}
 is one example of an external scheduler suitable for use with SLURM.

 LLNL uses DPCS \cite{DPCS2002} as SLURM's external scheduler.
 DPCS is a meta-scheduler with flexible scheduling algorithms that
 suit our needs well.
 It also provides the scalability required for this application.
 DPCS maintains pending job state internally and only transfers the
 jobs to SLURM (or another underlying resources manager) only when
 they are to begin execution.
 By not transferring jobs to a particular resources manager earlier,
 jobs are assured of being initiated on the first resource satisfying
 their requirements, whether a Linux cluster with SLURM or an IBM SP
 with LoadLeveler (assuming a highly flexible application).
 This mode of operation may also be suitable for computational grid
 schedulers.

 In a future release, the Job Manager will collect resource consumption
 information (CPU time used, CPU time allocated, and real memory used)
 associated with a job from the \slurmd\ daemons.  Presently, only the
 wall-clock run time of a job is monitored.  When a job approaches its
 time limit (as defined by wall-clock execution time) or an imminent
 system shutdown has been scheduled, the job is terminated.  The actual
 termination process is to notify \slurmd\ daemons on nodes allocated
 to the job of the termination request.  The \slurmd\ job termination
 procedure, including job signaling, is described in Section~\ref{slurmd}.

 One may think of a job as described above as an allocation of resources
 rather than a collection of parallel tasks.  The job script executes
 \srun\ commands to initiate the parallel tasks or ``job steps. '' The
 job may include multiple job steps, executing sequentially and/or
 concurrently either on separate or overlapping nodes.  Job steps have
 associated with them specific nodes (some or all of those associated with
 the job), tasks, and a task distribution (cyclic or block) over the nodes.

 The management of job steps is considered a component of the Job Manager.
 Supported job step functions include:
 \begin{itemize}
 \item Register job step
 \item Get job step information
 \item Run job step request
 \item Signal job step
 \end{itemize}

 Job step information includes a list of nodes (entire set or subset of
 those allocated to the job) and a credential used to bind communications
 between the tasks across the interconnect. The \slurmctld\ constructs
 this credential and sends it to the \srun\ initiating the job step.

 \subsection{Fault Tolerance}
 SLURM supports system level fault tolerance through the use of a secondary
 or ``backup'' controller.  The backup controller, if one is configured,
 periodically pings the primary controller.  Should the primary controller
 cease responding, the backup loads state information from the last state
 save and assumes control.  When the primary controller is returned to
 service, it tells the backup controller to save state and terminate.
 The primary then loads state and assumes control.

 SLURM utilities and API users read the configuration file and initially try
 to contact the primary controller.  Should that attempt fail, an attempt
 is made to contact the backup controller before returning an error.

 SLURM attempts to minimize the amount of time a node is unavailable
 for work.  Nodes assigned to jobs are returned to the partition as
 soon as they successfully clean up user processes and run the system
 epilog.  In this manner,
 those nodes that fail to successfully run the system epilog, or those
 with unkillable user processes, are held out of the partition while
 the remaining nodes are quickly returned to service.

 SLURM considers neither the crash of a compute node nor termination
 of \srun\ as a critical event for a job. Users may specify on a per-job
 basis whether the crash of a compute node should result in the premature
 termination of their job. Similarly, if the host on which \srun\ is
 running crashes, the job continues execution and no output is lost.


 \section{Slurmd Design}\label{slurmd}

 The \slurmd\ daemon is a multi-threaded daemon for managing user jobs
 and monitoring system state.  Upon initiation it reads the configuration
 file, recovers any saved state, captures system state,
 attempts an initial connection to the SLURM
 controller, and awaits requests.  It services requests for system state,
 accounting information, job initiation, job state, job termination,
 and job attachment. On the local node it offers an API to translate
 local process ids into SLURM job id.

 The most common action of \slurmd\ is to report system state on request.
 Upon \slurmd\ startup and periodically thereafter, it gathers the
 processor count, real memory size, and temporary disk space for the
 node. Should those values change, the controller is notified.  In a
 future release of SLURM, \slurmd\ will also capture CPU and real-memory and
 virtual-memory consumption from the process table entries for uploading
 to {\tt slurmctld}.

 %FUTURE:  Another thread is
 %created to capture CPU, real-memory and virtual-memory consumption from
 %the process table entries.  Differences in resource utilization values
 %from one process table snapshot to the next are accumulated. \slurmd\
 %insures these accumulated values are not decremented if resource
 %consumption for a user happens to decrease from snapshot to snapshot,
 %which would simply reflect the termination of one or more processes.
 %Both the real and virtual memory high-water marks are recorded and
 %the integral of memory consumption (e.g. megabyte-hours).  Resource
 %consumption is grouped by uid and SLURM job id (if any). Data
 %is collected for system users ({\em root}, {\em ftp}, {\em ntp},
 %etc.) as well as customer accounts.
 %The intent is to capture all resource use including
 %kernel, idle and down time.  Upon request, the accumulated values are
 %uploaded to \slurmctld\ and cleared.

 \slurmd\ accepts requests from \srun\ and \slurmctld\ to initiate
 and terminate user jobs. The initiate job request contains such
 information as real uid, effective uid, environment variables, working
 directory, task numbers, job step credential, interconnect specifications and
 authorization, core paths, SLURM job id, and the command line to execute.
 System-specific programs can be executed on each allocated node prior
 to the initiation of a user job and after the termination of a user
 job (e.g., {\em Prolog} and {\em Epilog} in the configuration file).
 These programs are executed as user {\em root} and can be used to
 establish an appropriate environment for the user (e.g., permit logins,
 disable logins, terminate orphan processes, etc.).  \slurmd\ executes
 the prolog program, resets its session id, and then initiates the job
 as requested. It records to disk the SLURM job id, session id, process
 id associated with each task, and user associated with the job.  In the
 event of \slurmd\ failure, this information is recovered from disk in
 order to identify active jobs.

 When \slurmd\ receives a job termination request from the SLURM
 controller, it sends SIGTERM to all running tasks in the job,
 waits for {\em KillWait} seconds (as specified in the configuration
 file), then sends SIGKILL. If the processes do not terminate \slurmd\
 notifies \slurmctld , which logs the event and sets the node's state
 to DRAINED. After all processes have terminated, \slurmd\ executes the
 configured epilog program, if any.

 \section{Command Line Utilities}

 \subsection{scancel}

 \scancel\ terminates queued jobs or signals running jobs or job steps.
 The default signal is SIGKILL, which indicates a request to terminate
 the specified job or job step.  \scancel\ identifies the job(s) to
 be signaled through user specification of the SLURM job id, job step id,
 user name, partition name, and/or job state.  If a job id is supplied,
 all job steps associated with the job are affected as well as the job
 and its resource allocation.  If a job step id is supplied, only that
 job step is affected.  \scancel\ can only be executed by the job's owner
 or a privileged user.

 \subsection{scontrol}

 \scontrol\ is a tool meant for SLURM administration by user {\em root}.
 It provides the following capabilities:
 \begin{itemize}
 \item {\tt Shutdown}: Cause \slurmctld\ and \slurmd\ to save state
 and terminate.
 \item {\tt Reconfigure}: Cause \slurmctld\ and \slurmd\ to reread the
 configuration file.
 \item {\tt Ping}: Display the status of primary and backup \slurmctld\ daemons.
 \item {\tt Show Configuration Parameters}: Display the values of general SLURM
 configuration parameters such as locations of files and values of timers.
 \item {\tt Show Job State}: Display the state information of a particular job
 or all jobs in the system.
 \item {\tt Show Job Step State}: Display the state information of a particular
 job step or all job steps in the system.
 \item {\tt Show Node State}: Display the state and configuration information
 of a particular node, a set of nodes (using numeric ranges syntax to
 identify their names), or all nodes.
 \item {\tt Show Partition State}: Display the state and configuration
 information of a particular partition or all partitions.
 \item {\tt Update Job State}: Update the state information of a particular job
 in the system. Note that not all state information can be changed in this
 fashion (e.g., the nodes allocated to a job).
 \item {\tt Update Node State}: Update the state of a particular node. Note
 that not all state information can be changed in this fashion (e.g., the
 amount of memory configured on a node). In some cases, you may need
 to modify the SLURM configuration file and cause it to be reread
 using the ``Reconfigure'' command described above.
 \item {\tt Update Partition State}: Update the state of a partition
 node. Note that not all state information can be changed in this fashion
 (e.g., the default partition). In some cases, you may need to modify
 the SLURM configuration file and cause it to be reread using the
 ``Reconfigure'' command described above.
 \end{itemize}

 \subsection{squeue}

 \squeue\ reports the state of SLURM jobs.  It can filter these
 jobs input specification of job state (RUN, PENDING, etc.), job id,
 user name, job name, etc.  If no specification is supplied, the state of
 all pending and running jobs is reported.
 \squeue\ also has a variety of sorting and output options.

 \subsection{sinfo}

 \sinfo\ reports the state of SLURM partitions and nodes.  By default,
 it reports a summary of partition state with node counts and a summary
 of the configuration of those nodes.  A variety of sorting and
 output formatting options exist.

 \subsection{srun}

 \srun\ is the user interface to accessing resources managed by SLURM.
 Users may utilize \srun\ to allocate resources, submit batch jobs,
 run jobs interactively, attach to currently running jobs, or launch a
 set of parallel tasks (job step) for a running job. \srun\ supports a
 full range of options to specify job constraints and characteristics,
 for example minimum real memory, temporary disk space, and CPUs per node,
 as well as time limits, stdin/stdout/stderr handling, signal handling,
 and working directory for job.  The full range of options is detailed
 in Table~\ref{srun_opts}.

 \begin{table}[!tb]
 \begin{center}
   \begin{tabular}[t]{lcl}
    \hhline{---}
       Option & Arg Type & Description \\
    \hhline{---}
       {\em attach}	   & string & attach \srun\ to a running job	      \\

       {\em allocate}	   & boolean& allocate nodes only		      \\
       {\em batch}	   & boolean& submit a batch script to job queue      \\
       {\em cddir}	   & string & working directory of remote processes   \\
       {\em constraint}     & string & arbitrary feature constraints	      \\
       {\em contiguous}	   & boolean& allocate contiguous nodes only	      \\
       {\em cpus-per-task } & number & number of CPUs needed per process      \\
       {\em distribution}   & string & distribution method for processes (block$|$cyclic) \\
       {\em error}	   & string & location of stderr redirection	      \\
       {\em exclude}	   & string & do not allocate from a specific set of hosts \\
       {\em immediate}	   & boolean& exit if resources are not immediately available  \\
       {\em input}	   & string & location of stdin redirection	      \\
       {\em join}	   & string & join existing \srun\ to collect output of a running job \\
       {\em job-name}	   & string & name of job 			      \\
       {\em label}	   & boolean& prepend task number to lines of stdout/err \\
       {\em mem}		   & number & minimum amount of real memory per node  \\
       {\em mincpus} 	   & number & minimum number of CPUs per node         \\
 %     {\em no-allocate}    & boolean& do not allocate nodes, SlurmUser only   \\
       {\em no-kill}	   & boolean& don't kill job if allocated nodes fail  \\
       {\em nodelist}	   & string & request a specific set of hosts	      \\
       {\em nodes}	   & numbers& minimum and maximum number of nodes on which to run \\
       {\em ntasks}         & number & number of tasks to run     	      \\
       {\em output}         & string & location of stdout redirection	      \\
       {\em overcommit}     & boolean& allow more than 1 process per CPU	      \\
       {\em partition}	   & string & partition name in which to run          \\
       {\em share}	   & boolean& allow nodes to be shared with other jobs \\
 %     {\em slurmd-debug}   & number & get slurmd error message at this level of detail \\
       {\em threads}        & number & run with this number of communication threads \\
       {\em time}           & number & wall-clock time limit in minutes        \\
       {\em tmp}	           & number & minimum amount of temporary disk space  \\
       {\em verbose}	   & boolean& verbose operation			      \\
       {\em version}	   & boolean& print \srun\ version and exit	      \\
       {\em wait}           & number & seconds to wait after first task end before killing job \\
    \hhline{---}
   \end{tabular}
 \caption{\label{srun_opts} List of \srun\ user options}
 \end{center}
 \end{table}

 The \srun\ utility can run in four different modes: {\em interactive},
 in which the \srun\ process remains resident in the user's session,
 manages stdout/stderr/stdin, and forwards signals to the remote tasks;
 {\em batch}, in which \srun\ submits a job script to the SLURM queue for
 later execution; {\em allocate}, in which \srun\ requests resources from
 the SLURM controller and spawns a shell with access to those resources;
 {\em attach}, in which \srun\ attaches to a currently
 running job and displays stdout/stderr in real time from the remote
 tasks.

 % FUTURE:
 % An interactive job may also be forced into the background with a special
 % control sequence typed at the user's terminal.  Output from the running
 % job is subsequently redirected to files in the current working directory
 % and stdin is copied from {\tt /dev/null}. A backgrounded job may be
 % reattached to a user's terminal at a later time by running
 % \begin{verbatim}
 %   srun --attach <jobid>
 % \end{verbatim}
 % at any time, though the remote \srun\ is not terminated as the result
 % of an attach.

 \section{Job Initiation Design}

 There are three modes in which jobs may be run by users under SLURM. The
 first and most simple mode is {\em interactive} mode, in which stdout and
 stderr are displayed on the user's terminal in real time, and stdin and
 signals may be forwarded from the  terminal transparently to the remote
 tasks. The second mode is {\em batch} or {\em queued} mode, in which the job is
 queued until the request for resources can be satisfied, at which time the
 job is run by SLURM as the submitting user. In the third mode,{\em allocate}
 mode, a job is allocated to the requesting user, under which the user may
 manually run job steps via a script or in a sub-shell spawned by \srun .

 \begin{figure}[tb]
 \centerline{\epsfig{file=../figures/connections.eps,scale=0.4}}
 \caption{\small Job initiation connections overview. 1. \srun\ connects to
          \slurmctld\ requesting resources. 2. \slurmctld\ issues a response,
 	 with list of nodes and job step credential. 3. \srun\ opens a listen
 	 port for job IO connections, then sends a run job step
 	 request to \slurmd . 4. \slurmd initiates job step and connects
 	 back to \srun\ for stdout/err }
 \label{connections}
 \end{figure}

 Figure~\ref{connections} shows a high-level depiction of the connections
 that occur between SLURM components during a general interactive
 job startup.  \srun\ requests a resource allocation and job step
 initiation from the {\tt slurmctld}, which responds with the job id,
 list of allocated nodes, job step credential, etc.  if the request is granted,
 \srun\ then initializes a listen port for stdio connections and connects
 to the {\tt slurmd}s on the allocated nodes requesting that the remote
 processes be initiated. The {\tt slurmd}s begin execution of the tasks and
 connect back to \srun\ for stdout and stderr. This process and other
 initiation modes are described in more detail below.

 \subsection{Interactive Job Initiation}

 \begin{figure}[tb]
 \centerline{\epsfig{file=../figures/interactive-job-init.eps,scale=0.45} }
 \caption{\small Interactive job initiation. \srun\ simultaneously allocates
 	 nodes and a job step from \slurmctld\ then sends a run request to all
 	 {\tt slurmd}s in job. Dashed arrows indicate a periodic request that
 	 may or may not occur during the lifetime of the job}
 \label{init-interactive}
 \end{figure}

 Interactive job initiation is shown in
 Figure~\ref{init-interactive}.  The process begins with a user invoking
 \srun\ in interactive mode.  In Figure~\ref{init-interactive}, the user
 has requested an interactive run of the executable ``{\tt cmd}'' in the
 default partition.

 After processing command line options, \srun\ sends a message to
 \slurmctld\ requesting a resource allocation and a job step initiation.
 This message simultaneously requests an allocation (or job) and a job
 step.  \srun\ waits for a reply from {\tt slurmctld}, which may not come
 instantly if the user has requested that \srun\ block until resources are
 available.  When resources are available for the user's job, \slurmctld\
 replies with a job step credential, list of nodes that were allocated,
 cpus per node, and so on. \srun\ then sends a message each \slurmd\ on
 the allocated nodes requesting that a job step be initiated.
 The \slurmd\ daemons verify that the job is valid using the forwarded job
 step credential and then respond to \srun .

 Each \slurmd\ invokes a job manager process to handle the request, which
 in turn invokes a session manager process that initializes the session for
 the job step. An IO thread is created in the job manager that connects
 all tasks' IO back to a port opened by \srun\ for stdout and stderr.
 Once stdout and stderr have successfully been connected, the task thread
 takes the necessary steps to initiate the user's executable on the node,
 initializing environment, current working directory, and interconnect
 resources if needed.

 Each \slurmd\ forks a copy of itself that is responsible for the job
 step on this node. This local job manager process then creates an
 IO thread that initializes stdout, stdin, and stderr streams for each
 local task and connects these streams to the remote \srun . Meanwhile,
 the job manager forks a session manager process that initializes
 the session becomes the requesting user and invokes the user's processes.

 As user processes exit, their exit codes are collected, aggregated when
 possible, and sent back to \srun\ in the form of a task exit message.
 Once all tasks have exited, the session manager exits, and the job
 manager process waits for the IO thread to complete, then exits.
 The \srun\ process either waits for all tasks to exit, or attempts to
 clean up the remaining processes some time after the first task exits
 (based on user option).  Regardless, once all tasks are finished,
 \srun\ sends a message to the \slurmctld\ releasing the allocated nodes,
 then exits with an appropriate exit status.

 When the \slurmctld\ receives notification that \srun\ no longer needs
 the allocated nodes, it issues a request for the epilog to be run on
 each of the {\tt slurmd}s in the allocation. As {\tt slurmd}s report that the
 epilog ran successfully, the nodes are returned to the partition.

 \subsection{Queued (Batch) Job Initiation}

 \begin{figure}[tb]
 \centerline{\epsfig{file=../figures/queued-job-init.eps,scale=0.45} }
 \caption{\small Queued job initiation.
          \slurmctld\ initiates the user's job as a batch script on one node.
 	 Batch script contains an \srun\ call that initiates parallel tasks
 	 after instantiating job step with controller. The shaded region is
 	 a compressed representation and is shown in more detail in the
 	 interactive diagram (Figure~\ref{init-interactive})}
 \label{init-batch}
 \end{figure}

 Figure~\ref{init-batch} shows the initiation of a queued job in
 SLURM.  The user invokes \srun\ in batch mode by supplying the {\tt --batch}
 option to \srun . Once user options are processed, \srun\ sends a batch
 job request to \slurmctld\ that identifies the stdin, stdout and stderr file
 names for the job, current working directory, environment, requested
 number of nodes, etc.
 The \slurmctld\ queues the request in its priority-ordered queue.

 Once the resources are available and the job has a high enough priority, \linebreak
 \slurmctld\ allocates the resources to the job and contacts the first node
 of the allocation requesting that the user job be started. In this case,
 the job may either be another invocation of \srun\ or a job script
 including invocations of \srun . The \slurmd\ on
 the remote node responds to the run request, initiating the job manager,
 session manager, and user script. An \srun\ executed from within the script
 detects that it has access to an allocation and initiates a job step on
 some or all of the nodes within the job.

 Once the job step is complete, the \srun\ in the job script notifies
 the \slurmctld\, and terminates. The job script continues executing and
 may initiate further job steps. Once the job script completes, the task
 thread running the job script collects the exit status and sends a task
 exit message to the \slurmctld . The \slurmctld\ notes that the job
 is complete and requests that the job epilog be run on all nodes that
 were allocated.  As the {\tt slurmd}s respond with successful completion
 of the epilog, the nodes are returned to the partition.

 \subsection{Allocate Mode Initiation}

 \begin{figure}[tb]
 \centerline{\epsfig{file=../figures/allocate-init.eps,scale=0.45} }
 \caption{\small Job initiation in allocate mode. Resources are allocated and
          \srun\ spawns a shell with access to the resources. When user runs
 	 an \srun\ from within the shell, the a job step is initiated under
 	 the allocation}
 \label{init-allocate}
 \end{figure}

 In allocate mode, the user wishes to allocate a job and interactively run
 job steps under that allocation. The process of initiation in this mode
 is shown in Figure~\ref{init-allocate}. The invoked \srun\ sends
 an allocate request to \slurmctld , which, if resources are available,
 responds with a list of nodes allocated, job id, etc. The \srun\ process
 spawns a shell on the user's terminal with access to the allocation,
 then waits for the shell to exit (at which time the job is considered
 complete).

 An \srun\ initiated within the allocate sub-shell recognizes that
 it is running under an allocation and therefore already within a
 job. Provided with no other arguments, \srun\ started in this manner
 initiates a job step on all nodes within the current job.

 % Maybe later:
 %
 % However, the user may select a subset of these nodes implicitly by using
 % the \srun\ {\tt --nodes} option, or explicitly by specifying a relative
 % nodelist ( {\tt --nodelist=[0-5]} ).

 An \srun\ executed from the sub-shell reads the environment and user
 options, then notifies the controller that it is starting a job step under
 the current job. The \slurmctld\ registers the job step and responds
 with a job step credential. \srun\ then initiates the job step using the same
 general method as for interactive job initiation.

 When the user exits the allocate sub-shell, the original \srun\ receives
 exit status, notifies \slurmctld\ that the job is complete, and exits.
 The controller runs the epilog on each of the allocated nodes, returning
 nodes to the partition as they successfully complete the epilog.

 %
 % Information in this section seems like it should be some place else
 % (Some of it is incorrect as well)
 % -mark
 %
 %\section{Infrastructure}
 %
 %The state of \slurmctld\ is written periodically to disk for fault
 %tolerance.  SLURM daemons are initiated via {\tt inittab} using the {\tt
 %respawn} option to insure their continuous execution.  If the control
 %machine itself becomes inoperative, its functions can easily be moved in
 %an automated fashion to another node. In fact, the computers designated

 %as both primary and backup control machine can easily be relocated as
 %needed without loss of the workload by changing the configuration file
 %and restarting all SLURM daemons.
 %
 %The {\tt syslog} tools are used for logging purposes and take advantage

 %of the severity level parameter.
 %
 %Direct use of the Elan interconnect is provided a version of MPI developed
 %and supported by Quadrics. SLURM supports this version of MPI with no
 %modifications.
 %
 %SLURM supports the TotalView debugger\cite{Etnus2002}.  This requires
 %\srun\ to not only maintain a list of nodes used by each job step, but
 %also a list of process ids on each node corresponding the application's
 %tasks.

 \section{Results}

 \begin{figure}[htb]
 \centerline{\epsfig{file=../figures/times.eps}}
 \caption{Time to execute /bin/hostname with various node counts}
 \label{timing}
 \end{figure}

 We were able to perform some SLURM tests on a 1000-node cluster
 in November 2002. Some development was still underway at that time
 and tuning had not been performed. The results for executing the
 program {\em /bin/hostname} on two tasks per node and various node
 counts are shown in Figure~\ref{timing}. We found SLURM performance
 to be comparable to the
 Quadrics Resource Management System (RMS) \cite{Quadrics2002} for all
 job sizes and about 80 times faster than IBM LoadLeveler\cite{LL2002}
 at tested job sizes.

 \section{Future Plans}

 SLURM begin production use on LLNL Linux clusters in March 2003
 and is available from our web site\cite{SLURM2003}.

 While SLURM is able to manage 1000 nodes without difficulty using
 sockets and Ethernet, we are reviewing other communication mechanisms
 that may offer improved scalability.  One possible alternative
 is STORM \cite{STORM2001}.  STORM uses the cluster interconnect
 and Network Interface Cards to provide high-speed communications,
 including a broadcast capability.  STORM only supports the Quadrics
 Elan interconnnect at present, but it does offer the promise of improved
 performance and scalability.

 Looking ahead, we anticipate adding support for additional
 interconnects (InfiniBand and the IBM
 Blue Gene \cite{BlueGene2002} system\footnote{Blue Gene has a different
 interconnect than any supported by SLURM and a 3-D topography with
 restrictive allocation constraints.}).  We anticipate adding a job
 preempt/resume capability to the next release of SLURM.  This will
 provide an external scheduler the infrastructure required to perform gang
 scheduling.  We also anticipate adding a checkpoint/restart capability at
 some time in the future ,and we plan to support changing the node count
 associated with running jobs (as needed for MPI2).  Recording resource
 use by each parallel job is planned for a future release.

 \section{Acknowledgments}

 SLURM is jointly developed by LLNL and Linux NetworX.
 Contributers to SLURM development include:
 \begin{itemize}
 \item Jay Windley of Linux NetworX for his development of the plugin
 mechanism and work on the security components
 \item Joey Ekstrom for his work developing the user tools
 \item Kevin Tew for his work developing the communications infrastructure
 \item Jim Garlick for his development of the Quadrics Elan interface and
 technical guidance
 \item Gregg Hommes, Bob Wood, and Phil Eckert for their help designing the
 SLURM APIs
 \item Mark Seager and Greg Tomaschke for their support of this project
 \item Chris Dunlap for technical guidance
 \item David Jackson of Linux NetworX for technical guidance
 \item Fabrizio Petrini of Los Alamos National Laboratory for his work to
 integrate SLURM with STORM communications
 \end{itemize}

 %\appendix
 %\newpage
 %\section{Glossary}
 %
 %\begin{description}
 %\item[Authd]    User authentication mechanism
 %\item[DCE]	Distributed Computing Environment
 %\item[DFS]	Distributed File System (part of DCE)
 %\item[DPCS]	Distributed Production Control System, a meta-batch system
 %		and resource manager developed by LLNL
 %\item[Globus]	Grid scheduling infrastructure
 %\item[Kerberos]	Authentication mechanism
 %\item[LoadLeveler] IBM's parallel job management system
 %\item[LLNL]	Lawrence Livermore National Laboratory
 %\item[Munge]   User authentication mechanism developed by LLNL
 %\item[NQS]	Network Queuing System (a batch system)
 %\item[RMS]	Quadrics' Resource Management System
 %\item[TotalView] Etnus' debugger
 %\end{description}

 % make the bibliography
 \bibliographystyle{unsrt}
 \bibliography{project}

 % make the back cover page
 \makeLLNLBackCover
 \end{document}