blob: 8c526456aa3b3f6bff63568e14abf22a5b1611f8 [file] [log] [blame] [edit]
\documentclass{article}
\usepackage{graphics}
\usepackage{epsfig}
\usepackage{draftcopy}
\author{Moe Jette, Chris Dunlap, Jim Garlick, Mark Grondona\\
\{jette,cdunlap,garlick,grondona\}@llnl.gov}
\title{Survey of Batch/Resource Management-Related System Software}
\begin{document}
\maketitle
\begin{abstract}
Simple Linux Utility for Resource Management (SLURM) is an open source,
fault-tolerant, and highly scalable cluster management and job
scheduling system for Linux clusters of
thousands of nodes. Components include machine status, partition
management, job management, and scheduling modules. The design also
includes a scalable, general-purpose communication infrastructure.
Development will take place in four phases: Phase I results in a solid
infrastructure; Phase II produces a functional but limited interactive
job initiation capability without use of the interconnect/switch;
Phase III provides switch support and documentation; Phase IV provides
job statusing, fault-tolerance, and job queueing and control through
Livermore's Distributed Production Control System (DPCS), a metabatch and
resource management system.
\end{abstract}
\vspace{0.25in}
\section{PBS (Portable Batch System)}
The Portable Batch System (PBS)\footnote{http://www.openpbs.org/}
is a flexible batch queuing and
workload management system originally developed by Veridian Systems
for NASA. It operates on networked, multi-platform UNIX environments,
including heterogeneous clusters of workstations, supercomputers, and
massively parallel systems. PBS was developed as a replacement for
NQS (Network Queuing System) by many of the same people.
PBS supports sophisticated scheduling logic (via the Maui
Scheduler\footnote{http://superclustergroup.org/maui}).
PBS spawn's daemons on each
machine to shepherd the job's tasks (similar to LoadLeveler
and Condor). It provides an interface for administrators to easily
interface their own scheduling modules (a nice feature). PBS can support
long delays in file staging (in and out) with retry. Host
authentication is provided by checking port numbers (low ports numbers are only
accessible to user root). Credential service is used for user authentication.
It has the job prolog and epilog feature, which is useful. PBS Supports
high priority queue for smaller "interactive" jobs. Signal to daemons
causes current log file (e.g. accounting) to be closed, renamed with
time-stamp, and a new log file created.
Specific complaints about PBS from members of the OSCAR group (Jeremy Enos,
Jeff Squyres, Tim Mattson):
\begin{itemize}
\item Sensitivity to hostname configuration on the server; improper
configuration results in hard to diagnose failure modes. Once
configuration is correct, this issue disappears.
\item When a compute node in the system dies, everything slows down.
PBS is single-threaded and continues to try to contact down nodes,
while other activities like scheduling jobs, answering qsub/qstat
requests, etc., have to wait for a complete timeout cycle before being
processed.
\item Default scheduler is just FIFO, but Maui can be plugged in so this
is not a big issue.
\item Weak mechanism for starting/cleaning up parallel jobs (pbsdsh).
When a job is killed, pbsdsh kills the processes it started, but
if the process doesn't die on the first shot it may continue on.
\item PBS server continues to mark specific nodes offline, even though they
are healthy. Restarting the server fixes this.
\item Lingering jobs. Jobs assigned to nodes, and then bounced back to the
queue for any reason, maintain their assignment to those nodes, even
if another job had already started on them. This is a poor clean up
issue.
\item When the PBS server process is restarted, it puts running jobs at risk.
\item Poor diagnostic messages. This problem can be as serious as ANY other
problem. This problem makes small, simple problems turn into huge
turmoil occasionally. For example, the variety of symptoms that arise
from improper hostname configuration. All the symptoms that result are
very misleading to the real problem.
\item Rumored to have problems when the number of jobs in the queues gets
large.
\item Scalability problems on large systems.
\item Non-portable to Windows
\item Source code is a mess and difficult for others (e.g. the open source
community) to improve/expand.
\item Licensing problems (see below).
\end{itemize}
The one strength mentioned is PBS's portability and broad user base.
PBS is owned by Veridian and is released as three separate products with
different licenses: {\em PBS Pro} is a commercial product sold by Veridian;
{\em OpenPBS} is an pseudo open source version of PBS that requires
registration; and
{\em PBS} is a GPL-like, true open source version of PBS.
Bug fixes go into PBS Pro. When a major revision of PBS Pro comes out,
the previous version of PBS Pro becomes OpenPBS, and the previous version
of OpenPBS becomes PBS. The delay getting bug fixes (some reported by the
open source community) into the true open source version of PBS is the source
of some frustration.
\section{Maui}
Maui Scheduler\footnote{http://supercluster.org/maui}
is an advance reservation HPC batch scheduler for use with SP,
O2K, and UNIX/Linux clusters. It is widely used to extend the
functionality of PBS and LoadLeveler
\section{DPCS}
The Distributed Production Control System (DPCS)\footnote{
http://www.llnl.gov/icc/lc/dpcs/dpcs\_overview.html}
is a resource manager developed by Lawrence Livermore National Laboratory (LLNL).
DPCS is (or will soon be) open source, although its use is presently
confined to LLNL. The development of DPCS began in 1990 and it has
evolved into a highly scalable and fault-tolerant meta-scheduler
operating on top of LoadLeveler, RMS, and NQS. DPCS provides:
\begin{itemize}
\item Basic data collection and reporting mechanisms for project-level,
near real-time accounting.
\item Resource allocation to customers with established limits per
customers' organizational budgets.
\item Proactive delivery of services to organizations that are relatively
underserviced using a fair-share resource allocation scheme.
\item Automated, highly flexible system with feedback for proactive delivery
of resources.
\item Even distribution of the workload across available computers.
\item Flexible prioritization of production workload, including "run on demand."
\item Dynamic reconfiguration and re-tuning.
\item Graceful degradation in service to prevent overuse of a computer where
not authorized.
\end{itemize}
While DPCS does have some attractive characteristics, it supports only a
limited number of computer systems: IBM RS/6000 and SP, Linux with RMS,
Sun Solaris, and Compaq Alpha. DPCS also lacks commercial support.
\section{LoadLeveler}
LoadLeveler\footnote{
http://www-1.ibm.com/servers/eserver/pseries/library/sp\_books/loadleveler.html}
is a proprietary batch system and parallel job manager by
IBM. LoadLeveler supports few non-IBM systems. Very primitive
scheduling software exists and other software is required for reasonable
performance (e.g. Maui and DPCS). Many soft and hard limits are available.
A very flexible queue and job class structure is available operating in "matrix" fashion
(probably overly complex). Many configuration files exist with signals to
daemons used to update configuration (like LSF, good). All jobs must
be initiated through LoadLeveler (no real "interactive" jobs, just
high priority queue for smaller jobs). Job accounting is only available
on termination (very bad for long-running jobs). Good status
information on nodes and LoadLeveler daemons is available. LoadLeveler
allocates jobs either entire nodes or shared nodes ,depending upon configuration.
A special version of MPI is required. LoadLeveler allocates
interconnect resources, spawns the user's processes, and manages the
job afterwards. Daemons also monitor the switch and node health using
a "heart-beat monitor." One fundamental problem is that when the
"Central Manager" restarts, it forgets about all nodes and jobs. They
appear in the database only after checking in via the heartbeat. It
needs to periodically write state to disk instead of doing
"cold-starts" after the daemon fails, which is rare. It has the job
prolog and epilog feature, which permits us to enable/disable logins
and remove stray processes.
LoadLeveler evolved from Condor, or what was Condor a decade ago.
While I am less familiar with LSF and Condor than LoadLeveler, they
all appear very similar with LSF having the far more sophisticated
scheduler. We should carefully review their data structures and
daemons before designing our own.
\section{LSF (Load Sharing Facility)}
LSF\footnote{http://www.platform.com/}
is a proprietary batch system and parallel job manager by
Platform Computing. Widely deployed on a wide variety of computer
architectures. Sophisticated scheduling software including
fair-share, backfill, consumable resources, job preemption, many soft
and hard limits, etc. Very flexible queue structure (perhaps overly
complex). Limits are available on both a per process bs per-job
basis. Time limits include CPU time and wall-clock time. Many
configuration files with signals to daemons used to update
configuration (like LoadLeveler, good). All jobs must be initiated
through LSF to be accounted for and managed by LSF ("interactive"
jobs can be executed through a high priority queue for
smaller jobs). Job accounting only available in near real-time (important
for long-running jobs). Jobs initiated from same directory as
submitted from (not good for computer centers with diverse systems
under LSF control). Good status information on nodes and LSF daemons.
Allocates jobs either entire nodes or shared nodes depending upon
configuration.
A special version of MPI is required. LSF allocates interconnect
resources, spawns the user's processes, and manages the job
afterwards. While I am less familiar with LSF than LoadLeveler, they
appear very similar with LSF having the far more sophisticated
scheduler. We should carefully review their data structures and
daemons before designing our own.
\section{Condor}
Condor\footnote{http://www.cs.wisc.edu/condor/} is a
batch system and parallel job manager
developed by the University of Wisconsin.
Condor was the basis for IBM's LoadLeveler and both share very similar
underlying infrastructure. Condor has a very sophisticated checkpoint/restart
service that does not rely upon kernel changes, but a variety of
library changes (which prevent it from being completely general). The
Condor checkpoint/restart service has been integrated into LSF,
Codine, and DPCS. Condor is designed to operate across a
heterogeneous environment, mostly to harness the compute resources of
workstations and PCs. It has an interesting "advertising" service.
Servers advertise their available resources and consumers advertise
their requirements for a broker to perform matches. The checkpoint
mechanism is used to relocate work on demand (when the "owner" of a
desktop machine wants to resume work).
\section{Memory Channel (Compaq)}
Memory Channel is a high-speed interconnect developed by
Digital/Compaq with related software for parallel job execution.
Special version of MPI required. The application spawns tasks on
other nodes. These tasks connect themselves to the high speed
interconnect. No system level tool to spawns the tasks, allocates
interconnect resources, or otherwise manages the parallel job (Note:
This is sometimes a problem when jobs fail, requiring system
administrators to release interconnect resources. There are also
performance problems related to resource sharing).
\section{Linux PAGG Process Aggregates}
PAGG\footnote{http://oss.sgi.com/projects/pagg/}
consists of modifications to the linux kernel that allows
developers to implement Process AGGregates as loadable kernel modules.
A process aggregate is defined as a collection of processes that are
all members of the same set. A set would be implemented as a container
for the member processes. For instance, process sessions and groups
could have been implemented as process aggregates.
\section{BPROC}
The Beowulf Distributed Process Space
(BProc\footnote{http://bproc.sourceforge.net/})
is set of kernel
modifications, utilities and libraries which allow a user to start
processes on other machines in a Beowulf-style cluster. Remote
processes started with this mechanism appear in the process table
of the front end machine in a cluster. This allows remote process
management using the normal UNIX process control facilities. Signals
are transparently forwarded to remote processes and exit status is
received using the usual wait() mechanisms.
\section{xcat}
Presumably IBM's suite of cluster management software
(xcat\footnote{http://publib-b.boulder.ibm.com/Redbooks.nsf/RedbookAbstracts/sg246041.html})
includes a batch system. Look into this.
\section{CPLANT}
CPLANT\footnote{http://www.cs.sandia.gov/cplant/} includes
Parallel Job Launcher, Compute Node Daemon Process,
Compute Node Allocator, Compute Node Status Tool.
\section{NQS}
NQS\footnote{http://umbc7.umbc.edu/nqs/nqsmain.html},
the Network Queueing System, is a serial batch system.
\section{LAM / MPI}
LAM (Local Area Multicomputer)\footnote{http://www.lam-mpi.org/}
is an MPI programming environment and development system for heterogeneous
computers on a network.
With LAM, a dedicated cluster or an existing network
computing infrastructure can act as one parallel computer solving
one problem. LAM features extensive debugging support in the
application development cycle and peak performance for production
applications. LAM features a full implementation of the MPI
communication standard.
\section{MPICH}
MPICH\footnote{http://www-unix.mcs.anl.gov/mpi/mpich/}
is a freely available, portable implementation of MPI,
the Standard for message-passing libraries.
\section{Quadrics RMS}
Quadrics
RMS\footnote{http://www.quadrics.com/downloads/documentation/}
(Resource Management System) is a cluster management system for
Linux and Tru64 which supports the
Elan3 interconnect.
\section{Sun Grid Engine}
SGE\footnote{http://www.sun.com/gridware/} is now proprietary.
\section{SCIDAC}
The Scientific Discovery through Advanced Computing (SciDAC)
project\footnote{http://www.scidac.org/ScalableSystems}
has a Resource Management and Accounting working group
and a white paper\cite{Res2000}. Deployment of a system with
the required fault-tolerance and scalability is scheduled
for June 2006.
\section{GNU Queue}
GNU Queue\footnote{http://www.gnuqueue.org/home.html}.
\section{Clubmask}
Clubmask\footnote{http://clubmask.sourceforge.net} is based on bproc.
Separate queueing system?
\section{SQMX}
Part of the SCE Project\footnote{http://www.opensce.org/},
SQMX\footnote{http://www.beowulf.org/pipermail/beowulf-announce/2001-January/000086.html} is worth taking a look at.
\newpage
\bibliographystyle{plain}
\bibliography{project}
\end{document}