doc/survey/report.tex - SchedMD/slurm - Git at Google

 \documentclass{article}
 \usepackage{graphics}
 \usepackage{epsfig}
 \usepackage{draftcopy}
 \author{Moe Jette, Chris Dunlap, Jim Garlick, Mark Grondona\\
         \{jette,cdunlap,garlick,grondona\}@llnl.gov}
 \title{Survey of Batch/Resource Management-Related System Software}


 \begin{document}

 \maketitle

 \begin{abstract}
 Simple Linux Utility for Resource Management (SLURM) is an open source,
 fault-tolerant, and highly scalable cluster management and job
 scheduling system for Linux clusters of
 thousands of nodes.  Components include machine status, partition
 management, job management, and scheduling modules.  The design also
 includes a scalable, general-purpose communication infrastructure.
 Development will take place in four phases:  Phase I results in a solid
 infrastructure;  Phase II produces a functional but limited interactive
 job initiation capability without use of the interconnect/switch;
 Phase III provides switch support and documentation; Phase IV provides
 job statusing, fault-tolerance, and job queueing and control through
 Livermore's Distributed Production Control System (DPCS), a metabatch and
 resource management system.
 \end{abstract}

 \vspace{0.25in}

 \section{PBS (Portable Batch System)}

 The Portable Batch System (PBS)\footnote{http://www.openpbs.org/}
 is a flexible batch queuing and
 workload management system originally developed by Veridian Systems
 for NASA.  It operates on networked, multi-platform UNIX environments,
 including heterogeneous clusters of workstations, supercomputers, and
 massively parallel systems. PBS was developed as a replacement for
 NQS (Network Queuing System) by many of the same people.

 PBS supports sophisticated scheduling logic (via the Maui
 Scheduler\footnote{http://superclustergroup.org/maui}).
 PBS spawn's daemons on each
 machine to shepherd the job's tasks (similar to LoadLeveler
 and Condor). It provides an interface for administrators to easily
 interface their own scheduling modules (a nice feature).  PBS can support
 long delays in file staging (in and out) with retry.  Host
 authentication is provided by checking port numbers (low ports numbers are only
 accessible to user root).  Credential service is used for user authentication.
 It has the job prolog and epilog feature, which is useful.  PBS Supports
 high priority queue for smaller "interactive" jobs.  Signal to daemons
 causes current log file (e.g. accounting) to be closed, renamed with
 time-stamp, and a new log file created.

 Specific complaints about PBS from members of the OSCAR group (Jeremy Enos,
 Jeff Squyres, Tim Mattson):
 \begin{itemize}
 \item Sensitivity to hostname configuration on the server; improper
       configuration results in hard to diagnose failure modes.  Once
       configuration is correct, this issue disappears.
 \item When a compute node in the system dies, everything slows down.
       PBS is single-threaded and continues to try to contact down nodes,
       while other activities like scheduling jobs, answering qsub/qstat
       requests, etc., have to wait for a complete timeout cycle before being
       processed.
 \item Default scheduler is just FIFO, but Maui can be plugged in so this
       is not a big issue.
 \item Weak mechanism for starting/cleaning up parallel jobs (pbsdsh).
       When a job is killed, pbsdsh kills the processes it started, but
       if the process doesn't die on the first shot it may continue on.
 \item PBS server continues to mark specific nodes offline, even though they
       are healthy.  Restarting the server fixes this.
 \item Lingering jobs.  Jobs assigned to nodes, and then bounced back to the
       queue for any reason, maintain their assignment to those nodes, even
       if another job had already started on them.  This is a poor clean up
       issue.
 \item When the PBS server process is restarted, it puts running jobs at risk.
 \item Poor diagnostic messages.  This problem can be as serious as ANY other
       problem.  This problem makes small, simple problems turn into huge
       turmoil occasionally.  For example, the variety of symptoms that arise
       from improper hostname configuration.  All the symptoms that result are
       very misleading to the real problem.
 \item Rumored to have problems when the number of jobs in the queues gets
       large.
 \item Scalability problems on large systems.
 \item Non-portable to Windows
 \item Source code is a mess and difficult for others (e.g. the open source
       community) to improve/expand.
 \item Licensing problems (see below).
 \end{itemize}
 The one strength mentioned is PBS's portability and broad user base.

 PBS is owned by Veridian and is released as three separate products with
 different licenses: {\em PBS Pro} is a commercial product sold by Veridian;
 {\em OpenPBS} is an pseudo open source version of PBS that requires
 registration; and
 {\em PBS} is a GPL-like, true open source version of PBS.

 Bug fixes go into PBS Pro.  When a major revision of PBS Pro comes out,
 the previous version of PBS Pro becomes OpenPBS, and the previous version
 of OpenPBS becomes PBS.  The delay getting bug fixes (some reported by the
 open source community) into the true open source version of PBS is the source
 of some frustration.

 \section{Maui}

 Maui Scheduler\footnote{http://supercluster.org/maui}
 is an advance reservation HPC batch scheduler for use with SP,
 O2K, and UNIX/Linux clusters. It is widely used to extend the
 functionality of PBS and LoadLeveler

 \section{DPCS}

 The Distributed Production Control System (DPCS)\footnote{
 http://www.llnl.gov/icc/lc/dpcs/dpcs\_overview.html}
 is a resource manager developed by Lawrence Livermore National Laboratory (LLNL).
 DPCS is (or will soon be) open source, although its use is presently
 confined to LLNL. The development of DPCS began in 1990 and it has
 evolved into a highly scalable and fault-tolerant meta-scheduler
 operating on top of LoadLeveler, RMS, and NQS. DPCS provides:
 \begin{itemize}
 \item Basic data collection and reporting mechanisms for project-level,
       near real-time accounting.
 \item Resource allocation to customers with established limits per
       customers' organizational budgets.
 \item Proactive delivery of services to organizations that are relatively
       underserviced using a fair-share resource allocation scheme.
 \item Automated, highly flexible system with feedback for proactive delivery
       of resources.
 \item Even distribution of the workload across available computers.
 \item Flexible prioritization of production workload, including "run on demand."
 \item Dynamic reconfiguration and re-tuning.
 \item Graceful degradation in service to prevent overuse of a computer where
       not authorized.
 \end{itemize}

 While DPCS does have some attractive characteristics, it supports only a
 limited number of computer systems: IBM RS/6000 and SP, Linux with RMS,
 Sun Solaris, and Compaq Alpha. DPCS also lacks commercial support.

 \section{LoadLeveler}

 LoadLeveler\footnote{
 http://www-1.ibm.com/servers/eserver/pseries/library/sp\_books/loadleveler.html}
 is a proprietary batch system and parallel job manager by
 IBM. LoadLeveler supports few non-IBM systems. Very primitive
 scheduling software exists and other software is required for reasonable
 performance (e.g. Maui and DPCS). Many soft and hard limits are available.
 A very flexible queue and job class structure is available operating in "matrix" fashion
 (probably overly complex). Many configuration files exist with signals to
 daemons used to update configuration (like LSF, good). All jobs must
 be initiated through LoadLeveler (no real "interactive" jobs, just
 high priority queue for smaller jobs). Job accounting is only available
 on termination (very bad for long-running jobs). Good status
 information on nodes and LoadLeveler daemons is available. LoadLeveler
 allocates jobs either entire nodes or shared nodes ,depending upon configuration.

 A special version of MPI is required. LoadLeveler allocates
 interconnect resources, spawns the user's processes, and manages the
 job afterwards. Daemons also monitor the switch and node health using
 a "heart-beat monitor." One fundamental problem is that when the
 "Central Manager" restarts, it forgets about all nodes and jobs. They
 appear in the database only after checking in via the heartbeat. It
 needs to periodically write state to disk instead of doing
 "cold-starts" after the daemon fails, which is rare. It has the job
 prolog and epilog feature, which permits us to enable/disable logins
 and remove stray processes.

 LoadLeveler evolved from Condor, or what was Condor a decade ago.
 While I am less familiar with LSF and Condor than LoadLeveler, they
 all appear very similar with LSF having the far more sophisticated
 scheduler. We should carefully review their data structures and
 daemons before designing our own.

 \section{LSF (Load Sharing Facility)}

 LSF\footnote{http://www.platform.com/}
 is a proprietary batch system and parallel job manager by
 Platform Computing. Widely deployed on a wide variety of computer
 architectures. Sophisticated scheduling software including
 fair-share, backfill, consumable resources, job preemption, many soft
 and hard limits, etc. Very flexible queue structure (perhaps overly
 complex). Limits are available on both a per process bs per-job
 basis. Time limits include CPU time and wall-clock time. Many
 configuration files with signals to daemons used to update
 configuration (like LoadLeveler, good). All jobs must be initiated
 through LSF to be accounted for and managed by LSF ("interactive"
 jobs can be executed through a high priority queue for
 smaller jobs). Job accounting only available in near real-time (important
 for long-running jobs). Jobs initiated from same directory as
 submitted from (not good for computer centers with diverse systems
 under LSF control). Good status information on nodes and LSF daemons.
 Allocates jobs either entire nodes or shared nodes depending upon
 configuration.

 A special version of MPI is required. LSF allocates interconnect
 resources, spawns the user's processes, and manages the job
 afterwards. While I am less familiar with LSF than LoadLeveler, they
 appear very similar with LSF having the far more sophisticated
 scheduler. We should carefully review their data structures and
 daemons before designing our own.


 \section{Condor}


 Condor\footnote{http://www.cs.wisc.edu/condor/} is a
 batch system and parallel job manager
 developed by the University of Wisconsin.
 Condor was the basis for IBM's LoadLeveler and both share very similar
 underlying infrastructure. Condor has a very sophisticated checkpoint/restart
 service that does not rely upon kernel changes, but a variety of
 library changes (which prevent it from being completely general). The
 Condor checkpoint/restart service has been integrated into LSF,
 Codine, and DPCS. Condor is designed to operate across a
 heterogeneous environment, mostly to harness the compute resources of
 workstations and PCs. It has an interesting "advertising" service.
 Servers advertise their available resources and consumers advertise
 their requirements for a broker to perform matches. The checkpoint
 mechanism is used to relocate work on demand (when the "owner" of a
 desktop machine wants to resume work).


 \section{Memory Channel (Compaq)}

 Memory Channel is a high-speed interconnect developed by
 Digital/Compaq with related software for parallel job execution.
 Special version of MPI required. The application spawns tasks on
 other nodes. These tasks connect themselves to the high speed
 interconnect. No system level tool to spawns the tasks, allocates
 interconnect resources, or otherwise manages the parallel job (Note:
 This is sometimes a problem when jobs fail, requiring system
 administrators to release interconnect resources. There are also
 performance problems related to resource sharing).

 \section{Linux PAGG Process Aggregates}


 PAGG\footnote{http://oss.sgi.com/projects/pagg/}
 consists of modifications to the linux kernel that allows
 developers to implement Process AGGregates as loadable kernel modules.
 A process aggregate is defined as a collection of processes that are
 all members of the same set. A set would be implemented as a container
 for the member processes. For instance, process sessions and groups
 could have been implemented as process aggregates.

 \section{BPROC}


 The Beowulf Distributed Process Space
 (BProc\footnote{http://bproc.sourceforge.net/})
 is set of kernel
 modifications, utilities and libraries which allow a user to start
 processes on other machines in a Beowulf-style cluster.  Remote
 processes started with this mechanism appear in the process table
 of the front end machine in a cluster. This allows remote process
 management using the normal UNIX process control facilities. Signals
 are transparently forwarded to remote processes and exit status is
 received using the usual wait() mechanisms.

 \section{xcat}

 Presumably IBM's suite of cluster management software
 (xcat\footnote{http://publib-b.boulder.ibm.com/Redbooks.nsf/RedbookAbstracts/sg246041.html})
 includes a batch system.  Look into this.

 \section{CPLANT}

 CPLANT\footnote{http://www.cs.sandia.gov/cplant/} includes
 Parallel Job Launcher, Compute Node Daemon Process,
 Compute Node Allocator, Compute Node Status Tool.

 \section{NQS}

 NQS\footnote{http://umbc7.umbc.edu/nqs/nqsmain.html},
 the Network Queueing System, is a serial batch system.

 \section{LAM / MPI}

 LAM (Local Area Multicomputer)\footnote{http://www.lam-mpi.org/}
 is an MPI programming environment and development system for heterogeneous
 computers on a network.
 With LAM, a dedicated cluster or an existing network
 computing infrastructure can act as one parallel computer solving
 one problem.  LAM features extensive debugging support in the
 application development cycle and peak performance for production
 applications. LAM features a full implementation of the MPI
 communication standard.

 \section{MPICH}

 MPICH\footnote{http://www-unix.mcs.anl.gov/mpi/mpich/}
 is a freely available, portable implementation of MPI,
 the Standard for message-passing libraries.

 \section{Quadrics RMS}

 Quadrics
 RMS\footnote{http://www.quadrics.com/downloads/documentation/}
 (Resource Management System) is a cluster management system for
 Linux and Tru64 which supports the
 Elan3 interconnect.

 \section{Sun Grid Engine}

 SGE\footnote{http://www.sun.com/gridware/} is now proprietary.


 \section{SCIDAC}

 The Scientific Discovery through Advanced Computing (SciDAC)
 project\footnote{http://www.scidac.org/ScalableSystems}
 has a Resource Management and Accounting working group
 and a white paper\cite{Res2000}. Deployment of a system with
 the required fault-tolerance and scalability is scheduled
 for June 2006.

 \section{GNU Queue}

 GNU Queue\footnote{http://www.gnuqueue.org/home.html}.

 \section{Clubmask}
 Clubmask\footnote{http://clubmask.sourceforge.net} is based on bproc.
 Separate queueing system?

 \section{SQMX}
 Part of the SCE Project\footnote{http://www.opensce.org/},
 SQMX\footnote{http://www.beowulf.org/pipermail/beowulf-announce/2001-January/000086.html} is worth taking a look at.

 \newpage
 \bibliographystyle{plain}
 \bibliography{project}

 \end{document}
	\documentclass{article}
	\usepackage{graphics}
	\usepackage{epsfig}
	\usepackage{draftcopy}
	\author{Moe Jette, Chris Dunlap, Jim Garlick, Mark Grondona\\
	\{jette,cdunlap,garlick,grondona\}@llnl.gov}
	\title{Survey of Batch/Resource Management-Related System Software}


	\begin{document}

	\maketitle

	\begin{abstract}
	Simple Linux Utility for Resource Management (SLURM) is an open source,
	fault-tolerant, and highly scalable cluster management and job
	scheduling system for Linux clusters of
	thousands of nodes. Components include machine status, partition
	management, job management, and scheduling modules. The design also
	includes a scalable, general-purpose communication infrastructure.
	Development will take place in four phases: Phase I results in a solid
	infrastructure; Phase II produces a functional but limited interactive
	job initiation capability without use of the interconnect/switch;
	Phase III provides switch support and documentation; Phase IV provides
	job statusing, fault-tolerance, and job queueing and control through
	Livermore's Distributed Production Control System (DPCS), a metabatch and
	resource management system.
	\end{abstract}

	\vspace{0.25in}

	\section{PBS (Portable Batch System)}

	The Portable Batch System (PBS)\footnote{http://www.openpbs.org/}
	is a flexible batch queuing and
	workload management system originally developed by Veridian Systems
	for NASA. It operates on networked, multi-platform UNIX environments,
	including heterogeneous clusters of workstations, supercomputers, and
	massively parallel systems. PBS was developed as a replacement for
	NQS (Network Queuing System) by many of the same people.

	PBS supports sophisticated scheduling logic (via the Maui
	Scheduler\footnote{http://superclustergroup.org/maui}).
	PBS spawn's daemons on each
	machine to shepherd the job's tasks (similar to LoadLeveler
	and Condor). It provides an interface for administrators to easily
	interface their own scheduling modules (a nice feature). PBS can support
	long delays in file staging (in and out) with retry. Host
	authentication is provided by checking port numbers (low ports numbers are only
	accessible to user root). Credential service is used for user authentication.
	It has the job prolog and epilog feature, which is useful. PBS Supports
	high priority queue for smaller "interactive" jobs. Signal to daemons
	causes current log file (e.g. accounting) to be closed, renamed with
	time-stamp, and a new log file created.

	Specific complaints about PBS from members of the OSCAR group (Jeremy Enos,
	Jeff Squyres, Tim Mattson):
	\begin{itemize}
	\item Sensitivity to hostname configuration on the server; improper
	configuration results in hard to diagnose failure modes. Once
	configuration is correct, this issue disappears.
	\item When a compute node in the system dies, everything slows down.
	PBS is single-threaded and continues to try to contact down nodes,
	while other activities like scheduling jobs, answering qsub/qstat
	requests, etc., have to wait for a complete timeout cycle before being
	processed.
	\item Default scheduler is just FIFO, but Maui can be plugged in so this
	is not a big issue.
	\item Weak mechanism for starting/cleaning up parallel jobs (pbsdsh).
	When a job is killed, pbsdsh kills the processes it started, but
	if the process doesn't die on the first shot it may continue on.
	\item PBS server continues to mark specific nodes offline, even though they
	are healthy. Restarting the server fixes this.
	\item Lingering jobs. Jobs assigned to nodes, and then bounced back to the
	queue for any reason, maintain their assignment to those nodes, even
	if another job had already started on them. This is a poor clean up
	issue.
	\item When the PBS server process is restarted, it puts running jobs at risk.
	\item Poor diagnostic messages. This problem can be as serious as ANY other
	problem. This problem makes small, simple problems turn into huge
	turmoil occasionally. For example, the variety of symptoms that arise
	from improper hostname configuration. All the symptoms that result are
	very misleading to the real problem.
	\item Rumored to have problems when the number of jobs in the queues gets
	large.
	\item Scalability problems on large systems.
	\item Non-portable to Windows
	\item Source code is a mess and difficult for others (e.g. the open source
	community) to improve/expand.
	\item Licensing problems (see below).
	\end{itemize}
	The one strength mentioned is PBS's portability and broad user base.

	PBS is owned by Veridian and is released as three separate products with
	different licenses: {\em PBS Pro} is a commercial product sold by Veridian;
	{\em OpenPBS} is an pseudo open source version of PBS that requires
	registration; and
	{\em PBS} is a GPL-like, true open source version of PBS.

	Bug fixes go into PBS Pro. When a major revision of PBS Pro comes out,
	the previous version of PBS Pro becomes OpenPBS, and the previous version
	of OpenPBS becomes PBS. The delay getting bug fixes (some reported by the
	open source community) into the true open source version of PBS is the source
	of some frustration.

	\section{Maui}

	Maui Scheduler\footnote{http://supercluster.org/maui}
	is an advance reservation HPC batch scheduler for use with SP,
	O2K, and UNIX/Linux clusters. It is widely used to extend the
	functionality of PBS and LoadLeveler

	\section{DPCS}

	The Distributed Production Control System (DPCS)\footnote{
	http://www.llnl.gov/icc/lc/dpcs/dpcs\_overview.html}
	is a resource manager developed by Lawrence Livermore National Laboratory (LLNL).
	DPCS is (or will soon be) open source, although its use is presently
	confined to LLNL. The development of DPCS began in 1990 and it has
	evolved into a highly scalable and fault-tolerant meta-scheduler
	operating on top of LoadLeveler, RMS, and NQS. DPCS provides:
	\begin{itemize}
	\item Basic data collection and reporting mechanisms for project-level,
	near real-time accounting.
	\item Resource allocation to customers with established limits per
	customers' organizational budgets.
	\item Proactive delivery of services to organizations that are relatively
	underserviced using a fair-share resource allocation scheme.
	\item Automated, highly flexible system with feedback for proactive delivery
	of resources.
	\item Even distribution of the workload across available computers.
	\item Flexible prioritization of production workload, including "run on demand."
	\item Dynamic reconfiguration and re-tuning.
	\item Graceful degradation in service to prevent overuse of a computer where
	not authorized.
	\end{itemize}

	While DPCS does have some attractive characteristics, it supports only a
	limited number of computer systems: IBM RS/6000 and SP, Linux with RMS,
	Sun Solaris, and Compaq Alpha. DPCS also lacks commercial support.

	\section{LoadLeveler}

	LoadLeveler\footnote{
	http://www-1.ibm.com/servers/eserver/pseries/library/sp\_books/loadleveler.html}
	is a proprietary batch system and parallel job manager by
	IBM. LoadLeveler supports few non-IBM systems. Very primitive
	scheduling software exists and other software is required for reasonable
	performance (e.g. Maui and DPCS). Many soft and hard limits are available.
	A very flexible queue and job class structure is available operating in "matrix" fashion
	(probably overly complex). Many configuration files exist with signals to
	daemons used to update configuration (like LSF, good). All jobs must
	be initiated through LoadLeveler (no real "interactive" jobs, just
	high priority queue for smaller jobs). Job accounting is only available
	on termination (very bad for long-running jobs). Good status
	information on nodes and LoadLeveler daemons is available. LoadLeveler
	allocates jobs either entire nodes or shared nodes ,depending upon configuration.

	A special version of MPI is required. LoadLeveler allocates
	interconnect resources, spawns the user's processes, and manages the
	job afterwards. Daemons also monitor the switch and node health using
	a "heart-beat monitor." One fundamental problem is that when the
	"Central Manager" restarts, it forgets about all nodes and jobs. They
	appear in the database only after checking in via the heartbeat. It
	needs to periodically write state to disk instead of doing
	"cold-starts" after the daemon fails, which is rare. It has the job
	prolog and epilog feature, which permits us to enable/disable logins
	and remove stray processes.

	LoadLeveler evolved from Condor, or what was Condor a decade ago.
	While I am less familiar with LSF and Condor than LoadLeveler, they
	all appear very similar with LSF having the far more sophisticated
	scheduler. We should carefully review their data structures and
	daemons before designing our own.

	\section{LSF (Load Sharing Facility)}

	LSF\footnote{http://www.platform.com/}
	is a proprietary batch system and parallel job manager by
	Platform Computing. Widely deployed on a wide variety of computer
	architectures. Sophisticated scheduling software including
	fair-share, backfill, consumable resources, job preemption, many soft
	and hard limits, etc. Very flexible queue structure (perhaps overly
	complex). Limits are available on both a per process bs per-job
	basis. Time limits include CPU time and wall-clock time. Many
	configuration files with signals to daemons used to update
	configuration (like LoadLeveler, good). All jobs must be initiated
	through LSF to be accounted for and managed by LSF ("interactive"
	jobs can be executed through a high priority queue for
	smaller jobs). Job accounting only available in near real-time (important
	for long-running jobs). Jobs initiated from same directory as
	submitted from (not good for computer centers with diverse systems
	under LSF control). Good status information on nodes and LSF daemons.
	Allocates jobs either entire nodes or shared nodes depending upon
	configuration.

	A special version of MPI is required. LSF allocates interconnect
	resources, spawns the user's processes, and manages the job
	afterwards. While I am less familiar with LSF than LoadLeveler, they
	appear very similar with LSF having the far more sophisticated
	scheduler. We should carefully review their data structures and
	daemons before designing our own.


	\section{Condor}


	Condor\footnote{http://www.cs.wisc.edu/condor/} is a
	batch system and parallel job manager
	developed by the University of Wisconsin.
	Condor was the basis for IBM's LoadLeveler and both share very similar
	underlying infrastructure. Condor has a very sophisticated checkpoint/restart
	service that does not rely upon kernel changes, but a variety of
	library changes (which prevent it from being completely general). The
	Condor checkpoint/restart service has been integrated into LSF,
	Codine, and DPCS. Condor is designed to operate across a
	heterogeneous environment, mostly to harness the compute resources of
	workstations and PCs. It has an interesting "advertising" service.
	Servers advertise their available resources and consumers advertise
	their requirements for a broker to perform matches. The checkpoint
	mechanism is used to relocate work on demand (when the "owner" of a
	desktop machine wants to resume work).



	\section{Memory Channel (Compaq)}

	Memory Channel is a high-speed interconnect developed by
	Digital/Compaq with related software for parallel job execution.
	Special version of MPI required. The application spawns tasks on
	other nodes. These tasks connect themselves to the high speed
	interconnect. No system level tool to spawns the tasks, allocates
	interconnect resources, or otherwise manages the parallel job (Note:
	This is sometimes a problem when jobs fail, requiring system
	administrators to release interconnect resources. There are also
	performance problems related to resource sharing).

	\section{Linux PAGG Process Aggregates}


	PAGG\footnote{http://oss.sgi.com/projects/pagg/}
	consists of modifications to the linux kernel that allows
	developers to implement Process AGGregates as loadable kernel modules.
	A process aggregate is defined as a collection of processes that are
	all members of the same set. A set would be implemented as a container
	for the member processes. For instance, process sessions and groups
	could have been implemented as process aggregates.

	\section{BPROC}


	The Beowulf Distributed Process Space
	(BProc\footnote{http://bproc.sourceforge.net/})
	is set of kernel
	modifications, utilities and libraries which allow a user to start
	processes on other machines in a Beowulf-style cluster. Remote
	processes started with this mechanism appear in the process table
	of the front end machine in a cluster. This allows remote process
	management using the normal UNIX process control facilities. Signals
	are transparently forwarded to remote processes and exit status is
	received using the usual wait() mechanisms.

	\section{xcat}

	Presumably IBM's suite of cluster management software
	(xcat\footnote{http://publib-b.boulder.ibm.com/Redbooks.nsf/RedbookAbstracts/sg246041.html})
	includes a batch system. Look into this.

	\section{CPLANT}

	CPLANT\footnote{http://www.cs.sandia.gov/cplant/} includes
	Parallel Job Launcher, Compute Node Daemon Process,
	Compute Node Allocator, Compute Node Status Tool.

	\section{NQS}

	NQS\footnote{http://umbc7.umbc.edu/nqs/nqsmain.html},
	the Network Queueing System, is a serial batch system.

	\section{LAM / MPI}

	LAM (Local Area Multicomputer)\footnote{http://www.lam-mpi.org/}
	is an MPI programming environment and development system for heterogeneous
	computers on a network.
	With LAM, a dedicated cluster or an existing network
	computing infrastructure can act as one parallel computer solving
	one problem. LAM features extensive debugging support in the
	application development cycle and peak performance for production
	applications. LAM features a full implementation of the MPI
	communication standard.

	\section{MPICH}

	MPICH\footnote{http://www-unix.mcs.anl.gov/mpi/mpich/}
	is a freely available, portable implementation of MPI,
	the Standard for message-passing libraries.

	\section{Quadrics RMS}

	Quadrics
	RMS\footnote{http://www.quadrics.com/downloads/documentation/}
	(Resource Management System) is a cluster management system for
	Linux and Tru64 which supports the
	Elan3 interconnect.

	\section{Sun Grid Engine}

	SGE\footnote{http://www.sun.com/gridware/} is now proprietary.


	\section{SCIDAC}

	The Scientific Discovery through Advanced Computing (SciDAC)
	project\footnote{http://www.scidac.org/ScalableSystems}
	has a Resource Management and Accounting working group
	and a white paper\cite{Res2000}. Deployment of a system with
	the required fault-tolerance and scalability is scheduled
	for June 2006.

	\section{GNU Queue}

	GNU Queue\footnote{http://www.gnuqueue.org/home.html}.

	\section{Clubmask}
	Clubmask\footnote{http://clubmask.sourceforge.net} is based on bproc.
	Separate queueing system?

	\section{SQMX}
	Part of the SCE Project\footnote{http://www.opensce.org/},
	SQMX\footnote{http://www.beowulf.org/pipermail/beowulf-announce/2001-January/000086.html} is worth taking a look at.

	\newpage
	\bibliographystyle{plain}
	\bibliography{project}

	\end{document}