blob: 8d68fdda639935cad2d667d9eeed1c22ed3c9045 [file] [log] [blame] [edit]
\section{Related Work}
\subsection*{Portable Batch System (PBS)}
The Portable Batch System (PBS)~\cite{PBS}
is a flexible batch queuing and
workload management system originally developed by Veridian Systems
for NASA. It operates on networked, multi-platform UNIX environments,
including heterogeneous clusters of workstations, supercomputers, and
massively parallel systems. PBS was developed as a replacement for
NQS (Network Queuing System) by many of the same people.
PBS supports sophisticated scheduling logic (via the Maui
Scheduler).
PBS spawn's daemons on each
machine to shepherd the job's tasks.
It provides an interface for administrators to easily
interface their own scheduling modules. PBS can support
long delays in file staging with retry. Host
authentication is provided by checking port numbers (low ports numbers are only
accessible to user root). Credential service is used for user authentication.
It has the job prolog and epilog feature.
PBS Supports
high priority queue for smaller "interactive" jobs. Signal to daemons
causes current log file to be closed, renamed with
time-stamp, and a new log file created.
Although the PBS is portable and has a broad user base, it has significant drawbacks.
PBS is single threaded and hence exhibits poor performance on large clusters.
This is particularly problematic when a compute node in the system fails:
PBS tries to contact down node while other activities must wait.
PBS also has a weak mechanism for starting and cleaning up parallel jobs.
%Specific complaints about PBS from members of the OSCAR group (Jeremy Enos,
%Jeff Squyres, Tim Mattson):
%\begin{itemize}
%\item Sensitivity to hostname configuration on the server; improper
% configuration results in hard to diagnose failure modes. Once
% configuration is correct, this issue disappears.
%\item When a compute node in the system dies, everything slows down.
% PBS is single-threaded and continues to try to contact down nodes,
% while other activities like scheduling jobs, answering qsub/qstat
% requests, etc., have to wait for a complete timeout cycle before being
% processed.
%\item Default scheduler is just FIFO, but Maui can be plugged in so this
% is not a big issue.
%\item Weak mechanism for starting/cleaning up parallel jobs (pbsdsh).
% When a job is killed, pbsdsh kills the processes it started, but
% if the process doesn't die on the first shot it may continue on.
%\item PBS server continues to mark specific nodes offline, even though they
% are healthy. Restarting the server fixes this.
%\item Lingering jobs. Jobs assigned to nodes, and then bounced back to the
% queue for any reason, maintain their assignment to those nodes, even
% if another job had already started on them. This is a poor clean up
% issue.
%\item When the PBS server process is restarted, it puts running jobs at risk.
%\item Poor diagnostic messages. This problem can be as serious as ANY other
% problem. This problem makes small, simple problems turn into huge
% turmoil occasionally. For example, the variety of symptoms that arise
% from improper hostname configuration. All the symptoms that result are
% very misleading to the real problem.
%\item Rumored to have problems when the number of jobs in the queues gets
% large.
%\item Scalability problems on large systems.
%\item Non-portable to Windows
%\item Source code is a mess and difficult for others (e.g. the open source
% community) to improve/expand.
%\item Licensing problems (see below).
%\end{itemize}
%The one strength mentioned is PBS's portability and broad user base.
%
%PBS is owned by Veridian and is released as three separate products with
%different licenses: {\em PBS Pro} is a commercial product sold by Veridian;
%{\em OpenPBS} is an pseudo open source version of PBS that requires
%registration; and
%{\em PBS} is a GPL-like, true open source version of PBS.
%
%Bug fixes go into PBS Pro. When a major revision of PBS Pro comes out,
%the previous version of PBS Pro becomes OpenPBS, and the previous version
%of OpenPBS becomes PBS. The delay getting bug fixes (some reported by the
%open source community) into the true open source version of PBS is the source
%of some frustration.
\subsection{Quadrics RMS}
Quadrics RMS\cite{Quadrics02}
(Resource Management System) is for
Unix systems having Quadrics Elan interconnects.
RMS functionality and performance is excellent.
It major limitation is the requirement for a Quadrics interconnect.
The proprietary code and cost may also pose difficulties under some
circumstances.
\subsection*{Maui Scheduler}
Maui Scheduler~\cite{Maui} is an advanced reservation HPC batch scheduler
for use with SP, O2K, and UNIX/Linux clusters.
It is widely used to extend the functionality of PBS and LoadLeveler,
which Maui requires to perform the parallel job initiation and management.
\subsection*{Distributed Production Control System (DPCS)}
The Distributed Production Control System (DPCS)~\cite{DPCS}
is a scheduler developed at Lawrence Livermore National Laboratory (LLNL).
The DPCS provides basic data collection and reporting
mechanisms for prject-level, near real-time accounting and resource allocation
to customers with established limits per customers' organization budgets,
In addition, the DPCS evenly distributes workload across available computers
and supports dynamic reconfiguration and graceful degradation of service to prevent
overuse of a computer where not authorized.
%DPCS is (or will soon be) open source, although its use is presently
%confined to LLNL. The development of DPCS began in 1990 and it has
%evolved into a highly scalable and fault-tolerant meta-scheduler
%operating on top of LoadLeveler, RMS, and NQS. DPCS provides:
%\begin{itemize}
%\item Basic data collection and reporting mechanisms for project-level,
% near real-time accounting.
%\item Resource allocation to customers with established limits per
% customers' organizational budgets.
%\item Proactive delivery of services to organizations that are relatively
% underserviced using a fair-share resource allocation scheme.
%\item Automated, highly flexible system with feedback for proactive delivery
% of resources.
%\item Even distribution of the workload across available computers.
%\item Flexible prioritization of production workload, including "run on demand."
%\item Dynamic reconfiguration and re-tuning.
%\item Graceful degradation in service to prevent overuse of a computer where
% not authorized.
%\end{itemize}
DPCS supports only a
limited number of computer systems: IBM RS/6000 and SP, Linux,
Sun Solaris, and Compaq Alpha.
Like the Maui Scheduler, DPCS requires an underlying infrastructure for
parallel job initiation and management (LoadLeveler, NQS, RMS or SLURM).
\subsection*{LoadLeveler}
LoadLeveler~\cite{LoadLevelerManual,LoadLevelerWeb}
is a proprietary batch system and parallel job manager by
IBM. LoadLeveler supports few non-IBM systems. Very primitive
scheduling software exists and other software is required for reasonable
performance, such as Maui Scheduler or DPCS.
The LoadLeveler has a simple and very flexible queue and job class structure available
operating in "matrix" fashion.
The biggest problem of the LoadLeveler is its poor scalability.
It typically requires 20 minutes to execute even a trivial 500-node, 8000-task
on the IBM SP computers at LLNL.
%In addition, all jobs must be initiated through the LoadLeveler, and a special version of
%MPI is requested to run a parallel job.
%[So do RMS, SLURM, etc. for interconnect set-up - Moe]%
%
%Many configuration files exist with signals to
%daemons used to update configuration (like LSF, good). All jobs must
%be initiated through LoadLeveler (no real "interactive" jobs, just
%high priority queue for smaller jobs). Job accounting is only available
%on termination (very bad for long-running jobs). Good status
%information on nodes and LoadLeveler daemons is available. LoadLeveler
%allocates jobs either entire nodes or shared nodes ,depending upon configuration.
%
%A special version of MPI is required. LoadLeveler allocates
%interconnect resources, spawns the user's processes, and manages the
%job afterwards. Daemons also monitor the switch and node health using
%a "heart-beat monitor." One fundamental problem is that when the
%"Central Manager" restarts, it forgets about all nodes and jobs. They
%appear in the database only after checking in via the heartbeat. It
%needs to periodically write state to disk instead of doing
%"cold-starts" after the daemon fails, which is rare. It has the job
%prolog and epilog feature, which permits us to enable/disable logins
%and remove stray processes.
%
%LoadLeveler evolved from Condor, or what was Condor a decade ago.
%While I am less familiar with LSF and Condor than LoadLeveler, they
%all appear very similar with LSF having the far more sophisticated
%scheduler. We should carefully review their data structures and
%daemons before designing our own.
%
\subsection*{Load Sharing Facility (LSF)}
LSF~\cite{LSF}
is a proprietary batch system and parallel job manager by
Platform Computing. Widely deployed on a wide variety of computer
architectures, it has sophisticated scheduling software including
fair-share, backfill, consumable resources, an job preemption and
very flexible queue structure.
It also provides good status information on nodes and LSF daemons.
While LSF is quite powerful, it is not open-source and can be costly on
larger clusters.
%The LSF share many of its shortcomings with the LoadLeveler: job initiation only
%through LSF, requirement of a spwcial MPI library, etc.
%Limits are available on both a per process bs per-job
%basis. Time limits include CPU time and wall-clock time. Many
%configuration files with signals to daemons used to update
%configuration (like LoadLeveler, good). All jobs must be initiated
%through LSF to be accounted for and managed by LSF ("interactive"
%jobs can be executed through a high priority queue for
%smaller jobs). Job accounting only available in near real-time (important
%for long-running jobs). Jobs initiated from same directory as
%submitted from (not good for computer centers with diverse systems
%under LSF control). Good status information on nodes and LSF daemons.
%Allocates jobs either entire nodes or shared nodes depending upon
%configuration.
%
%A special version of MPI is required. LSF allocates interconnect
%resources, spawns the user's processes, and manages the job
%afterwards. While I am less familiar with LSF than LoadLeveler, they
%appear very similar with LSF having the far more sophisticated
%scheduler. We should carefully review their data structures and
%daemons before designing our own.
\subsection*{Condor}
Condor~\cite{Condor,Litzkow88,Basney97}
is a batch system and parallel job manager
developed by the University of Wisconsin.
Condor was the basis for IBM's LoadLeveler and both share very similar
underlying infrastructure. Condor has a very sophisticated checkpoint/restart
service that does not rely upon kernel changes, but a variety of
library changes (which prevent it from being completely general). The
Condor checkpoint/restart service has been integrated into LSF,
Codine, and DPCS. Condor is designed to operate across a
heterogeneous environment, mostly to harness the compute resources of
workstations and PCs. It has an interesting "advertising" service.
Servers advertise their available resources and consumers advertise
their requirements for a broker to perform matches. The checkpoint
mechanism is used to relocate work on demand (when the "owner" of a
desktop machine wants to resume work).
%
%\subsection*{Linux PAGG Process Aggregates}
%
%PAGG~\cite{PAGG}
%consists of modifications to the linux kernel that allows
%developers to implement Process AGGregates as loadable kernel modules.
%A process aggregate is defined as a collection of processes that are
%all members of the same set. A set would be implemented as a container
%for the member processes. For instance, process sessions and groups
%could have been implemented as process aggregates.
%
\subsection*{Beowulf Distributed Process Space (BPROC)}
The Beowulf Distributed Process Space
(BPROC)
is set of kernel
modifications, utilities and libraries which allow a user to start
processes on other machines in a Beowulf-style cluster~\cite{BProc}. Remote
processes started with this mechanism appear in the process table
of the front end machine in a cluster. This allows remote process
management using the normal UNIX process control facilities. Signals
are transparently forwarded to remote processes and exit status is
received using the usual wait() mechanisms. This tight coupling of
a cluster's nodes is convenient, but high scalability can be difficult
to achieve.
%\subsection{xcat}
%
%Presumably IBM's suite of cluster management software
%(xcat\footnote{http://publib-b.boulder.ibm.com/Redbooks.nsf/RedbookAbstracts/sg246041.html})
%includes a batch system. Look into this.
%
%\subsection{CPLANT}
%
%CPLANT\footnote{http://www.cs.sandia.gov/cplant/} includes
%Parallel Job Launcher, Compute Node Daemon Process,
%Compute Node Allocator, Compute Node Status Tool.
%
%\subsection{NQS}
%
%NQS\footnote{http://umbc7.umbc.edu/nqs/nqsmain.html},
%the Network Queueing System, is a serial batch system.
%
%\subsection*{LAM / MPI}
%
%Local Area Multicomputer (LAM)~\cite{LAM}
%is an MPI programming environment and development system for heterogeneous
%computers on a network.
%With LAM, a dedicated cluster or an existing network
%computing infrastructure can act as one parallel computer solving
%one problem. LAM features extensive debugging support in the
%application development cycle and peak performance for production
%applications. LAM features a full implementation of the MPI
%communication standard.
%
%\subsection{MPICH}
%
%MPICH\footnote{http://www-unix.mcs.anl.gov/mpi/mpich/}
%is a freely available, portable implementation of MPI,
%the Standard for message-passing libraries.
%
%\subsection{Sun Grid Engine}
%
%SGE\footnote{http://www.sun.com/gridware/} is now proprietary.
%
%
%\subsection{SCIDAC}
%
%The Scientific Discovery through Advanced Computing (SciDAC)
%project\footnote{http://www.scidac.org/ScalableSystems}
%has a Resource Management and Accounting working group
%and a white paper\cite{Res2000}. Deployment of a system with
%the required fault-tolerance and scalability is scheduled
%for June 2006.
%
%\subsection{GNU Queue}
%
%GNU Queue\footnote{http://www.gnuqueue.org/home.html}.
%
%\subsection{Clubmask}
%Clubmask\footnote{http://clubmask.sourceforge.net} is based on bproc.
%Separate queueing system?
%
%\subsection{SQMX}
%Part of the SCE Project\footnote{http://www.opensce.org/},
%SQMX\footnote{http://www.beowulf.org/pipermail/beowulf-announce/2001-January/000086.html} is worth taking a look at.