| \section{Related Work} |
| \subsection*{Portable Batch System (PBS)} |
| |
| The Portable Batch System (PBS)~\cite{PBS} |
| is a flexible batch queuing and |
| workload management system originally developed by Veridian Systems |
| for NASA. It operates on networked, multi-platform UNIX environments, |
| including heterogeneous clusters of workstations, supercomputers, and |
| massively parallel systems. PBS was developed as a replacement for |
| NQS (Network Queuing System) by many of the same people. |
| |
| PBS supports sophisticated scheduling logic (via the Maui |
| Scheduler). |
| PBS spawn's daemons on each |
| machine to shepherd the job's tasks. |
| It provides an interface for administrators to easily |
| interface their own scheduling modules. PBS can support |
| long delays in file staging with retry. Host |
| authentication is provided by checking port numbers (low ports numbers are only |
| accessible to user root). Credential service is used for user authentication. |
| It has the job prolog and epilog feature. |
| PBS Supports |
| high priority queue for smaller "interactive" jobs. Signal to daemons |
| causes current log file to be closed, renamed with |
| time-stamp, and a new log file created. |
| |
| Although the PBS is portable and has a broad user base, it has significant drawbacks. |
| PBS is single threaded and hence exhibits poor performance on large clusters. |
| This is particularly problematic when a compute node in the system fails: |
| PBS tries to contact down node while other activities must wait. |
| PBS also has a weak mechanism for starting and cleaning up parallel jobs. |
| %Specific complaints about PBS from members of the OSCAR group (Jeremy Enos, |
| %Jeff Squyres, Tim Mattson): |
| %\begin{itemize} |
| %\item Sensitivity to hostname configuration on the server; improper |
| % configuration results in hard to diagnose failure modes. Once |
| % configuration is correct, this issue disappears. |
| %\item When a compute node in the system dies, everything slows down. |
| % PBS is single-threaded and continues to try to contact down nodes, |
| % while other activities like scheduling jobs, answering qsub/qstat |
| % requests, etc., have to wait for a complete timeout cycle before being |
| % processed. |
| %\item Default scheduler is just FIFO, but Maui can be plugged in so this |
| % is not a big issue. |
| %\item Weak mechanism for starting/cleaning up parallel jobs (pbsdsh). |
| % When a job is killed, pbsdsh kills the processes it started, but |
| % if the process doesn't die on the first shot it may continue on. |
| %\item PBS server continues to mark specific nodes offline, even though they |
| % are healthy. Restarting the server fixes this. |
| %\item Lingering jobs. Jobs assigned to nodes, and then bounced back to the |
| % queue for any reason, maintain their assignment to those nodes, even |
| % if another job had already started on them. This is a poor clean up |
| % issue. |
| %\item When the PBS server process is restarted, it puts running jobs at risk. |
| %\item Poor diagnostic messages. This problem can be as serious as ANY other |
| % problem. This problem makes small, simple problems turn into huge |
| % turmoil occasionally. For example, the variety of symptoms that arise |
| % from improper hostname configuration. All the symptoms that result are |
| % very misleading to the real problem. |
| %\item Rumored to have problems when the number of jobs in the queues gets |
| % large. |
| %\item Scalability problems on large systems. |
| %\item Non-portable to Windows |
| %\item Source code is a mess and difficult for others (e.g. the open source |
| % community) to improve/expand. |
| %\item Licensing problems (see below). |
| %\end{itemize} |
| %The one strength mentioned is PBS's portability and broad user base. |
| % |
| %PBS is owned by Veridian and is released as three separate products with |
| %different licenses: {\em PBS Pro} is a commercial product sold by Veridian; |
| %{\em OpenPBS} is an pseudo open source version of PBS that requires |
| %registration; and |
| %{\em PBS} is a GPL-like, true open source version of PBS. |
| % |
| %Bug fixes go into PBS Pro. When a major revision of PBS Pro comes out, |
| %the previous version of PBS Pro becomes OpenPBS, and the previous version |
| %of OpenPBS becomes PBS. The delay getting bug fixes (some reported by the |
| %open source community) into the true open source version of PBS is the source |
| %of some frustration. |
| |
| |
| \subsection{Quadrics RMS} |
| |
| Quadrics RMS\cite{Quadrics02} |
| (Resource Management System) is for |
| Unix systems having Quadrics Elan interconnects. |
| RMS functionality and performance is excellent. |
| It major limitation is the requirement for a Quadrics interconnect. |
| The proprietary code and cost may also pose difficulties under some |
| circumstances. |
| |
| |
| \subsection*{Maui Scheduler} |
| |
| Maui Scheduler~\cite{Maui} is an advanced reservation HPC batch scheduler |
| for use with SP, O2K, and UNIX/Linux clusters. |
| It is widely used to extend the functionality of PBS and LoadLeveler, |
| which Maui requires to perform the parallel job initiation and management. |
| |
| \subsection*{Distributed Production Control System (DPCS)} |
| |
| The Distributed Production Control System (DPCS)~\cite{DPCS} |
| is a scheduler developed at Lawrence Livermore National Laboratory (LLNL). |
| The DPCS provides basic data collection and reporting |
| mechanisms for prject-level, near real-time accounting and resource allocation |
| to customers with established limits per customers' organization budgets, |
| In addition, the DPCS evenly distributes workload across available computers |
| and supports dynamic reconfiguration and graceful degradation of service to prevent |
| overuse of a computer where not authorized. |
| %DPCS is (or will soon be) open source, although its use is presently |
| %confined to LLNL. The development of DPCS began in 1990 and it has |
| %evolved into a highly scalable and fault-tolerant meta-scheduler |
| %operating on top of LoadLeveler, RMS, and NQS. DPCS provides: |
| %\begin{itemize} |
| %\item Basic data collection and reporting mechanisms for project-level, |
| % near real-time accounting. |
| %\item Resource allocation to customers with established limits per |
| % customers' organizational budgets. |
| %\item Proactive delivery of services to organizations that are relatively |
| % underserviced using a fair-share resource allocation scheme. |
| %\item Automated, highly flexible system with feedback for proactive delivery |
| % of resources. |
| %\item Even distribution of the workload across available computers. |
| %\item Flexible prioritization of production workload, including "run on demand." |
| %\item Dynamic reconfiguration and re-tuning. |
| %\item Graceful degradation in service to prevent overuse of a computer where |
| % not authorized. |
| %\end{itemize} |
| |
| DPCS supports only a |
| limited number of computer systems: IBM RS/6000 and SP, Linux, |
| Sun Solaris, and Compaq Alpha. |
| Like the Maui Scheduler, DPCS requires an underlying infrastructure for |
| parallel job initiation and management (LoadLeveler, NQS, RMS or SLURM). |
| |
| \subsection*{LoadLeveler} |
| |
| LoadLeveler~\cite{LoadLevelerManual,LoadLevelerWeb} |
| is a proprietary batch system and parallel job manager by |
| IBM. LoadLeveler supports few non-IBM systems. Very primitive |
| scheduling software exists and other software is required for reasonable |
| performance, such as Maui Scheduler or DPCS. |
| The LoadLeveler has a simple and very flexible queue and job class structure available |
| operating in "matrix" fashion. |
| The biggest problem of the LoadLeveler is its poor scalability. |
| It typically requires 20 minutes to execute even a trivial 500-node, 8000-task |
| on the IBM SP computers at LLNL. |
| %In addition, all jobs must be initiated through the LoadLeveler, and a special version of |
| %MPI is requested to run a parallel job. |
| %[So do RMS, SLURM, etc. for interconnect set-up - Moe]% |
| % |
| %Many configuration files exist with signals to |
| %daemons used to update configuration (like LSF, good). All jobs must |
| %be initiated through LoadLeveler (no real "interactive" jobs, just |
| %high priority queue for smaller jobs). Job accounting is only available |
| %on termination (very bad for long-running jobs). Good status |
| %information on nodes and LoadLeveler daemons is available. LoadLeveler |
| %allocates jobs either entire nodes or shared nodes ,depending upon configuration. |
| % |
| %A special version of MPI is required. LoadLeveler allocates |
| %interconnect resources, spawns the user's processes, and manages the |
| %job afterwards. Daemons also monitor the switch and node health using |
| %a "heart-beat monitor." One fundamental problem is that when the |
| %"Central Manager" restarts, it forgets about all nodes and jobs. They |
| %appear in the database only after checking in via the heartbeat. It |
| %needs to periodically write state to disk instead of doing |
| %"cold-starts" after the daemon fails, which is rare. It has the job |
| %prolog and epilog feature, which permits us to enable/disable logins |
| %and remove stray processes. |
| % |
| %LoadLeveler evolved from Condor, or what was Condor a decade ago. |
| %While I am less familiar with LSF and Condor than LoadLeveler, they |
| %all appear very similar with LSF having the far more sophisticated |
| %scheduler. We should carefully review their data structures and |
| %daemons before designing our own. |
| % |
| \subsection*{Load Sharing Facility (LSF)} |
| |
| LSF~\cite{LSF} |
| is a proprietary batch system and parallel job manager by |
| Platform Computing. Widely deployed on a wide variety of computer |
| architectures, it has sophisticated scheduling software including |
| fair-share, backfill, consumable resources, an job preemption and |
| very flexible queue structure. |
| It also provides good status information on nodes and LSF daemons. |
| While LSF is quite powerful, it is not open-source and can be costly on |
| larger clusters. |
| %The LSF share many of its shortcomings with the LoadLeveler: job initiation only |
| %through LSF, requirement of a spwcial MPI library, etc. |
| %Limits are available on both a per process bs per-job |
| %basis. Time limits include CPU time and wall-clock time. Many |
| %configuration files with signals to daemons used to update |
| %configuration (like LoadLeveler, good). All jobs must be initiated |
| %through LSF to be accounted for and managed by LSF ("interactive" |
| %jobs can be executed through a high priority queue for |
| %smaller jobs). Job accounting only available in near real-time (important |
| %for long-running jobs). Jobs initiated from same directory as |
| %submitted from (not good for computer centers with diverse systems |
| %under LSF control). Good status information on nodes and LSF daemons. |
| %Allocates jobs either entire nodes or shared nodes depending upon |
| %configuration. |
| % |
| %A special version of MPI is required. LSF allocates interconnect |
| %resources, spawns the user's processes, and manages the job |
| %afterwards. While I am less familiar with LSF than LoadLeveler, they |
| %appear very similar with LSF having the far more sophisticated |
| %scheduler. We should carefully review their data structures and |
| %daemons before designing our own. |
| |
| |
| \subsection*{Condor} |
| |
| Condor~\cite{Condor,Litzkow88,Basney97} |
| is a batch system and parallel job manager |
| developed by the University of Wisconsin. |
| Condor was the basis for IBM's LoadLeveler and both share very similar |
| underlying infrastructure. Condor has a very sophisticated checkpoint/restart |
| service that does not rely upon kernel changes, but a variety of |
| library changes (which prevent it from being completely general). The |
| Condor checkpoint/restart service has been integrated into LSF, |
| Codine, and DPCS. Condor is designed to operate across a |
| heterogeneous environment, mostly to harness the compute resources of |
| workstations and PCs. It has an interesting "advertising" service. |
| Servers advertise their available resources and consumers advertise |
| their requirements for a broker to perform matches. The checkpoint |
| mechanism is used to relocate work on demand (when the "owner" of a |
| desktop machine wants to resume work). |
| |
| % |
| %\subsection*{Linux PAGG Process Aggregates} |
| % |
| %PAGG~\cite{PAGG} |
| %consists of modifications to the linux kernel that allows |
| %developers to implement Process AGGregates as loadable kernel modules. |
| %A process aggregate is defined as a collection of processes that are |
| %all members of the same set. A set would be implemented as a container |
| %for the member processes. For instance, process sessions and groups |
| %could have been implemented as process aggregates. |
| % |
| |
| \subsection*{Beowulf Distributed Process Space (BPROC)} |
| |
| The Beowulf Distributed Process Space |
| (BPROC) |
| is set of kernel |
| modifications, utilities and libraries which allow a user to start |
| processes on other machines in a Beowulf-style cluster~\cite{BProc}. Remote |
| processes started with this mechanism appear in the process table |
| of the front end machine in a cluster. This allows remote process |
| management using the normal UNIX process control facilities. Signals |
| are transparently forwarded to remote processes and exit status is |
| received using the usual wait() mechanisms. This tight coupling of |
| a cluster's nodes is convenient, but high scalability can be difficult |
| to achieve. |
| |
| %\subsection{xcat} |
| % |
| %Presumably IBM's suite of cluster management software |
| %(xcat\footnote{http://publib-b.boulder.ibm.com/Redbooks.nsf/RedbookAbstracts/sg246041.html}) |
| %includes a batch system. Look into this. |
| % |
| %\subsection{CPLANT} |
| % |
| %CPLANT\footnote{http://www.cs.sandia.gov/cplant/} includes |
| %Parallel Job Launcher, Compute Node Daemon Process, |
| %Compute Node Allocator, Compute Node Status Tool. |
| % |
| %\subsection{NQS} |
| % |
| %NQS\footnote{http://umbc7.umbc.edu/nqs/nqsmain.html}, |
| %the Network Queueing System, is a serial batch system. |
| % |
| %\subsection*{LAM / MPI} |
| % |
| %Local Area Multicomputer (LAM)~\cite{LAM} |
| %is an MPI programming environment and development system for heterogeneous |
| %computers on a network. |
| %With LAM, a dedicated cluster or an existing network |
| %computing infrastructure can act as one parallel computer solving |
| %one problem. LAM features extensive debugging support in the |
| %application development cycle and peak performance for production |
| %applications. LAM features a full implementation of the MPI |
| %communication standard. |
| % |
| %\subsection{MPICH} |
| % |
| %MPICH\footnote{http://www-unix.mcs.anl.gov/mpi/mpich/} |
| %is a freely available, portable implementation of MPI, |
| %the Standard for message-passing libraries. |
| % |
| %\subsection{Sun Grid Engine} |
| % |
| %SGE\footnote{http://www.sun.com/gridware/} is now proprietary. |
| % |
| % |
| %\subsection{SCIDAC} |
| % |
| %The Scientific Discovery through Advanced Computing (SciDAC) |
| %project\footnote{http://www.scidac.org/ScalableSystems} |
| %has a Resource Management and Accounting working group |
| %and a white paper\cite{Res2000}. Deployment of a system with |
| %the required fault-tolerance and scalability is scheduled |
| %for June 2006. |
| % |
| %\subsection{GNU Queue} |
| % |
| %GNU Queue\footnote{http://www.gnuqueue.org/home.html}. |
| % |
| %\subsection{Clubmask} |
| %Clubmask\footnote{http://clubmask.sourceforge.net} is based on bproc. |
| %Separate queueing system? |
| % |
| %\subsection{SQMX} |
| %Part of the SCE Project\footnote{http://www.opensce.org/}, |
| %SQMX\footnote{http://www.beowulf.org/pipermail/beowulf-announce/2001-January/000086.html} is worth taking a look at. |