| \section{SLURM Operation and Services} |
| \subsection{Command Line Utilities} |
| |
| The command line utilities are the user interface to SLURM functionality. |
| They offer users access to remote execution and job control. They also |
| permit administrators to dynamically change the system configuration. |
| These commands all use SLURM APIs which are directly available for |
| more sophisticated applications. |
| |
| \begin{itemize} |
| \item {\tt scancel}: Cancel a running or a pending job or job step, |
| subject to authentication and authorization. This command can also |
| be used to send an arbitrary signal to all processes on all nodes |
| associated with a job or job step. |
| |
| \item {\tt scontrol}: Perform privileged administrative commands |
| such as draining a node or partition in preparation for maintenance. |
| Many \scontrol\ functions can only be executed by privileged users. |
| |
| \item {\tt sinfo}: Display a summary of partition and node information. |
| A assortment of filtering and output format options are available. |
| |
| \item {\tt squeue}: Display the queue of running and waiting jobs |
| and/or job steps. A wide assortment of filtering, sorting, and output |
| format options are available. |
| |
| \item {\tt srun}: Allocate resources, submit jobs to the SLURM queue, |
| and initiate parallel tasks (job steps). |
| Every set of executing parallel tasks has an associated \srun\ which |
| initiated it and, if the \srun\ persists, managing it. |
| Jobs may be submitted for batch execution, in which case |
| \srun\ terminates after job submission. |
| Jobs may also be submitted for interactive execution, where \srun\ keeps |
| running to shepherd the running job. In this case, |
| \srun\ negotiates connections with remote {\tt slurmd}'s |
| for job initiation and to |
| get stdout and stderr, forward stdin, and respond to signals from the user. |
| The \srun\ may also be instructed to allocate a set of resources and |
| spawn a shell with access to those resources. |
| \srun\ has a total of 13 parameters to control where and when the job |
| is initiated. |
| |
| \end{itemize} |
| |
| \subsection{Plugins} |
| |
| In order to make the use of different infrastructures possible, |
| SLURM uses a general purpose plugin mechanism. |
| A SLURM plugin is a dynamically linked code object which is |
| loaded explicitly at run time by the SLURM libraries. |
| A plugin provides a customized implemenation of a well-defined |
| API connected to tasks such as authentication, interconnect fabric, |
| task scheduling. |
| A common set of functions is defined for use by all of the different |
| infrastructures of a particular variety. |
| For example, the authentication plugin must define functions |
| such as: |
| {\tt slurm\_auth\_activate} to create a credential, |
| {\tt slurm\_auth\_verify} to verify a credential to |
| approve or deny authentication, |
| {\tt slurm\_auth\_get\_uid} to get the user ID associated with |
| a specific credential, etc. |
| It also must define the data structure used, a plugin type, |
| a plugin version number. |
| The available plugins are defined in the configuration file. |
| %When a slurm daemon is initiated, it reads the configuration |
| %file to determine which of the available plugins should be used. |
| %For example {\em AuthType=auth/authd} says to use the plugin for |
| %authd based authentication and {\em PluginDir=/usr/local/lib} |
| %identifies the directory in which to find the plugin. |
| |
| \subsection{Communications Layer} |
| |
| SLURM presently uses Berkeley sockets for communications. |
| However, we anticipate using the plugin mechanism to easily |
| permit use of other communications layers. |
| At LLNL we are using an Ethernet for SLURM communications and |
| the Quadrics Elan switch exclusively for user applications. |
| The SLURM configuration file permits the identification of each |
| node's hostname as well as its name to be used for communications. |
| %In the case of a control machine known as {\em mcri} to be |
| %communicated with using the name {\em emcri} (say to indicate |
| %an ethernet communications path), this is represented in the |
| %configuration file as {\em ControlMachine=mcri ControlAddr=emcri}. |
| %The name used for communication is the same as the hostname unless |
| %%otherwise specified. |
| |
| While SLURM is able to manage 1000 nodes without difficulty using |
| sockets and Ethernet, we are reviewing other communication |
| mechanisms which may offer improved scalability. |
| One possible alternative is STORM\cite{STORM01}. |
| STORM uses the cluster interconnect and Network Interface Cards to |
| provide high-speed communications including a broadcast capability. |
| STORM only supports the Quadrics Elan interconnnect at present, |
| but does offer the promise of improved performance and scalability. |
| |
| \subsection{Security} |
| |
| SLURM has a simple security model: |
| Any user of the cluster may submit parallel jobs to execute and cancel |
| his own jobs. Any user may view SLURM configuration and state |
| information. |
| Only privileged users may modify the SLURM configuration, |
| cancel any jobs, or perform other restricted activities. |
| Privileged users in SLURM include the users {\em root} |
| and {\tt SlurmUser} (as defined in the SLURM configuration file). |
| If permission to modify SLURM configuration is |
| required by others, set-uid programs may be used to grant specific |
| permissions to specific users. |
| |
| We presently support three authentication mechanisms via plugins: |
| {\tt authd}\cite{Authd02}, {\tt munged} and {\tt none}. |
| A plugin can easily be developed for Kerberos or authentication |
| mechanisms as desired. |
| The \munged\ implementation is described below. |
| A \munged\ daemon running as user {\em root} on each node confirms the |
| identify of the user making the request using the {\tt getpeername} |
| function and generates a credential. |
| The credential contains a user ID, |
| group ID, time-stamp, lifetime, some pseudo-random information, and |
| any user supplied information. The \munged\ uses a private key to |
| generate a Message Authentication Code (MAC) for the credential. |
| The \munged\ then uses a public key to symmetrically encrypt |
| the credential including the MAC. |
| SLURM daemons and programs transmit this encrypted |
| credential with communications. The SLURM daemon receiving the message |
| sends the credential to \munged\ on that node. |
| The \munged\ decrypts the credential using its private key, validates it |
| and returns the user ID and group ID of the user originating the |
| credential. |
| The \munged\ prevents replay of a credential on any single node |
| by recording credentials that have already been authenticated. |
| In SLURM's case, the user supplied information includes node |
| identification information to prevent a credential from being |
| used on nodes it is not destined for. |
| |
| When resources are allocated to a user by the controller, a |
| {\em job step credential} is generated by combining the user ID, job ID, |
| step ID, the list of resources allocated (nodes), and the credential |
| lifetime. This job step credential is encrypted with |
| a \slurmctld\ private key. This credential |
| is returned to the requesting agent ({\tt srun}) along with the |
| allocation response, and must be forwarded to the remote {\tt slurmd}'s |
| upon job step initiation. \slurmd\ decrypts this credential with the |
| \slurmctld 's public key to verify that the user may access |
| resources on the local node. \slurmd\ also uses this job step credential |
| to authenticate standard input, output, and error communication streams. |
| |
| %Access to partitions may be restricted via a {\em RootOnly} flag. |
| %If this flag is set, job submit or allocation requests to this |
| %partition are only accepted if the effective user ID originating |
| %the request is a privileged user. |
| %The request from such a user may submit a job as any other user. |
| %This may be used, for example, to provide specific external schedulers |
| %with exclusive access to partitions. Individual users will not be |
| %permitted to directly submit jobs to such a partition, which would |
| %prevent the external scheduler from effectively managing it. |
| %Access to partitions may also be restricted to users who are |
| %members of specific Unix groups using a {\em AllowGroups} specification. |
| |
| \subsection{Job Initiation} |
| |
| There are three modes in which jobs may be run by users under SLURM. The |
| first and most simple is {\em interactive} mode, in which stdout and |
| stderr are displayed on the user's terminal in real time, and stdin and |
| signals may be forwarded from the terminal transparently to the remote |
| tasks. The second is {\em batch} mode, in which the job is |
| queued until the request for resources can be satisfied, at which time the |
| job is run by SLURM as the submitting user. In {\em allocate} mode, |
| a job is allocated to the requesting user, under which the user may |
| manually run job steps via a script or in a sub-shell spawned by \srun . |
| |
| \begin{figure}[tb] |
| \centerline{\epsfig{file=../figures/connections.eps,scale=0.5}} |
| \caption{\small Job initiation connections overview. 1. The \srun\ connects to |
| \slurmctld\ requesting resources. 2. \slurmctld\ issues a response, |
| with list of nodes and job credential. 3. The \srun\ opens a listen |
| port for every task in the job step, then sends a run job step |
| request to \slurmd . 4. \slurmd 's initiate job step and connect |
| back to \srun\ for stdout/err. } |
| \label{connections} |
| \end{figure} |
| |
| Figure~\ref{connections} gives a high-level depiction of the connections |
| that occur between SLURM components during a general interactive job |
| startup. |
| The \srun\ requests a resource allocation and job step initiation from the {\tt slurmctld}, |
| which responds with the job ID, list of allocated nodes, job credential. |
| if the request is granted. |
| The \srun\ then initializes listen ports for each |
| task and sends a message to the {\tt slurmd}'s on the allocated nodes requesting |
| that the remote processes be initiated. The {\tt slurmd}'s begin execution of |
| the tasks and connect back to \srun\ for stdout and stderr. This process and |
| the other initiation modes are described in more detail below. |
| |
| \subsubsection{Interactive mode initiation} |
| |
| \begin{figure}[tb] |
| \centerline{\epsfig{file=../figures/interactive-job-init.eps,scale=0.5} } |
| \caption{\small Interactive job initiation. \srun\ simultaneously allocates |
| nodes |
| and a job step from \slurmctld\ then sends a run request to all |
| \slurmd 's in job. Dashed arrows indicate a periodic request that |
| may or may not occur during the lifetime of the job.} |
| \label{init-interactive} |
| \end{figure} |
| |
| Interactive job initiation is illustrated in Figure~\ref{init-interactive}. |
| The process begins with a user invoking \srun\ in interactive mode. |
| In Figure~\ref{init-interactive}, the user has requested an interactive |
| run of the executable ``{\tt cmd}'' in the default partition. |
| |
| After processing command line options, \srun\ sends a message to |
| \slurmctld\ requesting a resource allocation and a job step initiation. |
| This message simultaneously requests an allocation (or job) and a job step. |
| The \srun\ waits for a reply from {\tt slurmctld}, which may not come instantly |
| if the user has requested that \srun\ block until resources are available. |
| When resources are available |
| for the user's job, \slurmctld\ replies with a job step credential, list of |
| nodes that were allocated, cpus per node, and so on. The \srun\ then sends |
| a message each \slurmd\ on the allocated nodes requesting that a job |
| step be initiated. The \slurmd 's verify that the job is valid using |
| the forwarded job step credential and then respond to \srun . |
| |
| Each \slurmd\ invokes a job thread to handle the request, which in turn |
| invokes a task thread for each requested task. The task thread connects |
| back to a port opened by \srun\ for stdout and stderr. The host and |
| port for this connection is contained in the run request message sent |
| to this machine by \srun . Once stdout and stderr have successfully |
| been connected, the task thread takes the necessary steps to initiate |
| the user's executable on the node, initializing environment, current |
| working directory, and interconnect resources if needed. |
| |
| Once the user process exits, the task thread records the exit status and |
| sends a task exit message back to \srun . When all local processes |
| terminate, the job thread exits. The \srun\ process either waits |
| for all tasks to exit, or attempt to clean up the remaining processes |
| some time after the first task exits. |
| Regardless, once all |
| tasks are finished, \srun\ sends a message to the \slurmctld\ releasing |
| the allocated nodes, then exits with an appropriate exit status. |
| |
| When the \slurmctld\ receives notification that \srun\ no longer needs |
| the allocated nodes, it issues a request for the epilog to be run on each of |
| the \slurmd 's in the allocation. As \slurmd 's report that the epilog ran |
| successfully, the nodes are returned to the partition. |
| |
| |
| \subsubsection{Batch mode initiation} |
| |
| \begin{figure}[tb] |
| \centerline{\epsfig{file=../figures/queued-job-init.eps,scale=0.5} } |
| \caption{\small Queued job initiation. |
| \slurmctld\ initiates the user's job as a batch script on one node. |
| Batch script contains an srun call which initiates parallel tasks |
| after instantiating job step with controller. The shaded region is |
| a compressed representation and is illustrated in more detail in the |
| interactive diagram (Figure~\ref{init-interactive}).} |
| \label{init-batch} |
| \end{figure} |
| |
| Figure~\ref{init-batch} illustrates the initiation of a batch job in SLURM. |
| Once a batch job is submitted, \srun\ sends a batch job request |
| to \slurmctld\ that contains the input/output location for the job, current |
| working directory, environment, requested number of nodes. The |
| \slurmctld\ queues the request in its priority ordered queue. |
| |
| Once the resources are available and the job has a high enough priority, |
| \slurmctld\ allocates the resources to the job and contacts the first node |
| of the allocation requesting that the user job be started. In this case, |
| the job may either be another invocation of \srun\ or a {\em job script} which |
| may have multiple invocations of \srun\ within it. The \slurmd\ on the remote |
| node responds to the run request, initiating the job thread, task thread, |
| and user script. An \srun\ executed from within the script detects that it |
| has access to an allocation and initiates a job step on some or all of the |
| nodes within the job. |
| |
| Once the job step is complete, the \srun\ in the job script notifies the |
| \slurmctld\, and terminates. The job script continues executing and may |
| initiate further job steps. Once the job script completes, the task |
| thread running the job script collects the exit status and sends a task exit |
| message to the \slurmctld . The \slurmctld\ notes that the job is complete |
| and requests that the job epilog be run on all nodes that were allocated. |
| As the \slurmd 's respond with successful completion of the epilog, |
| the nodes are returned to the partition. |
| |
| \subsubsection{Allocate mode initiation} |
| |
| \begin{figure}[tb] |
| \centerline{\epsfig{file=../figures/allocate-init.eps,scale=0.5} } |
| \caption{\small Job initiation in allocate mode. Resources are allocated and |
| \srun\ spawns a shell with access to the resources. When user runs |
| an \srun\ from within the shell, the a job step is initiated under |
| the allocation.} |
| \label{init-allocate} |
| \end{figure} |
| |
| In allocate mode, the user wishes to allocate a job and interactively run |
| job steps under that allocation. The process of initiation in this mode |
| is illustrated in Figure~\ref{init-allocate}. The invoked \srun\ sends |
| an allocate request to \slurmctld , which, if resources are available, |
| responds with a list of nodes allocated, job id, etc. The \srun\ |
| process spawns a shell on the user's terminal with access to the |
| allocation, then waits for the shell to exit at which time the job |
| is considered complete. |
| |
| An \srun\ initiated within the allocate sub-shell recognizes that it |
| is running under an allocation and therefore already within a job. Provided |
| with no other arguments, \srun\ started in this manner initiates a job |
| step on all nodes within the current job. However, the user may select |
| a subset of these nodes implicitly. |
| |
| An \srun\ executed from the sub-shell reads the environment and |
| user options, then notify the controller that it is starting a job step |
| under the current job. The \slurmctld\ registers the job step and responds |
| with a job credential. The \srun\ then initiates the job step using the same |
| general method as described in the section on interactive job initiation. |
| |
| When the user exits the allocate sub-shell, the original \srun\ receives |
| exit status, notifies \slurmctld\ that the job is complete, and exits. |
| The controller runs the epilog on each of the allocated nodes, returning |
| nodes to the partition as they complete the epilog. |