| <!--#include virtual="header.txt"--> |
| |
| <h1>Slurm Federated Scheduling Guide</h1> |
| |
| <ul> |
| <li><a href="#overview">Overview</a></li> |
| <li><a href="#configuration">Configuration</a></li> |
| <li><a href="#jobids">Federated Job IDs</a></li> |
| <li><a href="#jobsubmission">Job Submission</a></li> |
| <li><a href="#jobscheduling">Job Scheduling</a></li> |
| <li><a href="#jobrequeue">Job Requeue</a></li> |
| <li><a href="#interactivejobs">Interactive Jobs</a></li> |
| <li><a href="#jobcancel">Canceling Jobs</a></li> |
| <li><a href="#jobmodify">Job Modification</a></li> |
| <li><a href="#jobarrays">Job Arrays</a></li> |
| <li><a href="#statuscmds">Status Commands</a></li> |
| <li><a href="#glossary">Glossary</a></li> |
| <li><a href="#limitations">Limitations</a></li> |
| </ul> |
| |
| <h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2> |
| <p>Slurm includes support for creating a federation of clusters |
| and scheduling jobs in a peer-to-peer fashion between them. Jobs submitted to a |
| federation receive a unique job ID that is unique among all clusters in the |
| federation. A job is submitted to the local cluster (the cluster defined in the |
| slurm.conf) and is then replicated across the clusters in the federation. Each |
| cluster then independently attempts to the schedule the job based off of its own |
| scheduling policies. The clusters coordinate with the "origin" cluster (cluster |
| the job was submitted to) to schedule the job. |
| |
| <p> |
| <b>NOTE</b>: This is not intended as a high-throughput environment. If |
| scheduling more than 50,000 jobs a day, consider configuring fewer clusters that |
| the sibling jobs can be submitted to or directing load |
| to the local cluster only (e.g. --cluster-constraint= or -M submission options |
| could be used to do this). |
| </p> |
| |
| |
| <h2 id="configuration">Configuration |
| <a class="slurm_link" href="#configuration"></a> |
| </h2> |
| <p> |
| A federation is created using the sacctmgr command to create a federation in the |
| database and by adding clusters to a federation. |
| </p> |
| |
| <br> |
| To create a federation use: |
| <pre> |
| sacctmgr add federation <federation_name> [clusters=<list_of_clusters>] |
| </pre> |
| |
| Clusters can be added or removed from a federation using: |
| <br> |
| <b>NOTE</b>: A cluster can only be a member of one federation at a time. |
| <pre> |
| sacctmgr modify federation <federation_name> set clusters[+-]=<list_of_clusters> |
| sacctmgr modify cluster <cluster_name> set federation=<federation_name> |
| sacctmgr modify federation <federation_name> set clusters= |
| sacctmgr modify cluster <cluster_name> set federation= |
| </pre> |
| <b>NOTE</b>: If a cluster is removed from a federation without first being |
| drained, running jobs on the removed cluster, or that originated from the |
| removed cluster, will continue to run as non-federated jobs. If a job is pending |
| on the origin cluster, the job will remain pending on the origin cluster as a |
| non-federated job and the remaining sibling jobs will be removed. If the origin |
| cluster is being removed and the job is pending and is only viable on one |
| cluster then it will remain pending on the viable cluster as a non-federated |
| job. If the origin cluster is being removed and the job is pending and viable on |
| multiple clusters other than the origin cluster, then the remaining pending jobs |
| will remain pending as a federated job and the remaining sibling clusters will |
| schedule amongst themselves to start the job. |
| |
| <br> |
| <br> |
| Federations can be deleted using: |
| <pre> |
| sacctmgr delete federation <federation_name> |
| </pre> |
| |
| |
| Generic features can be assigned to clusters and can be requested at submission |
| using the <b>--cluster-constraint=[!]<feature_list></b> option: |
| <pre> |
| sacctmgr modify cluster <cluster_name> set features[+-]=<feature_list> |
| </pre> |
| |
| A cluster's federated state can be set using: |
| <pre> |
| sacctmgr modify cluster <cluster_name> set fedstate=<state> |
| </pre> |
| where possible states are: |
| <ul> |
| <li><b>ACTIVE:</b> Cluster will actively accept and schedule federated |
| jobs</li> |
| <li><b>INACTIVE:</b> Cluster will not schedule or accept any jobs</li> |
| <li><b>DRAIN:</b> Cluster will not accept any new jobs and will let |
| existing federated jobs complete</li> |
| <li><b>DRAIN+REMOVE:</b> Cluster will not accept any new jobs and will |
| remove itself from the federation once all federated jobs have |
| completed. When removed from the federation, the cluster will |
| accept jobs as a non-federated cluster</li> |
| </ul> |
| |
| Federation configuration can be viewed used using: |
| <pre> |
| sacctmgr show federation [tree] |
| sacctmgr show cluster withfed |
| </pre> |
| |
| After clusters are added to a federation and the controllers are started their |
| status can be viewed from the controller using: |
| <pre> |
| scontrol show federation |
| </pre> |
| |
| |
| By default the status commands will show a local view. A default federated view |
| can be set by configuring the following parameter in the slurm.conf: |
| <pre> |
| FederationParameters=fed_display |
| </pre> |
| |
| |
| <h2 id="jobids">Federated Job IDs<a class="slurm_link" href="#jobids"></a></h2> |
| When a job is submitted to a federation it gets a federated job id. Job ids in |
| the federation are unique across all clusters in the federation. A federated |
| job ID is made by utilizing an unsigned 32 bit integer to assign the cluster's |
| ID and the cluster's local ID. |
| |
| <pre> |
| Bits 0-25: Local Job ID |
| Bits 26-31: Cluster Origin ID |
| </pre> |
| |
| Federated job IDs allow the controllers to know which cluster the job was |
| submitted to by looking at the cluster origin id of the job. |
| |
| |
| <h2 id="jobsubmission">Job Submission |
| <a class="slurm_link" href="#jobsubmission"></a> |
| </h2> |
| <p> |
| When a federated cluster receives a job submission, it will submit copies of the |
| job (<b>sibling jobs</b>) to each eligible cluster. Each cluster will then |
| independently attempt to schedule the job. |
| </p> |
| |
| <p> |
| Jobs can be directed to specific clusters in the federation using the |
| <b>-M,--clusters=<cluster_list></b> and the new |
| <b>--cluster-constraint=[!]<constraint_list></b> options. |
| </p> |
| |
| <p> |
| Using the <b>-M,--clusters=<cluster_list></b> the submission command |
| (sbatch, salloc, srun) will pick one cluster from the list of clusters to submit |
| the job to and will also pass along the list of clusters with the job. The |
| clusters in the list will be the only viable clusters that siblings jobs can be |
| submitted to. For example the submission: |
| <pre> |
| cluster1$ sbatch -Mcluster2,cluster3 script.sh |
| </pre> |
| will submit the job to either cluster2 or cluster3 and will only submit sibling |
| jobs to cluster2 and cluster3 even if there are more clusters in the federation. |
| </p> |
| |
| <p> |
| Using the <b>--cluster-constraint=[!]<constraint_list></b> option will |
| submit sibling jobs to only the clusters that have the requested cluster |
| feature(s) -- or don't have the feature(s) if using <b>!</b>. Cluster features |
| are added using the <b>sacctmgr modify cluster <cluster_name> set |
| features[+-]=<feature_list></b> option. |
| </p> |
| |
| <p> |
| <b>NOTE</b>: When using the <b>!</b> option, add quotes around the option to |
| prevent the shell from interpreting the <b>!</b> (e.g |
| --cluster-constraint='!highmem'). |
| </p> |
| |
| |
| <p> |
| When using both the <b>--cluster-constraint=</b> and |
| <b>--clusters=</b> options together, the origin cluster will only submit |
| sibling jobs to clusters that meet both requirements. |
| </p> |
| |
| <p> |
| Held or dependent jobs are kept on the origin cluster until they are released |
| or are no longer dependent, at which time they are submitted to other viable |
| clusters in the federation. If a job becomes held or dependent |
| after being submitted, the job is removed from every cluster but the origin. |
| </p> |
| |
| <h2 id="jobscheduling">Job Scheduling |
| <a class="slurm_link" href="#jobscheduling"></a> |
| </h2> |
| <p> |
| Each cluster in the federation independently attempts to schedule each job with |
| the exception of coordinating with the <b>origin cluster</b> (cluster where the |
| job was submitted to) to allocate resources to a federated job. When a cluster |
| determines it can attempt to allocate resources for a job it communicates with |
| the origin cluster to verify that no other cluster is attempting to allocate |
| resources at the same time. If no other cluster is attempting to allocate |
| resources the cluster will attempt to allocate resources for the job. If it |
| succeeds then it will notify the origin cluster that it started the job and the |
| origin cluster will notify the clusters with sibling jobs to remove the sibling |
| jobs and put them in a <b>revoked</b> state. If the cluster was unable to |
| allocate resources to the job then it lets the origin cluster know so that other |
| clusters can attempt to schedule the job. If it was the main scheduler |
| attempting to allocate resources then the main scheduler will stop looking at |
| further jobs in the job's partition. If it was the backfill scheduler attempting |
| to allocate resources then the resources will be reserved for the job. |
| </p> |
| |
| <p> |
| If an origin cluster is down, then the remote siblings will coordinate with a |
| job's viable siblings to schedule the job. When the origin cluster comes back |
| up, it will sync with the other siblings. |
| </p> |
| |
| <h2 id="jobrequeue">Job Requeue |
| <a class="slurm_link" href="#jobrequeue"></a> |
| </h2> |
| <p> |
| When a federated job is requeued the origin cluster is notified and the origin |
| cluster will then submit new sibling jobs to viable clusters and the federated |
| job is eligible to start on a different cluster than the one it ran on. |
| </p> |
| |
| <p> |
| slurm.conf options <b>RequeueExit</b> and <b>RequeueExitHold</b> are controlled |
| by the origin cluster. |
| </p> |
| |
| |
| <h2 id="interactivejobs">Interactive Jobs |
| <a class="slurm_link" href="#interactivejobs"></a> |
| </h2> |
| <p> |
| Interactive jobs -- jobs submitted with srun and salloc -- can be submitted to |
| the local cluster and get an allocation from a different cluster. When an salloc |
| job allocation is granted by a cluster other than the local cluster, a new |
| environment variable, SLURM_WORKING_CLUSTER, will be set with the remote sibling |
| cluster's IP address, port and RPC version so that any sruns will know which |
| cluster to communicate with. |
| </p> |
| |
| <p> |
| <b>NOTE</b>: It is required that all compute nodes must be accessible to all |
| submission hosts for this to work. |
| <br> |
| <b>NOTE</b>: The current implementation of the MPI interfaces in Slurm require |
| the SlurmdSpooldir to be the same on the host where the srun is being run as it |
| is on the compute nodes in the allocation. If they aren't, a workaround is to |
| get an allocation that puts the user on the actual compute node. Then the sruns |
| on the compute nodes will be using the slurm.conf that corresponds to the |
| correct cluster. Setting <b>LaunchParameters=use_interactive_step</b> |
| slurm.conf will put the user on an actual compute node when using salloc. |
| </p> |
| |
| |
| <h2 id="jobcancel">Canceling Jobs |
| <a class="slurm_link" href="#jobcancel"></a> |
| </h2> |
| <p> |
| Cancel requests in the federation will cancel the running sibling job or all |
| pending sibling jobs. Specific pending sibling jobs can be removed by using |
| <b>scancel</b>'s <b>--sibling=<cluster_name></b> option to remove the |
| sibling job from the job's active sibling list. |
| </p> |
| |
| |
| <h2 id="jobmodify">Job Modification |
| <a class="slurm_link" href="#jobmodify"></a> |
| </h2> |
| Job modifications are routed to the origin cluster where the origin cluster will |
| push out the changes to each sibling job. |
| |
| |
| <h2 id="jobarrays">Job Arrays<a class="slurm_link" href="#jobarrays"></a></h2> |
| Currently, job arrays only run on the origin cluster. |
| |
| |
| <h2 id="statuscmds">Status Commands |
| <a class="slurm_link" href="#statuscmds"></a> |
| </h2> |
| <p> |
| By default, status commands, such as: squeue, sinfo, sprio, sacct, sreport, will |
| show a view local to the local cluster. A unified view of the jobs in the |
| federation can be viewed using the <b>--federation</b> option to each status |
| command. The <b>--federation</b> command causes the status command to first |
| check if the local cluster is part of a federation. If it is then the command |
| will query each cluster in parallel for job info and will combine the |
| information into one unified view. |
| </p> |
| |
| <p> |
| A new <b>FederationParameters=fed_display</b> slurm.conf parameter has been |
| added so that all status commands will present a federated view by default -- |
| equivalent to setting the <b>--federation</b> option for each status command. |
| The federated view can be overridden using the <b>--local</b> option. Using the |
| <b>--clusters,-M</b> option will also override the federated view and give a |
| local view for the given cluster(s). |
| </p> |
| |
| <p> |
| Using the existing <b>--clusters,-M option</b>, the status commands will output |
| the information in the same format that exists today where each cluster's |
| information is listed separately. |
| </p> |
| |
| <h3 id="squeue">squeue<a class="slurm_link" href="#squeue"></a></h3> |
| <p> |
| squeue also has a new <b>--sibling</b> option that will show each sibling job |
| rather than merge them into one. |
| </p> |
| |
| <p> |
| Several new long format options have been added to display the job's federated |
| information: |
| <ul> |
| <li> |
| <b>cluster:</b> Name of the cluster that is running the job or |
| job step. |
| </li> |
| <li> |
| <b>siblingsactive:</b> Cluster names of where federated sibling |
| jobs exist. |
| </li> |
| <li> |
| <b>siblingsactiveraw:</b> Cluster IDs of where federated |
| sibling jobs exist. |
| </li> |
| <li> |
| <b>siblingsviable:</b> Cluster names of where federated sibling |
| jobs are viable to run. |
| </li> |
| <li> |
| <b>siblingsviableraw:</b> Cluster names of where federated |
| sibling jobs are viable to run. |
| </li> |
| </ul> |
| </p> |
| |
| <p> |
| squeue output can be sorted using the <b>-S cluster</b> option. |
| </p> |
| |
| |
| <h3 id="sinfo">sinfo<a class="slurm_link" href="#sinfo"></a></h3> |
| <p> |
| sinfo will show the partitions from each cluster in one view. In a federated |
| view, the cluster name is displayed with each partition. The cluster name can be |
| specified in the format options using the short format <b>%V</b> or the long |
| format <b>cluster</b> options. The output can be sorted by cluster names using |
| the <b>-S %[+-]V</b> option. |
| </p> |
| |
| <h3 id="sprio">sprio<a class="slurm_link" href="#sprio"></a></h3> |
| <p> |
| In a federated view, sprio displays the job information from the local cluster |
| or from the first cluster to report the job. Since each sibling job could have a |
| different priority on each cluster it may be helpful to use the <b>--sibling</b> |
| option to show all records of a job to get a better picture of a job's priority. |
| The name of the cluster reporting the job record can be displayed using the |
| <b>%c</b> format option. The cluster name is shown by default when using |
| <b>--sibling</b> option. |
| </p> |
| |
| <h3 id="sacct">sacct<a class="slurm_link" href="#sacct"></a></h3> |
| <p> |
| By default, sacct will not display "revoked" jobs and will show the job from the |
| cluster that ran the job. However, "revoked" jobs can be viewed using the |
| <b>--duplicate/-D</b> option. |
| </p> |
| |
| <h3 id="sreport">sreport<a class="slurm_link" href="#sreport"></a></h3> |
| <p> |
| sreport will combine the reports from each cluster and display them as one. |
| </p> |
| |
| <h3 id="scontrol">scontrol<a class="slurm_link" href="#scontrol"></a></h3> |
| <p> |
| The following scontrol options will display a federated view: |
| <ul> |
| <li>show [--federation|--sibling] jobs</li> |
| <li>show [--federation] steps</li> |
| <li>completing</li> |
| </ul> |
| </p> |
| |
| <p> |
| The following scontrol options are handled in a federation. If the command is |
| run from a cluster other than the federated cluster it will be routed to the |
| origin cluster. |
| <ul> |
| <li>hold</li> |
| <li>uhold</li> |
| <li>release</li> |
| <li>requeue</li> |
| <li>requeuehold</li> |
| <li>suspend</li> |
| <li>update job</li> |
| </ul> |
| </p> |
| |
| <p> |
| All other scontrol options should be directed to the specific cluster either by |
| issuing the command on the cluster or using the <b>--cluster/-M</b> option. |
| </p> |
| |
| |
| |
| |
| <h2 id="glossary">Glossary<a class="slurm_link" href="#glossary"></a></h2> |
| <ul> |
| <li> |
| <b>Federated Job:</b> A job that is submitted to the federated |
| cluster. It has a unique job ID across all clusters (Origin |
| Cluster ID + Local Job ID). |
| </li> |
| <li> |
| <b>Sibling Job:</b> A copy of the federated job that is |
| submitted to other federated clusters. |
| </li> |
| <li> |
| <b>Local Cluster:</b> The cluster found in the slurm.conf that |
| the commands will talk to by default. |
| </li> |
| <li> |
| <b>Origin Cluster:</b> The cluster that the federated job was |
| originally submitted to. The origin cluster submits sibling jobs |
| to other clusters in the federation. The origin cluster |
| determines whether a sibling job can run or not. Communications |
| for the federated job are routed through the origin cluster. |
| </li> |
| <li> |
| <b>Sibling Cluster:</b> The cluster that is associated with a |
| sibling job. |
| </li> |
| <li> |
| <b>Origin Job:</b> The federated job that resides on the cluster |
| that it was originally submitted to. |
| </li> |
| <li> |
| <b>Revoked (RV) State:</b> The state that the origin job is in |
| while the origin job is not actively being scheduled on the |
| origin cluster (e.g. not a viable sibling or one of the sibling |
| jobs is running on a remote cluster). Or the state that a remote |
| sibling job is put in when another sibling is allocated nodes. |
| </li> |
| <li> |
| <b>Viable Sibling:</b> a cluster that is eligible to run a |
| sibling job based off of the requested clusters, cluster |
| features and state of the cluster (e.g. active, draining, etc.). |
| </li> |
| <li> |
| <b>Active Sibling:</b> a sibling job that actively has a |
| sibling job and is able to schedule the job. |
| </li> |
| </ul> |
| |
| <h2 id="limitations">Limitations |
| <a class="slurm_link" href="#limitations"></a> |
| </h2> |
| <ul> |
| <li>A federated job that fails due to resources (partition, node counts, |
| etc.) on the local cluster will be rejected and won't be |
| submitted to other sibling clusters even if it could run on |
| them.</li> |
| <li>Job arrays only run on the cluster that they were submitted to.</li> |
| <li>Job modification must succeed on the origin cluster for the changes |
| to be pushed to the sibling jobs on remote clusters. </li> |
| <li>Modifications to anything other than jobs are disabled in sview.</li> |
| <li>sview grid is disabled in a federated view.</li> |
| </ul> |
| |
| <p style="text-align:center;">Last modified 9 June 2021</p> |
| |
| <!--#include virtual="footer.txt"--> |