doc/html/federation.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <h1>Slurm Federated Scheduling Guide</h1>

 <ul>
 <li><a href="#overview">Overview</a></li>
 <li><a href="#configuration">Configuration</a></li>
 <li><a href="#jobids">Federated Job IDs</a></li>
 <li><a href="#jobsubmission">Job Submission</a></li>
 <li><a href="#jobscheduling">Job Scheduling</a></li>
 <li><a href="#jobrequeue">Job Requeue</a></li>
 <li><a href="#interactivejobs">Interactive Jobs</a></li>
 <li><a href="#jobcancel">Canceling Jobs</a></li>
 <li><a href="#jobmodify">Job Modification</a></li>
 <li><a href="#jobarrays">Job Arrays</a></li>
 <li><a href="#statuscmds">Status Commands</a></li>
 <li><a href="#glossary">Glossary</a></li>
 <li><a href="#limitations">Limitations</a></li>
 </ul>

 <h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2>
 <p>Slurm includes support for creating a federation of clusters
 and scheduling jobs in a peer-to-peer fashion between them. Jobs submitted to a
 federation receive a unique job ID that is unique among all clusters in the
 federation. A job is submitted to the local cluster (the cluster defined in the
 slurm.conf) and is then replicated across the clusters in the federation. Each
 cluster then independently attempts to the schedule the job based off of its own
 scheduling policies. The clusters coordinate with the "origin" cluster (cluster
 the job was submitted to) to schedule the job.

 <p>
 <b>NOTE</b>: This is not intended as a high-throughput environment. If
 scheduling more than 50,000 jobs a day, consider configuring fewer clusters that
 the sibling jobs can be submitted to or directing load
 to the local cluster only (e.g. --cluster-constraint= or -M submission options
 could be used to do this).
 </p>


 <h2 id="configuration">Configuration
 <a class="slurm_link" href="#configuration"></a>
 </h2>
 <p>
 A federation is created using the sacctmgr command to create a federation in the
 database and by adding clusters to a federation.
 </p>

 <br>
 To create a federation use:
 <pre>
 sacctmgr add federation &lt;federation_name&gt; [clusters=&lt;list_of_clusters&gt;]
 </pre>

 Clusters can be added or removed from a federation using:
 <br>
 <b>NOTE</b>: A cluster can only be a member of one federation at a time.
 <pre>
 sacctmgr modify federation &lt;federation_name&gt; set clusters[+-]=&lt;list_of_clusters&gt;
 sacctmgr modify cluster &lt;cluster_name&gt; set federation=&lt;federation_name&gt;
 sacctmgr modify federation &lt;federation_name&gt; set clusters=
 sacctmgr modify cluster &lt;cluster_name&gt; set federation=
 </pre>
 <b>NOTE</b>: If a cluster is removed from a federation without first being
 drained, running jobs on the removed cluster, or that originated from the
 removed cluster, will continue to run as non-federated jobs. If a job is pending
 on the origin cluster, the job will remain pending on the origin cluster as a
 non-federated job and the remaining sibling jobs will be removed. If the origin
 cluster is being removed and the job is pending and is only viable on one
 cluster then it will remain pending on the viable cluster as a non-federated
 job. If the origin cluster is being removed and the job is pending and viable on
 multiple clusters other than the origin cluster, then the remaining pending jobs
 will remain pending as a federated job and the remaining sibling clusters will
 schedule amongst themselves to start the job.

 <br>
 <br>
 Federations can be deleted using:
 <pre>
 sacctmgr delete federation &lt;federation_name&gt;
 </pre>


 Generic features can be assigned to clusters and can be requested at submission
 using the <b>--cluster-constraint=[!]&lt;feature_list&gt;</b> option:
 <pre>
 sacctmgr modify cluster &lt;cluster_name&gt; set features[+-]=&lt;feature_list&gt;
 </pre>

 A cluster's federated state can be set using:
 <pre>
 sacctmgr modify cluster &lt;cluster_name&gt; set fedstate=&lt;state&gt;
 </pre>
 where possible states are:
 <ul>
 	<li><b>ACTIVE:</b> Cluster will actively accept and schedule federated
 		jobs</li>
 	<li><b>INACTIVE:</b> Cluster will not schedule or accept any jobs</li>
 	<li><b>DRAIN:</b> Cluster will not accept any new jobs and will let
 		existing federated jobs complete</li>
 	<li><b>DRAIN+REMOVE:</b> Cluster will not accept any new jobs and will
 		remove itself from the federation once all federated jobs have
 		completed. When removed from the federation, the cluster will
 		accept jobs as a non-federated cluster</li>
 </ul>

 Federation configuration can be viewed used using:
 <pre>
 sacctmgr show federation [tree]
 sacctmgr show cluster withfed
 </pre>

 After clusters are added to a federation and the controllers are started their
 status can be viewed from the controller using:
 <pre>
 scontrol show federation
 </pre>


 By default the status commands will show a local view. A default federated view
 can be set by configuring the following parameter in the slurm.conf:
 <pre>
 FederationParameters=fed_display
 </pre>


 <h2 id="jobids">Federated Job IDs<a class="slurm_link" href="#jobids"></a></h2>
 When a job is submitted to a federation it gets a federated job id. Job ids in
 the federation are unique across all clusters in the federation. A federated
 job ID is made by utilizing an unsigned 32 bit integer to assign the cluster's
 ID and the cluster's local ID.

 <pre>
 Bits 0-25:  Local Job ID
 Bits 26-31: Cluster Origin ID
 </pre>

 Federated job IDs allow the controllers to know which cluster the job was
 submitted to by looking at the cluster origin id of the job.


 <h2 id="jobsubmission">Job Submission
 <a class="slurm_link" href="#jobsubmission"></a>
 </h2>
 <p>
 When a federated cluster receives a job submission, it will submit copies of the
 job (<b>sibling jobs</b>) to each eligible cluster. Each cluster will then
 independently attempt to schedule the job.
 </p>

 <p>
 Jobs can be directed to specific clusters in the federation using the
 <b>-M,--clusters=&lt;cluster_list&gt;</b> and the new
 <b>--cluster-constraint=[!]&lt;constraint_list&gt;</b> options.
 </p>

 <p>
 Using the <b>-M,--clusters=&lt;cluster_list&gt;</b> the submission command
 (sbatch, salloc, srun) will pick one cluster from the list of clusters to submit
 the job to and will also pass along the list of clusters with the job. The
 clusters in the list will be the only viable clusters that siblings jobs can be
 submitted to. For example the submission:
 <pre>
 cluster1$ sbatch -Mcluster2,cluster3 script.sh
 </pre>
 will submit the job to either cluster2 or cluster3 and will only submit sibling
 jobs to cluster2 and cluster3 even if there are more clusters in the federation.
 </p>

 <p>
 Using the <b>--cluster-constraint=[!]&lt;constraint_list&gt;</b> option will
 submit sibling jobs to only the clusters that have the requested cluster
 feature(s) -- or don't have the feature(s) if using <b>!</b>. Cluster features
 are added using the <b>sacctmgr modify cluster &lt;cluster_name&gt; set
 features[+-]=&lt;feature_list&gt;</b> option.
 </p>

 <p>
 <b>NOTE</b>: When using the <b>!</b> option, add quotes around the option to
 prevent the shell from interpreting the <b>!</b> (e.g
 --cluster-constraint='!highmem').
 </p>


 <p>
 When using both the <b>--cluster-constraint=</b> and
 <b>--clusters=</b> options together, the origin cluster will only submit
 sibling jobs to clusters that meet both requirements.
 </p>

 <p>
 Held or dependent jobs are kept on the origin cluster until they are released
 or are no longer dependent, at which time they are submitted to other viable
 clusters in the federation. If a job becomes held or dependent
 after being submitted, the job is removed from every cluster but the origin.
 </p>

 <h2 id="jobscheduling">Job Scheduling
 <a class="slurm_link" href="#jobscheduling"></a>
 </h2>
 <p>
 Each cluster in the federation independently attempts to schedule each job with
 the exception of coordinating with the <b>origin cluster</b> (cluster where the
 job was submitted to) to allocate resources to a federated job. When a cluster
 determines it can attempt to allocate resources for a job it communicates with
 the origin cluster to verify that no other cluster is attempting to allocate
 resources at the same time. If no other cluster is attempting to allocate
 resources the cluster will attempt to allocate resources for the job. If it
 succeeds then it will notify the origin cluster that it started the job and the
 origin cluster will notify the clusters with sibling jobs to remove the sibling
 jobs and put them in a <b>revoked</b> state. If the cluster was unable to
 allocate resources to the job then it lets the origin cluster know so that other
 clusters can attempt to schedule the job. If it was the main scheduler
 attempting to allocate resources then the main scheduler will stop looking at
 further jobs in the job's partition. If it was the backfill scheduler attempting
 to allocate resources then the resources will be reserved for the job.
 </p>

 <p>
 If an origin cluster is down, then the remote siblings will coordinate with a
 job's viable siblings to schedule the job. When the origin cluster comes back
 up, it will sync with the other siblings.
 </p>

 <h2 id="jobrequeue">Job Requeue
 <a class="slurm_link" href="#jobrequeue"></a>
 </h2>
 <p>
 When a federated job is requeued the origin cluster is notified and the origin
 cluster will then submit new sibling jobs to viable clusters and the federated
 job is eligible to start on a different cluster than the one it ran on.
 </p>

 <p>
 slurm.conf options <b>RequeueExit</b> and <b>RequeueExitHold</b> are controlled
 by the origin cluster.
 </p>


 <h2 id="interactivejobs">Interactive Jobs
 <a class="slurm_link" href="#interactivejobs"></a>
 </h2>
 <p>
 Interactive jobs -- jobs submitted with srun and salloc -- can be submitted to
 the local cluster and get an allocation from a different cluster. When an salloc
 job allocation is granted by a cluster other than the local cluster, a new
 environment variable, SLURM_WORKING_CLUSTER, will be set with the remote sibling
 cluster's IP address, port and RPC version so that any sruns will know which
 cluster to communicate with.
 </p>

 <p>
 <b>NOTE</b>: It is required that all compute nodes must be accessible to all
 submission hosts for this to work.
 <br>
 <b>NOTE</b>: The current implementation of the MPI interfaces in Slurm require
 the SlurmdSpooldir to be the same on the host where the srun is being run as it
 is on the compute nodes in the allocation. If they aren't, a workaround is to
 get an allocation that puts the user on the actual compute node. Then the sruns
 on the compute nodes will be using the slurm.conf that corresponds to the
 correct cluster. Setting <b>LaunchParameters=use_interactive_step</b>
 slurm.conf will put the user on an actual compute node when using salloc.
 </p>


 <h2 id="jobcancel">Canceling Jobs
 <a class="slurm_link" href="#jobcancel"></a>
 </h2>
 <p>
 Cancel requests in the federation will cancel the running sibling job or all
 pending sibling jobs. Specific pending sibling jobs can be removed by using
 <b>scancel</b>'s <b>--sibling=&lt;cluster_name&gt;</b> option to remove the
 sibling job from the job's active sibling list.
 </p>


 <h2 id="jobmodify">Job Modification
 <a class="slurm_link" href="#jobmodify"></a>
 </h2>
 Job modifications are routed to the origin cluster where the origin cluster will
 push out the changes to each sibling job.


 <h2 id="jobarrays">Job Arrays<a class="slurm_link" href="#jobarrays"></a></h2>
 Currently, job arrays only run on the origin cluster.


 <h2 id="statuscmds">Status Commands
 <a class="slurm_link" href="#statuscmds"></a>
 </h2>
 <p>
 By default, status commands, such as: squeue, sinfo, sprio, sacct, sreport, will
 show a view local to the local cluster. A unified view of the jobs in the
 federation can be viewed using the <b>--federation</b> option to each status
 command. The <b>--federation</b> command causes the status command to first
 check if the local cluster is part of a federation. If it is then the command
 will query each cluster in parallel for job info and will combine the
 information into one unified view.
 </p>

 <p>
 A new <b>FederationParameters=fed_display</b> slurm.conf parameter has been
 added so that all status commands will present a federated view by default --
 equivalent to setting the <b>--federation</b> option for each status command.
 The federated view can be overridden using the <b>--local</b> option. Using the
 <b>--clusters,-M</b> option will also override the federated view and give a
 local view for the given cluster(s).
 </p>

 <p>
 Using the existing <b>--clusters,-M option</b>, the status commands will output
 the information in the same format that exists today where each cluster's
 information is listed separately.
 </p>

 <h3 id="squeue">squeue<a class="slurm_link" href="#squeue"></a></h3>
 <p>
 squeue also has a new <b>--sibling</b> option that will show each sibling job
 rather than merge them into one.
 </p>

 <p>
 Several new long format options have been added to display the job's federated
 information:
 <ul>
 	<li>
 		<b>cluster:</b> Name of the cluster that is running the job or
 		job step.
 	</li>
 	<li>
 		<b>siblingsactive:</b> Cluster names of where federated sibling
 		jobs exist.
 	</li>
 	<li>
 		<b>siblingsactiveraw:</b> Cluster IDs of where federated
 		sibling jobs exist.
 	</li>
 	<li>
 		<b>siblingsviable:</b> Cluster names of where federated sibling
 		jobs are viable to run.
 	</li>
 	<li>
 		<b>siblingsviableraw:</b> Cluster names of where federated
 		sibling jobs are viable to run.
 	</li>
 </ul>
 </p>

 <p>
 squeue output can be sorted using the <b>-S cluster</b> option.
 </p>


 <h3 id="sinfo">sinfo<a class="slurm_link" href="#sinfo"></a></h3>
 <p>
 sinfo will show the partitions from each cluster in one view. In a federated
 view, the cluster name is displayed with each partition. The cluster name can be
 specified in the format options using the short format <b>%V</b> or the long
 format <b>cluster</b> options. The output can be sorted by cluster names using
 the <b>-S %[+-]V</b> option.
 </p>

 <h3 id="sprio">sprio<a class="slurm_link" href="#sprio"></a></h3>
 <p>
 In a federated view, sprio displays the job information from the local cluster
 or from the first cluster to report the job. Since each sibling job could have a
 different priority on each cluster it may be helpful to use the <b>--sibling</b>
 option to show all records of a job to get a better picture of a job's priority.
 The name of the cluster reporting the job record can be displayed using the
 <b>%c</b> format option. The cluster name is shown by default when using
 <b>--sibling</b> option.
 </p>

 <h3 id="sacct">sacct<a class="slurm_link" href="#sacct"></a></h3>
 <p>
 By default, sacct will not display "revoked" jobs and will show the job from the
 cluster that ran the job. However, "revoked" jobs can be viewed using the
 <b>--duplicate/-D</b> option.
 </p>

 <h3 id="sreport">sreport<a class="slurm_link" href="#sreport"></a></h3>
 <p>
 sreport will combine the reports from each cluster and display them as one.
 </p>

 <h3 id="scontrol">scontrol<a class="slurm_link" href="#scontrol"></a></h3>
 <p>
 The following scontrol options will display a federated view:
 <ul>
 	<li>show [--federation|--sibling] jobs</li>
 	<li>show [--federation] steps</li>
 	<li>completing</li>
 </ul>
 </p>

 <p>
 The following scontrol options are handled in a federation. If the command is
 run from a cluster other than the federated cluster it will be routed to the
 origin cluster.
 <ul>
 	<li>hold</li>
 	<li>uhold</li>
 	<li>release</li>
 	<li>requeue</li>
 	<li>requeuehold</li>
 	<li>suspend</li>
 	<li>update job</li>
 </ul>
 </p>

 <p>
 All other scontrol options should be directed to the specific cluster either by
 issuing the command on the cluster or using the <b>--cluster/-M</b>  option.
 </p>


 <h2 id="glossary">Glossary<a class="slurm_link" href="#glossary"></a></h2>
 <ul>
 	<li>
 		<b>Federated Job:</b> A job that is submitted to the federated
 		cluster. It has a unique job ID across all clusters (Origin
 		Cluster ID + Local Job ID).
 	</li>
 	<li>
 		<b>Sibling Job:</b> A copy of the federated job that is
 		submitted to other federated clusters.
 	</li>
 	<li>
 		<b>Local Cluster:</b> The cluster found in the slurm.conf that
 		the commands will talk to by default.
 	</li>
 	<li>
 		<b>Origin Cluster:</b> The cluster that the federated job was
 		originally submitted to. The origin cluster submits sibling jobs
 		to other clusters in the federation. The origin cluster
 		determines whether a sibling job can run or not. Communications
 		for the federated job are routed through the origin cluster.
 	</li>
 	<li>
 		<b>Sibling Cluster:</b> The cluster that is associated with a
 		sibling job.
 	</li>
 	<li>
 		<b>Origin Job:</b> The federated job that resides on the cluster
 		that it was originally submitted to.
 	</li>
 	<li>
 		<b>Revoked (RV) State:</b> The state that the origin job is in
 		while the origin job is not actively being scheduled on the
 		origin cluster (e.g. not a viable sibling or one of the sibling
 		jobs is running on a remote cluster). Or the state that a remote
 		sibling job is put in when another sibling is allocated nodes.
 	</li>
 	<li>
 		<b>Viable Sibling:</b> a cluster that is eligible to run a
 		sibling job based off of the requested clusters, cluster
 		features and state of the cluster (e.g. active, draining, etc.).
 	</li>
 	<li>
 		<b>Active Sibling:</b> a sibling job that actively has a
 		sibling job and is able to schedule the job.
 	</li>
 </ul>

 <h2 id="limitations">Limitations
 <a class="slurm_link" href="#limitations"></a>
 </h2>
 <ul>
 	<li>A federated job that fails due to resources (partition, node counts,
 		etc.) on the local cluster will be rejected and won't be
 		submitted to other sibling clusters even if it could run on
 		them.</li>
 	<li>Job arrays only run on the cluster that they were submitted to.</li>
 	<li>Job modification must succeed on the origin cluster for the changes
 		to be pushed to the sibling jobs on remote clusters. </li>
 	<li>Modifications to anything other than jobs are disabled in sview.</li>
 	<li>sview grid is disabled in a federated view.</li>
 </ul>

 <p style="text-align:center;">Last modified 9 June 2021</p>

 <!--#include virtual="footer.txt"-->
	<!--#include virtual="header.txt"-->

	<h1>Slurm Federated Scheduling Guide</h1>

	<ul>
	<li><a href="#overview">Overview</a></li>
	<li><a href="#configuration">Configuration</a></li>
	<li><a href="#jobids">Federated Job IDs</a></li>
	<li><a href="#jobsubmission">Job Submission</a></li>
	<li><a href="#jobscheduling">Job Scheduling</a></li>
	<li><a href="#jobrequeue">Job Requeue</a></li>
	<li><a href="#interactivejobs">Interactive Jobs</a></li>
	<li><a href="#jobcancel">Canceling Jobs</a></li>
	<li><a href="#jobmodify">Job Modification</a></li>
	<li><a href="#jobarrays">Job Arrays</a></li>
	<li><a href="#statuscmds">Status Commands</a></li>
	<li><a href="#glossary">Glossary</a></li>
	<li><a href="#limitations">Limitations</a></li>
	</ul>

	<h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2>
	<p>Slurm includes support for creating a federation of clusters
	and scheduling jobs in a peer-to-peer fashion between them. Jobs submitted to a
	federation receive a unique job ID that is unique among all clusters in the
	federation. A job is submitted to the local cluster (the cluster defined in the
	slurm.conf) and is then replicated across the clusters in the federation. Each
	cluster then independently attempts to the schedule the job based off of its own
	scheduling policies. The clusters coordinate with the "origin" cluster (cluster
	the job was submitted to) to schedule the job.

	<p>
	<b>NOTE</b>: This is not intended as a high-throughput environment. If
	scheduling more than 50,000 jobs a day, consider configuring fewer clusters that
	the sibling jobs can be submitted to or directing load
	to the local cluster only (e.g. --cluster-constraint= or -M submission options
	could be used to do this).
	</p>


	<h2 id="configuration">Configuration
	<a class="slurm_link" href="#configuration"></a>
	</h2>
	<p>
	A federation is created using the sacctmgr command to create a federation in the
	database and by adding clusters to a federation.
	</p>

	<br>
	To create a federation use:
	<pre>
	sacctmgr add federation <federation_name> [clusters=<list_of_clusters>]
	</pre>

	Clusters can be added or removed from a federation using:
	<br>
	<b>NOTE</b>: A cluster can only be a member of one federation at a time.
	<pre>
	sacctmgr modify federation <federation_name> set clusters[+-]=<list_of_clusters>
	sacctmgr modify cluster <cluster_name> set federation=<federation_name>
	sacctmgr modify federation <federation_name> set clusters=
	sacctmgr modify cluster <cluster_name> set federation=
	</pre>
	<b>NOTE</b>: If a cluster is removed from a federation without first being
	drained, running jobs on the removed cluster, or that originated from the
	removed cluster, will continue to run as non-federated jobs. If a job is pending
	on the origin cluster, the job will remain pending on the origin cluster as a
	non-federated job and the remaining sibling jobs will be removed. If the origin
	cluster is being removed and the job is pending and is only viable on one
	cluster then it will remain pending on the viable cluster as a non-federated
	job. If the origin cluster is being removed and the job is pending and viable on
	multiple clusters other than the origin cluster, then the remaining pending jobs
	will remain pending as a federated job and the remaining sibling clusters will
	schedule amongst themselves to start the job.

	<br>
	<br>
	Federations can be deleted using:
	<pre>
	sacctmgr delete federation <federation_name>
	</pre>


	Generic features can be assigned to clusters and can be requested at submission
	using the <b>--cluster-constraint=[!]<feature_list></b> option:
	<pre>
	sacctmgr modify cluster <cluster_name> set features[+-]=<feature_list>
	</pre>

	A cluster's federated state can be set using:
	<pre>
	sacctmgr modify cluster <cluster_name> set fedstate=<state>
	</pre>
	where possible states are:
	<ul>
	<li><b>ACTIVE:</b> Cluster will actively accept and schedule federated
	jobs</li>
	<li><b>INACTIVE:</b> Cluster will not schedule or accept any jobs</li>
	<li><b>DRAIN:</b> Cluster will not accept any new jobs and will let
	existing federated jobs complete</li>
	<li><b>DRAIN+REMOVE:</b> Cluster will not accept any new jobs and will
	remove itself from the federation once all federated jobs have
	completed. When removed from the federation, the cluster will
	accept jobs as a non-federated cluster</li>
	</ul>

	Federation configuration can be viewed used using:
	<pre>
	sacctmgr show federation [tree]
	sacctmgr show cluster withfed
	</pre>

	After clusters are added to a federation and the controllers are started their
	status can be viewed from the controller using:
	<pre>
	scontrol show federation
	</pre>


	By default the status commands will show a local view. A default federated view
	can be set by configuring the following parameter in the slurm.conf:
	<pre>
	FederationParameters=fed_display
	</pre>


	<h2 id="jobids">Federated Job IDs<a class="slurm_link" href="#jobids"></a></h2>
	When a job is submitted to a federation it gets a federated job id. Job ids in
	the federation are unique across all clusters in the federation. A federated
	job ID is made by utilizing an unsigned 32 bit integer to assign the cluster's
	ID and the cluster's local ID.

	<pre>
	Bits 0-25: Local Job ID
	Bits 26-31: Cluster Origin ID
	</pre>

	Federated job IDs allow the controllers to know which cluster the job was
	submitted to by looking at the cluster origin id of the job.


	<h2 id="jobsubmission">Job Submission
	<a class="slurm_link" href="#jobsubmission"></a>
	</h2>
	<p>
	When a federated cluster receives a job submission, it will submit copies of the
	job (<b>sibling jobs</b>) to each eligible cluster. Each cluster will then
	independently attempt to schedule the job.
	</p>

	<p>
	Jobs can be directed to specific clusters in the federation using the
	<b>-M,--clusters=<cluster_list></b> and the new
	<b>--cluster-constraint=[!]<constraint_list></b> options.
	</p>

	<p>
	Using the <b>-M,--clusters=<cluster_list></b> the submission command
	(sbatch, salloc, srun) will pick one cluster from the list of clusters to submit
	the job to and will also pass along the list of clusters with the job. The
	clusters in the list will be the only viable clusters that siblings jobs can be
	submitted to. For example the submission:
	<pre>
	cluster1$ sbatch -Mcluster2,cluster3 script.sh
	</pre>
	will submit the job to either cluster2 or cluster3 and will only submit sibling
	jobs to cluster2 and cluster3 even if there are more clusters in the federation.
	</p>

	<p>
	Using the <b>--cluster-constraint=[!]<constraint_list></b> option will
	submit sibling jobs to only the clusters that have the requested cluster
	feature(s) -- or don't have the feature(s) if using <b>!</b>. Cluster features
	are added using the <b>sacctmgr modify cluster <cluster_name> set
	features[+-]=<feature_list></b> option.
	</p>

	<p>
	<b>NOTE</b>: When using the <b>!</b> option, add quotes around the option to
	prevent the shell from interpreting the <b>!</b> (e.g
	--cluster-constraint='!highmem').
	</p>


	<p>
	When using both the <b>--cluster-constraint=</b> and
	<b>--clusters=</b> options together, the origin cluster will only submit
	sibling jobs to clusters that meet both requirements.
	</p>

	<p>
	Held or dependent jobs are kept on the origin cluster until they are released
	or are no longer dependent, at which time they are submitted to other viable
	clusters in the federation. If a job becomes held or dependent
	after being submitted, the job is removed from every cluster but the origin.
	</p>

	<h2 id="jobscheduling">Job Scheduling
	<a class="slurm_link" href="#jobscheduling"></a>
	</h2>
	<p>
	Each cluster in the federation independently attempts to schedule each job with
	the exception of coordinating with the <b>origin cluster</b> (cluster where the
	job was submitted to) to allocate resources to a federated job. When a cluster
	determines it can attempt to allocate resources for a job it communicates with
	the origin cluster to verify that no other cluster is attempting to allocate
	resources at the same time. If no other cluster is attempting to allocate
	resources the cluster will attempt to allocate resources for the job. If it
	succeeds then it will notify the origin cluster that it started the job and the
	origin cluster will notify the clusters with sibling jobs to remove the sibling
	jobs and put them in a <b>revoked</b> state. If the cluster was unable to
	allocate resources to the job then it lets the origin cluster know so that other
	clusters can attempt to schedule the job. If it was the main scheduler
	attempting to allocate resources then the main scheduler will stop looking at
	further jobs in the job's partition. If it was the backfill scheduler attempting
	to allocate resources then the resources will be reserved for the job.
	</p>

	<p>
	If an origin cluster is down, then the remote siblings will coordinate with a
	job's viable siblings to schedule the job. When the origin cluster comes back
	up, it will sync with the other siblings.
	</p>

	<h2 id="jobrequeue">Job Requeue
	<a class="slurm_link" href="#jobrequeue"></a>
	</h2>
	<p>
	When a federated job is requeued the origin cluster is notified and the origin
	cluster will then submit new sibling jobs to viable clusters and the federated
	job is eligible to start on a different cluster than the one it ran on.
	</p>

	<p>
	slurm.conf options <b>RequeueExit</b> and <b>RequeueExitHold</b> are controlled
	by the origin cluster.
	</p>


	<h2 id="interactivejobs">Interactive Jobs
	<a class="slurm_link" href="#interactivejobs"></a>
	</h2>
	<p>
	Interactive jobs -- jobs submitted with srun and salloc -- can be submitted to
	the local cluster and get an allocation from a different cluster. When an salloc
	job allocation is granted by a cluster other than the local cluster, a new
	environment variable, SLURM_WORKING_CLUSTER, will be set with the remote sibling
	cluster's IP address, port and RPC version so that any sruns will know which
	cluster to communicate with.
	</p>

	<p>
	<b>NOTE</b>: It is required that all compute nodes must be accessible to all
	submission hosts for this to work.
	<br>
	<b>NOTE</b>: The current implementation of the MPI interfaces in Slurm require
	the SlurmdSpooldir to be the same on the host where the srun is being run as it
	is on the compute nodes in the allocation. If they aren't, a workaround is to
	get an allocation that puts the user on the actual compute node. Then the sruns
	on the compute nodes will be using the slurm.conf that corresponds to the
	correct cluster. Setting <b>LaunchParameters=use_interactive_step</b>
	slurm.conf will put the user on an actual compute node when using salloc.
	</p>


	<h2 id="jobcancel">Canceling Jobs
	<a class="slurm_link" href="#jobcancel"></a>
	</h2>
	<p>
	Cancel requests in the federation will cancel the running sibling job or all
	pending sibling jobs. Specific pending sibling jobs can be removed by using
	<b>scancel</b>'s <b>--sibling=<cluster_name></b> option to remove the
	sibling job from the job's active sibling list.
	</p>


	<h2 id="jobmodify">Job Modification
	<a class="slurm_link" href="#jobmodify"></a>
	</h2>
	Job modifications are routed to the origin cluster where the origin cluster will
	push out the changes to each sibling job.


	<h2 id="jobarrays">Job Arrays<a class="slurm_link" href="#jobarrays"></a></h2>
	Currently, job arrays only run on the origin cluster.


	<h2 id="statuscmds">Status Commands
	<a class="slurm_link" href="#statuscmds"></a>
	</h2>
	<p>
	By default, status commands, such as: squeue, sinfo, sprio, sacct, sreport, will
	show a view local to the local cluster. A unified view of the jobs in the
	federation can be viewed using the <b>--federation</b> option to each status
	command. The <b>--federation</b> command causes the status command to first
	check if the local cluster is part of a federation. If it is then the command
	will query each cluster in parallel for job info and will combine the
	information into one unified view.
	</p>

	<p>
	A new <b>FederationParameters=fed_display</b> slurm.conf parameter has been
	added so that all status commands will present a federated view by default --
	equivalent to setting the <b>--federation</b> option for each status command.
	The federated view can be overridden using the <b>--local</b> option. Using the
	<b>--clusters,-M</b> option will also override the federated view and give a
	local view for the given cluster(s).
	</p>

	<p>
	Using the existing <b>--clusters,-M option</b>, the status commands will output
	the information in the same format that exists today where each cluster's
	information is listed separately.
	</p>

	<h3 id="squeue">squeue<a class="slurm_link" href="#squeue"></a></h3>
	<p>
	squeue also has a new <b>--sibling</b> option that will show each sibling job
	rather than merge them into one.
	</p>

	<p>
	Several new long format options have been added to display the job's federated
	information:
	<ul>
	<li>
	<b>cluster:</b> Name of the cluster that is running the job or
	job step.
	</li>
	<li>
	<b>siblingsactive:</b> Cluster names of where federated sibling
	jobs exist.
	</li>
	<li>
	<b>siblingsactiveraw:</b> Cluster IDs of where federated
	sibling jobs exist.
	</li>
	<li>
	<b>siblingsviable:</b> Cluster names of where federated sibling
	jobs are viable to run.
	</li>
	<li>
	<b>siblingsviableraw:</b> Cluster names of where federated
	sibling jobs are viable to run.
	</li>
	</ul>
	</p>

	<p>
	squeue output can be sorted using the <b>-S cluster</b> option.
	</p>


	<h3 id="sinfo">sinfo<a class="slurm_link" href="#sinfo"></a></h3>
	<p>
	sinfo will show the partitions from each cluster in one view. In a federated
	view, the cluster name is displayed with each partition. The cluster name can be
	specified in the format options using the short format <b>%V</b> or the long
	format <b>cluster</b> options. The output can be sorted by cluster names using
	the <b>-S %[+-]V</b> option.
	</p>

	<h3 id="sprio">sprio<a class="slurm_link" href="#sprio"></a></h3>
	<p>
	In a federated view, sprio displays the job information from the local cluster
	or from the first cluster to report the job. Since each sibling job could have a
	different priority on each cluster it may be helpful to use the <b>--sibling</b>
	option to show all records of a job to get a better picture of a job's priority.
	The name of the cluster reporting the job record can be displayed using the
	<b>%c</b> format option. The cluster name is shown by default when using
	<b>--sibling</b> option.
	</p>

	<h3 id="sacct">sacct<a class="slurm_link" href="#sacct"></a></h3>
	<p>
	By default, sacct will not display "revoked" jobs and will show the job from the
	cluster that ran the job. However, "revoked" jobs can be viewed using the
	<b>--duplicate/-D</b> option.
	</p>

	<h3 id="sreport">sreport<a class="slurm_link" href="#sreport"></a></h3>
	<p>
	sreport will combine the reports from each cluster and display them as one.
	</p>

	<h3 id="scontrol">scontrol<a class="slurm_link" href="#scontrol"></a></h3>
	<p>
	The following scontrol options will display a federated view:
	<ul>
	<li>show [--federation\|--sibling] jobs</li>
	<li>show [--federation] steps</li>
	<li>completing</li>
	</ul>
	</p>

	<p>
	The following scontrol options are handled in a federation. If the command is
	run from a cluster other than the federated cluster it will be routed to the
	origin cluster.
	<ul>
	<li>hold</li>
	<li>uhold</li>
	<li>release</li>
	<li>requeue</li>
	<li>requeuehold</li>
	<li>suspend</li>
	<li>update job</li>
	</ul>
	</p>

	<p>
	All other scontrol options should be directed to the specific cluster either by
	issuing the command on the cluster or using the <b>--cluster/-M</b> option.
	</p>




	<h2 id="glossary">Glossary<a class="slurm_link" href="#glossary"></a></h2>
	<ul>
	<li>
	<b>Federated Job:</b> A job that is submitted to the federated
	cluster. It has a unique job ID across all clusters (Origin
	Cluster ID + Local Job ID).
	</li>
	<li>
	<b>Sibling Job:</b> A copy of the federated job that is
	submitted to other federated clusters.
	</li>
	<li>
	<b>Local Cluster:</b> The cluster found in the slurm.conf that
	the commands will talk to by default.
	</li>
	<li>
	<b>Origin Cluster:</b> The cluster that the federated job was
	originally submitted to. The origin cluster submits sibling jobs
	to other clusters in the federation. The origin cluster
	determines whether a sibling job can run or not. Communications
	for the federated job are routed through the origin cluster.
	</li>
	<li>
	<b>Sibling Cluster:</b> The cluster that is associated with a
	sibling job.
	</li>
	<li>
	<b>Origin Job:</b> The federated job that resides on the cluster
	that it was originally submitted to.
	</li>
	<li>
	<b>Revoked (RV) State:</b> The state that the origin job is in
	while the origin job is not actively being scheduled on the
	origin cluster (e.g. not a viable sibling or one of the sibling
	jobs is running on a remote cluster). Or the state that a remote
	sibling job is put in when another sibling is allocated nodes.
	</li>
	<li>
	<b>Viable Sibling:</b> a cluster that is eligible to run a
	sibling job based off of the requested clusters, cluster
	features and state of the cluster (e.g. active, draining, etc.).
	</li>
	<li>
	<b>Active Sibling:</b> a sibling job that actively has a
	sibling job and is able to schedule the job.
	</li>
	</ul>

	<h2 id="limitations">Limitations
	<a class="slurm_link" href="#limitations"></a>
	</h2>
	<ul>
	<li>A federated job that fails due to resources (partition, node counts,
	etc.) on the local cluster will be rejected and won't be
	submitted to other sibling clusters even if it could run on
	them.</li>
	<li>Job arrays only run on the cluster that they were submitted to.</li>
	<li>Job modification must succeed on the origin cluster for the changes
	to be pushed to the sibling jobs on remote clusters. </li>
	<li>Modifications to anything other than jobs are disabled in sview.</li>
	<li>sview grid is disabled in a federated view.</li>
	</ul>

	<p style="text-align:center;">Last modified 9 June 2021</p>

	<!--#include virtual="footer.txt"-->