doc/html/heterogeneous_jobs.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <h1>Heterogeneous Job Support</h1>

 <ul>
 <li><a href="#overview">Overview</a></li>
 <li><a href="#submitting">Submitting Jobs</a></li>
 <li><a href="#burst_buffer">Burst Buffers</a></li>
 <li>
   <a href="#managing">Managing Jobs</a>
   <ul><li><a href="#accounting">Accounting</a></li></ul>
 </li>
 <li><a href="#job_steps">Launching Applications (Job Steps)</a></li>
 <li><a href="#env_var">Environment Variables</a></li>
 <li><a href="#examples">Examples</a></li>
 <li><a href="#limitations">Limitations</a></li>
 <li><a href="#het_steps">Heterogeneous Steps</a></li>
 <li><a href="#sys_admin">System Administrator Information</a></li>
 </ul>

 <h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2>

 <p>Slurm version 17.11 and later supports the ability to submit and manage
 heterogeneous jobs, in which each component has virtually all job options
 available including partition, account and QOS (Quality Of Service).
 For example, part of a job might require four cores and 4 GB for each of 128
 tasks while another part of the job would require 16 GB of memory and one CPU.</p>

 <h2 id="submitting">Submitting Jobs
 <a class="slurm_link" href="#submitting"></a>
 </h2>

 <p>The <i>salloc</i>, <i>sbatch</i> and <i>srun</i> commands can all be used
 to submit heterogeneous jobs.
 Resource specifications for each component of the heterogeneous job should be
 separated with ":" character.
 For example:</p>
 <pre>
 $ sbatch --cpus-per-task=4 --mem-per-cpu=1  --ntasks=128 : \
          --cpus-per-task=1 --mem-per-cpu=16 --ntasks=1 my.bash
 </pre>

 <p>Options specified for one component of a heterogeneous job (or job step) will
 be used for subsequent components to the extent which is expected to be helpful.
 Propagated options can be reset as desired for each component (e.g. a different
 account name could be specified for each hetjob component.
 For example, <i>--immediate</i> and <i>--job-name</i> are propagated, while
 <i>--ntasks</i> and <i>--mem-per-cpu</i> are reset to default values for each
 component.
 A list of propagated options follows.</p>
 <ul>
 <li>--account</li>
 <li>--acctg-freq</li>
 <li>--begin</li>
 <li>--cluster-constraint</li>
 <li>--clusters</li>
 <li>--comment</li>
 <li>--deadline</li>
 <li>--delay-boot</li>
 <li>--dependency</li>
 <li>--distribution</li>
 <li>--epilog            (option available only in srun)</li>
 <li>--error</li>
 <li>--export</li>
 <li>--export-file</li>
 <li>--exclude</li>
 <li>--get-user-env</li>
 <li>--gid</li>
 <li>--hold</li>
 <li>--ignore-pbs</li>
 <li>--immediate</li>
 <li>--input</li>
 <li>--job-name</li>
 <li>--kill-on-bad-exit  (option available only in srun)</li>
 <li>--label             (option available only in srun)</li>
 <li>--mcs-label</li>
 <li>--mem</li>
 <li>--msg-timeout       (option available only in srun)</li>
 <li>--no-allocate       (option available only in srun)</li>
 <li>--no-requeue</li>
 <li>--nice</li>
 <li>--no-kill</li>
 <li>--open-mode         (option available only in srun)</li>
 <li>--output</li>
 <li>--parsable</li>
 <li>--priority</li>
 <li>--profile</li>
 <li>--propagate</li>
 <li>--prolog            (option available only in srun)</li>
 <li>--pty               (option available only in srun)</li>
 <li>--qos</li>
 <li>--quiet</li>
 <li>--quit-on-interrupt (option available only in srun)</li>
 <li>--reboot</li>
 <li>--reservation</li>
 <li>--requeue</li>
 <li>--signal</li>
 <li>--slurmd-debug      (option available only in srun)</li>
 <li>--task-epilog       (option available only in srun)</li>
 <li>--task-prolog       (option available only in srun)</li>
 <li>--time</li>
 <li>--test-only</li>
 <li>--time-min</li>
 <li>--uid</li>
 <li>--unbuffered        (option available only in srun)</li>
 <li>--verbose</li>
 <li>--wait</li>
 <li>--wait-all-nodes</li>
 <li>--wckey</li>
 <li>--workdir</li>
 </ul>

 <p>The task distribution specification applies separately within each job
 component. Consider for example a heterogeneous job with each component being
 allocated 4 CPUs on 2 nodes. In our example, job component zero is allocated
 2 CPUs on node "nid00001" and 2 CPUs on node "nid00002". Job component one is
 allocated 2 CPUs on node "nid00003" and 2 CPUs on node "nid00004". A task
 distribution of "cyclic" will distribute the first 4 tasks in a cyclic fashion
 on nodes "nid00001" and "nid00002", then distribute the next 4 tasks in a cyclic
 fashion on nodes "nid00003" and "nid00004" as shown below.</p>

 <table width="100%" border=1 cellspacing=4 cellpadding=4>
 <tr><td align=center>Node nid00001</td><td align=center>Node nid00002</td><td align=center>Node nid00003</td><td align=center>Node nid00004</td></tr>
 <tr><td align=center nowrap>Rank 0</td><td align=center>Rank 1</td><td align=center>Rank 4</td><td align=center>Rank 5</td></tr>
 <tr><td align=center nowrap>Rank 2</td><td align=center>Rank 3</td><td align=center>Rank 6</td><td align=center>Rank 7</td></tr>
 </table>
 <br>

 <p>Some options should be specified only in the first hetjob component.
 For example, specifying a batch job output file in the second hetjob component's
 options will result in the first hetjob component (where the batch script
 executes) using the default output file name.</p>

 <p>Environment variables used to specify default options for the job submit
 command will be applied to every component of the heterogeneous job
 (e.g. <i>SBATCH_ACCOUNT</i>).</p>

 <p>Batch job options can be included in the submitted script for multiple
 heterogeneous job components. Each component should be separated by a line
 containing the line "#SBATCH hetjob" as shown below.</p>
 <pre>
 $ cat new.bash
 #!/bin/bash
 #SBATCH --cpus-per-task=4 --mem-per-cpu=16g --ntasks=1
 #SBATCH hetjob
 #SBATCH --cpus-per-task=2 --mem-per-cpu=1g  --ntasks=8

 srun run.app

 $ sbatch new.bash
 </pre>

 <p>Is equivalent to the following:</p>

 <pre>
 $ cat my.bash
 #!/bin/bash
 srun run.app

 $ sbatch --cpus-per-task=4 --mem-per-cpu=16g --ntasks=1 : \
          --cpus-per-task=2 --mem-per-cpu=1g  --ntasks=8 my.bash
 </pre>

 <p>The batch script will be executed in the first node in the first component
 of the heterogeneous job. For the above example, that will be the job component
 with 1 task, 4 CPUs and 64 GB of memory (16 GB for each of the 4 CPUs).</p>

 <p>If a heterogeneous job is submitted to run in multiple clusters <u>not</u>
 part of a federation (e.g. "sbatch --cluster=alpha,beta ...") then the entire
 job will be sent to the cluster expected to be able to start all components
 at the earliest time.</p>

 <p>A resource limit test is performed when a heterogeneous job is submitted in
 order to immediately reject jobs that will not be able to start with current
 limits.
 The individual components of the heterogeneous job are validated, like all
 regular jobs.
 The heterogeneous job as a whole is also tested, but in a more limited
 fashion with respect to quality of service (QOS) limits.
 Each component of a heterogeneous job counts as a "job" with respect to
 resource limits.</p>

 <h2 id="burst_buffer">Burst Buffers
 <a class="slurm_link" href="#burst_buffer"></a>
 </h2>

 <p>A burst buffer can either be persistent or linked to a specific job ID.
 Since a heterogeneous job consists of multiple job IDs, a job-specific burst
 buffer will be associated with only one heterogeneous job component.
 Each component can have its own burst buffer directives, and they are processed
 separately. Only a persistent burst buffer can be accessed by all components
 of a heterogeneous job. Persistent burst buffers are only available in the
 datawarp plugin. A sample batch script demonstrating this for the datawarp
 plugin is appended.</p>

 <pre>
 #!/bin/bash
 #SBATCH --nodes=1 --constraint=haswell
 #BB create_persistent name=alpha capacity=10 access=striped type=scratch
 #DW persistentdw name=alpha
 #SBATCH hetjob
 #SBATCH --nodes=16 --constraint=knl
 #DW persistentdw name=alpha
 ...
 </pre>

 <p><b>NOTE</b>: Cray's DataWarp interface directly reads the job script, but
 has no knowledge of "Slurm's "hetjob" directive, so Slurm internally rebuilds
 the script for each job component so that only that job component's burst buffer
 directives are included in that script. The batch script's first component of the
 job will be modified in order to replace the burst buffer directives of other
 job components with "#EXCLUDED directive" where directive is "DW" or "BB"
 for the datawarp plugin and is the
 <a href="burst_buffer.html#submit_lua">configured</a> value for the lua plugin.
 This prevents their interpretation by Cray infrastructure and aids
 administrators in writing an interface for the lua plugin.
 Since the batch script will only be executed by the first job
 component, the subsequent job components will not include commands from the
 original script. These scripts are built and managed by Slurm for internal
 purposes (and visible from various Slurm commands) from a user script as shown
 above. An example is shown below:</p>

 <pre>
 <b>Rebuilt script for first job component</b>

 #!/bin/bash
 #SBATCH --nodes=1 --constraint=haswell
 #BB create_persistent name=alpha capacity=10 access=striped type=scratch
 #DW persistentdw name=alpha
 #SBATCH hetjob
 #SBATCH --nodes=16 --constraint=knl
 #EXCLUDED DW persistentdw name=alpha
 ...


 <b>Rebuilt script for second job component</b>

 #!/bin/bash
 #SBATCH --nodes=16 --constraint=knl
 #DW persistentdw name=alpha
 exit 0
 </pre>

 <h2 id="managing">Managing Jobs<a class="slurm_link" href="#managing"></a></h2>

 <p>Information maintained in Slurm for a heterogeneous job includes:</p>
 <ul>
 <li><i>job_id</i>: Each component of a heterogeneous job will have its own
 unique <i>job_id</i>.</li>
 <li><i>het_job_id</i>: This identification number applies to all components
 of the heterogeneous job. All components of the same job will have the same
 <i>het_job_id</i> value and it will be equal to the <i>job_id</i> of the
 first component. We refer to this as the "heterogeneous job leader".</li>
 <li><i>het_job_id_set</i>: Regular expression identifying all <i>job_id</i>
 values associated with the job.</li>
 <li><i>het_job_offset</i>: A unique sequence number applied to each component
 of the heterogeneous job. The first component will have a <i>het_job_offset</i>
 value of 0, the next a value of 1, etc.</li>
 </ul>

 <table width="100%" border=1 cellspacing=0 cellpadding=4>
 <tr>
   <th width="25%"><b>job_id</b></th>
   <th width="25%"><b>het_job_id</b></th>
   <th width="25%"><b>het_job_offset</b></th>
   <th width="25%"><b>het_job_id_set</b></th>
  </tr>

 <tr><td>123</td><td>123</td><td>0</td><td>123-127</td></tr>
 <tr><td>124</td><td>123</td><td>1</td><td>123-127</td></tr>
 <tr><td>125</td><td>123</td><td>2</td><td>123-127</td></tr>
 <tr><td>126</td><td>123</td><td>3</td><td>123-127</td></tr>
 <tr><td>127</td><td>123</td><td>4</td><td>123-127</td></tr>
 </table>
 <p>Table 1: Example job IDs</p>

 <p>The <i>squeue</i> and <i>sview</i> commands report the
 components of a heterogeneous job using the format
 "&lt;het_job_id&gt;+&lt;het_job_offset&gt;".
 For example "123+4" would represent heterogeneous job id 123 and its fifth
 component (note: the first component has a <i>het_job_offset</i> value of 0).</p>

 <p>A request for a specific job ID that identifies a ID of the first component
 of a heterogeneous job (i.e. the "heterogeneous job leader") will return
 information about all components of that job. For example:</p>
 <pre>
 $ squeue --job=93
 JOBID PARTITION  NAME  USER ST  TIME  NODES NODELIST
  93+0     debug  bash  adam  R 18:18      1 nid00001
  93+1     debug  bash  adam  R 18:18      1 nid00011
  93+2     debug  bash  adam  R 18:18      1 nid00021
 </pre>

 <p>A request to cancel or otherwise signal a heterogeneous job leader will be applied to
 all components of that heterogeneous job. A request to cancel a specific component of
 the heterogeneous job using the "#+#" notation will apply only to that specific component.
 For example:</p>
 <pre>
 $ squeue --job=93
 JOBID PARTITION  NAME  USER ST  TIME  NODES NODELIST
  93+0     debug  bash  adam  R 19:18      1 nid00001
  93+1     debug  bash  adam  R 19:18      1 nid00011
  93+2     debug  bash  adam  R 19:18      1 nid00021
 $ scancel 93+1
 $ squeue --job=93
 JOBID PARTITION  NAME  USER ST  TIME  NODES NODELIST
  93+0     debug  bash  adam  R 19:38      1 nid00001
  93+2     debug  bash  adam  R 19:38      1 nid00021
 $ scancel 93
 $ squeue --job=93
 JOBID PARTITION  NAME  USER ST  TIME  NODES NODELIST
 </pre>

 <p>While a heterogeneous job is in pending state, only the entire job can be
 cancelled rather than its individual components.
 A request to cancel an individual component of a heterogeneous job in
 pending state will return an error.
 After the job has begun execution, the individual component can be cancelled.</p>

 <p>Email notification for job state changes (the <i>--mail-type</i> option)
 is only supported for a heterogeneous job leader. Requests for email
 notifications for other components of a heterogeneous job will be silently
 ignored.</p>

 <p>Requests to modify an individual component of a job using the scontrol
 command must specify the job ID with the "#+#" notation.
 A request to modify a job by specifying the het_job_id will modify all
 components of a heterogeneous job.
 For example:</p>
 <pre>
 # Change the account of component 2 of heterogeneous job 123:
 $ scontrol update jobid=123+2 account=abc

 # Change the time limit of all components of heterogeneous job 123:
 $ scontrol update jobid=123 timelimit=60
 </pre>

 <p>Requests to perform the following operations a job can only be requested for
 a heterogeneous job leader and will be applied to all components of that
 heterogeneous job. Requests to operate on individual components of the
 heterogeneous will return an error.</p>
 <ul>
 <li>requeue</li>
 <li>resume</li>
 <li>suspend</li>
 </ul>

 <p>The sbcast command supports heterogeneous job allocations. By default,
 sbcast will copy files to all nodes in the job allocation. The -j/--jobid
 option can be used to copy files to individual components as shown below.</p>
 <pre>
 $ sbcast --jobid=123   data /tmp/data
 $ sbcast --jobid=123.0 app0 /tmp/app0
 $ sbcast --jobid=123.1 app1 /tmp/app1
 </pre>

 <p>The srun commands --bcast option will transfer files to the nodes associated
 with the application to be launched as specified by the --het-group option.</p>

 <p>Slurm has a configuration option to control behavior of some commands with
 respect to heterogeneous jobs.
 By default a request to cancel, hold or release a job ID that is not the
 het_job_id, but that of a job component will only operate that one component
 of the heterogeneous job.
 If SchedulerParameters configuration parameter includes the option
 "whole_hetjob" then the operation would apply to all components of the job if
 any job component is specified to be operated upon. In the below example, the
 scancel command will either cancel all components of job 93 if
 SchedulerParameters=whole_hetjob is configured, otherwise only job 93+1 will be
 cancelled. If a specific heterogeneous job component is specified (e.g. "scancel
 93+1"), then only that one component will be effected.</p>
 <pre>
 $ squeue --job=93
 JOBID PARTITION  NAME  USER ST  TIME  NODES NODELIST
  93+0     debug  bash  adam  R 19:18      1 nid00001
  93+1     debug  bash  adam  R 19:18      1 nid00011
  93+2     debug  bash  adam  R 19:18      1 nid00021
 $ scancel 94 (where job ID 94 is equivalent to 93+1)
 # Cancel 93+0, 93+1 and 93+2 if SchedulerParameters includes "whole_hetjob"
 # Cancel only 93+1 if SchedulerParameters does not include "whole_hetjob"
 </pre>

 <h3 id="accounting">Accounting<a class="slurm_link" href="#accounting"></a></h3>

 <p>Slurm's accounting database records the het_job_id and het_job_offset
 fields.
 The sacct command reports job's using the format
 "&lt;het_job_id&gt;+&lt;het_job_offset&gt;" and can accept a job ID
 specification for filtering using the same format.
 If a het_job_id value is specified as a job filter, then information about
 all components of that job will be reported as shown below by default.
 The <i>--whole-hetjob=[yes|no]</i> option can be used to force to report
 the information about all the components of that job or just about the specific
 component requested, no matter if the job filter includes the het_job_id
 (leader) or not.
 </p>

 <pre>
 $ sacct -j 67767
   JobID JobName Partition Account AllocCPUS     State ExitCode
 ------- ------- --------- ------- --------- --------- --------
 67767+0     foo     debug    test         2 COMPLETED      0:0
 67767+1     foo     debug    test         4 COMPLETED      0:0

 $  sacct -j 67767+1
   JobID JobName Partition Account AllocCPUS     State ExitCode
 ------- ------- --------- ------- --------- --------- --------
 67767+1     foo     debug    test         4 COMPLETED      0:0

 $  sacct -j 67767 --whole-hetjob=no
   JobID JobName Partition Account AllocCPUS     State ExitCode
 ------- ------- --------- ------- --------- --------- --------
 67767+0     foo     debug    test         4 COMPLETED      0:0

 $ sacct -j 67767+1 --whole-hetjob=yes
   JobID JobName Partition Account AllocCPUS     State ExitCode
 ------- ------- --------- ------- --------- --------- --------
 67767+0     foo     debug    test         2 COMPLETED      0:0
 67767+1     foo     debug    test         4 COMPLETED      0:0
 </pre>

 <h2 id="job_steps">Launching Applications (Job Steps)
 <a class="slurm_link" href="#job_steps"></a>
 </h2>

 <p>The srun command is used to launch applications.
 By default, the application is launched only on the first component of a
 heterogeneous job, but options are available to support different behaviors.</p>

 <p>srun's "--het-group" option defines which hetjob component(s) are to have
 applications launched for them. The --het-group option takes an expression
 defining which component(s) are to launch an application for an individual
 execution of the srun command. The expression can contain one or more component
 index values in a comma separated list. Ranges of index values can be specified
 in a hyphen separated list. By default, an application is launched only on
 component number zero. Some examples follow:</p>
 <ul>
 <li>--het-group=2</li>
 <li>--het-group=0,4</li>
 <li>--het-group=1,3-5</li>
 </ul>

 <p><b>IMPORTANT:</b> The ability to execute a single application across more
 than one job allocation does not work with all MPI implementations or Slurm MPI
 plugins. Slurm's ability to execute such an application can be disabled on the
 entire cluster by adding "disable_hetjob_steps" to Slurm's SchedulerParameters
 configuration parameter.</p>

 <p><b>IMPORTANT:</b> While the srun command can be used to launch heterogeneous
 job steps, mpirun would require substantial modification to support
 heterogeneous applications. We are aware of no such mpirun development efforts
 at this time.</p>

 <p>By default, the applications launched by a single execution of the srun
 command (even for different components of the heterogeneous job) are combined
 into one MPI_COMM_WORLD with non-overlapping task IDs.</p>

 <p>As with the salloc and sbatch commands, the ":" character is used to
 separate multiple components of a heterogeneous job.
 This convention means that the stand-alone ":" character can not be used as an
 argument to an application launched by srun.
 This includes the ability to execute different applications and arguments
 for each job component.
 If some heterogeneous job component lacks an application specification, the next
 application specification provided will be used for earlier components lacking
 one as shown below.</p>
 <pre>
 $ srun --label -n2 : -n1 hostname
 0: nid00012
 1: nid00012
 2: nid00013
 </pre>

 <p>If multiple srun commands are executed concurrently, this may result in resource
 contention (e.g. memory limits preventing some job steps components from being
 allocated resources because of two srun commands executing at the same time).
 If the srun --het-group option is used to create multiple job steps (for the
 different components of a heterogeneous job), those job steps will be created
 sequentially.
 When multiple srun commands execute at the same time, this may result in some
 step allocations taking place, while others are delayed.
 Only after all job step allocations have been granted will the application
 being launched.</p>

 <p>All components of a job step will have the same step ID value.
 If job steps are launched on subsets of the job components there may be gaps in
 the step ID values for individual job components.</p>
 <pre>
 $ salloc -n1 : -n2 beta bash
 salloc: Pending job allocation 1721
 salloc: Granted job allocation 1721
 $ srun --het-group=0,1 true   # Launches steps 1721.0 and 1722.0
 $ srun --het-group=0   true   # Launches step  1721.1, no 1722.1
 $ srun --het-group=0,1 true   # Launches steps 1721.2 and 1722.2
 </pre>

 <p>The maximum het-group specified in a job step allocation (either explicitly
 specified or implied by the ":" separator) must not exceed the number of
 components in the heterogeneous job allocation. For example</p>
 <pre>
 $ salloc -n1 -C alpha : -n2 -C beta bash
 salloc: Pending job allocation 1728
 salloc: Granted job allocation 1728
 $ srun --het-group=0,1 hostname
 nid00001
 nid00008
 nid00008
 $ srun hostname : date : id
 error: Attempt to run a job step with het-group value of 2,
        but the job allocation has maximum value of 1
 </pre>

 <h2 id="env_var">Environment Variables
 <a class="slurm_link" href="#env_var"></a>
 </h2>

 <p>Slurm environment variables will be set independently for each component of
 the job by appending "_HET_GROUP_" and a sequence number to the usual name.
 In addition, the "SLURM_JOB_ID" environment variable will contain the job ID
 of the heterogeneous job leader and "SLURM_HET_SIZE" will contain the number of
 components in the job. Note that if using srun with a single specific
 het group (for instance --het-group=1) "SLURM_JOB_ID" will contain the job
 ID of the heterogeneous job leader. The job ID for a specific heterogeneous
 component is set in "SLURM_JOB_ID_HET_GROUP_&lt;component_id&gt;". For example:
 </p>
 <pre>
 $ salloc -N1 : -N2 bash
 salloc: Pending job allocation 11741
 salloc: job 11741 queued and waiting for resources
 salloc: job 11741 has been allocated resources
 $ env | grep SLURM
 SLURM_JOB_ID=11741
 SLURM_HET_SIZE=2
 SLURM_JOB_ID_HET_GROUP_0=11741
 SLURM_JOB_ID_HET_GROUP_1=11742
 SLURM_JOB_NODES_HET_GROUP_0=1
 SLURM_JOB_NODES_HET_GROUP_1=2
 SLURM_JOB_NODELIST_HET_GROUP_0=nid00001
 SLURM_JOB_NODELIST_HET_GROUP_1=nid[00011-00012]
 ...
 $ srun --het-group=1 printenv SLURM_JOB_ID
 11741
 11741
 $ srun --het-group=0 printenv SLURM_JOB_ID
 11741
 $ srun --het-group=1 printenv SLURM_JOB_ID_HET_GROUP_1
 11742
 11742
 $ srun --het-group=0 printenv SLURM_JOB_ID_HET_GROUP_0
 11741
 </pre>

 <p>The various MPI implementations rely heavily upon Slurm environment variables
 for proper operation.
 A single MPI application executing in a single MPI_COMM_WORLD requires a
 uniform set of environment variables that reflect a single job allocation.
 The example below shows how Slurm sets environment variables for MPI.</p>
 <pre>
 $ salloc -N1 : -N2 bash
 salloc: Pending job allocation 11741
 salloc: job 11751 queued and waiting for resources
 salloc: job 11751 has been allocated resources
 $ env | grep SLURM
 SLURM_JOB_ID=11751
 SLURM_HET_SIZE=2
 SLURM_JOB_ID_HET_GROUP_0=11751
 SLURM_JOB_ID_HET_GROUP_1=11752
 SLURM_JOB_NODELIST_HET_GROUP_0=nid00001
 SLURM_JOB_NODELIST_HET_GROUP_1=nid[00011-00012]
 ...
 $ srun --het-group=0,1 env | grep SLURM
 SLURM_JOB_ID=11751
 SLURM_JOB_NODELIST=nid[00001,00011-00012]
 ...
 </pre>

 <h2 id="examples">Examples<a class="slurm_link" href="#examples"></a></h2>

 <p>Create a heterogeneous resource allocation containing one node with 256GB
 of memory and a feature of "haswell" plus 2176 cores on 32 nodes with a
 feature of "knl". Then launch a program called "server" on the "haswell" node
 and "client" on the "knl" nodes. Each application will be in its own
 MPI_COMM_WORLD.</p>
 <pre>
 salloc -N1 --mem=256GB -C haswell : \
        -n2176 -N32 --ntasks-per-core=1 -C knl bash
 srun server &
 srun --het-group=1 client &
 wait
 </pre>

 <p>This variation of the above example launches programs "server" and "client"
 in a single MPI_COMM_WORLD.</p>
 <pre>
 salloc -N1 --mem=256GB -C haswell : \
        -n2176 -N32 --ntasks-per-core=1 -C knl bash
 srun server : client
 </pre>

 <p>The SLURM_PROCID environment variable will be set to reflect a global
 task rank. Each spawned process will have a unique SLURM_PROCID.</p>

 <p>Similarly, the SLURM_NPROCS and SLURM_NTASKS environment variables will be set
 to reflect a global task count (both environment variables will have the same
 value).
 SLURM_NTASKS will be set to the total count of tasks in all components.
 Note that the task rank and count values are needed by MPI and typically
 determined by examining Slurm environment variables.</p>

 <h2 id="limitations">Limitations
 <a class="slurm_link" href="#limitations"></a>
 </h2>

 <p>The backfill scheduler has limitations in how it tracks usage of CPUs and
 memory in the future.
 This typically requires the backfill scheduler be able to allocate each
 component of a heterogeneous job on a different node in order to begin its
 resource allocation, even if multiple components of the job do actually get
 allocated resources on the same node.</p>

 <p>In a federation of clusters, a heterogeneous job will execute entirely on
 the cluster from which the job is submitted. The heterogeneous job will not
 be eligible to migrate between clusters or to have different components of
 the job execute on different clusters in the federation.</p>

 <p>Caution must be taken when submitting heterogeneous jobs that request
 multiple overlapping partitions. When the partitions share the same resources
 it's possible to starve your own job by having the first job component request
 enough nodes that the scheduler isn't able to fill the subsequent request(s).
 Consider an example where you have partition <i>p1</i> that contains 10 nodes
 and partition <i>p2</i> that exists on 5 of the same nodes. If you submit a
 heterogeneous job that requests 5 nodes in <i>p1</i> and 5 nodes in <i>p2</i>,
 the scheduler may try to allocate some of the nodes from the <i>p2</i>
 partition for the first job component, preventing the scheduler from being
 able to fulfill the second request, resulting in a job that is never able to
 start.</p>

 <p>Magnetic reservations cannot "attract" heterogeneous jobs - heterogeneous
 jobs will only run in magnetic reservations if they explicitly request the
 reservation.</p>

 <p>Job arrays of heterogeneous jobs are not supported.</p>

 <p>The srun command's --no-allocate option is not supported
 for heterogeneous jobs.</p>

 <p>Only one job step per heterogeneous job component can be launched by a
 single srun command (e.g.
 "srun --het-group=0 alpha : --het-group=0 beta" is not supported).</p>

 <p>The sattach command can only be used to attach to a single component of
 a heterogeneous job at a time.</p>

 <p>License requests are only allowed on the first component
 job (e.g.
 "sbatch -L ansys:2 : script.sh").<p>

 <p>Heterogeneous jobs are only scheduled by the backfill scheduler plugin.
 The more frequently executed scheduling logic only starts jobs on a first-in
 first-out (FIFO) basis and lacks logic for concurrently scheduling all
 components of a heterogeneous job.</p>

 <p>Heterogeneous jobs are not supported on GANG scheduling operations.</p>

 <p>Slurm's Perl APIs do not support heterogeneous jobs.</p>

 <p>The srun --multi-prog option can not be used to span more than one
 heterogeneous job component.</p>

 <p>The srun --open-mode option is by default set to "append".</p>

 <p>Ancient versions of OpenMPI and their derivatives (i.e. Cray MPI) are
 dependent upon communication ports being assigned to them by Slurm. Such MPI
 jobs will experience step launch failure if <u>any</u> component of a
 heterogeneous job step is unable to acquire the allocated ports.
 Non-heterogeneous job steps will retry step launch using a new set of
 communication ports (no change in Slurm behavior).</p>
 <!-- NOTE: Correcting this would necessitate assigning the same set of ports
 to all components of the heterogeneous job (not possible today) plus changes to
 srun in order to better synchronize the step startup and error handling. -->

 <h2 id="het_steps">Heterogeneous Steps
 <a class="slurm_link" href="#het_steps"></a>
 </h2>

 <p>Slurm version 20.11 introduces the ability to request heterogeneous job
 steps from within a non-homogeneous job allocation. This allows you the
 flexibility to have different layouts for job steps without requiring the
 use of heterogeneous jobs, where having separate jobs for the components
 may be undesirable.</p>

 <p>Some limitations for heterogeneous steps are that the steps must be able
 to run on unique nodes. You also cannot request heterogeneous steps from within
 a heterogeneous job.</p>

 <p>An example scenario would be if you have a task that needs to use 1 GPU
 per processor while another task needs all the available GPUs on a node with
 only one processor. This can be accomplished like this:

 <pre>
 $ salloc -N2 --exclusive --gpus=10
 salloc: Granted job allocation 61034
 $ srun -N1 -n4 --gpus=4 printenv SLURMD_NODENAME : -N1 -n1 --gpus=6 printenv SLURMD_NODENAME
 node02
 node01
 node01
 node01
 node01
 </pre>

 <h2 id="sys_admin">System Administrator Information
 <a class="slurm_link" href="#sys_admin"></a>
 </h2>

 <p>The job submit plugin is invoked independently for each component of a
 heterogeneous job.</p>

 <p>The spank_init_post_opt() function is invoked once for each component of a
 heterogeneous job. This permits site defined options on a per job component
 basis.</p>

 <p>Scheduling of heterogeneous jobs is performed only by the sched/backfill
 plugin and all heterogeneous job components are either all scheduled at the same
 time or deferred. The pending reason of heterogeneous jobs isn't set until
 backfill evaluation.
 In order to ensure the timely initiation of both heterogeneous and
 non-heterogeneous jobs, the backfill scheduler alternates between two different
 modes on each iteration.
 In the first mode, if a heterogeneous job component can not be initiated
 immediately, its expected start time is recorded and all subsequent components
 of that job will be considered for starting no earlier than the latest
 component's expected start time.
 In the second mode, all heterogeneous job components will be considered for
 starting no earlier than the latest component's expected start time.
 After completion of the second mode, all heterogeneous job expected start time
 data is cleared and the first mode will be used in the next backfill scheduler
 iteration.
 Regular (non-heterogeneous jobs) are scheduled independently on each iteration
 of the backfill scheduler.</p>

 <p> For example, consider a heterogeneous job with three components.
 When considered as independent jobs, the components could be initiated at times
 now (component 0), now plus 2 hour (component 1), and now plus 1 hours
 (component 2).
 When the backfill scheduler runs in the first mode:</p>
 <ol>
 <li>Component 0 will be noted to possible to start now, but not initiated due
 to the additional components to be initiated</li>
 <li>Component 1 will be noted to be possible to start in 2 hours</li>
 <li>Component 2 will not be considered for scheduling until 2 hours in the
 future, which leave some additional resources available for scheduling to other
 jobs</li>
 </ol>

 <p>When the backfill scheduler executes next, it will use the second mode and
 (assuming no other state changes) all three job components will be considered
 available for scheduling no earlier than 2 hours in the future, which may allow
 other jobs to be allocated resources before heterogeneous job component 0
 could be initiated.</p>

 <p>The heterogeneous job start time data will be cleared before the first
 mode is used in the next iteration in order to consider system status changes
 which might permit the heterogeneous to be initiated at an earlier time than
 previously determined.</p>

 <p>A resource limit test is performed when a heterogeneous job is submitted in
 order to immediately reject jobs that will not be able to start with current
 limits.
 The individual components of the heterogeneous job are validated, like all
 regular jobs.
 The heterogeneous job as a whole is also tested, but in a more limited
 fashion with respect to quality of service (QOS) limits.
 This is due to the complexity of each job component having up to three sets of
 limits (association, job QOS and partition QOS).
 Note that successful submission of any job (heterogeneous or otherwise) does
 not ensure the job will be able to start without exceeding some limit.
 For example a job's CPU limit test does not consider that CPUs might not be
 allocated individually, but resource allocations might be performed by whole
 core, socket or node.
 Each component of a heterogeneous job counts as a "job" with respect to
 resource limits.</p>

 <p>For example, a user might have a limit of 2 concurrent running jobs and submit
 a heterogeneous job with 3 components.
 Such a situation will have an adverse effect upon scheduling other jobs,
 especially other heterogeneous jobs.</p>


 <p style="text-align:center;">Last modified 04 January 2024</p>

 <!--#include virtual="footer.txt"-->