| <!--#include virtual="header.txt"--> |
| |
| <h1>Heterogeneous Job Support</h1> |
| |
| <ul> |
| <li><a href="#overview">Overview</a></li> |
| <li><a href="#submitting">Submitting Jobs</a></li> |
| <li><a href="#burst_buffer">Burst Buffers</a></li> |
| <li> |
| <a href="#managing">Managing Jobs</a> |
| <ul><li><a href="#accounting">Accounting</a></li></ul> |
| </li> |
| <li><a href="#job_steps">Launching Applications (Job Steps)</a></li> |
| <li><a href="#env_var">Environment Variables</a></li> |
| <li><a href="#examples">Examples</a></li> |
| <li><a href="#limitations">Limitations</a></li> |
| <li><a href="#het_steps">Heterogeneous Steps</a></li> |
| <li><a href="#sys_admin">System Administrator Information</a></li> |
| </ul> |
| |
| <h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2> |
| |
| <p>Slurm version 17.11 and later supports the ability to submit and manage |
| heterogeneous jobs, in which each component has virtually all job options |
| available including partition, account and QOS (Quality Of Service). |
| For example, part of a job might require four cores and 4 GB for each of 128 |
| tasks while another part of the job would require 16 GB of memory and one CPU.</p> |
| |
| <h2 id="submitting">Submitting Jobs |
| <a class="slurm_link" href="#submitting"></a> |
| </h2> |
| |
| <p>The <i>salloc</i>, <i>sbatch</i> and <i>srun</i> commands can all be used |
| to submit heterogeneous jobs. |
| Resource specifications for each component of the heterogeneous job should be |
| separated with ":" character. |
| For example:</p> |
| <pre> |
| $ sbatch --cpus-per-task=4 --mem-per-cpu=1 --ntasks=128 : \ |
| --cpus-per-task=1 --mem-per-cpu=16 --ntasks=1 my.bash |
| </pre> |
| |
| <p>Options specified for one component of a heterogeneous job (or job step) will |
| be used for subsequent components to the extent which is expected to be helpful. |
| Propagated options can be reset as desired for each component (e.g. a different |
| account name could be specified for each hetjob component. |
| For example, <i>--immediate</i> and <i>--job-name</i> are propagated, while |
| <i>--ntasks</i> and <i>--mem-per-cpu</i> are reset to default values for each |
| component. |
| A list of propagated options follows.</p> |
| <ul> |
| <li>--account</li> |
| <li>--acctg-freq</li> |
| <li>--begin</li> |
| <li>--cluster-constraint</li> |
| <li>--clusters</li> |
| <li>--comment</li> |
| <li>--deadline</li> |
| <li>--delay-boot</li> |
| <li>--dependency</li> |
| <li>--distribution</li> |
| <li>--epilog (option available only in srun)</li> |
| <li>--error</li> |
| <li>--export</li> |
| <li>--export-file</li> |
| <li>--exclude</li> |
| <li>--get-user-env</li> |
| <li>--gid</li> |
| <li>--hold</li> |
| <li>--ignore-pbs</li> |
| <li>--immediate</li> |
| <li>--input</li> |
| <li>--job-name</li> |
| <li>--kill-on-bad-exit (option available only in srun)</li> |
| <li>--label (option available only in srun)</li> |
| <li>--mcs-label</li> |
| <li>--mem</li> |
| <li>--msg-timeout (option available only in srun)</li> |
| <li>--no-allocate (option available only in srun)</li> |
| <li>--no-requeue</li> |
| <li>--nice</li> |
| <li>--no-kill</li> |
| <li>--open-mode (option available only in srun)</li> |
| <li>--output</li> |
| <li>--parsable</li> |
| <li>--priority</li> |
| <li>--profile</li> |
| <li>--propagate</li> |
| <li>--prolog (option available only in srun)</li> |
| <li>--pty (option available only in srun)</li> |
| <li>--qos</li> |
| <li>--quiet</li> |
| <li>--quit-on-interrupt (option available only in srun)</li> |
| <li>--reboot</li> |
| <li>--reservation</li> |
| <li>--requeue</li> |
| <li>--signal</li> |
| <li>--slurmd-debug (option available only in srun)</li> |
| <li>--task-epilog (option available only in srun)</li> |
| <li>--task-prolog (option available only in srun)</li> |
| <li>--time</li> |
| <li>--test-only</li> |
| <li>--time-min</li> |
| <li>--uid</li> |
| <li>--unbuffered (option available only in srun)</li> |
| <li>--verbose</li> |
| <li>--wait</li> |
| <li>--wait-all-nodes</li> |
| <li>--wckey</li> |
| <li>--workdir</li> |
| </ul> |
| |
| <p>The task distribution specification applies separately within each job |
| component. Consider for example a heterogeneous job with each component being |
| allocated 4 CPUs on 2 nodes. In our example, job component zero is allocated |
| 2 CPUs on node "nid00001" and 2 CPUs on node "nid00002". Job component one is |
| allocated 2 CPUs on node "nid00003" and 2 CPUs on node "nid00004". A task |
| distribution of "cyclic" will distribute the first 4 tasks in a cyclic fashion |
| on nodes "nid00001" and "nid00002", then distribute the next 4 tasks in a cyclic |
| fashion on nodes "nid00003" and "nid00004" as shown below.</p> |
| |
| <table width="100%" border=1 cellspacing=4 cellpadding=4> |
| <tr><td align=center>Node nid00001</td><td align=center>Node nid00002</td><td align=center>Node nid00003</td><td align=center>Node nid00004</td></tr> |
| <tr><td align=center nowrap>Rank 0</td><td align=center>Rank 1</td><td align=center>Rank 4</td><td align=center>Rank 5</td></tr> |
| <tr><td align=center nowrap>Rank 2</td><td align=center>Rank 3</td><td align=center>Rank 6</td><td align=center>Rank 7</td></tr> |
| </table> |
| <br> |
| |
| <p>Some options should be specified only in the first hetjob component. |
| For example, specifying a batch job output file in the second hetjob component's |
| options will result in the first hetjob component (where the batch script |
| executes) using the default output file name.</p> |
| |
| <p>Environment variables used to specify default options for the job submit |
| command will be applied to every component of the heterogeneous job |
| (e.g. <i>SBATCH_ACCOUNT</i>).</p> |
| |
| <p>Batch job options can be included in the submitted script for multiple |
| heterogeneous job components. Each component should be separated by a line |
| containing the line "#SBATCH hetjob" as shown below.</p> |
| <pre> |
| $ cat new.bash |
| #!/bin/bash |
| #SBATCH --cpus-per-task=4 --mem-per-cpu=16g --ntasks=1 |
| #SBATCH hetjob |
| #SBATCH --cpus-per-task=2 --mem-per-cpu=1g --ntasks=8 |
| |
| srun run.app |
| |
| $ sbatch new.bash |
| </pre> |
| |
| <p>Is equivalent to the following:</p> |
| |
| <pre> |
| $ cat my.bash |
| #!/bin/bash |
| srun run.app |
| |
| $ sbatch --cpus-per-task=4 --mem-per-cpu=16g --ntasks=1 : \ |
| --cpus-per-task=2 --mem-per-cpu=1g --ntasks=8 my.bash |
| </pre> |
| |
| <p>The batch script will be executed in the first node in the first component |
| of the heterogeneous job. For the above example, that will be the job component |
| with 1 task, 4 CPUs and 64 GB of memory (16 GB for each of the 4 CPUs).</p> |
| |
| <p>If a heterogeneous job is submitted to run in multiple clusters <u>not</u> |
| part of a federation (e.g. "sbatch --cluster=alpha,beta ...") then the entire |
| job will be sent to the cluster expected to be able to start all components |
| at the earliest time.</p> |
| |
| <p>A resource limit test is performed when a heterogeneous job is submitted in |
| order to immediately reject jobs that will not be able to start with current |
| limits. |
| The individual components of the heterogeneous job are validated, like all |
| regular jobs. |
| The heterogeneous job as a whole is also tested, but in a more limited |
| fashion with respect to quality of service (QOS) limits. |
| Each component of a heterogeneous job counts as a "job" with respect to |
| resource limits.</p> |
| |
| <h2 id="burst_buffer">Burst Buffers |
| <a class="slurm_link" href="#burst_buffer"></a> |
| </h2> |
| |
| <p>A burst buffer can either be persistent or linked to a specific job ID. |
| Since a heterogeneous job consists of multiple job IDs, a job-specific burst |
| buffer will be associated with only one heterogeneous job component. |
| Each component can have its own burst buffer directives, and they are processed |
| separately. Only a persistent burst buffer can be accessed by all components |
| of a heterogeneous job. Persistent burst buffers are only available in the |
| datawarp plugin. A sample batch script demonstrating this for the datawarp |
| plugin is appended.</p> |
| |
| <pre> |
| #!/bin/bash |
| #SBATCH --nodes=1 --constraint=haswell |
| #BB create_persistent name=alpha capacity=10 access=striped type=scratch |
| #DW persistentdw name=alpha |
| #SBATCH hetjob |
| #SBATCH --nodes=16 --constraint=knl |
| #DW persistentdw name=alpha |
| ... |
| </pre> |
| |
| <p><b>NOTE</b>: Cray's DataWarp interface directly reads the job script, but |
| has no knowledge of "Slurm's "hetjob" directive, so Slurm internally rebuilds |
| the script for each job component so that only that job component's burst buffer |
| directives are included in that script. The batch script's first component of the |
| job will be modified in order to replace the burst buffer directives of other |
| job components with "#EXCLUDED directive" where directive is "DW" or "BB" |
| for the datawarp plugin and is the |
| <a href="burst_buffer.html#submit_lua">configured</a> value for the lua plugin. |
| This prevents their interpretation by Cray infrastructure and aids |
| administrators in writing an interface for the lua plugin. |
| Since the batch script will only be executed by the first job |
| component, the subsequent job components will not include commands from the |
| original script. These scripts are built and managed by Slurm for internal |
| purposes (and visible from various Slurm commands) from a user script as shown |
| above. An example is shown below:</p> |
| |
| <pre> |
| <b>Rebuilt script for first job component</b> |
| |
| #!/bin/bash |
| #SBATCH --nodes=1 --constraint=haswell |
| #BB create_persistent name=alpha capacity=10 access=striped type=scratch |
| #DW persistentdw name=alpha |
| #SBATCH hetjob |
| #SBATCH --nodes=16 --constraint=knl |
| #EXCLUDED DW persistentdw name=alpha |
| ... |
| |
| |
| <b>Rebuilt script for second job component</b> |
| |
| #!/bin/bash |
| #SBATCH --nodes=16 --constraint=knl |
| #DW persistentdw name=alpha |
| exit 0 |
| </pre> |
| |
| <h2 id="managing">Managing Jobs<a class="slurm_link" href="#managing"></a></h2> |
| |
| <p>Information maintained in Slurm for a heterogeneous job includes:</p> |
| <ul> |
| <li><i>job_id</i>: Each component of a heterogeneous job will have its own |
| unique <i>job_id</i>.</li> |
| <li><i>het_job_id</i>: This identification number applies to all components |
| of the heterogeneous job. All components of the same job will have the same |
| <i>het_job_id</i> value and it will be equal to the <i>job_id</i> of the |
| first component. We refer to this as the "heterogeneous job leader".</li> |
| <li><i>het_job_id_set</i>: Regular expression identifying all <i>job_id</i> |
| values associated with the job.</li> |
| <li><i>het_job_offset</i>: A unique sequence number applied to each component |
| of the heterogeneous job. The first component will have a <i>het_job_offset</i> |
| value of 0, the next a value of 1, etc.</li> |
| </ul> |
| |
| <table width="100%" border=1 cellspacing=0 cellpadding=4> |
| <tr> |
| <th width="25%"><b>job_id</b></th> |
| <th width="25%"><b>het_job_id</b></th> |
| <th width="25%"><b>het_job_offset</b></th> |
| <th width="25%"><b>het_job_id_set</b></th> |
| </tr> |
| |
| <tr><td>123</td><td>123</td><td>0</td><td>123-127</td></tr> |
| <tr><td>124</td><td>123</td><td>1</td><td>123-127</td></tr> |
| <tr><td>125</td><td>123</td><td>2</td><td>123-127</td></tr> |
| <tr><td>126</td><td>123</td><td>3</td><td>123-127</td></tr> |
| <tr><td>127</td><td>123</td><td>4</td><td>123-127</td></tr> |
| </table> |
| <p>Table 1: Example job IDs</p> |
| |
| <p>The <i>squeue</i> and <i>sview</i> commands report the |
| components of a heterogeneous job using the format |
| "<het_job_id>+<het_job_offset>". |
| For example "123+4" would represent heterogeneous job id 123 and its fifth |
| component (note: the first component has a <i>het_job_offset</i> value of 0).</p> |
| |
| <p>A request for a specific job ID that identifies a ID of the first component |
| of a heterogeneous job (i.e. the "heterogeneous job leader") will return |
| information about all components of that job. For example:</p> |
| <pre> |
| $ squeue --job=93 |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 93+0 debug bash adam R 18:18 1 nid00001 |
| 93+1 debug bash adam R 18:18 1 nid00011 |
| 93+2 debug bash adam R 18:18 1 nid00021 |
| </pre> |
| |
| <p>A request to cancel or otherwise signal a heterogeneous job leader will be applied to |
| all components of that heterogeneous job. A request to cancel a specific component of |
| the heterogeneous job using the "#+#" notation will apply only to that specific component. |
| For example:</p> |
| <pre> |
| $ squeue --job=93 |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 93+0 debug bash adam R 19:18 1 nid00001 |
| 93+1 debug bash adam R 19:18 1 nid00011 |
| 93+2 debug bash adam R 19:18 1 nid00021 |
| $ scancel 93+1 |
| $ squeue --job=93 |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 93+0 debug bash adam R 19:38 1 nid00001 |
| 93+2 debug bash adam R 19:38 1 nid00021 |
| $ scancel 93 |
| $ squeue --job=93 |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| </pre> |
| |
| <p>While a heterogeneous job is in pending state, only the entire job can be |
| cancelled rather than its individual components. |
| A request to cancel an individual component of a heterogeneous job in |
| pending state will return an error. |
| After the job has begun execution, the individual component can be cancelled.</p> |
| |
| <p>Email notification for job state changes (the <i>--mail-type</i> option) |
| is only supported for a heterogeneous job leader. Requests for email |
| notifications for other components of a heterogeneous job will be silently |
| ignored.</p> |
| |
| <p>Requests to modify an individual component of a job using the scontrol |
| command must specify the job ID with the "#+#" notation. |
| A request to modify a job by specifying the het_job_id will modify all |
| components of a heterogeneous job. |
| For example:</p> |
| <pre> |
| # Change the account of component 2 of heterogeneous job 123: |
| $ scontrol update jobid=123+2 account=abc |
| |
| # Change the time limit of all components of heterogeneous job 123: |
| $ scontrol update jobid=123 timelimit=60 |
| </pre> |
| |
| <p>Requests to perform the following operations a job can only be requested for |
| a heterogeneous job leader and will be applied to all components of that |
| heterogeneous job. Requests to operate on individual components of the |
| heterogeneous will return an error.</p> |
| <ul> |
| <li>requeue</li> |
| <li>resume</li> |
| <li>suspend</li> |
| </ul> |
| |
| <p>The sbcast command supports heterogeneous job allocations. By default, |
| sbcast will copy files to all nodes in the job allocation. The -j/--jobid |
| option can be used to copy files to individual components as shown below.</p> |
| <pre> |
| $ sbcast --jobid=123 data /tmp/data |
| $ sbcast --jobid=123.0 app0 /tmp/app0 |
| $ sbcast --jobid=123.1 app1 /tmp/app1 |
| </pre> |
| |
| <p>The srun commands --bcast option will transfer files to the nodes associated |
| with the application to be launched as specified by the --het-group option.</p> |
| |
| <p>Slurm has a configuration option to control behavior of some commands with |
| respect to heterogeneous jobs. |
| By default a request to cancel, hold or release a job ID that is not the |
| het_job_id, but that of a job component will only operate that one component |
| of the heterogeneous job. |
| If SchedulerParameters configuration parameter includes the option |
| "whole_hetjob" then the operation would apply to all components of the job if |
| any job component is specified to be operated upon. In the below example, the |
| scancel command will either cancel all components of job 93 if |
| SchedulerParameters=whole_hetjob is configured, otherwise only job 93+1 will be |
| cancelled. If a specific heterogeneous job component is specified (e.g. "scancel |
| 93+1"), then only that one component will be effected.</p> |
| <pre> |
| $ squeue --job=93 |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 93+0 debug bash adam R 19:18 1 nid00001 |
| 93+1 debug bash adam R 19:18 1 nid00011 |
| 93+2 debug bash adam R 19:18 1 nid00021 |
| $ scancel 94 (where job ID 94 is equivalent to 93+1) |
| # Cancel 93+0, 93+1 and 93+2 if SchedulerParameters includes "whole_hetjob" |
| # Cancel only 93+1 if SchedulerParameters does not include "whole_hetjob" |
| </pre> |
| |
| <h3 id="accounting">Accounting<a class="slurm_link" href="#accounting"></a></h3> |
| |
| <p>Slurm's accounting database records the het_job_id and het_job_offset |
| fields. |
| The sacct command reports job's using the format |
| "<het_job_id>+<het_job_offset>" and can accept a job ID |
| specification for filtering using the same format. |
| If a het_job_id value is specified as a job filter, then information about |
| all components of that job will be reported as shown below by default. |
| The <i>--whole-hetjob=[yes|no]</i> option can be used to force to report |
| the information about all the components of that job or just about the specific |
| component requested, no matter if the job filter includes the het_job_id |
| (leader) or not. |
| </p> |
| |
| <pre> |
| $ sacct -j 67767 |
| JobID JobName Partition Account AllocCPUS State ExitCode |
| ------- ------- --------- ------- --------- --------- -------- |
| 67767+0 foo debug test 2 COMPLETED 0:0 |
| 67767+1 foo debug test 4 COMPLETED 0:0 |
| |
| $ sacct -j 67767+1 |
| JobID JobName Partition Account AllocCPUS State ExitCode |
| ------- ------- --------- ------- --------- --------- -------- |
| 67767+1 foo debug test 4 COMPLETED 0:0 |
| |
| $ sacct -j 67767 --whole-hetjob=no |
| JobID JobName Partition Account AllocCPUS State ExitCode |
| ------- ------- --------- ------- --------- --------- -------- |
| 67767+0 foo debug test 4 COMPLETED 0:0 |
| |
| $ sacct -j 67767+1 --whole-hetjob=yes |
| JobID JobName Partition Account AllocCPUS State ExitCode |
| ------- ------- --------- ------- --------- --------- -------- |
| 67767+0 foo debug test 2 COMPLETED 0:0 |
| 67767+1 foo debug test 4 COMPLETED 0:0 |
| </pre> |
| |
| <h2 id="job_steps">Launching Applications (Job Steps) |
| <a class="slurm_link" href="#job_steps"></a> |
| </h2> |
| |
| <p>The srun command is used to launch applications. |
| By default, the application is launched only on the first component of a |
| heterogeneous job, but options are available to support different behaviors.</p> |
| |
| <p>srun's "--het-group" option defines which hetjob component(s) are to have |
| applications launched for them. The --het-group option takes an expression |
| defining which component(s) are to launch an application for an individual |
| execution of the srun command. The expression can contain one or more component |
| index values in a comma separated list. Ranges of index values can be specified |
| in a hyphen separated list. By default, an application is launched only on |
| component number zero. Some examples follow:</p> |
| <ul> |
| <li>--het-group=2</li> |
| <li>--het-group=0,4</li> |
| <li>--het-group=1,3-5</li> |
| </ul> |
| |
| <p><b>IMPORTANT:</b> The ability to execute a single application across more |
| than one job allocation does not work with all MPI implementations or Slurm MPI |
| plugins. Slurm's ability to execute such an application can be disabled on the |
| entire cluster by adding "disable_hetjob_steps" to Slurm's SchedulerParameters |
| configuration parameter.</p> |
| |
| <p><b>IMPORTANT:</b> While the srun command can be used to launch heterogeneous |
| job steps, mpirun would require substantial modification to support |
| heterogeneous applications. We are aware of no such mpirun development efforts |
| at this time.</p> |
| |
| <p>By default, the applications launched by a single execution of the srun |
| command (even for different components of the heterogeneous job) are combined |
| into one MPI_COMM_WORLD with non-overlapping task IDs.</p> |
| |
| <p>As with the salloc and sbatch commands, the ":" character is used to |
| separate multiple components of a heterogeneous job. |
| This convention means that the stand-alone ":" character can not be used as an |
| argument to an application launched by srun. |
| This includes the ability to execute different applications and arguments |
| for each job component. |
| If some heterogeneous job component lacks an application specification, the next |
| application specification provided will be used for earlier components lacking |
| one as shown below.</p> |
| <pre> |
| $ srun --label -n2 : -n1 hostname |
| 0: nid00012 |
| 1: nid00012 |
| 2: nid00013 |
| </pre> |
| |
| <p>If multiple srun commands are executed concurrently, this may result in resource |
| contention (e.g. memory limits preventing some job steps components from being |
| allocated resources because of two srun commands executing at the same time). |
| If the srun --het-group option is used to create multiple job steps (for the |
| different components of a heterogeneous job), those job steps will be created |
| sequentially. |
| When multiple srun commands execute at the same time, this may result in some |
| step allocations taking place, while others are delayed. |
| Only after all job step allocations have been granted will the application |
| being launched.</p> |
| |
| <p>All components of a job step will have the same step ID value. |
| If job steps are launched on subsets of the job components there may be gaps in |
| the step ID values for individual job components.</p> |
| <pre> |
| $ salloc -n1 : -n2 beta bash |
| salloc: Pending job allocation 1721 |
| salloc: Granted job allocation 1721 |
| $ srun --het-group=0,1 true # Launches steps 1721.0 and 1722.0 |
| $ srun --het-group=0 true # Launches step 1721.1, no 1722.1 |
| $ srun --het-group=0,1 true # Launches steps 1721.2 and 1722.2 |
| </pre> |
| |
| <p>The maximum het-group specified in a job step allocation (either explicitly |
| specified or implied by the ":" separator) must not exceed the number of |
| components in the heterogeneous job allocation. For example</p> |
| <pre> |
| $ salloc -n1 -C alpha : -n2 -C beta bash |
| salloc: Pending job allocation 1728 |
| salloc: Granted job allocation 1728 |
| $ srun --het-group=0,1 hostname |
| nid00001 |
| nid00008 |
| nid00008 |
| $ srun hostname : date : id |
| error: Attempt to run a job step with het-group value of 2, |
| but the job allocation has maximum value of 1 |
| </pre> |
| |
| <h2 id="env_var">Environment Variables |
| <a class="slurm_link" href="#env_var"></a> |
| </h2> |
| |
| <p>Slurm environment variables will be set independently for each component of |
| the job by appending "_HET_GROUP_" and a sequence number to the usual name. |
| In addition, the "SLURM_JOB_ID" environment variable will contain the job ID |
| of the heterogeneous job leader and "SLURM_HET_SIZE" will contain the number of |
| components in the job. Note that if using srun with a single specific |
| het group (for instance --het-group=1) "SLURM_JOB_ID" will contain the job |
| ID of the heterogeneous job leader. The job ID for a specific heterogeneous |
| component is set in "SLURM_JOB_ID_HET_GROUP_<component_id>". For example: |
| </p> |
| <pre> |
| $ salloc -N1 : -N2 bash |
| salloc: Pending job allocation 11741 |
| salloc: job 11741 queued and waiting for resources |
| salloc: job 11741 has been allocated resources |
| $ env | grep SLURM |
| SLURM_JOB_ID=11741 |
| SLURM_HET_SIZE=2 |
| SLURM_JOB_ID_HET_GROUP_0=11741 |
| SLURM_JOB_ID_HET_GROUP_1=11742 |
| SLURM_JOB_NODES_HET_GROUP_0=1 |
| SLURM_JOB_NODES_HET_GROUP_1=2 |
| SLURM_JOB_NODELIST_HET_GROUP_0=nid00001 |
| SLURM_JOB_NODELIST_HET_GROUP_1=nid[00011-00012] |
| ... |
| $ srun --het-group=1 printenv SLURM_JOB_ID |
| 11741 |
| 11741 |
| $ srun --het-group=0 printenv SLURM_JOB_ID |
| 11741 |
| $ srun --het-group=1 printenv SLURM_JOB_ID_HET_GROUP_1 |
| 11742 |
| 11742 |
| $ srun --het-group=0 printenv SLURM_JOB_ID_HET_GROUP_0 |
| 11741 |
| </pre> |
| |
| <p>The various MPI implementations rely heavily upon Slurm environment variables |
| for proper operation. |
| A single MPI application executing in a single MPI_COMM_WORLD requires a |
| uniform set of environment variables that reflect a single job allocation. |
| The example below shows how Slurm sets environment variables for MPI.</p> |
| <pre> |
| $ salloc -N1 : -N2 bash |
| salloc: Pending job allocation 11741 |
| salloc: job 11751 queued and waiting for resources |
| salloc: job 11751 has been allocated resources |
| $ env | grep SLURM |
| SLURM_JOB_ID=11751 |
| SLURM_HET_SIZE=2 |
| SLURM_JOB_ID_HET_GROUP_0=11751 |
| SLURM_JOB_ID_HET_GROUP_1=11752 |
| SLURM_JOB_NODELIST_HET_GROUP_0=nid00001 |
| SLURM_JOB_NODELIST_HET_GROUP_1=nid[00011-00012] |
| ... |
| $ srun --het-group=0,1 env | grep SLURM |
| SLURM_JOB_ID=11751 |
| SLURM_JOB_NODELIST=nid[00001,00011-00012] |
| ... |
| </pre> |
| |
| <h2 id="examples">Examples<a class="slurm_link" href="#examples"></a></h2> |
| |
| <p>Create a heterogeneous resource allocation containing one node with 256GB |
| of memory and a feature of "haswell" plus 2176 cores on 32 nodes with a |
| feature of "knl". Then launch a program called "server" on the "haswell" node |
| and "client" on the "knl" nodes. Each application will be in its own |
| MPI_COMM_WORLD.</p> |
| <pre> |
| salloc -N1 --mem=256GB -C haswell : \ |
| -n2176 -N32 --ntasks-per-core=1 -C knl bash |
| srun server & |
| srun --het-group=1 client & |
| wait |
| </pre> |
| |
| <p>This variation of the above example launches programs "server" and "client" |
| in a single MPI_COMM_WORLD.</p> |
| <pre> |
| salloc -N1 --mem=256GB -C haswell : \ |
| -n2176 -N32 --ntasks-per-core=1 -C knl bash |
| srun server : client |
| </pre> |
| |
| <p>The SLURM_PROCID environment variable will be set to reflect a global |
| task rank. Each spawned process will have a unique SLURM_PROCID.</p> |
| |
| <p>Similarly, the SLURM_NPROCS and SLURM_NTASKS environment variables will be set |
| to reflect a global task count (both environment variables will have the same |
| value). |
| SLURM_NTASKS will be set to the total count of tasks in all components. |
| Note that the task rank and count values are needed by MPI and typically |
| determined by examining Slurm environment variables.</p> |
| |
| <h2 id="limitations">Limitations |
| <a class="slurm_link" href="#limitations"></a> |
| </h2> |
| |
| <p>The backfill scheduler has limitations in how it tracks usage of CPUs and |
| memory in the future. |
| This typically requires the backfill scheduler be able to allocate each |
| component of a heterogeneous job on a different node in order to begin its |
| resource allocation, even if multiple components of the job do actually get |
| allocated resources on the same node.</p> |
| |
| <p>In a federation of clusters, a heterogeneous job will execute entirely on |
| the cluster from which the job is submitted. The heterogeneous job will not |
| be eligible to migrate between clusters or to have different components of |
| the job execute on different clusters in the federation.</p> |
| |
| <p>Caution must be taken when submitting heterogeneous jobs that request |
| multiple overlapping partitions. When the partitions share the same resources |
| it's possible to starve your own job by having the first job component request |
| enough nodes that the scheduler isn't able to fill the subsequent request(s). |
| Consider an example where you have partition <i>p1</i> that contains 10 nodes |
| and partition <i>p2</i> that exists on 5 of the same nodes. If you submit a |
| heterogeneous job that requests 5 nodes in <i>p1</i> and 5 nodes in <i>p2</i>, |
| the scheduler may try to allocate some of the nodes from the <i>p2</i> |
| partition for the first job component, preventing the scheduler from being |
| able to fulfill the second request, resulting in a job that is never able to |
| start.</p> |
| |
| <p>Magnetic reservations cannot "attract" heterogeneous jobs - heterogeneous |
| jobs will only run in magnetic reservations if they explicitly request the |
| reservation.</p> |
| |
| <p>Job arrays of heterogeneous jobs are not supported.</p> |
| |
| <p>The srun command's --no-allocate option is not supported |
| for heterogeneous jobs.</p> |
| |
| <p>Only one job step per heterogeneous job component can be launched by a |
| single srun command (e.g. |
| "srun --het-group=0 alpha : --het-group=0 beta" is not supported).</p> |
| |
| <p>The sattach command can only be used to attach to a single component of |
| a heterogeneous job at a time.</p> |
| |
| <p>License requests are only allowed on the first component |
| job (e.g. |
| "sbatch -L ansys:2 : script.sh").<p> |
| |
| <p>Heterogeneous jobs are only scheduled by the backfill scheduler plugin. |
| The more frequently executed scheduling logic only starts jobs on a first-in |
| first-out (FIFO) basis and lacks logic for concurrently scheduling all |
| components of a heterogeneous job.</p> |
| |
| <p>Heterogeneous jobs are not supported on GANG scheduling operations.</p> |
| |
| <p>Slurm's Perl APIs do not support heterogeneous jobs.</p> |
| |
| <p>The srun --multi-prog option can not be used to span more than one |
| heterogeneous job component.</p> |
| |
| <p>The srun --open-mode option is by default set to "append".</p> |
| |
| <p>Ancient versions of OpenMPI and their derivatives (i.e. Cray MPI) are |
| dependent upon communication ports being assigned to them by Slurm. Such MPI |
| jobs will experience step launch failure if <u>any</u> component of a |
| heterogeneous job step is unable to acquire the allocated ports. |
| Non-heterogeneous job steps will retry step launch using a new set of |
| communication ports (no change in Slurm behavior).</p> |
| <!-- NOTE: Correcting this would necessitate assigning the same set of ports |
| to all components of the heterogeneous job (not possible today) plus changes to |
| srun in order to better synchronize the step startup and error handling. --> |
| |
| <h2 id="het_steps">Heterogeneous Steps |
| <a class="slurm_link" href="#het_steps"></a> |
| </h2> |
| |
| <p>Slurm version 20.11 introduces the ability to request heterogeneous job |
| steps from within a non-homogeneous job allocation. This allows you the |
| flexibility to have different layouts for job steps without requiring the |
| use of heterogeneous jobs, where having separate jobs for the components |
| may be undesirable.</p> |
| |
| <p>Some limitations for heterogeneous steps are that the steps must be able |
| to run on unique nodes. You also cannot request heterogeneous steps from within |
| a heterogeneous job.</p> |
| |
| <p>An example scenario would be if you have a task that needs to use 1 GPU |
| per processor while another task needs all the available GPUs on a node with |
| only one processor. This can be accomplished like this: |
| |
| <pre> |
| $ salloc -N2 --exclusive --gpus=10 |
| salloc: Granted job allocation 61034 |
| $ srun -N1 -n4 --gpus=4 printenv SLURMD_NODENAME : -N1 -n1 --gpus=6 printenv SLURMD_NODENAME |
| node02 |
| node01 |
| node01 |
| node01 |
| node01 |
| </pre> |
| |
| <h2 id="sys_admin">System Administrator Information |
| <a class="slurm_link" href="#sys_admin"></a> |
| </h2> |
| |
| <p>The job submit plugin is invoked independently for each component of a |
| heterogeneous job.</p> |
| |
| <p>The spank_init_post_opt() function is invoked once for each component of a |
| heterogeneous job. This permits site defined options on a per job component |
| basis.</p> |
| |
| <p>Scheduling of heterogeneous jobs is performed only by the sched/backfill |
| plugin and all heterogeneous job components are either all scheduled at the same |
| time or deferred. The pending reason of heterogeneous jobs isn't set until |
| backfill evaluation. |
| In order to ensure the timely initiation of both heterogeneous and |
| non-heterogeneous jobs, the backfill scheduler alternates between two different |
| modes on each iteration. |
| In the first mode, if a heterogeneous job component can not be initiated |
| immediately, its expected start time is recorded and all subsequent components |
| of that job will be considered for starting no earlier than the latest |
| component's expected start time. |
| In the second mode, all heterogeneous job components will be considered for |
| starting no earlier than the latest component's expected start time. |
| After completion of the second mode, all heterogeneous job expected start time |
| data is cleared and the first mode will be used in the next backfill scheduler |
| iteration. |
| Regular (non-heterogeneous jobs) are scheduled independently on each iteration |
| of the backfill scheduler.</p> |
| |
| <p> For example, consider a heterogeneous job with three components. |
| When considered as independent jobs, the components could be initiated at times |
| now (component 0), now plus 2 hour (component 1), and now plus 1 hours |
| (component 2). |
| When the backfill scheduler runs in the first mode:</p> |
| <ol> |
| <li>Component 0 will be noted to possible to start now, but not initiated due |
| to the additional components to be initiated</li> |
| <li>Component 1 will be noted to be possible to start in 2 hours</li> |
| <li>Component 2 will not be considered for scheduling until 2 hours in the |
| future, which leave some additional resources available for scheduling to other |
| jobs</li> |
| </ol> |
| |
| <p>When the backfill scheduler executes next, it will use the second mode and |
| (assuming no other state changes) all three job components will be considered |
| available for scheduling no earlier than 2 hours in the future, which may allow |
| other jobs to be allocated resources before heterogeneous job component 0 |
| could be initiated.</p> |
| |
| <p>The heterogeneous job start time data will be cleared before the first |
| mode is used in the next iteration in order to consider system status changes |
| which might permit the heterogeneous to be initiated at an earlier time than |
| previously determined.</p> |
| |
| <p>A resource limit test is performed when a heterogeneous job is submitted in |
| order to immediately reject jobs that will not be able to start with current |
| limits. |
| The individual components of the heterogeneous job are validated, like all |
| regular jobs. |
| The heterogeneous job as a whole is also tested, but in a more limited |
| fashion with respect to quality of service (QOS) limits. |
| This is due to the complexity of each job component having up to three sets of |
| limits (association, job QOS and partition QOS). |
| Note that successful submission of any job (heterogeneous or otherwise) does |
| not ensure the job will be able to start without exceeding some limit. |
| For example a job's CPU limit test does not consider that CPUs might not be |
| allocated individually, but resource allocations might be performed by whole |
| core, socket or node. |
| Each component of a heterogeneous job counts as a "job" with respect to |
| resource limits.</p> |
| |
| <p>For example, a user might have a limit of 2 concurrent running jobs and submit |
| a heterogeneous job with 3 components. |
| Such a situation will have an adverse effect upon scheduling other jobs, |
| especially other heterogeneous jobs.</p> |
| |
| |
| <p style="text-align:center;">Last modified 04 January 2024</p> |
| |
| <!--#include virtual="footer.txt"--> |