| <!--#include virtual="header.txt"--> |
| |
| <H1>Scheduling Configuration Guide</H1> |
| |
| <h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2> |
| |
| <P>Slurm is designed to perform a quick and simple scheduling attempt at |
| events such as job submission or completion and configuration changes. |
| During these event-triggered scheduling events, <b>default_queue_depth</b> |
| (default is 100) number of jobs will be considered.</P> |
| |
| <p>At less frequent intervals, defined by <b>sched_interval</b>, the main |
| scheduling loop will run, considering all jobs while still honoring the |
| <b>partition_job_depth</b> limit.</p> |
| |
| <P>In both cases, jobs are evaluated in a strict priority order and once any |
| job or job array task in a partition is left pending, no other jobs in that |
| partition will be scheduled to avoid taking resources from the higher-priority |
| pending job.</P> |
| |
| <P>A more comprehensive scheduling attempt is typically done by the backfill |
| scheduling plugin, which considers job run time and resources required to |
| determine if lower-priority jobs would actually take resources needed by |
| higher-priority jobs. This allows the backfill scheduler to assign more specific |
| <a href="job_reason_codes.html">reasons</a> to pending jobs, or to start jobs |
| that were previously pending.</P> |
| |
| <h2 id="config">Scheduling Configuration |
| <a class="slurm_link" href="#config"></a> |
| </h2> |
| |
| <P>The <B>SchedulerType</B> configuration parameter specifies the scheduler |
| plugin to use. |
| Options are sched/backfill, which performs backfill scheduling, and |
| sched/builtin, which attempts to schedule jobs in a strict priority order within |
| each partition/queue.</P> |
| |
| <P>There is also a <B>SchedulerParameters</B> configuration parameter which |
| can specify a wide range of parameters as described below. |
| This first set of parameters applies to all scheduling configurations. |
| See the <a href="slurm.conf.html">slurm.conf(5)</a> man page for more details. |
| </P> |
| |
| <UL> |
| <LI><B>default_queue_depth=#</B> - Specifies the number of jobs to consider for |
| scheduling on each event that may result in a job being scheduled. |
| Default value is 100 jobs. Since this happens frequently, a relatively |
| small number is generally best.</LI> |
| <LI><B>defer</B> - Do not attempt to schedule jobs individually at submit time. |
| Can be useful for high-throughput computing.</LI> |
| <LI><B>max_switch_wait=#</B> - Specifies the maximum time a job can wait for |
| desired number of leaf switches. Default value is 300 seconds.</LI> |
| <LI><B>partition_job_depth=#</B> - Specifies how many jobs are tested in any |
| single partition, default value is 0 (no limit).</LI> |
| <LI><B>sched_interval=#</B> - Specifies how frequently, in seconds, the main |
| scheduling loop will execute and test all pending jobs, with the |
| <b>partition_job_depth</b> limit in place. The default value is 60 seconds.</LI> |
| </UL> |
| |
| <h2 id="backfill">Backfill Scheduling |
| <a class="slurm_link" href="#backfill"></a> |
| </h2> |
| |
| <P>The backfill scheduling plugin is loaded by default. |
| Without backfill scheduling, each partition is scheduled strictly in priority |
| order, which typically results in significantly lower system utilization and |
| responsiveness than otherwise possible. |
| Backfill scheduling will start lower priority jobs if doing so does not delay |
| the expected start time of <B>any</B> higher priority jobs. |
| Since the expected start time of pending jobs depends upon the expected |
| completion time of running jobs, reasonably accurate time limits are important |
| for backfill scheduling to work well.</P> |
| |
| <P>Slurm's backfill scheduler takes into consideration every running job. |
| It then considers pending jobs in priority order, determining when and where |
| each will start, taking into consideration the possibility of |
| <a href="preempt.html">job preemption</a>, |
| <a href="gang_scheduling.html">gang scheduling</a>, |
| <a href="gres.html">generic resource (GRES) requirements</a>, |
| memory requirements, etc. |
| If the job under consideration can start immediately without impacting the |
| expected start time of any higher priority job, then it does so. |
| Otherwise the resources required by the job will be reserved during the job's |
| expected execution time. |
| The backfill plugin will set the expected start time for pending jobs setting |
| these reserved nodes into a <B>'Planned'</B> state. A job's |
| expected start time can be seen using the <b>squeue --start</b> command. |
| For performance reasons, the backfill scheduler reserves whole nodes for jobs, |
| even if jobs don't require whole nodes. |
| </P> |
| |
| <P>The scheduling logic builds a sorted list of job-partition pairs. Jobs |
| submitted to multiple partitions will have as many entries in the list as |
| requested partitions. By default, the backfill scheduler may evaluate all the |
| job-partition pairs for a single job, potentially reserving resources for each |
| pair, but only starting the job in the reservation offering the earliest start |
| time.</P> |
| |
| <P>Having a single job reserving resources for multiple partitions could impede |
| other jobs (or hetjob components) from reserving resources already reserved for |
| the partitions that don't offer the earliest start time. |
| A single job that requests multiple partitions can also prevent itself |
| from starting earlier in a lower priority partition if the partitions overlap |
| nodes and a backfill reservation in the higher priority partition blocks nodes |
| that are also in the lower priority partition.</P> |
| |
| <P>Backfill scheduling is difficult without reasonable time limit estimates |
| for jobs, but some configuration parameters that can help.</P> |
| <UL> |
| <LI><B>DefaultTime</B> - Default job time limit (specify value by partition)</LI> |
| <LI><B>MaxTime</B> - Maximum job time limit (specify value by partition)</LI> |
| <LI><B>OverTimeLimit</B> - Amount by which a job can exceed its time limit |
| before it is killed. A system-wide configuration parameter.</LI> |
| </UL> |
| |
| <P>Backfill scheduling is a time consuming operation. |
| Locks are released briefly every two seconds so that other options can be |
| processed, for example to process new job submission requests. |
| Backfill scheduling can optionally continue execution after the lock release |
| and ignore newly submitted jobs (<B>SchedulerParameters=bf_continue</B>). |
| Doing so will permit consideration of more jobs, but may result in the delayed |
| scheduling of newly submitted jobs. |
| A partial list of <B>SchedulerParameters</B> configuration parameters related to |
| backfill scheduling follows. |
| For more details and a complete list of the backfill related SchedulerParameters |
| see the <a href="slurm.conf.html">slurm.conf(5)</a> man page. |
| </P> |
| |
| <UL> |
| <LI><B>bf_continue</B> - If set, then continue backfill scheduling after |
| periodically releasing locks for other operations.</LI> |
| <LI><B>bf_interval=#</B> - Interval between backfill scheduling attempts. |
| Default value is 30 seconds.</LI> |
| <LI><B>bf_max_job_part=#</B> - Maximum number of jobs to initiate per partition |
| in each backfill cycle. Default value is 0 (no limit).</LI> |
| <LI><B>bf_max_job_start=#</B> - Maximum number of jobs to initiate |
| in each backfill cycle. Default value is 0 (no limit).</LI> |
| <LI><B>bf_max_job_test=#</B> - Maximum number of jobs consider for backfill |
| scheduling in each backfill cycle. Default value is 100 jobs.</LI> |
| <LI><B>bf_max_job_user=#</B> - Maximum number of jobs to initiate per user |
| in each backfill cycle. Default value is 0 (no limit).</LI> |
| <LI><B>bf_max_time=#</B> - Maximum time in seconds the backfill scheduler can |
| spend (including time spent sleeping when locks are released) before |
| discontinuing. The default value is the value of <B>bf_interval</B>, which |
| defaults to 30 seconds.</LI> |
| <LI><B>bf_one_resv_per_job</B> - Disallow adding more than one backfill |
| reservation per job. This option makes it so that a job submitted to multiple |
| partitions will stop reserving resources once the first job-partition pair |
| has booked a backfill reservation. Subsequent pairs from the same job will |
| only be tested to start now. This allows for other jobs to be able to book the |
| other pairs resources at the cost of not guaranteeing that the multi-partition |
| job will start in the partition offering the earliest start time (unless it |
| can start immediately). This option is disabled by default.</LI> |
| <LI><B>bf_resolution=#</B> - Time resolution of backfill scheduling. |
| Default value is 60 seconds. |
| Larger values are appropriate if job time limits are imprecise and/or |
| small delays in starting pending jobs in order to achieve higher system |
| utilization is desired.</LI> |
| <LI><B>bf_window=#</B> - How long, in minutes, into the future to look when |
| determining when and where jobs can start. |
| Higher values result in more overhead and less responsiveness. |
| A value at least as long as the highest allowed time limit is generally |
| advisable to prevent job starvation. |
| In order to limit the amount of data managed by the backfill scheduler, |
| if the value of bf_window is increased, then it is generally advisable |
| to also increase <B>bf_resolution</B>. |
| The default value is 1440 minutes (one day).</LI> |
| <LI><B>bf_yield_interval=#</B> - |
| The backfill scheduler will periodically relinquish locks in order for other |
| pending operations to take place. This specifies the times when the locks are |
| relinquished in microseconds. The default value is 2,000,000 microseconds |
| (2 seconds). Smaller values may be helpful for high throughput computing when |
| used in conjunction with the bf_continue option.</LI> |
| <LI><B>bf_yield_sleep=#</B> - |
| The backfill scheduler will periodically relinquish locks in order for other |
| pending operations to take place. This specifies the length of time for which |
| the locks are relinquished in microseconds. The default value is 500,000 |
| microseconds (0.5 seconds). </LI> |
| </UL> |
| |
| <p style="text-align:center;">Last modified 04 June 2024</p> |
| |
| <!--#include virtual="footer.txt"--> |