doc/html/sched_config.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <H1>Scheduling Configuration Guide</H1>

 <h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2>

 <P>Slurm is designed to perform a quick and simple scheduling attempt at
 events such as job submission or completion and configuration changes.
 During these event-triggered scheduling events, <b>default_queue_depth</b>
 (default is 100) number of jobs will be considered.</P>

 <p>At less frequent intervals, defined by <b>sched_interval</b>, the main
 scheduling loop will run, considering all jobs while still honoring the
 <b>partition_job_depth</b> limit.</p>

 <P>In both cases, jobs are evaluated in a strict priority order and once any
 job or job array task in a partition is left pending, no other jobs in that
 partition will be scheduled to avoid taking resources from the higher-priority
 pending job.</P>

 <P>A more comprehensive scheduling attempt is typically done by the backfill
 scheduling plugin, which considers job run time and resources required to
 determine if lower-priority jobs would actually take resources needed by
 higher-priority jobs. This allows the backfill scheduler to assign more specific
 <a href="job_reason_codes.html">reasons</a> to pending jobs, or to start jobs
 that were previously pending.</P>

 <h2 id="config">Scheduling Configuration
 <a class="slurm_link" href="#config"></a>
 </h2>

 <P>The <B>SchedulerType</B> configuration parameter specifies the scheduler
 plugin to use.
 Options are sched/backfill, which performs backfill scheduling, and
 sched/builtin, which attempts to schedule jobs in a strict priority order within
 each partition/queue.</P>

 <P>There is also a <B>SchedulerParameters</B> configuration parameter which
 can specify a wide range of parameters as described below.
 This first set of parameters applies to all scheduling configurations.
 See the <a href="slurm.conf.html">slurm.conf(5)</a> man page for more details.
 </P>

 <UL>
 <LI><B>default_queue_depth=#</B> - Specifies the number of jobs to consider for
 scheduling on each event that may result in a job being scheduled.
 Default value is 100 jobs. Since this happens frequently, a relatively
 small number is generally best.</LI>
 <LI><B>defer</B> - Do not attempt to schedule jobs individually at submit time.
 Can be useful for high-throughput computing.</LI>
 <LI><B>max_switch_wait=#</B> - Specifies the maximum time a job can wait for
 desired number of leaf switches. Default value is 300 seconds.</LI>
 <LI><B>partition_job_depth=#</B> - Specifies how many jobs are tested in any
 single partition, default value is 0 (no limit).</LI>
 <LI><B>sched_interval=#</B> - Specifies how frequently, in seconds, the main
 scheduling loop will execute and test all pending jobs, with the
 <b>partition_job_depth</b> limit in place. The default value is 60 seconds.</LI>
 </UL>

 <h2 id="backfill">Backfill Scheduling
 <a class="slurm_link" href="#backfill"></a>
 </h2>

 <P>The backfill scheduling plugin is loaded by default.
 Without backfill scheduling, each partition is scheduled strictly in priority
 order, which typically results in significantly lower system utilization and
 responsiveness than otherwise possible.
 Backfill scheduling will start lower priority jobs if doing so does not delay
 the expected start time of <B>any</B> higher priority jobs.
 Since the expected start time of pending jobs depends upon the expected
 completion time of running jobs, reasonably accurate time limits are important
 for backfill scheduling to work well.</P>

 <P>Slurm's backfill scheduler takes into consideration every running job.
 It then considers pending jobs in priority order, determining when and where
 each will start, taking into consideration the possibility of
 <a href="preempt.html">job preemption</a>,
 <a href="gang_scheduling.html">gang scheduling</a>,
 <a href="gres.html">generic resource (GRES) requirements</a>,
 memory requirements, etc.
 If the job under consideration can start immediately without impacting the
 expected start time of any higher priority job, then it does so.
 Otherwise the resources required by the job will be reserved during the job's
 expected execution time.
 The backfill plugin will set the expected start time for pending jobs setting
 these reserved nodes into a <B>'Planned'</B> state. A job's
 expected start time can be seen using the <b>squeue --start</b> command.
 For performance reasons, the backfill scheduler reserves whole nodes for jobs,
 even if jobs don't require whole nodes.
 </P>

 <P>The scheduling logic builds a sorted list of job-partition pairs. Jobs
 submitted to multiple partitions will have as many entries in the list as
 requested partitions. By default, the backfill scheduler may evaluate all the
 job-partition pairs for a single job, potentially reserving resources for each
 pair, but only starting the job in the reservation offering the earliest start
 time.</P>

 <P>Having a single job reserving resources for multiple partitions could impede
 other jobs (or hetjob components) from reserving resources already reserved for
 the partitions that don't offer the earliest start time.
 A single job that requests multiple partitions can also prevent itself
 from starting earlier in a lower priority partition if the partitions overlap
 nodes and a backfill reservation in the higher priority partition blocks nodes
 that are also in the lower priority partition.</P>

 <P>Backfill scheduling is difficult without reasonable time limit estimates
 for jobs, but some configuration parameters that can help.</P>
 <UL>
 <LI><B>DefaultTime</B> - Default job time limit (specify value by partition)</LI>
 <LI><B>MaxTime</B> - Maximum job time limit (specify value by partition)</LI>
 <LI><B>OverTimeLimit</B> - Amount by which a job can exceed its time limit
 before it is killed. A system-wide configuration parameter.</LI>
 </UL>

 <P>Backfill scheduling is a time consuming operation.
 Locks are released briefly every two seconds so that other options can be
 processed, for example to process new job submission requests.
 Backfill scheduling can optionally continue execution after the lock release
 and ignore newly submitted jobs (<B>SchedulerParameters=bf_continue</B>).
 Doing so will permit consideration of more jobs, but may result in the delayed
 scheduling of newly submitted jobs.
 A partial list of <B>SchedulerParameters</B> configuration parameters related to
 backfill scheduling follows.
 For more details and a complete list of the backfill related SchedulerParameters
 see the <a href="slurm.conf.html">slurm.conf(5)</a> man page.
 </P>

 <UL>
 <LI><B>bf_continue</B> - If set, then continue backfill scheduling after
 periodically releasing locks for other operations.</LI>
 <LI><B>bf_interval=#</B> - Interval between backfill scheduling attempts.
 Default value is 30 seconds.</LI>
 <LI><B>bf_max_job_part=#</B> - Maximum number of jobs to initiate per partition
 in each backfill cycle. Default value is 0 (no limit).</LI>
 <LI><B>bf_max_job_start=#</B> - Maximum number of jobs to initiate
 in each backfill cycle. Default value is 0 (no limit).</LI>
 <LI><B>bf_max_job_test=#</B> - Maximum number of jobs consider for backfill
 scheduling in each backfill cycle. Default value is 100 jobs.</LI>
 <LI><B>bf_max_job_user=#</B> - Maximum number of jobs to initiate per user
 in each backfill cycle. Default value is 0 (no limit).</LI>
 <LI><B>bf_max_time=#</B> - Maximum time in seconds the backfill scheduler can
 spend (including time spent sleeping when locks are released) before
 discontinuing. The default value is the value of <B>bf_interval</B>, which
 defaults to 30 seconds.</LI>
 <LI><B>bf_one_resv_per_job</B> - Disallow adding more than one backfill
 reservation per job. This option makes it so that a job submitted to multiple
 partitions will stop reserving resources once the first job-partition pair
 has booked a backfill reservation. Subsequent pairs from the same job will
 only be tested to start now. This allows for other jobs to be able to book the
 other pairs resources at the cost of not guaranteeing that the multi-partition
 job will start in the partition offering the earliest start time (unless it
 can start immediately). This option is disabled by default.</LI>
 <LI><B>bf_resolution=#</B> - Time resolution of backfill scheduling.
 Default value is 60 seconds.
 Larger values are appropriate if job time limits are imprecise and/or
 small delays in starting pending jobs in order to achieve higher system
 utilization is desired.</LI>
 <LI><B>bf_window=#</B> - How long, in minutes, into the future to look when
 determining when and where jobs can start.
 Higher values result in more overhead and less responsiveness.
 A value at least as long as the highest allowed time limit is generally
 advisable to prevent job starvation.
 In order to limit the amount of data managed by the backfill scheduler,
 if the value of bf_window is increased, then it is generally advisable
 to also increase <B>bf_resolution</B>.
 The default value is 1440 minutes (one day).</LI>
 <LI><B>bf_yield_interval=#</B> -
 The backfill scheduler will periodically relinquish locks in order for other
 pending operations to take place. This specifies the times when the locks are
 relinquished in microseconds. The default value is 2,000,000  microseconds
 (2 seconds). Smaller values may be helpful for high throughput computing when
 used in conjunction with the bf_continue option.</LI>
 <LI><B>bf_yield_sleep=#</B> -
 The backfill scheduler will periodically relinquish locks in order for other
 pending operations to take place. This specifies the length of time for which
 the locks are relinquished in microseconds. The default value is 500,000
 microseconds (0.5 seconds).  </LI>
 </UL>

 <p style="text-align:center;">Last modified 04 June 2024</p>

 <!--#include virtual="footer.txt"-->
	<!--#include virtual="header.txt"-->

	<H1>Scheduling Configuration Guide</H1>

	<h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2>

	<P>Slurm is designed to perform a quick and simple scheduling attempt at
	events such as job submission or completion and configuration changes.
	During these event-triggered scheduling events, <b>default_queue_depth</b>
	(default is 100) number of jobs will be considered.</P>

	<p>At less frequent intervals, defined by <b>sched_interval</b>, the main
	scheduling loop will run, considering all jobs while still honoring the
	<b>partition_job_depth</b> limit.</p>

	<P>In both cases, jobs are evaluated in a strict priority order and once any
	job or job array task in a partition is left pending, no other jobs in that
	partition will be scheduled to avoid taking resources from the higher-priority
	pending job.</P>

	<P>A more comprehensive scheduling attempt is typically done by the backfill
	scheduling plugin, which considers job run time and resources required to
	determine if lower-priority jobs would actually take resources needed by
	higher-priority jobs. This allows the backfill scheduler to assign more specific
	<a href="job_reason_codes.html">reasons</a> to pending jobs, or to start jobs
	that were previously pending.</P>

	<h2 id="config">Scheduling Configuration
	<a class="slurm_link" href="#config"></a>
	</h2>

	<P>The <B>SchedulerType</B> configuration parameter specifies the scheduler
	plugin to use.
	Options are sched/backfill, which performs backfill scheduling, and
	sched/builtin, which attempts to schedule jobs in a strict priority order within
	each partition/queue.</P>

	<P>There is also a <B>SchedulerParameters</B> configuration parameter which
	can specify a wide range of parameters as described below.
	This first set of parameters applies to all scheduling configurations.
	See the <a href="slurm.conf.html">slurm.conf(5)</a> man page for more details.
	</P>

	<UL>
	<LI><B>default_queue_depth=#</B> - Specifies the number of jobs to consider for
	scheduling on each event that may result in a job being scheduled.
	Default value is 100 jobs. Since this happens frequently, a relatively
	small number is generally best.</LI>
	<LI><B>defer</B> - Do not attempt to schedule jobs individually at submit time.
	Can be useful for high-throughput computing.</LI>
	<LI><B>max_switch_wait=#</B> - Specifies the maximum time a job can wait for
	desired number of leaf switches. Default value is 300 seconds.</LI>
	<LI><B>partition_job_depth=#</B> - Specifies how many jobs are tested in any
	single partition, default value is 0 (no limit).</LI>
	<LI><B>sched_interval=#</B> - Specifies how frequently, in seconds, the main
	scheduling loop will execute and test all pending jobs, with the
	<b>partition_job_depth</b> limit in place. The default value is 60 seconds.</LI>
	</UL>

	<h2 id="backfill">Backfill Scheduling
	<a class="slurm_link" href="#backfill"></a>
	</h2>

	<P>The backfill scheduling plugin is loaded by default.
	Without backfill scheduling, each partition is scheduled strictly in priority
	order, which typically results in significantly lower system utilization and
	responsiveness than otherwise possible.
	Backfill scheduling will start lower priority jobs if doing so does not delay
	the expected start time of <B>any</B> higher priority jobs.
	Since the expected start time of pending jobs depends upon the expected
	completion time of running jobs, reasonably accurate time limits are important
	for backfill scheduling to work well.</P>

	<P>Slurm's backfill scheduler takes into consideration every running job.
	It then considers pending jobs in priority order, determining when and where
	each will start, taking into consideration the possibility of
	<a href="preempt.html">job preemption</a>,
	<a href="gang_scheduling.html">gang scheduling</a>,
	<a href="gres.html">generic resource (GRES) requirements</a>,
	memory requirements, etc.
	If the job under consideration can start immediately without impacting the
	expected start time of any higher priority job, then it does so.
	Otherwise the resources required by the job will be reserved during the job's
	expected execution time.
	The backfill plugin will set the expected start time for pending jobs setting
	these reserved nodes into a <B>'Planned'</B> state. A job's
	expected start time can be seen using the <b>squeue --start</b> command.
	For performance reasons, the backfill scheduler reserves whole nodes for jobs,
	even if jobs don't require whole nodes.
	</P>

	<P>The scheduling logic builds a sorted list of job-partition pairs. Jobs
	submitted to multiple partitions will have as many entries in the list as
	requested partitions. By default, the backfill scheduler may evaluate all the
	job-partition pairs for a single job, potentially reserving resources for each
	pair, but only starting the job in the reservation offering the earliest start
	time.</P>

	<P>Having a single job reserving resources for multiple partitions could impede
	other jobs (or hetjob components) from reserving resources already reserved for
	the partitions that don't offer the earliest start time.
	A single job that requests multiple partitions can also prevent itself
	from starting earlier in a lower priority partition if the partitions overlap
	nodes and a backfill reservation in the higher priority partition blocks nodes
	that are also in the lower priority partition.</P>

	<P>Backfill scheduling is difficult without reasonable time limit estimates
	for jobs, but some configuration parameters that can help.</P>
	<UL>
	<LI><B>DefaultTime</B> - Default job time limit (specify value by partition)</LI>
	<LI><B>MaxTime</B> - Maximum job time limit (specify value by partition)</LI>
	<LI><B>OverTimeLimit</B> - Amount by which a job can exceed its time limit
	before it is killed. A system-wide configuration parameter.</LI>
	</UL>

	<P>Backfill scheduling is a time consuming operation.
	Locks are released briefly every two seconds so that other options can be
	processed, for example to process new job submission requests.
	Backfill scheduling can optionally continue execution after the lock release
	and ignore newly submitted jobs (<B>SchedulerParameters=bf_continue</B>).
	Doing so will permit consideration of more jobs, but may result in the delayed
	scheduling of newly submitted jobs.
	A partial list of <B>SchedulerParameters</B> configuration parameters related to
	backfill scheduling follows.
	For more details and a complete list of the backfill related SchedulerParameters
	see the <a href="slurm.conf.html">slurm.conf(5)</a> man page.
	</P>

	<UL>
	<LI><B>bf_continue</B> - If set, then continue backfill scheduling after
	periodically releasing locks for other operations.</LI>
	<LI><B>bf_interval=#</B> - Interval between backfill scheduling attempts.
	Default value is 30 seconds.</LI>
	<LI><B>bf_max_job_part=#</B> - Maximum number of jobs to initiate per partition
	in each backfill cycle. Default value is 0 (no limit).</LI>
	<LI><B>bf_max_job_start=#</B> - Maximum number of jobs to initiate
	in each backfill cycle. Default value is 0 (no limit).</LI>
	<LI><B>bf_max_job_test=#</B> - Maximum number of jobs consider for backfill
	scheduling in each backfill cycle. Default value is 100 jobs.</LI>
	<LI><B>bf_max_job_user=#</B> - Maximum number of jobs to initiate per user
	in each backfill cycle. Default value is 0 (no limit).</LI>
	<LI><B>bf_max_time=#</B> - Maximum time in seconds the backfill scheduler can
	spend (including time spent sleeping when locks are released) before
	discontinuing. The default value is the value of <B>bf_interval</B>, which
	defaults to 30 seconds.</LI>
	<LI><B>bf_one_resv_per_job</B> - Disallow adding more than one backfill
	reservation per job. This option makes it so that a job submitted to multiple
	partitions will stop reserving resources once the first job-partition pair
	has booked a backfill reservation. Subsequent pairs from the same job will
	only be tested to start now. This allows for other jobs to be able to book the
	other pairs resources at the cost of not guaranteeing that the multi-partition
	job will start in the partition offering the earliest start time (unless it
	can start immediately). This option is disabled by default.</LI>
	<LI><B>bf_resolution=#</B> - Time resolution of backfill scheduling.
	Default value is 60 seconds.
	Larger values are appropriate if job time limits are imprecise and/or
	small delays in starting pending jobs in order to achieve higher system
	utilization is desired.</LI>
	<LI><B>bf_window=#</B> - How long, in minutes, into the future to look when
	determining when and where jobs can start.
	Higher values result in more overhead and less responsiveness.
	A value at least as long as the highest allowed time limit is generally
	advisable to prevent job starvation.
	In order to limit the amount of data managed by the backfill scheduler,
	if the value of bf_window is increased, then it is generally advisable
	to also increase <B>bf_resolution</B>.
	The default value is 1440 minutes (one day).</LI>
	<LI><B>bf_yield_interval=#</B> -
	The backfill scheduler will periodically relinquish locks in order for other
	pending operations to take place. This specifies the times when the locks are
	relinquished in microseconds. The default value is 2,000,000 microseconds
	(2 seconds). Smaller values may be helpful for high throughput computing when
	used in conjunction with the bf_continue option.</LI>
	<LI><B>bf_yield_sleep=#</B> -
	The backfill scheduler will periodically relinquish locks in order for other
	pending operations to take place. This specifies the length of time for which
	the locks are relinquished in microseconds. The default value is 500,000
	microseconds (0.5 seconds). </LI>
	</UL>

	<p style="text-align:center;">Last modified 04 June 2024</p>

	<!--#include virtual="footer.txt"-->