| <!--#include virtual="header.txt"--> |
| |
| <h1>Resource Limits</h1> |
| |
| <p>Familiarity with Slurm's <a href="accounting.html">Accounting</a> web page |
| is strongly recommended before use of this document.</p> |
| |
| <h2 id="hierarchy">Hierarchy<a class="slurm_link" href="#hierarchy"></a></h2> |
| |
| <p>Slurm's hierarchical limits are enforced in the following order |
| with Job QOS and Partition QOS order being reversible by using the QOS |
| flag 'OverPartQOS':</p> |
| <ol> |
| <li>Partition QOS limit</li> |
| <li>Job QOS limit</li> |
| <li>User association</li> |
| <li>Account association(s), ascending the hierarchy</li> |
| <li>Root/Cluster association</li> |
| <li>Partition limit</li> |
| <li>None</li> |
| </ol> |
| |
| <p>Note: If limits are defined at multiple points in this hierarchy, |
| the point in this list where the limit is first defined will be used. |
| Consider the following example:</p> |
| <ul> |
| <li>MaxJobs=20 and MaxSubmitJobs is undefined in the partition QOS</li> |
| <li>No limits are set in the job QOS and</li> |
| <li>MaxJobs=4 and MaxSubmitJobs=50 in the user association</li> |
| </ul> |
| <p>The limits in effect will be MaxJobs=20 and MaxSubmitJobs=50.</p> |
| |
| <p>Note: The precedence order specified above is respected except for the |
| following limits: Max[Time|Wall], [Min|Max]Nodes. For these limits, even |
| if the job is enforced with QOS and/or Association limits, it can't |
| go over the limit imposed at Partition level, even if it listed at the bottom. |
| So the default for these 3 types of limits is that they are upper bound by the |
| Partition one. This Partition level bound can be ignored if |
| the respective QOS PartitionTimeLimit and/or Partition[Max|Min]Nodes flags |
| are set, then the job would be enforced the limits imposed at QOS |
| and/or association level respecting the order above. |
| <b>Grp*</b> limits are also an exception. A more restrictive limit at the |
| Account level will be enforced before a less restrictive limit at the User |
| level. This is due to the nature of the limit being enforced, requiring that |
| the limit at the highest level not be exceeded. |
| </p> |
| |
| <h2 id="config">Configuration<a class="slurm_link" href="#config"></a></h2> |
| |
| <p>Scheduling policy information must be stored in a database |
| as specified by the <b>AccountingStorageType</b> configuration parameter |
| in the <b>slurm.conf</b> configuration file. |
| Information can be recorded in a <a href="http://www.mysql.com/">MySQL</a> or |
| <a href="https://mariadb.org/">MariaDB</a> database. |
| For security and performance reasons, the use of |
| SlurmDBD (Slurm Database Daemon) as a front-end to the |
| database is strongly recommended. |
| SlurmDBD uses a Slurm authentication plugin (e.g. MUNGE). |
| SlurmDBD also uses an existing Slurm accounting storage plugin |
| to maximize code reuse. |
| SlurmDBD uses data caching and prioritization of pending requests |
| in order to optimize performance. |
| While SlurmDBD relies upon existing Slurm plugins for authentication |
| and database use, the other Slurm commands and daemons are not required |
| on the host where SlurmDBD is installed. |
| Only the <i>slurmdbd</i> and <i>slurm-plugins</i> RPMs are required |
| for SlurmDBD execution.</p> |
| |
| <p>Both accounting and scheduling policies are configured based upon |
| an <i>association</i>. An <i>association</i> is a 4-tuple consisting |
| of the cluster name, bank account, user and (optionally) the Slurm |
| partition. |
| In order to enforce scheduling policy, set the value of |
| <b>AccountingStorageEnforce</b>. |
| This option contains a comma separated list of options you may want to |
| enforce. The valid options are: |
| <ul> |
| <li>associations - This will prevent users from running jobs if |
| their <i>association</i> is not in the database. This option will |
| prevent users from accessing invalid accounts. |
| </li> |
| <li>limits - This will enforce limits set to associations. By setting |
| this option, the 'associations' option is also set. |
| </li> |
| <li>qos - This will require all jobs to specify (either overtly or by |
| default) a valid qos (Quality of Service). QOS values are defined for |
| each association in the database. By setting this option, the |
| 'associations' option is also set. |
| </li> |
| <li>safe - This will ensure a job will only be launched when using an |
| association or qos that has a TRES-minutes limit set if the job will be |
| able to run to completion. Without this option set, jobs will be |
| launched as long as their usage hasn't reached the TRES-minutes limit |
| which can lead to jobs being launched but then killed when the limit is |
| reached. |
| With the 'safe' option set, a job won't be killed due to limits, |
| even if the limits are changed after the job was started and the |
| association or qos violates the updated limits. |
| By setting this option, both the 'associations' option and the |
| 'limits' option are set automatically. |
| </li> |
| <li>wckeys - This will prevent users from running jobs under a wckey |
| that they don't have access to. By using this option, the |
| 'associations' option is also set. The 'TrackWCKey' option is also |
| set to true. |
| </li> |
| </ul> |
| |
| <p><b>NOTE</b>: The association is a combination of cluster, account, |
| user names and optional partition name. |
| <br> |
| Without AccountingStorageEnforce being set (the default behavior) |
| jobs will be executed based upon policies configured in Slurm on each |
| cluster. |
| </p> |
| |
| <h2 id="tools">Tools<a class="slurm_link" href="#tools"></a></h2> |
| |
| <p>The tool used to manage accounting policy is <i>sacctmgr</i>. |
| It can be used to create and delete cluster, user, bank account, |
| and partition records plus their combined <i>association</i> record. |
| See <i>man sacctmgr</i> for details on this tools and examples of |
| its use.</p> |
| |
| <p>Changes made to the scheduling policy are uploaded to |
| the Slurm control daemons on the various clusters and take effect |
| immediately. When an association is deleted, all running or pending |
| jobs which belong to that association are immediately canceled. |
| When limits are lowered, running jobs will not be canceled to |
| satisfy the new limits, but the new lower limits will be enforced.</p> |
| |
| <h2 id="limits">Association specific limits and scheduling policies |
| <a class="slurm_link" href="#assoc"></a> |
| </h2> |
| <p>These represent the limits and scheduling policies relevant to Associations. |
| When dealing with Associations, most of these limits are available |
| not only for the user association, but also for each cluster and account. |
| Limits and policies are applied in the following order: |
| <br> |
| 1. The option specified for the user association. |
| <br> |
| 2. The option specified for the account. |
| <br> |
| 3. The option specified for the cluster. |
| <br> |
| 4. If nothing is configured at the above levels, no limit will be applied. |
| </p> |
| |
| <p>These are just the limits and policies for Associations. For a more |
| complete description of the columns available to be displayed, see the |
| <a href="sacctmgr.html#SECTION_LIST/SHOW-ASSOCIATION-FORMAT-OPTIONS"> |
| sacctmgr</a> man page.</p> |
| |
| <dl> |
| <dt id="assoc_fairshare"><b>Fairshare</b> |
| <a class="slurm_link" href="#assoc_fairshare"></a></dt> |
| <dd>Integer value used for determining priority. |
| Essentially this is the amount of claim this association and its |
| children have to the above system. Can also be the string "parent", |
| when used on a user this means that the parent association is used |
| for fairshare. If Fairshare=parent is set on an account, that |
| account's children will be effectively re-parented for fairshare |
| calculations to the first parent of their parent that is not |
| Fairshare=parent. Limits remain the same, only its fairshare value |
| is affected. |
| </dd> |
| |
| <dt id="assoc_grpjobs"><b>GrpJobs</b> |
| <a class="slurm_link" href="#assoc_grpjobs"></a></dt> |
| <dd>The total number of jobs able to run at any given |
| time from an association and its children. If |
| this limit is reached, new jobs will be queued but only allowed to |
| run after previous jobs complete from this group. |
| </dd> |
| |
| <dt id="assoc_grpjobsaccrue"><b>GrpJobsAccrue</b> |
| <a class="slurm_link" href="#assoc_grpjobsaccrue"></a></dt> |
| <dd>The total number of pending jobs able to accrue age |
| priority at any given time from an association and its children. If |
| this limit is reached, new jobs will be queued but not accrue age priority |
| until after previous jobs are removed from pending in this group. |
| This limit does not determine if the job can run or not, it only limits the |
| age factor of the priority. |
| </dd> |
| |
| <dt id="assoc_grpsubmitjobs"><b>GrpSubmitJobs</b> |
| <a class="slurm_link" href="#assoc_grpsubmitjobs"></a></dt> |
| <dd>The total number of jobs able to be submitted |
| to the system at any given time from an association and its children. |
| If this limit is reached, new submission requests will be |
| denied until previous jobs complete from this group. |
| </dd> |
| |
| <dt id="assoc_grptres"><b>GrpTRES</b> |
| <a class="slurm_link" href="#assoc_grptres"></a></dt> |
| <dd>The total count of TRES able to be used at any given |
| time from jobs running from an association and its children. If |
| this limit is reached, new jobs will be queued but only allowed to |
| run after resources have been relinquished from this group. |
| </dd> |
| |
| <dt id="assoc_grptresmins"><b>GrpTRESMins</b> |
| <a class="slurm_link" href="#assoc_grptresmins"></a></dt> |
| <dd>The total number of TRES minutes that can |
| possibly be used by past, present and future jobs |
| running from an association and its children. If any limit is reached, |
| all running jobs with that TRES in this group will be killed, and no new |
| jobs will be allowed to run. This usage is decayed (at a rate of |
| PriorityDecayHalfLife). It can also be reset (according to |
| PriorityUsageResetPeriod) in order to allow jobs to run against the |
| association tree. |
| This limit only applies when using the Priority Multifactor plugin. |
| </dd> |
| |
| <dt id="assoc_grptresrunmins"><b>GrpTRESRunMins</b> |
| <a class="slurm_link" href="#assoc_grptresrunmins"></a></dt> |
| <dd>Used to limit the combined total number of TRES |
| minutes used by all jobs running with an association and its |
| children. This takes into consideration time limit of |
| running jobs and consumes it. If the limit is reached, no new jobs |
| are started until other jobs finish to allow time to free up. |
| </dd> |
| |
| <dt id="assoc_grpwall"><b>GrpWall</b> |
| <a class="slurm_link" href="#assoc_grpwall"></a></dt> |
| <dd>The maximum wall clock time running jobs are able |
| to be allocated in aggregate for an association and its children. |
| If this limit is reached, future jobs in this association will be |
| queued until they are able to run inside the limit. |
| This usage is decayed (at a rate of |
| PriorityDecayHalfLife). It can also be reset (according to |
| PriorityUsageResetPeriod) in order to allow jobs to run against the |
| association tree again. |
| </dd> |
| |
| <dt id="assoc_maxjobs"><b>MaxJobs</b> |
| <a class="slurm_link" href="#assoc_maxjobs"></a></dt> |
| <dd>The total number of jobs able to run at any given |
| time for the given association. If this limit is reached, new jobs will |
| be queued but only allowed to run after existing jobs in the association |
| complete. |
| </dd> |
| |
| <dt id="assoc_maxjobsaccrue"><b>MaxJobsAccrue</b> |
| <a class="slurm_link" href="#assoc_maxjobsaccrue"></a></dt> |
| <dd>The maximum number of pending jobs able to accrue age |
| priority at any given time for the given association. If this limit is |
| reached, new jobs will be queued but will not accrue age priority |
| until after existing jobs in the association are moved from a pending state. |
| This limit does not determine if the job can run, it only limits the |
| age factor of the priority. |
| </dd> |
| |
| <dt id="assoc_maxsubmitjobs"><b>MaxSubmitJobs</b> |
| <a class="slurm_link" href="#assoc_maxsubmitjobs"></a></dt> |
| <dd>The maximum number of jobs able to be submitted |
| to the system at any given time from the given association. If |
| this limit is reached, new submission requests will be denied until |
| existing jobs in this association complete. |
| </dd> |
| |
| <dt id="assoc_maxtresminsperjob"><b>MaxTRESMinsPerJob</b> |
| <a class="slurm_link" href="#assoc_maxtresminsperjob"></a></dt> |
| <dd>A limit of TRES minutes to be used by a job. |
| If this limit is reached, the job will be killed if not running in |
| Safe mode, otherwise the job will pend until enough time is given to |
| complete the job. |
| </dd> |
| |
| <dt id="assoc_maxtresperjob"><b>MaxTRESPerJob</b> |
| <a class="slurm_link" href="#assoc_maxtresperjob"></a></dt> |
| <dd>The maximum size in TRES any given job can |
| have from the association. |
| </dd> |
| |
| <dt id="assoc_maxtrespernode"><b>MaxTRESPerNode</b> |
| <a class="slurm_link" href="#assoc_maxtrespernode"></a></dt> |
| <dd>The maximum size in TRES each node in a job |
| allocation can use. |
| </dd> |
| |
| <!-- For future use |
| <li><b>MaxTRESRunMinsPerJob=</b> A limit of TRES minutes to be used by jobs |
| running from the given association/QOS. If this limit is |
| reached the job will be killed will be allowed to run. |
| </li> |
| --> |
| |
| <dt id="assoc_maxwalldurationperjob"><b>MaxWallDurationPerJob</b> |
| <a class="slurm_link" href="#assoc_maxwalldurationperjob"></a></dt> |
| <dd>The maximum wall clock time any individual job |
| can run for in the given association. If this limit is reached, |
| the job will be denied at submission. |
| </dd> |
| |
| <dt id="assoc_minpriothreshold"><b>MinPrioThreshold</b> |
| <a class="slurm_link" href="#assoc_minpriothreshold"></a></dt> |
| <dd>Minimum priority required to reserve resources |
| in the given association. Used to override bf_min_prio_reserve. |
| See <a href="slurm.conf.html#OPT_bf_min_prio_reserve=#"> |
| bf_min_prio_reserve</a> for details. |
| </dd> |
| |
| <dt id="assoc_qos"><b>QOS</b> |
| <a class="slurm_link" href="#assoc_qos"></a></dt> |
| <dd>comma separated list of QOSs an association is |
| able to run. |
| </dd> |
| </dl> |
| |
| <p><b>NOTE</b>: When modifying a TRES field with <i>sacctmgr</i>, one must |
| specify which TRES to modify (see <a href="tres.html">TRES</a> for complete |
| list) as in the following examples: </p> |
| <pre> |
| SET: |
| sacctmgr modify user bob set GrpTRES=cpu=1500,mem=200,gres/gpu=50 |
| UNSET: |
| sacctmgr modify user bob set GrpTRES=cpu=-1,mem=-1,gres/gpu=-1 |
| </pre> |
| |
| |
| <h2 id="qos">QOS specific limits and scheduling policies |
| <a class="slurm_link" href="#qos"></a> |
| </h2> |
| |
| <p>As noted <a href="#hierarchy">above</a>, the default behavior is that |
| a limit set on a Partition QOS will be applied before a limit on the job's |
| requested QOS. You can change this behavior with the <i>OverPartQOS</i> |
| flag.</p> |
| |
| <p>Unless noted, if a job request breaches a given limit |
| on its own, the job will pend unless the job's QOS has the DenyOnLimit |
| flag set, which will cause the job to be denied at submission. When |
| Grp limits are considered with respect to this flag the Grp limit |
| is treated as a Max limit.</p> |
| |
| <dl> |
| <dt id="qos_gracetime"><b>GraceTime</b> |
| <a class="slurm_link" href="#qos_gracetime"></a></dt> |
| <dd>Preemption grace time to be extended to a job which |
| has been selected for preemption in the format of |
| <hh>:<mm>:<ss>. The default value is zero, |
| meaning no preemption grace time is allowed on this QOS. This value |
| is only meaningful for QOS PreemptMode=CANCEL and PreemptMode=REQUEUE. |
| </dd> |
| |
| <dt id="qos_grpjobs"><b>GrpJobs</b> |
| <a class="slurm_link" href="#qos_grpjobs"></a></dt> |
| <dd>The total number of jobs able to run at any given time |
| from a QOS. If this limit is reached, new jobs will be queued but only |
| allowed to run after previous jobs complete from this group. |
| </dd> |
| |
| <dt id="qos_grpjobsaccrue"><b>GrpJobsAccrue</b> |
| <a class="slurm_link" href="#qos_grpjobsaccrue"></a></dt> |
| <dd>The total number of pending jobs able to accrue age priority at any |
| given time from a QOS. If this limit is reached, new jobs will be queued but |
| will not accrue age based priority until after previous jobs are removed |
| from pending in this group. This limit does not determine if the job can |
| run or not, it only limits the age factor of the priority. This limit only |
| applies to the job's QOS and not the partition's QOS. |
| </dd> |
| |
| <dt id="qos_grpsubmitjobs"><b>GrpSubmitJobs</b> |
| <a class="slurm_link" href="#qos_grpsubmitjobs"></a></dt> |
| <dd>The total number of jobs able to be submitted to the system at any |
| given time from a QOS. If this limit is reached, new submission requests |
| will be denied until previous jobs complete from this group. |
| </dd> |
| |
| <dt id="qos_grptres"><b>GrpTRES</b> |
| <a class="slurm_link" href="#qos_grptres"></a></dt> |
| <dd>The total count of TRES able to be used at any given time from jobs |
| running from a QOS. If this limit is reached, new jobs will be queued but |
| only allowed to run after resources have been relinquished from this group. |
| </dd> |
| |
| <dt id="qos_grptresmins"><b>GrpTRESMins</b> |
| <a class="slurm_link" href="#qos_grptresmins"></a></dt> |
| <dd>The total number of TRES minutes that can possibly be used by past, |
| present and future jobs running from a QOS. If any limit is reached, |
| all running jobs with that TRES in this group will be killed, and no new |
| jobs will be allowed to run. This usage is decayed (at a rate of |
| PriorityDecayHalfLife). It can also be reset (according to |
| PriorityUsageResetPeriod) in order to allow jobs to run against the |
| QOS again. QOS that have the NoDecay flag set do not decay GrpTRESMins, |
| see <a href="qos.html#qos_other">QOS Options</a> for details. |
| This limit only applies when using the Priority Multifactor plugin. |
| </dd> |
| |
| <dt id="qos_grptresrunmins"><b>GrpTRESRunMins</b> |
| <a class="slurm_link" href="#qos_grptresrunmins"></a></dt> |
| <dd>Used to limit the combined total number of TRES |
| minutes used by all jobs running with a QOS. This takes into |
| consideration the time limit of running jobs and consumes it. |
| If the limit is reached, no new jobs are started until other jobs |
| finish to allow time to free up. |
| </dd> |
| |
| <dt id="qos_grpwall"><b>GrpWall</b> |
| <a class="slurm_link" href="#qos_grpwall"></a></dt> |
| <dd>The maximum wall clock time running jobs are able |
| to be allocated in aggregate for a QOS. If this limit is reached, |
| future jobs in this QOS will be queued until they are able to run |
| inside the limit. This usage is decayed (at a rate of |
| PriorityDecayHalfLife). It can also be reset (according to |
| PriorityUsageResetPeriod) in order to allow jobs to run against the |
| QOS again. QOS that have the NoDecay flag set do not decay GrpWall. |
| See <a href="qos.html#qos_other">QOS Options</a> for details. |
| </dd> |
| |
| <dt id="qos_limitfactor"><b>LimitFactor</b> |
| <a class="slurm_link" href="#qos_limitfactor"></a></dt> |
| <dd>A float that is factored into an associations [Grp|Max]TRES limits. |
| For example, if the LimitFactor is 2, then an association with a GrpTRES of |
| 30 CPUs would be allowed to allocate 60 CPUs when running under this QOS. |
| |
| <b>NOTE</b>: This factor is only applied to associations running in this |
| QOS and is not applied to any limits in the QOS itself. |
| </dd> |
| |
| <dt id="qos_maxjobsaccruepa"><b>MaxJobsAccruePerAccount</b> |
| <a class="slurm_link" href="#qos_maxjobsaccruepa"></a></dt> |
| <dd>The maximum number of pending jobs an |
| account (or sub-account) can have accruing age priority at any given time. |
| This limit does not determine if the job can run, it only limits the |
| age factor of the priority. |
| </dd> |
| |
| <dt id="qos_maxjobsaccruepu"><b>MaxJobsAccruePerUser</b> |
| <a class="slurm_link" href="#qos_maxjobsaccruepu"></a></dt> |
| <dd>The maximum number of pending jobs a |
| user can have accruing age priority at any given time. |
| This limit does not determine if the job can run, it only limits the |
| age factor of the priority. |
| </dd> |
| |
| <dt id="qos_maxjobspa"><b>MaxJobsPerAccount</b> |
| <a class="slurm_link" href="#qos_maxjobspa"></a></dt> |
| <dd>The maximum number of jobs an account (or sub-account) can have running at |
| a given time. |
| </dd> |
| |
| <dt id="qos_maxjobspu"><b>MaxJobsPerUser</b> |
| <a class="slurm_link" href="#qos_maxjobspu"></a></dt> |
| <dd>The maximum number of jobs a user can |
| have running at a given time. |
| </dd> |
| |
| <dt id="qos_maxsubmitjobspa"><b>MaxSubmitJobsPerAccount</b> |
| <a class="slurm_link" href="#qos_maxsubmitjobspa"></a></dt> |
| <dd>The maximum number of jobs an account (or sub-account) can have running and |
| pending at a given time. |
| </dd> |
| |
| <dt id="qos_maxsubmitjobspu"><b>MaxSubmitJobsPerUser</b> |
| <a class="slurm_link" href="#qos_maxsubmitjobspu"></a></dt> |
| <dd>The maximum number of jobs a user can |
| have running and pending at a given time. |
| </dd> |
| |
| <dt id="qos_maxtresminsperjob"><b>MaxTRESMinsPerJob</b> |
| <a class="slurm_link" href="#qos_maxtresminsperjob"></a></dt> |
| <dd>Maximum number of TRES minutes each job is able to use. |
| </dd> |
| |
| <dt id="qos_maxtrespa"><b>MaxTRESPerAccount</b> |
| <a class="slurm_link" href="#qos_maxtrespa"></a></dt> |
| <dd>The maximum number of TRES an account can |
| allocate at a given time. |
| </dd> |
| |
| <dt id="qos_maxtrespj"><b>MaxTRESPerJob</b> |
| <a class="slurm_link" href="#qos_maxtrespj"></a></dt> |
| <dd>The maximum number of TRES each job is able to use. |
| </dd> |
| |
| <dt id="qos_maxtrespn"><b>MaxTRESPerNode</b> |
| <a class="slurm_link" href="#qos_maxtrespn"></a></dt> |
| <dd>The maximum number of TRES each node in a job allocation can use. |
| </dd> |
| |
| <dt id="qos_maxtrespu"><b>MaxTRESPerUser</b> |
| <a class="slurm_link" href="#qos_maxtrespu"></a></dt> |
| <dd>The maximum number of TRES a user can |
| allocate at a given time. |
| </dd> |
| |
| <dt id="qos_maxwalldurationpj"><b>MaxWallDurationPerJob</b> |
| <a class="slurm_link" href="#qos_maxwalldurationpj"></a></dt> |
| <dd>Maximum wall clock time each job is able to use. Format is <min> |
| or <min>:<sec> or <hr>:<min>:<sec> or |
| <days>-<hr>:<min>:<sec> or <days>-<hr>. |
| The value is recorded in minutes with rounding as needed. |
| </dd> |
| |
| <dt id="qos_minpriothreshold"><b>MinPrioThreshold</b> |
| <a class="slurm_link" href="#qos_minpriothreshold"></a></dt> |
| <dd>Minimum priority required to reserve resources when scheduling. |
| </dd> |
| |
| <dt id="qos_mintresperjob"><b>MinTRESPerJob</b> |
| <a class="slurm_link" href="#qos_mintresperjob"></a></dt> |
| <dd>The minimum size in TRES any given job can |
| have when using the requested QOS. |
| </dd> |
| |
| <dt id="qos_usagefactor"><b>UsageFactor</b> |
| <a class="slurm_link" href="#qos_usagefactor"></a></dt> |
| <dd>A float that is factored into a job's TRES usage (e.g. RawUsage, |
| TRESMins, TRESRunMins). For example, if the usagefactor was 2, for every |
| TRESBillingUnit second a job ran it would count for 2. If the usagefactor |
| was .5, every second would only count for half of the time. |
| A setting of 0 would add no timed usage from the job. |
| |
| The usage factor only applies to the job's QOS and not the partition QOS. |
| <br> |
| If the UsageFactorSafe flag is set and AccountingStorageEnforce includes |
| <i>Safe</i>, jobs will only be able to run if the job can run to completion |
| with the UsageFactor applied, and won't be killed due to limits. |
| <br> |
| If the UsageFactorSafe flag is not set and AccountingStorageEnforce includes |
| <i>Safe</i>, a job will be able to be scheduled without the UsageFactor |
| applied and won't be killed due to limits. |
| <br> |
| If the UsageFactorSafe flag is not set and AccountingStorageEnforce does |
| not include <i>Safe</i>, a job will be scheduled as long as the limits are |
| not reached, but could be killed due to limits. |
| <br> |
| See <a href="slurm.conf.html#OPT_AccountingStorageEnforce"> |
| AccountingStorageEnforce</a> in the slurm.conf man page. |
| </dd> |
| </dl> |
| |
| |
| <p>The <b>MaxNodes</b> and <b>MaxTime</b> options already exist in |
| Slurm's configuration on a per-partition basis, but the above options |
| provide the ability to impose limits on a per-user basis. The |
| <b>MaxJobs</b> option provides an entirely new mechanism for Slurm to |
| control the workload any individual may place on a cluster in order to |
| achieve some balance between users.</p> |
| |
| <p>When assigning limits to a QOS to use for a Partition QOS, |
| keep in mind that those limits are enforced at the QOS level, not |
| individually for each partition. For example, if a QOS has a |
| <b>GrpTRES=cpu=20</b> limit defined and the QOS is assigned to two |
| unique partitions, users will be limited to 20 CPUs for the QOS |
| rather than being allowed 20 CPUs for each partition.</p> |
| |
| <p>Fair-share scheduling is based upon the hierarchical bank account |
| data maintained in the Slurm database. More information can be found |
| in the <a href="priority_multifactor.html">priority/multifactor</a> |
| plugin description.</p> |
| |
| <h3 id="gres_limits">Specific limits over GRES |
| <a class="slurm_link" href="#gres_limits"></a> |
| </h3> |
| <p> When a GRES has a type associated with it and a limit is applied |
| over this specific type (e.g. <i>MaxTRESPerUser=gres/gpu:tesla=1</i>) if a |
| user requests a generic gres, the type's limit will not be enforced. In this |
| situation an additional lua job submit plugin to check the user request may |
| become useful. For example, if one requests <i>--gres=gpu:2</i> having a |
| limit set of <i>MaxTRESPerUser=gres/gpu:tesla=1</i>, the limit won't be |
| enforced so it will still be possible to get two teslas. |
| </p> |
| <p> |
| This is due to a design limitation. The only way to enforce such a limit |
| is to combine the specification of the limit with a job submit plugin that |
| forces the user to always request a specific type model. |
| </p> |
| <p> |
| An example of basic lua job submit plugin function could be: |
| </p> |
| <pre> |
| function slurm_job_submit(job_desc, part_list, submit_uid) |
| if (job_desc.gres ~= nil) |
| then |
| for g in job_desc.gres:gmatch("[^,]+") |
| do |
| bad = string.match(g,'^gpu[:]*[0-9]*$') |
| if (bad ~= nil) |
| then |
| slurm.log_info("User specified gpu GRES without type: %s", bad) |
| slurm.user_msg("You must always specify a type when requesting gpu GRES") |
| return slurm.ERROR |
| end |
| end |
| end |
| end |
| </pre> |
| <p> Having this script and the limit in place will force the users to always |
| specify a gpu with its type, thus enforcing the limits for each specific |
| model. |
| </p> |
| |
| <p>When <b>TRESBillingWeights</b> are defined for a partition, both typed and |
| non-typed resources should be included. For example, if you have 'tesla' GPUs |
| in one partition and you only define the billing weights for the 'tesla' typed |
| GPU resource, then those weights will not be applied to the generic GPUs.</p> |
| |
| <p>It is also advisable to set <b>AccountingStorageTRES</b> for both generic |
| and specific gres types, otherwise requests that ask for the generic instance |
| of a gres won't be accounted for. For example, to track generic GPUs and |
| Tesla GPUs, you would set this in your slurm.conf: |
| </p> |
| <pre> |
| AccountingStorageTRES=gres/gpu,gres/gpu:tesla |
| </pre> |
| |
| <p> |
| See <a href="tres.html">Trackable Resources TRES</a> for details. |
| </p> |
| |
| <h2 id="reasons">Job Reason Codes |
| <a class="slurm_link" href="#reasons"></a> |
| </h2> |
| |
| <p>These reason codes can be used to identify why a job is waiting for |
| execution. A job may be waiting for more than one reason, in which case |
| only one of those reasons is displayed.</p> |
| |
| <p><b>AccountingPolicy</b> — Fallback reason when others not matched.</p> |
| |
| <p><b>AccountNotAllowed</b> — Job is in an account not allowed in a |
| partition.</p> |
| |
| <p><b>AssocGrpBB</b> — The job's association has reached its aggregate |
| Burst Buffer limit.</p> |
| |
| <p><b>AssocGrpBBMinutes</b> — The job's association has reached the |
| maximum number of minutes allowed in aggregate for Burst Buffers by past, |
| present and future jobs.</p> |
| |
| <p><b>AssocGrpBBRunMinutes</b> — The job's association has reached the |
| maximum number of minutes allowed in aggregate for Burst Buffers by |
| currently running jobs.</p> |
| |
| <p><b>AssocGrpBilling</b> — The job's association has reached its |
| aggregate Billing limit.</p> |
| |
| <p><b>AssocGrpBillingMinutes</b> — The job's association has reached |
| the maximum number of minutes allowed in aggregate for the Billing value of |
| a resource by past, present and future jobs.</p> |
| |
| <p><b>AssocGrpBillingRunMinutes</b> — The job's association has reached |
| the maximum number of minutes allowed in aggregate for the Billing value of a |
| resource by currently running jobs.</p> |
| |
| <p><b>AssocGrpCpuLimit</b> — The job's association has reached its |
| aggregate CPU limit.</p> |
| |
| <p><b>AssocGrpCPUMinutesLimit</b> — The job's association has reached |
| the maximum number of minutes allowed in aggregate for CPUs by past, present |
| and future jobs.</p> |
| |
| <p><b>AssocGrpCPURunMinutesLimit</b> — The job's association has reached |
| the maximum number of minutes allowed in aggregate for CPUs by currently |
| running jobs.</p> |
| |
| <p><b>AssocGrpEnergy</b> — The job's association has reached its |
| aggregate Energy limit.</p> |
| |
| <p><b>AssocGrpEnergyMinutes</b> — The job's association has reached the |
| maximum number of minutes allowed in aggregate for Energy by past, present |
| and future jobs.</p> |
| |
| <p><b>AssocGrpEnergyRunMinutes</b> — The job's association has reached |
| the maximum number of minutes allowed in aggregate for Energy by currently |
| running jobs.</p> |
| |
| <p><b>AssocGrpGRES</b> — The job's association has reached its aggregate |
| GRES limit.</p> |
| |
| <p><b>AssocGrpGRESMinutes</b> — The job's association has reached the |
| maximum number of minutes allowed in aggregate for a GRES by past, present |
| and future jobs.</p> |
| |
| <p><b>AssocGrpGRESRunMinutes</b> — The job's association has reached the |
| maximum number of minutes allowed in aggregate for a GRES by currently |
| running jobs.</p> |
| |
| <p><b>AssocGrpJobsLimit</b> — The job's association has reached the |
| maximum number of allowed jobs in aggregate.</p> |
| |
| <p><b>AssocGrpLicense</b> — The job's association has reached its |
| aggregate license limit.</p> |
| |
| <p><b>AssocGrpLicenseMinutes</b> — The job's association has reached the |
| maximum number of minutes allowed in aggregate for Licenses by past, present |
| and future jobs.</p> |
| |
| <p><b>AssocGrpLicenseRunMinutes</b> — The job's association has reached |
| the maximum number of minutes allowed in aggregate for Licenses by currently |
| running jobs.</p> |
| |
| <p><b>AssocGrpMemLimit</b> — The job's association has reached its |
| aggregate Memory limit.</p> |
| |
| <p><b>AssocGrpMemMinutes</b> — The job's association has reached the |
| maximum number of minutes allowed in aggregate for Memory by past, present |
| and future jobs.</p> |
| |
| <p><b>AssocGrpMemRunMinutes</b> — The job's association has reached the |
| maximum number of minutes allowed in aggregate for Memory by currently |
| running jobs.</p> |
| |
| <p><b>AssocGrpNodeLimit</b> — The job's association has reached its |
| aggregate Node limit.</p> |
| |
| <p><b>AssocGrpNodeMinutes</b> — The job's association has reached the |
| maximum number of minutes allowed in aggregate for Nodes by past, present and |
| future jobs.</p> |
| |
| <p><b>AssocGrpNodeRunMinutes</b> — The job's association has reached the |
| maximum number of minutes allowed in aggregate for Nodes by currently running |
| jobs.</p> |
| |
| <p><b>AssocGrpSubmitJobsLimit</b> — The job's association has reached the |
| maximum number of jobs that can be running or pending in aggregate at a given |
| time.</p> |
| |
| <p><b>AssocGrpUnknown</b> — The job's association has reached its |
| aggregate limit for an unknown generic resource.</p> |
| |
| <p><b>AssocGrpUnknownMinutes</b> — The job's association has reached |
| the maximum number of minutes allowed in aggregate for an unknown generic |
| resource by past, present and future jobs.</p> |
| |
| <p><b>AssocGrpUnknownRunMinutes</b> — The job's association has reached |
| the maximum number of minutes allowed in aggregate for an unknown generic |
| resource by currently running jobs.</p> |
| |
| <p><b>AssocGrpWallLimit</b> — The job's association has reached its |
| aggregate limit for the amount of walltime requested by running jobs.</p> |
| |
| <p><b>AssocMaxBBMinutesPerJob</b> — The Burst Buffer request exceeds |
| the maximum number of minutes each job is allowed to use for the requested |
| association.</p> |
| |
| <p><b>AssocMaxBBPerJob</b> — The Burst Buffer request exceeds the |
| maximum each job is allowed to use for the requested association.</p> |
| |
| <p><b>AssocMaxBBPerNode</b> — The Burst Buffer request exceeds the |
| maximum number each node in a job allocation is allowed to use for the |
| requested association.</p> |
| |
| <p><b>AssocMaxBillingMinutesPerJob</b> — The request exceeds the |
| maximum number of minutes each job is allowed to use, with Billing taken into |
| account, for the requested association.</p> |
| |
| <p><b>AssocMaxBillingPerJob</b> — The resource request exceeds the |
| maximum Billing limit each job is allowed to use for the requested |
| association.</p> |
| |
| <p><b>AssocMaxBillingPerNode</b> — The request exceeds the maximum |
| Billing limit each node in a job allocation is allowed to use for the |
| requested association.</p> |
| |
| <p><b>AssocMaxCpuMinutesPerJobLimit</b> — The CPU request exceeds the |
| maximum number of minutes each job is allowed to use for the requested |
| association.</p> |
| |
| <p><b>AssocMaxCpuPerJobLimit</b> — The CPU request exceeds the maximum |
| each job is allowed to use for the requested association.</p> |
| |
| <p><b>AssocMaxCpuPerNode</b> — The request exceeds the maximum number |
| of CPUs each node in a job allocation is allowed to use for the requested |
| association.</p> |
| |
| <p><b>AssocMaxEnergyMinutesPerJob</b> — The Energy request exceeds the |
| maximum number of minutes each job is allowed to use for the requested |
| association.</p> |
| |
| <p><b>AssocMaxEnergyPerJob</b> — The Energy request exceeds the maximum |
| each job is allowed to use for the requested association.</p> |
| |
| <p><b>AssocMaxEnergyPerNode</b> — The request exceeds the maximum |
| amount of Energy each node in a job allocation is allowed to use for the |
| requested association.</p> |
| |
| <p><b>AssocMaxGRESMinutesPerJob</b> — The GRES request exceeds the |
| maximum number of minutes each job is allowed to use for the requested |
| association.</p> |
| |
| <p><b>AssocMaxGRESPerJob</b> — The GRES request exceeds the maximum |
| each job is allowed to use for the requested association.</p> |
| |
| <p><b>AssocMaxGRESPerNode</b> — The request exceeds the maximum number |
| of a GRES each node in a job allocation is allowed to use for the requested |
| association.</p> |
| |
| <p><b>AssocMaxJobsLimit</b> — The limit on the number of jobs each |
| user is allowed to run at a given time has been met for the requested |
| association.</p> |
| |
| <p><b>AssocMaxLicenseMinutesPerJob</b> — The License request exceeds |
| the maximum number of minutes each job is allowed to use for the requested |
| association.</p> |
| |
| <p><b>AssocMaxLicensePerJob</b> — The License request exceeds the |
| maximum each job is allowed to use for the requested association.</p> |
| |
| <p><b>AssocMaxMemMinutesPerJob</b> — The Memory request exceeds the |
| maximum number of minutes each job is allowed to use for the requested |
| association.</p> |
| |
| <p><b>AssocMaxMemPerJob</b> — The Memory request exceeds the maximum |
| each job is allowed to use for the requested association.</p> |
| |
| <p><b>AssocMaxMemPerNode</b> — The request exceeds the maximum amount |
| of Memory each node in a job allocation is allowed to use for the requested |
| association.</p> |
| |
| <p><b>AssocMaxNodeMinutesPerJob</b> — The number of nodes requested |
| exceeds the maximum number of minutes each job is allowed to use for the |
| requested association.</p> |
| |
| <p><b>AssocMaxNodePerJobLimit</b> — The number of nodes requested |
| exceeds the maximum each job is allowed to use for the requested |
| association.</p> |
| |
| <p><b>AssocMaxSubmitJobLimit</b> — The limit on the number of jobs each |
| user is allowed to have running or pending at a given time has been met for |
| the requested association.</p> |
| |
| <p><b>AssocMaxUnknownMinutesPerJob</b> — The request of an unknown |
| trackable resource exceeds the maximum number of minutes each job is allowed |
| to use for the requested association.</p> |
| |
| <p><b>AssocMaxUnknownPerJob</b> — The request of an unknown trackable |
| resource exceeds the maximum each job is allowed to use for the requested |
| association.</p> |
| |
| <p><b>AssocMaxUnknownPerNode</b> — The request exceeds the maximum |
| number of an unknown trackable resource each node in a job allocation is |
| allowed to use for the requested association.</p> |
| |
| <p><b>AssocMaxWallDurationPerJobLimit</b> — The limit on the amount of |
| wall time a job can request has been exceeded for the requested association. |
| </p> |
| |
| <p><b>AssociationJobLimit</b> — The job's association has reached |
| its maximum job count.</p> |
| |
| <p><b>AssociationResourceLimit</b> — The job's association has reached |
| some resource limit.</p> |
| |
| <p><b>AssociationTimeLimit</b> — The job's association has reached its |
| time limit.</p> |
| |
| <p><b>BadConstraints</b> — The job's constraints can not be satisfied.</p> |
| |
| <p><b>BeginTime</b> — The job's earliest start time has not yet been |
| reached.</p> |
| |
| <p><b>BurstBufferOperation</b> — Burst Buffer operation for the job |
| failed.</p> |
| |
| <p><b>BurstBufferResources</b> — There are insufficient resources |
| in a Burst Buffer resource pool.</p> |
| |
| <p><b>BurstBufferStageIn</b> — The Burst Buffer plugin is in the |
| process of staging the environment for the job.</p> |
| |
| <p><b>Cleaning</b> — The job is being requeued and still cleaning up |
| from its previous execution.</p> |
| |
| <p><b>DeadLine</b> — This job has violated the configured Deadline.</p> |
| |
| <p><b>Dependency</b> — This job has a dependency on another job that |
| has not been satisfied.</p> |
| |
| <p><b>DependencyNeverSatisfied</b> — This job has a dependency on |
| another job that will never be satisfied.</p> |
| |
| <p><b>FedJobLock</b> — The job is waiting for the clusters in the |
| federation to sync up and issue a lock.</p> |
| |
| <p><b>FrontEndDown</b> — No front end node is available to execute this |
| job.</p> |
| |
| <p><b>InactiveLimit</b> — The job reached the system InactiveLimit.</p> |
| |
| <p><b>InvalidAccount</b> — The job's account is invalid.</p> |
| |
| <p><b>InvalidQOS</b> — The job's QOS is invalid.</p> |
| |
| <p><b>JobArrayTaskLimit</b> — The job array's limit on the number of |
| simultaneously running tasks has been reached.</p> |
| |
| <p><b>JobHeldAdmin</b> — The job is held by a system administrator.</p> |
| |
| <p><b>JobHeldUser</b> — The job is held by the user.</p> |
| |
| <p><b>JobHoldMaxRequeue</b> — Job has been requeued enough times to |
| reach the MAX_BATCH_REQUEUE limit.</p> |
| |
| <p><b>JobLaunchFailure</b> — The job could not be launched. This may |
| be due to a file system problem, invalid program name, etc.</p> |
| |
| <p><b>Licenses</b> — The job is waiting for a license.</p> |
| |
| <p><b>MaxBBPerAccount</b> — The job's Burst Buffer request exceeds the |
| per-Account limit on the job's QOS.</p> |
| |
| <p><b>MaxBillingPerAccount</b> — The job's Billing request exceeds the |
| per-Account limit on the job's QOS.</p> |
| |
| <p><b>MaxCpuPerAccount</b> — The job's CPU request exceeds the |
| per-Account limit on the job's QOS.</p> |
| |
| <p><b>MaxEnergyPerAccount</b> — The job's Energy request exceeds the |
| per-Account limit on the job's QOS.</p> |
| |
| <p><b>MaxGRESPerAccount</b> — The job's GRES request exceeds the |
| per-Account limit on the job's QOS.</p> |
| |
| <p><b>MaxJobsPerAccount</b> — This job exceeds the per-Account limit |
| on the number of jobs for the job's QOS.</p> |
| |
| <p><b>MaxLicensePerAccount</b> — The job's License request exceeds the |
| per-Account limit on the job's QOS.</p> |
| |
| <p><b>MaxMemoryPerAccount</b> — The job's Memory request exceeds the |
| per-Account limit on the job's QOS.</p> |
| |
| <p><b>MaxMemPerLimit</b> — The job violates the limit on the maximum |
| amount of Memory per-CPU or per-Node.</p> |
| |
| <p><b>MaxNodePerAccount</b> — The number of nodes requested by the job |
| exceeds the per-Account limit on the number of nodes for the job's QOS.</p> |
| |
| <p><b>MaxSubmitJobsPerAccount</b> — This job exceeds the per-Account |
| limit on the number of jobs in a pending or running state for the job's QOS. |
| </p> |
| |
| <p><b>MaxUnknownPerAccount</b> — The jobs request of an unknown GRES |
| exceeds the per-Account limit on the job's QOS.</p> |
| |
| <p><b>NodeDown</b> — A node required by the job is down.</p> |
| |
| <p><b>NonZeroExitCode</b> — The job terminated with a non-zero exit |
| code.</p> |
| |
| <p><b>None</b> — The job hasn't had a reason assigned to it yet.</p> |
| |
| <p><b>OutOfMemory</b> — The job failed with an Out Of Memory error. |
| </p> |
| |
| <p><b>PartitionConfig</b> — Fallback reason when the job violates |
| some limit on the partition.</p> |
| |
| <p><b>PartitionDown</b> — The partition required by this job is in |
| a DOWN state.</p> |
| |
| <p><b>PartitionInactive</b> — The partition required by this job is |
| in an Inactive state and not able to start jobs.</p> |
| |
| <p><b>PartitionNodeLimit</b> — The number of nodes required by this |
| job is outside of its partition's current limits. Can also indicate that |
| required nodes are DOWN or DRAINED.</p> |
| |
| <p><b>PartitionTimeLimit</b> — The job's time limit exceeds its |
| partition's current time limit.</p> |
| |
| <p><b>PowerNotAvail</b> — The job requests more power than is available |
| when using the cray_aries power management plugin.</p> |
| |
| <p><b>PowerReserved</b> — The job's power request is for more than |
| what is currently available when using the cray_aries power management plugin. |
| </p> |
| |
| <p><b>Priority</b> — One of more higher priority jobs exist for the |
| partition associated with the job or for the advanced reservation.</p> |
| |
| <p><b>Prolog</b> — The job's PrologSlurmctld program is still running. |
| </p> |
| |
| <p><b>QOSGrpBB</b> — The job's QOS has reached its aggregate |
| Burst Buffer limit.</p> |
| |
| <p><b>QOSGrpBBMinutes</b> — The job's QOS has reached the |
| maximum number of minutes allowed in aggregate for Burst Buffers by past, |
| present and future jobs.</p> |
| |
| <p><b>QOSGrpBBRunMinutes</b> — The job's QOS has reached the |
| maximum number of minutes allowed in aggregate for Burst Buffers by |
| currently running jobs.</p> |
| |
| <p><b>QOSGrpBilling</b> — The job's QOS has reached its aggregate |
| Billing limit.</p> |
| |
| <p><b>QOSGrpBillingMinutes</b> — The job's QOS has reached |
| the maximum number of minutes allowed in aggregate for the Billing value of |
| a resource by past, present and future jobs.</p> |
| |
| <p><b>QOSGrpBillingRunMinutes</b> — The job's QOS has reached the |
| maximum number of minutes allowed in aggregate for the Billing value of a |
| resource by currently running jobs.</p> |
| |
| <p><b>QOSGrpCpuLimit</b> — The job's QOS has reached its aggregate |
| CPU limit.</p> |
| |
| <p><b>QOSGrpCPUMinutesLimit</b> — The job's QOS has reached the |
| maximum number of minutes allowed in aggregate for CPUs by past, present |
| and future jobs.</p> |
| |
| <p><b>QOSGrpCPURunMinutesLimit</b> — The job's QOS has reached |
| the maximum number of minutes allowed in aggregate for CPUs by currently |
| running jobs.</p> |
| |
| <p><b>QOSGrpEnergy</b> — The job's QOS has reached its aggregate |
| Energy limit.</p> |
| |
| <p><b>QOSGrpEnergyMinutes</b> — The job's QOS has reached the |
| maximum number of minutes allowed in aggregate for Energy by past, present |
| and future jobs.</p> |
| |
| <p><b>QOSGrpEnergyRunMinutes</b> — The job's QOS has reached the |
| maximum number of minutes allowed in aggregate for Energy by currently |
| running jobs.</p> |
| |
| <p><b>QOSGrpGRES</b> — The job's QOS has reached its aggregate GRES |
| limit.</p> |
| |
| <p><b>QOSGrpGRESMinutes</b> — The job's QOS has reached the maximum |
| number of minutes allowed in aggregate for a GRES by past, present and |
| future jobs.</p> |
| |
| <p><b>QOSGrpGRESRunMinutes</b> — The job's QOS has reached the |
| maximum number of minutes allowed in aggregate for a GRES by currently |
| running jobs.</p> |
| |
| <p><b>QOSGrpJobsLimit</b> — The job's QOS has reached the maximum |
| number of allowed jobs in aggregate.</p> |
| |
| <p><b>QOSGrpLicense</b> — The job's QOS has reached its aggregate |
| license limit.</p> |
| |
| <p><b>QOSGrpLicenseMinutes</b> — The job's QOS has reached the |
| maximum number of minutes allowed in aggregate for Licenses by past, present |
| and future jobs.</p> |
| |
| <p><b>QOSGrpLicenseRunMinutes</b> — The job's QOS has reached the |
| maximum number of minutes allowed in aggregate for Licenses by currently |
| running jobs.</p> |
| |
| <p><b>QOSGrpMemLimit</b> — The job's QOS has reached its aggregate |
| Memory limit.</p> |
| |
| <p><b>QOSGrpMemoryMinutes</b> — The job's QOS has reached the |
| maximum number of minutes allowed in aggregate for Memory by past, present |
| and future jobs.</p> |
| |
| <p><b>QOSGrpMemoryRunMinutes</b> — The job's QOS has reached the |
| maximum number of minutes allowed in aggregate for Memory by currently |
| running jobs.</p> |
| |
| <p><b>QOSGrpNodeLimit</b> — The job's QOS has reached its |
| aggregate Node limit.</p> |
| |
| <p><b>QOSGrpNodeMinutes</b> — The job's QOS has reached the maximum |
| number of minutes allowed in aggregate for Nodes by past, present and |
| future jobs.</p> |
| |
| <p><b>QOSGrpNodeRunMinutes</b> — The job's QOS has reached the maximum |
| number of minutes allowed in aggregate for Nodes by currently running jobs.</p> |
| |
| <p><b>QOSGrpSubmitJobsLimit</b> — The job's QOS has reached the maximum |
| number of jobs that can be running or pending in aggregate at a given time.</p> |
| |
| <p><b>QOSGrpUnknown</b> — The job's QOS has reached its aggregate limit |
| for an unknown generic resource.</p> |
| |
| <p><b>QOSGrpUnknownMinutes</b> — The job's QOS has reached the maximum |
| number of minutes allowed in aggregate for an unknown generic resource by |
| past, present and future jobs.</p> |
| |
| <p><b>QOSGrpUnknownRunMinutes</b> — The job's QOS has reached the |
| maximum number of minutes allowed in aggregate for an unknown generic |
| resource by currently running jobs.</p> |
| |
| <p><b>QOSGrpWallLimit</b> — The job's QOS has reached its aggregate |
| limit for the amount of walltime requested by running jobs.</p> |
| |
| <p><b>QOSJobLimit</b> — The job's QOS has reached its maximum job |
| count.</p> |
| |
| <p><b>QOSMaxBBMinutesPerJob</b> — The Burst Buffer request exceeds |
| the maximum number of minutes each job is allowed to use for the requested |
| QOS.</p> |
| |
| <p><b>QOSMaxBBPerJob</b> — The Burst Buffer request exceeds the |
| maximum each job is allowed to use for the requested QOS.</p> |
| |
| <p><b>QOSMaxBBPerNode</b> — The Burst Buffer request exceeds the |
| maximum number each node in a job allocation is allowed to use for the |
| requested QOS.</p> |
| |
| <p><b>QOSMaxBBPerUser</b> — The Burst Buffer request exceeds the |
| maximum number each user is allowed to use for the requested QOS.</p> |
| |
| <p><b>QOSMaxBillingMinutesPerJob</b> — The request exceeds the |
| maximum number of minutes each job is allowed to use, with Billing taken into |
| account, for the requested QOS.</p> |
| |
| <p><b>QOSMaxBillingPerJob</b> — The resource request exceeds the |
| maximum Billing limit each job is allowed to use for the requested |
| QOS.</p> |
| |
| <p><b>QOSMaxBillingPerNode</b> — The request exceeds the maximum |
| Billing limit each node in a job allocation is allowed to use for the |
| requested QOS.</p> |
| |
| <p><b>QOSMaxBillingPerUser</b> — The request exceeds the maximum |
| Billing limit each user is allowed to use for the requested QOS.</p> |
| |
| <p><b>QOSMaxCpuMinutesPerJobLimit</b> — The CPU request exceeds the |
| maximum number of minutes each job is allowed to use for the requested |
| QOS.</p> |
| |
| <p><b>QOSMaxCpuPerJobLimit</b> — The CPU request exceeds the maximum |
| each job is allowed to use for the requested QOS.</p> |
| |
| <p><b>QOSMaxCpuPerNode</b> — The request exceeds the maximum number |
| of CPUs each node in a job allocation is allowed to use for the requested |
| QOS.</p> |
| |
| <p><b>QOSMaxCpuPerUserLimit</b> — The CPU request exceeds the maximum |
| each user is allowed to use for the requested QOS.</p> |
| |
| <p><b>QOSMaxEnergyMinutesPerJob</b> — The Energy request exceeds the |
| maximum number of minutes each job is allowed to use for the requested |
| QOS.</p> |
| |
| <p><b>QOSMaxEnergyPerJob</b> — The Energy request exceeds the maximum |
| each job is allowed to use for the requested QOS.</p> |
| |
| <p><b>QOSMaxEnergyPerNode</b> — The request exceeds the maximum |
| amount of Energy each node in a job allocation is allowed to use for the |
| requested QOS.</p> |
| |
| <p><b>QOSMaxEnergyPerUser</b> — The request exceeds the maximum |
| amount of Energy each user is allowed to use for the requested QOS.</p> |
| |
| <p><b>QOSMaxGRESMinutesPerJob</b> — The GRES request exceeds the |
| maximum number of minutes each job is allowed to use for the requested |
| QOS.</p> |
| |
| <p><b>QOSMaxGRESPerJob</b> — The GRES request exceeds the maximum |
| each job is allowed to use for the requested QOS.</p> |
| |
| <p><b>QOSMaxGRESPerNode</b> — The request exceeds the maximum number |
| of a GRES each node in a job allocation is allowed to use for the requested |
| QOS.</p> |
| |
| <p><b>QOSMaxGRESPerUser</b> — The request exceeds the maximum number |
| of a GRES each user is allowed to use for the requested QOS.</p> |
| |
| <p><b>QOSMaxJobsPerUserLimit</b> — The limit on the number of jobs a |
| user is allowed to run at a given time has been met for the requested |
| QOS.</p> |
| |
| <p><b>QOSMaxLicenseMinutesPerJob</b> — The License request exceeds |
| the maximum number of minutes each job is allowed to use for the requested |
| QOS.</p> |
| |
| <p><b>QOSMaxLicensePerJob</b> — The License request exceeds the |
| maximum each job is allowed to use for the requested QOS.</p> |
| |
| <p><b>QOSMaxLicensePerUser</b> — The License request exceeds the |
| maximum each user is allowed to use for the requested QOS.</p> |
| |
| <p><b>QOSMaxMemoryMinutesPerJob</b> — The Memory request exceeds the |
| maximum number of minutes each job is allowed to use for the requested |
| QOS.</p> |
| |
| <p><b>QOSMaxMemoryPerJob</b> — The Memory request exceeds the maximum |
| each job is allowed to use for the requested QOS.</p> |
| |
| <p><b>QOSMaxMemoryPerNode</b> — The request exceeds the maximum amount |
| of Memory each node in a job allocation is allowed to use for the requested |
| QOS.</p> |
| |
| <p><b>QOSMaxMemoryPerUser</b> — The request exceeds the maximum amount |
| of Memory each user is allowed to use for the requested QOS.</p> |
| |
| <p><b>QOSMaxNodeMinutesPerJob</b> — The number of nodes requested |
| exceeds the maximum number of minutes each job is allowed to use for the |
| requested QOS.</p> |
| |
| <p><b>QOSMaxNodePerJobLimit</b> — The number of nodes requested |
| exceeds the maximum each job is allowed to use for the requested |
| QOS.</p> |
| |
| <p><b>QOSMaxNodePerUserLimit</b> — The number of nodes requested |
| exceeds the maximum each user is allowed to use for the requested |
| QOS.</p> |
| |
| <p><b>QOSMaxSubmitJobPerUserLimit</b> — The limit on the number of |
| jobs each user is allowed to have running or pending at a given time has |
| been met for the requested QOS.</p> |
| |
| <p><b>QOSMaxUnknownMinutesPerJob</b> — The request of an unknown |
| trackable resource exceeds the maximum number of minutes each job is allowed |
| to use for the requested QOS.</p> |
| |
| <p><b>QOSMaxUnknownPerJob</b> — The request of an unknown trackable |
| resource exceeds the maximum each job is allowed to use for the requested |
| QOS.</p> |
| |
| <p><b>QOSMaxUnknownPerNode</b> — The request exceeds the maximum |
| number of an unknown trackable resource each node in a job allocation is |
| allowed to use for the requested QOS.</p> |
| |
| <p><b>QOSMaxUnknownPerUser</b> — The request exceeds the maximum |
| number of an unknown trackable resource each user is allowed to use for |
| the requested QOS.</p> |
| |
| <p><b>QOSMaxWallDurationPerJobLimit</b> — The limit on the amount of |
| wall time a job can request has been exceeded for the requested QOS.</p> |
| |
| <p><b>QOSMinBB</b> — The Burst Buffer request does not meet the |
| minimum each job is required to request for the requested QOS.</p> |
| |
| <p><b>QOSMinBilling</b> — The resource request does not meet the |
| minimum Billing limit each job is allowed to use for the requested |
| QOS.</p> |
| |
| <p><b>QOSMinCpuNotSatisfied</b> — The CPU request does not meet the |
| minimum each job is allowed to use for the requested QOS.</p> |
| |
| <p><b>QOSMinEnergy</b> — The Energy request does not meet the |
| minimum each job is allowed to use for the requested QOS.</p> |
| |
| <p><b>QOSMinGRES</b> — The GRES request does not meet the |
| minimum each job is allowed to use for the requested QOS.</p> |
| |
| <p><b>QOSMinLicense</b> — The License request does not meet the |
| minimum each job is allowed to use for the requested QOS.</p> |
| |
| <p><b>QOSMinMemory</b> — The Memory request does not meet the |
| minimum each job is allowed to use for the requested QOS.</p> |
| |
| <p><b>QOSMinNode</b> — The number of nodes requested does not meet the |
| minimum each job is allowed to use for the requested QOS.</p> |
| |
| <p><b>QOSMinUnknown</b> — The request of an unknown trackable resource |
| does not meet the minimum each job is allowed to use for the requested QOS.</p> |
| |
| <p><b>QOSNotAllowed</b> — The job requests a QOS is not allowed by |
| the requested association or partition.</p> |
| |
| <p><b>ReservationDeleted</b> — The job requested a reservation that is |
| no longer on the system.</p> |
| |
| <p><b>QOSResourceLimit</b> — The job's QOS has reached some resource |
| limit.</p> |
| |
| <p><b>QOSTimeLimit</b> — The job's QOS has reached its time limit.</p> |
| |
| <p><b>QOSUsageThreshold</b> — Required QOS threshold has been |
| breached.</p> |
| |
| <p><b>ReqNodeNotAvail</b> — Some node specifically required by the job |
| is not currently available. The node may currently be in use, reserved for |
| another job, in an advanced reservation, DOWN, DRAINED, or not responding. |
| Nodes which are DOWN, DRAINED, or not responding will be identified as part |
| of the job's "reason" field as "UnavailableNodes". Such nodes will typically |
| require the intervention of a system administrator to make available.</p> |
| |
| <p><b>Reservation</b> — The job is waiting its advanced reservation to |
| become available.</p> |
| |
| <p><b>Resources</b> — The QOS resource limit has been reached.</p> |
| |
| <p><b>SchedDefer</b> — The job requests an immediate allocation but |
| <b>SchedulerParameters=defer</b> is configured in the slurm.conf.</p> |
| |
| <p><b>SystemFailure</b> — Failure of the Slurm system, a file system, |
| the network, etc.</p> |
| |
| <p><b>TimeLimit</b> — The job exhausted its time limit.</p> |
| |
| <p style="text-align: center;">Last modified 01 January 2024</p> |
| |
| <!--#include virtual="footer.txt"--> |