| .TH "slurm.conf" "5" "Slurm Configuration File" "August 2025" "Slurm Configuration File" |
| |
| .SH "NAME" |
| slurm.conf \- Slurm configuration file |
| |
| .SH "DESCRIPTION" |
| \fBslurm.conf\fP is an ASCII file which describes general Slurm |
| configuration information, the nodes to be managed, information about |
| how those nodes are grouped into partitions, and various scheduling |
| parameters associated with those partitions. This file should be |
| consistent across all nodes in the cluster. |
| .LP |
| The file location can be modified at execution time by setting the SLURM_CONF |
| environment variable. The Slurm daemons also allow you to override |
| both the built\-in and environment\-provided location using the "\-f" |
| option on the command line. |
| .LP |
| The contents of the file are case insensitive except for the names of nodes |
| and partitions. Any text following a "#" in the configuration file is treated |
| as a comment through the end of that line. |
| Changes to the configuration file take effect upon restart of |
| Slurm daemons, daemon receipt of the SIGHUP signal, or execution |
| of the command "scontrol reconfigure" unless otherwise noted. |
| Changes to TCP listening settings will require a daemon restart. |
| .LP |
| If a line begins with the word "Include" followed by whitespace |
| and then a file name, that file will be included inline with the current |
| configuration file. For large or complex systems, multiple configuration files |
| may prove easier to manage and enable reuse of some files (See INCLUDE |
| MODIFIERS for more details). |
| .LP |
| Note on file permissions: |
| .LP |
| The \fIslurm.conf\fR file must be readable by all users of Slurm, since it |
| is used by many of the Slurm commands. Other files that are defined |
| in the \fIslurm.conf\fR file, such as log files and job accounting files, |
| may need to be created/owned by the user "SlurmUser" to be successfully |
| accessed. Use the "chown" and "chmod" commands to set the ownership |
| and permissions appropriately. |
| See the section \fBFILE AND DIRECTORY PERMISSIONS\fR for information |
| about the various files and directories used by Slurm. |
| |
| .SH "PARAMETERS" |
| .LP |
| The overall configuration parameters available include: |
| |
| .TP |
| \fBAccountingStorageBackupHost\fR |
| The name of the backup machine hosting the accounting storage database. |
| If used with the accounting_storage/slurmdbd plugin, this is where the backup |
| slurmdbd would be running. |
| Only used with systems using SlurmDBD, ignored otherwise. |
| .IP |
| |
| .TP |
| \fBAccountingStorageEnforce\fR |
| This controls what level of association\-based enforcement to impose |
| on job submissions. Valid options are any comma-separated combination of the |
| following, many of which will implicitly include other options: |
| |
| .IP |
| .RS |
| .TP 2 |
| \fBall\fR |
| Implies all other available options except \fBnojobs\fR and \fBnosteps\fR. |
| .IP |
| |
| .TP |
| \fBassociations\fR |
| No new job is allowed to run unless a corresponding association exists in the |
| system. |
| .IP |
| |
| .TP |
| \fBlimits\fR |
| Users can be limited by association to whatever job size or run time limits are |
| defined. Implies \fBassociations\fR. |
| .IP |
| |
| .TP |
| \fBnojobs\fR |
| Slurm will not account for any jobs or steps on the system. |
| Implies \fBnosteps\fR. |
| .IP |
| |
| .TP |
| \fBnosteps\fR |
| Slurm will not account for any steps that have run. |
| .IP |
| |
| .TP |
| \fBqos\fR |
| Jobs will not be scheduled unless a valid qos is specified. |
| Implies \fBassociations\fR. |
| .IP |
| |
| .TP |
| \fBsafe\fR |
| A job will only be launched against an association or qos that has a |
| TRES\-minutes limit set if the job will be able to run to completion. Without |
| this option set, jobs will be launched as long as their usage hasn't reached |
| the TRES\-minutes limit. This can lead to jobs being launched but then killed |
| when the limit is reached. With this option, a job won't be killed due to limits, |
| even if the limits are changed after the job was started and the association or |
| qos violates the updated limits. Implies \fBlimits\fR and \fBassociations\fR. |
| .IP |
| |
| .TP |
| \fBwckeys\fR |
| Jobs will not be scheduled unless a valid workload characterization key is |
| specified. Implies \fBassociations\fR and \fBTrackWCKey\fR (a separate |
| configuration option). |
| .RE |
| .IP |
| |
| .TP |
| \fBAccountingStorageExternalHost\fR |
| A comma\-separated list of external slurmdbds (<host/ip>[:port][,...]) to |
| register with. If no port is given, the \fBAccountingStoragePort\fR will be |
| used. |
| |
| This allows clusters registered with the external slurmdbd to communicate with |
| each other using the \fI\-\-cluster/\-M\fR client command options. |
| |
| The cluster will add itself to the external slurmdbd if it doesn't exist. If a |
| non\-external cluster already exists on the external slurmdbd, the slurmctld |
| will ignore registering to the external slurmdbd. |
| .IP |
| |
| .TP |
| \fBAccountingStorageHost\fR |
| The name of the machine hosting the accounting storage database. |
| Only used with systems using SlurmDBD, ignored otherwise. |
| .IP |
| |
| .TP |
| \fBAccountingStorageParameters\fR |
| Comma\-separated list of options. |
| .IP |
| .RS |
| .TP 2 |
| \fBmax_step_records\fR=\# |
| The number of steps that are recorded in the database for each job -- excluding |
| batch, extern, and interactive steps. |
| .IP |
| |
| .RE |
| .IP |
| The following comma\-separated list of key\-value options are used to establish |
| a secure connection to the database: |
| .IP |
| .RS |
| .TP 2 |
| \fBSSL_CERT\fR |
| The path name of the client public key certificate file. |
| .IP |
| |
| .TP |
| \fBSSL_CA\fR |
| The path name of the Certificate Authority (CA) certificate file. |
| .IP |
| |
| .TP |
| \fBSSL_CAPATH\fR |
| The path name of the directory that contains trusted SSL CA certificate files. |
| .IP |
| |
| .TP |
| \fBSSL_KEY\fR |
| The path name of the client private key file. |
| .IP |
| |
| .TP |
| \fBSSL_CIPHER\fR |
| The list of permissible ciphers for SSL encryption. |
| .RE |
| .IP |
| |
| .TP |
| \fBAccountingStoragePass\fR |
| The password used to gain access to the database to store the |
| accounting data. Only used for database type storage plugins, ignored |
| otherwise. In the case of SlurmDBD (Database Daemon) with MUNGE |
| authentication this can be configured to use a MUNGE daemon |
| specifically configured to provide authentication between clusters |
| while the default MUNGE daemon provides authentication within a |
| cluster. In that case, \fBAccountingStoragePass\fR should specify the |
| named port to be used for communications with the alternate MUNGE |
| daemon (e.g. "/var/run/munge/global.socket.2"). The default value is |
| NULL. |
| .IP |
| |
| .TP |
| \fBAccountingStoragePort\fR |
| The listening port of the accounting storage database server. |
| Only used for database type storage plugins, ignored otherwise. |
| The default value is SLURMDBD_PORT as established at system |
| build time. If no value is explicitly specified, it will be set to 6819. |
| This value must be equal to the \fBDbdPort\fR parameter in the |
| slurmdbd.conf file. |
| .IP |
| |
| .TP |
| \fBAccountingStorageTRES\fR |
| Comma\-separated list of resources you wish to track on the cluster. |
| These are the resources requested by the sbatch/srun job when it |
| is submitted. Currently this consists of any GRES, BB (burst buffer) or |
| license along with CPU, Memory, Node, Energy, FS/[Disk|Lustre], IC/OFED, Pages, |
| and VMem. By default Billing, CPU, Energy, Memory, Node, FS/Disk, Pages and VMem |
| are tracked. These default TRES cannot be disabled, but only appended to. |
| AccountingStorageTRES=gres/craynetwork,license/iop1 |
| will track billing, cpu, energy, memory, nodes, fs/disk, pages and vmem along |
| with a gres called craynetwork as well as a license called iop1. Whenever these |
| resources are used on the cluster they are recorded. The TRES are automatically |
| set up in the database on the start of the slurmctld. |
| |
| If multiple GRES of different types are tracked (e.g. GPUs of different types), |
| then job requests with matching type specifications will be recorded. |
| Given a configuration of |
| "AccountingStorageTRES=gres/gpu,gres/gpu:tesla,gres/gpu:volta" |
| Then "gres/gpu:tesla" and "gres/gpu:volta" will track only jobs that explicitly |
| request those two GPU types, while "gres/gpu" will track allocated GPUs of any |
| type ("tesla", "volta" or any other GPU type). |
| |
| Given a configuration of |
| "AccountingStorageTRES=gres/gpu:tesla,gres/gpu:volta" |
| Then "gres/gpu:tesla" and "gres/gpu:volta" will track jobs that explicitly |
| request those GPU types. |
| If a job requests GPUs, but does not explicitly specify the GPU type, then |
| its resource allocation will be accounted for as either "gres/gpu:tesla" or |
| "gres/gpu:volta", although the accounting may not match the actual GPU type |
| allocated to the job and the GPUs allocated to the job could be heterogeneous. |
| In an environment containing various GPU types, use of a job_submit plugin |
| may be desired in order to force jobs to explicitly specify some GPU type. |
| |
| \fBNOTE\fR: Setting gres/gpu will also set gres/gpumem and gres/gpuutil. |
| gres/gpumem and gres/gpuutil can be set individually when gres/gpu is not set. |
| .IP |
| |
| .TP |
| \fBAccountingStorageType\fR |
| The accounting storage mechanism type. Unset by default, which indicates |
| that accounting records are not maintained. |
| |
| Current options are: |
| .IP |
| .RS |
| .TP |
| \fBaccounting_storage/slurmdbd\fR |
| The accounting records will be written to the SlurmDBD, which manages an |
| underlying MySQL database. See "man slurmdbd" for more information. |
| .RE |
| .IP |
| |
| .TP |
| \fBAccountingStoreFlags\fR |
| Comma separated list used to modify which fields the slurmctld send to the |
| accounting database. |
| .IP |
| .RS |
| .TP |
| Current options are: |
| .IP |
| |
| .TP |
| \fBjob_comment\fR |
| Include the job's comment field in the job complete message sent to the Accounting Storage database. |
| Note the AdminComment and SystemComment are always recorded in the database. |
| .IP |
| |
| .TP |
| \fBjob_env\fR |
| Include a batch job's environment variables used at job submission in the job |
| start message sent to the Accounting Storage database. |
| .IP |
| |
| .TP |
| \fBjob_extra\fR |
| Include the job's extra field in the job complete message sent to the Accounting |
| Storage database. |
| .IP |
| |
| .TP |
| \fBjob_script\fR |
| Include the job's batch script in the job start message sent to the Accounting Storage database. |
| .IP |
| |
| .TP |
| \fBno_stdio\fR |
| Exclude the stdio paths when recording data into the database on a job or |
| step start. StdOut, StdErr and StdIn db fields for jobs and steps will be empty. |
| .RE |
| .IP |
| |
| .TP |
| \fBAcctGatherNodeFreq\fR |
| The AcctGather plugins sampling interval for node accounting. |
| For AcctGather plugin values of none, this parameter is ignored. |
| For all other values this parameter is the number |
| of seconds between node accounting samples. For the |
| acct_gather_energy/rapl plugin, set a value less |
| than 300 because the counters may overflow beyond this rate. |
| The default value is zero. This value disables accounting sampling |
| for nodes. Note: The accounting sampling interval for jobs is |
| determined by the value of \fBJobAcctGatherFrequency\fR. |
| .IP |
| |
| .TP |
| \fBAcctGatherEnergyType\fR |
| Identifies the plugin to be used for energy consumption accounting. |
| The jobacct_gather plugin and slurmd daemon call this plugin to collect |
| energy consumption data for jobs and nodes. The collection of energy |
| consumption data takes place on the node level, hence only in case of exclusive |
| job allocation the energy consumption measurements will reflect the job's |
| real consumption. In case of node sharing between jobs the reported consumed |
| energy per job (through sstat or sacct) will not reflect the real energy |
| consumed by the jobs. Default is nothing is collected. |
| |
| Configurable values at present are: |
| .IP |
| .RS |
| .TP 20 |
| \fBacct_gather_energy/gpu\fR |
| Energy consumption data is collected from the GPU management library (e.g. rsmi) |
| for the corresponding type of GPU. Only available for rsmi at present. |
| Note: slurmd will keep gpu plugin loaded after configuration when this is set. |
| .IP |
| |
| .TP |
| \fBacct_gather_energy/ipmi\fR |
| Energy consumption data is collected from the Baseboard Management Controller |
| (BMC) using the Intelligent Platform Management Interface (IPMI). |
| .IP |
| |
| .TP |
| \fBacct_gather_energy/pm_counters\fR |
| Energy consumption data is collected from the Baseboard Management |
| Controller (BMC) for HPE Cray systems. |
| .IP |
| |
| .TP |
| \fBacct_gather_energy/rapl\fR |
| Energy consumption data is collected from hardware sensors using the Running |
| Average Power Limit (RAPL) mechanism. Note that enabling RAPL may require the |
| execution of the command "sudo modprobe msr". |
| .IP |
| |
| .TP |
| \fBacct_gather_energy/xcc\fR |
| Energy consumption data is collected from the Lenovo SD650 XClarity Controller |
| (XCC) using IPMI OEM raw commands. |
| .RE |
| .IP |
| |
| .TP |
| \fBAcctGatherInterconnectType\fR |
| Identifies the plugin to be used for interconnect network traffic accounting. |
| The jobacct_gather plugin and slurmd daemon call this plugin to collect |
| network traffic data for jobs and nodes. |
| The collection of network traffic data takes place on the node level, |
| hence only in case of exclusive job allocation the collected values will |
| reflect the job's real traffic. In case of node sharing between jobs the reported |
| network traffic per job (through sstat or sacct) will not reflect the real |
| network traffic by the jobs. |
| |
| Configurable values at present are: |
| .IP |
| .RS |
| .TP 20 |
| \fBacct_gather_interconnect/ofed\fR |
| Infiniband network traffic data are collected from the hardware monitoring |
| counters of Infiniband devices through the OFED library. |
| In order to account for per job network traffic, add the "ic/ofed" TRES to |
| \fIAccountingStorageTRES\fR. |
| .IP |
| |
| .TP |
| \fBacct_gather_interconnect/sysfs\fR |
| Network traffic statistics are collected from the Linux sysfs |
| pseudo\-filesystem for specific interfaces defined in |
| \fBacct_gather.conf\fR(5). |
| In order to account for per job network traffic, add the "ic/sysfs" TRES to |
| \fIAccountingStorageTRES\fR. |
| .RE |
| .IP |
| |
| .TP |
| \fBAcctGatherFilesystemType\fR |
| Identifies the plugin to be used for filesystem traffic accounting. |
| The jobacct_gather plugin and slurmd daemon call this plugin to collect |
| filesystem traffic data for jobs and nodes. |
| The collection of filesystem traffic data takes place on the node level, |
| hence only in case of exclusive job allocation the collected values will |
| reflect the job's real traffic. In case of node sharing between jobs the reported |
| filesystem traffic per job (through sstat or sacct) will not reflect the real |
| filesystem traffic by the jobs. |
| |
| |
| Configurable values at present are: |
| .IP |
| .RS |
| .TP 20 |
| \fBacct_gather_filesystem/lustre\fR |
| Lustre filesystem traffic data are collected from the counters found in |
| /proc/fs/lustre/. |
| In order to account for per job lustre traffic, add the "fs/lustre" TRES to |
| \fIAccountingStorageTRES\fR. |
| .RE |
| .IP |
| |
| .TP |
| \fBAcctGatherProfileType\fR |
| Identifies the plugin to be used for detailed job profiling. |
| The jobacct_gather plugin and slurmd daemon call this plugin to collect |
| detailed data such as I/O counts, memory usage, or energy consumption for jobs |
| and nodes. There are interfaces in this plugin to collect data as step start |
| and completion, task start and completion, and at the account gather |
| frequency. The data collected at the node level is related to jobs only in |
| case of exclusive job allocation. |
| |
| Configurable values at present are: |
| .IP |
| .RS |
| .TP 20 |
| \fBacct_gather_profile/hdf5\fR |
| This enables the HDF5 plugin. The directory where the profile files |
| are stored and which values are collected are configured in the |
| acct_gather.conf file. |
| .IP |
| |
| .TP |
| \fBacct_gather_profile/influxdb\fR |
| This enables the influxdb plugin. The influxdb instance host, port, database, |
| retention policy and which values are collected are configured in the |
| acct_gather.conf file. |
| .RE |
| .IP |
| |
| .TP |
| \fBAllowSpecResourcesUsage\fR |
| If set to "YES", Slurm allows individual jobs to override node's configured |
| CoreSpecCount value. For a job to take advantage of this feature, |
| a command line option of \-\-core\-spec must be specified. The default |
| value for this option is "YES" for Cray systems and "NO" for other system types. |
| .IP |
| |
| .TP |
| \fBAuthAltTypes\fR |
| Comma\-separated list of alternative authentication plugins that the slurmctld |
| will permit for communication. Acceptable values at present include |
| \fBauth/jwt\fR. |
| |
| \fBNOTE\fR: If \fBAuthAltParameters\fR is not used to specify a path to the |
| required jwt_hs256.key then slurmctld will default to looking for it in the |
| \fBStateSaveLocation\fR. |
| The jwt_hs256.key should only be visible to the SlurmUser and root. It is not |
| suggested to place the jwt_hs256.key on any nodes other than the machine running |
| \fBslurmctld\fR and the machine running \fBslurmdbd\fR. |
| \fBauth/jwt\fR can be activated by the presence of the \fISLURM_JWT\fR |
| environment variable. When activated, it will override the default |
| \fBAuthType\fR. |
| .IP |
| |
| .TP |
| \fBAuthAltParameters\fR |
| Used to define alternative authentication plugins options. Multiple options may |
| be comma separated. |
| .IP |
| .RS |
| .TP 15 |
| \fBdisable_token_creation\fR |
| Disable "scontrol token" use by non\-SlurmUser accounts. |
| .TP |
| \fBmax_token_lifespan\fR=<seconds> |
| Set max lifespan (in seconds) for any token generated for user accounts. Limit |
| applies to all users except SlurmUser. Sites wishing to have per user limits |
| should generate tokens using JWT\-compatible tools, and\/or an authenticating |
| proxy, instead of using \fIscontrol token\fR. |
| .IP |
| |
| .TP |
| \fBjwks\fR= |
| Absolute path to JWKS file. Key should be owned by SlurmUser or root, must be |
| readable by SlurmUser, with suggested permissions of 0400. It must not be |
| writable by 'other'. |
| Only RS256 keys are supported, although other key types may be listed in the |
| file. If set, no HS256 key will be loaded by default (and token generation is |
| disabled), although the jwt_key setting may be used to explicitly re\-enable |
| HS256 key use (and token generation). |
| .IP |
| |
| .TP |
| \fBjwt_key\fR= |
| Absolute path to JWT key file. Key must be HS256. Key should be owned by |
| SlurmUser or root, must be readable by SlurmUser, with suggested permissions of |
| 0400. It must not be accessible by 'other'. |
| If not set, the default key file is jwt_hs256.key in \fIStateSaveLocation\fR. |
| .IP |
| |
| .TP |
| \fBuserclaimfield\fR= |
| Use an alternative claim field for the Slurm UserName \fBsun\fR field. This |
| option is designed to allow compatibility with tokens generated outside of |
| Slurm. (This field may also be known as a grant.) |
| .NR |
| Default: (disabled) |
| .RE |
| .IP |
| |
| .TP |
| \fBAuthInfo\fR |
| Additional information to be used for authentication of communications |
| between the Slurm daemons (slurmctld and slurmd) and the Slurm |
| clients. The interpretation of this option is specific to the |
| configured \fBAuthType\fR. |
| Multiple options may be specified in a comma\-delimited list. |
| If not specified, the default authentication information will be used. |
| .IP |
| .RS |
| .TP 14 |
| \fBcred_expire\fR |
| Default job step credential lifetime, in seconds (e.g. "cred_expire=1200"). |
| It must be sufficiently long enough to load user environment, run prolog, |
| deal with the slurmd getting paged out of memory, etc. |
| This also controls how long a requeued job must wait before starting again. |
| The default value is 120 seconds. |
| .IP |
| |
| .TP |
| \fBsocket\fR |
| Path name to a MUNGE daemon socket to use |
| (e.g. "socket=/var/run/munge/munge.socket.2"). |
| The default value is "/var/run/munge/munge.socket.2". |
| Used by \fBauth/munge\fR and \fBcred/munge\fR. |
| .IP |
| |
| .TP |
| \fBttl\fR |
| Credential lifetime, in seconds (e.g. "ttl=300"). |
| The default value is dependent on the \fBAuthType\fR used. |
| For \fBauth/munge\fR, the default value is dependent upon the MUNGE |
| installation, but is typically 300 seconds. For \fBauth/slurm\fR, the default |
| value is 60 seconds. For \fBauth/jwt\fR, the default value is 1800 seconds. |
| .IP |
| |
| .TP |
| \fBuse_client_ids\fR |
| Allow the \fBauth/slurm\fR plugin to authenticate users without relying on |
| the user information from LDAP or the operating system. When coupled with |
| nss_slurm, the user information can be managed on the compute nodes by |
| slurmstepd. This would allow the cluster to operate in an environment where |
| only the login nodes have access to LDAP/OS user information. |
| See <https://slurm.schedmd.com/nss_slurm.html> for more information. |
| .RE |
| .IP |
| |
| .TP |
| \fBAuthType\fR |
| The authentication method for communications between Slurm |
| components. |
| All Slurm daemons and commands must be terminated prior to changing |
| the value of \fBAuthType\fR and later restarted. |
| Changes to this value will interrupt outstanding job steps and prevent them |
| from completing. |
| Acceptable values at present: |
| .RS |
| .TP |
| \fBauth/munge\fR |
| Indicates that MUNGE is to be used (default). |
| (See "https://dun.github.io/munge/" for more information). |
| .IP |
| |
| .TP |
| \fBauth/slurm\fR |
| Use Slurm's internal authentication plugin. |
| .RE |
| .IP |
| |
| .TP |
| \fBBackupAddr\fR |
| Deprecated option, see \fBSlurmctldHost\fR. |
| .IP |
| |
| .TP |
| \fBBackupController\fR |
| Deprecated option, see \fBSlurmctldHost\fR. |
| .IP |
| |
| .TP |
| \fBBatchStartTimeout\fR |
| The maximum time (in seconds) that a batch job is permitted for |
| launching before being considered missing and releasing the |
| allocation. The default value is 10 (seconds). Larger values may be |
| required if more time is required to execute the \fBProlog\fR, load |
| user environment variables, or if the slurmd daemon gets paged from memory. |
| .br |
| .br |
| \fBNOTE\fR: The test for a job being successfully launched is only performed when |
| the Slurm daemon on the compute node registers state with the slurmctld daemon |
| on the head node, which happens fairly rarely. |
| Therefore a job will not necessarily be terminated if its start time exceeds |
| \fBBatchStartTimeout\fR. |
| This configuration parameter is also applied to launch tasks and avoid aborting |
| \fBsrun\fR commands due to long running \fBProlog\fR scripts. |
| .IP |
| |
| .TP |
| \fBBcastExclude\fR |
| Comma\-separated list of absolute directory paths to be excluded when |
| autodetecting and broadcasting executable shared object dependencies through |
| \fBsbcast\fR or \fBsrun \-\-bcast\fR. The keyword "\fInone\fR" can be used to |
| indicate that no directory paths should be excluded. The default value is |
| "\fI/lib,/usr/lib,/lib64,/usr/lib64\fR". This option can be overridden by |
| \fBsbcast \-\-exclude\fR and \fBsrun \-\-bcast\-exclude\fR. |
| .IP |
| |
| .TP |
| \fBBcastParameters\fR |
| Controls sbcast and srun \-\-bcast behavior. Multiple options can be specified |
| in a comma separated list. |
| Supported values include: |
| .IP |
| .RS |
| .TP 15 |
| \fBDestDir\fR= |
| Destination directory for file being broadcast to allocated compute nodes. |
| Default value is current working directory, or \-\-chdir for srun if set. |
| .IP |
| |
| .TP |
| \fBCompression\fR= |
| Specify default file compression library to be used. |
| Supported values are "lz4" and "none". |
| The default value with the sbcast \-\-compress option is "lz4" and "none" otherwise. |
| Some compression libraries may be unavailable on some systems. |
| .IP |
| |
| .TP |
| \fBsend_libs\fR |
| If set, attempt to autodetect and broadcast the executable's shared object |
| dependencies to allocated compute nodes. The files are placed in a directory |
| alongside the executable. For \fBsrun\fR only, the \fBLD_LIBRARY_PATH\fR is |
| automatically updated to include this cache directory as well. |
| This can be overridden with either \fBsbcast\fR or \fBsrun\fR |
| \fB\-\-send\-libs\fR option. By default this is disabled. |
| .RE |
| .IP |
| |
| .TP |
| \fBBurstBufferType\fR |
| The plugin used to manage burst buffers. Unset by default. |
| Acceptable values at present are: |
| .IP |
| .RS |
| .TP |
| \fBburst_buffer/datawarp\fR |
| Use Cray DataWarp API to provide burst buffer functionality. |
| .IP |
| |
| .TP |
| \fBburst_buffer/lua\fR |
| This plugin provides hooks to an API that is defined by a Lua script. This |
| plugin was developed to provide system administrators with a way to do any task |
| (not only file staging) at different points in a job's life cycle. |
| .RE |
| .IP |
| |
| .TP |
| \fBCertgenParameters\fR |
| Comma\-separated options identifying certgen plugin options. |
| Supported values include: |
| .IP |
| .RS |
| .TP |
| \fBcertgen_script=\fR |
| Absolute path to executable script to generate self-signed TLS certificate. |
| The private key generated by \fBkeygen_script\fR is passed in as stdin, and |
| only the certificate PEM file should be printed to stdout. Must return 0 on |
| success, and non-zero on error. |
| .IP |
| |
| .TP |
| \fBkeygen_script=\fR |
| Absolute path to executable script to generate private key used later to |
| generate a self-signed certificate. Only the private key PEM file should be |
| printed to stdout, which will be later sent as stdin to \fBcertgen_script\fR. |
| Must return 0 on success, and non-zero on error. |
| .RE |
| .IP |
| |
| .TP |
| \fBCertgenType\fR |
| Specify the certgen plugin that will be used. |
| Acceptable values at present: |
| .IP |
| .RS |
| .TP |
| \fBcertgen/script\fR |
| Use built-in/configured scripts to generate certificate key pair. |
| .RE |
| .IP |
| |
| .TP |
| \fBCertmgrParameters\fR |
| Used to define parameters for certmgr plugin. |
| .IP |
| .RS |
| .TP |
| \fBcertificate_renewal_period=\fR |
| slurmd/sackd will request a new signed certificate from slurmctld at this |
| specified interval (in minutes). |
| |
| Default is 1440 minutes (once per day). |
| .IP |
| |
| .TP |
| \fBgenerate_csr_script=\fR |
| Path to script used to generate certificate signing requests. The nodename is |
| passed in as an argument to the script. The script must print only the |
| certificate signing request PEM file to stdout, and return 0 on success. Must |
| return non-zero on error. |
| |
| Required with certmgr/script. Only run by daemons requesting certificates. |
| .IP |
| |
| .TP |
| \fBget_node_cert_key_script=\fR |
| Path to script used to get node's private key which was used to generate the |
| CSR returned by \fBgenerate_csr_script\fR. The nodename is passed in as an |
| argument to the script. The script must print the node's private key (PEM file) |
| to stdout. Must return 0 on success, and non-zero on error. |
| |
| Required with certmgr/script. Only run by daemons requesting certificates. |
| .IP |
| |
| .TP |
| \fBget_node_token_script=\fR |
| Path to script used to get node's unique token which will be validated by |
| slurmctld using the script set by \fBvalidate_node_script=\fR. |
| The nodename is passed in as an argument to the script. The script must print |
| the node's unique token to stdout, and return 0 on success. Must return |
| non-zero on error. |
| |
| Required with certmgr/script. Only run by daemons requesting certificates. |
| .IP |
| |
| .TP |
| \fBsign_csr_script=\fR |
| Path to script used to sign incoming certificate signing requests. |
| This script will only be called if \fBvalidate_node_script=\fR was |
| already called on the accompanying unique node token and returned with a |
| non-zero exit code. |
| The certificate signing request (as given by \fBgenerate_csr_script=\fR) is |
| passed as an argument to this script. |
| The script must print the new signed certificate to stdout, and return 0 on |
| success. Must return non-zero on error. |
| |
| Required with certmgr/script. Only run by slurmctld. |
| .IP |
| |
| .TP |
| \fBsingle_use_tokens\fR |
| Unique node tokens that are dynamically set (e.g. set via scontrol) will be |
| consumed upon successful certificate signing. |
| .IP |
| |
| .TP |
| \fBvalidate_node_script=\fR |
| Path to script used to validate a unique node token. |
| The unique node token is passed as an argument to this script. |
| If the script finds the node token to be valid, return 0. |
| Otherwise, if the node token is invalid, return non-zero. |
| |
| Required with certmgr/script. Only run by slurmctld. |
| .IP |
| .RE |
| |
| .TP |
| \fBCertmgrType\fR |
| Plugin used to dynamically renew TLS certificates for slurmd/sackd. |
| .RS |
| .TP |
| \fBcertmgr/script\fR |
| Use script hooks to implement certificate management. See |
| \fBCertmgrParameters\fR for details on how to setup these scripts. |
| .IP |
| .RE |
| |
| .TP |
| \fBCliFilterParameters\fR |
| Extra parameters for cli_filter plugins. Multiple options may be |
| comma\-separated. Acceptable values at present are: |
| .IP |
| .RS |
| .TP |
| \fBcli_filter_lua_path\fR=\fI<path>\fR |
| Absolute path to the cli_filter.lua script to be used when cli_filter/lua is |
| enabled. If this is not defined, the default path will be used instead (same |
| path to slurm.conf). |
| |
| \fBNOTE\fR: The configured directory containing the cli_filter.lua script should |
| have 755 permissions, the script itself 644 and both be owned by SlurmdUser. |
| .RE |
| .IP |
| |
| .TP |
| \fBCliFilterPlugins\fR |
| A comma\-delimited list of command line interface option filter/modification |
| plugins. The specified plugins will be executed in the order listed. |
| No cli_filter plugins are used by default. Acceptable values at present are: |
| .IP |
| .RS |
| .TP |
| \fBcli_filter/lua\fR |
| This plugin allows you to write your own implementation of a cli_filter |
| using lua. |
| .IP |
| |
| .TP |
| \fBcli_filter/syslog\fR |
| This plugin enables logging of job submission activities performed. All the |
| salloc/sbatch/srun options are logged to syslog together with environment |
| variables in JSON format. If the plugin is not the last one in the list it may |
| log values different than what was actually sent to slurmctld. |
| .IP |
| |
| .TP |
| \fBcli_filter/user_defaults\fR |
| This plugin looks for the file $HOME/.slurm/defaults and reads every line of it |
| as a \fIkey\fR=\fIvalue\fR pair, where \fIkey\fR is any of the job submission |
| options available to salloc/sbatch/srun and \fIvalue\fR is a default value |
| defined by the user. For instance: |
| .nf |
| time=1:30 |
| mem=2048 |
| .fi |
| The above will result in a user defined default for each of their jobs of |
| "\-t 1:30" and "\-\-mem=2048". |
| .RE |
| .IP |
| |
| .TP |
| \fBClusterName\fR |
| The name by which this Slurm managed cluster is known in the |
| accounting database. This is needed to distinguish accounting records |
| when multiple clusters report to the same database. Because of limitations |
| in some databases, any upper case letters in the name will be silently mapped |
| to lower case. In order to avoid confusion, it is recommended that the name |
| be lower case. The cluster name must be 40 characters or less in order to |
| comply with the limit on the maximum length for table names in MySQL/MariaDB. |
| .IP |
| |
| .TP |
| \fBCommunicationParameters\fR |
| Comma\-separated options identifying communication options. |
| .IP |
| .RS |
| .TP 15 |
| \fBblock_null_hash\fR |
| Require all Slurm authentication tokens to include a newer (20.11.9 and |
| 21.08.8) payload that provides an additional layer of security against |
| credential replay attacks. This option should only be enabled once all Slurm |
| daemons have been upgraded to 20.11.9/21.08.8 or newer, and all jobs that were |
| started before the upgrade have been completed. |
| .IP |
| |
| .TP |
| \fBhost_unreach_retry_count\fR=\# |
| When a node tries to connect() to another node, connect() may return an error |
| with EHOSTUNREACH if the host is unreachable. If this parameter is set, this |
| is the number of times that Slurm will retry making that connection. Slurm will |
| wait for 500 milliseconds in between each try. The default for this parameter |
| is zero (Slurm will not retry if EHOSTUNREACH is returned). |
| .IP |
| |
| .TP |
| \fBDisableIPv4\fR |
| Disable IPv4 only operation for all slurm daemons (except slurmdbd). This |
| should also be set in your \fBslurmdbd.conf\fR file. |
| .IP |
| |
| .TP |
| \fBEnableIPv6\fR |
| Enable using IPv6 addresses for all slurm daemons (except slurmdbd). When |
| using both IPv4 and IPv6, address family preferences will be based on your |
| /etc/gai.conf file. This should also be set in your \fBslurmdbd.conf\fR file. |
| .IP |
| |
| .TP |
| \fBgetnameinfo_cache_timeout\fR |
| When munge is used as AuthType slurmctld makes use of getnameinfo to obtain |
| the hostname from IP address stored in munge credential. This parameter controls |
| the number of seconds slurmctld should keep the IP to hostname resolution. When |
| set to 0 cache is disabled. The default value is 60. |
| .IP |
| |
| .TP |
| \fBkeepaliveinterval\fR=\# |
| Specifies the interval, in seconds, between keepalive probes on idle |
| connections. |
| This affects connections between srun and its slurmstepd process as well as all |
| connections to the slurmdbd. |
| The default is to use the system default settings. |
| .IP |
| |
| .TP |
| \fBkeepaliveprobes\fR=\# |
| Specifies the number of unacknowledged keepalive probes sent before considering |
| the connection broken. |
| This affects connections between srun and its slurmstepd process as well as all |
| connections to the slurmdbd. |
| The default is to use the system default settings. |
| .IP |
| |
| .TP |
| \fBkeepalivetime\fR=\# |
| Specifies how long, in seconds, before a connection is marked as needing a |
| keepalive probe as well as how long to delay closing a connection to process |
| messages still in the queue. |
| This affects connections between srun and its slurmstepd process as well as all |
| connections to the slurmdbd. |
| Longer values can be used to improve reliability of communications in the event |
| of network failures. |
| The default is for keepalive to be disabled. |
| .IP |
| |
| .TP |
| \fBNoCtldInAddrAny\fR |
| Used to directly bind to the address of what the node resolves to running |
| the slurmctld instead of binding messages to any address on the node, |
| which is the default. |
| .IP |
| |
| .TP |
| \fBNoInAddrAny\fR |
| Used to directly bind to the address of what the node resolves to instead |
| of binding messages to any address on the node which is the default. |
| This option is for all daemons/clients except for the slurmctld. |
| .RE |
| .IP |
| |
| .TP |
| \fBCompleteWait\fR |
| The time to wait, in seconds, when any job is in the COMPLETING state |
| before any additional jobs are scheduled. This is to attempt to keep jobs on |
| nodes that were recently in use, with the goal of preventing fragmentation. |
| If set to zero, pending jobs will be started as soon as possible. |
| Since a COMPLETING job's resources are released for use by other |
| jobs as soon as the \fBEpilog\fR completes on each individual node, |
| this can result in very fragmented resource allocations. |
| To provide jobs with the minimum response time, a value of zero is |
| recommended (no waiting). |
| To minimize fragmentation of resources, a value equal to \fBKillWait\fR |
| plus two is recommended. |
| In that case, setting \fBKillWait\fR to a small value may be beneficial. |
| The default value of \fBCompleteWait\fR is zero seconds. |
| The value may not exceed 65533. |
| |
| \fBNOTE\fR: Setting \fBreduce_completing_frag\fR affects the behavior |
| of \fBCompleteWait\fR. |
| .IP |
| |
| .TP |
| \fBControlAddr\fR |
| Deprecated option, see \fBSlurmctldHost\fR. |
| .IP |
| |
| .TP |
| \fBControlMachine\fR |
| Deprecated option, see \fBSlurmctldHost\fR. |
| .IP |
| |
| .TP |
| \fBCpuFreqDef\fR |
| Default CPU governor to use when running a job step if it has not been |
| explicitly set with the \-\-cpu\-freq option. Acceptable values at present |
| include one of the following governors: |
| .IP |
| .RS |
| .TP 14 |
| \fBConservative\fR |
| attempts to use the Conservative CPU governor |
| .IP |
| |
| .TP |
| \fBOnDemand\fR |
| attempts to use the OnDemand CPU governor |
| .IP |
| |
| .TP |
| \fBPerformance\fR |
| attempts to use the Performance CPU governor |
| .IP |
| |
| .TP |
| \fBPowerSave\fR |
| attempts to use the PowerSave CPU governor |
| .TP |
| Default: Use system default. No attempt to set the governor is made if |
| \-\-cpu\-freq option has not been specified. |
| .RE |
| .IP |
| |
| .TP |
| \fBCpuFreqGovernors\fR |
| List of CPU frequency governors allowed to be set with the salloc, sbatch, or |
| srun option \-\-cpu\-freq. |
| Acceptable values at present include: |
| .IP |
| .RS |
| .TP 14 |
| \fBConservative\fR |
| attempts to use the Conservative CPU governor |
| .IP |
| |
| .TP |
| \fBOnDemand\fR |
| attempts to use the OnDemand CPU governor (a default value) |
| .IP |
| |
| .TP |
| \fBPerformance\fR |
| attempts to use the Performance CPU governor (a default value) |
| .IP |
| |
| .TP |
| \fBPowerSave\fR |
| attempts to use the PowerSave CPU governor |
| .IP |
| |
| .TP |
| \fBSchedUtil\fR |
| attempts to use the SchedUtil CPU governor |
| .IP |
| |
| .TP |
| \fBUserSpace\fR |
| attempts to use the UserSpace CPU governor (a default value) |
| .TP |
| Default: OnDemand, Performance and UserSpace. |
| .RE |
| .IP |
| |
| .TP |
| \fBCredType\fR |
| The cryptographic signature tool to be used in the creation of |
| job step credentials. |
| Acceptable values at present are: |
| .RS |
| .TP |
| \fBcred/munge\fR |
| Indicates that Munge is to be used (default). |
| .IP |
| |
| .TP |
| \fBcred/slurm\fR |
| Use Slurm's internal credential format. |
| .RE |
| .IP |
| |
| .TP |
| \fBDataParserParameters\fR=<\fIdata_parser\fR> |
| Apply default value for data_parser plugin parameters. See \fI\-\-json\fR or |
| \fI\-\-yaml\fR arguments in \fBsacct\fR(1), \fBscontrol\fR(1), \fBsinfo\fR(1), |
| \fBsqueue\fR(1), \fBsacctmgr\fR(1), \fBsdiag\fR(1), and \fBsshare\fR(1). |
| .br |
| Default: Latest data_parser plugin version with no flags selected. |
| .IP |
| |
| .TP |
| \fBDebugFlags\fR |
| Defines specific subsystems which should provide more detailed event logging. |
| Multiple subsystems can be specified with comma separators. |
| Most DebugFlags will result in additional logging messages for the identified |
| subsystems if \fBSlurmctldDebug\fR is at 'verbose' or higher. |
| More logging may impact performance. |
| |
| \fBNOTE\fR: You can also set debug flags by having the \fBSLURM_DEBUG_FLAGS\fR |
| environment variable defined with the desired flags when the process (client |
| command, daemon, etc.) is started. |
| The environment variable takes precedence over the setting in the slurm.conf. |
| |
| Valid subsystems available include: |
| .IP |
| .RS |
| .TP 17 |
| \fBAccrue\fR |
| Accrue counters accounting details |
| .IP |
| |
| .TP |
| \fBAgent\fR |
| RPC agents (outgoing RPCs from Slurm daemons) |
| .IP |
| |
| .TP |
| \fBAuditRPCs\fR |
| For all inbound RPCs to slurmctld, print the originating address, authenticated |
| user, and RPC type before the connection is processed. |
| .IP |
| |
| .TP |
| \fBAuditTLS\fR |
| Print TLS certificates being used |
| .IP |
| |
| .TP |
| \fBBackfill\fR |
| Backfill scheduler details |
| .IP |
| |
| .TP |
| \fBBackfillMap\fR |
| Backfill scheduler to log a very verbose map of reserved resources through |
| time. Combine with \fBBackfill\fR for a verbose and complete view of the |
| backfill scheduler's work. |
| .IP |
| |
| .TP |
| \fBBurstBuffer\fR |
| Burst Buffer plugin |
| .IP |
| |
| .TP |
| \fBCgroup\fR |
| Cgroup details |
| .IP |
| |
| .TP |
| \fBConMgr\fR |
| Connection manager details |
| .IP |
| |
| .TP |
| \fBCPU_Bind\fR |
| CPU binding details for jobs and steps |
| .IP |
| |
| .TP |
| \fBCpuFrequency\fR |
| Cpu frequency details for jobs and steps using the \-\-cpu\-freq option. |
| .IP |
| |
| .TP |
| \fBData\fR |
| Generic data structure details. |
| .IP |
| |
| .TP |
| \fBDBD_Agent\fR |
| RPC agent (outgoing RPCs to the DBD) |
| .IP |
| |
| .TP |
| \fBDependency\fR |
| Job dependency debug info |
| .IP |
| |
| .TP |
| \fBElasticsearch\fR |
| Elasticsearch debug info (deprecated). Alias of \fBJobComp\fR. |
| .IP |
| |
| .TP |
| \fBEnergy\fR |
| AcctGatherEnergy debug info |
| .IP |
| |
| .TP |
| \fBFederation\fR |
| Federation scheduling debug info |
| .IP |
| |
| .TP |
| \fBGres\fR |
| Generic resource details |
| .IP |
| |
| .TP |
| \fBHetjob\fR |
| Heterogeneous job details |
| .IP |
| |
| .TP |
| \fBGang\fR |
| Gang scheduling details |
| .IP |
| |
| .TP |
| \fBGLOB_SILENCE\fR |
| Do not display error message of glob "*" symbols in conf files. |
| .IP |
| |
| .TP |
| \fBJobAccountGather\fR |
| Common job account gathering details (not plugin specific). |
| .IP |
| |
| .TP |
| \fBJobComp\fR |
| Job Completion plugin details |
| .IP |
| |
| .TP |
| \fBJobContainer\fR |
| Job container plugin details |
| .IP |
| |
| .TP |
| \fBLicense\fR |
| License management details |
| .IP |
| |
| .TP |
| \fBNetwork\fR |
| Network details. \fBWarning\fR: activating this flag may cause logging of |
| passwords, tokens or other authentication credentials. |
| .IP |
| |
| .TP |
| \fBNetworkRaw\fR |
| Dump raw hex values of key Network communications. \fBWarning\fR: This flag |
| will cause very verbose logs and may cause logging of passwords, tokens or |
| other authentication credentials. |
| .IP |
| |
| .TP |
| \fBNodeFeatures\fR |
| Node Features plugin debug info |
| .IP |
| |
| .TP |
| \fBNO_CONF_HASH\fR |
| Do not log when the slurm.conf files differ between Slurm daemons |
| .IP |
| |
| .TP |
| \fBPower\fR |
| Power management plugin and power save (suspend/resume programs) details |
| .IP |
| |
| .TP |
| \fBPriority\fR |
| Job prioritization |
| .IP |
| |
| .TP |
| \fBProfile\fR |
| AcctGatherProfile plugins details |
| .IP |
| |
| .TP |
| \fBProtocol\fR |
| Communication protocol details |
| .IP |
| |
| .TP |
| \fBReservation\fR |
| Advanced reservations |
| .IP |
| |
| .TP |
| \fBRoute\fR |
| Message forwarding debug info |
| .IP |
| |
| .TP |
| \fBScript\fR |
| Debug info regarding any script called by Slurm. This includes slurmctld |
| executed scripts such as PrologSlurmctld and EpilogSlurmctld. |
| .IP |
| |
| .TP |
| \fBSelectType\fR |
| Resource selection plugin |
| .IP |
| |
| .TP |
| \fBSteps\fR |
| Slurmctld resource allocation for job steps |
| .IP |
| |
| .TP |
| \fBSwitch\fR |
| Switch plugin |
| .IP |
| |
| .TP |
| \fBTLS\fR |
| TLS plugin |
| .IP |
| |
| .TP |
| \fBTraceJobs\fR |
| Trace jobs in slurmctld. It will print detailed job information |
| including state, job ids and allocated nodes counter. |
| .IP |
| |
| .TP |
| \fBTriggers\fR |
| Slurmctld triggers |
| .RE |
| .IP |
| |
| .TP |
| \fBDefCpuPerGPU\fR |
| Default count of CPUs allocated per allocated GPU. This value is used only if |
| the job didn't specify \-\-cpus\-per\-task and \-\-cpus\-per\-gpu. |
| .IP |
| |
| .TP |
| \fBDefMemPerCPU\fR |
| Default real memory size available per usable allocated CPU in megabytes. |
| Used to avoid over\-subscribing memory and causing paging. |
| \fBDefMemPerCPU\fR would generally be used if individual processors |
| are allocated to jobs (\fBSelectType=select/cons_tres\fR). |
| The default value is 0 (unlimited). |
| Also see \fBDefMemPerGPU\fR, \fBDefMemPerNode\fR and \fBMaxMemPerCPU\fR. |
| \fBDefMemPerCPU\fR, \fBDefMemPerGPU\fR and \fBDefMemPerNode\fR are |
| mutually exclusive. |
| |
| |
| \fBNOTE\fR: This applies to \fBusable\fR allocated CPUs in a job allocation. |
| This is important when more than one thread per core is configured. |
| If a job requests \-\-threads\-per\-core with fewer threads on a core than |
| exist on the core (or \-\-hint=nomultithread which implies |
| \-\-threads\-per\-core=1), the job will be unable to use those extra threads on |
| the core and those threads will not be included in the memory per CPU |
| calculation. But if the job has access to all threads on the core, those threads |
| will be included in the memory per CPU calculation even if the job did not |
| explicitly request those threads. |
| |
| In the following examples, each core has two threads. |
| |
| In this first example, two tasks can run on separate hyperthreads |
| in the same core because \-\-threads\-per\-core is not used. The |
| third task uses both threads of the second core. The allocated |
| memory per cpu includes all threads: |
| |
| .nf |
| .ft B |
| $ salloc \-n3 \-\-mem\-per\-cpu=100 |
| salloc: Granted job allocation 17199 |
| $ sacct \-j $SLURM_JOB_ID \-X \-o jobid%7,reqtres%35,alloctres%35 |
| JobID ReqTRES AllocTRES |
| \-\-\-\-\-\-\- \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- |
| 17199 billing=3,cpu=3,mem=300M,node=1 billing=4,cpu=4,mem=400M,node=1 |
| .ft |
| .fi |
| |
| In this second example, because of \-\-threads\-per\-core=1, each |
| task is allocated an entire core but is only able to use one |
| thread per core. Allocated CPUs includes all threads on each |
| core. However, allocated memory per cpu includes only the |
| usable thread in each core. |
| |
| .nf |
| .ft B |
| $ salloc \-n3 \-\-mem\-per\-cpu=100 \-\-threads\-per\-core=1 |
| salloc: Granted job allocation 17200 |
| $ sacct \-j $SLURM_JOB_ID \-X \-o jobid%7,reqtres%35,alloctres%35 |
| JobID ReqTRES AllocTRES |
| \-\-\-\-\-\-\- \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- |
| 17200 billing=3,cpu=3,mem=300M,node=1 billing=6,cpu=6,mem=300M,node=1 |
| .ft |
| .fi |
| .IP |
| |
| .TP |
| \fBDefMemPerGPU\fR |
| Default real memory size available per allocated GPU in megabytes. |
| The default value is 0 (unlimited). |
| Please note a best effort attempt is made to predict which GPUs on the system |
| will be used, but this could change between job submission and start time, |
| causing \fBMaxMemPerNode\fR to potentially not work as expected for |
| heterogeneous jobs. |
| Also see \fBDefMemPerCPU\fR and \fBDefMemPerNode\fR. |
| \fBDefMemPerCPU\fR, \fBDefMemPerGPU\fR and \fBDefMemPerNode\fR are |
| mutually exclusive. |
| .IP |
| |
| .TP |
| \fBDefMemPerNode\fR |
| Default real memory size available per allocated node in megabytes. |
| Used to avoid over\-subscribing memory and causing paging. |
| \fBDefMemPerNode\fR would generally be used if whole nodes |
| are allocated to jobs (\fBSelectType=select/linear\fR) and |
| resources are over\-subscribed (\fBOverSubscribe=yes\fR or |
| \fBOverSubscribe=force\fR). |
| The default value is 0 (unlimited). |
| Also see \fBDefMemPerCPU\fR, \fBDefMemPerGPU\fR and \fBMaxMemPerCPU\fR. |
| \fBDefMemPerCPU\fR, \fBDefMemPerGPU\fR and \fBDefMemPerNode\fR are |
| mutually exclusive. |
| .IP |
| |
| .TP |
| \fBDependencyParameters\fR |
| Multiple options may be comma separated. |
| .IP |
| .RS |
| .TP |
| \fBdisable_remote_singleton\fR |
| By default, when a federated job has a singleton dependency, each cluster in the |
| federation must clear the singleton dependency before the job's singleton |
| dependency is considered satisfied. Enabling this option means that only the |
| origin cluster must clear the singleton dependency. This option must be set |
| in every cluster in the federation. |
| .IP |
| |
| .TP |
| \fBkill_invalid_depend\fR |
| If a job has an invalid dependency and it can never run terminate it |
| and set its state to be JOB_CANCELLED. By default the job stays pending |
| with reason DependencyNeverSatisfied. |
| .IP |
| |
| .TP |
| \fBmax_depend_depth\fR=\# |
| Maximum number of jobs to test for a circular job dependency. Stop testing |
| after this number of job dependencies have been tested. The default value is |
| 10 jobs. |
| .RE |
| .IP |
| |
| .TP |
| \fBDisableRootJobs\fR |
| If set to "YES" then user root will be prevented from running any jobs. |
| The default value is "NO", meaning user root will be able to execute jobs. |
| \fBDisableRootJobs\fR may also be set by partition. |
| .IP |
| |
| .TP |
| \fBEioTimeout\fR |
| The number of seconds srun waits for slurmstepd to close the TCP/IP |
| connection used to relay data between the user application and srun |
| when the user application terminates. The default value is 60 seconds. |
| May not exceed 65533. |
| .IP |
| |
| .TP |
| \fBEnforcePartLimits\fR |
| Controls whether partition limits are enforced when a job is submitted to the |
| cluster. The partition limits being considered by this option are its |
| configured MaxMemPerCPU, MaxMemPerNode, MinNodes, MaxNodes, MaxTime, AllocNodes, |
| AllowAccounts, AllowGroups, AllowQOS, and QOS usage threshold. It also considers |
| if the job requests more nodes than exist in the partition. If set, then a |
| job and job QOS cannot be submitted that exceed partition limits. |
| .IP |
| .RS |
| .TP |
| \fBALL\fR |
| Jobs which exceed the number of nodes in a partition and/or any of its |
| configured limits will be rejected at submission time. If the job is submitted |
| to multiple partitions, the job must satisfy the limits on all the requested |
| partitions. |
| .IP |
| |
| .TP |
| \fBANY\fR |
| Jobs will be accepted if they satisfy the limits on at least one of the |
| requested partitions. |
| .IP |
| |
| .TP |
| \fBNO\fR |
| Partition limits will not be enforced at submit time, but will still be enforced |
| during scheduling. This includes jobs that request more nodes than exist in |
| any of the partition, so jobs can be submitted to empty partitions. A job that |
| exceeds the limits on all requested partitions will remain queued until the |
| partition limits are altered. This is the default. |
| .RE |
| .IP |
| |
| .TP |
| \fBEpilog\fR |
| Pathname of a script to execute as user root on every node when a user's job |
| completes (e.g. "/usr/local/slurm/epilog"). If it is not an absolute path name |
| (i.e. it does not start with a slash), it will be searched for in the same |
| directory as the slurm.conf file. A glob pattern (See \fBglob\fR (7)) may also |
| be used to run more than one epilog script (e.g. "/etc/slurm/epilog.d/*"). |
| When more than one epilog script is configured, they are executed in reverse |
| alphabetical order (z-a -> Z-A -> 9-0). The Epilog script(s) may be used |
| to purge files, disable user login, etc. |
| By default there is no epilog. |
| See \fBProlog and Epilog Scripts\fR for more information. |
| |
| \fB\fBNOTE\fR: It is possible to configure multiple epilog scripts by including |
| this option on multiple lines.\fR |
| .IP |
| |
| .TP |
| \fBEpilogMsgTime\fR |
| The number of microseconds that the slurmctld daemon requires to process |
| an epilog completion message from the slurmd daemons. This parameter can |
| be used to prevent a burst of epilog completion messages from being sent |
| at the same time which should help prevent lost messages and improve |
| throughput for large jobs. |
| The default value is 2000 microseconds. |
| For a 1000 node job, this spreads the epilog completion messages out over |
| two seconds. |
| .IP |
| |
| .TP |
| \fBEpilogSlurmctld\fR |
| Fully qualified pathname of a program for the slurmctld to execute |
| upon termination of a job allocation (e.g. |
| "/usr/local/slurm/epilog_controller"). |
| The program executes as SlurmUser, which gives it permission to drain |
| nodes and requeue the job if a failure occurs (See scontrol(1)). |
| Exactly what the program does and how it accomplishes this is completely at |
| the discretion of the system administrator. |
| Information about the job being initiated, its allocated nodes, etc. are |
| passed to the program using environment variables. |
| See \fBProlog and Epilog Scripts\fR for more information. |
| |
| \fB\fBNOTE\fR: It is possible to configure multiple epilog scripts by including |
| this option on multiple lines.\fR |
| .IP |
| |
| .TP |
| \fBEpilogTimeout\fR |
| The interval in seconds Slurm waits for Epilog before terminating them. The |
| default value is \fBPrologEpilogTimeout\fR. This interval applies to the Epilog |
| run by slurmd daemon after the job, the EpilogSlurmctld run by slurmctld |
| daemon, and the SPANK plugin epilog call: slurm_spank_job_epilog. |
| .br |
| If the Epilog or slurm_spank_job_epilog time out, the node is drained. |
| In all cases, errors are logged. |
| .IP |
| |
| .TP |
| \fBFairShareDampeningFactor\fR |
| Dampen the effect of exceeding a user or group's fair share of allocated |
| resources. Higher values will provides greater ability to differentiate |
| between exceeding the fair share at high levels (e.g. a value of 1 results |
| in almost no difference between overconsumption by a factor of 10 and 100, |
| while a value of 5 will result in a significant difference in priority). |
| The default value is 1. |
| .IP |
| |
| .TP |
| \fBFederationParameters\fR |
| Used to define federation options. Multiple options may be comma separated. |
| .IP |
| .RS |
| .TP |
| \fBfed_display\fR |
| If set, then the client status commands (e.g. squeue, sinfo, sprio, etc.) will |
| display information in a federated view by default. This option is functionally |
| equivalent to using the \-\-federation options on each command. Use the client's |
| \-\-local option to override the federated view and get a local view of the |
| given cluster. |
| |
| Allow client commands to use the \-\-cluster option even when the \fBslurmdbd\fR |
| is down by retrieving cluster records from \fBslurmctld\fR instead. |
| .RE |
| .IP |
| |
| .TP |
| \fBFirstJobId\fR |
| The job id to be used for the first job submitted to Slurm. |
| Job id values generated will incremented by 1 for each subsequent job. |
| Value must be larger than 0. The default value is 1. |
| Also see \fBMaxJobId\fR |
| .IP |
| |
| .TP |
| \fBGresTypes\fR |
| A comma\-delimited list of generic resources to be managed (e.g. |
| \fIGresTypes=gpu,mps\fR). |
| These resources may have an associated GRES plugin of the same name providing |
| additional functionality. |
| No generic resources are managed by default. |
| Ensure this parameter is consistent across all nodes in the cluster for |
| proper operation. |
| .IP |
| |
| .TP |
| \fBGroupUpdateForce\fR |
| If set to a non\-zero value, then information about which users are members |
| of groups allowed to use a partition will be updated periodically, even when |
| there have been no changes to the /etc/group file. |
| If set to zero, group member information will be updated only after the |
| /etc/group file is updated. |
| The default value is 1. |
| Also see the \fBGroupUpdateTime\fR parameter. |
| .IP |
| |
| .TP |
| \fBGroupUpdateTime\fR |
| Controls how frequently information about which users are members of |
| groups allowed to use a partition will be updated, and how long user |
| group membership lists will be cached. |
| The time interval is given in seconds with a default value of 600 seconds. |
| A value of zero will prevent periodic updating of group membership information. |
| Also see the \fBGroupUpdateForce\fR parameter. |
| .IP |
| |
| .TP |
| \fBGpuFreqDef\fR=[<\fItype\fR]=\fIvalue\fR>[,<\fItype\fR=\fIvalue\fR>] |
| Default GPU frequency to use when running a job step if it |
| has not been explicitly set using the \-\-gpu\-freq option. |
| This option can be used to independently configure the GPU and its memory |
| frequencies. |
| There is no default value. If unset, no attempt to change the GPU frequency |
| is made if the \-\-gpu\-freq option has not been set. |
| After the job is completed, the frequencies of all affected GPUs will be reset |
| to the highest possible values. |
| In some cases, system power caps may override the requested values. |
| The field \fItype\fR can be "memory". |
| If \fItype\fR is not specified, the GPU frequency is implied. |
| The \fIvalue\fR field can either be "low", "medium", "high", "highm1" or |
| a numeric value in megahertz (MHz). |
| If the specified numeric value is not possible, a value as close as |
| possible will be used. |
| See below for definition of the values. |
| Examples of use include "GpuFreqDef=medium,memory=high and "GpuFreqDef=450". |
| |
| Supported \fIvalue\fR definitions: |
| .IP |
| .RS |
| .TP 10 |
| \fBlow\fR |
| the lowest available frequency. |
| .IP |
| |
| .TP |
| \fBmedium\fR |
| attempts to set a frequency in the middle of the available range. |
| .IP |
| |
| .TP |
| \fBhigh\fR |
| the highest available frequency. |
| .IP |
| |
| .TP |
| \fBhighm1\fR |
| (high minus one) will select the next highest available frequency. |
| .RE |
| .IP |
| |
| .TP |
| \fBHashPlugin\fR |
| Identifies the type of hash plugin to use for network communication. |
| Acceptable values include: |
| |
| .IP |
| .RS |
| .TP 15 |
| \fBhash/k12\fR |
| Hashes are generated by the KangorooTwelve cryptographic hash function. |
| This is the default. |
| .IP |
| |
| .TP |
| \fBhash/sha3\fR |
| Hashes are generated by the SHA-3 cryptographic hash function. |
| .RE |
| .IP |
| |
| \fBNOTE\fR: Make sure that HashPlugin has the same value both in slurm.conf |
| and in slurmdbd.conf. |
| |
| .TP |
| \fBHealthCheckInterval\fR |
| The interval in seconds between executions of \fBHealthCheckProgram\fR. |
| The default value is zero, which disables execution. |
| .IP |
| |
| .TP |
| \fBHealthCheckNodeState\fR |
| Identify what node states should execute the \fBHealthCheckProgram\fR. |
| Multiple state values may be specified with a comma separator. |
| The default value is ANY to execute on nodes in any state. |
| .IP |
| .RS |
| .TP 12 |
| \fBALLOC\fR |
| Run on nodes in the ALLOC state (all CPUs allocated). |
| .IP |
| |
| .TP |
| \fBANY\fR |
| Run on nodes in any state. |
| .IP |
| |
| .TP |
| \fBCYCLE\fR |
| Rather than running the health check program on all nodes at the same time, |
| cycle through running on all compute nodes through the course of the |
| \fBHealthCheckInterval\fR. May be combined with the various node state |
| options. |
| .IP |
| |
| .TP |
| \fBIDLE\fR |
| Run on nodes in the IDLE state. |
| .IP |
| |
| .TP |
| \fBNONDRAINED_IDLE\fR |
| Run on nodes that are in the IDLE state and not DRAINED. |
| .IP |
| |
| .TP |
| \fBMIXED\fR |
| Run on nodes in the MIXED state (some CPUs idle and other CPUs allocated). |
| .IP |
| |
| .TP |
| \fBSTART_ONLY\fR |
| Run only at slurmd startup. |
| .RE |
| .IP |
| |
| .TP |
| \fBHealthCheckProgram\fR |
| Fully qualified pathname of a script to execute as user root periodically |
| on all compute nodes that are \fBnot\fR in the NOT_RESPONDING state. This |
| program may be used to verify the node is fully operational and DRAIN the node |
| or send email if a problem is detected. |
| Any action to be taken must be explicitly performed by the program |
| (e.g. execute |
| "scontrol update NodeName=foo State=drain Reason=tmp_file_system_full" |
| to drain a node). |
| The execution interval is controlled using the \fBHealthCheckInterval\fR |
| parameter. |
| Note that the \fBHealthCheckProgram\fR will be executed at the same time |
| on all nodes to minimize its impact upon parallel programs. |
| This program will be killed if it does not terminate normally within |
| 60 seconds. |
| This program will also be executed when the slurmd daemon is first started and |
| before it registers with the slurmctld daemon. If \fBHealthCheckNodeState\fR is |
| \fBSTART_ONLY\fR it will be executed only when the slurmd daemon is first |
| started. |
| By default, no program will be executed. |
| .IP |
| |
| .TP |
| \fBHttpParserType\fR |
| Specify the http_parser implementation that will be used. Default is |
| \fIhttp_parser/libhttp_parser\fR. |
| Acceptable values at present: |
| .IP |
| .RS |
| .TP |
| \fBhttp_parser/libhttp_parser\fR |
| Use the libhttp_parser based plugin. |
| .RE |
| .IP |
| |
| .TP |
| \fBInactiveLimit\fR |
| The interval, in seconds, after which a non\-responsive job allocation |
| command (e.g. \fBsrun\fR or \fBsalloc\fR) will result in the job being |
| terminated. If the node on which the command is executed fails or the |
| command abnormally terminates, this will terminate its job allocation. |
| This option has no effect upon batch jobs. |
| When setting a value, take into consideration that a debugger using \fBsrun\fR |
| to launch an application may leave the \fBsrun\fR command in a stopped state |
| for extended periods of time. |
| This limit is ignored for jobs running in partitions with the |
| \fBRootOnly\fR flag set (the scheduler running as root will be |
| responsible for the job). |
| The default value is unlimited (zero) and may not exceed 65533 seconds. |
| .IP |
| |
| .TP |
| \fBInteractiveStepOptions\fR |
| When LaunchParameters=use_interactive_step is enabled, launching salloc will |
| automatically start an srun process with InteractiveStepOptions to launch |
| a terminal on a node in the job allocation. |
| The default value is "\-\-interactive \-\-preserve\-env \-\-pty $SHELL". |
| The "\-\-interactive" option is intentionally not documented in the srun man |
| page. It is meant only to be used in \fBInteractiveStepOptions\fR in order to |
| create an "interactive step" that will not consume resources so that other |
| steps may run in parallel with the interactive step. |
| .IP |
| |
| .TP |
| \fBJobAcctGatherType\fR |
| The JobAcctGather plugin collects memory, cpu, io, interconnect, energy and gpu |
| usage information at the task level, depending on which plugins are configured |
| in Slurm. This parameter will control how some of these metrics will be |
| collected. Unset by default. |
| |
| Configurable values at present are: |
| .IP |
| .RS |
| .TP 20 |
| \fBjobacct_gather/cgroup\fR (recommended) |
| Collect cpu and memory statistics by reading the task's cgroup directory |
| interfaces (e.g. memory.stat, cpu.stat) by issuing a call to the configured |
| CgroupPlugin (see "man cgroup.conf"). |
| This mechanism ignores JobAcctGatherParams=UsePSS or NoShared since these are |
| used only when reading memory usage from the proc filesystem. |
| .IP |
| |
| .TP |
| \fBjobacct_gather/linux\fR |
| Collect cpu and memory statistics by reading procfs. The plugin will take all |
| the pids of the task and for each of them will read /proc/<pid>/stats. If UsePSS |
| is set it will also read /proc/<pid>/smaps, and if NoShare is set it will also |
| read /proc/<pid>/statm (see \fBJobAcctGatherParams\fR for more information). |
| |
| This plugin carries a performance penalty on jobs with a large number of spawned |
| processes since it needs to iterate over all the task pids and aggregate the |
| stats into one single metric for the ppid, and then these values need to be |
| aggregated to the task stats. |
| .RE |
| .IP |
| |
| \fBNOTE\fR: Changing the plugin type when jobs are running in the cluster is |
| possible. The already running steps will keep using the previous plugin |
| mechanism, while new steps will use the new mechanism. |
| .IP |
| |
| .TP |
| \fBJobAcctGatherFrequency\fR |
| The job accounting and profiling sampling intervals, specified for each data |
| type. Multiple comma\-separated \fB<datatype>=<interval>\fR intervals may be |
| specified. If an interval is provided without a datatype, it will be assigned |
| to the \fBtask\fR datatype. Supported datatypes are as follows: |
| .IP |
| .RS |
| .TP 12 |
| Affects accounting and profiling: |
| .IP |
| .RS |
| |
| .TP |
| \fBtask\fR=<\fIinterval\fR> |
| sampling interval in seconds for task usage by the jobacct_gather plugins and |
| for task profiling by the acct_gather_profile plugin. |
| Defaults to 30. |
| .br |
| .br |
| If this interval is 0 (disabled), accounting information is collected only at |
| job termination, which reduces Slurm |
| interference with the job, but also means that the statistics about a job |
| are only derived from a single sample and don't reflect the average or maximum |
| of several samples throughout the life of the job. |
| .IP |
| .RE |
| |
| .TP |
| Affects profiling only: |
| .IP |
| .RS |
| |
| .TP |
| \fBenergy\fR=<\fIinterval\fR> |
| sampling interval in seconds for energy profiling using the acct_gather_energy |
| plugin. Defaults to 0 (disabled). |
| .IP |
| |
| .TP |
| \fBnetwork\fR=<\fIinterval\fR> |
| sampling interval in seconds for infiniband profiling using the |
| acct_gather_interconnect plugin. Defaults to 0 (disabled). |
| .IP |
| |
| .TP |
| \fBfilesystem\fR=<\fIinterval\fR> |
| sampling interval in seconds for filesystem profiling using the |
| acct_gather_filesystem plugin. Defaults to 0 (disabled). |
| .IP |
| .br |
| .RE |
| .RE |
| .IP |
| Smaller (non\-zero) values have a greater impact upon job performance, |
| but a value of 30 seconds is not likely to be noticeable for |
| applications having less than 10,000 tasks. |
| .br |
| .br |
| Users can independently override each interval on a per job basis using the |
| \fB\-\-acctg\-freq\fR option when submitting the job. |
| .br |
| This value should be lower or equal to \fBEnergyIPMIFreq\fR when using |
| \fIacct_gather_energy/ipmi\fR or xcc plugins as otherwise it will unnecessarily |
| get repeated values on successive polls. |
| .IP |
| |
| .TP |
| \fBJobAcctGatherParams\fR |
| Arbitrary parameters for the job account gather plugin. |
| Acceptable values at present include: |
| .IP |
| .RS |
| .TP 20 |
| \fBNoShared\fR |
| Exclude shared memory from RSS. This option cannot be used with UsePSS. |
| Only compatible with \fBjobacct_gather/linux\fR plugin. |
| .IP |
| |
| .TP |
| \fBUsePss\fR |
| Use PSS value instead of RSS to calculate real usage of memory. The PSS value |
| will be saved as RSS. This option cannot be used with NoShared. Only compatible |
| with \fBjobacct_gather/linux\fR plugin. |
| .IP |
| |
| .TP |
| \fBOverMemoryKill\fR |
| Kill processes that are being detected to use more memory than requested by |
| steps every time accounting information is gathered by the JobAcctGather plugin. |
| This parameter should be used with caution because a job exceeding its memory |
| allocation may affect other processes and/or machine health. |
| |
| \fBNOTE\fR: If available, it is recommended to limit memory by enabling |
| task/cgroup as a TaskPlugin and making use of ConstrainRAMSpace=yes in the |
| cgroup.conf instead of using this JobAcctGather mechanism for memory |
| enforcement. Using JobAcctGather is polling based and there is a |
| delay before a job is killed, which could lead to system Out of Memory events. |
| |
| \fBNOTE\fR: When using \fBOverMemoryKill\fR, if the combined memory used by |
| all the processes in a step exceeds the memory limit, the entire step will be |
| killed/cancelled by the JobAcctGather plugin. |
| This differs from the behavior when using \fBConstrainRAMSpace\fR, where |
| processes in the step will be killed, but the step will be left active, |
| possibly with other processes left running. |
| .IP |
| |
| .TP |
| \fBDisableGPUAcct\fR |
| Do not do accounting of GPU usage and skip any gpu driver library call. This |
| parameter can help to improve performance if the GPU driver response is slow. |
| .RE |
| .IP |
| |
| .TP |
| \fBJobCompHost\fR |
| The name of the machine hosting the job completion database. |
| Only used for database type storage plugins, ignored otherwise. |
| .IP |
| |
| .TP |
| \fBJobCompLoc\fR |
| This option sets a string which has different meanings depending on |
| \fBJobCompType\fR: |
| .IP |
| .RS |
| .TP |
| If \fBjobcomp/elasticsearch\fR: |
| Instructs this plugin to send the finished job records information to the |
| Elasticsearch server URL endpoint (including the port number and the target |
| index) configured in this option. This string should typically take the form |
| of \fI<host>:<port>/<target>/_doc\fR. There is no default value for |
| JobCompLoc when this plugin is enabled. |
| |
| \fBNOTE\fR: Refer to <https://slurm.schedmd.com/elasticsearch.html> for more |
| information. |
| .IP |
| |
| .TP |
| If \fBjobcomp/filetxt\fR: |
| Instructs this plugin to send the finished job records information to a file |
| configured in this option. This string should represent an absolute path to |
| a file. The default value for this plugin is \fI/var/log/slurm_jobcomp.log\fR. |
| .IP |
| |
| .TP |
| If \fBjobcomp/kafka\fR: |
| When this plugin is configured, finished (and optionally start running) job |
| records information is sent to a Kafka server. The plugin makes use of |
| \fBlibrdkafka\fR. This string represents an absolute path to a file containing |
| key=value pairs configuring the library behavior. For the plugin to work |
| properly, this file needs to exist and least the \fIbootstrap.servers\fR |
| \fBlibrdkafka\fR property needs to be configured in it. There is no default |
| value for JobCompLoc when this plugin is enabled. |
| |
| \fBNOTE\fR: For a full list of \fBlibrdkafka\fR properties, please refer to |
| the library documentation. You can also view the jobcomp_kafka page for more |
| information: <https://slurm.schedmd.com/jobcomp_kafka.html> |
| |
| \fBNOTE\fR: The target Kafka topic(s) and other plugin parameters can be |
| configured via \fBJobCompParams\fR. |
| .IP |
| |
| .TP |
| If \fBjobcomp/lua\fR: |
| This option is ignored in this plugin. The finished job record is processed |
| by a hardcoded \fIjobcomp.lua\fR script expected to be located in the same |
| location of slurm.conf. There is no default value for JobCompLoc when this |
| plugin is enabled. |
| .IP |
| |
| .TP |
| If \fBjobcomp/mysql\fR: |
| Instructs this plugin to send the finished job records information to a database |
| name configured in this option. This string should represent a database name. |
| The default value for this plugin is \fIslurm_jobcomp_db\fR. |
| .IP |
| |
| .TP |
| If \fBjobcomp/script\fR: |
| The finished job record information is made available via environment variables |
| and processed by a script with name configured by this option. This string |
| should represent a path to a script. There is no default value for JobCompLoc |
| when this plugin is enabled. It needs to be explicitly configured or the |
| plugin will fail to initialize. |
| .RE |
| .IP |
| |
| .TP |
| \fBJobCompParams\fR |
| Pass arbitrary text string to job completion plugin. |
| Also see \fBJobCompType\fR. |
| .RS |
| .IP |
| |
| .TP |
| Optional comma-separated list for \fBjobcomp/elasticsearch\fR: |
| .RS |
| .IP |
| |
| .TP |
| \fBsend_script\fR |
| Sends the job script as part of jobcomp messages. |
| .IP |
| |
| .RE |
| .IP |
| |
| .TP |
| Optional comma-separated list for \fBjobcomp/kafka\fR: |
| .RS |
| .IP |
| |
| .TP |
| \fBenable_job_start\fR |
| Instruct the \fBjobcomp/kafka\fR plugin to send a subset of the job record |
| fields to the \fBtopic_job_start\fR Kafka topic when a job first starts running. |
| |
| \fBNOTE\fR: The writing when the job finishes (historical purpose of the plugin) |
| is always enabled by default and can't be disabled. |
| |
| \fBNOTE\fR: The subset of fields for job start events is slightly smaller than |
| those sent when the job finishes. |
| .IP |
| |
| .TP |
| \fBflush_timeout\fR=<milliseconds> |
| Maximum time (in milliseconds) to wait for all outstanding produce requests, |
| et.al, to be completed. This is passed as a timeout argument to the |
| \fBlibrdkafka\fR flush API function, called on plugin termination. This is done |
| prior to destroying the producer instance to make sure all queued and in-flight |
| produce requests are completed before terminating. |
| For non-blocking calls, set to 0. |
| To wait indefinitely for an event, set to -1 (not recommended, since this is |
| called on plugin fini and could block slurmctld graceful termination). |
| Accepted values are [-1,2147483647]. |
| Defaults to 500 (milliseconds). |
| .IP |
| |
| .TP |
| \fBpoll_interval\fR=<seconds> |
| Seconds between calls to \fBlibrdkafka\fR API poll function, which polls the |
| provided Kafka handle for events. The plugin spawns a separate thread to perform |
| this call at the configured interval. |
| Accepted values are [0,4294967295]. |
| Defaults to 2 (seconds). |
| .IP |
| |
| .TP |
| \fBrequeue_on_msg_timeout\fR |
| Instruct the delivery report callback to requeue messages that failed delivery |
| because their time waiting for successful delivery reached the \fBlibrdkafka\fR |
| property \fBmessage.timeout.ms\fR. |
| Defaults to not set (don't requeue and thus discard these messages). |
| .IP |
| |
| .TP |
| \fBsend_script\fR |
| Sends the job script as part of jobcomp messages. |
| .IP |
| |
| .TP |
| \fBtopic\fR=<string> |
| Target Kafka topic to send messages to when a job finishes. |
| Defaults to \fBClusterName\fR. |
| .IP |
| |
| .TP |
| \fBtopic_job_start\fR=<string> |
| Target Kafka topic to send messages to when a job starts running. |
| Defaults to \fB<ClusterName>-job-start\fR. |
| |
| \fBNOTE\fR: It is advisable that job start running event records be sent to a |
| different Kafka topic than the topic configured for job finish event records. |
| .RE |
| .IP |
| |
| .RE |
| .IP |
| |
| .TP |
| \fBJobCompPass\fR |
| The password used to gain access to the database to store the job |
| completion data. |
| Only used for database type storage plugins, ignored otherwise. |
| .IP |
| |
| .TP |
| \fBJobCompPort\fR |
| The listening port of the job completion database server. |
| Only used for database type storage plugins, ignored otherwise. |
| .IP |
| |
| .TP |
| \fBJobCompType\fR |
| The job completion logging mechanism type. Unset by default. |
| Acceptable values at present include: |
| .IP |
| .RS |
| .TP |
| \fBjobcomp/elasticsearch\fR |
| Upon job completion, a record of the job should be written to an |
| Elasticsearch server, specified by the \fBJobCompLoc\fR parameter. |
| .br |
| \fBNOTE\fR: More information is available at the Slurm web site |
| ( https://slurm.schedmd.com/elasticsearch.html ). |
| .IP |
| |
| .TP |
| \fBjobcomp/filetxt\fR |
| Upon job completion, a record of the job should be written to a text file, |
| specified by the \fBJobCompLoc\fR parameter. |
| .IP |
| |
| .TP |
| \fBjobcomp/kafka\fR |
| Upon job completion (or optionally job start running), a record of the job |
| should be sent to a Kafka server, specified by the file path referenced in |
| \fBJobCompLoc\fR and/or using other \fBJobCompParams\fR. |
| .IP |
| |
| .TP |
| \fBjobcomp/lua\fR |
| Upon job completion, a record of the job should be processed by the |
| \fIjobcomp.lua\fR script, located in the default script directory |
| (typically the subdirectory \fIetc\fR of the installation directory. |
| .IP |
| |
| .TP |
| \fBjobcomp/mysql\fR |
| Upon job completion, a record of the job should be written to a MySQL |
| or MariaDB database, specified by the \fBJobCompLoc\fR parameter. |
| .IP |
| |
| .TP |
| \fBjobcomp/script\fR |
| Upon job completion, a script specified by the \fBJobCompLoc\fR parameter is |
| to be executed with environment variables providing the job information. |
| .RE |
| .IP |
| |
| .TP |
| \fBJobCompUser\fR |
| The user account for accessing the job completion database. |
| Only used for database type storage plugins, ignored otherwise. |
| .IP |
| |
| .TP |
| \fBJobContainerType\fR |
| Identifies the plugin to be used for job isolation through Linux namespaces. |
| \fBNOTE\fR: See \fBProctrackType\fR for resource containment and usage tracking. |
| Acceptable values at present include: |
| .IP |
| .RS |
| .TP 20 |
| \fBjob_container/tmpfs\fR |
| Used to create a private namespace on the filesystem for jobs, which houses |
| temporary file systems (/tmp and /dev/shm) for each job. 'PrologFlags=Contain' |
| must be set to use this plugin. |
| .RE |
| .IP |
| |
| .TP |
| \fBJobFileAppend\fR |
| This option controls what to do if a job's output or error file |
| exist when the job is started. |
| If \fBJobFileAppend\fR is set to a value of 1, then append to |
| the existing file. |
| By default, any existing file is truncated. |
| .IP |
| |
| .TP |
| \fBJobRequeue\fR |
| This option controls the default ability for batch jobs to be requeued. |
| Jobs may be requeued explicitly by a system administrator, after node |
| failure, or upon preemption by a higher priority job. |
| If \fBJobRequeue\fR is set to a value of 1, then batch jobs may be requeued |
| unless explicitly disabled by the user. |
| If \fBJobRequeue\fR is set to a value of 0, then batch jobs will not be requeued |
| unless explicitly enabled by the user. |
| Use the \fBsbatch\fR \fI\-\-no\-requeue\fR or \fI\-\-requeue\fR |
| option to change the default behavior for individual jobs. |
| The default value is 1. |
| .IP |
| |
| .TP |
| \fBJobSubmitPlugins\fR |
| These are intended to be site\-specific plugins which can be used to set |
| default job parameters and/or logging events. Slurm can be configured to use |
| multiple job_submit plugins if desired, which must be specified as a |
| comma\-delimited list and will be executed in the order listed. |
| .nf |
| e.g. for multiple job_submit plugin configuration: |
| JobSubmitPlugins=lua,require_timelimit |
| .fi |
| Take a look at <https://slurm.schedmd.com/job_submit_plugins.html> for further |
| plugin implementation details. No job submission plugins are used by default. |
| Currently available plugins are: |
| .IP |
| .RS |
| .TP 24 |
| \fBall_partitions\fR |
| Set default partition to all partitions on the cluster. |
| .IP |
| |
| .TP |
| \fBdefaults\fR |
| Set default values for job submission or modify requests. |
| .IP |
| |
| .TP |
| \fBlogging\fR |
| Log select job submission and modification parameters. |
| .IP |
| |
| .TP |
| \fBlua\fR |
| Execute a Lua script implementing site's own job_submit logic. Only one Lua |
| script will be executed. It must be named "job_submit.lua" and must be located |
| in the default configuration directory (typically the subdirectory "etc" of the |
| installation directory). Sample Lua scripts can be found with the Slurm |
| distribution, in the directory contribs/lua. Slurmctld will fatal on startup if |
| the configured lua script is invalid. Slurm will try to load the script for each |
| job submission. If the script is broken or removed while slurmctld is running, |
| Slurm will fallback to the previous working version of the script. |
| \fBWarning\fR: slurmctld runs this script while holding internal locks, and |
| only a single copy of this script can run at a time. This blocks most |
| concurrency in slurmctld. Therefore, this script should run to completion as |
| quickly as possible. |
| .IP |
| |
| .TP |
| \fBpartition\fR |
| Set a job's default partition based upon job submission parameters and |
| available partitions. |
| .IP |
| |
| .TP |
| \fBpbs\fR |
| Translate PBS job submission options to Slurm equivalent (if possible). |
| .IP |
| |
| .TP |
| \fBrequire_timelimit\fR |
| Force job submissions to specify a timelimit. |
| .RE |
| .IP |
| |
| \fBNOTE\fR: For examples of use see the Slurm code in "src/plugins/job_submit" |
| and "contribs/lua/job_submit*.lua" then modify the code to satisfy your needs. |
| .IP |
| |
| .TP |
| \fBKillOnBadExit\fR |
| If set to 1, a step will be terminated immediately if any task is |
| crashed or aborted, as indicated by a non\-zero exit code. |
| With the default value of 0, if one of the processes is crashed or aborted |
| the other processes will continue to run while the crashed or aborted process |
| waits. The user can override this configuration parameter by using srun's |
| \fB\-K\fR, \fB\-\-kill\-on\-bad\-exit\fR. |
| .IP |
| |
| .TP |
| \fBKillWait\fR |
| The interval, in seconds, given to a job's processes between the |
| SIGTERM and SIGKILL signals upon reaching its time limit. |
| If the job fails to terminate gracefully in the interval specified, |
| it will be forcibly terminated. |
| The default value is 30 seconds. |
| The value may not exceed 65533. |
| .IP |
| |
| .TP |
| \fBMaxBatchRequeue\fR |
| Maximum number of times a batch job may be automatically requeued before |
| being marked as JobHeldAdmin. (Mainly useful when the \fBSchedulerParameters\fR |
| option \fBnohold_on_prolog_fail\fR is enabled.) |
| The default value is 5. |
| .IP |
| |
| .TP |
| \fBNodeFeaturesPlugins\fR |
| Identifies the plugins to be used for support of node features which can |
| change through time. For example, a node which might be booted with various |
| BIOS setting. This is supported through the use of a node's active_features |
| and available_features information. |
| Acceptable values at present include: |
| .IP |
| .RS |
| .TP |
| \fBnode_features/knl_cray\fR |
| Used only for Intel Knights Landing processors (KNL) on Cray systems. |
| See https://slurm.schedmd.com/intel_knl.html for more information. |
| .IP |
| |
| .TP |
| \fBnode_features/knl_generic\fR |
| Used for Intel Knights Landing processors (KNL) on a generic Linux system. |
| See https://slurm.schedmd.com/intel_knl.html for more information. |
| .IP |
| |
| .TP |
| \fBnode_features/helpers\fR |
| Used to report and modify features on nodes using arbitrary scripts or |
| programs. |
| See helpers.conf man page for more information: |
| https://slurm.schedmd.com/helpers.conf.html |
| .RE |
| .IP |
| |
| .TP |
| \fBLaunchParameters\fR |
| Identifies options to the job launch plugin. |
| Acceptable values include: |
| .IP |
| .RS |
| .TP 24 |
| \fBbatch_step_set_cpu_freq\fR |
| Set the cpu frequency for the batch step from given \-\-cpu\-freq, or |
| slurm.conf CpuFreqDef, option. By default only steps started with srun will |
| utilize the cpu freq setting options. |
| |
| \fBNOTE\fR: If you are using srun to launch your steps inside a batch script |
| (advised) this option will create a situation where you may have multiple |
| agents setting the cpu_freq as the batch step usually runs on the same |
| resources one or more steps the sruns in the script will create. |
| .IP |
| |
| .TP 24 |
| \fBcray_net_exclusive\fR |
| Allow jobs on a Cray XC cluster exclusive access to network resources. |
| This should only be set on clusters providing exclusive access to each |
| node to a single job at once, and not using parallel steps within the job, |
| otherwise resources on the node can be oversubscribed. |
| .IP |
| |
| .TP 24 |
| \fBenable_nss_slurm\fR |
| Permits passwd and group resolution for a job to be serviced by slurmstepd rather |
| than requiring a lookup from a network based service. See |
| https://slurm.schedmd.com/nss_slurm.html for more information. |
| .IP |
| |
| .TP 24 |
| \fBlustre_no_flush\fR |
| If set on a Cray XC cluster, then do not flush the Lustre cache on job step |
| completion. This setting will only take effect after reconfiguring, and will |
| only take effect for newly launched jobs. |
| .IP |
| |
| .TP 24 |
| \fBmem_sort\fR |
| Sort NUMA memory at step start. User can override this default with |
| SLURM_MEM_BIND environment variable or \-\-mem\-bind=nosort command line option. |
| .IP |
| |
| .TP |
| \fBmpir_use_nodeaddr\fR |
| When launching tasks Slurm creates entries in MPIR_proctable that are used by |
| parallel debuggers, profilers, and related tools to attach to running process. |
| By default the MPIR_proctable entries contain MPIR_procdesc structures where |
| the host_name is set to NodeName by default. If this option is specified, |
| NodeAddr will be used in this context instead. |
| .IP |
| |
| .TP |
| \fBdisable_send_gids\fR |
| By default, the slurmctld will look up and send the user_name and extended gids |
| for a job, rather than independently on each node as part of each task launch. |
| This helps mitigate issues around name service scalability when launching jobs |
| involving many nodes. Using this option will disable this functionality. This |
| option is ignored if enable_nss_slurm is specified. |
| .IP |
| |
| .TP 24 |
| \fBslurmstepd_memlock\fR |
| Lock the slurmstepd process's current memory in RAM. |
| .IP |
| |
| .TP |
| \fBslurmstepd_memlock_all\fR |
| Lock the slurmstepd process's current and future memory in RAM. |
| .IP |
| |
| .TP |
| \fBtest_exec\fR |
| Have srun verify existence of the executable program along with user |
| execute permission on the node where srun was called before attempting to |
| launch it on nodes in the step. |
| .IP |
| |
| .TP |
| \fBuse_interactive_step\fR |
| Have salloc use the Interactive Step to launch a shell on an allocated compute |
| node rather than locally to wherever salloc was invoked. This is accomplished |
| by launching the srun command with InteractiveStepOptions as options. |
| |
| This does not affect salloc called with a command as an argument. These jobs |
| will continue to be executed as the calling user on the calling host. |
| .IP |
| |
| .TP |
| \fBulimit_pam_adopt\fR |
| When pam_slurm_adopt is used to join an external process into a job cgroup, |
| RLIMIT_RSS is set, as is done for tasks running in regular steps. |
| .RE |
| .IP |
| |
| .TP |
| \fBLicenses\fR |
| Specification of licenses (or other resources available on all |
| nodes of the cluster) which can be allocated to jobs. |
| License names can optionally be followed by a colon |
| and count with a default count of one. |
| Multiple license names should be comma separated (e.g. |
| "Licenses=foo:4,bar"). |
| Note that Slurm prevents jobs from being scheduled if their |
| required license specification is not available. |
| Slurm does not prevent jobs from using licenses that are |
| not explicitly listed in the job submission specification. |
| .IP |
| |
| .TP |
| \fBLogTimeFormat\fR |
| Format of the timestamp in slurmctld and slurmd log files. Accepted |
| format values include "iso8601", "iso8601_ms", "rfc5424", "rfc5424_ms", |
| "rfc3339", "clock", "short" and "thread_id". The values ending in "_ms" differ |
| from the ones without in that fractional seconds with millisecond precision are |
| printed. The default value is "iso8601_ms". The "rfc5424" formats are the same |
| as the "iso8601" formats except that the timezone value is also shown. |
| The "clock" format shows a timestamp in microseconds retrieved |
| with the C standard clock() function. The "short" format is a short |
| date and time format. The "thread_id" format shows the timestamp |
| in the C standard ctime() function form without the year but |
| including the microseconds, the daemon's process ID and the current thread name |
| and ID. |
| .IP |
| |
| .TP |
| \fBMailDomain\fR |
| Domain name to qualify usernames if email address is not explicitly given |
| with the "\-\-mail\-user" option. If unset, the local MTA will need to qualify |
| local address itself. Changes to MailDomain will only affect new jobs. |
| .IP |
| |
| .TP |
| \fBMailProg\fR |
| Fully qualified pathname to the program used to send email per user request. |
| The default value is "/bin/mail" (or "/usr/bin/mail" if "/bin/mail" does not |
| exist but "/usr/bin/mail" does exist). |
| The program is called with arguments suitable for the default mail command, |
| however additional information about the job is passed in the form of |
| environment variables. |
| |
| Additional variables are the same as those passed to \fIPrologSlurmctld\fR and |
| \fIEpilogSlurmctld\fR with additional variables in the following contexts: |
| .IP |
| .RS |
| .TP |
| \fBALL\fR |
| .IP |
| .RS |
| .TP |
| \fBSLURM_JOB_STATE\fR |
| The base state of the job when the MailProg is called. |
| .RE |
| .IP |
| .RS |
| .TP |
| \fBSLURM_JOB_MAIL_TYPE\fR |
| The mail type triggering the mail. |
| .RE |
| .RE |
| .IP |
| .RS |
| .TP |
| \fBBEGIN\fR |
| .IP |
| .RS |
| .TP |
| \fBSLURM_JOB_QEUEUED_TIME\fR |
| The amount of time the job was queued. |
| .RE |
| .RE |
| .IP |
| .RS |
| .TP |
| \fBEND, FAIL, REQUEUE, TIME_LIMIT_*\fR |
| .IP |
| .RS |
| .TP |
| \fBSLURM_JOB_RUN_TIME\fR |
| The amount of time the job ran for. |
| .RE |
| .RE |
| .IP |
| .RS |
| .TP |
| \fBEND, FAIL\fR |
| .IP |
| .RS |
| .TP |
| \fBSLURM_JOB_EXIT_CODE_MAX\fR |
| Job's exit code or highest exit code for an array job. |
| .RE |
| .IP |
| .RS |
| .TP |
| \fBSLURM_JOB_EXIT_CODE_MIN\fR |
| Job's minimum exit code for an array job. |
| .RE |
| .IP |
| .RS |
| .TP |
| \fBSLURM_JOB_TERM_SIGNAL_MAX\fR |
| Job's highest signal for an array job. |
| .RE |
| .RE |
| .IP |
| .RS |
| .TP |
| \fBSTAGE_OUT\fR |
| .IP |
| .RS |
| .TP |
| \fBSLURM_JOB_STAGE_OUT_TIME\fR |
| Job's staging out time. |
| .RE |
| .RE |
| .IP |
| |
| .TP |
| \fBMaxArraySize\fR |
| The maximum job array task index value will be one less than MaxArraySize |
| to allow for an index value of zero. |
| Configure MaxArraySize to 0 in order to disable job array use. |
| The value may not exceed 4000001. |
| The value of \fBMaxJobCount\fR should be much larger than \fBMaxArraySize\fR. |
| The default value is 1001. |
| See also \fBmax_array_tasks\fR in SchedulerParameters. |
| .IP |
| |
| .TP |
| \fBMaxDBDMsgs\fR |
| When communication to the SlurmDBD is not possible the slurmctld will queue |
| messages meant to processed when the SlurmDBD is available again. |
| In order to avoid running out of memory the slurmctld will only queue so many |
| messages. The default value is 10000, or \fBMaxJobCount\fR * 2 + Node Count |
| * 4, whichever is greater. The value can not be less than 10000. |
| .IP |
| |
| .TP |
| \fBMaxJobCount\fR |
| The maximum number of jobs slurmctld can have in memory at one time. |
| Combine with \fBMinJobAge\fR to ensure the slurmctld daemon does not exhaust |
| its memory or other resources. Once this limit is reached, requests to submit |
| additional jobs will fail. The default value is 10000 jobs. |
| \fBNOTE\fR: Each task of a job array counts as one job even though they will not |
| occupy separate job records until modified or initiated. |
| Performance can suffer with more than a few hundred thousand jobs. |
| Setting per MaxSubmitJobs per user is generally valuable to prevent a single |
| user from filling the system with jobs. |
| This is accomplished using Slurm's database and configuring enforcement of |
| resource limits. |
| .IP |
| |
| .TP |
| \fBMaxJobId\fR |
| The maximum job id to be used for jobs submitted to Slurm without a specific |
| requested value. Job ids are unsigned 32bit integers with the first 26 bits |
| reserved for local job ids and the remaining 6 bits reserved for a cluster id |
| to identify a federated job's origin. The maximum allowed local job id is |
| 67,108,863 (0x3FFFFFF). The default value is 67,043,328 (0x03ff0000). |
| \fBMaxJobId\fR only applies to the local job id and not the federated job id. |
| Job id values generated will be incremented by 1 for each subsequent job. Once |
| \fBMaxJobId\fR is reached, the next job will be assigned \fBFirstJobId\fR. |
| Federated jobs will always have a job ID of 67,108,865 or higher. |
| Also see \fBFirstJobId\fR. |
| .IP |
| |
| .TP |
| \fBMaxMemPerCPU\fR |
| Maximum real memory size available per allocated CPU in megabytes. |
| Used to avoid over\-subscribing memory and causing paging. |
| \fBMaxMemPerCPU\fR would generally be used if individual processors |
| are allocated to jobs (\fBSelectType=select/cons_tres\fR). |
| The default value is 0 (unlimited). |
| Also see \fBDefMemPerCPU\fR, \fBDefMemPerGPU\fR and \fBMaxMemPerNode\fR. |
| \fBMaxMemPerCPU\fR and \fBMaxMemPerNode\fR are mutually exclusive. |
| |
| \fBNOTE\fR: If a job specifies a memory per CPU limit that exceeds this system |
| limit, that job's count of CPUs per task will try to automatically increase. |
| This may result in the job failing due to CPU count limits. This |
| auto\-adjustment feature is a best\-effort one and optimal assignment is not |
| guaranteed due to the possibility of having heterogeneous configurations and |
| multi\-partition/qos jobs. If this is a concern it is advised to use a job |
| submit LUA plugin instead to enforce auto\-adjustments to your specific needs. |
| .IP |
| |
| .TP |
| \fBMaxMemPerNode\fR |
| Maximum real memory size available per allocated node in a job allocation in |
| megabytes. Used to avoid over\-subscribing memory and causing paging. |
| \fBMaxMemPerNode\fR would generally be used if whole nodes |
| are allocated to jobs (\fBSelectType=select/linear\fR) and |
| resources are over\-subscribed (\fBOverSubscribe=yes\fR or |
| \fBOverSubscribe=force\fR). |
| The default value is 0 (unlimited). |
| Also see \fBDefMemPerNode\fR and \fBMaxMemPerCPU\fR. |
| \fBMaxMemPerCPU\fR and \fBMaxMemPerNode\fR are mutually exclusive. |
| .IP |
| |
| .TP |
| \fBMaxNodeCount\fR |
| Maximum count of nodes which may exist in the controller. By default MaxNodeCount |
| will be set to the number of nodes found in the slurm.conf. MaxNodeCount will |
| be ignored if less than the number of nodes found in the |
| slurm.conf. The total number of nodes in a system cannot exceed 65536. Increase |
| MaxNodeCount to accommodate dynamically created nodes with dynamic node |
| registrations and nodes created with scontrol. |
| .IP |
| |
| .TP |
| \fBMaxStepCount\fR |
| The maximum number of steps that any job can initiate. This parameter |
| is intended to limit the effect of bad batch scripts. |
| The default value is 40000 steps. |
| .IP |
| |
| .TP |
| \fBMaxTasksPerNode\fR |
| Maximum number of tasks Slurm will allow a job step to spawn |
| on a single node. The default \fBMaxTasksPerNode\fR is 512. |
| May not exceed 65533. |
| .IP |
| |
| .TP |
| \fBMCSParameters\fR |
| MCS = Multi\-Category Security |
| MCS Plugin Parameters. |
| The supported parameters are specific to the \fBMCSPlugin\fR. |
| Changes to this value take effect when the Slurm daemons are reconfigured. |
| More information about MCS is available here |
| <https://slurm.schedmd.com/mcs.html>. |
| .IP |
| |
| .TP |
| \fBMCSPlugin\fR |
| MCS = Multi\-Category Security : associate a security label to jobs and ensure |
| that nodes can only be shared among jobs using the same security label. |
| Unset by default. Acceptable values include: |
| .IP |
| .RS |
| .TP 12 |
| \fBmcs/account\fR |
| only users with the same account can share the nodes (requires enabling of accounting). |
| .IP |
| |
| .TP |
| \fBmcs/group\fR |
| only users with the same group can share the nodes. |
| .IP |
| |
| .TP |
| \fBmcs/user\fR |
| a node cannot be shared with other users. |
| .IP |
| |
| .TP |
| \fBmcs/label\fR |
| only jobs with the same arbitrary label string can share nodes. |
| .RE |
| .IP |
| |
| .TP |
| \fBMessageTimeout\fR |
| Time permitted for a round\-trip communication to complete |
| in seconds. Default value is 10 seconds. For systems with |
| shared nodes, the slurmd daemon could be paged out and |
| necessitate higher values. |
| .IP |
| |
| .TP |
| \fBMinJobAge\fR |
| The minimum age of a completed job before its record is cleared from the list |
| of jobs slurmctld keeps in memory. Combine with \fBMaxJobCount\fR |
| to ensure the slurmctld daemon does not exhaust |
| its memory or other resources. The default value is 300 seconds. |
| A value of zero prevents any job record purging. |
| Jobs are not purged during a backfill cycle, so it can take longer than |
| MinJobAge seconds to purge a job if using the backfill scheduling plugin. |
| In order to eliminate some possible race conditions, the minimum non\-zero |
| value for \fBMinJobAge\fR recommended is 2. |
| .IP |
| |
| .TP |
| \fBMpiDefault\fR |
| Identifies the default type of MPI to be used. |
| Unset by default, which allows Slurm to work with versions of MPI other than |
| listed below. |
| Srun may override this configuration parameter in any case. |
| Currently supported versions include: |
| \fBpmi2\fR and \fBpmix\fR. |
| More information about MPI use is available here |
| <https://slurm.schedmd.com/mpi_guide.html>. |
| .IP |
| |
| .TP |
| \fBMpiParams\fR |
| MPI-related parameters. Multiple parameters may be comma separated. Currently |
| supported parameters include: |
| .IP |
| .RS |
| |
| .TP |
| \fBports\fR=\#\-\# |
| Identifies a range of communication ports used by native Cray's PMI. |
| .IP |
| |
| .TP |
| \fBdisable_slurm_hydra_bootstrap\fR |
| Disable environment variable injection in allocations for the following |
| variables: I_MPI_HYDRA_BOOTSTRAP, I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS, |
| HYDRA_BOOTSTRAP, HYDRA_LAUNCHER_EXTRA_ARGS. |
| |
| Manually setting I_MPI_HYDRA_BOOTSTRAP or HYDRA_BOOTSTRAP to 'slurm' in the |
| allocation will skip this parameter and injection of extra args will be |
| performed as usual. |
| .RE |
| .IP |
| |
| .TP |
| \fBOverTimeLimit\fR |
| Number of minutes by which a job can exceed its time limit before |
| being canceled. |
| Normally a job's time limit is treated as a \fIhard\fR limit and the job will be |
| killed upon reaching that limit. |
| Configuring \fBOverTimeLimit\fR will result in the job's time limit being |
| treated like a \fIsoft\fR limit. |
| Adding the \fBOverTimeLimit\fR value to the \fIsoft\fR time limit provides a |
| \fIhard\fR time limit, at which point the job is canceled. |
| This is particularly useful for backfill scheduling, which bases upon |
| each job's soft time limit. |
| The default value is zero. |
| May not exceed 65533 minutes. |
| A value of "UNLIMITED" is also supported. |
| .IP |
| |
| .TP |
| \fBPluginDir\fR |
| Identifies the places in which to look for Slurm plugins. |
| This is a colon\-separated list of directories, like the PATH |
| environment variable. |
| The default value is the prefix given at configure time + "/lib/slurm". |
| .IP |
| |
| .TP |
| \fBPlugStackConfig\fR |
| Location of the config file for Slurm stackable plugins that use |
| the Stackable Plugin Architecture for Node job (K)control (SPANK). |
| This provides support for a highly configurable set of plugins to |
| be called before and/or after execution of each task spawned as |
| part of a user's job step. Default location is "plugstack.conf" |
| in the same directory as the system slurm.conf. For more information |
| on SPANK plugins, see the \fBspank\fR(8) manual. |
| .IP |
| |
| .TP |
| \fBPreemptMode\fR |
| Mechanism used to preempt jobs or enable gang scheduling. When the |
| \fBPreemptType\fR parameter is set to enable preemption, the |
| \fBPreemptMode\fR selects the default mechanism used to preempt the eligible |
| jobs for the cluster. |
| .br |
| \fBPreemptMode\fR may be specified on a per partition basis to override this |
| default value if \fBPreemptType=preempt/partition_prio\fR. Alternatively, it |
| can be specified on a per QOS basis if \fBPreemptType=preempt/qos\fR. In either |
| case, a valid default \fBPreemptMode\fR value must be specified for the |
| cluster as a whole when preemption is enabled. |
| .br |
| The \fBGANG\fR option is used to enable gang scheduling independent of |
| whether preemption is enabled (i.e. independent of the \fBPreemptType\fR |
| setting). It can be specified in addition to a \fBPreemptMode\fR setting with |
| the two options comma separated (e.g. \fBPreemptMode=SUSPEND,GANG\fR). |
| .br |
| See <https://slurm.schedmd.com/preempt.html> and |
| <https://slurm.schedmd.com/gang_scheduling.html> for more details. |
| |
| \fBNOTE\fR: |
| For performance reasons, the backfill scheduler reserves whole nodes for jobs, |
| not partial nodes. If during backfill scheduling a job preempts one or more |
| other jobs, the whole nodes for those preempted jobs are reserved for the |
| preemptor job, even if the preemptor job requested fewer resources than that. |
| These reserved nodes aren't available to other jobs during that backfill |
| cycle, even if the other jobs could fit on the nodes. Therefore, jobs may |
| preempt more resources during a single backfill iteration than they requested. |
| .br |
| \fBNOTE\fR: |
| For heterogeneous job to be considered for preemption all components |
| must be eligible for preemption. When a heterogeneous job is to be preempted |
| the first identified component of the job with the highest order PreemptMode |
| (\fBSUSPEND\fR (highest), \fBREQUEUE\fR, \fBCANCEL\fR (lowest)) will be |
| used to set the PreemptMode for all components. The \fBGraceTime\fR and user |
| warning signal for each component of the heterogeneous job remain unique. |
| Heterogeneous jobs are excluded from GANG scheduling operations. |
| .IP |
| .RS |
| .TP 12 |
| \fBOFF\fR |
| Is the default value and disables job preemption and gang scheduling. |
| It is only compatible with preemption being disabled at the global level. |
| A common use case for this parameter is to set it on a partition to disable |
| preemption for that partition. |
| .IP |
| |
| .TP |
| \fBCANCEL\fR |
| The preempted job will be cancelled. |
| .IP |
| |
| .TP |
| \fBGANG\fR |
| Enables gang scheduling (time slicing) of jobs in the same partition, and |
| allows the resuming of suspended jobs. In order to use gang scheduling, the |
| \fBGANG\fR option must be specified at the cluster level. |
| |
| \fBNOTE:\fR |
| If \fBGANG\fR scheduling is enabled with |
| \fBPreemptType=preempt/partition_prio\fR, the controller will ignore |
| \fBPreemptExemptTime\fR and the following \fBPreemptParameters\fR: |
| \fBreorder_count\fR, \fBstrict_order\fR, and \fByoungest_first\fR. |
| |
| \fBNOTE\fR: |
| Gang scheduling is performed independently for each partition, so |
| if you only want time\-slicing by \fBOverSubscribe\fR, without any preemption, |
| then configuring partitions with overlapping nodes is not recommended. |
| On the other hand, if you want to use \fBPreemptType=preempt/partition_prio\fR |
| to allow jobs from higher PriorityTier partitions to Suspend jobs from lower |
| PriorityTier partitions you will need overlapping partitions, and |
| \fBPreemptMode=SUSPEND,GANG\fR to use the Gang scheduler to resume the suspended |
| jobs(s). You must configure the partition's \fIOverSubscribe\fR setting to |
| \fIFORCE\fR for all partitions in which time\-slicing is to take place. |
| In any case, time\-slicing won't happen between jobs on different partitions. |
| |
| \fBNOTE\fR: |
| Heterogeneous jobs are excluded from GANG scheduling operations. |
| |
| \fBNOTE\fR: |
| In case of overlapping partitions. If the node is allocated job that allows |
| sharing of resources (Oversubscribe=FORCE or Oversubscribe=YES and job was |
| submitted with \fB\-s\fR/\fB\-\-oversubscribe\fR) it can only be allocated by |
| jobs from the same partition. |
| |
| .IP |
| |
| .TP |
| \fBREQUEUE\fR |
| Preempts jobs by requeuing them (if possible) or canceling them. |
| For jobs to be requeued they must have the \-\-requeue sbatch option set |
| or the cluster wide JobRequeue parameter in slurm.conf must be set to \fB1\fR. |
| .IP |
| |
| .TP |
| \fBSUSPEND\fR |
| The preempted jobs will be suspended, and later the Gang scheduler will resume |
| them. Therefore the \fBSUSPEND\fR preemption mode always needs the \fBGANG\fR |
| option to be specified at the cluster level. Also, because the suspended jobs |
| will still use memory on the allocated nodes, Slurm needs to be able to track |
| memory resources to be able to suspend jobs. |
| .br |
| When suspending jobs, Slurm sends the SIGTSTP signal, waits the time specified |
| by \fBPreemptParameters=suspend_grace_time\fR (default is 2 seconds), then |
| sends the SIGSTOP signal. The SIGCONT signal is sent when resuming jobs. |
| .br |
| If \fBPreemptType=preempt/qos\fR is configured and if the preempted job(s) and |
| the preemptor job are on the same partition, then they will share resources with |
| the Gang scheduler (time\-slicing). If not (i.e. if the preemptees and preemptor |
| are on different partitions) then the preempted jobs will remain suspended until |
| the preemptor ends. |
| |
| \fBNOTE\fR: Because gang scheduling is performed independently for each |
| partition, if using \fBPreemptType=preempt/partition_prio\fR then jobs in |
| higher PriorityTier partitions will suspend jobs in lower PriorityTier |
| partitions to run on the released resources. Only when the preemptor job ends |
| will the suspended jobs will be resumed by the Gang scheduler. |
| .br |
| \fBNOTE\fR: Suspended jobs will not release GRES. Higher priority jobs will not |
| be able to preempt to gain access to GRES. |
| .IP |
| |
| .TP |
| \fBPRIORITY\fR |
| Allow preemption only if the preemptor's job priority is higher than the |
| preemptee's job priority. |
| |
| .TP |
| \fBWITHIN\fR |
| For \fBPreemptType=preempt/qos\fR, allow jobs within the same qos to preempt |
| one another. While this can be set globally here, it is recommend that this |
| only be set directly on a relevant subset of the system qos values instead. |
| .RE |
| .IP |
| |
| |
| .TP |
| \fBPreemptParameters\fR |
| Multiple options may be comma separated. |
| .RS |
| .TP |
| \fBmin_exempt_priority\fR=\# |
| Threshold value for the job's global priority. Only those jobs with priority |
| lower than this value will be marked as preemptable. |
| .IP |
| |
| .TP |
| \fBreclaim_licenses\fR |
| If set, jobs may be preempted to reclaim licenses. Otherwise jobs requesting |
| busy licenses will have to wait even if they have preemption priority. |
| The logic to support this option is only available in the select/cons_tres |
| plugin. Jobs that use OR in the license request are not eligible to preempt |
| other jobs to reclaim licenses. |
| .IP |
| |
| .TP |
| \fBreorder_count\fR=\# |
| Specify how many attempts should be made in reordering preemptable jobs to |
| minimize the total number of jobs that will be preempted. |
| The default value is 1. High values may adversely impact performance. |
| Changes to the order of jobs on these attempts can be enabled with |
| \fBstrict_order\fR. |
| The logic to support this option is only available in the select/cons_tres |
| plugin. |
| .IP |
| |
| .TP |
| \fBsend_user_signal\fR |
| Send the user signal (e.g. \-\-signal=<sig_num>) at preemption time even if the |
| signal time hasn't been reached. In the case of a gracetime preemption the user |
| signal will be sent if the user signal has been specified and not sent, |
| otherwise a SIGTERM will be sent to the tasks. |
| .IP |
| |
| .TP |
| \fBstrict_order\fR |
| When reordering preemptable jobs, place the most recently tested job at the |
| front of the list since we are certain that it actually added resources needed |
| by the new job. This ensures that with enough reorder attempts, the minimum |
| possible number of jobs will be preempted. |
| See also \fBreorder_count\fR. |
| The logic to support this option is only available in the select/cons_tres |
| plugin. |
| .IP |
| |
| .TP |
| \fBsuspend_grace_time\fR |
| Specifies, in units of seconds, the preemption grace time when using |
| \fBPreemptMode=SUSPEND\fR. |
| When a job is suspended, the SIGTSTP signal will be sent, and then after waiting |
| the specified suspend grace time, the SIGSTOP signal will be sent. |
| The default value is 2 seconds. |
| .br |
| \fBNOTE\fR: This parameter is only used when \fBPreemptMode=SUSPEND\fR is |
| configured or when suspending jobs with scontrol suspend. |
| For setting the preemption grace time when using other preemption modes, |
| see \fBGraceTime\fR. |
| .IP |
| |
| .TP |
| \fByoungest_first\fR |
| If set, then the preemption sorting algorithm will be changed to sort by the |
| job start times to favor preempting younger jobs over older. (Requires |
| preempt/partition_prio or preempt/qos plugins.) |
| .IP |
| .RE |
| .IP |
| |
| .TP |
| \fBPreemptType\fR |
| Specifies the plugin used to identify which jobs can be |
| preempted in order to start a pending job. Unset by default. |
| .IP |
| .RS |
| .TP |
| \fBpreempt/partition_prio\fR |
| Job preemption is based upon partition \fBPriorityTier\fR. |
| Jobs in higher \fBPriorityTier\fR partitions may preempt jobs from lower |
| \fBPriorityTier\fR partitions. |
| This is not compatible with \fBPreemptMode=OFF\fR. |
| .IP |
| |
| .TP |
| \fBpreempt/qos\fR |
| Job preemption rules are specified by Quality Of Service (QOS) specifications |
| in the Slurm database. |
| In the case of \fBPreemptMode=SUSPEND\fR, a preempting job has to be submitted |
| to a partition with a higher PriorityTier or to the same partition. Submission |
| to the same partition is also supported, which results in the preemptor QoS to |
| gang schedule the preemptee QoS. |
| This option is not compatible with \fBPreemptMode=OFF\fR. |
| A configuration of \fBPreemptMode=SUSPEND\fR is only supported by the |
| \fBSelectType=select/cons_tres\fR plugin. |
| See the \fBsacctmgr\fR man page to configure the options for \fBpreempt/qos\fR. |
| .RE |
| .IP |
| |
| .TP |
| \fBPreemptExemptTime\fR |
| Global option for minimum run time for all jobs before they can be considered |
| for preemption. Any QOS PreemptExemptTime takes precedence over the global |
| option. This is only honored for \fBPreemptMode=REQUEUE\fR and |
| \fBPreemptMode=CANCEL\fR. |
| .br |
| A time of \-1 disables the option, equivalent to 0. Acceptable time formats |
| include "minutes", "minutes:seconds", "hours:minutes:seconds", "days\-hours", |
| "days\-hours:minutes", and "days\-hours:minutes:seconds". |
| .IP |
| |
| .TP |
| \fBPrEpParameters\fR |
| Parameters to be passed to the \fBPrEpPlugins\fR. |
| .IP |
| |
| .TP |
| \fBPrEpPlugins\fR |
| A resource for programmers wishing to write their own plugins for the Prolog and |
| Epilog (PrEp) scripts. The default, and currently the only implemented plugin is |
| \fIprep/script\fR. Additional plugins can be specified in a comma\-separated |
| list. For more information please see the PrEp Plugin API documentation page: |
| <https://slurm.schedmd.com/prep_plugins.html> |
| .IP |
| |
| .TP |
| \fBPriorityCalcPeriod\fR |
| The period of time in minutes in which the half\-life decay will be |
| re\-calculated. |
| Applicable only if PriorityType=priority/multifactor. |
| The default value is 5 (minutes). |
| .IP |
| |
| .TP |
| \fBPriorityDecayHalfLife\fR |
| This controls how long prior resource use is considered in determining |
| how over\- or under\-serviced an association is (user, bank account and |
| cluster) in determining job priority. |
| The record of usage will be decayed over time, with half of the original value |
| cleared at age \fBPriorityDecayHalfLife\fR. |
| If set to 0 no decay will be applied. |
| This is helpful if you want to enforce hard time limits per association. If |
| set to 0 \fBPriorityUsageResetPeriod\fR must be set to some interval. |
| Applicable only if PriorityType=priority/multifactor. |
| The unit is a time string (i.e. min, hr:min:00, days\-hr:min:00, |
| or days\-hr). The default value is 7\-0 (7 days). |
| .IP |
| |
| .TP |
| \fBPriorityFavorSmall\fR |
| Specifies that small jobs should be given preferential scheduling priority. |
| Applicable only if PriorityType=priority/multifactor. |
| Supported values are "YES" and "NO". The default value is "NO". |
| .IP |
| |
| .TP |
| \fBPriorityFlags\fR |
| Flags to modify priority behavior. |
| Applicable only if PriorityType=priority/multifactor. |
| The keywords below have no associated value |
| (e.g. "PriorityFlags=ACCRUE_ALWAYS,SMALL_RELATIVE_TO_TIME"). |
| .IP |
| .RS |
| .TP 17 |
| \fBACCRUE_ALWAYS\fR |
| If set, priority age factor will be increased despite job ineligibility due to |
| either dependencies, holds or begin time in the future. Accrue limits are |
| ignored. |
| .IP |
| |
| .TP |
| \fBCALCULATE_RUNNING\fR |
| If set, priorities will be recalculated not only for pending jobs, but also |
| running and suspended jobs. |
| .IP |
| |
| .TP |
| \fBDEPTH_OBLIVIOUS\fR |
| If set, priority will be calculated based similar to the normal multifactor |
| calculation, but depth of the associations in the tree does not adversely |
| affect their priority. This option automatically enables NO_FAIR_TREE. |
| .IP |
| |
| .TP |
| \fBNO_FAIR_TREE\fR |
| Disables the "fair tree" algorithm, and reverts to "classic" fair share |
| priority scheduling. |
| .IP |
| |
| .TP |
| \fBINCR_ONLY\fR |
| If set, priority values will only increase in value. Job priority will never |
| decrease in value. |
| .IP |
| |
| .TP |
| \fBMAX_TRES\fR |
| If set, the weighted TRES value (e.g. TRESBillingWeights) is calculated as the |
| MAX of individual TRESs on a node (e.g. cpus, mem, gres) plus the sum of all |
| global TRESs (e.g. licenses). |
| .IP |
| |
| .TP |
| \fBMAX_TRES_GRES\fR |
| If set, the weighted TRES value (e.g. TRESBillingWeights) is calculated as the |
| MAX of individual TRESs on a node (e.g. cpus, mem), plus the billable gres, plus |
| the sum of all global TRESs (e.g. licenses). |
| .IP |
| |
| .TP |
| \fBNO_NORMAL_ALL\fR |
| If set, all NO_NORMAL_* flags are set. |
| .IP |
| |
| .TP |
| \fBNO_NORMAL_ASSOC\fR |
| If set, the association factor is not normalized against the highest association |
| priority. |
| .IP |
| |
| .TP |
| \fBNO_NORMAL_PART\fR |
| If set, the partition factor is not normalized against the highest partition |
| \fBPriorityJobFactor\fR. |
| .IP |
| |
| .TP |
| \fBNO_NORMAL_QOS\fR |
| If set, the QOS factor is not normalized against the highest qos priority. |
| .IP |
| |
| .TP |
| \fBNO_NORMAL_TRES\fR |
| If set, the TRES factor is not normalized against the job's partition TRES |
| counts. |
| .IP |
| |
| .TP |
| \fBSMALL_RELATIVE_TO_TIME\fR |
| If set, the job's size component will be based upon not the job size alone, but |
| the job's size divided by its time limit. |
| .RE |
| .IP |
| |
| .TP |
| \fBPriorityMaxAge\fR |
| Specifies the job age which will be given the maximum age factor in computing |
| priority. For example, a value of 30 minutes would result in all jobs over |
| 30 minutes old would get the same age\-based priority. |
| Applicable only if PriorityType=priority/multifactor. |
| The unit is a time string (i.e. min, hr:min:00, days\-hr:min:00, |
| or days\-hr). The default value is 7\-0 (7 days). |
| .IP |
| |
| .TP |
| \fBPriorityParameters\fR |
| Arbitrary string used by the PriorityType plugin. |
| .IP |
| |
| .TP |
| \fBPrioritySiteFactorParameters\fR |
| Arbitrary string used by the PrioritySiteFactorPlugin plugin. |
| .IP |
| |
| .TP |
| \fBPrioritySiteFactorPlugin\fR |
| The specifies an optional plugin to be used alongside "priority/multifactor", |
| which is meant to initially set and continuously update the SiteFactor |
| priority factor. Unset by default. |
| .IP |
| |
| .TP |
| \fBPriorityType\fR |
| This specifies the plugin to be used in establishing a job's scheduling |
| priority. |
| Also see \fBPriorityFlags\fR for configuration options. |
| The default value is "priority/multifactor". |
| .IP |
| .RS |
| .TP |
| \fBpriority/basic\fR |
| Jobs are evaluated in a First In, First Out (FIFO) manner. |
| .IP |
| |
| .TP |
| \fBpriority/multifactor\fR |
| Jobs are assigned a priority based upon a variety of factors |
| that include size, age, Fairshare, etc. |
| .IP |
| .RE |
| .RS |
| .nr step 1 1 |
| When not FIFO scheduling, jobs are prioritized in the following order: |
| .br |
| |
| 1. Jobs that can preempt |
| .br |
| 2. Jobs with an advanced reservation |
| .br |
| 3. Partition PriorityTier |
| .br |
| 4. Job priority |
| .br |
| 5. Job submit time |
| .br |
| 6. Job ID |
| .RE |
| .IP |
| |
| .TP |
| \fBPriorityUsageResetPeriod\fR |
| At this interval the usage of associations will be reset to 0. This is used |
| if you want to enforce hard limits of time usage per association. If |
| PriorityDecayHalfLife is set to be 0 no decay will happen and this is the |
| only way to reset the usage accumulated by running jobs. By default this is |
| turned off and it is advised to use the PriorityDecayHalfLife option to avoid |
| not having anything running on your cluster, but if your schema is set up to |
| only allow certain amounts of time on your system this is the way to do it. |
| Applicable only if PriorityType=priority/multifactor. |
| .IP |
| .RS |
| .TP 12 |
| \fBNONE\fR |
| Never clear historic usage. The default value. |
| .IP |
| |
| .TP |
| \fBNOW\fR |
| Clear the historic usage now. |
| Executed at startup and reconfiguration time. |
| .IP |
| |
| .TP |
| \fBDAILY\fR |
| Cleared every day at midnight. |
| .IP |
| |
| .TP |
| \fBWEEKLY\fR |
| Cleared every week on Sunday at time 00:00. |
| .IP |
| |
| .TP |
| \fBMONTHLY\fR |
| Cleared on the first day of each month at time 00:00. |
| .IP |
| |
| .TP |
| \fBQUARTERLY\fR |
| Cleared on the first day of each quarter at time 00:00. |
| .IP |
| |
| .TP |
| \fBYEARLY\fR |
| Cleared on the first day of each year at time 00:00. |
| .RE |
| .IP |
| |
| .TP |
| \fBPriorityWeightAge\fR |
| An integer value that sets the degree to which the queue wait time |
| component contributes to the job's priority. |
| Applicable only if PriorityType=priority/multifactor. |
| Requires AccountingStorageType=accounting_storage/slurmdbd. |
| The default value is 0. |
| .IP |
| |
| .TP |
| \fBPriorityWeightAssoc\fR |
| An integer value that sets the degree to which the association |
| component contributes to the job's priority. |
| Applicable only if PriorityType=priority/multifactor. |
| The default value is 0. |
| .IP |
| |
| .TP |
| \fBPriorityWeightFairshare\fR |
| An integer value that sets the degree to which the fair\-share |
| component contributes to the job's priority. |
| Applicable only if PriorityType=priority/multifactor. |
| Requires AccountingStorageType=accounting_storage/slurmdbd. |
| The default value is 0. |
| .IP |
| |
| .TP |
| \fBPriorityWeightJobSize\fR |
| An integer value that sets the degree to which the job size |
| component contributes to the job's priority. |
| Applicable only if PriorityType=priority/multifactor. |
| The default value is 0. |
| .IP |
| |
| .TP |
| \fBPriorityWeightPartition\fR |
| Partition factor used by priority/multifactor plugin in calculating job priority. |
| Applicable only if PriorityType=priority/multifactor. |
| The default value is 0. |
| .IP |
| |
| .TP |
| \fBPriorityWeightQOS\fR |
| An integer value that sets the degree to which the Quality Of Service |
| component contributes to the job's priority. |
| Applicable only if PriorityType=priority/multifactor. |
| The default value is 0. |
| .IP |
| |
| .TP |
| \fBPriorityWeightTRES\fR |
| A comma\-separated list of TRES Types and weights that sets the degree that each |
| TRES Type contributes to the job's priority. |
| .IP |
| .nf |
| e.g. |
| PriorityWeightTRES=CPU=1000,Mem=2000,GRES/gpu=3000 |
| .fi |
| |
| Applicable only if PriorityType=priority/multifactor and if |
| AccountingStorageTRES is configured with each TRES Type. |
| Negative values are allowed. |
| The default values are 0. |
| .IP |
| |
| .TP |
| \fBPrivateData\fR |
| This controls what type of information is hidden from regular users. |
| By default, all information is visible to all users. |
| User \fBSlurmUser\fR and \fBroot\fR can always view all information. |
| Multiple values may be specified with a comma separator. |
| Acceptable values include: |
| .IP |
| .RS |
| .TP |
| \fBaccounts\fR |
| (NON\-SlurmDBD ACCOUNTING ONLY) Prevents users from viewing any account |
| definitions unless they are coordinators of them. |
| .IP |
| |
| .TP |
| \fBevents\fR |
| prevents users from viewing event information unless they have operator status |
| or above. |
| .IP |
| |
| .TP |
| \fBjobs\fR |
| Prevents users from viewing jobs or job steps belonging |
| to other users. (NON\-SlurmDBD ACCOUNTING ONLY) Prevents users from viewing |
| job records belonging to other users unless they are coordinators of |
| the association running the job when using sacct. |
| .IP |
| |
| .TP |
| \fBnodes\fR |
| Prevents users from viewing node state information. |
| .IP |
| |
| .TP |
| \fBpartitions\fR |
| Prevents users from viewing partition state information. |
| .IP |
| |
| .TP |
| \fBreservations\fR |
| Prevents regular users from viewing reservations which they can not use. |
| .IP |
| |
| .TP |
| \fBusage\fR |
| Prevents users from viewing usage of any other user, this applies to sshare. |
| (NON\-SlurmDBD ACCOUNTING ONLY) Prevents users from viewing |
| usage of any other user, this applies to sreport. |
| .IP |
| |
| .TP |
| \fBusers\fR |
| (NON\-SlurmDBD ACCOUNTING ONLY) Prevents users from viewing |
| information of any user other than themselves, this also makes it so users can |
| only see associations they deal with. |
| Coordinators can see associations of all users in the account they are |
| coordinator of, but can only see themselves when listing users. |
| .RE |
| .IP |
| |
| .TP |
| \fBProctrackType\fR |
| Identifies the plugin to be used for process tracking on a job step basis. |
| The slurmd daemon uses this mechanism to identify all processes |
| which are children of processes it spawns for a user job step. |
| \fBNOTE\fR: "proctrack/linuxproc" and "proctrack/pgid" can fail to |
| identify all processes associated with a job since processes |
| can become a child of the init process (when the parent process |
| terminates) or change their process group. |
| To reliably track all processes, "proctrack/cgroup" is highly recommended. |
| \fBNOTE\fR: The \fBJobContainerType\fR applies to a job namespace isolation, |
| while \fBProctrackType\fR applies to job resource limits and tracking. |
| Acceptable values at present include: |
| .IP |
| .RS |
| .TP |
| \fBproctrack/cgroup\fR |
| Uses linux cgroups to constrain and track processes, and is the default |
| for systems with cgroup support. |
| .br |
| \fBNOTE\fR: See "man cgroup.conf" for configuration details. |
| .IP |
| |
| .TP |
| \fBproctrack/linuxproc\fR |
| Uses linux process tree using parent process IDs. |
| .IP |
| |
| .TP |
| \fBproctrack/pgid\fR |
| Uses Process Group IDs. |
| .br |
| \fBNOTE\fR: This is the default for the BSD family. |
| .RE |
| .IP |
| |
| .TP |
| \fBProlog\fR |
| Pathname of a program for the slurmd to execute whenever it is asked to run a |
| job step from a new job allocation. If it is not an absolute path name (i.e. it |
| does not start with a slash), it will be searched for in the same directory as |
| the slurm.conf file. A glob pattern (See \fBglob\fR (7)) may also be used to |
| specify more than one program to run (e.g. "/etc/slurm/prolog.d/*"). When more |
| than one prolog script is configured, they are executed in reverse alphabetical |
| order (z-a -> Z-A -> 9-0). The slurmd executes the prolog before starting |
| the first job step. The prolog script or scripts may be used to purge files, |
| enable user login, etc. By default there is no prolog. Any configured script |
| is expected to complete execution quickly (in less time than |
| \fBMessageTimeout\fR). |
| If the prolog fails (returns a non\-zero exit code), this will result in the |
| node being set to a DRAIN state and the job being requeued. The job will be |
| placed in a held state, unless \fBnohold_on_prolog_fail\fR is configured in |
| \fBSchedulerParameters\fR. |
| See \fBProlog and Epilog Scripts\fR for more information. |
| |
| \fB\fBNOTE\fR: It is possible to configure multiple prolog scripts by including |
| this option on multiple lines.\fR |
| .IP |
| |
| .TP |
| \fBPrologEpilogTimeout\fR |
| The interval in seconds Slurm waits for Prolog and Epilog before terminating |
| them. The default behavior is to wait indefinitely. This interval applies to |
| the Prolog and Epilog run by slurmd daemon before and after the job, the |
| PrologSlurmctld and EpilogSlurmctld run by slurmctld daemon, and the SPANK |
| plugin prolog/epilog calls: slurm_spank_job_prolog and slurm_spank_job_epilog. |
| .br |
| If the PrologSlurmctld times out, the job is requeued if possible. |
| If the Prolog or slurm_spank_job_prolog time out, the job is requeued if |
| possible and the node is drained. |
| If the Epilog or slurm_spank_job_epilog time out, the node is drained. |
| In all cases, errors are logged. |
| .br |
| \fB\fBNOTE\fR: This value is not used for prologs if PrologTimeout is configured. |
| Likewise, this value is not used for epilogs if EpilogTimout is configured. |
| .IP |
| |
| .TP |
| \fBPrologTimeout\fR |
| The interval in seconds Slurm waits for Prolog before terminating them. The |
| default value is \fBPrologEpilogTimeout\fR. This interval applies to the Prolog |
| run by slurmd daemon before the job, the PrologSlurmctld run by slurmctld |
| daemon, and the SPANK plugin prolog call: slurm_spank_job_prolog. |
| .br |
| If the PrologSlurmctld times out, the job is requeued if possible. |
| If the Prolog or slurm_spank_job_prolog time out, the job is requeued if |
| possible and the node is drained. In all cases, errors are logged. |
| .IP |
| |
| .TP |
| \fBPrologFlags\fR |
| Flags to control the Prolog behavior. By default no flags are set. |
| Multiple flags may be specified in a comma\-separated list. |
| Currently supported options are: |
| .IP |
| .RS |
| .TP 8 |
| \fBAlloc\fR |
| If set, the Prolog script will be executed at job allocation. By default, |
| Prolog is executed just before the task is launched. Therefore, when salloc |
| is started, no Prolog is executed. Alloc is useful for preparing things |
| before a user starts to use any allocated resources. |
| In particular, this flag is needed on a Cray system when cluster compatibility |
| mode is enabled. |
| |
| \fBNOTE\fR: Use of the Alloc flag will increase the time required to start jobs. |
| .IP |
| |
| .TP |
| \fBContain\fR |
| At job allocation time, use the ProcTrack plugin to create a job container |
| on all allocated compute nodes. |
| This container may be used for user processes not launched under Slurm control, |
| for example pam_slurm_adopt may place processes launched through a direct user |
| login into this container. If using pam_slurm_adopt, then ProcTrackType must be |
| set to \fBproctrack/cgroup\fR. |
| Setting the Contain implicitly sets the Alloc flag. |
| .IP |
| |
| .TP |
| \fBDeferBatch\fR |
| If set, slurmctld will wait until the prolog completes on all allocated |
| nodes before sending the batch job launch request. With just the Alloc flag, |
| slurmctld will launch the batch step as soon as the first node in the job |
| allocation completes the prolog. |
| .IP |
| |
| .TP |
| \fBNoHold\fR |
| If set, the Alloc flag should also be set. This will allow for salloc to not |
| block until the prolog is finished on each node. The blocking will happen when |
| steps reach the slurmd and before any execution has happened in the step. |
| This is a much faster way to work and if using srun to launch your tasks you |
| should use this flag. This flag cannot be combined with the Contain or X11 |
| flags. |
| .IP |
| |
| .TP |
| \fBForceRequeueOnFail\fR |
| When a batch job fails to launch due to a Prolog failure, always requeue it |
| automatically even if the job requested no requeues. |
| |
| \fB\fBNOTE\fR: Setting this flag implicitly sets the Alloc flag.\fR |
| .IP |
| |
| .TP |
| \fBRunInJob\fR |
| Make the Prolog/Epilog run in the extern slurmstepd. This will contain it in one |
| of on the job's processes. This will contain it in the cgroup if configured. |
| Setting the RunInJob flag implicitly sets the Contain and Alloc flag. |
| .IP |
| |
| .TP |
| \fBSerial\fR |
| By default, the Prolog and Epilog scripts run concurrently on each node. |
| This flag forces those scripts to run serially within each node, but with |
| a significant penalty to job throughput on each node. |
| |
| \fB\fBNOTE\fR: This is incompatible with RunInJob.\fR |
| .IP |
| |
| .TP |
| \fBX11\fR |
| Enable Slurm's built\-in X11 forwarding capabilities. |
| This is incompatible with \fBProctrackType=proctrack/linuxproc\fR. |
| Setting the X11 flag implicitly enables both Contain and Alloc flags as well. |
| .RE |
| .IP |
| |
| .TP |
| \fBPrologSlurmctld\fR |
| Fully qualified pathname of a program for the slurmctld daemon to execute |
| before granting a new job allocation (e.g. |
| "/usr/local/slurm/prolog_controller"). |
| The program executes as SlurmUser on the same node where the slurmctld daemon |
| executes, giving it permission to drain |
| nodes and requeue the job if a failure occurs or cancel the job if appropriate. |
| Exactly what the program does and how it accomplishes this is completely at |
| the discretion of the system administrator. |
| Information about the job being initiated, its allocated nodes, etc. are |
| passed to the program using environment variables. |
| While this program is running, the nodes associated with the job will be |
| have a POWER_UP/CONFIGURING flag set in their state, which can be readily |
| viewed. |
| The slurmctld daemon will wait indefinitely for this program to complete. |
| Once the program completes with an exit code of zero, the nodes will be |
| considered ready for use and the program will be started. |
| If some node can not be made available for use, the program should drain |
| the node (typically using the scontrol command) and terminate with a non\-zero |
| exit code. |
| A non\-zero exit code will result in the job being requeued (where possible) |
| or killed. Note that only batch jobs can be requeued. |
| See \fBProlog and Epilog Scripts\fR for more information. |
| |
| \fB\fBNOTE\fR: It is possible to configure multiple prolog scripts by including |
| this option on multiple lines.\fR |
| .IP |
| |
| .TP |
| \fBPropagatePrioProcess\fR |
| Controls the scheduling priority (nice value) of user spawned tasks. |
| .IP |
| .RS |
| .TP 5 |
| \fB0\fR |
| The tasks will inherit the scheduling priority from the slurm daemon. |
| This is the default value. |
| .IP |
| |
| .TP |
| \fB1\fR |
| The tasks will inherit the scheduling priority of the command used to |
| submit them (e.g. \fBsrun\fR or \fBsbatch\fR). |
| Unless the job is submitted by user root, the tasks will have a scheduling |
| priority no higher than the slurm daemon spawning them. |
| .IP |
| |
| .TP |
| \fB2\fR |
| The tasks will inherit the scheduling priority of the command used to |
| submit them (e.g. \fBsrun\fR or \fBsbatch\fR) with the restriction that |
| their nice value will always be one higher than the slurm daemon (i.e. |
| the tasks scheduling priority will be lower than the slurm daemon). |
| .RE |
| .IP |
| |
| .TP |
| \fBPropagateResourceLimits\fR |
| A comma\-separated list of resource limit names. |
| The slurmd daemon uses these names to obtain the associated (soft) limit |
| values from the user's process environment on the submit node. |
| These limits are then propagated and applied to the jobs that |
| will run on the compute nodes. |
| This parameter can be useful when system limits vary among nodes. |
| Any resource limits that do not appear in the list are not propagated. |
| However, the user can override this by specifying which resource limits |
| to propagate with the sbatch or srun "\-\-propagate" option. If neither |
| \fBPropagateResourceLimits\fR or \fBPropagateResourceLimitsExcept\fR are |
| configured and the "\-\-propagate" option is not specified, then the default |
| action is to propagate all limits. Only one of the parameters, either |
| PropagateResourceLimits or PropagateResourceLimitsExcept, may be specified. |
| The user limits can not exceed hard limits under which the slurmd daemon |
| operates. If the user limits are not propagated, the limits from the slurmd |
| daemon will be propagated to the user's job. The limits used for the Slurm |
| daemons can be set in the /etc/sysconf/slurm file. For more information, see: |
| https://slurm.schedmd.com/faq.html#memlock |
| The following limit names are supported by Slurm (although some |
| options may not be supported on some systems): |
| .IP |
| .RS |
| .TP 10 |
| \fBALL\fR |
| All limits listed below (default) |
| .IP |
| |
| .TP |
| \fBNONE\fR |
| No limits listed below |
| .IP |
| |
| .TP |
| \fBAS\fR |
| The maximum address space (virtual memory) for a process. |
| .IP |
| |
| .TP |
| \fBCORE\fR |
| The maximum size of core file |
| .IP |
| |
| .TP |
| \fBCPU\fR |
| The maximum amount of CPU time |
| .IP |
| |
| .TP |
| \fBDATA\fR |
| The maximum size of a process's data segment |
| .IP |
| |
| .TP |
| \fBFSIZE\fR |
| The maximum size of files created. Note that if the user sets FSIZE to less |
| than the current size of the slurmd.log, job launches will fail with |
| a 'File size limit exceeded' error. |
| .IP |
| |
| .TP |
| \fBMEMLOCK\fR |
| The maximum size that may be locked into memory |
| .IP |
| |
| .TP |
| \fBNOFILE\fR |
| The maximum number of open files |
| .IP |
| |
| .TP |
| \fBNPROC\fR |
| The maximum number of processes available |
| .IP |
| |
| .TP |
| \fBRSS\fR |
| The maximum resident set size. Note that this only has effect with Linux |
| kernels 2.4.30 or older or BSD. |
| .IP |
| |
| .TP |
| \fBSTACK\fR |
| The maximum stack size |
| .RE |
| .IP |
| |
| .TP |
| \fBPropagateResourceLimitsExcept\fR |
| A comma\-separated list of resource limit names. |
| By default, all resource limits will be propagated, (as described by |
| the \fBPropagateResourceLimits\fR parameter), except for the limits |
| appearing in this list. The user can override this by specifying which |
| resource limits to propagate with the sbatch or srun "\-\-propagate" option. |
| See \fBPropagateResourceLimits\fR above for a list of valid limit names. |
| .IP |
| |
| .TP |
| \fBRebootProgram\fR |
| Program to be executed on each compute node to reboot it. Invoked on each node |
| once it becomes idle after the command "scontrol reboot" is executed by |
| an authorized user or a job is submitted with the "\-\-reboot" option. |
| After rebooting, the node is returned to normal use. |
| See \fBResumeTimeout\fR to configure the time you expect a reboot to finish in. |
| A node will be marked DOWN if it doesn't reboot within \fBResumeTimeout\fR. |
| .IP |
| |
| .TP |
| \fBReconfigFlags\fR |
| Flags to control various actions that may be taken when an "scontrol |
| reconfig" command is issued. Currently the options are: |
| .IP |
| .RS |
| .TP 17 |
| \fBKeepPartInfo\fR |
| If set, an "scontrol reconfig" command will maintain the in\-memory |
| value of partition "state" and other parameters that may have been |
| dynamically updated by "scontrol update". Partition information in |
| the slurm.conf file will be merged with in\-memory data. This flag |
| supersedes the KeepPartState flag. |
| .IP |
| |
| .TP |
| \fBKeepPartState\fR |
| If set, an "scontrol reconfig" command will preserve only the current |
| "state" value of in\-memory partitions and will reset all other |
| parameters of the partitions that may have been dynamically updated by |
| "scontrol update" to the values from the slurm.conf file. Partition |
| information in the slurm.conf file will be merged with in\-memory |
| data. |
| .IP |
| |
| .TP |
| \fBKeepPowerSaveSettings\fR |
| If set, an "scontrol reconfig" command will preserve the current state of |
| SuspendExcNodes, SuspendExcParts and SuspendExcStates. |
| .IP |
| |
| .RE |
| .RS 7 |
| The default for the above flags is not set, and the |
| "scontrol reconfig" will rebuild the partition information using only |
| the definitions in the slurm.conf file. |
| .RE |
| .IP |
| |
| .TP |
| \fBRequeueExit\fR |
| Enables automatic requeue for batch jobs which exit with the specified |
| values. |
| Separate multiple exit code by a comma and/or specify numeric ranges using a |
| "\-" separator (e.g. "RequeueExit=1\-9,18") |
| Jobs will be put back in to pending state and later scheduled again. |
| Restarted jobs will have the environment variable \fBSLURM_RESTART_COUNT\fP |
| set to the number of times the job has been restarted. |
| .IP |
| |
| .TP |
| \fBRequeueExitHold\fR |
| Enables automatic requeue for batch jobs which exit with the specified |
| values, with these jobs being held until released manually by the user. |
| Separate multiple exit code by a comma and/or specify numeric ranges using a |
| "\-" separator (e.g. "RequeueExitHold=10\-12,16") |
| These jobs are put in the \fBJOB_SPECIAL_EXIT\fP exit state. |
| Restarted jobs will have the environment variable \fBSLURM_RESTART_COUNT\fP |
| set to the number of times the job has been restarted. |
| .IP |
| |
| .TP |
| \fBResumeFailProgram\fR |
| The program that will be executed when nodes fail to resume to by |
| \fBResumeTimeout\fR. The argument to the program will be the names of the failed |
| nodes (using Slurm's hostlist expression format). |
| Programs will be killed if they run longer than the largest configured, global |
| or partition, \fBResumeTimeout\fR or \fBSuspendTimeout\fR. |
| .IP |
| |
| .TP |
| \fBResumeProgram\fR |
| Slurm supports a mechanism to reduce power consumption on nodes that |
| remain idle for an extended period of time. |
| This is typically accomplished by reducing voltage and frequency or powering |
| the node down. |
| \fBResumeProgram\fR is the program that will be executed when a node |
| in power save mode is assigned work to perform. |
| For reasons of reliability, \fBResumeProgram\fR may execute more than once |
| for a node when the \fBslurmctld\fR daemon crashes and is restarted. |
| If \fBResumeProgram\fR is unable to restore a node to service with a responding |
| slurmd and an updated BootTime, it should set the node state to DOWN, which will |
| result in a requeue of any job associated with the node - this will happen |
| automatically if the node doesn't register within ResumeTimeout. |
| \fBSchedulerParameters=requeue_on_resume_failure\fR can be used to always |
| requeue batch jobs in this situation, even if the job requested no requeues. |
| If the node isn't actually rebooted (i.e. when multiple\-slurmd is configured) |
| starting slurmd with "\-b" option might be useful. |
| The program executes as \fBSlurmUser\fR. |
| The argument to the program will be the names of nodes to |
| be removed from power savings mode (using Slurm's hostlist |
| expression format). A job to node mapping is available in JSON format by |
| reading the temporary file specified by the \fBSLURM_RESUME_FILE\fR environment |
| variable. |
| This file is closed once slurmctld shuts down. If ResumeProgram is running, |
| slurmctld shutdown is delayed by up to ten seconds to give ResumeProgram time |
| to read this file. Therefore, this file should be read at the beginning of |
| ResumeProgram. |
| By default no program is run. |
| Programs will be killed if they run longer than the largest configured, global |
| or partition, \fBResumeTimeout\fR or \fBSuspendTimeout\fR. |
| .IP |
| |
| .TP |
| \fBResumeRate\fR |
| The rate at which nodes in power save mode are returned to normal |
| operation by \fBResumeProgram\fR. |
| The value is a number of nodes per minute and it can be used to prevent |
| power surges if a large number of nodes in power save mode are |
| assigned work at the same time (e.g. a large job starts). |
| A value of zero results in no limits being imposed. |
| The default value is 300 nodes per minute. |
| .IP |
| |
| .TP |
| \fBResumeTimeout\fR |
| Maximum time permitted (in seconds) between when a node resume request |
| is issued and when the node is actually available for use. |
| Nodes which fail to respond in this time frame will be marked DOWN and |
| the jobs scheduled on the node requeued if possible. |
| Nodes which reboot after this time frame will be marked DOWN with a reason of |
| "Node unexpectedly rebooted." |
| The default value is 60 seconds, and the maximum value is either 65533 or |
| INFINITE. |
| .IP |
| |
| .TP |
| \fBResvEpilog\fR |
| Fully qualified pathname of a program for the slurmctld to execute |
| when a reservation ends. It does not run when a running reservation is |
| deleted. The program can be used to cancel jobs, modify |
| partition configuration, etc. |
| The reservation named will be passed as an argument to the program. |
| By default there is no epilog. |
| .IP |
| |
| .TP |
| \fBResvOverRun\fR |
| Describes how long a job already running in a reservation should be |
| permitted to execute after the end time of the reservation has been |
| reached. |
| The time period is specified in minutes and the default value is 0 |
| (kill the job immediately). |
| The value may not exceed 65533 minutes, although a value of "UNLIMITED" |
| is supported to permit a job to run indefinitely after its reservation |
| is terminated. |
| .IP |
| |
| .TP |
| \fBResvProlog\fR |
| Fully qualified pathname of a program for the slurmctld to execute |
| when a reservation begins. The program can be used to cancel jobs, modify |
| partition configuration, etc. |
| The reservation named will be passed as an argument to the program. |
| By default there is no prolog. |
| .IP |
| |
| .TP |
| \fBReturnToService\fR |
| Controls when a DOWN node will be returned to service. |
| The default value is 0. |
| Supported values include |
| .IP |
| .RS |
| .TP 4 |
| \fB0\fR |
| A node will remain in the DOWN state until a system administrator |
| explicitly changes its state (even if the slurmd daemon registers |
| and resumes communications). |
| .IP |
| |
| .TP |
| \fB1\fR |
| A DOWN node will become available for use upon registration with a |
| valid configuration only if it was set DOWN due to being non\-responsive. |
| If the node was set DOWN for any other reason (low memory, |
| unexpected reboot, etc.), its state will not automatically |
| be changed. |
| A node registers with a valid configuration if its memory, GRES, CPU count, |
| etc. are equal to or greater than the values configured in slurm.conf. |
| .IP |
| |
| .TP |
| \fB2\fR |
| A DOWN node will become available for use upon registration with a |
| valid configuration. The node could have been set DOWN for any reason. |
| A node registers with a valid configuration if its memory, GRES, CPU count, |
| etc. are equal to or greater than the values configured in slurm.conf. |
| .RE |
| .IP |
| |
| .TP |
| \fBSchedulerParameters\fR |
| The interpretation of this parameter varies by \fBSchedulerType\fR. |
| Multiple options may be comma separated. |
| .IP |
| .RS |
| .TP |
| \fBallow_zero_lic\fR |
| If set, then job submissions requesting more than configured licenses won't be |
| rejected. |
| .IP |
| |
| .TP |
| \fBassoc_limit_stop\fR |
| If set and a job cannot start due to association limits, then do not attempt |
| to initiate any lower priority jobs in that partition. Setting this can |
| decrease system throughput and utilization, but avoid potentially starving larger |
| jobs by preventing them from launching indefinitely. |
| .IP |
| |
| .TP |
| \fBbatch_sched_delay\fR=\# |
| How long, in seconds, the scheduling of batch jobs can be delayed. |
| This can be useful in a high\-throughput environment in which batch jobs are |
| submitted at a very high rate (i.e. using the sbatch command) and one wishes |
| to reduce the overhead of attempting to schedule each job at submit time. |
| The default value is 3 seconds. |
| .IP |
| |
| .TP |
| \fBbb_array_stage_cnt\fR=\# |
| Number of tasks from a job array that should be available for burst buffer |
| resource allocation. Higher values will increase the system overhead as each |
| task from the job array will be moved to its own job record in memory, so |
| relatively small values are generally recommended. |
| The default value is 10. |
| .IP |
| |
| .TP |
| \fBbf_allow_magnetic_slot\fR |
| By default the backfill scheduler will not add a slot in the bf plan when a job |
| attempts to use a magnetic reservation. This option reverses this to make the |
| backfill scheduler add slots in the bf plan when jobs are eligible to run in a |
| magnetic reservation. With this option enabled, jobs inside magnetic |
| reservations will respect priorities and also be counted against the backfill |
| limits such as \fBbf_max_job_test\fR. |
| \fBNOTE\fR: |
| Backfill first evaluates jobs inside reservations, which means all |
| magnetic jobs will be tested first. When enabling this option, make sure to |
| revise (increasing if necessary) the backfill limits configured to validate |
| backfill cycle gets to test the expected jobs in the queue. |
| \fBNOTE\fR: |
| If \fBbf_one_resv_per_job\fR is used along with this option the magnetic |
| reservation slot will now be the only slot in the bf plan. Otherwise the |
| slot will be the first one outside the magnetic reservation. |
| .IP |
| |
| .TP |
| \fBbf_busy_nodes\fR |
| When selecting resources for pending jobs to reserve for future execution |
| (i.e. the job can not be started immediately), then preferentially select |
| nodes that are in use. |
| This will tend to leave currently idle resources available for backfilling |
| longer running jobs, but may result in allocations having less than optimal |
| network topology. |
| This option is currently only supported by the select/cons_tres plugin. |
| .IP |
| |
| .TP |
| \fBbf_continue\fR |
| The backfill scheduler periodically releases locks in order to permit other |
| operations to proceed rather than blocking all activity for what could be an |
| extended period of time. |
| Setting this option will cause the backfill scheduler to continue processing |
| pending jobs from its original job list after releasing locks even if job |
| or node state changes. |
| .IP |
| |
| .TP |
| \fBbf_hetjob_immediate\fR |
| Instruct the backfill scheduler to attempt to start a heterogeneous job as |
| soon as all of its components are determined able to do so. Otherwise, the |
| backfill scheduler will delay heterogeneous jobs initiation attempts until |
| after the rest of the queue has been processed. This delay may result in lower |
| priority jobs being allocated resources, which could delay the initiation of |
| the heterogeneous job due to account and/or QOS limits being reached. This |
| option is disabled by default. If enabled and \fBbf_hetjob_prio=min\fR is not |
| set, then it would be automatically set. |
| .IP |
| |
| .TP |
| \fBbf_hetjob_prio=[min|avg|max]\fR |
| At the beginning of each backfill scheduling cycle, a list of pending to be |
| scheduled jobs is sorted according to the precedence order configured in |
| \fBPriorityType\fR. This option instructs the scheduler to alter the sorting |
| algorithm to ensure that all components belonging to the same heterogeneous job |
| will be attempted to be scheduled consecutively (thus not fragmented in the |
| resulting list). More specifically, all components from the same heterogeneous |
| job will be treated as if they all have the same priority (minimum, average or |
| maximum depending upon this option's parameter) when compared with other jobs |
| (or other heterogeneous job components). The original order will be preserved |
| within the same heterogeneous job. Note that the operation is calculated for |
| the \fBPriorityTier\fR layer and for the \fBPriority\fR resulting from the |
| priority/multifactor plugin calculations. When enabled, if any heterogeneous job |
| requested an advanced reservation, then all of that job's components will be |
| treated as if they had requested an advanced reservation (and get |
| preferential treatment in scheduling). |
| |
| Note that this operation does not update the \fBPriority\fR values of the |
| heterogeneous job components, only their order within the list, so the output of |
| the sprio command will not be affected. |
| |
| Heterogeneous jobs have special scheduling properties: they are only scheduled |
| by the backfill scheduling plugin, each of their components is considered |
| separately when reserving resources (and might have different \fBPriorityTier\fR |
| or different \fBPriority\fR values), and no heterogeneous job component is |
| actually allocated resources until all if its components can be initiated. |
| This may imply potential scheduling deadlock scenarios because components |
| from different heterogeneous jobs can start reserving resources in an |
| interleaved fashion (not consecutively), but none of the jobs can reserve |
| resources for all components and start. Enabling this option can help to |
| mitigate this problem. By default, this option is disabled. |
| .IP |
| |
| .TP |
| \fBbf_interval\fR=\# |
| The number of seconds between backfill iterations. |
| Higher values result in less overhead and better responsiveness. |
| This option applies only to \fBSchedulerType=sched/backfill\fR. |
| Default: 30, Min: 1, Max: 10800 (3h). |
| A setting of \-1 will disable the backfill scheduling loop. |
| .IP |
| |
| .TP |
| \fBbf_job_part_count_reserve\fR=\# |
| The backfill scheduling logic will reserve resources for the specified count |
| of highest priority jobs in each partition. |
| For example, bf_job_part_count_reserve=10 will cause the backfill scheduler to |
| reserve resources for the ten highest priority jobs in each partition. |
| Any lower priority job that can be started using currently available resources |
| and not adversely impact the expected start time of these higher priority jobs |
| will be started by the backfill scheduler |
| The default value is zero, which will reserve resources for any pending job |
| and delay initiation of lower priority jobs. |
| Also see bf_min_age_reserve and bf_min_prio_reserve. |
| Default: 0, Min: 0, Max: 100000. |
| .IP |
| |
| .TP |
| \fBbf_licenses\fR |
| Require the backfill scheduling logic to track and plan for license |
| availability. By default, any job blocked on license availability will not |
| have resources reserved which can lead to job starvation. |
| This option implicitly enables \fBbf_running_job_reserve\fR. |
| .IP |
| |
| .TP |
| \fBbf_max_job_array_resv\fR=\# |
| The maximum number of tasks from a job array for which the backfill scheduler |
| will reserve resources in the future. |
| Since job arrays can potentially have millions of tasks, the overhead in |
| reserving resources for all tasks can be prohibitive. |
| In addition various limits may prevent all the jobs from starting at the |
| expected times. |
| This has no impact upon the number of tasks from a job array that can be |
| started immediately, only those tasks expected to start at some future time. |
| Default: 20, Min: 0, Max: 1000. |
| \fBNOTE\fR: |
| Jobs submitted to multiple partitions appear in the job queue once per |
| partition. If different copies of a single job array record aren't consecutive |
| in the job queue and another job array record is in between, then |
| bf_max_job_array_resv tasks are considered per partition that the job is |
| submitted to. |
| .IP |
| |
| .TP |
| \fBbf_max_job_assoc\fR=\# |
| The maximum number of jobs per user association to attempt starting with the |
| backfill scheduler. |
| This setting is similar to \fBbf_max_job_user\fR but is handy if a user |
| has multiple associations equating to basically different users. |
| One can set this limit to prevent users from flooding the backfill |
| queue with jobs that cannot start and that prevent jobs from other users |
| to start. |
| This option applies only to \fBSchedulerType=sched/backfill\fR. |
| Also see the \fBbf_max_job_user\fR, \fBbf_max_job_part\fR, |
| \fBbf_max_job_test\fR, and \fBbf_max_job_user_part\fR options. |
| Set \fBbf_max_job_test\fR to a value much higher than \fBbf_max_job_assoc\fR. |
| Default: 0 (no limit), Min: 0, Max: bf_max_job_test. |
| .IP |
| |
| .TP |
| \fBbf_max_job_part\fR=\# |
| The maximum number of jobs per partition to attempt starting with the backfill |
| scheduler. This can be especially helpful for systems with large numbers of |
| partitions and jobs. |
| This option applies only to \fBSchedulerType=sched/backfill\fR. |
| Also see the \fBpartition_job_depth\fR and \fBbf_max_job_test\fR options. |
| Set \fBbf_max_job_test\fR to a value much higher than \fBbf_max_job_part\fR. |
| Default: 0 (no limit), Min: 0, Max: bf_max_job_test. |
| .IP |
| |
| .TP |
| \fBbf_max_job_start\fR=\# |
| The maximum number of jobs which can be initiated in a single iteration |
| of the backfill scheduler. |
| This option applies only to \fBSchedulerType=sched/backfill\fR. |
| Default: 0 (no limit), Min: 0, Max: 10000. |
| .IP |
| |
| .TP |
| \fBbf_max_job_test\fR=\# |
| The maximum number of jobs to attempt backfill scheduling for |
| (i.e. the queue depth). |
| Higher values result in more overhead and less responsiveness. |
| Until an attempt is made to backfill schedule a job, its expected |
| initiation time value will not be set. |
| In the case of large clusters, configuring a relatively small value may be |
| desirable. |
| This option applies only to \fBSchedulerType=sched/backfill\fR. |
| Default: 500, Min: 1, Max: 1,000,000. |
| .IP |
| |
| .TP |
| \fBbf_max_job_user\fR=\# |
| The maximum number of jobs per user to attempt starting with the backfill |
| scheduler for ALL partitions. |
| One can set this limit to prevent users from flooding the backfill |
| queue with jobs that cannot start and that prevent jobs from other users |
| to start. This is similar to the MAXIJOB limit in Maui. |
| This option applies only to \fBSchedulerType=sched/backfill\fR. |
| Also see the \fBbf_max_job_part\fR, \fBbf_max_job_test\fR, and |
| \fBbf_max_job_user_part\fR options. |
| Set \fBbf_max_job_test\fR to a value much higher than \fBbf_max_job_user\fR. |
| Default: 0 (no limit), Min: 0, Max: bf_max_job_test. |
| .IP |
| |
| .TP |
| \fBbf_max_job_user_part\fR=\# |
| The maximum number of jobs per user per partition to attempt starting with the |
| backfill scheduler for any single partition. |
| This option applies only to \fBSchedulerType=sched/backfill\fR. |
| Also see the \fBbf_max_job_part\fR, \fBbf_max_job_test\fR, and |
| \fBbf_max_job_user\fR options. |
| Default: 0 (no limit), Min: 0, Max: bf_max_job_test. |
| .IP |
| |
| .TP |
| \fBbf_max_time\fR=\# |
| The maximum time in seconds the backfill scheduler can spend (including time |
| spent sleeping when locks are released) before discontinuing, even if maximum |
| job counts have not been reached. |
| This option applies only to \fBSchedulerType=sched/backfill\fR. |
| The default value is the value of bf_interval (which defaults to 30 seconds). |
| Default: bf_interval value (def. 30 sec), Min: 1, Max: 3600 (1h). |
| \fBNOTE\fR: If bf_interval is short and bf_max_time is large, this may cause |
| locks to be acquired too frequently and starve out other serviced RPCs. It's |
| advisable if using this parameter to set max_rpc_cnt high enough that |
| scheduling isn't always disabled, and low enough that the interactive |
| workload can get through in a reasonable period of time. max_rpc_cnt needs to |
| be below 256 (the default RPC thread limit). Running around the middle (150) |
| may give you good results. |
| \fBNOTE\fR: When increasing the amount of time spent in the backfill scheduling |
| cycle, Slurm can be prevented from responding to client requests in a timely |
| manner. To address this you can use \fBmax_rpc_cnt\fR to specify a number of |
| queued RPCs before the scheduler stops to respond to these requests. |
| .IP |
| |
| .TP |
| \fBbf_min_age_reserve\fR=\# |
| The backfill and main scheduling logic will not reserve resources for pending |
| jobs until they have been pending and runnable for at least the specified |
| number of seconds. |
| In addition, jobs waiting for less than the specified number of seconds will |
| not prevent a newly submitted job from starting immediately, even if the newly |
| submitted job has a lower priority. |
| This can be valuable if jobs lack time limits or all time limits have the same |
| value. |
| The default value is zero, which will reserve resources for any pending job |
| and delay initiation of lower priority jobs. |
| Also see bf_job_part_count_reserve and bf_min_prio_reserve. |
| Default: 0, Min: 0, Max: 2592000 (30 days). |
| .IP |
| |
| .TP |
| \fBbf_min_prio_reserve\fR=\# |
| The backfill and main scheduling logic will not reserve resources for pending |
| jobs unless they have a priority equal to or higher than the specified value. |
| In addition, jobs with a lower priority will not prevent a newly submitted job |
| from starting immediately, even if the newly submitted job has a lower priority. |
| This can be valuable if one wished to maximize system utilization without regard |
| for job priority below a certain threshold. |
| The default value is zero, which will reserve resources for any pending job |
| and delay initiation of lower priority jobs. |
| Also see bf_job_part_count_reserve and bf_min_age_reserve. |
| Default: 0, Min: 0, Max: 2^63. |
| .IP |
| |
| .TP |
| \fBbf_node_space_size\fR=\# |
| Size of backfill node_space table. Adding a single job to backfill reservations |
| in the worst case can consume two node_space records. |
| In the case of large clusters, configuring a relatively small value may be |
| desirable. |
| This option applies only to \fBSchedulerType=sched/backfill\fR. |
| Also see bf_max_job_test and bf_running_job_reserve. |
| Default: bf_max_job_test, Min: 2, Max: 2,000,000. |
| .IP |
| |
| .TP |
| \fBbf_one_resv_per_job\fR |
| Disallow adding more than one backfill reservation per job. |
| The scheduling logic builds a sorted list of job-partition pairs. Jobs |
| submitted to multiple partitions have as many entries in the list as requested |
| partitions. By default, the backfill scheduler may evaluate all the |
| job-partition entries for a single job, potentially reserving resources for |
| each pair, but only starting the job in the reservation offering the earliest |
| start time. |
| Having a single job reserving resources for multiple partitions could impede |
| other jobs (or hetjob components) from reserving resources already reserved for |
| the partitions that don't offer the earliest start time. |
| A single job that requests multiple partitions can also prevent itself from |
| starting earlier in a lower priority partition if the partitions overlap |
| nodes and a backfill reservation in the higher priority partition blocks nodes |
| that are also in the lower priority partition. |
| This option makes it so that a job submitted to multiple partitions will stop |
| reserving resources once the first job-partition pair has booked a backfill |
| reservation. Subsequent pairs from the same job will only be tested to start |
| now. This allows for other jobs to be able to book the other pairs resources at |
| the cost of not guaranteeing that the multi partition job will start in the |
| partition offering the earliest start time (unless it can start immediately). |
| This option is disabled by default. |
| .IP |
| |
| .TP |
| \fBbf_resolution\fR=\# |
| The number of seconds in the resolution of data maintained about when jobs |
| begin and end. Higher values result in better responsiveness and quicker |
| backfill cycles by using larger blocks of time to determine node eligibility. |
| However, higher values lead to less efficient system planning, and may miss |
| opportunities to improve system utilization. |
| This option applies only to \fBSchedulerType=sched/backfill\fR. |
| Default: 60, Min: 1, Max: 3600 (1 hour). |
| .IP |
| |
| .TP |
| \fBbf_running_job_reserve\fR |
| Add an extra step to backfill logic, which creates backfill reservations |
| for jobs running on whole nodes. |
| This option is disabled by default. |
| .IP |
| |
| .TP |
| \fBbf_topopt_enable\fR |
| Enable experimental hook to control whether to delay jobs in backfill for a |
| better placement. Modify \fIsrc/plugins/sched/backfill/oracle.c\fR for testing. |
| |
| It is recommended to disable the main scheduler so that all jobs are planned |
| through backfill and utilize the oracle function(). This can be done by setting |
| \fBSchedulerParameters=sched_interval=-1\fR. |
| |
| It's also recommended to run with |
| \fBSchedulerParameters=bf_running_job_reserve\fR for better planning. |
| .IP |
| |
| .TP |
| \fBbf_topopt_iterations\fR |
| The number of successive backfill map slots that a job may be delayed. |
| This option applies only when the \fBbf_topopt_enable\fR is set. |
| .IP |
| |
| .TP |
| \fBbf_window\fR=\# |
| The number of minutes into the future to look when considering jobs to schedule. |
| Higher values result in more overhead and less responsiveness. |
| A value at least as long as the highest allowed time limit is generally |
| advisable to prevent job starvation. |
| In order to limit the amount of data managed by the backfill scheduler, |
| if the value of \fBbf_window\fR is increased, then it is generally advisable |
| to also increase \fBbf_resolution\fR. |
| This option applies only to \fBSchedulerType=sched/backfill\fR. |
| Default: 1440 (1 day), Min: 1, Max: 43200 (30 days). |
| .IP |
| |
| .TP |
| \fBbf_window_linear\fR=\# |
| For performance reasons, the backfill scheduler will decrease precision in |
| calculation of job expected termination times. By default, the precision starts |
| at 30 seconds and that time interval doubles with each evaluation of currently |
| executing jobs when trying to determine when a pending job can start. This |
| algorithm can support an environment with many thousands of running jobs, but |
| can result in the expected start time of pending jobs being gradually being |
| deferred due to lack of precision. A value for bf_window_linear will cause |
| the time interval to be increased by a constant amount on each iteration. |
| The value is specified in units of seconds. For example, a value of 60 will |
| cause the backfill scheduler on the first iteration to identify the job ending |
| soonest and determine if the pending job can be started after that job plus |
| all other jobs expected to end within 30 seconds (default initial value) of the |
| first job. On the next iteration, the pending job will be evaluated for |
| starting after the next job expected to end plus all jobs ending within |
| 90 seconds of that time (30 second default, plus the 60 second option value). |
| The third iteration will have a 150 second window and the fourth 210 seconds. |
| Without this option, the time windows will double on each iteration and thus |
| be 30, 60, 120, 240 seconds, etc. The use of bf_window_linear is not recommended |
| with more than a few hundred simultaneously executing jobs. |
| .IP |
| |
| .TP |
| \fBbf_yield_interval\fR=\# |
| The backfill scheduler will periodically relinquish locks in order for other |
| pending operations to take place. |
| This specifies the times when the locks are relinquished in microseconds. |
| Smaller values may be helpful for high throughput computing when used in |
| conjunction with the \fBbf_continue\fR option. |
| Also see the \fBbf_yield_sleep\fR option. |
| Default: 2,000,000 (2 sec), Min: 1, Max: 10,000,000 (10 sec). |
| .IP |
| |
| .TP |
| \fBbf_yield_rpc_cnt\fR |
| If the number of active threads in the slurmctld daemon is lower than this |
| value, continue scheduling of jobs. The scheduler will check this condition at |
| certain points in code and release previously yielded locks if necessary. This |
| is used to instruct Slurm when to stop processing requests and start scheduling |
| new jobs, which is the opposite of \fBmax_rpc_cnt\fR. In conjunction with |
| \fBmax_rpc_cnt\fR, it can improve Slurm's responsiveness to spikes of requests. |
| Default: MAX((max_rpc_cnt / 10), 20) (option disabled), Min: 0, Max: 200. |
| .IP |
| .RS |
| \fBNOTE\fR: If a value is set, then a value lower than \fBmax_rpc_cnt\fR is |
| recommended. It may require some tuning for each system, but needs to be high |
| enough that scheduling isn't always disabled, and low enough that requests can |
| get through in a reasonable period of time. Avoid both values being close enough |
| to cause continuous switching between request processing and scheduling. |
| .RE |
| .IP |
| |
| .TP |
| \fBbf_yield_sleep\fR=\# |
| The backfill scheduler will periodically relinquish locks in order for other |
| pending operations to take place. |
| This specifies the length of time for which the locks are relinquished in |
| microseconds. |
| Also see the \fBbf_yield_interval\fR option. |
| Default: 500,000 (0.5 sec), Min: 1, Max: 10,000,000 (10 sec). |
| .IP |
| |
| .TP |
| \fBbuild_queue_timeout\fR=\# |
| Defines the maximum time that can be devoted to building a queue of jobs to |
| be tested for scheduling. |
| If the system has a huge number of jobs with dependencies, just building the |
| job queue can take so much time as to adversely impact overall system |
| performance and this parameter can be adjusted as needed. |
| The default value is 2,000,000 microseconds (2 seconds). |
| .IP |
| |
| .TP |
| \fBcorrespond_after_task_cnt\fR=\# |
| Defines the number of array tasks that get split for potential aftercorr |
| dependency check. Low number may result in dependent task check failures when |
| the job one depends on gets purged before the split. |
| Default: 10. |
| .IP |
| |
| .TP |
| \fBdefault_queue_depth\fR=\# |
| The default number of jobs to attempt scheduling (i.e. the queue depth) when a |
| running job completes or other routine actions occur, however the frequency |
| with which the scheduler is run may be limited by using the \fBdefer\fR or |
| \fBsched_min_interval\fR parameters described below. |
| The main scheduling loop will run (ignoring this limit) |
| on a less frequent basis as defined by the |
| \fBsched_interval\fR option described below. The default value is 100. |
| See the \fBpartition_job_depth\fR option to limit depth by partition. |
| .IP |
| |
| .TP |
| \fBdefer\fR |
| Setting this option will avoid attempting to schedule each job |
| individually at job submit time, but defer it until a later time when |
| scheduling multiple jobs simultaneously may be possible. |
| This option may improve system responsiveness when large numbers of jobs |
| (many hundreds) are submitted at the same time, but it will delay the |
| initiation time of individual jobs. Also see \fBdefault_queue_depth\fR above. |
| .IP |
| |
| .TP |
| \fBdefer_batch\fR |
| Like \fBdefer\fR, but only will defer scheduling for batch jobs. Interactive |
| allocations from salloc/srun will still attempt to schedule immediately upon |
| submission. |
| .IP |
| |
| .TP |
| \fBdelay_boot\fR=\# |
| Do not reboot nodes in order to satisfied this job's feature specification if |
| the job has been eligible to run for less than this time period. |
| If the job has waited for less than the specified period, it will use only |
| nodes which already have the specified features. |
| The argument is in units of minutes. |
| Individual jobs may override this default value with the \fB\-\-delay\-boot\fR |
| option. |
| .IP |
| |
| .TP |
| \fBdisable_job_shrink\fR |
| Deny user requests to shrink the size of running jobs. (However, running jobs |
| may still shrink due to node failure if the \-\-no\-kill option was set.) |
| .IP |
| |
| .TP |
| \fBdisable_hetjob_steps\fR |
| Disable job steps that span heterogeneous job allocations. |
| .IP |
| |
| .TP |
| \fBenable_hetjob_steps\fR |
| Enable job steps that span heterogeneous job allocations. |
| The default value. |
| .IP |
| |
| .TP |
| \fBenable_job_state_cache\fR |
| Enables an independent cache of job state details within slurmctld. This allows |
| processing of `\fBsqueue\fR \-\-only\-job\-state` and replaced RPCs with minimal |
| impact on other slurmctld operations. |
| .IP |
| |
| .TP |
| \fBenable_user_top\fR |
| Enable use of the "scontrol top" command by non\-privileged users. |
| .IP |
| |
| .TP |
| \fBextra_constraints\fR |
| Enable node filtering with the \-\-extra option for salloc, sbatch, and srun |
| and the node's Extra field. |
| .IP |
| |
| .TP |
| \fBignore_constraint_validation\fR |
| If set, and a job requests --constraint any features in the request that would |
| create an invalid request with the current system will not generate an error. |
| This is helpful for dynamic systems where nodes with features come and go. |
| Jobs will remain in the job queue until the requested feature is in the cluster |
| and available. |
| Please note using this option will not protect you from typos. |
| See also ignore_prefer_validation. |
| .IP |
| |
| .TP |
| \fBIgnore_NUMA\fR |
| Some processors (e.g. AMD Opteron 6000 series) contain multiple NUMA nodes per |
| socket. This is a configuration which does not map into the hardware entities |
| that Slurm optimizes resource allocation for (PU/thread, core, socket, |
| baseboard, node and network switch). In order to optimize resource allocations |
| on such hardware, Slurm will consider each NUMA node within the socket as a |
| separate socket by default. Use the Ignore_NUMA option to report the correct |
| socket count, but \fBnot\fR optimize resource allocations on the NUMA nodes. |
| .IP |
| |
| \fBNOTE\fR: Since hwloc 2.0 NUMA Nodes are are not part of the main/CPU topology tree, |
| because of that if Slurm is build with hwloc 2.0 or above Slurm will treat |
| HWLOC_OBJ_PACKAGE as Socket, you can change this behavior using |
| \fBSlurmdParameters\fR=l3cache_as_socket. |
| .IP |
| |
| .TP |
| \fBignore_prefer_validation\fR |
| If set, and a job requests --prefer any features in the request that would |
| create an invalid request with the current system will not generate an error. |
| This is helpful for dynamic systems where nodes with features come and go. |
| Please note using this option will not protect you from typos. |
| See also ignore_constraint_validation. |
| .IP |
| |
| .TP |
| \fBmax_array_tasks\fR |
| Specify the maximum number of tasks that can be included in a job array. |
| The default limit is MaxArraySize, but this option can be used to set a lower |
| limit. For example, max_array_tasks=1000 and MaxArraySize=100001 would permit |
| a maximum task ID of 100000, but limit the number of tasks in any single job |
| array to 1000. |
| .IP |
| |
| .TP |
| \fBmax_rpc_cnt\fR=\# |
| If the number of active threads in the slurmctld daemon is equal to or |
| larger than this value, defer scheduling of jobs. The scheduler will check |
| this condition at certain points in code and yield locks if necessary. |
| This can improve Slurm's ability to process requests at a cost of initiating |
| new jobs less frequently. Default: 0 (option disabled), Min: 0, Max: 1000. |
| .IP |
| .RS |
| \fBNOTE\fR: The maximum number of threads (MAX_SERVER_THREADS) is internally set |
| to 256 and defines the number of served RPCs at a given time. Setting max_rpc_cnt |
| to more than 256 will be only useful to let backfill continue scheduling work |
| after locks have been yielded (i.e. each 2 seconds) if there are a maximum of |
| MAX(max_rpc_cnt/10, 20) RPCs in the queue. i.e. max_rpc_cnt=1000, the scheduler |
| will be allowed to continue after yielding locks only when there are less than |
| or equal to 100 pending RPCs. |
| If a value is set, then a value of 10 or higher is recommended. It may require |
| some tuning for each system, but needs to be high enough that scheduling isn't |
| always disabled, and low enough that requests can get through in a reasonable |
| period of time. |
| .RE |
| .IP |
| |
| .TP |
| \fBmax_sched_time\fR=\# |
| How long, in seconds, that the main scheduling loop will execute for before |
| exiting. |
| If a value is configured, be aware that all other Slurm operations will be |
| deferred during this time period. |
| Make certain the value is lower than \fBMessageTimeout\fR. |
| If a value is not explicitly configured, the default value is half of |
| \fBMessageTimeout\fR with a minimum default value of 1 second and a maximum |
| default value of 2 seconds. |
| For example if MessageTimeout=10, the time limit will be 2 seconds |
| (i.e. MIN(10/2, 2) = 2). |
| .IP |
| |
| .TP |
| \fBmax_script_size\fR=\# |
| Specify the maximum size of a batch script, in bytes. |
| The default value is 4 megabytes. |
| Larger values may adversely impact system performance. |
| .IP |
| |
| .TP |
| \fBmax_submit_line_size\fR=\# |
| Specify the maximum size of a submit line, in bytes. |
| The default value is 1 megabtye. |
| This option cannot exceed 2 megabytes. |
| .IP |
| |
| .TP |
| \fBmax_switch_wait\fR=\# |
| Maximum number of seconds that a job can delay execution waiting for the |
| specified desired switch count. The default value is 300 seconds. |
| .IP |
| |
| .TP |
| \fBno_backup_scheduling\fR |
| If used, the backup controller will not schedule jobs when it takes over. The |
| backup controller will allow jobs to be submitted, modified and cancelled but |
| won't schedule new jobs. This is useful in Cray environments when the backup |
| controller resides on an external Cray node. |
| .IP |
| |
| .TP |
| \fBnohold_on_prolog_fail\fR |
| By default, if the Prolog exits with a non\-zero value the job is requeued in |
| a held state. By specifying this parameter the job will be requeued but not |
| held so that the scheduler can dispatch it to another host. |
| .IP |
| |
| .TP |
| \fBpack_serial_at_end\fR |
| If used with the select/cons_tres plugin, |
| then put serial jobs at the end of |
| the available nodes rather than using a best fit algorithm. |
| This may reduce resource fragmentation for some workloads. |
| .IP |
| |
| .TP |
| \fBpartition_job_depth\fR=\# |
| The default number of jobs to attempt scheduling (i.e. the queue depth) |
| from each partition/queue in Slurm's main scheduling logic. |
| This limit will be enforced for all main scheduler cycles. |
| The functionality is similar to that provided by the \fBbf_max_job_part\fR |
| option for the backfill scheduling logic. |
| The default value is 0 (no limit). |
| Job's excluded from attempted scheduling based upon partition will not be |
| counted against the \fBdefault_queue_depth\fR limit. |
| Also see the \fBbf_max_job_part\fR option. |
| .IP |
| |
| .TP |
| \fBreduce_completing_frag\fR |
| This option is used to control how scheduling of resources is performed when |
| jobs are in the COMPLETING state, which influences potential fragmentation. |
| If this option is not set then no jobs will be started in any partition when |
| any job is in the COMPLETING state for less than \fBCompleteWait\fR seconds. |
| If this option is set then no jobs will be started in any individual partition |
| that has a job in COMPLETING state for less than \fBCompleteWait\fR seconds. |
| In addition, no jobs will be started in any partition with nodes that overlap |
| with any nodes in the partition of the completing job. |
| This option is to be used in conjunction with \fBCompleteWait\fR. |
| |
| \fBNOTE\fR: \fBCompleteWait\fR must be set in order for this to work. If |
| \fBCompleteWait=0\fR then this option does nothing. |
| |
| \fBNOTE\fR: \fBreduce_completing_frag\fR only affects the main scheduler, not |
| the backfill scheduler. |
| .IP |
| |
| .TP |
| \fBrequeue_on_resume_failure\fR |
| In the event that nodes fail to resume by \fBResumeTimeout\fR, all batch jobs |
| will be requeued -- even if the jobs requested not to be requeued. This is |
| similar to \fBPrologFlags=ForceRequeueOnFail\fR. |
| .IP |
| |
| .TP |
| \fBsalloc_wait_nodes\fR |
| If defined, the salloc command will wait until all allocated nodes are ready for |
| use (i.e. booted) before the command returns. By default, salloc will return as |
| soon as the resource allocation has been made. The salloc command can use the |
| \-\-wait\-all\-nodes option to override this configuration parameter. |
| .IP |
| |
| .TP |
| \fBsbatch_wait_nodes\fR |
| If defined, the sbatch script will wait until all allocated nodes are ready for |
| use (i.e. booted) before the initiation. By default, the sbatch script will be |
| initiated as soon as the first node in the job allocation is ready. The sbatch |
| command can use the \-\-wait\-all\-nodes option to override this configuration |
| parameter. |
| .IP |
| |
| .TP |
| \fBsched_interval\fR=\# |
| How frequently, in seconds, the main scheduling loop will execute and test all |
| pending jobs, with only the \fBpartition_job_depth\fR limit in place. |
| The default value is 60 seconds. |
| A setting of \-1 will disable the main scheduling loop. |
| .IP |
| |
| .TP |
| \fBsched_max_job_start\fR=\# |
| The maximum number of jobs that the main scheduling logic will start in any |
| single execution. |
| The default value is zero, which imposes no limit. |
| .IP |
| |
| .TP |
| \fBsched_min_interval\fR=\# |
| How frequently, in microseconds, the main scheduling loop will execute and test |
| any pending jobs. |
| The scheduler runs in a limited fashion every time that any event happens which |
| could enable a job to start (e.g. job submit, job terminate, etc.). |
| If these events happen at a high frequency, the scheduler can run very |
| frequently and consume significant resources if not throttled by this option. |
| This option specifies the minimum time between the end of one scheduling |
| cycle and the beginning of the next scheduling cycle. |
| A value of zero will disable throttling of the scheduling logic interval. |
| The default value is 2 microseconds. |
| .IP |
| |
| .TP |
| \fBspec_cores_first\fR |
| Specialized cores will be selected from the first cores of the first sockets, |
| cycling through the sockets on a round robin basis. |
| By default, specialized cores will be selected from the last cores of the |
| last sockets, cycling through the sockets on a round robin basis. |
| .IP |
| |
| .TP |
| \fBstep_retry_count\fR=\# |
| When a step completes and there are steps ending resource allocation, then |
| retry step allocations for at least this number of pending steps. |
| Also see \fBstep_retry_time\fR. |
| The default value is 8 steps. |
| .IP |
| |
| .TP |
| \fBstep_retry_time\fR=\# |
| When a step completes and there are steps ending resource allocation, then |
| retry step allocations for all steps which have been pending for at least this |
| number of seconds. |
| Also see \fBstep_retry_count\fR. |
| The default value is 60 seconds. |
| .IP |
| |
| .TP |
| \fBtime_min_as_soft_limit\fR |
| Treat the \-\-time\-min limit as a soft time limit for the job. Scheduling |
| will plan for the shorter duration, while permitting the job to continue |
| running until the ("hard") \-\-time limit. |
| .IP |
| |
| .TP |
| \fBwhole_hetjob\fR |
| Requests to cancel, hold or release any component of a heterogeneous job will |
| be applied to all components of the job. |
| |
| \fBNOTE\fR: This option was previously named whole_pack and this is still |
| supported for backwards compatibility. |
| .RE |
| .IP |
| |
| .TP |
| \fBSchedulerTimeSlice\fR |
| Number of seconds in each time slice when gang scheduling is enabled |
| (\fBPreemptMode=SUSPEND,GANG\fR). |
| The value must be between 5 seconds and 65533 seconds. |
| The default value is 30 seconds. |
| .IP |
| |
| .TP |
| \fBSchedulerType\fR |
| Identifies the type of scheduler to be used. |
| The \fBscontrol\fR command can be used to manually change job priorities |
| if desired. |
| Acceptable values include: |
| .IP |
| .RS |
| .TP |
| \fBsched/backfill\fR |
| For a backfill scheduling module to augment the default FIFO scheduling. |
| Backfill scheduling will initiate lower\-priority jobs if doing |
| so does not delay the expected initiation time of any higher |
| priority job. |
| Effectiveness of backfill scheduling is dependent upon users specifying |
| job time limits, otherwise all jobs will have the same time limit and |
| backfilling is impossible. |
| Note documentation for the \fBSchedulerParameters\fR option above. |
| This is the default configuration. |
| .IP |
| |
| .TP |
| \fBsched/builtin\fR |
| This is the FIFO scheduler which initiates jobs in priority order. |
| If any job in the partition can not be scheduled, no lower priority job in that |
| partition will be scheduled. |
| An exception is made for jobs that can not run due to partition constraints |
| (e.g. the time limit) or down/drained nodes. |
| In that case, lower priority jobs can be initiated and not impact the higher |
| priority job. |
| .RE |
| .IP |
| |
| .TP |
| \fBScronParameters\fR |
| Multiple options may be comma separated. |
| .IP |
| .RS |
| .TP |
| \fBenable\fR |
| Enable the use of scrontab to submit and manage periodic repeating jobs. |
| .IP |
| |
| .TP |
| \fBexplicit_scancel\fR |
| When cancelling an scrontab job, require the user to explicitly request |
| cancelling the job with the --cron flag in scancel. |
| .RE |
| .IP |
| |
| .TP |
| \fBSelectType\fR |
| Identifies the type of resource selection algorithm to be used. |
| When changed, all job information (running and pending) will be |
| lost, since the job state save format used by each plugin is different. |
| The only exception to this is when changing from the legacy cons_res to |
| cons_tres. |
| |
| Acceptable values include |
| .IP |
| .RS |
| .TP |
| \fBselect/cons_tres\fR |
| The resources (cores, memory, GPUs and all other trackable resources) within |
| a node are individually allocated as consumable resources. |
| Note that whole nodes can be allocated to jobs for selected |
| partitions by using the \fIOverSubscribe=Exclusive\fR option. |
| See the partition \fBOverSubscribe\fR parameter for more information. |
| This is the default value. |
| .IP |
| |
| .TP |
| \fBselect/linear\fR |
| for allocation of entire nodes assuming a one\-dimensional array of nodes in |
| which sequentially ordered nodes are preferable. |
| For a heterogeneous cluster (e.g. different CPU counts on the various nodes), |
| resource allocations will favor nodes with high CPU counts as needed based upon |
| the job's node and CPU specification if TopologyPlugin=topology/flat is |
| configured. Use of other topology plugins with select/linear and heterogeneous |
| nodes is not recommended and may result in valid job allocation requests being |
| rejected. The linear plugin is not designed to track generic resources on a |
| node. In cases where generic resources (such as GPUs) need to be tracked, |
| the cons_tres plugin should be used instead. |
| .RE |
| .IP |
| |
| .TP |
| \fBSelectTypeParameters\fR |
| The permitted values of \fBSelectTypeParameters\fR depend upon the |
| configured value of \fBSelectType\fR. |
| The only supported options for \fBSelectType=select/linear\fR are |
| \fBCR_ONE_TASK_PER_CORE\fR and |
| \fBCR_Memory\fR, which treats memory as a consumable resource and |
| prevents memory over subscription with job preemption or gang scheduling. |
| By default \fBSelectType=select/linear\fR allocates whole nodes to jobs without |
| considering their memory consumption. |
| By default \fBSelectType=select/cons_tres\fR uses \fBCR_Core_Memory\fR, which |
| allocates Core to jobs while considering their memory consumption. |
| |
| The following options are supported by the \fBSelectType=select/cons_tres\fR |
| plugin: |
| .IP |
| .RS |
| .TP |
| \fBCR_CPU\fR |
| CPUs are consumable resources. |
| Configure the number of \fBCPUs\fR on each node, which may be equal to the |
| count of cores or hyper\-threads on the node depending upon the desired minimum |
| resource allocation. |
| The node's \fBBoards\fR, \fBSockets\fR, \fBCoresPerSocket\fR and |
| \fBThreadsPerCore\fR may optionally be configured and result in job |
| allocations which have improved locality; however doing so will prevent |
| more than one job from being allocated on each core. |
| .IP |
| |
| .TP |
| \fBCR_CPU_Memory\fR |
| CPUs and memory are consumable resources. |
| Configure the number of \fBCPUs\fR on each node, which may be equal to the |
| count of cores or hyper\-threads on the node depending upon the desired minimum |
| resource allocation. |
| The node's \fBBoards\fR, \fBSockets\fR, \fBCoresPerSocket\fR and |
| \fBThreadsPerCore\fR may optionally be configured and result in job |
| allocations which have improved locality; however doing so will prevent |
| more than one job from being allocated on each core. |
| Setting a value for \fBDefMemPerCPU\fR is strongly recommended. |
| .IP |
| |
| .TP |
| \fBCR_Core\fR |
| Cores are consumable resources. |
| On nodes with hyper\-threads, each thread is counted as a CPU to |
| satisfy a job's resource requirement, but multiple jobs are not |
| allocated threads on the same core. |
| The count of CPUs allocated to a job is rounded up to account for every |
| CPU on an allocated core. This will also impact total allocated memory when |
| -\-mem\-per\-cpu is used to be multiply of total number of CPUs on allocated cores. |
| .IP |
| |
| .TP |
| \fBCR_Core_Memory\fR |
| Cores and memory are consumable resources. |
| On nodes with hyper\-threads, each thread is counted as a CPU to |
| satisfy a job's resource requirement, but multiple jobs are not |
| allocated threads on the same core. |
| The count of CPUs allocated to a job may be rounded up to account for every |
| CPU on an allocated core. |
| Setting a value for \fBDefMemPerCPU\fR is strongly recommended. |
| .IP |
| |
| .TP |
| \fBCR_ONE_TASK_PER_CORE\fR |
| Allocate one task per core by default. |
| Without this option, by default one task will be allocated per |
| thread on nodes with more than one \fBThreadsPerCore\fR configured. |
| \fBNOTE\fR: This option cannot be used with CR_CPU*. |
| .IP |
| |
| .TP |
| \fBCR_CORE_DEFAULT_DIST_BLOCK\fR |
| Allocate cores within a node using block distribution by default. |
| This is a pseudo\-best\-fit algorithm that minimizes the number of |
| boards and minimizes the number of sockets (within minimum boards) |
| used for the allocation. |
| This default behavior can be overridden specifying a particular |
| "\-m" parameter with srun/salloc/sbatch. |
| Without this option, cores will be allocated cyclically across the sockets. |
| .IP |
| |
| .TP |
| \fBCR_LLN\fR |
| Schedule resources to jobs on the least loaded nodes (based upon the number |
| of idle CPUs). This is generally only recommended for an environment with |
| serial jobs as idle resources will tend to be highly fragmented, resulting |
| in parallel jobs being distributed across many nodes. |
| Note that node \fBWeight\fR takes precedence over how many idle resources are |
| on each node. |
| Also see the partition configuration parameter \fBLLN\fR |
| use the least loaded nodes in selected partitions. |
| .IP |
| |
| .TP |
| \fBCR_Pack_Nodes\fR |
| If a job allocation contains more resources than will be used for launching |
| tasks (e.g. if whole nodes are allocated to a job), then rather than |
| distributing a job's tasks evenly across its allocated nodes, pack them as |
| tightly as possible on these nodes. |
| For example, consider a job allocation containing two \fBentire\fR nodes with |
| eight CPUs each. |
| If the job starts ten tasks across those two nodes without this option, it will |
| start five tasks on each of the two nodes. |
| With this option, eight tasks will be started on the first node and two tasks |
| on the second node. |
| This can be superseded by "NoPack" in srun's "\-\-distribution" option. |
| CR_Pack_Nodes only applies when the "block" task distribution method is used. |
| .IP |
| |
| .TP |
| \fBLL_SHARED_GRES\fR |
| When allocating resources for a shared GRES (gres/mps, gres/shard), prefer |
| least loaded device (in terms of already allocated fraction). This way jobs are |
| spread across GRES devices on the node, instead of the default behavior where |
| the first available device is used. |
| This option is only supported by select/cons_tres plugin. |
| .IP |
| |
| .TP |
| \fBCR_Socket\fR |
| Sockets are consumable resources. |
| On nodes with multiple cores, each core or thread is counted as a CPU |
| to satisfy a job's resource requirement, but multiple jobs are not |
| allocated resources on the same socket. |
| .IP |
| |
| .TP |
| \fBCR_Socket_Memory\fR |
| Memory and sockets are consumable resources. |
| On nodes with multiple cores, each core or thread is counted as a CPU |
| to satisfy a job's resource requirement, but multiple jobs are not |
| allocated resources on the same socket. |
| Setting a value for \fBDefMemPerCPU\fR is strongly recommended. |
| .IP |
| |
| .TP |
| \fBMULTIPLE_SHARING_GRES_PJ\fR |
| By default, only one sharing gres per job is allowed on each node from shared |
| gres requests. This allows multiple sharing gres' to be used on a single node |
| to satisfy shared gres requirements per job. |
| Example: If there are 10 shards to a gpu and 12 shards are requested, instead of |
| being denied the job will be allocated with 2 gpus. 1 using 10 shards and the |
| other using 2 shards. |
| .IP |
| |
| .TP |
| \fBENFORCE_BINDING_GRES\fR |
| Set \fB\-\-gres\-flags=enforce\-binding\fR as the default in every job. |
| This can be overridden with \fB\-\-gres\-flags=disable\-binding\fR. |
| .IP |
| |
| .TP |
| \fBONE_TASK_PER_SHARING_GRES\fR |
| Set \fB\-\-gres\-flags=one\-task\-per\-sharing\fR as the default in every job. |
| This can be overridden with \fB\-\-gres\-flags=multiple\-tasks\-per\-sharing\fR. |
| .IP |
| |
| .LP |
| \fBNOTE\fR: If memory isn't configured as a consumable resource (CR_CPU, |
| CR_Core or CR_Socket without _Memory) memory can be oversubscribed and will not |
| be constrained by task/cgroup even if it is configured in cgroup.conf. In this |
| case the \fI--mem\fR option is only used to filter out nodes with lower |
| configured memory and does not take running jobs into account. For instance, |
| two jobs requesting all the memory of a node can run at the same time. |
| .RE |
| .IP |
| |
| .TP |
| \fBSlurmctldAddr\fR |
| An optional address to be used for communications to the currently active |
| slurmctld daemon, normally used with Virtual IP addressing of the currently |
| active server. |
| If this parameter is not specified then each primary and backup server will |
| have its own unique address used for communications as specified in the |
| \fBSlurmctldHost\fR parameter. |
| If this parameter is specified then the \fBSlurmctldHost\fR parameter will |
| still be used for communications to specific slurmctld primary or backup |
| servers, for example to cause all of them to read the current configuration |
| files or shutdown. |
| Also see the \fBSlurmctldPrimaryOffProg\fR and \fBSlurmctldPrimaryOnProg\fR |
| configuration parameters to configure programs to manipulate virtual IP |
| address manipulation. |
| .IP |
| |
| .TP |
| \fBSlurmctldDebug\fR |
| The level of detail to provide \fBslurmctld\fR daemon's logs. |
| The default value is \fBinfo\fR. |
| If the \fBslurmctld\fR daemon is initiated with \-v or \-\-verbose options, |
| that debug level will be preserved or restored upon reconfiguration. |
| .IP |
| .RS |
| .TP 10 |
| \fBquiet\fR |
| Log nothing |
| .IP |
| |
| .TP |
| \fBfatal\fR |
| Log only fatal errors |
| .IP |
| |
| .TP |
| \fBerror\fR |
| Log only errors |
| .IP |
| |
| .TP |
| \fBinfo\fR |
| Log errors and general informational messages |
| .IP |
| |
| .TP |
| \fBverbose\fR |
| Log errors and verbose informational messages |
| .IP |
| |
| .TP |
| \fBdebug\fR |
| Log errors and verbose informational messages and debugging messages |
| .IP |
| |
| .TP |
| \fBdebug2\fR |
| Log errors and verbose informational messages and more debugging messages |
| .IP |
| |
| .TP |
| \fBdebug3\fR |
| Log errors and verbose informational messages and even more debugging messages |
| .IP |
| |
| .TP |
| \fBdebug4\fR |
| Log errors and verbose informational messages and even more debugging messages |
| .IP |
| |
| .TP |
| \fBdebug5\fR |
| Log errors and verbose informational messages and even more debugging messages |
| .RE |
| .IP |
| |
| .TP |
| \fBSlurmctldHost\fR |
| The short, or long, hostname of the machine where Slurm control daemon is |
| executed (i.e. the name returned by the command "hostname \-s"). |
| This hostname is optionally followed by either the IP address or |
| a name by which the address can be identified, enclosed in parentheses. e.g. |
| .nf |
| SlurmctldHost=slurmctl\-primary(12.34.56.78) |
| .fi |
| |
| Each host running an instance of slurmctld should have a \fBSlurmctldHost=\fR |
| entry. e.g. |
| .nf |
| SlurmctldHost=slurmctl\-primary1 |
| SlurmctldHost=slurmctl\-primary2 |
| SlurmctldHost=slurmctl\-primary3(12.34.56.78) |
| .fi |
| |
| SlurmctldHost must be specified at least once. If specified more than once, the |
| first entry will run as the primary and all other entries as standby backups. |
| If the primary host fails, the first backup will change from standby to primary |
| until the first host comes back online. This same process will repeat if the new |
| primary fails. |
| |
| Slurm daemons need to be reconfigured (e.g. "scontrol reconfig") for changes to |
| this parameter to take effect. It is okay for jobs to be running when making |
| these changes, as the running steps will get the updated SlurmctldHost info. |
| |
| Every slurmctld host controller must have access to the \fBStateSaveLocation\fR |
| directory, which must be readable and writable from the primary and all backup |
| controllers at all times. |
| |
| Refer to the \fBRELOCATING CONTROLLERS\fR section if you need to change this. |
| .IP |
| |
| .TP |
| \fBSlurmctldLogFile\fR |
| Fully qualified pathname of a file into which the \fBslurmctld\fR daemon's |
| logs are written. |
| The default value is none (performs logging via syslog). |
| .br |
| See the section \fBLOGGING\fR if a pathname is specified. |
| .IP |
| |
| .TP |
| \fBSlurmctldParameters\fR |
| Multiple options may be comma separated. |
| .IP |
| .RS |
| .TP |
| \fBallow_user_triggers\fR |
| Permit setting triggers from non\-root/slurm_user users. SlurmUser must also |
| be set to root to permit these triggers to work. See the \fBstrigger\fR man |
| page for additional details. |
| .IP |
| |
| .TP |
| \fBcloud_dns\fR |
| By default, Slurm expects that the network address for a cloud node won't |
| be known until the creation of the node and that Slurm will be notified of the |
| node's address (e.g. \fBscontrol update nodename=<name> nodeaddr=<addr>\fR). |
| Since Slurm communications rely on the node configuration found in the |
| slurm.conf, Slurm will tell the client command, after waiting for all nodes to |
| boot, each node's ip address. However, in environments where the nodes are in |
| DNS, this step can be avoided by configuring this option. |
| .IP |
| |
| .TP |
| \fBconmgr_max_connections\fR=\fI<connection_count>\fR |
| Specify the maximum number of connections to be processed at any given time. |
| This does not influence the maximum number of pending connections as that is |
| controlled by the kernel. Defaults to 50. |
| .IP |
| |
| .TP |
| \fBconmgr_threads\fR=\fI<thread_count>\fR |
| The number of process threads in thread pool used to for receiving and |
| processing connections on the listening sockets. While increasing this |
| value may improve the capacity of conmgr to handle a larger number of |
| connections, it does not generally improve the speed at which RPCs are |
| processed. It is highly recommended that benchmarking be conducted when |
| this value is changed to ensure optimal performance. |
| .IP |
| |
| .TP |
| \fBconmgr_use_poll\fR |
| Use \fIpoll\fR(2) instead of \fIepoll\fR(7) for monitoring file descriptors. |
| .IP |
| |
| .TP |
| \fBconmgr_connect_timeout\fR=\fI<seconds>\fR |
| Wait \fI<seconds>\fR before considering an outbound connection attempt to be |
| timed out. Defaults to the value of \fBMessageTimeout\fR. |
| .IP |
| |
| .TP |
| \fBconmgr_read_timeout\fR=\fI<seconds>\fR |
| Wait \fI<seconds>\fR before considering a read from a file descriptor to be |
| timed out. Defaults to the value of \fBMessageTimeout\fR. |
| .IP |
| |
| .TP |
| \fBconmgr_quiesce_timeout\fR=\fI<seconds>\fR |
| Wait \fI<seconds>\fR before considering quiesce to be timed out. Upon timeout, |
| all (non-listening) active connections will be closed to allow the quiesce to |
| start. Defaults to two times value of \fBMessageTimeout\fR. |
| .IP |
| |
| .TP |
| \fBconmgr_wait_write_delay\fR=\fI<seconds>\fR |
| When waiting for kernel to flush outgoing buffer, poll kernel for changes every |
| \fI<seconds>\fR for changes. Defaults to the value of \fBMessageTimeout\fR. |
| .IP |
| |
| .TP |
| \fBconmgr_write_timeout\fR=\fI<seconds>\fR |
| Wait \fI<seconds>\fR before considering a write from a file descriptor to be |
| timed out. Defaults to the value of \fBMessageTimeout\fR. |
| .IP |
| |
| .TP |
| \fBdisable_triggers\fR |
| Disable the ability to register new triggers. |
| .IP |
| |
| .TP |
| \fBenable_configless\fR |
| Permit "configless" operation by the slurmd, slurmstepd, and user commands. |
| When enabled the slurmd will be permitted to retrieve config files and |
| \fBProlog\fR and \fBEpilog\fR scripts from the slurmctld, and on any 'scontrol |
| reconfigure' command new configs and scripts will be automatically pushed out |
| and applied to nodes that are running in this "configless" mode. |
| See https://slurm.schedmd.com/configless_slurm.html for more details. |
| |
| \fBNOTE\fR: Included files with the \fBInclude\fR directive will only be pushed |
| if the filename has no path separators and is located adjacent to slurm.conf. |
| |
| \fBNOTE\fR: \fBProlog\fR and \fBEpilog\fR scripts will only be pushed if the |
| filenames have no path separators and are located adjacent to slurm.conf. |
| Glob patterns (See \fBglob\fR (7)) are not supported. |
| .IP |
| |
| .TP |
| \fBidle_on_node_suspend\fR |
| Mark nodes as idle, regardless of current state, when suspending nodes with |
| \fBSuspendProgram\fR so that nodes will be eligible to be resumed at a later |
| time. |
| .IP |
| |
| .TP |
| \fBnode_reg_mem_percent\fR=\# |
| Percentage of memory a node is allowed to register with without being marked as |
| invalid with low memory. Default is 100. For State=CLOUD nodes, the default is |
| 90. To disable this for cloud nodes set it to 100. \fIconfig_overrides\fR takes |
| precedence over this option. |
| |
| It's recommended that \fItask/cgroup\fR with \fIConstrainRamSpace\fR is |
| configured. A memory cgroup limit won't be set more than the actual memory on |
| the node. If needed, configure \fIAllowedRamSpace\fR in the cgroup.conf to add |
| a buffer. |
| .IP |
| |
| .TP |
| \fBno_quick_restart\fR |
| By default starting a new instance of the slurmctld will kill the old one |
| running before taking control. If this option is set this will not happen |
| without the \fB-i\fR option. |
| .IP |
| |
| .TP |
| \fBpower_save_interval\fR |
| How often the power_save thread looks to resume and suspend nodes. The |
| power_save thread will do work sooner if there are node state changes. Default |
| is 10 seconds. |
| .IP |
| |
| .TP |
| \fBpower_save_min_interval\fR |
| How often the power_save thread, at a minimum, looks to resume and suspend |
| nodes. Default is 0. |
| .IP |
| |
| .TP |
| \fBmax_powered_nodes\fR |
| The max number of powered up nodes across the cluster. Once this is reached, |
| jobs requesting additional nodes will not start, and "scontrol power up |
| <nodes>" will fail. |
| .IP |
| |
| .TP |
| \fBmax_dbd_msg_action\fR |
| Action used once MaxDBDMsgs is reached, options are 'discard' (default) and 'exit'. |
| |
| When 'discard' is specified and MaxDBDMsgs is reached we start by purging |
| pending messages of types Step start and complete, and it reaches MaxDBDMsgs |
| again Job start messages are purged. Job completes and node state changes |
| continue to consume the empty space created from the purgings until MaxDBDMsgs |
| is reached again at which no new message is tracked creating data loss and |
| potentially runaway jobs. |
| |
| When 'exit' is specified and MaxDBDMsgs is reached the slurmctld will exit |
| instead of discarding any messages. It will be impossible to start the |
| slurmctld with this option where the slurmdbd is down and the slurmctld is |
| tracking more than MaxDBDMsgs. |
| .IP |
| |
| .TP |
| \fBreboot_from_controller\fR |
| Run the \fBRebootProgram\fR from the controller instead of on the slurmds. The |
| RebootProgram will be passed a comma\-separated list of nodes to reboot as the |
| first argument and if applicable the required features needed for reboot as the |
| second argument. |
| .IP |
| |
| .TP |
| \fBrl_bucket_size\fR= |
| Size of the token bucket. This permits a certain amount of RPC burst from a |
| user before the steady\-state rate limit takes effect. |
| The default value is 30. |
| .IP |
| |
| .TP |
| \fBrl_enable\fR |
| Enable per\-user RPC rate\-limiting support. Client\-commands will be told to |
| back off and sleep for a second once the limit has been reached. |
| This is implemented as a "token bucket", which permits a certain degree of |
| "bursty" RPC load from an individual user before holding them to a |
| steady\-state RPC load established by the refill period and rate. |
| .IP |
| |
| .TP |
| \fBrl_log_freq\fR= |
| The maximum frequency (in seconds) for which logs about RPC limit being exceeded |
| by an individual user are printed to the logs. Set to 0 to see every incidence. |
| Set to -1 to disable the log message entirely. |
| The default value is 0. |
| .IP |
| |
| .TP |
| \fBrl_refill_period\fR= |
| How frequently, in seconds, in which additional tokens are added to each user |
| bucket. |
| The default value is 1. |
| .IP |
| |
| .TP |
| \fBrl_refill_rate\fR= |
| How many tokens to add to the bucket on each period. |
| The default value is 2. |
| .IP |
| |
| .TP |
| \fBrl_table_size\fR= |
| Number of entries in the user hash\-table. Recommended value should be at least |
| twice the number of active user accounts on the system. |
| The default value is 8192. |
| .IP |
| |
| .TP |
| \fBenable_stepmgr\fR |
| Enable slurmstepd step management system wide. This enables job steps to be |
| managed by a single extern slurmstepd associated with the job to manage steps. |
| This is beneficial for jobs that submit many steps inside their allocations. |
| \fBPrologFlags=contain\fR must be set. |
| .IP |
| |
| .TP |
| \fBuser_resv_delete\fR |
| Allow any user able to run in a reservation to delete it. |
| .IP |
| |
| .TP |
| \fBvalidate_nodeaddr_threads\fR= |
| During startup, slurmctld looks up the address for each compute node in the |
| system. On large systems this can cause considerable delay, this option permits |
| the slurmctld to concurrently handle the lookup calls and can reduce system |
| startup time considerably. The default value is 1. Maximum permitted value is |
| 64. |
| .RE |
| .IP |
| |
| .TP |
| \fBSlurmctldPidFile\fR |
| Fully qualified pathname of a file into which the \fBslurmctld\fR daemon |
| may write its process id. This may be used for automated signal processing. |
| The default value is "/var/run/slurmctld.pid". |
| .IP |
| |
| .TP |
| \fBSlurmctldPort\fR |
| The port number that the Slurm controller, \fBslurmctld\fR, listens |
| to for work. The default value is SLURMCTLD_PORT as established at system |
| build time. If none is explicitly specified, it will be set to 6817. |
| \fBSlurmctldPort\fR may also be configured to support a range of port |
| numbers in order to accept larger bursts of incoming messages by specifying |
| two numbers separated by a dash (e.g. \fBSlurmctldPort=6817\-6818\fR). |
| \fBNOTE\fR: Either \fBslurmctld\fR and \fBslurmd\fR daemons must not |
| execute on the same nodes or the values of \fBSlurmctldPort\fR and |
| \fBSlurmdPort\fR must be different. |
| |
| \fBNOTE\fR: On Cray systems, Realm\-Specific IP Addressing (RSIP) will |
| automatically try to interact with anything opened on ports 8192\-60000. |
| Configure SlurmctldPort to use a port outside of the configured SrunPortRange |
| and RSIP's port range. |
| .IP |
| |
| .TP |
| \fBSlurmctldPrimaryOffProg\fR |
| This program is executed when a slurmctld daemon running as the primary server |
| becomes a backup server. The controller will wait for this script to end before |
| fully shutting down. By default no program is executed. |
| See also the related "SlurmctldPrimaryOnProg" parameter. |
| .IP |
| |
| .TP |
| \fBSlurmctldPrimaryOnProg\fR |
| This program is executed when a slurmctld daemon running as a backup server |
| becomes the primary server. The controller will wait for this script to end |
| before fully starting up. By default no program is executed. |
| When using virtual IP addresses to manage High Available Slurm services, |
| this program can be used to add the IP address to an interface (and optionally |
| try to kill the unresponsive slurmctld daemon and flush the ARP caches on |
| nodes on the local Ethernet fabric). |
| See also the related "SlurmctldPrimaryOffProg" parameter. |
| .IP |
| |
| .TP |
| \fBSlurmctldSyslogDebug\fR |
| The slurmctld daemon will log events to the syslog file at the specified |
| level of detail. If not set, the slurmctld daemon will log to syslog at |
| level \fBfatal\fR, unless there is no \fBSlurmctldLogFile\fR and it is running |
| in the background, in which case it will log to syslog at the level specified |
| by \fBSlurmctldDebug\fR (at \fBfatal\fR in the case that \fBSlurmctldDebug\fR |
| is set to \fBquiet\fR) or it is run in the foreground, when it will be set to |
| quiet. |
| .IP |
| .RS |
| .TP 10 |
| \fBquiet\fR |
| Log nothing |
| .IP |
| |
| .TP |
| \fBfatal\fR |
| Log only fatal errors |
| .IP |
| |
| .TP |
| \fBerror\fR |
| Log only errors |
| .IP |
| |
| .TP |
| \fBinfo\fR |
| Log errors and general informational messages |
| .IP |
| |
| .TP |
| \fBverbose\fR |
| Log errors and verbose informational messages |
| .IP |
| |
| .TP |
| \fBdebug\fR |
| Log errors and verbose informational messages and debugging messages |
| .IP |
| |
| .TP |
| \fBdebug2\fR |
| Log errors and verbose informational messages and more debugging messages |
| .IP |
| |
| .TP |
| \fBdebug3\fR |
| Log errors and verbose informational messages and even more debugging messages |
| .IP |
| |
| .TP |
| \fBdebug4\fR |
| Log errors and verbose informational messages and even more debugging messages |
| .IP |
| |
| .TP |
| \fBdebug5\fR |
| Log errors and verbose informational messages and even more debugging messages |
| .RE |
| .IP |
| \fBNOTE\fR: By default, Slurm's systemd service file starts the slurmctld daemon |
| in the foreground with the \-\-systemd option. This means that systemd will |
| capture stdout/stderr output and print that to syslog, independent of Slurm |
| printing to syslog directly. To prevent systemd from doing this, add |
| "StandardOutput=null" and "StandardError=null" to the respective service files |
| or override files. |
| .IP |
| |
| .TP |
| \fBSlurmctldTimeout\fR |
| The interval, in seconds, that the backup controller waits for the |
| primary controller to respond before assuming control. |
| The default value is 120 seconds. |
| May not exceed 65533. |
| .IP |
| |
| .TP |
| \fBSlurmdDebug\fR |
| The level of detail to provide \fBslurmd\fR daemon's logs. |
| The default value is \fBinfo\fR. |
| .IP |
| .RS |
| .TP 10 |
| \fBquiet\fR |
| Log nothing |
| .IP |
| |
| .TP |
| \fBfatal\fR |
| Log only fatal errors |
| .IP |
| |
| .TP |
| \fBerror\fR |
| Log only errors |
| .IP |
| |
| .TP |
| \fBinfo\fR |
| Log errors and general informational messages |
| .IP |
| |
| .TP |
| \fBverbose\fR |
| Log errors and verbose informational messages |
| .IP |
| |
| .TP |
| \fBdebug\fR |
| Log errors and verbose informational messages and debugging messages |
| .IP |
| |
| .TP |
| \fBdebug2\fR |
| Log errors and verbose informational messages and more debugging messages |
| .IP |
| |
| .TP |
| \fBdebug3\fR |
| Log errors and verbose informational messages and even more debugging messages |
| .IP |
| |
| .TP |
| \fBdebug4\fR |
| Log errors and verbose informational messages and even more debugging messages |
| .IP |
| |
| .TP |
| \fBdebug5\fR |
| Log errors and verbose informational messages and even more debugging messages |
| .RE |
| .IP |
| |
| .TP |
| \fBSlurmdLogFile\fR |
| Fully qualified pathname of a file into which the \fBslurmd\fR daemon's |
| logs are written. |
| The default value is none (performs logging via syslog). |
| The first "%h" within the name is replaced with the hostname on which the |
| \fBslurmd\fR is running. |
| The first "%n" within the name is replaced with the Slurm node name on which the |
| \fBslurmd\fR is running. |
| .br |
| See the section \fBLOGGING\fR if a pathname is specified. |
| .IP |
| |
| .TP |
| \fBSlurmdParameters\fR |
| Parameters specific to the Slurmd. |
| Multiple options may be comma separated. |
| .IP |
| .RS |
| .TP |
| \fBallow_ecores\fR |
| If set, and processors on your nodes have E\-Cores, allows them to be used in |
| for scheduling and task placement. (By default, E\-Cores are ignored.) |
| .IP |
| |
| .TP |
| \fBconfig_overrides\fR |
| If set, consider the configuration of each node to be that specified in the |
| slurm.conf configuration file and any node with less than the |
| configured resources will \fBnot\fR be set to INVAL/INVALID_REG. |
| This option is generally only useful for testing purposes. |
| Equivalent to the now deprecated FastSchedule=2 option. |
| .IP |
| |
| .TP |
| \fBconmgr_max_connections\fR=\fI<connection_count>\fR |
| Specify the maximum number of connections to be processed at any given time. |
| This does not influence the maximum number of pending connections as that is |
| controlled by the kernel. |
| .IP |
| |
| .TP |
| \fBconmgr_threads\fR=\fI<thread_count>\fR |
| The number of process threads to use for the receiving connections on the |
| listening socket. |
| .IP |
| |
| .TP |
| \fBconmgr_use_poll\fR |
| Use \fIpoll\fR(2) instead of \fIepoll\fR(7) for monitoring file descriptors. |
| .IP |
| |
| .TP |
| \fBconmgr_connect_timeout\fR=\fI<seconds>\fR |
| Wait \fI<seconds>\fR before considering an outbound connection attempt to be |
| timed out. Defaults to the value of \fBMessageTimeout\fR. |
| .IP |
| |
| .TP |
| \fBconmgr_read_timeout\fR=\fI<seconds>\fR |
| Wait \fI<seconds>\fR before considering a read from a file descriptor to be |
| timed out. Defaults to the value of \fBMessageTimeout\fR. |
| .IP |
| |
| .TP |
| \fBconmgr_quiesce_timeout\fR=\fI<seconds>\fR |
| Wait \fI<seconds>\fR before considering quiesce to be timed out. Upon timeout, |
| all (non-listening) active connections will be closed to allow the quiesce to |
| start. Defaults to two times value of \fBMessageTimeout\fR. |
| .IP |
| |
| .TP |
| \fBconmgr_wait_write_delay\fR=\fI<seconds>\fR |
| When waiting for kernel to flush outgoing buffer, poll kernel for changes every |
| \fI<seconds>\fR for changes. Defaults to the value of \fBMessageTimeout\fR. |
| .IP |
| |
| .TP |
| \fBconmgr_write_timeout\fR=\fI<seconds>\fR |
| Wait \fI<seconds>\fR before considering a write from a file descriptor to be |
| timed out. Defaults to the value of \fBMessageTimeout\fR. |
| .IP |
| |
| .TP |
| \fBl3cache_as_socket\fR |
| Use the hwloc l3cache as the socket count. Can be useful on certain processors |
| where the socket level is too coarse, and the l3cache may provide better |
| task distribution. (E.g., along CCX boundaries instead of socket boundaries.) |
| Mutually exclusive with numa_node_as_socket. |
| Requires hwloc v2. |
| .TP |
| \fBnuma_node_as_socket\fR |
| Use the hwloc NUMA Node to determine main hierarchy object to be used as socket. |
| If the option is set Slurm will check the parent object of NUMA Node and use it |
| as socket. This option may be useful for architectures likes AMD Epyc, where |
| number of nodes per socket may be configured. |
| Mutually exclusive with l3cache_as_socket. |
| Requires hwloc v2. |
| .IP |
| |
| .TP |
| \fBshutdown_on_reboot\fR |
| If set, the Slurmd will shut itself down when a reboot request is received. |
| .IP |
| |
| .TP |
| \fBcontain_spank\fR |
| If set and a job_container plugin is specified, the spank_user(), |
| spank_task_post_fork() and spank_task_exit() calls will be run inside the job |
| container. |
| .RE |
| .IP |
| |
| .TP |
| \fBSlurmdPidFile\fR |
| Fully qualified pathname of a file into which the \fBslurmd\fR daemon may write |
| its process id. This may be used for automated signal processing. |
| The first "%h" within the name is replaced with the hostname on which the |
| \fBslurmd\fR is running. |
| The first "%n" within the name is replaced with the Slurm node name on which the |
| \fBslurmd\fR is running. |
| The default value is "/var/run/slurmd.pid". |
| .IP |
| |
| .TP |
| \fBSlurmdPort\fR |
| The port number that the Slurm compute node daemon, \fBslurmd\fR, listens |
| to for work. The default value is SLURMD_PORT as established at system |
| build time. If none is explicitly specified, its value will be 6818. |
| \fBNOTE\fR: Either slurmctld and slurmd daemons must not execute |
| on the same nodes or the values of \fBSlurmctldPort\fR and \fBSlurmdPort\fR |
| must be different. |
| |
| \fBNOTE\fR: On Cray systems, Realm\-Specific IP Addressing (RSIP) will |
| automatically try to interact with anything opened on ports 8192\-60000. |
| Configure SlurmdPort to use a port outside of the configured SrunPortRange |
| and RSIP's port range. |
| .IP |
| |
| .TP |
| \fBSlurmdSpoolDir\fR |
| Fully qualified pathname of a directory into which the \fBslurmd\fR |
| daemon's state information and batch job script information are written. This |
| must be a common pathname for all nodes, but should represent a directory which |
| is local to each node (reference a local file system). The default value |
| is "/var/spool/slurmd". |
| The first "%h" within the name is replaced with the hostname on which the |
| \fBslurmd\fR is running. |
| The first "%n" within the name is replaced with the Slurm node name on which the |
| \fBslurmd\fR is running. |
| .IP |
| |
| .TP |
| \fBSlurmdSyslogDebug\fR |
| The slurmd daemon will log events to the syslog file at the specified |
| level of detail. If not set, the slurmd daemon will log to syslog at |
| level \fBfatal\fR, unless there is no \fBSlurmdLogFile\fR and it is running |
| in the background, in which case it will log to syslog at the level specified |
| by \fBSlurmdDebug\fR (at \fBfatal\fR in the case that \fBSlurmdDebug\fR |
| is set to \fBquiet\fR) or it is run in the foreground, when it will be set to |
| quiet. |
| .IP |
| .RS |
| .TP 10 |
| \fBquiet\fR |
| Log nothing |
| .IP |
| |
| .TP |
| \fBfatal\fR |
| Log only fatal errors |
| .IP |
| |
| .TP |
| \fBerror\fR |
| Log only errors |
| .IP |
| |
| .TP |
| \fBinfo\fR |
| Log errors and general informational messages |
| .IP |
| |
| .TP |
| \fBverbose\fR |
| Log errors and verbose informational messages |
| .IP |
| |
| .TP |
| \fBdebug\fR |
| Log errors and verbose informational messages and debugging messages |
| .IP |
| |
| .TP |
| \fBdebug2\fR |
| Log errors and verbose informational messages and more debugging messages |
| .IP |
| |
| .TP |
| \fBdebug3\fR |
| Log errors and verbose informational messages and even more debugging messages |
| .IP |
| |
| .TP |
| \fBdebug4\fR |
| Log errors and verbose informational messages and even more debugging messages |
| .IP |
| |
| .TP |
| \fBdebug5\fR |
| Log errors and verbose informational messages and even more debugging messages |
| .RE |
| .IP |
| \fBNOTE\fR: By default, Slurm's systemd service file starts the slurmd daemon in |
| the foreground with the \-\-systemd option. This means that systemd will capture |
| stdout/stderr output and print that to syslog, independent of Slurm printing to |
| syslog directly. To prevent systemd from doing this, add "StandardOutput=null" |
| and "StandardError=null" to the respective service files or override files. |
| .IP |
| |
| .TP |
| \fBSlurmdTimeout\fR |
| The interval, in seconds, that the Slurm controller waits for \fBslurmd\fR |
| to respond before configuring that node's state to DOWN. |
| A value of zero indicates the node will not be tested by \fBslurmctld\fR to |
| confirm the state of \fBslurmd\fR, the node will not be automatically set to |
| a DOWN state indicating a non\-responsive \fBslurmd\fR, and some other tool |
| will take responsibility for monitoring the state of each compute node |
| and its \fBslurmd\fR daemon. |
| Slurm's hierarchical communication mechanism is used to ping the \fBslurmd\fR |
| daemons in order to minimize system noise and overhead. |
| The default value is 300 seconds. |
| The value may not exceed 65533 seconds. |
| .IP |
| |
| .TP |
| \fBSlurmdUser\fR |
| The name of the user that the \fBslurmd\fR daemon executes as. |
| This user must exist on all nodes of the cluster for authentication |
| of communications between Slurm components. |
| The default value is "root", which should be kept in almost all cases so that |
| slurmd can run jobs as the user that submitted them. |
| .IP |
| |
| .TP |
| \fBSlurmSchedLogFile\fR |
| Fully qualified pathname of the scheduling event logging file. |
| The syntax of this parameter is the same as for \fBSlurmctldLogFile\fR. |
| In order to configure scheduler logging, set both the \fBSlurmSchedLogFile\fR |
| and \fBSlurmSchedLogLevel\fR parameters. |
| .IP |
| |
| .TP |
| \fBSlurmSchedLogLevel\fR |
| The initial level of scheduling event logging, similar to the |
| \fBSlurmctldDebug\fR parameter used to control the initial level of |
| \fBslurmctld\fR logging. |
| Valid values for \fBSlurmSchedLogLevel\fR are "0" (scheduler logging |
| disabled) and "1" (scheduler logging enabled). |
| If this parameter is omitted, the value defaults to "0" (disabled). |
| In order to configure scheduler logging, set both the \fBSlurmSchedLogFile\fR |
| and \fBSlurmSchedLogLevel\fR parameters. |
| The scheduler logging level can be changed dynamically using \fBscontrol\fR. |
| .IP |
| |
| .TP |
| \fBSlurmUser\fR |
| The name of the user that the \fBslurmctld\fR daemon executes as. |
| For security purposes, a user other than "root" is recommended. |
| This user must exist on all nodes of the cluster for authentication |
| of communications between Slurm components. |
| The default value is "root". |
| .IP |
| |
| .TP |
| \fBSrunEpilog\fR |
| Fully qualified pathname of an executable to be run by srun following |
| the completion of a job step. The command line arguments for the |
| executable will be the command and arguments of the job step. This |
| configuration parameter may be overridden by srun's \fB\-\-epilog\fR |
| parameter. Note that while the other "Epilog" executables (e.g., |
| TaskEpilog) are run by slurmd on the compute nodes where the tasks are |
| executed, the \fBSrunEpilog\fR runs on the node where the "srun" is |
| executing. |
| .IP |
| |
| .TP |
| \fBSrunPortRange\fR |
| The \fBsrun\fR creates a set of listening ports to communicate with the |
| controller, the slurmstepd and to handle the application I/O. |
| By default these ports are ephemeral meaning the port numbers are selected |
| by the kernel. Using this parameter allow sites to configure a range of ports |
| from which srun ports will be selected. This is useful if sites want to |
| allow only certain port range on their network. |
| |
| \fBNOTE\fR: On Cray systems, Realm\-Specific IP Addressing (RSIP) will |
| automatically try to interact with anything opened on ports 8192\-60000. |
| Configure SrunPortRange to use a range of ports above those used by RSIP, |
| ideally 1000 or more ports, for example "SrunPortRange=60001\-63000". |
| |
| \fBNOTE\fR: \fBSrunPortRange\fR must be large enough to cover the expected |
| number of srun ports created. A single srun opens 4 listening ports plus 2 |
| more for every 48 hosts beyond the first 48. Use of the \fB\-\-pty\fR option |
| will result in an additional port being used. |
| |
| Example: |
| .nf |
| srun \-N 1 will use 4 listening ports. |
| srun \-\-pty \-N 1 will use 5 listening ports. |
| srun \-N 48 will use 4 listening ports. |
| srun \-N 50 will use 6 listening ports. |
| srun \-N 200 will use 12 listening ports. |
| .fi |
| .IP |
| |
| .TP |
| \fBSrunProlog\fR |
| Fully qualified pathname of an executable to be run by srun prior to |
| the launch of a job step. The command line arguments for the |
| executable will be the command and arguments of the job step. This |
| configuration parameter may be overridden by srun's \fB\-\-prolog\fR |
| parameter. Note that while the other "Prolog" executables (e.g., |
| TaskProlog) are run by slurmd on the compute nodes where the tasks are |
| executed, the \fBSrunProlog\fR runs on the node where the "srun" is |
| executing. |
| .IP |
| |
| .TP |
| \fBStateSaveLocation\fR |
| Fully qualified pathname of a directory into which the Slurm controller, |
| \fBslurmctld\fR, saves its state (e.g. "/usr/local/slurm/checkpoint"). |
| Slurm state will saved here to recover from system failures. |
| \fBSlurmUser\fR must be able to create files in this directory. |
| If you have a secondary \fBSlurmctldHost\fR configured, this location should be |
| readable and writable by both systems. |
| Since all running and pending job information is stored here, the use of |
| a reliable file system (e.g. RAID) is recommended. |
| The default value is "/var/spool". |
| If any slurm daemons terminate abnormally, their core files will also be written |
| into this directory. |
| .IP |
| |
| .TP |
| \fBSuspendExcNodes\fR |
| Specifies the nodes which are to not be placed in power save mode, even |
| if the node remains idle for an extended period of time. |
| Use Slurm's hostlist expression or NodeSets to identify nodes with an optional |
| ":" separator and count of nodes to exclude from the preceding range. |
| For example "nid[10\-20]:4" will prevent 4 powered up nodes in the set |
| "nid[10\-20]" from being powered down. |
| Multiple sets of nodes can be specified with or without counts in a comma |
| separated list (e.g "nid[10\-20]:4,nid[80\-90]:2"). |
| By default no nodes are excluded. |
| This value may be updated with scontrol. |
| See \fBReconfigFlags=KeepPowerSaveSettings\fR for setting persistence. |
| .IP |
| |
| .TP |
| \fBSuspendExcParts\fR |
| Specifies the partitions whose nodes are to not be placed in power save |
| mode, even if the node remains idle for an extended period of time. |
| Multiple partitions can be identified and separated by commas. |
| By default no nodes are excluded. |
| This value may be updated with scontrol. |
| See \fBReconfigFlags=KeepPowerSaveSettings\fR for setting persistence. |
| .IP |
| |
| .TP |
| \fBSuspendExcStates\fR |
| Specifies node states that are not to be powered down automatically. |
| Valid states include CLOUD, DOWN, DRAIN, DYNAMIC_FUTURE, DYNAMIC_NORM, FAIL, |
| INVALID_REG, MAINTENANCE, NOT_RESPONDING, PERFCTRS, PLANNED, and RESERVED. |
| By default, any of these states, if idle for \fBSuspendTime\fR, would be |
| powered down. |
| This value may be updated with scontrol. |
| See \fBReconfigFlags=KeepPowerSaveSettings\fR for setting persistence. |
| .IP |
| |
| .TP |
| \fBSuspendProgram\fR |
| \fBSuspendProgram\fR is the program that will be executed when a node |
| remains idle for an extended period of time. |
| This program is expected to place the node into some power save mode. |
| This can be used to reduce the frequency and voltage of a node or |
| completely power the node off. |
| The program executes as \fBSlurmUser\fR. |
| The argument to the program will be the names of nodes to |
| be placed into power savings mode (using Slurm's hostlist |
| expression format). |
| By default, no program is run. |
| Programs will be killed if they run longer than the largest configured, global |
| or partition, \fBResumeTimeout\fR or \fBSuspendTimeout\fR. |
| .IP |
| |
| .TP |
| \fBSuspendRate\fR |
| The rate at which nodes are placed into power save mode by \fBSuspendProgram\fR. |
| The value is number of nodes per minute and it can be used to prevent |
| a large drop in power consumption (e.g. after a large job completes). |
| A value of zero results in no limits being imposed. |
| The default value is 60 nodes per minute. |
| .IP |
| |
| .TP |
| \fBSuspendTime\fR |
| Nodes which remain idle or down for this number of seconds will be placed into |
| power save mode by \fBSuspendProgram\fR. |
| Setting \fBSuspendTime\fR to anything but INFINITE (or \-1) will enable power |
| save mode. INFINITE is the default. |
| .IP |
| |
| .TP |
| \fBSuspendTimeout\fR |
| Maximum time permitted (in seconds) between when a node suspend request |
| is issued and when the node is shutdown. |
| At that time the node must be ready for a resume request to be issued |
| as needed for new work. |
| The default value is 30 seconds. |
| .IP |
| |
| .TP |
| \fBSwitchParameters\fR |
| Optional parameters for the switch plugin. |
| |
| On HPE Slingshot systems configured with \fBSwitchType=switch/hpe_slingshot\fR, |
| the following parameters are supported |
| (separate multiple parameters with a comma): |
| .IP |
| |
| .RS |
| .TP |
| \fBvnis\fR=<\fImin\fR>-<\fImax\fR> |
| Range of VNIs to allocate for jobs and applications. |
| The default value is 1024-65535. |
| .IP |
| |
| .TP |
| \fBdestroy_retries\fR=<\fIretry attempts\fR> |
| Configure the number of times destroying CXI services is retried at the end of |
| the step. There is a one second pause between each retry. |
| The default value is 5. |
| .IP |
| |
| .TP |
| \fBtcs\fR=<\fIclass1\fR>[:<\fIclass2\fR>]... |
| Set of traffic classes to configure for applications. |
| Supported traffic classes are DEDICATED_ACCESS, LOW_LATENCY, BULK_DATA, and |
| BEST_EFFORT. The traffic classes may also be specified as TC_DEDICATED_ACCESS, |
| TC_LOW_LATENCY, TC_BULK_DATA, and TC_BEST_EFFORT. |
| .IP |
| |
| .TP |
| \fBsingle_node_vni\fR=<\fIall\fR|\fIuser\fR|\fInone\fR> |
| If set to 'all', allocate a VNI for all job steps (by default, no VNI will be |
| allocated for single-node job steps). |
| If set to 'user', allocate a VNI for single-node job steps using the \fBsrun\fR |
| \fB\-\-network=single_node_vni\fR option or \fBSLURM_NETWORK=single_node_vni\fR |
| environment variable. |
| If set to 'none' (or if \fBsingle_node_vni\fR is not set), do not allocate any |
| VNI for single-node job steps. |
| For backwards compatibility, setting \fBsingle_node_vni\fR with no argument is |
| equivalent to 'all'. |
| .IP |
| |
| .TP |
| \fBjob_vni\fR=<\fIall\fR|\fIuser\fR|\fInone\fR> |
| If set to 'all', allocate an additional VNI for jobs, shared among all job steps. |
| If set to 'user', allocate an additional VNI for any job using the \fBsrun\fR |
| \fB\-\-network=job_vni\fR option or \fBSLURM_NETWORK=job_vni\fR environment |
| variable. |
| If set to 'none' (or if \fBjob_vni\fR is not set), do not allocate any |
| additional VNI for jobs. For backwards compatibility, setting \fBjob_vni\fR with |
| no argument is equivalent to 'all'. |
| .IP |
| |
| .TP |
| \fBadjust_limits\fR |
| If set, slurmd will set an upper bound on network resource reservations |
| by taking the per-NIC maximum resource quantity and subtracting the |
| reserved or used values (whichever is higher) for any system network services; |
| this is the default. |
| .IP |
| |
| .TP |
| \fBno_adjust_limits\fR |
| If set, slurmd will calculate network resource reservations |
| based only upon the per-resource configuration default and number of tasks |
| in the application; it will not set an upper bound on those reservation |
| requests based on resource usage of already-existing system network services. |
| Setting this will mean more application launches could fail based |
| on network resource exhaustion, but if the application |
| absolutely needs a certain amount of resources to function, this option |
| will ensure that. |
| .IP |
| |
| .TP |
| \fBhwcoll_addrs_per_job\fR |
| The number of Slingshot hardware collectives multicast addresses to allocate |
| per job. (That are larger than hwcoll_min_nodes nodes) |
| .IP |
| |
| .TP |
| \fBhwcoll_num_nodes\fR |
| The minimum number of nodes for a job to be allocated Slingshot hardware |
| collectives. Because the hardware collective engine is not expected to offer a |
| meaningful performance boost for jobs spanning a small number of nodes. |
| .IP |
| |
| .TP |
| \fBfm_url\fR |
| If set, slurm will use the configured URL to interface with the fabric |
| manager to enable Slingshot hardware collectives. |
| Note \fBenable_stepmgr\fR needs to be set for hardware collectives to run. |
| .IP |
| |
| .TP |
| \fBfm_auth\fR |
| HPE fabric manager REST API authentication type |
| (BASIC or OAUTH, default OAUTH). |
| .IP |
| |
| .TP |
| \fBfm_authdir\fR |
| Directory containing authentication info files (default /etc/fmsim |
| for BASIC authentication, /etc/wlm-client-auth for OAUTH authentication). |
| .IP |
| |
| .TP |
| \fBfm_mtls_url\fR |
| This sets an alternative URL to \fBfm_url\fR that slurm daemons will use to |
| interface with the fabric manager to enable Slingshot hardware collectives when |
| mTLS authentication is enabled. If this is not set, \fBfm_url\fR will be used |
| instead. To enable mTLS authentication see \fBfm_mtls_ca\fR, \fBfm_mtls_cert\fR, |
| and \fBfm_mtls_key\fR. |
| |
| \fBNote\fR: Setting \fBfm_url\fR and \fBenable_stepmgr\fR are required to enable |
| Slingshot hardware collectives. |
| .IP |
| |
| .TP |
| \fBfm_mtls_ca\fR |
| Path to Certificate Authority (CA) bundle file or directory containing a file |
| signed by the fabric manager certificate. If set, the identity of the fabric |
| manager server will be verified if Slingshot hardware collectives are enabled. |
| See also \fBfm_mtls_cert\fR and \fBfm_mtls_key\fR. |
| |
| \fBNote\fR: This option is not required to enable mTLS authentication with the |
| fabric manager. However, without it the client (slurmctld and stepmgr processes) |
| will not be able to verify the server identity. |
| .IP |
| |
| .TP |
| \fBfm_mtls_cert\fR |
| Path to client public certificate. This is required to enable mTLS |
| authentication with the fabric manager when Slingshot hardware collectives are |
| enabled. See also \fBfm_mtls_ca\fR and \fBfm_mtls_key\fR. |
| .IP |
| |
| .TP |
| \fBfm_mtls_key\fR |
| Path to client private key. This is required to enable mTLS authentication to |
| the fabric manager when Slingshot hardware collectives are enabled. |
| See also \fBfm_mtls_ca\fR and \fBfm_mtls_cert\fR. |
| .IP |
| |
| .TP |
| \fBdef_<rsrc>\fR=<\fIval\fR> |
| Per-CPU reserved allocation for this resource. |
| .IP |
| |
| .TP |
| \fBres_<rsrc>\fR=<\fIval\fR> |
| Per-node reserved allocation for this resource. |
| If set, overrides the per-CPU allocation. |
| .IP |
| |
| .TP |
| \fBmax_<rsrc>\fR=<\fIval\fR> |
| Maximum per-node application for this resource. |
| .IP |
| .RE |
| |
| The resources that may be configured are: |
| .IP |
| |
| .RS |
| .TP |
| \fBtxqs\fR |
| Transmit command queues. The default is 2 per-CPU, maximum 1024 per-node. |
| .IP |
| |
| .TP |
| \fBtgqs\fR |
| Target command queues. The default is 1 per-CPU, maximum 512 per-node. |
| .IP |
| |
| .TP |
| \fBeqs\fR |
| Event queues. The default is 2 per-CPU, maximum 2047 per-node. |
| .IP |
| |
| .TP |
| \fBcts\fR |
| Counters. The default is 1 per-CPU, maximum 2047 per-node. |
| .IP |
| |
| .TP |
| \fBtles\fR |
| Trigger list entries. The default is 1 per-CPU, maximum 2048 per-node. |
| .IP |
| |
| .TP |
| \fBptes\fR |
| Portable table entries. The default is 6 per-CPU, maximum 2048 per-node. |
| .IP |
| |
| .TP |
| \fBles\fR |
| List entries. The default is 16 per-CPU, maximum 16384 per-node. |
| .IP |
| |
| .TP |
| \fBacs\fR |
| Addressing contexts. The default is 2 per-CPU, maximum 1022 per-node. |
| .IP |
| .RE |
| |
| On systems configured with \fBSwitchType=switch/nvidia_imex\fR, the following |
| parameters are supported: |
| .RS |
| .TP |
| \fBimex_channel_count\fR |
| Number of channels that can be configured. Channels allow nodes to create a |
| secure method of sharing memory. The default value is 2048. |
| |
| \fBNOTE\fR: The batch and interactive steps will not have imex channels |
| created since they run on a single node. Once you start creating job steps |
| that span nodes you will see the channels created. |
| .RE |
| .IP |
| |
| .TP |
| \fBSwitchType\fR |
| Identifies the type of switch or interconnect used for application |
| communications. |
| The default value is no special plugin requiring special processing for job |
| launch or termination (Ethernet, and InfiniBand). |
| All Slurm daemons, commands and running jobs must be restarted or reconfigured |
| for a change in \fBSwitchType\fR to take effect. |
| If running jobs exist at the time \fBslurmctld\fR is restarted with a new |
| value of \fBSwitchType\fR, records of all jobs in any state may be lost. |
| Acceptable values include: |
| .IP |
| .RS |
| .TP 15 |
| \fBswitch/hpe_slingshot\fR |
| For HPE Slingshot systems. |
| .IP |
| |
| .TP |
| \fBswitch/nvidia_imex\fR |
| For allocating unique channels within an NVIDIA IMEX domain. |
| .RE |
| .IP |
| |
| .TP |
| \fBTaskEpilog\fR |
| Fully qualified pathname of a program to be executed as the slurm job's user |
| after termination of each task. Will run inside of the job's container if |
| configured. Should not be used for policy enforcement. |
| See \fBTaskProlog\fR for execution order details. |
| .IP |
| |
| .TP |
| \fBTaskPlugin\fR |
| Identifies the type of task launch plugin, typically used to provide |
| resource management within a node (e.g. pinning tasks to specific |
| processors). More than one task plugin can be specified in a comma\-separated |
| list. The prefix of "task/" is optional. Unset by default. |
| Acceptable values include: |
| .IP |
| .RS |
| .TP 15 |
| \fBtask/affinity\fR |
| binds processes to specified resources using sched_setaffinity(). |
| This enables the \-\-cpu\-bind and/or \-\-mem\-bind srun options. |
| .IP |
| |
| .TP |
| \fBtask/cgroup\fR |
| enables process containment to specified resources using Cgroups cpuset |
| interface. This enables the \-\-cpu\-bind and/or \-\-mem\-bind srun options. |
| \fBNOTE\fR: see "man cgroup.conf" for configuration details. |
| .RE |
| .IP |
| |
| .RS |
| \fBNOTE\fR: It is recommended to stack \fBtask/cgroup,task/affinity\fR together |
| when configuring TaskPlugin, and setting \fBConstrainCores=yes\fR in |
| \fBcgroup.conf\fR. This setup uses the task/affinity plugin for setting the |
| cpu mask for tasks and uses the task/cgroup plugin to fence tasks into the |
| allocated cpus. |
| .RE |
| .IP |
| |
| .TP |
| \fBTaskPluginParam\fR |
| Optional parameters for the task plugin. |
| Multiple options should be comma separated. |
| \fBNone\fR, \fBSockets\fR, \fBCores\fR and \fBThreads\fR are mutually |
| exclusive and treated as a last possible source of \-\-cpu\-bind default. See also |
| Node and Partition CpuBind options. |
| .IP |
| .RS |
| .TP |
| \fBCores\fR |
| Bind tasks to cores by default. |
| Overrides automatic binding. |
| .IP |
| |
| .TP |
| \fBNone\fR |
| Perform no task binding by default. |
| Overrides automatic binding. |
| .IP |
| |
| .TP |
| \fBSockets\fR |
| Bind to sockets by default. |
| Overrides automatic binding. |
| .IP |
| |
| .TP |
| \fBThreads\fR |
| Bind to threads by default. |
| Overrides automatic binding. |
| .IP |
| |
| .TP |
| \fBSlurmdSpecOverride\fR |
| If slurmd is started in a cgroup which has cpuset or memory constraints, then |
| CpuSpecList and MemSpecLimit will be set and will override the configured |
| values. This will avoid scheduling resources from these constraints. In |
| cgroup/v1, slurmd and slurmstepd daemons will now not be able to use any |
| of these resources. While in normal behavior, cgroup/v1 constrains the |
| daemons to CpuSpecList and MemSpecLimit. |
| .IP |
| |
| .TP |
| \fBSlurmdOffSpec\fR |
| If specialized cores or CPUs are identified for the node (i.e. the |
| \fBCoreSpecCount\fR or \fBCpuSpecList\fR are configured for the node), |
| then Slurm daemons running on the compute node (i.e. slurmd and slurmstepd) |
| should run outside of those resources (i.e. specialized resources are |
| completely unavailable to Slurm daemons and jobs spawned by Slurm). |
| .IP |
| |
| .TP |
| \fBOOMKillStep\fR |
| Set this parameter to kill the whole step in all the nodes in case an OOM event |
| is triggered in any task of the step. |
| |
| This applies to entire allocations but does not apply to the external step. |
| It can be overwritten by the user. |
| |
| \fBNOTE\fR: This parameter requires the \fBtask/cgroup\fR plugin, Cgroups v2, |
| and a kernel newer than 4.19. |
| .IP |
| |
| .TP |
| \fBVerbose\fR |
| Verbosely report binding before tasks run by default. |
| .IP |
| |
| .TP |
| \fBAutobind\fR |
| Set a default binding in the event that "auto binding" doesn't find a match. |
| Set to Threads, Cores or Sockets (E.g. TaskPluginParam=autobind=threads). |
| .RE |
| .IP |
| |
| .TP |
| \fBTaskProlog\fR |
| Fully qualified pathname of a program to be executed as the slurm job's user |
| prior to initiation of each task. Will run inside of the job's container if |
| configured. Should not be used for policy enforcement. |
| Besides the normal environment variables, this has SLURM_TASK_PID |
| available to identify the process ID of the task being started. |
| Standard output from this program can be used to control the environment |
| variables and output for the user program. |
| .IP |
| .RS |
| .TP 20 |
| \fBexport NAME=value\fR |
| Will set environment variables for the task being spawned. |
| Everything after the equal sign to the end of the |
| line will be used as the value for the environment variable. |
| Exporting of functions is not currently supported. |
| .IP |
| |
| .TP |
| \fBprint ...\fR |
| Will cause that line (without the leading "print ") |
| to be printed to the job's standard output. |
| .IP |
| |
| .TP |
| \fBunset NAME\fR |
| Will clear environment variables for the task being spawned. |
| .IP |
| |
| .TP |
| The order of task prolog/epilog execution is as follows: |
| .IP |
| |
| .TP |
| \fB1. pre_launch_priv()\fR |
| Function in TaskPlugin |
| .IP |
| |
| .TP |
| \fB1. pre_launch()\fR |
| Function in TaskPlugin |
| .IP |
| |
| .TP |
| \fB2. TaskProlog\fR |
| System\-wide per task program defined in slurm.conf |
| .IP |
| |
| .TP |
| \fB3. User prolog\fR |
| Job\-step\-specific task program defined using |
| \fBsrun\fR's \fB\-\-task\-prolog\fR option or \fBSLURM_TASK_PROLOG\fR |
| environment variable |
| .IP |
| |
| .TP |
| \fB4. Task\fR |
| Execute the job step's task |
| .IP |
| |
| .TP |
| \fB5. User epilog\fR |
| Job\-step\-specific task program defined using |
| \fBsrun\fR's \fB\-\-task\-epilog\fR option or \fBSLURM_TASK_EPILOG\fR |
| environment variable |
| .IP |
| |
| .TP |
| \fB6. TaskEpilog\fR |
| System\-wide per task program defined in slurm.conf |
| .IP |
| |
| .TP |
| \fB7. post_term()\fR |
| Function in TaskPlugin |
| .RE |
| .IP |
| |
| .TP |
| \fBTCPTimeout\fR |
| Time permitted for TCP connection to be established. Default value is 2 seconds. |
| .IP |
| |
| .TP |
| \fBTLSParameters\fR |
| Comma\-separated options for the TLS plugin configured by \fBTLSType\fR. |
| Supported values include: |
| .IP |
| .RS |
| .TP |
| \fBca_cert_file=\fR |
| Path of certificate authority (CA) certificate. Must exist on all hosts and be |
| accessible by all Slurm components. File permissions must be 644, and owned by |
| SlurmUser/root. |
| |
| Default path is "ca_cert.pem" in the Slurm configuration directory |
| .IP |
| |
| .TP |
| \fBctld_cert_file=\fR |
| Path of certificate used by slurmctld. Must chain to \fBca_cert_file\fR. Should |
| only exist on host running slurmctld. File permissions must be 600, and owned |
| by SlurmUser. |
| |
| Default path is "ctld_cert.pem" in the Slurm configuration directory |
| .IP |
| |
| .TP |
| \fBctld_cert_key_file=\fR |
| Path of private key that accompanies \fBctld_cert_file\fR. Should only exist on |
| host running slurmctld. File permissions must be 600, and owned by SlurmUser. |
| |
| Default path is "ctld_cert_key.pem" in the Slurm configuration directory |
| .IP |
| |
| .TP |
| \fBrestd_cert_file=\fR |
| Path of certificate used by slurmrestd. Must chain to \fBca_cert_file\fR. Should |
| only exist on host running slurmrestd. File permissions must be 600, and owned |
| by the user that runs slurmrestd. |
| |
| Default path is "restd_cert.pem" in the Slurm configuration directory |
| .IP |
| |
| .TP |
| \fBrestd_cert_key_file=\fR |
| Path of private key that accompanies \fBrestd_cert_file\fR. Should only exist |
| on host running slurmrestd. File permissions must be 600, and owned by the user |
| that runs slurmrestd. |
| |
| Default path is "restd_cert_key.pem" in the Slurm configuration directory |
| .IP |
| |
| .TP |
| \fBsackd_cert_file=\fR |
| Path of certificate used by sackd. Must chain to \fBca_cert_file\fR. Should |
| only exist on host running sackd. File permissions must be 600, and owned |
| by SlurmUser. |
| |
| Default path is "sackd_cert.pem" in the Slurm configuration directory |
| |
| NOTE: If not using the certmgr plugin, this file needs to exist. |
| .IP |
| |
| .TP |
| \fBsackd_cert_key_file=\fR |
| Path of private key that accompanies \fBsackd_cert_file\fR. Should only exist on |
| host running sackd. File permissions must be 600, and owned by SlurmUser. |
| |
| Default path is "sackd_cert_key.pem" in the Slurm configuration directory |
| |
| NOTE: If not using the certmgr plugin, this file needs to exist. |
| .IP |
| |
| .TP |
| \fBslurmd_cert_file=\fR |
| Path of certificate used by slurmd. Must chain to \fBca_cert_file\fR. Should |
| only exist on host running slurmd. File permissions must be 600, and owned |
| by SlurmUser. |
| |
| Default path is "slurmd_cert.pem" in the Slurm configuration directory |
| |
| NOTE: If not using the certmgr plugin, this file needs to exist. |
| .IP |
| |
| .TP |
| \fBslurmd_cert_key_file=\fR |
| Path of private key that accompanies \fBslurmd_cert_file\fR. Should only exist on |
| host running slurmd. File permissions must be 600, and owned by SlurmUser. |
| |
| Default path is "slurmd_cert_key.pem" in the Slurm configuration directory |
| |
| NOTE: If not using the certmgr plugin, this file needs to exist. |
| .IP |
| |
| .TP |
| \fBload_system_certificates\fR |
| Load certificates found in default system locations (e.g. /etc/ssl) into trust store. |
| |
| Default is to not load system certificates, and to rely solely on |
| \fBca_cert_file\fR to establish trust. |
| .IP |
| |
| .TP |
| \fBsecurity_policy_version=\fR |
| Security policy version used by s2n. See s2n documentation for more details. |
| Default security policy is "20230317", which is FIPS compliant and includes TLS 1.3. |
| .RE |
| .IP |
| |
| .TP |
| \fBTLSType\fR |
| Specify the TLS implementation that will be used. Unset by default. |
| Acceptable values at present: |
| .IP |
| .RS |
| .TP |
| \fBtls/s2n\fR |
| Use the s2n TLS plugin. Requires additional configuration and causes significant |
| processing overhead, but allows all Slurm communication to be encrypted. Refer |
| to the TLS guide for more details: <https://slurm.schedmd.com/tls.html> |
| .RE |
| .IP |
| |
| .TP |
| \fBTmpFS\fR |
| Fully qualified pathname of the file system available to user jobs for |
| temporary storage. This parameter is used in establishing a node's \fBTmpDisk\fR |
| space. |
| The default value is "/tmp". |
| .IP |
| |
| .TP |
| \fBTopologyParam\fR |
| Comma\-separated options identifying network topology options. |
| .IP |
| .RS |
| .TP 17 |
| \fBDragonfly\fR |
| Optimize allocation for Dragonfly network. |
| Valid when TopologyPlugin=topology/tree. |
| .IP |
| |
| .TP |
| \fBRoutePart\fR |
| Instead of using the plugin's default route calculation, use partition node |
| lists to route communications from the controller. Once on the compute node, |
| communications will be routed using the requested plugin's normal algorithm, |
| following TreeWidth if applicable. If a node is in multiple partitions, |
| the first partition seen will be used. The controller will communicate directly |
| with any nodes that aren't in a partition. |
| .IP |
| |
| .TP |
| \fBBlockAsNodeRank\fR |
| Assign the same node rank to all nodes under one base block. |
| This can be useful if the naming convention for the nodes does not match the |
| network topology. |
| Valid when topology/block is a cluster default topology. |
| .IP |
| |
| .TP |
| \fBSwitchAsNodeRank\fR |
| Assign the same node rank to all nodes under one leaf switch. |
| This can be useful if the naming convention for the nodes does not match the |
| network topology. |
| Valid when topology/tree is a cluster default topology. |
| .IP |
| |
| .TP |
| \fBRouteTree\fR |
| Use the switch hierarchy defined in a \fItopology.conf\fR file for routing |
| instead of just scheduling. |
| Valid when TopologyPlugin=topology/tree. |
| Incompatible with dynamic nodes. |
| .IP |
| |
| .TP |
| \fBTopoMaxSizeUnroll\fR=\# |
| Maximum number of individual job sizes automatically unrolled |
| from min-max nodes job specification. |
| Default: -1 (option disabled). |
| Valid when TopologyPlugin=topology/block. |
| .IP |
| |
| .TP |
| \fBTopoOptional\fR |
| Only optimize allocation for network topology if the job includes a switch |
| option. Since optimizing resource allocation for topology involves much higher |
| system overhead, this option can be used to impose the extra overhead only on |
| jobs which can take advantage of it. If most job allocations are not optimized |
| for network topology, they may fragment resources to the point that topology |
| optimization for other jobs will be difficult to achieve. |
| \fBNOTE\fR: Jobs may span across nodes without common parent switches with |
| this enabled. |
| .RE |
| .IP |
| |
| .TP |
| \fBTopologyPlugin\fR |
| Identifies the plugin to be used for determining the network topology |
| and optimizing job allocations to minimize network contention. |
| See \fBNETWORK TOPOLOGY\fR below for details. |
| Additional plugins may be provided in the future which gather topology |
| information directly from the network. |
| Acceptable values include: |
| .IP |
| .RS |
| .TP 21 |
| \fBtopology/block\fR |
| used for a block network topology, as described in the \fBtopology.conf\fR(5) |
| man page |
| .IP |
| |
| .TP |
| \fBtopology/flat\fR |
| best\-fit logic over one\-dimensional topology. This is the default. |
| .IP |
| |
| .TP |
| \fBtopology/tree\fR |
| used for a hierarchical network with the select/cons_tres plugin, |
| as described in the \fBtopology.conf\fR(5) |
| man page |
| .RE |
| \fBNOTE\fR: This option is ignored if topology.yaml exists. |
| .IP |
| |
| .TP |
| \fBTrackWCKey\fR |
| Boolean yes or no. Used to set display and track of the Workload |
| Characterization Key. Must be set to track correct wckey usage. |
| \fBNOTE\fR: You must also set TrackWCKey in your slurmdbd.conf file to create |
| historical usage reports. |
| .IP |
| |
| .TP |
| \fBTreeWidth\fR |
| \fBSlurmd\fR daemons use a virtual tree network for communications. |
| \fBTreeWidth\fR specifies the width of the tree (i.e. the fanout). |
| The default value is 16, meaning each slurmd daemon can |
| communicate with up to 16 other slurmd daemons. This value balances offloading |
| slurmctld (max 16 threads running), time of communication, and node fault |
| tolerance (4368 nodes can be contacted with three message hops). The default |
| value will work well for most clusters however on bigger systems this value can |
| be increased to avoid long timeouts and retransmissions in case of unresponsive |
| nodes. The value may not exceed 65533. |
| .IP |
| |
| .TP |
| \fBUnkillableStepProgram\fR |
| If the processes in a job step are determined to be unkillable for a period |
| of time specified by the \fBUnkillableStepTimeout\fR variable, the program |
| specified by \fBUnkillableStepProgram\fR will be executed. |
| By default no program is run. |
| |
| See section \fBUNKILLABLE STEP PROGRAM SCRIPT\fR for more information. |
| .IP |
| |
| .TP |
| \fBUnkillableStepTimeout\fR |
| The length of time, in seconds, that Slurm will wait before deciding that |
| processes in a job step are unkillable (after they have been signaled with |
| SIGKILL) and execute \fBUnkillableStepProgram\fR. |
| The default timeout value is 60 seconds or five times the value of |
| MessageTimeout, whichever is greater. |
| If exceeded, the compute node will be drained to prevent future jobs from being |
| scheduled on the node. |
| |
| \fBNOTE\fR: Ensure that UnkillableStepTimeout is at least 5 times larger than |
| MessageTimeout, otherwise it can lead to unexpected draining of nodes. |
| .IP |
| |
| .TP |
| \fBUrlParserType\fR |
| Specify the url_parser implementation that will be used. Default is |
| \fIurl_parser/libhttp_parser\fR. |
| Acceptable values at present: |
| .IP |
| .RS |
| .TP |
| \fBurl_parser/libhttp_parser\fR |
| Use the libhttp_parser based plugin. |
| .RE |
| .IP |
| |
| .TP |
| \fBUsePAM\fR |
| If set to 1, PAM (Pluggable Authentication Modules for Linux) will be enabled. |
| PAM is used to establish the upper bounds for resource limits. With PAM support |
| enabled, local system administrators can dynamically configure system resource |
| limits. Changing the upper bound of a resource limit will not alter the limits |
| of running jobs, only jobs started after a change has been made will pick up |
| the new limits. |
| The default value is 0 (not to enable PAM support). |
| Remember that PAM also needs to be configured to support Slurm as a service. |
| For sites using PAM's directory based configuration option, a configuration |
| file named \fBslurm\fR should be created. The module\-type, control\-flags, and |
| module\-path names that should be included in the file are: |
| .br |
| auth required pam_localuser.so |
| .br |
| auth required pam_shells.so |
| .br |
| account required pam_unix.so |
| .br |
| account required pam_access.so |
| .br |
| session required pam_unix.so |
| .br |
| For sites configuring PAM with a general configuration file, the appropriate |
| lines (see above), where \fBslurm\fR is the service\-name, should be added. |
| See <https://slurm.schedmd.com/pam_slurm_adopt.html> for more details. |
| |
| \fBNOTE\fR: UsePAM option has nothing to do with the |
| \fBcontribs/pam/pam_slurm\fR and/or \fBcontribs/pam_slurm_adopt\fR modules. So |
| these two modules can work independently of the value set for UsePAM. |
| .IP |
| |
| .TP |
| \fBVSizeFactor\fR |
| Memory specifications in job requests apply to real memory size (also known |
| as resident set size). It is possible to enforce virtual memory limits for |
| both jobs and job steps by limiting their virtual memory to some percentage |
| of their real memory allocation. The \fBVSizeFactor\fR parameter specifies |
| the job's or job step's virtual memory limit as a percentage of its real |
| memory limit. For example, if a job's real memory limit is 500MB and |
| VSizeFactor is set to 101 then the job will be killed if its real memory |
| exceeds 500MB or its virtual memory exceeds 505MB (101 percent of the |
| real memory limit). |
| The default value is 0, which disables enforcement of virtual memory limits. |
| The value may not exceed 65533 percent. |
| |
| \fBNOTE\fR: This parameter is dependent on \fBOverMemoryKill\fR being |
| configured in \fBJobAcctGatherParams\fR. It is also possible |
| to configure the \fBTaskPlugin\fR to use \fBtask/cgroup\fR for memory |
| enforcement. \fBVSizeFactor\fR will not have an effect on memory enforcement |
| done through cgroups. |
| .IP |
| |
| .TP |
| \fBWaitTime\fR |
| Specifies how many seconds the srun command should by default wait after |
| the first task terminates before terminating all remaining tasks. The |
| "\-\-wait" option on the srun command line overrides this value. |
| The default value is 0, which disables this feature. |
| May not exceed 65533 seconds. |
| .IP |
| |
| .TP |
| \fBX11Parameters\fR |
| For use with Slurm's built\-in X11 forwarding implementation. |
| .IP |
| .RS |
| .TP 8 |
| \fBhome_xauthority\fR |
| If set, xauth data on the compute node will be placed in \fB~/.Xauthority\fR |
| rather than in a temporary file under \fBTmpFS\fR. |
| .RE |
| .IP |
| |
| .SH "NODE CONFIGURATION" |
| The configuration of nodes (or machines) to be managed by Slurm is |
| also specified in \fB/etc/slurm.conf\fR. |
| Changes in node configuration (e.g. adding nodes, changing their |
| processor count, etc.) require restarting or reconfiguring all slurmctld |
| and slurmd daemons. |
| All slurmd daemons must know each node in the system to forward |
| messages in support of hierarchical communications. |
| Only the NodeName must be supplied in the configuration file. |
| All other node configuration information is optional. |
| It is advisable to establish baseline node configurations, |
| especially if the cluster is heterogeneous. |
| Nodes which register to the system with less than the configured resources |
| (e.g. too little memory), will be placed in the "DOWN" state to |
| avoid scheduling jobs on them. |
| Establishing baseline configurations will also speed Slurm's |
| scheduling process by permitting it to compare job requirements |
| against these (relatively few) configuration parameters and |
| possibly avoid having to check job requirements |
| against every individual node's configuration. |
| The resources checked at node registration time are: CPUs, |
| RealMemory and TmpDisk. |
| .LP |
| Default values can be specified with a record in which |
| \fBNodeName\fR is "DEFAULT". |
| The default entry values will apply only to lines following it in the |
| configuration file and the default values can be reset multiple times |
| in the configuration file with multiple entries where "NodeName=DEFAULT". |
| Each line where \fBNodeName\fR is "DEFAULT" will replace or add to previous |
| default values and will not reinitialize the default values. |
| The "NodeName=" specification must be placed on every line |
| describing the configuration of nodes. |
| A single node name can not appear as a NodeName value in more than one line |
| (duplicate node name records will be ignored). |
| In fact, it is generally possible and desirable to define the |
| configurations of all nodes in only a few lines. |
| This convention permits significant optimization in the scheduling |
| of larger clusters. |
| In order to support the concept of jobs requiring consecutive nodes |
| on some architectures, |
| node specifications should be place in this file in consecutive order. |
| No single node name may be listed more than once in the configuration |
| file. |
| Use "DownNodes=" to record the state of nodes which are temporarily |
| in a DOWN, DRAIN or FAILING state without altering permanent |
| configuration information. |
| A job step's tasks are allocated to nodes in order the nodes appear |
| in the configuration file. There is presently no capability within |
| Slurm to arbitrarily order a job step's tasks. |
| .LP |
| Multiple node names may be comma separated (e.g. "alpha,beta,gamma") |
| and/or a simple node range expression may optionally be used to |
| specify numeric ranges of nodes to avoid building a configuration |
| file with large numbers of entries. |
| The node range expression can contain one pair of square brackets |
| with a sequence of comma\-separated numbers and/or ranges of numbers |
| separated by a "\-" (e.g. "linux[0\-64,128]", or "lx[15,18,32\-33]"). |
| Note that the numeric ranges can include one or more leading |
| zeros to indicate the numeric portion has a fixed number of digits |
| (e.g. "linux[0000\-1023]"). |
| Multiple numeric ranges can be included in the expression |
| (e.g. "rack[0\-63]_blade[0\-41]"). |
| If one or more numeric expressions are included, one of them |
| must be at the end of the name (e.g. "unit[0\-31]rack" is invalid), |
| but arbitrary names can always be used in a comma\-separated list. |
| .LP |
| The node configuration specified the following information: |
| |
| .TP |
| \fBNodeName\fR |
| Name that Slurm uses to refer to a node. |
| Typically this would be the string that "/bin/hostname \-s" returns. |
| It may also be the fully qualified domain name as returned by "/bin/hostname \-f" |
| (e.g. "foo1.bar.com"), or any valid domain name associated with the host |
| through the host database (/etc/hosts) or DNS, depending on the resolver |
| settings. Note that if the short form of the hostname is not used, it |
| may prevent use of hostlist expressions (the numeric portion in brackets |
| must be at the end of the string). |
| It may also be an arbitrary string if \fBNodeHostname\fR is specified. |
| If the \fBNodeName\fR is "DEFAULT", the values specified |
| with that record will apply to subsequent node specifications |
| unless explicitly set to other values in that node record or |
| replaced with a different set of default values. |
| Each line where \fBNodeName\fR is "DEFAULT" will replace or add to previous |
| default values and not reinitialize the default values. |
| For architectures in which the node order is significant, |
| nodes will be considered consecutive in the order defined. |
| For example, if the configuration for "NodeName=charlie" immediately |
| follows the configuration for "NodeName=baker" they will be |
| considered adjacent in the computer. |
| \fBNOTE\fR: If the \fBNodeName\fR is "ALL" the process parsing the configuration |
| will exit immediately as it is an internally reserved word. |
| .IP |
| |
| .TP |
| \fBNodeHostname\fR |
| Typically this would be the string that "/bin/hostname \-s" returns. |
| It may also be the fully qualified domain name as returned by "/bin/hostname \-f" |
| (e.g. "foo1.bar.com"), or any valid domain name associated with the host |
| through the host database (/etc/hosts) or DNS, depending on the resolver |
| settings. Note that if the short form of the hostname is not used, it |
| may prevent use of hostlist expressions (the numeric portion in brackets |
| must be at the end of the string). |
| A node range expression can be used to specify a set of nodes. |
| If an expression is used, the number of nodes identified by |
| \fBNodeHostname\fR on a line in the configuration file must |
| be identical to the number of nodes identified by \fBNodeName\fR. |
| By default, the \fBNodeHostname\fR will be identical in value to |
| \fBNodeName\fR. |
| .IP |
| |
| .TP |
| \fBNodeAddr\fR |
| Name that a node should be referred to in establishing |
| a communications path. |
| This name will be used as an |
| argument to the getaddrinfo() function for identification. |
| If a node range expression is used to designate multiple nodes, |
| they must exactly match the entries in the \fBNodeName\fR |
| (e.g. "NodeName=lx[0\-7] NodeAddr=elx[0\-7]"). |
| \fBNodeAddr\fR may also contain IP addresses. |
| By default, the \fBNodeAddr\fR will be identical in value to |
| \fBNodeHostname\fR. |
| .IP |
| |
| .TP |
| \fBBcastAddr\fR |
| Alternate network path to be used for sbcast network traffic to a given node. |
| This name will be used as an argument to the getaddrinfo() function. |
| If a node range expression is used to designate multiple nodes, |
| they must exactly match the entries in the \fBNodeName\fR |
| (e.g. "NodeName=lx[0\-7] BcastAddr=elx[0\-7]"). |
| \fBBcastAddr\fR may also contain IP addresses. |
| By default, the \fBBcastAddr\fR is unset, and sbcast traffic will be routed |
| to the \fBNodeAddr\fR for a given node. |
| Note: cannot be used with CommunicationParameters=NoInAddrAny. |
| .IP |
| |
| .TP |
| \fBBoards\fR |
| Number of Baseboards in nodes with a baseboard controller. |
| Note that when Boards is specified, SocketsPerBoard, |
| CoresPerSocket, and ThreadsPerCore should be specified. |
| The default value is 1. |
| .IP |
| |
| .TP |
| \fBCoreSpecCount\fR |
| Number of cores reserved for system use. |
| Depending upon the \fBTaskPluginParam\fR option of \fBSlurmdOffSpec\fR, |
| the Slurm daemon slurmd may either be confined to these |
| resources (the default) or prevented from using these resources. |
| If cgroup/v1 is used, the same applies to the slurmstepd processes. |
| Isolation of slurmd from user jobs may improve application performance. |
| A job can use these cores if AllowSpecResourcesUsage=yes and the user |
| explicitly requests less than the configured CoreSpecCount. |
| If this option and \fBCpuSpecList\fR are both designated for a |
| node, an error is generated. For information on the algorithm used by Slurm |
| to select the cores refer to the core specialization documentation |
| ( https://slurm.schedmd.com/core_spec.html ). |
| .IP |
| |
| .TP |
| \fBCoresPerSocket\fR |
| Number of cores in a single physical processor socket (e.g. "2"). |
| The CoresPerSocket value describes physical cores, not the |
| logical number of processors per socket. |
| \fBNOTE\fR: If you have multi\-core processors, you will likely |
| need to specify this parameter in order to optimize scheduling. |
| The default value is 1. |
| .IP |
| |
| .TP |
| \fBCpuBind\fR |
| If a job step request does not specify an option to control how tasks are bound |
| to allocated CPUs (by using \-\-cpu\-bind) and all nodes allocated to the job |
| have the same \fBCpuBind\fR option, the node \fBCpuBind\fR option will control |
| how tasks are bound to allocated resources. Partition definitions are used next |
| if the node definition(s) can't be used, followed by \fBTaskPluginParam\fR as a |
| last resort, with the default being no binding. Supported values for |
| CpuBind are \fBnone\fR, \fBsocket\fR, \fBldom\fR (NUMA), \fBcore\fR and |
| \fBthread\fR. |
| .IP |
| |
| .TP |
| \fBCPUs\fR |
| Number of logical processors on the node (e.g. "2"). |
| It can be set to the total |
| number of sockets(supported only by select/linear), cores or threads. |
| This can be useful when you want to schedule only the cores on a hyper\-threaded |
| node. If \fBCPUs\fR is omitted, its default will be set equal to the product of |
| \fBBoards\fR, \fBSockets\fR, \fBCoresPerSocket\fR, and \fBThreadsPerCore\fR. |
| .IP |
| |
| .TP |
| \fBCpuSpecList\fR |
| A comma\-delimited list of Slurm abstract CPU IDs reserved for system use. |
| The list will be expanded to include all other CPUs, if any, on the same cores. |
| Depending upon the \fBTaskPluginParam\fR option of \fBSlurmdOffSpec\fR, |
| the Slurm daemon slurmd may either be confined to these |
| resources (the default) or prevented from using these resources. |
| If cgroup/v1 is used, the same applies to the slurmstepd processes. |
| Isolation of slurmd from user jobs may improve application performance. |
| A job can use these cores if AllowSpecResourcesUsage=yes and the user |
| explicitly requests less than the number of CPUs in this list. |
| If this option and \fBCoreSpecCount\fR are both designated for a node, |
| an error is generated. |
| This option has no effect unless cgroup job confinement is also configured |
| (i.e. the \fItask/cgroup\fR \fBTaskPlugin\fR is enabled and |
| \fBConstrainCores=yes\fR is set in cgroup.conf). |
| .IP |
| |
| .TP |
| \fBFeatures\fR |
| A comma\-delimited list of arbitrary strings indicative of some |
| characteristic associated with the node. |
| There is no value or count associated with a feature at this time, a node |
| either has a feature or it does not. |
| A desired feature may contain a numeric component indicating, |
| for example, processor speed but this numeric component will be considered to |
| be part of the feature string. Features are intended to be used to filter nodes |
| eligible to run jobs via the \fB\-\-constraint\fR argument. |
| By default a node has no features. |
| Also see \fBGres\fR for being able to have more control such as types and |
| count. Using features is faster than scheduling against GRES but is limited to |
| Boolean operations. |
| |
| \fBNOTE\fR: The hostlist function \fBfeature{myfeature}\fR expands to all nodes |
| with the specified feature. This may be used in place of or alongside regular |
| hostlist expressions in commands or configuration files that interact with the |
| slurmctld. |
| For example: \fBscontrol update node=feature{myfeature} state=resume\fR or |
| \fBPartitionName=p1 Nodes=feature{myfeature}\fR. |
| .IP |
| |
| .TP |
| \fBGres\fR |
| A comma\-delimited list of generic resources specifications for a node. |
| The format is: "<name>[:<type>][:no_consume]:<number>[K|M|G]". |
| The first field is the resource name, which matches the GresType configuration |
| parameter name. |
| The optional type field might be used to identify a model of that generic |
| resource. |
| It is forbidden to specify both an untyped GRES and a typed GRES with the same |
| <name>. |
| The optional no_consume field allows you to specify that a |
| generic resource does not have a finite number of that resource that gets |
| consumed as it is requested. The no_consume field is a GRES specific setting |
| and applies to the GRES, regardless of the type specified. |
| It should not be used with GRES that has a dedicated plugin, if you're looking |
| for a way to overcommit GPUs to multiple processes at the time you may be |
| interested in using "shard" GRES instead. |
| The final field must specify a generic resources count. |
| A suffix of "K", "M", "G", "T" or "P" may be used to multiply the number by |
| 1024, 1048576, 1073741824, etc. respectively. |
| (e.g."Gres=gpu:tesla:1,gpu:kepler:1,bandwidth:lustre:no_consume:4G"). |
| By default a node has no generic resources and its maximum count is |
| that of an unsigned 64bit integer. |
| Also see \fBFeatures\fR for Boolean flags to filter nodes using job constraints. |
| .IP |
| |
| .TP |
| \fBMemSpecLimit\fR |
| Amount of \fBRealMemory\fR, in megabytes, reserved for system use and not |
| available for user allocations. Must be less than the amount defined for |
| \fBRealMemory\fR. |
| If the task/cgroup plugin is configured and that plugin constrains memory |
| allocations (i.e. the \fItask/cgroup\fR \fBTaskPlugin\fR is enabled and |
| \fBConstrainRAMSpace=yes\fR is set in cgroup.conf), then the slurmd will be |
| allocated the specified memory limit. If cgroup/v1 is used the slurmstepd will |
| also be allocated the specified memory limit. If cgroup/v2 is used, the |
| slurmstepd's consumption is completely dependent on the topology of the job. |
| Note that having the Memory set in \fBSelectTypeParameters\fR as any of the |
| options that has it as a consumable resource is needed for this option to work. |
| The daemons will not be killed if they exhaust the memory allocation |
| (i.e. the Out\-Of\-Memory Killer is disabled for the daemon's memory cgroup). |
| If the task/cgroup plugin is not configured, the specified memory will only be |
| unavailable for user allocations. |
| .IP |
| |
| .TP |
| \fBPort\fR |
| The port number that the Slurm compute node daemon, \fBslurmd\fR, listens |
| to for work on this particular node. By default there is a single port number |
| for all \fBslurmd\fR daemons on all compute nodes as defined by the |
| \fBSlurmdPort\fR configuration parameter. Use of this option is not generally |
| recommended except for development or testing purposes. If multiple |
| \fBslurmd\fR daemons execute on a node this can specify a range of ports. |
| |
| \fBNOTE\fR: On Cray systems, Realm\-Specific IP Addressing (RSIP) will |
| automatically try to interact with anything opened on ports 8192\-60000. |
| Configure Port to use a port outside of the configured SrunPortRange and |
| RSIP's port range. |
| .IP |
| |
| .TP |
| \fBProcs\fR |
| See \fBCPUs\fR. |
| .IP |
| |
| .TP |
| \fBRealMemory\fR |
| Size of real memory on the node in megabytes (e.g. "2048"). |
| The default value is 1. Lowering RealMemory with the goal of setting |
| aside some amount for the OS and not available for job allocations |
| will not work as intended if Memory is not set as a consumable |
| resource in \fBSelectTypeParameters\fR. So one of the *_Memory |
| options need to be enabled for that goal to be accomplished. |
| Also see \fBMemSpecLimit\fR. |
| .IP |
| |
| .TP |
| \fBReason\fR |
| Identifies the reason for a node being in state "DOWN", "DRAINED" |
| "DRAINING", "FAIL" or "FAILING". |
| Use quotes to enclose a reason having more than one word. |
| .IP |
| |
| .TP |
| \fBRestrictedCoresPerGPU\fR |
| Number of cores per GPU restricted for only GPU use. If a job does not request a |
| GPU it will not have access to these cores. The node's GPUs must either be |
| autodetected or have valid cores configured in \fBgres.conf\fR(5). |
| |
| \fBNOTE\fR: Configuring multiple GPU types on overlapping sockets can result in |
| erroneous GPU type and restricted core pairings in allocations requesting gpus |
| without specifying a type. |
| .IP |
| |
| .TP |
| \fBSockets\fR |
| Number of physical processor sockets/chips on the node (e.g. "2"). |
| If Sockets is omitted, it will be inferred from |
| \fBCPUs\fR, \fBCoresPerSocket\fR, and \fBThreadsPerCore\fR. |
| \fBNOTE\fR: If you have multi\-core processors, you will likely |
| need to specify these parameters. |
| Sockets and SocketsPerBoard are mutually exclusive. |
| If Sockets is specified when Boards is also used, |
| Sockets is interpreted as SocketsPerBoard rather than total sockets. |
| The default value is 1. |
| .IP |
| |
| .TP |
| \fBSocketsPerBoard\fR |
| Number of physical processor sockets/chips on a baseboard. |
| Sockets and SocketsPerBoard are mutually exclusive. |
| The default value is 1. |
| .IP |
| |
| .TP |
| \fBState\fR |
| State of the node with respect to the initiation of user jobs. |
| Acceptable values are \fICLOUD\fR, \fIDOWN\fR, \fIDRAIN\fR, \fIFAIL\fR, |
| \fIFAILING\fR, \fIFUTURE\fR and \fIUNKNOWN\fR. |
| Node states of \fIBUSY\fR and \fIIDLE\fR should not be specified in the node |
| configuration, but set the node state to \fIUNKNOWN\fR instead. |
| Setting the node state to \fIUNKNOWN\fR will result in the node state being |
| set to \fIBUSY\fR, \fIIDLE\fR or other appropriate state based upon recovered |
| system state information. |
| The default value is \fIUNKNOWN\fR. |
| Also see the \fBDownNodes\fR parameter below. |
| .IP |
| .RS |
| .TP 10 |
| \fBCLOUD\fP |
| Indicates the node exists in the cloud. |
| Its initial state will be treated as powered down. |
| The node will be available for use after its state is recovered from Slurm's |
| state save file or the slurmd daemon starts on the compute node. |
| .IP |
| |
| .TP |
| \fBDOWN\fP |
| Indicates the node failed and is unavailable to be allocated work. |
| .IP |
| |
| .TP |
| \fBDRAIN\fP |
| Indicates the node is unavailable to be allocated work. |
| .IP |
| |
| .TP |
| \fBFAIL\fP |
| Indicates the node is expected to fail soon, has |
| no jobs allocated to it, and will not be allocated |
| to any new jobs. |
| .IP |
| |
| .TP |
| \fBFAILING\fP |
| Indicates the node is expected to fail soon, has |
| one or more jobs allocated to it, but will not be allocated |
| to any new jobs. |
| .IP |
| |
| .TP |
| \fBFUTURE\fP |
| Indicates the node is defined for future use and need not |
| exist when the Slurm daemons are started. These nodes can be made available |
| for use simply by updating the node state using the scontrol command rather |
| than restarting the slurmctld daemon. After these nodes are made available, |
| change their \fRState\fR in the slurm.conf file. Until these nodes are made |
| available, they will not be seen using any Slurm commands or nor will |
| any attempt be made to contact them. FUTURE nodes retain non\-FUTURE state on |
| restart. Use scontrol to put node back into FUTURE state. |
| |
| .IP |
| .RS |
| .TP |
| \fBDynamic Future Nodes\fR |
| A \fBslurmd\fR started with \-F[<feature>] will be associated with a FUTURE |
| node that matches the same configuration (sockets, cores, threads) as reported |
| by \fBslurmd\fR \-C. The node's NodeAddr and NodeHostname will automatically be |
| retrieved from the \fBslurmd\fR and will be cleared when set back to the FUTURE |
| state. |
| .RE |
| .IP |
| |
| .TP |
| \fBUNKNOWN\fP |
| Indicates the node's state is undefined but will be established |
| (set to \fIBUSY\fR or \fIIDLE\fR) when the \fBslurmd\fR daemon on that |
| node registers. \fIUNKNOWN\fR is the default state. |
| .RE |
| .IP |
| |
| .TP |
| \fBThreadsPerCore\fR |
| Number of logical threads in a single physical core (e.g. "2"). |
| Note that the Slurm can allocate resources to jobs down to the |
| resolution of a core. If your system is configured with more than |
| one thread per core, execution of a different job on each thread |
| is not supported unless you configure \fBSelectTypeParameters=CR_CPU\fR |
| plus \fBCPUs\fR; do not configure \fBSockets\fR, \fBCoresPerSocket\fR or |
| \fBThreadsPerCore\fR. |
| A job can execute a one task per thread from within one job step or |
| execute a distinct job step on each of the threads. |
| Note also if you are running with more than 1 thread per core and running |
| the select/cons_tres plugin then you will want to set |
| the SelectTypeParameters |
| variable to something other than CR_CPU to avoid unexpected results. |
| The default value is 1. |
| .IP |
| |
| .TP |
| \fBTmpDisk\fR |
| Total size of temporary disk storage in \fBTmpFS\fR in megabytes |
| (e.g. "16384"). \fBTmpFS\fR (for "Temporary File System") |
| identifies the location which jobs should use for temporary storage. |
| Note this does not indicate the amount of free |
| space available to the user on the node, only the total file |
| system size. The system administration should ensure this file |
| system is purged as needed so that user jobs have access to |
| most of this space. |
| The Prolog and/or Epilog programs (specified in the configuration file) |
| might be used to ensure the file system is kept clean. |
| The default value is 0. |
| .IP |
| |
| .TP |
| \fBTopology\fR |
| Comma-separated list of pairs in the format |
| \fR\fI<topology_name>\fR\fB:\fR\fI<topology_unit>\fR. |
| Where <\fItopology_unit\fR> is the block name or the name of a leaf switch. |
| Intermediate switch names -- ':' delimited -- can be provided and will be |
| created if needed (e.g. Topology=topo-tree:sw_root:s1:s2). |
| This setting overwrites the node topology affiliation configuration specified |
| in \fBtopology.conf\fR(5) and \fBtopology.yaml\fR(5). |
| |
| .IP |
| |
| .TP |
| \fBWeight\fR |
| The priority of the node for scheduling purposes. |
| All things being equal, jobs will be allocated the nodes with |
| the lowest weight which satisfies their requirements. |
| For example, a heterogeneous collection of nodes might |
| be placed into a single partition for greater system |
| utilization, responsiveness and capability. It would be |
| preferable to allocate smaller memory nodes rather than larger |
| memory nodes if either will satisfy a job's requirements. |
| The units of weight are arbitrary, but larger weights |
| should be assigned to nodes with more processors, memory, |
| disk space, higher processor speed, etc. |
| Note that if a job allocation request can not be satisfied |
| using the nodes with the lowest weight, the set of nodes |
| with the next lowest weight is added to the set of nodes |
| under consideration for use (repeat as needed for higher |
| weight values). If you absolutely want to minimize the number |
| of higher weight nodes allocated to a job (at a cost of higher |
| scheduling overhead), give each node a distinct \fBWeight\fR |
| value and they will be added to the pool of nodes being |
| considered for scheduling individually. |
| |
| The default value is 1. |
| |
| \fBNOTE\fR: Node weights are first considered among currently available |
| nodes. For example, POWERED_DOWN and POWERING_UP nodes with lower weights will |
| not be evaluated before a powered up IDLE node. |
| .IP |
| |
| .SH "DOWN NODE CONFIGURATION" |
| The \fBDownNodes=\fR parameter permits you to mark certain nodes as in a |
| \fIDOWN\fR, \fIDRAIN\fR, \fIFAIL\fR, \fIFAILING\fR or \fIFUTURE\fR state |
| without altering the permanent configuration information listed under a |
| \fBNodeName=\fR specification. |
| |
| .TP |
| \fBDownNodes\fR |
| Any node name, or list of node names, from the \fBNodeName=\fR specifications. |
| .IP |
| |
| .TP |
| \fBReason\fR |
| Identifies the reason for a node being in state \fIDOWN\fR, \fIDRAIN\fR, |
| \fIFAIL\fR, \fIFAILING\fR or \fIFUTURE\fR. |
| \Use quotes to enclose a reason having more than one word. |
| .IP |
| |
| .TP |
| \fBState\fR |
| State of the node with respect to the initiation of user jobs. |
| Acceptable values are \fIDOWN\fR, \fIDRAIN\fR, \fIFAIL\fR, \fIFAILING\fR |
| and \fIFUTURE\fR. |
| For more information about these states see the descriptions under \fBState\fR |
| in the \fBNodeName=\fR section above. |
| The default value is \fIDOWN\fR. |
| .IP |
| |
| .SH "NODESET CONFIGURATION" |
| The nodeset configuration allows you to define a name for a specific set of |
| nodes which can be used to simplify the partition configuration section, |
| especially for heterogeneous or condo\-style systems. Each nodeset may be defined |
| by an explicit list of nodes, and/or by filtering the nodes by a particular |
| configured feature. If both \fBFeature=\fR and \fBNodes=\fR are used the |
| nodeset shall be the union of the two subsets. |
| Note that the nodesets are only used to simplify the partition definitions |
| at present, and are not usable outside of the partition configuration. |
| |
| .TP |
| \fBFeature\fR |
| All nodes with this feature will be included as part of this nodeset. Only a |
| single feature is allowed. |
| .IP |
| |
| .TP |
| \fBNodes\fR |
| List of nodes in this set. |
| .IP |
| |
| .TP |
| \fBNodeSet\fR |
| Unique name for a set of nodes. Must not overlap with any NodeName definitions. |
| .IP |
| |
| .SH "PARTITION CONFIGURATION" |
| The partition configuration permits you to establish different job |
| limits or access controls for various groups (or partitions) of nodes. |
| Nodes may be in more than one partition, making partitions serve |
| as general purpose queues. |
| For example one may put the same set of nodes into two different |
| partitions, each with different constraints (time limit, job sizes, |
| groups allowed to use the partition, etc.). |
| Jobs are allocated resources within a single partition. |
| Default values can be specified with a record in which |
| \fBPartitionName\fR is "DEFAULT". |
| The default entry values will apply only to lines following it in the |
| configuration file and the default values can be reset multiple times |
| in the configuration file with multiple entries where "PartitionName=DEFAULT". |
| The "PartitionName=" specification must be placed on every line |
| describing the configuration of partitions. |
| Each line where \fBPartitionName\fR is "DEFAULT" will replace or add to previous |
| default values and not reinitialize the default values. |
| A single partition name can not appear as a PartitionName value in more than |
| one line (duplicate partition name records will be ignored). |
| If a partition that is in use is deleted from the configuration and slurm |
| is restarted or reconfigured (scontrol reconfigure), jobs using the partition |
| are canceled. |
| \fBNOTE\fR: Put all parameters for each partition on a single line. |
| Each line of partition configuration information should |
| represent a different partition. |
| The partition configuration file contains the following information: |
| |
| .TP |
| \fBAllocNodes\fR |
| Comma\-separated list of nodes from which users can submit jobs in the |
| partition. |
| Node names may be specified using the node range expression syntax |
| described above. |
| The default value is "ALL". |
| .IP |
| |
| .TP |
| \fBAllowAccounts\fR |
| Comma\-separated list of accounts which may execute jobs in the partition. |
| The default value is "ALL". This list is hierarchical, meaning subaccounts |
| are included automatically. |
| \fBNOTE\fR: If AllowAccounts is used then DenyAccounts will not be enforced. |
| Also refer to DenyAccounts. |
| .IP |
| |
| .TP |
| \fBAllowGroups\fR |
| Comma\-separated list of group names which may execute jobs in this |
| partition. |
| A user will be permitted to submit a job to this partition if |
| AllowGroups has \fBat least one\fR group associated with the user. |
| Jobs executed as user root or as user SlurmUser will be allowed to |
| use any partition, regardless of the value of AllowGroups. In addition, a Slurm |
| Admin or Operator will be able to view any partition, regardless of the value |
| of AllowGroups. |
| If user root attempts to execute a job as another user (e.g. using |
| srun's \-\-uid option), then the job will be subject to AllowGroups as if it |
| were submitted by that user. |
| By default, AllowGroups is unset, meaning all groups are allowed to use this |
| partition. The special value 'ALL' is equivalent to this. |
| Users who are not members of the specified group will not see information |
| about this partition by default. However, this should not be treated as a |
| security mechanism, since job information will be returned if a user requests |
| details about the partition or a specific job. See the \fBPrivateData\fR |
| parameter to restrict access to job information. |
| \fBNOTE\fR: For performance reasons, Slurm maintains a list of user IDs |
| allowed to use each partition and this is checked at job submission time. |
| This list of user IDs is updated when the \fBslurmctld\fR daemon is restarted, |
| reconfigured (e.g. "scontrol reconfig") or the partition's \fBAllowGroups\fR |
| value is reset, even if is value is unchanged |
| (e.g. "scontrol update PartitionName=name AllowGroups=group"). |
| For a user's access to a partition to change, both his group membership must |
| change and Slurm's internal user ID list must change using one of the methods |
| described above. |
| .IP |
| |
| .TP |
| \fBAllowQos\fR |
| Comma\-separated list of Qos which may execute jobs in the partition. |
| Jobs executed as user root can use any partition without regard to |
| the value of AllowQos. |
| The default value is "ALL". |
| \fBNOTE\fR: If AllowQos is used then DenyQos will not be enforced. |
| Also refer to DenyQos. |
| .IP |
| |
| .TP |
| \fBAlternate\fR |
| Partition name of alternate partition to be used if the state of this partition |
| is "DRAIN" or "INACTIVE." |
| .IP |
| |
| .TP |
| \fBCpuBind\fR |
| If a job step request does not specify an option to control how tasks are bound |
| to allocated CPUs (by using \-\-cpu\-bind) and all nodes allocated to the job |
| do not have the same \fBCpuBind\fR option for the node, then the partition's |
| \fBCpuBind\fR option will control how tasks are bound to allocated resources. |
| The \fBTaskPluginParam\fR will be used as a last resort, with the default being |
| no binding. Supported values for CpuBind are \fBnone\fR, \fBsocket\fR, |
| \fBldom\fR (NUMA), \fBcore\fR and \fBthread\fR. |
| .IP |
| |
| .TP |
| \fBDefault\fR |
| If this keyword is set, jobs submitted without a partition |
| specification will utilize this partition. |
| Possible values are "YES" and "NO". |
| The default value is "NO". |
| .IP |
| |
| .TP |
| \fBDefaultTime\fR |
| Run time limit used for jobs that don't specify a value. If not set |
| then MaxTime will be used. |
| Format is the same as for MaxTime. |
| .IP |
| |
| .TP |
| \fBDefCpuPerGPU\fR |
| Default count of CPUs allocated per allocated GPU. This value is used only if |
| the job didn't specify \-\-cpus\-per\-task and \-\-cpus\-per\-gpu. |
| .IP |
| |
| .TP |
| \fBDefMemPerCPU\fR |
| Default real memory size available per allocated CPU in megabytes. |
| Used to avoid over\-subscribing memory and causing paging. |
| \fBDefMemPerCPU\fR would generally be used if individual processors |
| are allocated to jobs (\fBSelectType=select/cons_tres\fR). |
| If not set, the \fBDefMemPerCPU\fR value for the entire cluster will be used. |
| Also see \fBDefMemPerGPU\fR, \fBDefMemPerNode\fR and \fBMaxMemPerCPU\fR. |
| \fBDefMemPerCPU\fR, \fBDefMemPerGPU\fR and \fBDefMemPerNode\fR are mutually |
| exclusive. |
| .IP |
| |
| .TP |
| \fBDefMemPerGPU\fR |
| Default real memory size available per allocated GPU in megabytes. |
| Please note a best effort attempt is made to predict which GPUs on the system |
| will be used, but this could change between job submission and start time, |
| causing \fBMaxMemPerNode\fR to potentially not work as expected for |
| heterogeneous jobs. |
| Also see \fBDefMemPerCPU\fR, \fBDefMemPerNode\fR and \fBMaxMemPerCPU\fR. |
| \fBDefMemPerCPU\fR, \fBDefMemPerGPU\fR and \fBDefMemPerNode\fR are mutually |
| exclusive. |
| .IP |
| |
| .TP |
| \fBDefMemPerNode\fR |
| Default real memory size available per allocated node in megabytes. |
| Used to avoid over\-subscribing memory and causing paging. |
| \fBDefMemPerNode\fR would generally be used if whole nodes |
| are allocated to jobs (\fBSelectType=select/linear\fR) and |
| resources are over\-subscribed (\fBOverSubscribe=yes\fR or |
| \fBOverSubscribe=force\fR). |
| If not set, the \fBDefMemPerNode\fR value for the entire cluster will be used. |
| Also see \fBDefMemPerCPU\fR, \fBDefMemPerGPU\fR and \fBMaxMemPerCPU\fR. |
| \fBDefMemPerCPU\fR, \fBDefMemPerGPU\fR and \fBDefMemPerNode\fR are mutually |
| exclusive. |
| .IP |
| |
| .TP |
| \fBDenyAccounts\fR |
| Comma\-separated list of accounts which may not execute jobs in the partition. |
| By default, no accounts are denied access. This list is hierarchical, |
| meaning subaccounts are included automatically. |
| \fBNOTE\fR: If AllowAccounts is used then DenyAccounts will not be enforced. |
| Also refer to AllowAccounts. |
| .IP |
| |
| .TP |
| \fBDenyQos\fR |
| Comma\-separated list of Qos which may not execute jobs in the partition. |
| By default, no QOS are denied access |
| \fBNOTE\fR: If AllowQos is used then DenyQos will not be enforced. |
| Also refer AllowQos. |
| .IP |
| |
| .TP |
| \fBDisableRootJobs\fR |
| If set to "YES" then user root will be prevented from running any jobs |
| on this partition. |
| The default value will be the value of \fBDisableRootJobs\fR set |
| outside of a partition specification (which is "NO", allowing user |
| root to execute jobs). |
| .IP |
| |
| .TP |
| \fBExclusiveTopo\fR |
| If set to "YES," then only one job may be run on a single topology segment. |
| This capability is also available on a per\-job basis by using the |
| \fB\-\-exclusive=topo\fR option. |
| .IP |
| |
| .TP |
| \fBExclusiveUser\fR |
| If set to "YES" then nodes will be exclusively allocated to users. |
| Multiple jobs may be run for the same user, but only one user can be active |
| at a time. |
| This capability is also available on a per\-job basis by using the |
| \fB\-\-exclusive=user\fR option. |
| .IP |
| |
| .TP |
| \fBGraceTime\fR |
| Specifies, in units of seconds, the preemption grace time |
| to be extended to a job which has been selected for preemption. |
| This parameter only takes effect when \fBPreemptType=partition_prio\fR. |
| The default value is zero, no preemption grace time is allowed on |
| this partition. |
| Once a job has been selected for preemption, its end time is set to the current |
| time plus GraceTime. The job's tasks are immediately sent SIGCONT and SIGTERM |
| signals in order to provide notification of its imminent termination. |
| This is followed by the SIGCONT, SIGTERM and SIGKILL signal sequence upon |
| reaching its new end time. This second set of signals is sent to both the |
| tasks \fBand\fR the containing batch script, if applicable. |
| See also the global \fBKillWait\fR configuration parameter. |
| .br |
| \fBNOTE\fR: This parameter does not apply to \fBPreemptMode=SUSPEND\fR. |
| For setting the preemption grace time when using \fBPreemptMode=SUSPEND\fR, |
| see \fBPreemptParameters=suspend_grace_time\fR. |
| .IP |
| |
| .TP |
| \fBHidden\fR |
| Specifies if the partition and its jobs are to be hidden by default. |
| Hidden partitions will by default not be reported by the Slurm APIs or commands. |
| Possible values are "YES" and "NO". |
| The default value is "NO". |
| Note that partitions that a user lacks access to by virtue of the |
| \fBAllowGroups\fR parameter will also be hidden by default. |
| .IP |
| |
| .TP |
| \fBLLN\fR |
| Schedule resources to jobs on the least loaded nodes (based upon the number |
| of idle CPUs). This is generally only recommended for an environment with |
| serial jobs as idle resources will tend to be highly fragmented, resulting |
| in parallel jobs being distributed across many nodes. |
| Note that node \fBWeight\fR takes precedence over how many idle resources are |
| on each node. |
| Also see the \fBSelectTypeParameters\fR configuration parameter \fBCR_LLN\fR to |
| use the least loaded nodes in every partition. |
| .IP |
| |
| .TP |
| \fBMaxCPUsPerNode\fR |
| Maximum number of CPUs on any node available to all jobs from this partition. |
| This can be especially useful to schedule GPUs. For example a node can be |
| associated with two Slurm partitions (e.g. "cpu" and "gpu") and the |
| partition/queue "cpu" could be limited to only a subset of the node's CPUs, |
| ensuring that one or more CPUs would be available to jobs in the "gpu" |
| partition/queue. |
| Also see \fBMaxCPUsPerSocket\fR. |
| .IP |
| |
| .TP |
| \fBMaxCPUsPerSocket\fR |
| Maximum number of CPUs on any node available on the all jobs from this |
| partition. This can be especially useful to schedule GPUs. |
| Also see \fBMaxCPUsPerNode\fR. |
| .IP |
| |
| .TP |
| \fBMaxMemPerCPU\fR |
| Maximum real memory size available per allocated CPU in megabytes. |
| Used to avoid over\-subscribing memory and causing paging. |
| \fBMaxMemPerCPU\fR would generally be used if individual processors |
| are allocated to jobs (\fBSelectType=select/cons_tres\fR). |
| If not set, the \fBMaxMemPerCPU\fR value for the entire cluster will be used. |
| Also see \fBDefMemPerCPU\fR and \fBMaxMemPerNode\fR. |
| \fBMaxMemPerCPU\fR and \fBMaxMemPerNode\fR are mutually exclusive. |
| .IP |
| |
| .TP |
| \fBMaxMemPerNode\fR |
| Maximum real memory size available per allocated node in a job allocation in |
| megabytes. Used to avoid over\-subscribing memory and causing paging. |
| \fBMaxMemPerNode\fR would generally be used if whole nodes |
| are allocated to jobs (\fBSelectType=select/linear\fR) and |
| resources are over\-subscribed (\fBOverSubscribe=yes\fR or |
| \fBOverSubscribe=force\fR). |
| If not set, the \fBMaxMemPerNode\fR value for the entire cluster will be used. |
| Also see \fBDefMemPerNode\fR and \fBMaxMemPerCPU\fR. |
| \fBMaxMemPerCPU\fR and \fBMaxMemPerNode\fR are mutually exclusive. |
| .IP |
| |
| .TP |
| \fBMaxNodes\fR |
| Maximum count of nodes which may be allocated to any single job. |
| The default value is "UNLIMITED", which is represented internally as \-1. |
| .IP |
| |
| .TP |
| \fBMaxTime\fR |
| Maximum run time limit for jobs. |
| Format is minutes, minutes:seconds, hours:minutes:seconds, |
| days\-hours, days\-hours:minutes, days\-hours:minutes:seconds or |
| "UNLIMITED". |
| Time resolution is one minute and second values are rounded up to |
| the next minute. |
| The job TimeLimit may be updated by root, SlurmUser or an Operator to a |
| value higher than the configured MaxTime after job submission. |
| .IP |
| |
| .TP |
| \fBMinNodes\fR |
| Minimum count of nodes which may be allocated to any single job. |
| The default value is 0. |
| .IP |
| |
| .TP |
| \fBNodes\fR |
| Comma\-separated list of nodes or nodesets which are associated with this |
| partition. |
| Node names may be specified using the node range expression syntax |
| described above. A blank list of nodes |
| (i.e. Nodes="") can be used if one wants a partition to exist, |
| but have no resources (possibly on a temporary basis). |
| A value of "ALL" is mapped to all nodes configured in the cluster. |
| .IP |
| |
| .TP |
| \fBOverSubscribe\fR |
| Controls the ability of the partition to execute more than one job at a |
| time on each resource (node, socket or core depending upon the value |
| of \fBSelectTypeParameters\fR). |
| If resources are to be over\-subscribed, avoiding memory over\-subscription |
| is very important. |
| \fBSelectTypeParameters\fR should be configured to treat |
| memory as a consumable resource and the \fB\-\-mem\fR option |
| should be used for job allocations. |
| Sharing of resources is typically useful only when using gang scheduling |
| (\fBPreemptMode=suspend,gang\fR). |
| Possible values for \fBOverSubscribe\fR are "EXCLUSIVE", "FORCE", "YES", and "NO". |
| Note that a value of "YES" or "FORCE" can negatively impact performance |
| for systems with many thousands of running jobs. |
| The default value is "NO". |
| For more information see the following web pages: |
| .br |
| .na |
| \fIhttps://slurm.schedmd.com/cons_tres.html\fR |
| .br |
| \fIhttps://slurm.schedmd.com/cons_tres_share.html\fR |
| .br |
| \fIhttps://slurm.schedmd.com/gang_scheduling.html\fR |
| .br |
| \fIhttps://slurm.schedmd.com/preempt.html\fR |
| .ad |
| .IP |
| .RS |
| .TP 12 |
| \fBEXCLUSIVE\fR |
| Allocates entire nodes to jobs even with \fBSelectType=select/cons_tres\fR |
| configured. |
| Jobs that run in partitions with \fBOverSubscribe=EXCLUSIVE\fR will have |
| exclusive access to all allocated nodes. |
| These jobs are allocated all CPUs and GRES on the nodes, but they are only |
| allocated as much memory as they ask for. This is by design to support gang |
| scheduling, because suspended jobs still reside in memory. To request all the |
| memory on a node, use \fB\-\-mem=0\fR at submit time. |
| .IP |
| |
| .TP |
| \fBFORCE\fR |
| Makes all resources (except GRES) in the partition available for |
| oversubscription without any means for users to disable it. |
| May be followed with a colon and maximum number of jobs in |
| running or suspended state. |
| For example \fBOverSubscribe=FORCE:4\fR enables each node, socket or |
| core to oversubscribe each resource four ways. |
| Recommended only for systems using \fBPreemptMode=suspend,gang\fR. |
| |
| \fBNOTE\fR: \fBOverSubscribe=FORCE:1\fR is a special case that is not exactly |
| equivalent to \fBOverSubscribe=NO\fR. \fBOverSubscribe=FORCE:1\fR disables |
| the regular oversubscription of resources in the same partition but it will |
| still allow oversubscription due to preemption or on overlapping partitions |
| with the same PriorityTier. Setting \fBOverSubscribe=NO\fR |
| will prevent oversubscription from happening in all cases. |
| |
| \fBNOTE\fR: If using \fBPreemptType=preempt/qos\fR you can specify a value for |
| \fBFORCE\fR that is greater than 1. For example, \fBOverSubscribe=FORCE:2\fR |
| will permit two jobs per resource normally, but a third job can be started |
| only if done so through preemption based upon QOS. |
| |
| \fBNOTE\fR: If \fBOverSubscribe\fR is configured to \fBFORCE\fR or \fBYES\fR |
| in your slurm.conf and the system is not configured to use preemption |
| (\fBPreemptMode=OFF\fR) accounting can easily grow to values greater than |
| the actual utilization. It may be common on such systems to get error messages |
| in the slurmdbd log stating: "We have more allocated time than is possible." |
| .IP |
| |
| .TP |
| \fBYES\fR |
| Makes all resources (except GRES) in the partition available for sharing upon |
| request by the job. |
| Resources will only be over\-subscribed when explicitly requested |
| by the user using the "\-\-oversubscribe" option on job submission. |
| May be followed with a colon and maximum number of jobs in |
| running or suspended state. |
| For example "OverSubscribe=YES:4" enables each node, socket or |
| core to execute up to four jobs at once. |
| Recommended only for systems running with gang scheduling |
| (\fBPreemptMode=suspend,gang\fR). |
| .IP |
| |
| .TP |
| \fBNO\fR |
| Selected resources are allocated to a single job. No resource will be |
| allocated to more than one job. |
| |
| \fBNOTE\fR: Even if you are using \fBPreemptMode=suspend,gang\fR, setting |
| \fBOverSubscribe=NO\fR will disable preemption on that partition. Use |
| \fBOverSubscribe=FORCE:1\fR if you want to disable normal oversubscription |
| but still allow suspension due to preemption. |
| .RE |
| .IP |
| |
| .TP |
| \fBOverTimeLimit\fR |
| Number of minutes by which a job can exceed its time limit before |
| being canceled. |
| Normally a job's time limit is treated as a \fIhard\fR limit and the job will be |
| killed upon reaching that limit. |
| Configuring \fBOverTimeLimit\fR will result in the job's time limit being |
| treated like a \fIsoft\fR limit. |
| Adding the \fBOverTimeLimit\fR value to the \fIsoft\fR time limit provides a |
| \fIhard\fR time limit, at which point the job is canceled. |
| This is particularly useful for backfill scheduling, which bases upon |
| each job's soft time limit. |
| If not set, the \fBOverTimeLimit\fR value for the entire cluster will be used. |
| May not exceed 65533 minutes. |
| A value of "UNLIMITED" is also supported. |
| .IP |
| |
| .TP |
| \fBPartitionName\fR |
| Name by which the partition may be referenced (e.g. "Interactive"). |
| This name can be specified by users when submitting jobs. |
| If the \fBPartitionName\fR is "DEFAULT", the values specified |
| with that record will apply to subsequent partition specifications |
| unless explicitly set to other values in that partition record or |
| replaced with a different set of default values. |
| Each line where \fBPartitionName\fR is "DEFAULT" will replace or add to previous |
| default values and not reinitialize the default values. |
| .IP |
| |
| .TP |
| \fBPowerDownOnIdle\fR |
| If set to "YES" and power saving is enabled for the partition, then nodes |
| allocated from this partition will be requested to power down after being |
| allocated at least one job. |
| These nodes will not power down until they transition from COMPLETING to IDLE. |
| If set to "NO" then power saving will operate as configured for the partition. |
| The default value is "NO". |
| See <https://slurm.schedmd.com/power_save.html> and |
| <https://slurm.schedmd.com/elastic_computing.html> for more details. |
| |
| \fBNOTE\fR: |
| The following will cause a transition from COMPLETING to IDLE: |
| .br |
| Completing all running jobs without additional jobs being allocated. |
| .br |
| ExclusiveUser=YES and after all running jobs complete but before another user's |
| job is allocated. |
| .br |
| OverSubscribe=EXCLUSIVE and after the running job completes but before another |
| job is allocated. |
| |
| \fBNOTE\fR: |
| Nodes are still subject to powering down when being IDLE for \fBSuspendTime\fR |
| when PowerDownOnIdle is set to NO.</p> |
| |
| Also see \fBSuspendTime\fR. |
| .IP |
| |
| .TP |
| \fBPreemptMode\fR |
| Mechanism used to preempt jobs or enable gang scheduling for this |
| partition when \fBPreemptType=preempt/partition_prio\fR is configured. |
| This partition\-specific \fBPreemptMode\fR configuration parameter will |
| override the cluster\-wide \fBPreemptMode\fR for this partition. |
| It can be set to OFF to disable preemption and gang scheduling for this |
| partition. |
| See also \fBPriorityTier\fR and the above description of the cluster\-wide |
| \fBPreemptMode\fR parameter for further details. |
| .br |
| The \fBGANG\fR option is used to enable gang scheduling independent of |
| whether preemption is enabled (i.e. independent of the \fBPreemptType\fR |
| setting). It can be specified in addition to a \fBPreemptMode\fR setting with |
| the two options comma separated (e.g. \fBPreemptMode=SUSPEND,GANG\fR). |
| .br |
| See <https://slurm.schedmd.com/preempt.html> and |
| <https://slurm.schedmd.com/gang_scheduling.html> for more details. |
| |
| \fBNOTE\fR: |
| For performance reasons, the backfill scheduler reserves whole nodes for jobs, |
| not partial nodes. If during backfill scheduling a job preempts one or more |
| other jobs, the whole nodes for those preempted jobs are reserved for the |
| preemptor job, even if the preemptor job requested fewer resources than that. |
| These reserved nodes aren't available to other jobs during that backfill |
| cycle, even if the other jobs could fit on the nodes. Therefore, jobs may |
| preempt more resources during a single backfill iteration than they requested. |
| .br |
| \fBNOTE\fR: |
| For heterogeneous job to be considered for preemption all components |
| must be eligible for preemption. When a heterogeneous job is to be preempted |
| the first identified component of the job with the highest order PreemptMode |
| (\fBSUSPEND\fR (highest), \fBREQUEUE\fR, \fBCANCEL\fR (lowest)) will be |
| used to set the PreemptMode for all components. The \fBGraceTime\fR and user |
| warning signal for each component of the heterogeneous job remain unique. |
| Heterogeneous jobs are excluded from GANG scheduling operations. |
| .IP |
| .RS |
| .TP 12 |
| \fBOFF\fR |
| Disables job preemption and gang scheduling. |
| .IP |
| |
| .TP |
| \fBCANCEL\fR |
| The preempted job will be cancelled. |
| .IP |
| |
| .TP |
| \fBGANG\fR |
| Enables gang scheduling (time slicing) of jobs in the same partition, and |
| allows the resuming of suspended jobs. |
| |
| \fBNOTE\fR: |
| Gang scheduling is performed independently for each partition, so |
| if you only want time\-slicing by \fBOverSubscribe\fR, without any preemption, |
| then configuring partitions with overlapping nodes is not recommended. |
| On the other hand, if you want to use \fBPreemptType=preempt/partition_prio\fR |
| to allow jobs from higher PriorityTier partitions to Suspend jobs from lower |
| PriorityTier partitions you will need overlapping partitions, and |
| \fBPreemptMode=SUSPEND,GANG\fR to use the Gang scheduler to resume the suspended |
| jobs(s). |
| In any case, time\-slicing won't happen between jobs on different partitions. |
| .br |
| \fBNOTE\fR: |
| Heterogeneous jobs are excluded from GANG scheduling operations. |
| .IP |
| |
| .TP |
| \fBREQUEUE\fR |
| Preempts jobs by requeuing them (if possible) or canceling them. |
| For jobs to be requeued they must have the \-\-requeue sbatch option set |
| or the cluster wide JobRequeue parameter in slurm.conf must be set to \fB1\fR. |
| .IP |
| |
| .TP |
| \fBSUSPEND\fR |
| The preempted jobs will be suspended, and later the Gang scheduler will resume |
| them. Therefore the \fBSUSPEND\fR preemption mode always needs the \fBGANG\fR |
| option to be specified at the cluster level. Also, because the suspended jobs |
| will still use memory on the allocated nodes, Slurm needs to be able to track |
| memory resources to be able to suspend jobs. |
| |
| If the preemptees and preemptor are on different partitions then the preempted |
| jobs will remain suspended until the preemptor ends. |
| .br |
| \fBNOTE\fR: Because gang scheduling is performed independently for each |
| partition, if using \fBPreemptType=preempt/partition_prio\fR then jobs in |
| higher PriorityTier partitions will suspend jobs in lower PriorityTier |
| partitions to run on the released resources. Only when the preemptor job ends |
| will the suspended jobs will be resumed by the Gang scheduler. |
| .br |
| \fBNOTE\fR: Suspended jobs will not release GRES. Higher priority jobs will not |
| be able to preempt to gain access to GRES. |
| .RE |
| .IP |
| |
| .TP |
| \fBPriorityJobFactor\fR |
| Partition factor used by priority/multifactor plugin in calculating job priority. |
| Defaults to 1. A value of 0 prevents this partition from adding to the job's |
| priority. The value may not exceed 65533. |
| Also see PriorityTier. |
| .IP |
| |
| .TP |
| \fBPriorityTier\fR |
| Jobs submitted to a partition with a higher \fBPriorityTier\fR value will be |
| evaluated by the scheduler before pending jobs in a partition with a lower |
| \fBPriorityTier\fR value. They will also be considered for preemption of running |
| jobs in partition(s) with lower \fBPriorityTier\fR values if |
| \fIPreemptType=preempt/partition_prio\fR. |
| The value may not exceed 65533. |
| Also see PriorityJobFactor. |
| .IP |
| |
| .TP |
| \fBQOS\fR |
| Used to extend the limits available to a QOS on a partition. Jobs will not be |
| associated to this QOS outside of being associated to the partition. They |
| will still be associated to their requested QOS. |
| By default, no QOS is used. |
| Additional details are in the QOS documentation at |
| <https://slurm.schedmd.com/qos.html>, including special conditions |
| when a relative QOS is used for this parameter. |
| \fBNOTE\fR: If a limit is set in both the Partition's QOS and the Job's QOS, |
| the Partition QOS limit will be honored unless the Job's QOS has the |
| \fBOverPartQOS\fR flag set, in which case the Job's QOS limit will take |
| precedence. |
| .IP |
| |
| .TP |
| \fBReqResv\fR |
| Specifies users of this partition are required to designate a reservation |
| when submitting a job. This option can be useful in restricting usage |
| of a partition that may have higher priority or additional resources to be |
| allowed only within a reservation. |
| Possible values are "YES" and "NO". |
| The default value is "NO". |
| .IP |
| |
| .TP |
| \fBResumeTimeout\fR |
| Maximum time permitted (in seconds) between when a node resume request |
| is issued and when the node is actually available for use. |
| Nodes which fail to respond in this time frame will be marked DOWN and |
| the jobs scheduled on the node requeued if possible. |
| Nodes which reboot after this time frame will be marked DOWN with a reason of |
| "Node unexpectedly rebooted." |
| For nodes that are in multiple partitions with this option set, |
| the highest time will take effect. If not set on any partition, the node will |
| use the \fBResumeTimeout\fR value set for the entire cluster. The maximum value |
| is either 65533 or INFINITE. |
| .IP |
| |
| .TP |
| \fBRootOnly\fR |
| Specifies if only user ID zero (i.e. user \fIroot\fR) may allocate resources |
| in this partition. User root may allocate resources for any other user, |
| but the request must be initiated by user root. |
| This option can be useful for a partition to be managed by some |
| external entity (e.g. a higher\-level job manager) and prevents |
| users from directly using those resources. |
| Possible values are "YES" and "NO". |
| The default value is "NO". |
| .IP |
| |
| .TP |
| \fBSelectTypeParameters\fR |
| Partition\-specific resource allocation type. |
| This option replaces the global \fBSelectTypeParameters\fR value. |
| Supported values are \fBCR_Core\fR, \fBCR_Core_Memory\fR, \fBCR_Socket\fR and |
| \fBCR_Socket_Memory\fR. |
| Use requires the system\-wide \fBSelectTypeParameters\fR value be set to |
| any of the four supported values previously listed; otherwise, the |
| partition\-specific value will be ignored. |
| .IP |
| |
| .TP |
| \fBShared\fR |
| The \fBShared\fR configuration parameter has been replaced by the |
| \fBOverSubscribe\fR parameter described above. |
| .IP |
| |
| .TP |
| \fBState\fR |
| State of partition or availability for use. Possible values |
| are "UP", "DOWN", "DRAIN" and "INACTIVE". The default value is "UP". |
| See also the related "Alternate" keyword. |
| .IP |
| .RS |
| .TP 10 |
| \fBUP\fP |
| Designates that new jobs may be queued on the partition, and that |
| jobs may be allocated nodes and run from the partition. |
| .IP |
| |
| .TP |
| \fBDOWN\fP |
| Designates that new jobs may be queued on the partition, but |
| queued jobs may not be allocated nodes and run from the partition. Jobs |
| already running on the partition continue to run. The jobs |
| must be explicitly canceled to force their termination. |
| .IP |
| |
| .TP |
| \fBDRAIN\fP |
| Designates that no new jobs may be queued on the partition (job |
| submission requests will be denied with an error message), but jobs |
| already queued on the partition may be allocated nodes and run. |
| See also the "Alternate" partition specification. |
| .IP |
| |
| .TP |
| \fBINACTIVE\fP |
| Designates that no new jobs may be queued on the partition, |
| and jobs already queued may not be allocated nodes and run. |
| See also the "Alternate" partition specification. |
| .RE |
| .IP |
| |
| .TP |
| \fBSuspendTime\fR |
| Nodes which remain idle or down for this number of seconds will be placed into |
| power save mode by \fBSuspendProgram\fR. |
| For nodes that are in multiple partitions with this option set, |
| the highest time will take effect. If not set on any partition, the node will |
| use the \fBSuspendTime\fR value set for the entire cluster. |
| Setting \fBSuspendTime\fR to INFINITE will disable suspending of nodes in this |
| partition. |
| Setting \fBSuspendTime\fR to anything but INFINITE (or \-1) will enable power |
| save mode. |
| .IP |
| |
| .TP |
| \fBSuspendTimeout\fR |
| Maximum time permitted (in seconds) between when a node suspend request |
| is issued and when the node is shutdown. |
| At that time the node must be ready for a resume request to be issued |
| as needed for new work. |
| For nodes that are in multiple partitions with this option set, |
| the highest time will take effect. If not set on any partition, the node will |
| use the \fBSuspendTimeout\fR value set for the entire cluster. |
| .IP |
| |
| .TP |
| \fBTopology\fR |
| Name of the topology, defined in \fItopology.yaml\fR, used by jobs in this |
| partition. |
| .IP |
| |
| .TP |
| \fBTRESBillingWeights\fR |
| TRESBillingWeights is used to define the billing weights of each tracked TRES |
| type (see \fBAccountingStorageTRES\fR) that |
| will be used in calculating the usage of a job. The calculated usage is used |
| when calculating fairshare and when enforcing the TRES billing limit on jobs. |
| |
| Billing weights are specified as a comma\-separated list of |
| \fI<TRES Type>\fR=\fI<TRES Billing Weight>\fR pairs. |
| |
| Any TRES Type is available for billing. Note that the base unit for memory and |
| burst buffers is megabytes. |
| |
| By default the billing of TRES is calculated as the sum of all TRES types |
| multiplied by their corresponding billing weight. |
| |
| The weighted amount of a resource can be adjusted by adding a suffix of K,M,G,T |
| or P after the billing weight. For example, a memory weight of "mem=.25" on a |
| job allocated 8GB will be billed 2048 (8192MB *.25) units. A memory weight of |
| "mem=.25G" on the same job will be billed 2 (8192MB * (.25/1024)) units. |
| |
| Negative values are allowed. |
| |
| When a job is allocated 1 CPU and 8 GB of memory on a partition configured with |
| TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0,license/licA=1.5", the |
| billable TRES will be: (1*1.0) + (8*0.25) + (0*2.0) + (0*1.5) = 3.0. |
| |
| If PriorityFlags=MAX_TRES is configured, the billable TRES is calculated as the |
| MAX of individual TRESs on a node (e.g. cpus, mem, gres) plus the sum of all |
| global TRESs (e.g. licenses). Using the same example above the billable TRES |
| will be MAX(1*1.0, 8*0.25, 0*2.0) + (0*1.5) = 2.0. |
| |
| If TRESBillingWeights is not defined then the job is billed against the total |
| number of allocated CPUs. |
| |
| \fBNOTE\fR: TRESBillingWeights doesn't affect job priority directly as it is |
| currently not used for the size of the job. If you want TRESs to play a role in |
| the job's priority then refer to the PriorityWeightTRES option. |
| .RE |
| .IP |
| |
| .SH "PROLOG AND EPILOG SCRIPTS" |
| There are a variety of prolog and epilog program options that |
| execute with various permissions and at various times. |
| The four options most likely to be used are: |
| \fBProlog\fR and \fBEpilog\fR (executed once on each compute node |
| for each job) plus \fBPrologSlurmctld\fR and \fBEpilogSlurmctld\fR |
| (executed once on the \fBControlMachine\fR for each job). |
| |
| \fBNOTE\fR: Standard output and error messages are normally not preserved. |
| Explicitly write output and error messages to an appropriate location |
| if you wish to preserve that information. |
| |
| \fBNOTE\fR: By default the Prolog script is ONLY run on any individual |
| node when it first sees a job step from a new allocation. It does not |
| run the Prolog immediately when an allocation is granted. If no job steps |
| from an allocation are run on a node, it will never run the Prolog for that |
| allocation. This Prolog behavior can be changed by the |
| \fBPrologFlags\fR parameter. The Epilog, on the other hand, always |
| runs on every node of an allocation when the allocation is released. |
| |
| If the Epilog fails (returns a non\-zero exit code), this will result in the |
| node being set to a DRAIN state. |
| If the EpilogSlurmctld fails (returns a non\-zero exit code), this will only |
| be logged. |
| If the Prolog fails (returns a non\-zero exit code), this will result in the |
| node being set to a DRAIN state and the job being requeued. The job will be |
| placed in a held state unless \fBnohold_on_prolog_fail\fR is configured in |
| \fBSchedulerParameters\fR. |
| If the PrologSlurmctld fails (returns a non\-zero exit code), this will result |
| in the job being requeued to be executed on another node if possible. Only |
| batch jobs can be requeued. |
| Interactive jobs (salloc and srun) will be cancelled if the |
| PrologSlurmctld fails. |
| If slurmctld is stopped while either PrologSlurmctld or EpilogSlurmctld is |
| running, the script will be killed with SIGKILL. The script will restart when |
| slurmctld restarts. |
| |
| Information about the job is passed to the script using environment |
| variables. For a full list of environment variables please see the Prolog |
| and Epilog Guide <https://slurm.schedmd.com/prolog_epilog.html>. |
| |
| .SH "UNKILLABLE STEP PROGRAM SCRIPT" |
| This program can be used to take special actions to clean up the unkillable |
| processes and/or notify system administrators. |
| The program will be run as \fBSlurmdUser\fR (usually "root") on the compute |
| node where \fBUnkillableStepTimeout\fR was triggered. |
| |
| Information about the unkillable job step is passed to the script using |
| environment variables. |
| |
| .TP |
| \fBSLURM_JOB_ID\fR |
| Job ID. |
| .IP |
| |
| .TP |
| \fBSLURM_STEP_ID\fR |
| Job Step ID. Note that the special steps "batch", "interactive", and "extern" |
| are reported not by name but with integer Step IDs 4294967291, 4294967290, and |
| 4294967292 respectively. |
| .IP |
| |
| .SH "NETWORK TOPOLOGY" |
| Slurm is able to optimize job allocations to minimize network contention. |
| Special Slurm logic is used to optimize allocations on systems with a |
| three\-dimensional interconnect. |
| and information about configuring those systems are available on |
| web pages available here: <https://slurm.schedmd.com/>. |
| For a hierarchical network, Slurm needs to have detailed information |
| about how nodes are configured on the network switches. |
| .LP |
| The \fBTopologyPlugin\fR parameter controls which plugin is used to |
| collect network topology information. |
| The only values presently supported are |
| "topology/flat" (best\-fit logic over one\-dimensional topology), |
| "topology/tree", and "topology/block" (both determine the network topology |
| based upon information contained in a topology.conf or topology.yaml file, |
| see "man topology.conf" and "man topology.yaml" for more information). |
| Future plugins may gather topology information directly from the network. |
| The topology information is optional. |
| If not provided, Slurm will perform a best\-fit algorithm assuming the |
| nodes are in a one\-dimensional array as configured and the communications |
| cost is related to the node distance in this array. |
| |
| .SH "RELOCATING CONTROLLERS" |
| If the cluster's computers used for the primary or backup controller |
| will be out of service for an extended period of time, it may be |
| desirable to relocate them. |
| In order to do so, follow this procedure: |
| .LP |
| 1. Stop the Slurm daemons on the old controller and nodes. |
| .br |
| 2. Modify the slurm.conf file appropriately. |
| .br |
| 3. Copy the files from the StateSaveLocation to the new controller or ensure |
| that they are accessible to the new controller via a shared drive. |
| .br |
| 4. Distribute the updated slurm.conf file to all nodes. |
| .br |
| 5. Restart the Slurm daemons on the new controller and nodes. |
| .LP |
| There should be no loss of any pending jobs. Any running jobs will get the |
| updated host info and finish normally. |
| Ensure that any nodes added to the cluster have the current |
| slurm.conf file installed. |
| .LP |
| \fBCAUTION:\fR If two nodes are simultaneously configured as the |
| primary controller (two nodes on which \fBSlurmctldHost\fR specify |
| the local host and the \fBslurmctld\fR daemon is executing on each), |
| system behavior will be destructive. |
| If a compute node has an incorrect \fBSlurmctldHost\fR |
| parameter, that node may be rendered |
| unusable, but no other harm will result. |
| |
| .SH "EXAMPLE" |
| .nf |
| # |
| # Sample /etc/slurm.conf for dev[0\-25].llnl.gov |
| # Author: John Doe |
| # Date: 11/06/2001 |
| # |
| SlurmctldHost=dev0(12.34.56.78) # Primary server |
| SlurmctldHost=dev1(12.34.56.79) # Backup server |
| # |
| AuthType=auth/munge |
| Epilog=/usr/local/slurm/epilog |
| Prolog=/usr/local/slurm/prolog |
| FirstJobId=65536 |
| InactiveLimit=120 |
| JobCompType=jobcomp/filetxt |
| JobCompLoc=/var/log/slurm/jobcomp |
| KillWait=30 |
| MaxJobCount=10000 |
| MinJobAge=300 |
| PluginDir=/usr/local/lib:/usr/local/slurm/lib |
| ReturnToService=0 |
| SchedulerType=sched/backfill |
| SlurmctldLogFile=/var/log/slurm/slurmctld.log |
| SlurmdLogFile=/var/log/slurm/slurmd.log |
| SlurmctldPort=7002 |
| SlurmdPort=7003 |
| SlurmdSpoolDir=/var/spool/slurmd.spool |
| StateSaveLocation=/var/spool/slurm.state |
| TmpFS=/tmp |
| WaitTime=30 |
| # |
| # Node Configurations |
| # |
| NodeName=DEFAULT CPUs=2 RealMemory=2000 TmpDisk=64000 |
| NodeName=DEFAULT State=UNKNOWN |
| NodeName=dev[0\-25] NodeAddr=edev[0\-25] Weight=16 |
| # Update records for specific DOWN nodes |
| DownNodes=dev20 State=DOWN Reason="power,ETA=Dec25" |
| # |
| # Partition Configurations |
| # |
| PartitionName=DEFAULT MaxTime=30 MaxNodes=10 State=UP |
| PartitionName=debug Nodes=dev[0\-8,18\-25] Default=YES |
| PartitionName=batch Nodes=dev[9\-17] MinNodes=4 |
| PartitionName=long Nodes=dev[9\-17] MaxTime=120 AllowGroups=admin |
| .fi |
| |
| .SH "INCLUDE MODIFIERS" |
| The "include" key word can be used with modifiers within the specified |
| pathname. These modifiers would be replaced with cluster name or other |
| information depending on which modifier is specified. If the included file |
| is not an absolute path name (i.e. it does not start with a slash), it will |
| searched for in the same directory as the slurm.conf file. |
| |
| .TP |
| \fB%c\fR |
| Cluster name specified in the slurm.conf will be used. |
| .IP |
| |
| .LP |
| \fBEXAMPLE\fR |
| .nf |
| ClusterName=linux |
| include /home/slurm/etc/%c_config |
| # Above line interpreted as |
| # "include /home/slurm/etc/linux_config" |
| .fi |
| |
| .SH "FILE AND DIRECTORY PERMISSIONS" |
| There are three classes of files: |
| Files used by \fBslurmctld\fR must be accessible by user \fBSlurmUser\fR |
| and accessible by the primary and backup control machines. |
| Files used by \fBslurmd\fR must be accessible by user root and |
| accessible from every compute node. |
| A few files need to be accessible by normal users on all login and |
| compute nodes. |
| While many files and directories are listed below, most of them will |
| not be used with most configurations. |
| |
| .TP |
| \fBEpilog\fR |
| Must be executable by user root. |
| It is recommended that the file be readable by all users. |
| The file must exist on every compute node. |
| .IP |
| |
| .TP |
| \fBEpilogSlurmctld\fR |
| Must be executable by user \fBSlurmUser\fR. |
| It is recommended that the file be readable by all users. |
| The file must be accessible by the primary and backup control machines. |
| .IP |
| |
| .TP |
| \fBHealthCheckProgram\fR |
| Must be executable by user root. |
| It is recommended that the file be readable by all users. |
| The file must exist on every compute node. |
| .IP |
| |
| .TP |
| \fBJobCompLoc\fR |
| If this specifies a file, it must be writable by user \fBSlurmUser\fR. |
| The file must be accessible by the primary and backup control machines. |
| .IP |
| |
| .TP |
| \fBMailProg\fR |
| Must be executable by user \fBSlurmUser\fR. |
| Must not be writable by regular users. |
| The file must be accessible by the primary and backup control machines. |
| .IP |
| |
| .TP |
| \fBProlog\fR |
| Must be executable by user root. |
| It is recommended that the file be readable by all users. |
| The file must exist on every compute node. |
| .IP |
| |
| .TP |
| \fBPrologSlurmctld\fR |
| Must be executable by user \fBSlurmUser\fR. |
| It is recommended that the file be readable by all users. |
| The file must be accessible by the primary and backup control machines. |
| .IP |
| |
| .TP |
| \fBResumeProgram\fR |
| Must be executable by user \fBSlurmUser\fR. |
| The file must be accessible by the primary and backup control machines. |
| .IP |
| |
| .TP |
| \fBslurm.conf\fR |
| Readable to all users on all nodes. |
| Must not be writable by regular users. |
| .IP |
| |
| .TP |
| \fBSlurmctldLogFile\fR |
| Must be writable by user \fBSlurmUser\fR. |
| The file must be accessible by the primary and backup control machines. |
| .IP |
| |
| .TP |
| \fBSlurmctldPidFile\fR |
| Must be writable by user root. |
| Preferably writable and removable by \fBSlurmUser\fR. |
| The file must be accessible by the primary and backup control machines. |
| .IP |
| |
| .TP |
| \fBSlurmdLogFile\fR |
| Must be writable by user root. |
| A distinct file must exist on each compute node. |
| .IP |
| |
| .TP |
| \fBSlurmdPidFile\fR |
| Must be writable by user root. |
| A distinct file must exist on each compute node. |
| .IP |
| |
| .TP |
| \fBSlurmdSpoolDir\fR |
| Must be writable by user root. Permissions must be set to 755 so that |
| job scripts can be executed from this directory. |
| A distinct file must exist on each compute node. |
| .IP |
| |
| .TP |
| \fBSrunEpilog\fR |
| Must be executable by all users. |
| The file must exist on every login and compute node. |
| .IP |
| |
| .TP |
| \fBSrunProlog\fR |
| Must be executable by all users. |
| The file must exist on every login and compute node. |
| .IP |
| |
| .TP |
| \fBStateSaveLocation\fR |
| Must be writable by user \fBSlurmUser\fR. |
| The file must be accessible by the primary and backup control machines. |
| .IP |
| |
| .TP |
| \fBSuspendProgram\fR |
| Must be executable by user \fBSlurmUser\fR. |
| The file must be accessible by the primary and backup control machines. |
| .IP |
| |
| .TP |
| \fBTaskEpilog\fR |
| Must be executable by all users. |
| The file must exist on every compute node. |
| .IP |
| |
| .TP |
| \fBTaskProlog\fR |
| Must be executable by all users. |
| The file must exist on every compute node. |
| .IP |
| |
| .TP |
| \fBUnkillableStepProgram\fR |
| Must be executable by user \fBSlurmdUser\fR. |
| The file must be accessible by the primary and backup control machines. |
| .IP |
| |
| .SH "LOGGING" |
| .LP |
| Note that while Slurm daemons create log files and other files as needed, |
| it treats the lack of parent directories as a fatal error. |
| This prevents the daemons from running if critical file systems are |
| not mounted and will minimize the risk of cold\-starting (starting |
| without preserving jobs). |
| .LP |
| Log files and job accounting files |
| may need to be created/owned by the "SlurmUser" uid to be successfully |
| accessed. Use the "chown" and "chmod" commands to set the ownership |
| and permissions appropriately. |
| See the section \fBFILE AND DIRECTORY PERMISSIONS\fR for information |
| about the various files and directories used by Slurm. |
| .LP |
| It is recommended that the logrotate utility be used to ensure that |
| various log files do not become too large. |
| This also applies to text files used for accounting, |
| process tracking, and the slurmdbd log if they are used. |
| .LP |
| Here is a sample logrotate configuration. Make appropriate site modifications |
| and save as /etc/logrotate.d/slurm on all nodes. |
| See the \fBlogrotate\fR man page for more details. |
| .LP |
| .nf |
| ## |
| # Slurm Logrotate Configuration |
| ## |
| /var/log/slurm/*.log { |
| compress |
| missingok |
| nocopytruncate |
| nodelaycompress |
| nomail |
| notifempty |
| noolddir |
| rotate 5 |
| sharedscripts |
| size=5M |
| create 640 slurm root |
| postrotate |
| pkill \-x \-\-signal SIGUSR2 slurmctld |
| pkill \-x \-\-signal SIGUSR2 slurmd |
| pkill \-x \-\-signal SIGUSR2 slurmdbd |
| exit 0 |
| endscript |
| } |
| .fi |
| |
| .SH "COPYING" |
| Copyright (C) 2002\-2007 The Regents of the University of California. |
| Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER). |
| .br |
| Copyright (C) 2008\-2010 Lawrence Livermore National Security. |
| .br |
| Copyright (C) 2010\-2022 SchedMD LLC. |
| .LP |
| This file is part of Slurm, a resource management program. |
| For details, see <https://slurm.schedmd.com/>. |
| .LP |
| Slurm is free software; you can redistribute it and/or modify it under |
| the terms of the GNU General Public License as published by the Free |
| Software Foundation; either version 2 of the License, or (at your option) |
| any later version. |
| .LP |
| Slurm is distributed in the hope that it will be useful, but WITHOUT ANY |
| WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS |
| FOR A PARTICULAR PURPOSE. See the GNU General Public License for more |
| details. |
| |
| .SH "FILES" |
| /etc/slurm.conf |
| |
| .SH "SEE ALSO" |
| .LP |
| \fBcgroup.conf\fR(5), \fBgetaddrinfo\fR(3), |
| \fBgetrlimit\fR(2), \fBgres.conf\fR(5), \fBgroup\fR(5), \fBhostname\fR(1), |
| \fBscontrol\fR(1), \fBslurmctld\fR(8), \fBslurmd\fR(8), |
| \fBslurmdbd\fR(8), \fBslurmdbd.conf\fR(5), \fBsrun\fR(1), |
| \fBspank\fR(8), \fBsyslog\fR(3), \fBtopology.conf\fR(5) |