| .TH "slurm.conf" "5" "August 2010" "slurm.conf 2.2" "Slurm configuration file" |
| |
| .SH "NAME" |
| slurm.conf \- Slurm configuration file |
| .SH "DESCRIPTION" |
| \fB/etc/slurm.conf\fP is an ASCII file which describes general SLURM |
| configuration information, the nodes to be managed, information about |
| how those nodes are grouped into partitions, and various scheduling |
| parameters associated with those partitions. This file should be |
| consistent across all nodes in the cluster. |
| .LP |
| You can use the \fBSLURM_CONF\fR environment variable to override the built\-in |
| location of this file. The SLURM daemons also allow you to override |
| both the built\-in and environment\-provided location using the "\-f" |
| option on the command line. |
| .LP |
| Note the while SLURM daemons create log files and other files as needed, |
| it treats the lack of parent directories as a fatal error. |
| This prevents the daemons from running if critical file systems are |
| not mounted and will minimize the risk of cold\-starting (starting |
| without preserving jobs). |
| .LP |
| The contents of the file are case insensitive except for the names of nodes |
| and partitions. Any text following a "#" in the configuration file is treated |
| as a comment through the end of that line. |
| The size of each line in the file is limited to 1024 characters. |
| Changes to the configuration file take effect upon restart of |
| SLURM daemons, daemon receipt of the SIGHUP signal, or execution |
| of the command "scontrol reconfigure" unless otherwise noted. |
| .LP |
| If a line begins with the word "Include" followed by whitespace |
| and then a file name, that file will be included inline with the current |
| configuration file. |
| .LP |
| Note on file permissions: |
| .LP |
| The \fIslurm.conf\fR file must be readable by all users of SLURM, since it |
| is used by many of the SLURM commands. Other files that are defined |
| in the \fIslurm.conf\fR file, such as log files and job accounting files, |
| may need to be created/owned by the "SlurmUser" uid to be successfully |
| accessed. Use the "chown" and "chmod" commands to set the ownership |
| and permissions appropriately. |
| See the section \fBFILE AND DIRECTORY PERMISSIONS\fR for information |
| about the various files and directories used by SLURM. |
| |
| .SH "PARAMETERS" |
| .LP |
| The overall configuration parameters available include: |
| |
| .TP |
| \fBAccountingStorageBackupHost\fR |
| The name of the backup machine hosting the accounting storage database. |
| If used with the accounting_storage/slurmdbd plugin, this is where the backup |
| slurmdbd would be running. |
| Only used for database type storage plugins, ignored otherwise. |
| |
| .TP |
| \fBAccountingStorageEnforce\fR |
| This controls what level of association\-based enforcement to impose |
| on job submissions. Valid options are any combination of |
| \fIassociations\fR, \fIlimits\fR, \fIqos\fR, and \fIwckeys\fR, or |
| \fIall\fR for all things. If limits, qos, or wckeys are set, |
| associations will automatically be set. In addition, if wckeys is |
| set, TrackWCKey will automatically be set. By enforcing Associations |
| no new job is allowed to run unless a corresponding association exists |
| in the system. If limits are enforced users can be limited by |
| association to whatever job size or run time limits are defined. With |
| qos and/or wckeys enforced jobs will not be scheduled unless a valid |
| qos and/or workload characterization key is specified. When |
| \fBAccountingStorageEnforce\fR is changed, a restart of the slurmctld |
| daemon is required (not just a "scontrol reconfig"). |
| |
| .TP |
| \fBAccountingStorageHost\fR |
| The name of the machine hosting the accounting storage database. |
| Only used for database type storage plugins, ignored otherwise. |
| Also see \fBDefaultStorageHost\fR. |
| |
| .TP |
| \fBAccountingStorageLoc\fR |
| The fully qualified file name where accounting records are written |
| when the \fBAccountingStorageType\fR is "accounting_storage/filetxt" |
| or else the name of the database where accounting records are stored when the |
| \fBAccountingStorageType\fR is a database. |
| Also see \fBDefaultStorageLoc\fR. |
| |
| .TP |
| \fBAccountingStoragePass\fR |
| The password used to gain access to the database to store the |
| accounting data. Only used for database type storage plugins, ignored |
| otherwise. In the case of SLURM DBD (Database Daemon) with MUNGE |
| authentication this can be configured to use a MUNGE daemon |
| specifically configured to provide authentication between clusters |
| while the default MUNGE daemon provides authentication within a |
| cluster. In that case, \fBAccountingStoragePass\fR should specify the |
| named port to be used for communications with the alternate MUNGE |
| daemon (e.g. "/var/run/munge/global.socket.2"). The default value is |
| NULL. Also see \fBDefaultStoragePass\fR. |
| |
| .TP |
| \fBAccountingStoragePort\fR |
| The listening port of the accounting storage database server. |
| Only used for database type storage plugins, ignored otherwise. |
| Also see \fBDefaultStoragePort\fR. |
| |
| .TP |
| \fBAccountingStorageType\fR |
| The accounting storage mechanism type. Acceptable values at |
| present include "accounting_storage/filetxt", |
| "accounting_storage/mysql", "accounting_storage/none", |
| "accounting_storage/pgsql", and "accounting_storage/slurmdbd". The |
| "accounting_storage/filetxt" value indicates that accounting records |
| will be written to the file specified by the |
| \fBAccountingStorageLoc\fR parameter. The "accounting_storage/mysql" |
| value indicates that accounting records will be written to a MySQL |
| database specified by the \fBAccountingStorageLoc\fR parameter. The |
| "accounting_storage/pgsql" value indicates that accounting records |
| will be written to a PostgreSQL database specified by the |
| \fBAccountingStorageLoc\fR parameter. The |
| "accounting_storage/slurmdbd" value indicates that accounting records |
| will be written to the SLURM DBD, which manages an underlying MySQL or |
| PostgreSQL database. See "man slurmdbd" for more information. The |
| default value is "accounting_storage/none" and indicates that account |
| records are not maintained. Note: the PostgreSQL plugin is not |
| complete and should not be used if wanting to use associations. It |
| will however work with basic accounting of jobs and job steps. If |
| interested in completing, please email slurm-dev@lists.llnl.gov. Also |
| see \fBDefaultStorageType\fR. |
| |
| .TP |
| \fBAccountingStorageUser\fR |
| The user account for accessing the accounting storage database. |
| Only used for database type storage plugins, ignored otherwise. |
| Also see \fBDefaultStorageUser\fR. |
| |
| .TP |
| \fBAuthType\fR |
| The authentication method for communications between SLURM |
| components. |
| Acceptable values at present include "auth/none", "auth/authd", |
| and "auth/munge". |
| The default value is "auth/munge". |
| "auth/none" includes the UID in each communication, but it is not verified. |
| This may be fine for testing purposes, but |
| \fBdo not use "auth/none" if you desire any security\fR. |
| "auth/authd" indicates that Brett Chun's authd is to be used (see |
| "http://www.theether.org/authd/" for more information. Note that |
| authd is no longer actively supported). |
| "auth/munge" indicates that LLNL's MUNGE is to be used |
| (this is the best supported authentication mechanism for SLURM, |
| see "http://munge.googlecode.com/" for more information). |
| All SLURM daemons and commands must be terminated prior to changing |
| the value of \fBAuthType\fR and later restarted (SLURM jobs can be |
| preserved). |
| |
| .TP |
| \fBBackupAddr\fR |
| The name that \fBBackupController\fR should be referred to in |
| establishing a communications path. This name will |
| be used as an argument to the gethostbyname() function for |
| identification. For example, "elx0000" might be used to designate |
| the Ethernet address for node "lx0000". |
| By default the \fBBackupAddr\fR will be identical in value to |
| \fBBackupController\fR. |
| |
| .TP |
| \fBBackupController\fR |
| The name of the machine where SLURM control functions are to be |
| executed in the event that \fBControlMachine\fR fails. This node |
| may also be used as a compute server if so desired. It will come into service |
| as a controller only upon the failure of ControlMachine and will revert |
| to a "standby" mode when the ControlMachine becomes available once again. |
| This should be a node name without the full domain name. I.e., the hostname |
| returned by the \fIgethostname()\fR function cut at the first dot (e.g. use |
| "tux001" rather than "tux001.my.com"). |
| While not essential, it is recommended that you specify a backup controller. |
| See the \fBRELOCATING CONTROLLERS\fR section if you change this. |
| |
| .TP |
| \fBBatchStartTimeout\fR |
| The maximum time (in seconds) that a batch job is permitted for |
| launching before being considered missing and releasing the |
| allocation. The default value is 10 (seconds). Larger values may be |
| required if more time is required to execute the \fBProlog\fR, load |
| user environment variables (for Moab spawned jobs), or if the slurmd |
| daemon gets paged from memory. |
| |
| .TP |
| \fBCacheGroups\fR |
| If set to 1, the slurmd daemon will cache /etc/groups entries. |
| This can improve performance for highly parallel jobs if NIS servers |
| are used and unable to respond very quickly. |
| The default value is 0 to disable caching group data. |
| |
| .TP |
| \fBCheckpointType\fR |
| The system\-initiated checkpoint method to be used for user jobs. |
| The slurmctld daemon must be restarted for a change in \fBCheckpointType\fR |
| to take effect. |
| Supported values presently include: |
| .RS |
| .TP 18 |
| \fBcheckpoint/aix\fR |
| for AIX systems only |
| .TP |
| \fBcheckpoint/blcr\fR |
| Berkeley Lab Checkpoint Restart (BLCR). |
| NOTE: If a file is found at sbin/scch (relative to the SLURM installation |
| location), it will be executed upon completion of the checkpoint. This can |
| be a script used for managing the checkpoint files. |
| .TP |
| \fBcheckpoint/none\fR |
| no checkpoint support (default) |
| .TP |
| \fBcheckpoint/ompi\fR |
| OpenMPI (version 1.3 or higher) |
| .TP |
| \fBcheckpoint/xlch\fR |
| XLCH (requires that SlurmUser be root) |
| .RE |
| |
| .TP |
| \fBClusterName\fR |
| The name by which this SLURM managed cluster is known in the |
| accounting database. This is needed distinguish accounting records |
| when multiple clusters report to the same database. |
| |
| .TP |
| \fBCompleteWait\fR |
| The time, in seconds, given for a job to remain in COMPLETING state |
| before any additional jobs are scheduled. |
| If set to zero, pending jobs will be started as soon as possible. |
| Since a COMPLETING job's resources are released for use by other |
| jobs as soon as the \fBEpilog\fR completes on each individual node, |
| this can result in very fragmented resource allocations. |
| To provide jobs with the minimum response time, a value of zero is |
| recommended (no waiting). |
| To minimize fragmentation of resources, a value equal to \fBKillWait\fR |
| plus two is recommended. |
| In that case, setting \fBKillWait\fR to a small value may be beneficial. |
| The default value of \fBCompleteWait\fR is zero seconds. |
| The value may not exceed 65533. |
| |
| .TP |
| \fBControlAddr\fR |
| Name that \fBControlMachine\fR should be referred to in |
| establishing a communications path. This name will |
| be used as an argument to the gethostbyname() function for |
| identification. For example, "elx0000" might be used to designate |
| the Ethernet address for node "lx0000". |
| By default the \fBControlAddr\fR will be identical in value to |
| \fBControlMachine\fR. |
| |
| .TP |
| \fBControlMachine\fR |
| The short hostname of the machine where SLURM control functions are |
| executed (i.e. the name returned by the command "hostname \-s", use |
| "tux001" rather than "tux001.my.com"). |
| This value must be specified. |
| In order to support some high availability architectures, multiple |
| hostnames may be listed with comma separators and one \fBControlAddr\fR |
| must be specified. The high availability system must insure that the |
| slurmctld daemon is running on only one of these hosts at a time. |
| See the \fBRELOCATING CONTROLLERS\fR section if you change this. |
| |
| .TP |
| \fBCryptoType\fR |
| The cryptographic signature tool to be used in the creation of |
| job step credentials. |
| The slurmctld daemon must be restarted for a change in \fBCryptoType\fR |
| to take effect. |
| Acceptable values at present include "crypto/munge" and "crypto/openssl". |
| The default value is "crypto/munge". |
| |
| .TP |
| \fBDebugFlags\fR |
| Defines specific subsystems which should provide more detailed event logging. |
| Multiple subsystems can be specified with comma separators. |
| Valid subsystems available today (with more to come) include: |
| .RS |
| .TP 17 |
| \fBBackfill\fR |
| Backfill scheduler details |
| .TP |
| \fBBGBlockAlgo\fR |
| BlueGene block selection details |
| .TP |
| \fBBGBlockAlgoDeep\fR |
| BlueGene block selection, more details |
| .TP |
| \fBBGBlockPick\fR |
| BlueGene block selection for jobs |
| .TP |
| \fBBGBlockWires\fR |
| BlueGene block wiring (switch state details) |
| .TP |
| \fBCPU_Bind\fR |
| CPU binding details for jobs and steps |
| .TP |
| \fBGres\fR |
| Generic resource details |
| .TP |
| \fBGang\fR |
| Gang scheduling details |
| .TP |
| \fBNO_CONF_HASH\fR |
| Do not log when the slurm.conf files differs between SLURM daemons |
| .TP |
| \fBPriority\fB |
| Job prioritization |
| .TP |
| \fBReservation\fB |
| Advanced reservations |
| .TP |
| \fBSelectType\fR |
| Resource selection plugin |
| .TP |
| \fBSteps\fR |
| Slurmctld resource allocation for job steps |
| .TP |
| \fBTriggers\fR |
| Slurmctld triggers |
| .TP |
| \fBWiki\fR |
| Sched/wiki and wiki2 communications |
| .RE |
| |
| .TP |
| \fBDefMemPerCPU\fR |
| Default real memory size available per allocated CPU in MegaBytes. |
| Used to avoid over\-subscribing memory and causing paging. |
| \fBDefMemPerCPU\fR would generally be used if individual processors |
| are allocated to jobs (\fBSelectType=select/cons_res\fR). |
| The default value is 0 (unlimited). |
| Also see \fBDefMemPerNode\fR and \fBMaxMemPerCPU\fR. |
| \fBDefMemPerCPU\fR and \fBDefMemPerNode\fR are mutually exclusive. |
| NOTE: Enforcement of memory limits currently requires enabling of |
| accounting, which samples memory use on a periodic basis (data need |
| not be stored, just collected). |
| |
| .TP |
| \fBDefMemPerNode\fR |
| Default real memory size available per allocated node in MegaBytes. |
| Used to avoid over\-subscribing memory and causing paging. |
| \fBDefMemPerNode\fR would generally be used if whole nodes |
| are allocated to jobs (\fBSelectType=select/linear\fR) and |
| resources are shared (\fBShared=yes\fR or \fBShared=force\fR). |
| The default value is 0 (unlimited). |
| Also see \fBDefMemPerCPU\fR and \fBMaxMemPerNode\fR. |
| \fBDefMemPerCPU\fR and \fBDefMemPerNode\fR are mutually exclusive. |
| NOTE: Enforcement of memory limits currently requires enabling of |
| accounting, which samples memory use on a periodic basis (data need |
| not be stored, just collected). |
| |
| .TP |
| \fBDefaultStorageHost\fR |
| The default name of the machine hosting the accounting storage and |
| job completion databases. |
| Only used for database type storage plugins and when the |
| \fBAccountingStorageHost\fR and \fBJobCompHost\fR have not been |
| defined. |
| |
| .TP |
| \fBDefaultStorageLoc\fR |
| The fully qualified file name where accounting records and/or job |
| completion records are written when the \fBDefaultStorageType\fR is |
| "filetxt" or the name of the database where accounting records and/or job |
| completion records are stored when the \fBDefaultStorageType\fR is a |
| database. |
| Also see \fBAccountingStorageLoc\fR and \fBJobCompLoc\fR. |
| |
| .TP |
| \fBDefaultStoragePass\fR |
| The password used to gain access to the database to store the |
| accounting and job completion data. |
| Only used for database type storage plugins, ignored otherwise. |
| Also see \fBAccountingStoragePass\fR and \fBJobCompPass\fR. |
| |
| .TP |
| \fBDefaultStoragePort\fR |
| The listening port of the accounting storage and/or job completion |
| database server. |
| Only used for database type storage plugins, ignored otherwise. |
| Also see \fBAccountingStoragePort\fR and \fBJobCompPort\fR. |
| |
| .TP |
| \fBDefaultStorageType\fR |
| The accounting and job completion storage mechanism type. Acceptable |
| values at present include "filetxt", "mysql", "none", "pgsql", and |
| "slurmdbd". The value "filetxt" indicates that records will be |
| written to a file. The value "mysql" indicates that accounting |
| records will be written to a mysql database. The default value is |
| "none", which means that records are not maintained. The value |
| "pgsql" indicates that records will be written to a PostgreSQL |
| database. The value "slurmdbd" indicates that records will be written |
| to the SLURM DBD, which maintains its own database. See "man slurmdbd" |
| for more information. |
| Also see \fBAccountingStorageType\fR and \fBJobCompType\fR. |
| |
| .TP |
| \fBDefaultStorageUser\fR |
| The user account for accessing the accounting storage and/or job |
| completion database. |
| Only used for database type storage plugins, ignored otherwise. |
| Also see \fBAccountingStorageUser\fR and \fBJobCompUser\fR. |
| |
| .TP |
| \fBDisableRootJobs\fR |
| If set to "YES" then user root will be prevented from running any jobs. |
| The default value is "NO", meaning user root will be able to execute jobs. |
| \fBDisableRootJobs\fR may also be set by partition. |
| |
| .TP |
| \fBEnforcePartLimits\fR |
| If set to "YES" then jobs which exceed a partition's size and/or time limits |
| will be rejected at submission time. If set to "NO" then the job will be |
| accepted and remain queued until the partition limits are altered. |
| The default value is "NO". |
| |
| .TP |
| \fBEpilog\fR |
| Fully qualified pathname of a script to execute as user root on every |
| node when a user's job completes (e.g. "/usr/local/slurm/epilog"). This may |
| be used to purge files, disable user login, etc. |
| By default there is no epilog. |
| See \fBProlog and Epilog Scripts\fR for more information. |
| |
| .TP |
| \fBEpilogMsgTime\fR |
| The number of microseconds that the slurmctld daemon requires to process |
| an epilog completion message from the slurmd dameons. This parameter can |
| be used to prevent a burst of epilog completion messages from being sent |
| at the same time which should help prevent lost messages and improve |
| throughput for large jobs. |
| The default value is 2000 microseconds. |
| For a 1000 node job, this spreads the epilog completion messages out over |
| two seconds. |
| |
| .TP |
| \fBEpilogSlurmctld\fR |
| Fully qualified pathname of a program for the slurmctld to execute |
| upon termination of a job allocation (e.g. |
| "/usr/local/slurm/epilog_controller"). |
| The program executes as SlurmUser, which gives it permission to drain |
| nodes and requeue the job if a failure occurs or cancel the job if appropriate. |
| The program can be used to reboot nodes or perform other work to prepare |
| resources for use. |
| See \fBProlog and Epilog Scripts\fR for more information. |
| |
| .TP |
| \fBFastSchedule\fR |
| Controls how a node's configuration specifications in slurm.conf are used. |
| If the number of node configuration entries in the configuration file |
| is significantly lower than the number of nodes, setting FastSchedule to |
| 1 will permit much faster scheduling decisions to be made. |
| (The scheduler can just check the values in a few configuration records |
| instead of possibly thousands of node records.) |
| Note that on systems with hyper\-threading, the processor count |
| reported by the node will be twice the actual processor count. |
| Consider which value you want to be used for scheduling purposes. |
| .RS |
| .TP 5 |
| \fB1\fR (default) |
| Consider the configuration of each node to be that specified in the |
| slurm.conf configuration file and any node with less than the |
| configured resources will be set DOWN. |
| .TP |
| \fB0\fR |
| Base scheduling decisions upon the actual configuration of each individual |
| node except that the node's processor count in SLURM's configuration must |
| match the actual hardware configuration if \fBSchedulerType=sched/gang\fR |
| or \fBSelectType=select/cons_res\fR are configured (both of those plugins |
| maintain resource allocation information using bitmaps for the cores in the |
| system and must remain static, while the node's memory and disk space can |
| be established later). |
| .TP |
| \fB2\fR |
| Consider the configuration of each node to be that specified in the |
| slurm.conf configuration file and any node with less than the |
| configured resources will \fBnot\fR be set DOWN. |
| This can be useful for testing purposes. |
| .RE |
| |
| .TP |
| \fBFirstJobId\fR |
| The job id to be used for the first submitted to SLURM without a |
| specific requested value. Job id values generated will incremented by 1 |
| for each subsequent job. This may be used to provide a meta\-scheduler |
| with a job id space which is disjoint from the interactive jobs. |
| The default value is 1. |
| |
| .TP |
| \fBGetEnvTimeout\fR |
| Used for Moab scheduled jobs only. Controls how long job should wait |
| in seconds for loading the user's environment before attempting to |
| load it from a cache file. Applies when the srun or sbatch |
| \fI\-\-get\-user\-env\fR option is used. If set to 0 then always load |
| the user's environment from the cache file. |
| The default value is 2 seconds. |
| |
| .TP |
| \fBGresTypes\fR |
| A comma delimited list of generic resources to be managed. |
| These generic resources may have an associated plugin available to provide |
| additional functionality. |
| No generic resources are managed by default. |
| Insure this parameter is consistent across all nodes in the cluster for |
| proper operation. |
| The slurmctld daemon must be restarted for changes to this parameter to become |
| effective. |
| |
| .TP |
| \fBGroupUpdateForce\fR |
| If set to a non\-zero value, then information about which users are members |
| of groups allowed to use a partition will be updated periodically, even when |
| there have been no changes to the /etc/group file. |
| Otherwise group member information will be updated periodically only after the |
| /etc/group file is updated |
| The default vaue is 0. |
| Also see the \fBGroupUpdateTime\fR parameter. |
| |
| .TP |
| \fBGroupUpdateTime\fR |
| Controls how frequently information about which users are members of groups |
| allowed to use a partition will be updated. |
| The time interval is given in seconds with a default value of 600 seconds and |
| a maximum value of 4095 seconds. |
| A value of zero will prevent periodic updating of group membership information. |
| Also see the \fBGroupUpdateForce\fR parameter. |
| |
| .TP |
| \fBHealthCheckInterval\fR |
| The interval in seconds between executions of \fBHealthCheckProgram\fR. |
| The default value is zero, which disables execution. |
| |
| .TP |
| \fBHealthCheckProgram\fR |
| Fully qualified pathname of a script to execute as user root periodically |
| on all compute nodes that are not in the NOT_RESPONDING state. This may be |
| used to verify the node is fully operational and DRAIN the node or send email |
| if a problem is detected. |
| Any action to be taken must be explicitly performed by the program |
| (e.g. execute |
| "scontrol update NodeName=foo State=drain Reason=tmp_file_system_full" |
| to drain a node). |
| The interval is controlled using the \fBHealthCheckInterval\fR parameter. |
| Note that the \fBHealthCheckProgram\fR will be executed at the same time |
| on all nodes to minimize its impact upon parallel programs. |
| This program is will be killed if it does not terminate normally within |
| 60 seconds. |
| By default, no program will be executed. |
| |
| .TP |
| \fBInactiveLimit\fR |
| The interval, in seconds, after which a non\-responsive job allocation |
| command (e.g. \fBsrun\fR or \fBsalloc\fR) will result in the job being |
| terminated. If the node on which the command is executed fails or the |
| command abnormally terminates, this will terminate its job allocation. |
| This option has no effect upon batch jobs. |
| When setting a value, take into consideration that a debugger using \fBsrun\fR |
| to launch an application may leave the \fBsrun\fR command in a stopped state |
| for extended periods of time. |
| This limit is ignored for jobs running in partitions with the |
| \fBRootOnly\fR flag set (the scheduler running as root will be |
| responsible for the job). |
| The default value is unlimited (zero) and may not exceed 65533 seconds. |
| |
| .TP |
| \fBJobAcctGatherType\fR |
| The job accounting mechanism type. |
| Acceptable values at present include "jobacct_gather/aix" (for AIX operating |
| system), "jobacct_gather/linux" (for Linux operating system) and "jobacct_gather/none" |
| (no accounting data collected). |
| The default value is "jobacct_gather/none". |
| In order to use the \fBsstat\fR tool, "jobacct_gather/aix" or "jobacct_gather/linux" |
| must be configured. |
| |
| .TP |
| \fBJobAcctGatherFrequency\fR |
| The job accounting sampling interval. |
| For jobacct_gather/none this parameter is ignored. |
| For jobacct_gather/aix and jobacct_gather/linux the parameter is a number is |
| seconds between sampling job state. |
| The default value is 30 seconds. |
| A value of zero disables real the periodic job sampling and provides accounting |
| information only on job termination (reducing SLURM interference with the job). |
| Smaller (non\-zero) values have a greater impact upon job performance, but |
| a value of 30 seconds is not likely to be noticeable for applications having |
| less than 10,000 tasks. |
| Users can override this value on a per job basis using the \fB\-\-acctg\-freq\fR |
| option when submitting the job. |
| |
| .TP |
| \fBJobCheckpointDir\fR |
| Specifies the default directory for storing or reading job checkpoint |
| information. The data stored here is only a few thousand bytes per job |
| and includes information needed to resubmit the job request, not job's |
| memory image. The directory must be readable and writable by |
| \fBSlurmUser\fR, but not writable by regular users. The job memory images |
| may be in a different location as specified by \fB\-\-checkpoint\-dir\fR |
| option at job submit time or scontrol's \fBImageDir\fR option. |
| |
| .TP |
| \fBJobCompHost\fR |
| The name of the machine hosting the job completion database. |
| Only used for database type storage plugins, ignored otherwise. |
| Also see \fBDefaultStorageHost\fR. |
| |
| .TP |
| \fBJobCompLoc\fR |
| The fully qualified file name where job completion records are written |
| when the \fBJobCompType\fR is "jobcomp/filetxt" or the database where |
| job completion records are stored when the \fBJobCompType\fR is a |
| database. |
| Also see \fBDefaultStorageLoc\fR. |
| |
| .TP |
| \fBJobCompPass\fR |
| The password used to gain access to the database to store the job |
| completion data. |
| Only used for database type storage plugins, ignored otherwise. |
| Also see \fBDefaultStoragePass\fR. |
| |
| .TP |
| \fBJobCompPort\fR |
| The listening port of the job completion database server. |
| Only used for database type storage plugins, ignored otherwise. |
| Also see \fBDefaultStoragePort\fR. |
| |
| .TP |
| \fBJobCompType\fR |
| The job completion logging mechanism type. |
| Acceptable values at present include "jobcomp/none", "jobcomp/filetxt", |
| "jobcomp/mysql", "jobcomp/pgsql", and "jobcomp/script"". |
| The default value is "jobcomp/none", which means that upon job completion |
| the record of the job is purged from the system. If using the accounting |
| infrastructure this plugin may not be of interest since the information |
| here is redundant. |
| The value "jobcomp/filetxt" indicates that a record of the job should be |
| written to a text file specified by the \fBJobCompLoc\fR parameter. |
| The value "jobcomp/mysql" indicates that a record of the job should be |
| written to a mysql database specified by the \fBJobCompLoc\fR parameter. |
| The value "jobcomp/pgsql" indicates that a record of the job should be |
| written to a PostgreSQL database specified by the \fBJobCompLoc\fR parameter. |
| The value "jobcomp/script" indicates that a script specified by the |
| \fBJobCompLoc\fR parameter is to be executed with environment variables |
| indicating the job information. |
| |
| .TP |
| \fBJobCompUser\fR |
| The user account for accessing the job completion database. |
| Only used for database type storage plugins, ignored otherwise. |
| Also see \fBDefaultStorageUser\fR. |
| |
| .TP |
| \fBJobCredentialPrivateKey\fR |
| Fully qualified pathname of a file containing a private key used for |
| authentication by SLURM daemons. |
| This parameter is ignored if \fBCryptoType=crypto/munge\fR. |
| |
| .TP |
| \fBJobCredentialPublicCertificate\fR |
| Fully qualified pathname of a file containing a public key used for |
| authentication by SLURM daemons. |
| This parameter is ignored if \fBCryptoType=crypto/munge\fR. |
| |
| .TP |
| \fBJobFileAppend\fR |
| This option controls what to do if a job's output or error file |
| exist when the job is started. |
| If \fBJobFileAppend\fR is set to a value of 1, then append to |
| the existing file. |
| By default, any existing file is truncated. |
| |
| .TP |
| \fBJobRequeue\fR |
| This option controls what to do by default after a node failure. |
| If \fBJobRequeue\fR is set to a value of 1, then any batch job running |
| on the failed node will be requeued for execution on different nodes. |
| If \fBJobRequeue\fR is set to a value of 0, then any job running |
| on the failed node will be terminated. |
| Use the \fBsbatch\fR \fI\-\-no\-requeue\fR or \fI\-\-requeue\fR |
| option to change the default behavior for individual jobs. |
| The default value is 1. |
| |
| .TP |
| \fBJobSubmitPlugins\fR |
| A comma delimited list of job submission plugins to be used. |
| The specified plugins will be executed in the order listed. |
| These are intended to be site\-specific plugins which can be used to set |
| default job parameters and/or logging events. |
| Sample plugins available in the distribution include "cnode", "defaults", |
| "logging", "lua", and "partition". |
| See the SLURM code in "src/plugins/job_submit" and modify the code to satisfy |
| your needs. |
| No job submission plugins are used by default. |
| |
| .TP |
| \fBKillOnBadExit\fR |
| If set to 1, the job will be terminated immediately when one of the |
| processes is crashed or aborted. With default value of 0, if one of |
| the processes is crashed or aborted the other processes will continue |
| to run. |
| |
| .TP |
| \fBKillWait\fR |
| The interval, in seconds, given to a job's processes between the |
| SIGTERM and SIGKILL signals upon reaching its time limit. |
| If the job fails to terminate gracefully in the interval specified, |
| it will be forcibly terminated. |
| The default value is 30 seconds. |
| The value may not exceed 65533. |
| |
| .TP |
| \fBLicenses\fR |
| Specification of licenses (or other resources available on all |
| nodes of the cluster) which can be allocated to jobs. |
| License names can optionally be followed by an asterisk |
| and count with a default count of one. |
| Multiple license names should be comma separated (e.g. |
| "Licenses=foo*4,bar"). |
| Note that SLURM prevents jobs from being scheduled if their |
| required license specification is not available. |
| SLURM does not prevent jobs from using licenses that are |
| not explicitly listed in the job submission specification. |
| |
| .TP |
| \fBMailProg\fR |
| Fully qualified pathname to the program used to send email per user request. |
| The default value is "/bin/mail". |
| |
| .TP |
| \fBMaxJobCount\fR |
| The maximum number of jobs SLURM can have in its active database |
| at one time. Set the values of \fBMaxJobCount\fR and \fBMinJobAge\fR |
| to insure the slurmctld daemon does not exhaust its memory or other |
| resources. Once this limit is reached, requests to submit additional |
| jobs will fail. The default value is 10000 jobs. This value may not |
| be reset via "scontrol reconfig". It only takes effect upon restart |
| of the slurmctld daemon. |
| |
| .TP |
| \fBMaxMemPerCPU\fR |
| Maximum real memory size available per allocated CPU in MegaBytes. |
| Used to avoid over\-subscribing memory and causing paging. |
| \fBMaxMemPerCPU\fR would generally be used if individual processors |
| are allocated to jobs (\fBSelectType=select/cons_res\fR). |
| The default value is 0 (unlimited). |
| Also see \fBDefMemPerCPU\fR and \fBMaxMemPerNode\fR. |
| \fBMaxMemPerCPU\fR and \fBMaxMemPerNode\fR are mutually exclusive. |
| NOTE: Enforcement of memory limits currently requires enabling of |
| accounting, which samples memory use on a periodic basis (data need |
| not be stored, just collected). |
| |
| .TP |
| \fBMaxMemPerNode\fR |
| Maximum real memory size available per allocated node in MegaBytes. |
| Used to avoid over\-subscribing memory and causing paging. |
| \fBMaxMemPerNode\fR would generally be used if whole nodes |
| are allocated to jobs (\fBSelectType=select/linear\fR) and |
| resources are shared (\fBShared=yes\fR or \fBShared=force\fR). |
| The default value is 0 (unlimited). |
| Also see \fBDefMemPerNode\fR and \fBMaxMemPerCPU\fR. |
| \fBMaxMemPerCPU\fR and \fBMaxMemPerNode\fR are mutually exclusive. |
| NOTE: Enforcement of memory limits currently requires enabling of |
| accounting, which samples memory use on a periodic basis (data need |
| not be stored, just collected). |
| |
| .TP |
| \fBMaxTasksPerNode\fR |
| Maximum number of tasks SLURM will allow a job step to spawn |
| on a single node. The default \fBMaxTasksPerNode\fR is 128. |
| |
| .TP |
| \fBMessageTimeout\fR |
| Time permitted for a round\-trip communication to complete |
| in seconds. Default value is 10 seconds. For systems with |
| shared nodes, the slurmd daemon could be paged out and |
| necessitate higher values. |
| |
| .TP |
| \fBMinJobAge\fR |
| The minimum age of a completed job before its record is purged from |
| SLURM's active database. Set the values of \fBMaxJobCount\fR and |
| \fBMinJobAge\fR to insure the slurmctld daemon does not exhaust |
| its memory or other resources. The default value is 300 seconds. |
| A value of zero prevents any job record purging. |
| May not exceed 65533. |
| |
| .TP |
| \fBMpiDefault\fR |
| Identifies the default type of MPI to be used. |
| Srun may override this configuration parameter in any case. |
| Currently supported versions include: |
| \fBlam\fR, |
| \fBmpich1_p4\fR, |
| \fBmpich1_shmem\fR, |
| \fBmpichgm\fR, |
| \fBmpichmx\fR, |
| \fBmvapich\fR, |
| \fBnone\fR (default, which works for many other versions of MPI) and |
| \fBopenmpi\fR. |
| More information about MPI use is available here |
| <https://computing.llnl.gov/linux/slurm/mpi_guide.html>. |
| |
| .TP |
| \fBMpiParams\fR |
| MPI parameters. |
| Used to identify ports used by OpenMPI only and the input format is |
| "ports=12000\-12999" to identify a range of communication ports to be used. |
| |
| .TP |
| \fBOverTimeLimit\fR |
| Number of minutes by which a job can exceed its time limit before |
| being canceled. |
| The configured job time limit is treated as a \fIsoft\fR limit. |
| Adding \fBOverTimeLimit\fR to the \fIsoft\fR limit provides a \fIhard\fR |
| limit, at which point the job is canceled. |
| This is particularly useful for backfill scheduling, which bases upon |
| each job's soft time limit. |
| The default value is zero. |
| Man not exceed exceed 65533 minutes. |
| A value of "UNLIMITED" is also supported. |
| |
| .TP |
| \fBPluginDir\fR |
| Identifies the places in which to look for SLURM plugins. |
| This is a colon\-separated list of directories, like the PATH |
| environment variable. |
| The default value is "/usr/local/lib/slurm". |
| |
| .TP |
| \fBPlugStackConfig\fR |
| Location of the config file for SLURM stackable plugins that use |
| the Stackable Plugin Architecture for Node job (K)control (SPANK). |
| This provides support for a highly configurable set of plugins to |
| be called before and/or after execution of each task spawned as |
| part of a user's job step. Default location is "plugstack.conf" |
| in the same directory as the system slurm.conf. For more information |
| on SPANK plugins, see the \fBspank\fR(8) manual. |
| |
| .TP |
| \fBPreemptMode\fR |
| Enables gang scheduling and/or controls the mechanism used to preempt |
| jobs. When the \fBPreemptType\fR parameter is set to enable |
| preemption, the \fBPreemptMode\fR selects the mechanism used to |
| preempt the lower priority jobs. The \fBGANG\fR option is used to |
| enable gang scheduling independent of whether preemption is enabled |
| (the \fBPreemptType\fR setting). The \fBGANG\fR option can be |
| specified in addition to a \fBPreemptMode\fR setting with the two |
| options comma separated. The \fBSUSPEND\fR option requires that gang |
| scheduling be enabled (i.e, "PreemptMode=SUSPEND,GANG"). |
| .RS |
| .TP 12 |
| \fBOFF\fR |
| is the default value and disables job preemption and gang scheduling. |
| This is the only option compatible with \fBSchedulerType=sched/wiki\fR |
| or \fBSchedulerType=sched/wiki2\fR (used by Maui and Moab respectively, |
| which provide their own job preemption functionality). |
| .TP |
| \fBCANCEL\fR |
| always cancel the job. |
| .TP |
| \fBCHECKPOINT\fR |
| preempts jobs by checkpointing them (if possible) or canceling them. |
| .TP |
| \fBGANG\fR |
| enables gang scheduling (time slicing) of jobs in the same partition. |
| .TP |
| \fBREQUEUE\fR |
| preempts jobs by requeuing them (if possible) or canceling them. |
| .TP |
| \fBSUSPEND\fR |
| preempts jobs by suspending them. |
| A suspended job will resume execution once the high priority job |
| preempting it completes. |
| The \fBSUSPEND\fR may only be used with the \fBGANG\fR option |
| (the gang scheduler module performs the job resume operation). |
| .RE |
| |
| .TP |
| \fBPreemptType\fR |
| This specifies the plugin used to identify which jobs can be |
| preempted in order to start a pending job. |
| .RS |
| .TP |
| \fBpreempt/none\fR |
| Job preemption is disabled. |
| This is the default. |
| .TP |
| \fBpreempt/partition_prio\fR |
| Job preemption is based upon partition priority. |
| Jobs in higher priority partitions (queues) may preempt jobs from lower |
| priority partitions. |
| .TP |
| \fBpreempt/qos\fR |
| Job preemption rules are specified by Quality Of Service (QOS) specifications |
| in the SLURM database a database. |
| This is not compatible with \fBPreemptMode=OFF\fR or \fBPreemptMode=SUSPEND\fR |
| (i.e. preempted jobs must be removed from the resources). |
| .RE |
| |
| .TP |
| \fBPriorityDecayHalfLife\fR |
| This controls how long prior resource use is considered in determining |
| how over\- or under\-serviced an association is (user, bank account and |
| cluster) in determining job priority. If set to 0 no decay will be applied. |
| This is helpful if you want to enforce hard time limits per association. If |
| set to 0 \fBPriorityUsageResetPeriod\fR must be set to some interval. |
| Applicable only if PriorityType=priority/multifactor. |
| The unit is a time string (i.e. min, hr:min:00, days\-hr:min:00, |
| or days\-hr). The default value is 7\-0 (7 days). |
| |
| .TP |
| \fBPriorityCalcPeriod\fR |
| The period of time in minutes in which the half-life decay will be |
| re-calculated. |
| Applicable only if PriorityType=priority/multifactor. |
| The default value is 5 (minutes). |
| |
| .TP |
| \fBPriorityFavorSmall\fR |
| Specifies that small jobs should be given preferential scheduling priority. |
| Applicable only if PriorityType=priority/multifactor. |
| Supported values are "YES" and "NO". The default value is "NO". |
| |
| .TP |
| \fBPriorityMaxAge\fR |
| Specifies the job age which will be given the maximum age factor in computing |
| priority. For example, a value of 30 minutes would result in all jobs over |
| 30 minutes old would get the same age\-based priority. |
| Applicable only if PriorityType=priority/multifactor. |
| The unit is a time string (i.e. min, hr:min:00, days\-hr:min:00, |
| or days\-hr). The default value is 7\-0 (7 days). |
| |
| .TP |
| \fBPriorityUsageResetPeriod\fR |
| At this interval the usage of associations will be reset to 0. This is used |
| if you want to enforce hard limits of time usage per association. If |
| PriorityDecayHalfLife is set to be 0 no decay will happen and this is the |
| only way to reset the usage accumulated by running jobs. By default this is |
| turned off and it is advised to use the PriorityDecayHalfLife option to avoid |
| not having anything running on your cluster, but if your schema is set up to |
| only allow certain amounts of time on your system this is the way to do it. |
| Applicable only if PriorityType=priority/multifactor. |
| .RS |
| .TP 12 |
| \fBNONE\fR |
| Never clear historic usage. The default value. |
| .TP |
| \fBNOW\fR |
| Clear the historic usage now. |
| Executed at startup and reconfiguration time. |
| .TP |
| \fBDAILY\fR |
| Cleared every day at midnight. |
| .TP |
| \fBWEEKLY\fR |
| Cleared every week on Sunday at time 00:00. |
| .TP |
| \fBMONTHLY\fR |
| Cleared on the first day of each month at time 00:00. |
| .TP |
| \fBQUARTERLY\fR |
| Cleared on the first day of each quarter at time 00:00. |
| .TP |
| \fBYEARLY\fR |
| Cleared on the first day of each year at time 00:00. |
| .RE |
| |
| .TP |
| \fBPriorityType\fR |
| This specifies the plugin to be used in establishing a job's scheduling |
| priority. Supported values are "priority/basic" (jobs are prioritized |
| by order of arrival, also suitable for sched/wiki and sched/wiki2) and |
| "priority/multifactor" (jobs are prioritized based upon size, age, |
| fair\-share of allocation, etc). |
| The default value is "priority/basic". |
| |
| .TP |
| \fBPriorityWeightAge\fR |
| An integer value that sets the degree to which the queue wait time |
| component contributes to the job's priority. |
| Applicable only if PriorityType=priority/multifactor. |
| The default value is 0. |
| |
| .TP |
| \fBPriorityWeightFairshare\fR |
| An integer value that sets the degree to which the fair-share |
| component contributes to the job's priority. |
| Applicable only if PriorityType=priority/multifactor. |
| The default value is 0. |
| |
| .TP |
| \fBPriorityWeightJobSize\fR |
| An integer value that sets the degree to which the job size |
| component contributes to the job's priority. |
| Applicable only if PriorityType=priority/multifactor. |
| The default value is 0. |
| |
| .TP |
| \fBPriorityWeightPartition\fR |
| An integer value that sets the degree to which the node partition |
| component contributes to the job's priority. |
| Applicable only if PriorityType=priority/multifactor. |
| The default value is 0. |
| |
| .TP |
| \fBPriorityWeightQOS\fR |
| An integer value that sets the degree to which the Quality Of Service |
| component contributes to the job's priority. |
| Applicable only if PriorityType=priority/multifactor. |
| The default value is 0. |
| |
| .TP |
| \fBPrivateData\fR |
| This controls what type of information is hidden from regular users. |
| By default, all information is visible to all users. |
| User \fBSlurmUser\fR and \fBroot\fR can always view all information. |
| Multiple values may be specified with a comma separator. |
| Acceptable values include: |
| .RS |
| .TP |
| \fBaccounts\fR |
| (NON-SLURMDBD ACCOUNTING ONLY) prevents users from viewing any account |
| definitions unless they are coordinators of them. |
| .TP |
| \fBjobs\fR |
| prevents users from viewing jobs or job steps belonging |
| to other users. (NON-SLURMDBD ACCOUNTING ONLY) prevents users from viewing |
| job records belonging to other users unless they are coordinators of |
| the association running the job when using sacct. |
| .TP |
| \fBnodes\fR |
| prevents users from viewing node state information. |
| .TP |
| \fBpartitions\fR |
| prevents users from viewing partition state information. |
| .TP |
| \fBreservations\fR |
| prevents regular users from viewing reservations. |
| .TP |
| \fBusage\fR |
| (NON-SLURMDBD ACCOUNTING ONLY) prevents users from viewing |
| usage of any other user. This applies to sreport. |
| .TP |
| \fBusers\fR |
| (NON-SLURMDBD ACCOUNTING ONLY) prevents users from viewing |
| information of any user other than themselves, this also makes it so users can |
| only see associations they deal with. |
| Coordinators can see associations of all users they are coordinator of, |
| but can only see themselves when listing users. |
| .RE |
| |
| .TP |
| \fBProctrackType\fR |
| Identifies the plugin to be used for process tracking. |
| The slurmd daemon uses this mechanism to identify all processes |
| which are children of processes it spawns for a user job. |
| The slurmd daemon must be restarted for a change in ProctrackType |
| to take effect. |
| NOTE: "proctrack/linuxproc" and "proctrack/pgid" can fail to |
| identify all processes associated with a job since processes |
| can become a child of the init process (when the parent process |
| terminates) or change their process group. |
| To reliably track all processes, one of the other mechanisms |
| utilizing kernel modifications is preferable. |
| NOTE: "proctrack/linuxproc" is not compatible with "switch/elan." |
| Acceptable values at present include: |
| .RS |
| .TP 20 |
| \fBproctrack/aix\fR |
| which uses an AIX kernel extension and is the default for AIX systems |
| .TP |
| \fBproctrack/cgroup\fR |
| which uses linux cgroups to constrain and track processes. |
| NOTE: see "man cgroup.conf" for configuration details |
| .TP |
| \fBproctrack/linuxproc\fR |
| which uses linux process tree using parent process IDs |
| .TP |
| \fBproctrack/rms\fR |
| which uses Quadrics kernel patch and is the default if "SwitchType=switch/elan" |
| .TP |
| \fBproctrack/sgi_job\fR |
| which uses SGI's Process Aggregates (PAGG) kernel module, |
| see \fIhttp://oss.sgi.com/projects/pagg/\fR for more information |
| .TP |
| \fBproctrack/pgid\fR |
| which uses process group IDs and is the default for all other systems |
| .RE |
| |
| .TP |
| \fBProlog\fR |
| Fully qualified pathname of a program for the slurmd to execute |
| whenever it is asked to run a job step from a new job allocation (e.g. |
| "/usr/local/slurm/prolog"). The slurmd executes the script before starting |
| the first job step. This may be used to purge files, enable user login, etc. |
| By default there is no prolog. Any configured script is expected to |
| complete execution quickly (in less time than \fBMessageTimeout\fR). |
| See \fBProlog and Epilog Scripts\fR for more information. |
| |
| .TP |
| \fBPrologSlurmctld\fR |
| Fully qualified pathname of a program for the slurmctld to execute |
| before granting a new job allocation (e.g. |
| "/usr/local/slurm/prolog_controller"). |
| The program executes as SlurmUser, which gives it permission to drain |
| nodes and requeue the job if a failure occurs or cancel the job if appropriate. |
| The program can be used to reboot nodes or perform other work to prepare |
| resources for use. |
| While this program is running, the nodes associated with the job will be |
| have a POWER_UP/CONFIGURING flag set in their state, which can be readily |
| viewed. |
| A non\-zero exit code will result in the job being requeued (where possible) |
| or killed. |
| See \fBProlog and Epilog Scripts\fR for more information. |
| |
| .TP |
| \fBPropagatePrioProcess\fR |
| Controls the scheduling priority (nice value) of user spawned tasks. |
| .RS |
| .TP 5 |
| \fB0\fR |
| The tasks will inherit the scheduling priority from the slurm daemon. |
| This is the default value. |
| .TP |
| \fB1\fR |
| The tasks will inherit the scheduling priority of the command used to |
| submit them (e.g. \fBsrun\fR or \fBsbatch\fR). |
| Unless the job is submitted by user root, the tasks will have a scheduling |
| priority no higher than the slurm daemon spawning them. |
| .TP |
| \fB2\fR |
| The tasks will inherit the scheduling priority of the command used to |
| submit them (e.g. \fBsrun\fR or \fBsbatch\fR) with the restriction that |
| their nice value will always be one higher than the slurm daemon (i.e. |
| the tasks scheduling priority will be lower than the slurm daemon). |
| .RE |
| |
| .TP |
| \fBPropagateResourceLimits\fR |
| A list of comma separated resource limit names. |
| The slurmd daemon uses these names to obtain the associated (soft) limit |
| values from the users process environment on the submit node. |
| These limits are then propagated and applied to the jobs that |
| will run on the compute nodes. |
| This parameter can be useful when system limits vary among nodes. |
| Any resource limits that do not appear in the list are not propagated. |
| However, the user can override this by specifying which resource limits |
| to propagate with the srun commands "\-\-propagate" option. |
| If neither of the 'propagate resource limit' parameters are specified, then |
| the default action is to propagate all limits. |
| Only one of the parameters, either |
| \fBPropagateResourceLimits\fR or \fBPropagateResourceLimitsExcept\fR, |
| may be specified. |
| The following limit names are supported by SLURM (although some |
| options may not be supported on some systems): |
| .RS |
| .TP 10 |
| \fBALL\fR |
| All limits listed below |
| .TP |
| \fBNONE\fR |
| No limits listed below |
| .TP |
| \fBAS\fR |
| The maximum address space for a processes |
| .TP |
| \fBCORE\fR |
| The maximum size of core file |
| .TP |
| \fBCPU\fR |
| The maximum amount of CPU time |
| .TP |
| \fBDATA\fR |
| The maximum size of a process's data segment |
| .TP |
| \fBFSIZE\fR |
| The maximum size of files created |
| .TP |
| \fBMEMLOCK\fR |
| The maximum size that may be locked into memory |
| .TP |
| \fBNOFILE\fR |
| The maximum number of open files |
| .TP |
| \fBNPROC\fR |
| The maximum number of processes available |
| .TP |
| \fBRSS\fR |
| The maximum resident set size |
| .TP |
| \fBSTACK\fR |
| The maximum stack size |
| .RE |
| |
| .TP |
| \fBPropagateResourceLimitsExcept\fR |
| A list of comma separated resource limit names. |
| By default, all resource limits will be propagated, (as described by |
| the \fBPropagateResourceLimits\fR parameter), except for the limits |
| appearing in this list. The user can override this by specifying which |
| resource limits to propagate with the srun commands "\-\-propagate" option. |
| See \fBPropagateResourceLimits\fR above for a list of valid limit names. |
| |
| .TP |
| \fBResumeProgram\fR |
| SLURM supports a mechanism to reduce power consumption on nodes that |
| remain idle for an extended period of time. |
| This is typically accomplished by reducing voltage and frequency or powering |
| the node down. |
| \fBResumeProgram\fR is the program that will be executed when a node |
| in power save mode is assigned work to perform. |
| For reasons of reliability, \fBResumeProgram\fR may execute more than once |
| for a node when the \fBslurmctld\fR daemon crashes and is restarted. |
| If \fBResumeProgram\fR is unable to restore a node to service, it should |
| requeue any node associated with the node and set the node state to DRAIN. |
| The program executes as \fBSlurmUser\fR. |
| The argument to the program will be the names of nodes to |
| be removed from power savings mode (using SLURM's hostlist |
| expression format). |
| By default no program is run. |
| Related configuration options include \fBResumeTimeout\fR, \fBResumeRate\fR, |
| \fBSuspendRate\fR, \fBSuspendTime\fR, \fBSuspendTimeout\fR, \fBSuspendProgram\fR, |
| \fBSuspendExcNodes\fR, and \fBSuspendExcParts\fR. |
| More information is available at the SLURM web site |
| (https://computing.llnl.gov/linux/slurm/power_save.html). |
| |
| .TP |
| \fBResumeRate\fR |
| The rate at which nodes in power save mode are returned to normal |
| operation by \fBResumeProgram\fR. |
| The value is number of nodes per minute and it can be used to prevent |
| power surges if a large number of nodes in power save mode are |
| assigned work at the same time (e.g. a large job starts). |
| A value of zero results in no limits being imposed. |
| The default value is 300 nodes per minute. |
| Related configuration options include \fBResumeTimeout\fR, \fBResumeProgram\fR, |
| \fBSuspendRate\fR, \fBSuspendTime\fR, \fBSuspendTimeout\fR, \fBSuspendProgram\fR, |
| \fBSuspendExcNodes\fR, and \fBSuspendExcParts\fR. |
| |
| .TP |
| \fBResumeTimeout\fR |
| Maximum time permitted (in second) between when a node is resume request |
| is issued and when the node is actually available for use. |
| Nodes which fail to respond in this time frame may be marked DOWN and |
| the jobs scheduled on the node requeued. |
| The default value is 60 seconds. |
| Related configuration options include \fBResumeProgram\fR, \fBResumeRate\fR, |
| \fBSuspendRate\fR, \fBSuspendTime\fR, \fBSuspendTimeout\fR, \fBSuspendProgram\fR, |
| \fBSuspendExcNodes\fR and \fBSuspendExcParts\fR. |
| More information is available at the SLURM web site |
| (https://computing.llnl.gov/linux/slurm/power_save.html). |
| |
| .TP |
| \fBResvOverRun\fR |
| Describes how long a job already running in a reservation should be |
| permitted to execute after the end time of the reservation has been |
| reached. |
| The time period is specified in minutes and the default value is 0 |
| (kill the job immediately). |
| The value may not exceed 65533 minutes, although a value of "UNLIMITED" |
| is supported to permit a job to run indefinitely after its reservation |
| is terminated. |
| |
| .TP |
| \fBReturnToService\fR |
| Controls when a DOWN node will be returned to service. |
| The default value is 0. |
| Supported values include |
| .RS |
| .TP 4 |
| \fB0\fR |
| A node will remain in the DOWN state until a system administrator |
| explicitly changes its state (even if the slurmd daemon registers |
| and resumes communications). |
| .TP |
| \fB1\fR |
| A DOWN node will become available for use upon registration with a |
| valid configuration only if it was set DOWN due to being non\-responsive. |
| If the node was set DOWN for any other reason (low memory, prolog failure, |
| epilog failure, silently rebooting, etc.), its state will not automatically |
| be changed. |
| .TP |
| \fB2\fR |
| A DOWN node will become available for use upon registration with a |
| valid configuration. The node could have been set DOWN for any reason. |
| .RE |
| |
| .TP |
| \fBSallocDefaultCommand\fR |
| Normally, \fBsalloc\fR(1) will run the user's default shell when |
| a command to execute is not specified on the \fBsalloc\fR command line. |
| If \fBSallocDefaultCommand\fR is specified, \fBsalloc\fR will instead |
| run the configured command. The command is passed to '/bin/sh \-c', so |
| shell metacharacters are allowed, and commands with multiple arguments |
| should be quoted. For instance: |
| |
| .nf |
| SallocDefaultCommand = "$SHELL" |
| .fi |
| |
| would run the shell in the user's $SHELL environment variable. |
| and |
| |
| .nf |
| SallocDefaultCommand = "xterm \-T Job_$SLURM_JOB_ID" |
| .fi |
| |
| would run \fBxterm\fR with the title set to the SLURM jobid. |
| |
| .TP |
| \fBSchedulerParameters\fR |
| The interpretation of this parameter varies by \fBSchedulerType\fR. |
| Multiple options may be comma separated. |
| .RS |
| .TP |
| \fBdefault_queue_depth=#\fR |
| The default number of jobs to attempt scheduling (i.e. the queue depth) when a |
| running job completes or other routine actions occur. The full queue will be |
| tested on a less frequent basis. The default value is 100. |
| In the case of large clusters (more than 1000 nodes), configuring a relatively |
| small value may be desirable. |
| .TP |
| \fBdefer\fR |
| Setting this option will avoid attempting to schedule each job |
| individually at job submit time, but defer it until a later time when |
| scheduling multiple jobs simultaneously may be possible. |
| This option may improve system responsiveness when large numbers of jobs |
| (many hundreds) are submitted at the same time, but it will delay the |
| initiation time of individual jobs. Also see \fBdefault_queue_depth\fR above. |
| .TP |
| \fBbf_interval=#\fR |
| The number of seconds between iterations. |
| Higher values result in less overhead and better responsiveness. |
| The default value is 30 seconds. |
| This option applies only to \fBSchedulerType=sched/backfill\fR. |
| .TP |
| \fBbf_window=#\fR |
| The number of minutes into the future to look when considering jobs to schedule. |
| Higher values result in more overhead and less responsiveness. |
| The default value is 1440 minutes (one day). |
| This option applies only to \fBSchedulerType=sched/backfill\fR. |
| .TP |
| \fBmax_job_bf=#\fR |
| The maximum number of jobs to attempt backfill scheduling for |
| (i.e. the queue depth). |
| Higher values result in more overhead and less responsiveness. |
| Until an attempt is made to backfill schedule a job, its expected |
| initiation time value will not be set. |
| The default value is 50. |
| In the case of large clusters (more than 1000 nodes) configured with |
| \fBSelectType=select/cons_res\fR, configuring a relatively small value may be |
| desirable. |
| This option applies only to \fBSchedulerType=sched/backfill\fR. |
| .RE |
| |
| .TP |
| \fBSchedulerPort\fR |
| The port number on which slurmctld should listen for connection requests. |
| This value is only used by the Maui Scheduler (see \fBSchedulerType\fR). |
| The default value is 7321. |
| |
| .TP |
| \fBSchedulerRootFilter\fR |
| Identifies whether or not \fBRootOnly\fR partitions should be filtered from |
| any external scheduling activities. If set to 0, then \fBRootOnly\fR partitions |
| are treated like any other partition. If set to 1, then \fBRootOnly\fR |
| partitions are exempt from any external scheduling activities. The |
| default value is 1. Currently only used by the built\-in backfill |
| scheduling module "sched/backfill" (see \fBSchedulerType\fR). |
| |
| .TP |
| \fBSchedulerTimeSlice\fR |
| Number of seconds in each time slice when gang scheduling is enabled |
| (\fBPreemptMode=GANG\fR). |
| The default value is 30 seconds. |
| |
| .TP |
| \fBSchedulerType\fR |
| Identifies the type of scheduler to be used. |
| Note the \fBslurmctld\fR daemon must be restarted for a change in |
| scheduler type to become effective (reconfiguring a running daemon has |
| no effect for this parameter). |
| The \fBscontrol\fR command can be used to manually change job priorities |
| if desired. |
| Acceptable values include: |
| .RS |
| .TP |
| \fBsched/builtin\fR |
| for the built\-in FIFO (First In First Out) scheduler. |
| This is the default. |
| .TP |
| \fBsched/backfill\fR |
| for a backfill scheduling module to augment the default FIFO scheduling. |
| Backfill scheduling will initiate lower\-priority jobs if doing |
| so does not delay the expected initiation time of any higher |
| priority job. |
| Effectiveness of backfill scheduling is dependent upon users specifying |
| job time limits, otherwise all jobs will have the same time limit and |
| backfilling is impossible. |
| Note documentation for the \fBSchedulerParameters\fR option above. |
| .TP |
| \fBsched/gang\fR |
| Defunct option. See \fBPreemptType\fR and \fBPreemptMode\fR options. |
| .TP |
| \fBsched/hold\fR |
| to hold all newly arriving jobs if a file "/etc/slurm.hold" |
| exists otherwise use the built\-in FIFO scheduler |
| .TP |
| \fBsched/wiki\fR |
| for the Wiki interface to the Maui Scheduler |
| .TP |
| \fBsched/wiki2\fR |
| for the Wiki interface to the Moab Cluster Suite |
| .RE |
| |
| .TP |
| \fBSelectType\fR |
| Identifies the type of resource selection algorithm to be used. |
| Acceptable values include |
| .RS |
| .TP |
| \fBselect/linear\fR |
| for allocation of entire nodes assuming a |
| one\-dimensional array of nodes in which sequentially ordered |
| nodes are preferable. |
| This is the default value for non\-BlueGene systems. |
| .TP |
| \fBselect/cons_res\fR |
| The resources within a node are individually allocated as |
| consumable resources. |
| Note that whole nodes can be allocated to jobs for selected |
| partitions by using the \fIShared=Exclusive\fR option. |
| See the partition \fBShared\fR parameter for more information. |
| .TP |
| \fBselect/bluegene\fR |
| for a three\-dimensional BlueGene system. |
| The default value is "select/bluegene" for BlueGene systems. |
| .RE |
| |
| .TP |
| \fBSelectTypeParameters\fR |
| The permitted values of \fBSelectTypeParameters\fR depend upon the |
| configured value of \fBSelectType\fR. |
| \fBSelectType=select/bluegene\fR supports no \fBSelectTypeParameters\fR. |
| The only supported option for \fBSelectType=select/linear\fR are |
| \fBCR_ONE_TASK_PER_CORE\fR and |
| \fBCR_Memory\fR, which treats memory as a consumable resource and |
| prevents memory over subscription with job preemption or gang scheduling. |
| The following values are supported for \fBSelectType=select/cons_res\fR: |
| .RS |
| .TP |
| \fBCR_CPU\fR |
| CPUs are consumable resources. |
| There is no notion of sockets, cores or threads; |
| do not define those values in the node specification. If these |
| are defined, unexpected results will happen when hyper\-threading |
| is enabled Procs= should be used instead. |
| On a multi\-core system, each core will be considered a CPU. |
| On a multi\-core and hyper\-threaded system, each thread will be |
| considered a CPU. |
| On single\-core systems, each CPUs will be considered a CPU. |
| .TP |
| \fBCR_CPU_Memory\fR |
| CPUs and memory are consumable resources. |
| There is no notion of sockets, cores or threads; |
| do not define those values in the node specification. If these |
| are defined, unexpected results will happen when hyper\-threading |
| is enabled Procs= should be used instead. |
| Setting a value for \fBDefMemPerCPU\fR is strongly recommended. |
| .TP |
| \fBCR_Core\fR |
| Cores are consumable resources. |
| On nodes with hyper\-threads, each thread is counted as a CPU to |
| satisfy a job's resource requirement, but multiple jobs are not |
| allocated threads on the same core. |
| .TP |
| \fBCR_Core_Memory\fR |
| Cores and memory are consumable resources. |
| On nodes with hyper\-threads, each thread is counted as a CPU to |
| satisfy a job's resource requirement, but multiple jobs are not |
| allocated threads on the same core. |
| Setting a value for \fBDefMemPerCPU\fR is strongly recommended. |
| .TP |
| \fBCR_ONE_TASK_PER_CORE\fR |
| Allocate one task per core by default. |
| Without this option, by default one task will be allocated per |
| thread on nodes with more than one \fBThreadsPerCore\fR configured. |
| .TP |
| \fBCR_CORE_DEFAULT_DIST_BLOCK\fR |
| Allocate cores using block distribution by default. |
| This default behavior can be overridden specifying a particular |
| "\-m" parameter with srun/salloc/sbatch. |
| Without this option, cores will be allocated cyclicly across the sockets. |
| .TP |
| \fBCR_Socket\fR |
| Sockets are consumable resources. |
| On nodes with multiple cores, each core or thread is counted as a CPU |
| to satisfy a job's resource requirement, but multiple jobs are not |
| allocated resources on the same socket. |
| Note that jobs requesting one CPU will only be allocated |
| that one CPU, but no other job will share the socket. |
| .TP |
| \fBCR_Socket_Memory\fR |
| Memory and sockets are consumable resources. |
| On nodes with multiple cores, each core or thread is counted as a CPU |
| to satisfy a job's resource requirement, but multiple jobs are not |
| allocated resources on the same socket. |
| Note that jobs requesting one CPU will only be allocated |
| that one CPU, but no other job will share the socket. |
| Setting a value for \fBDefMemPerCPU\fR is strongly recommended. |
| .TP |
| \fBCR_Memory\fR |
| Memory is a consumable resource. |
| NOTE: This implies \fIShared=YES\fR or \fIShared=FORCE\fR for all partitions. |
| Setting a value for \fBDefMemPerCPU\fR is strongly recommended. |
| .RE |
| |
| .TP |
| \fBSlurmUser\fR |
| The name of the user that the \fBslurmctld\fR daemon executes as. |
| For security purposes, a user other than "root" is recommended. |
| This user must exist on all nodes of the cluster for authentication |
| of communications between SLURM components. |
| The default value is "root". |
| |
| .TP |
| \fBSlurmdUser\fR |
| The name of the user that the \fBslurmd\fR daemon executes as. |
| This user must exist on all nodes of the cluster for authentication |
| of communications between SLURM components. |
| The default value is "root". |
| |
| .TP |
| \fBSlurmctldDebug\fR |
| The level of detail to provide \fBslurmctld\fR daemon's logs. |
| Values from 0 to 9 are legal, with `0' being "quiet" operation and `9' |
| being insanely verbose. |
| The default value is 3. |
| |
| .TP |
| \fBSlurmctldLogFile\fR |
| Fully qualified pathname of a file into which the \fBslurmctld\fR daemon's |
| logs are written. |
| The default value is none (performs logging via syslog). |
| |
| .TP |
| \fBSlurmctldPidFile\fR |
| Fully qualified pathname of a file into which the \fBslurmctld\fR daemon |
| may write its process id. This may be used for automated signal processing. |
| The default value is "/var/run/slurmctld.pid". |
| |
| .TP |
| \fBSlurmctldPort\fR |
| The port number that the SLURM controller, \fBslurmctld\fR, listens |
| to for work. The default value is SLURMCTLD_PORT as established at system |
| build time. If none is explicitly specified, it will be set to 6817. |
| \fBSlurmctldPort\fR may also be configured to support a range of port |
| numbers in order to accept larger bursts of incoming messages by specifying |
| two numbers separated by a dash (e.g. \fBSlurmctldPort=6817\-6818\fR). |
| NOTE: Either \fBslurmctld\fR and \fBslurmd\fR daemons must not |
| execute on the same nodes or the values of \fBSlurmctldPort\fR and |
| \fBSlurmdPort\fR must be different. |
| |
| .TP |
| \fBSlurmctldTimeout\fR |
| The interval, in seconds, that the backup controller waits for the |
| primary controller to respond before assuming control. |
| The default value is 120 seconds. |
| May not exceed 65533. |
| |
| .TP |
| \fBSlurmdDebug\fR |
| The level of detail to provide \fBslurmd\fR daemon's logs. |
| Values from 0 to 9 are legal, with `0' being "quiet" operation and `9' being |
| insanely verbose. |
| The default value is 3. |
| |
| .TP |
| \fBSlurmdLogFile\fR |
| Fully qualified pathname of a file into which the \fBslurmd\fR daemon's |
| logs are written. |
| The default value is none (performs logging via syslog). |
| Any "%h" within the name is replaced with the hostname on which the |
| \fBslurmd\fR is running. |
| |
| .TP |
| \fBSlurmdPidFile\fR |
| Fully qualified pathname of a file into which the \fBslurmd\fR daemon may write |
| its process id. This may be used for automated signal processing. |
| The default value is "/var/run/slurmd.pid". |
| |
| .TP |
| \fBSlurmdPort\fR |
| The port number that the SLURM compute node daemon, \fBslurmd\fR, listens |
| to for work. The default value is SLURMD_PORT as established at system |
| build time. If none is explicitly specified, its value will be 6818. |
| NOTE: Either slurmctld and slurmd daemons must not execute |
| on the same nodes or the values of \fBSlurmctldPort\fR and \fBSlurmdPort\fR |
| must be different. |
| |
| .TP |
| \fBSlurmdSpoolDir\fR |
| Fully qualified pathname of a directory into which the \fBslurmd\fR |
| daemon's state information and batch job script information are written. This |
| must be a common pathname for all nodes, but should represent a directory which |
| is local to each node (reference a local file system). The default value |
| is "/var/spool/slurmd." \fBNOTE\fR: This directory is also used to store |
| \fBslurmd\fR's |
| shared memory lockfile, and \fBshould not be changed\fR unless the system |
| is being cleanly restarted. If the location of \fBSlurmdSpoolDir\fR is |
| changed and \fBslurmd\fR is restarted, the new daemon will attach to a |
| different shared memory region and lose track of any running jobs. |
| |
| .TP |
| \fBSlurmdTimeout\fR |
| The interval, in seconds, that the SLURM controller waits for \fBslurmd\fR |
| to respond before configuring that node's state to DOWN. |
| A value of zero indicates the node will not be tested by \fBslurmctld\fR to |
| confirm the state of \fBslurmd\fR, the node will not be automatically set to |
| a DOWN state indicating a non\-responsive \fBslurmd\fR, and some other tool |
| will take responsibility for monitoring the state of each compute node |
| and its \fBslurmd\fR daemon. |
| SLURM's hierarchical communication mechanism is used to ping the \fBslurmd\fR |
| daemons in order to minimize system noise and overhead. |
| The default value is 300 seconds. |
| The value may not exceed 65533 seconds. |
| |
| .TP |
| \fBSlurmSchedLogFile\fR |
| Fully qualified pathname of the scheduling event logging file. |
| The syntax of this parameter is the same as for \fBSlurmctldLogFile\fR. |
| In order to configure scheduler logging, set both the \fBSlurmSchedLogFile\fR |
| and \fBSlurmSchedLogLevel\fR parameters. |
| |
| .TP |
| \fBSlurmSchedLogLevel\fR |
| The initial level of scheduling event logging, similar to the |
| \fBSlurmctlDebug\fR parameter used to control the initial level of |
| \fBslurmctld\fR logging. |
| Valid values for \fBSlurmSchedLogLevel\fR are "0" (scheduler logging |
| disabled) and "1" (scheduler logging enabled). |
| If this parameter is omitted, the value defaults to "0" (disabled). |
| In order to configure scheduler logging, set both the \fBSlurmSchedLogFile\fR |
| and \fBSlurmSchedLogLevel\fR parameters. |
| The scheduler logging level can be changed dynamically using \fBscontrol\fR. |
| |
| .TP |
| \fBSrunEpilog\fR |
| Fully qualified pathname of an executable to be run by srun following |
| the completion of a job step. The command line arguments for the |
| executable will be the command and arguments of the job step. This |
| configuration parameter may be overridden by srun's \fB\-\-epilog\fR |
| parameter. Note that while the other "Epilog" executables (e.g., |
| TaskEpilog) are run by slurmd on the compute nodes where the tasks are |
| executed, the \fBSrunEpilog\fR runs on the node where the "srun" is |
| executing. |
| |
| .TP |
| \fBSrunProlog\fR |
| Fully qualified pathname of an executable to be run by srun prior to |
| the launch of a job step. The command line arguments for the |
| executable will be the command and arguments of the job step. This |
| configuration parameter may be overridden by srun's \fB\-\-prolog\fR |
| parameter. Note that while the other "Prolog" executables (e.g., |
| TaskProlog) are run by slurmd on the compute nodes where the tasks are |
| executed, the \fBSrunProlog\fR runs on the node where the "srun" is |
| executing. |
| |
| .TP |
| \fBStateSaveLocation\fR |
| Fully qualified pathname of a directory into which the SLURM controller, |
| \fBslurmctld\fR, saves its state (e.g. "/usr/local/slurm/checkpoint"). |
| SLURM state will saved here to recover from system failures. |
| \fBSlurmUser\fR must be able to create files in this directory. |
| If you have a \fBBackupController\fR configured, this location should be |
| readable and writable by both systems. |
| Since all running and pending job information is stored here, the use of |
| a reliable file system (e.g. RAID) is recommended. |
| The default value is "/tmp". |
| If any slurm daemons terminate abnormally, their core files will also be written |
| into this directory. |
| |
| .TP |
| \fBSuspendExcNodes\fR |
| Specifies the nodes which are to not be placed in power save mode, even |
| if the node remains idle for an extended period of time. |
| Use SLURM's hostlist expression to identify nodes. |
| By default no nodes are excluded. |
| Related configuration options include \fBResumeTimeout\fR, \fBResumeProgram\fR, |
| \fBResumeRate\fR, \fBSuspendProgram\fR, \fBSuspendRate\fR, \fBSuspendTime\fR, |
| \fBSuspendTimeout\fR, and \fBSuspendExcParts\fR. |
| |
| .TP |
| \fBSuspendExcParts\fR |
| Specifies the partitions whose nodes are to not be placed in power save |
| mode, even if the node remains idle for an extended period of time. |
| Multiple partitions can be identified and separated by commas. |
| By default no nodes are excluded. |
| Related configuration options include \fBResumeTimeout\fR, \fBResumeProgram\fR, |
| \fBResumeRate\fR, \fBSuspendProgram\fR, \fBSuspendRate\fR, \fBSuspendTime\fR |
| \fBSuspendTimeout\fR, and \fBSuspendExcNodes\fR. |
| |
| .TP |
| \fBSuspendProgram\fR |
| \fBSuspendProgram\fR is the program that will be executed when a node |
| remains idle for an extended period of time. |
| This program is expected to place the node into some power save mode. |
| This can be used to reduce the frequency and voltage of a node or |
| completely power the node off. |
| The program executes as \fBSlurmUser\fR. |
| The argument to the program will be the names of nodes to |
| be placed into power savings mode (using SLURM's hostlist |
| expression format). |
| By default, no program is run. |
| Related configuration options include \fBResumeTimeout\fR, \fBResumeProgram\fR, |
| \fBResumeRate\fR, \fBSuspendRate\fR, \fBSuspendTime\fR, \fBSuspendTimeout\fR, |
| \fBSuspendExcNodes\fR, and \fBSuspendExcParts\fR. |
| |
| .TP |
| \fBSuspendRate\fR |
| The rate at which nodes are place into power save mode by \fBSuspendProgram\fR. |
| The value is number of nodes per minute and it can be used to prevent |
| a large drop in power power consumption (e.g. after a large job completes). |
| A value of zero results in no limits being imposed. |
| The default value is 60 nodes per minute. |
| Related configuration options include \fBResumeTimeout\fR, \fBResumeProgram\fR, |
| \fBResumeRate\fR, \fBSuspendProgram\fR, \fBSuspendTime\fR, \fBSuspendTimeout\fR, |
| \fBSuspendExcNodes\fR, and \fBSuspendExcParts\fR. |
| |
| .TP |
| \fBSuspendTime\fR |
| Nodes which remain idle for this number of seconds will be placed into |
| power save mode by \fBSuspendProgram\fR. |
| A value of \-1 disables power save mode and is the default. |
| Related configuration options include \fBResumeTimeout\fR, \fBResumeProgram\fR, |
| \fBResumeRate\fR, \fBSuspendProgram\fR, \fBSuspendRate\fR, \fBSuspendTimeout\fR, |
| \fBSuspendExcNodes\fR, and \fBSuspendExcParts\fR. |
| |
| .TP |
| \fBSuspendTimeout\fR |
| Maximum time permitted (in second) between when a node suspend request |
| is issued and when the node shutdown. |
| At that time the node must ready for a resume request to be issued |
| as needed for new work. |
| The default value is 30 seconds. |
| Related configuration options include \fBResumeProgram\fR, \fBResumeRate\fR, |
| \fBResumeTimeout\fR, \fBSuspendRate\fR, \fBSuspendTime\fR, \fBSuspendProgram\fR, |
| \fBSuspendExcNodes\fR and \fBSuspendExcParts\fR. |
| More information is available at the SLURM web site |
| (https://computing.llnl.gov/linux/slurm/power_save.html). |
| |
| .TP |
| \fBSwitchType\fR |
| Identifies the type of switch or interconnect used for application |
| communications. |
| Acceptable values include |
| "switch/none" for switches not requiring special processing for job launch |
| or termination (Myrinet, Ethernet, and InfiniBand), |
| "switch/elan" for Quadrics Elan 3 or Elan 4 interconnect. |
| The default value is "switch/none". |
| All SLURM daemons, commands and running jobs must be restarted for a |
| change in \fBSwitchType\fR to take effect. |
| If running jobs exist at the time \fBslurmctld\fR is restarted with a new |
| value of \fBSwitchType\fR, records of all jobs in any state may be lost. |
| |
| .TP |
| \fBTaskEpilog\fR |
| Fully qualified pathname of a program to be execute as the slurm job's |
| owner after termination of each task. |
| See \fBTaskProlog\fR for execution order details. |
| |
| .TP |
| \fBTaskPlugin\fR |
| Identifies the type of task launch plugin, typically used to provide |
| resource management within a node (e.g. pinning tasks to specific |
| processors). |
| Acceptable values include |
| "task/none" for systems requiring no special handling and |
| "task/affinity" to enable the \-\-cpu_bind and/or \-\-mem_bind |
| srun options. |
| The default value is "task/none". |
| If you "task/affinity" and encounter problems, it may be due to |
| the variety of system calls used to implement task affinity on |
| different operating systems. |
| If that is the case, you may want to use Portable Linux |
| Process Affinity (PLPA, see http://www.open-mpi.org/software/plpa), |
| which is supported by SLURM. |
| |
| .TP |
| \fBTaskPluginParam\fR |
| Optional parameters for the task plugin. |
| Multiple options should be comma separated |
| If \fBNone\fR, \fBSockets\fR, \fBCores\fR, \fBThreads\fR, |
| and/or \fBVerbose\fR are specified, they will override |
| the \fB\-\-cpu_bind\fR option specified by the user |
| in the \fBsrun\fR command. |
| \fBNone\fR, \fBSockets\fR, \fBCores\fR and \fBThreads\fR are mutually |
| exclusive and since they decrease scheduling flexibility are not generally |
| recommended (select no more than one of them). |
| \fBCpusets\fR and \fBSched\fR |
| are mutually exclusive (select only one of them). |
| |
| .RS |
| .TP 10 |
| \fBCores\fR |
| Always bind to cores. |
| Overrides user options or automatic binding. |
| .TP |
| \fBCpusets\fR |
| Use cpusets to perform task affinity functions. |
| By default, \fBSched\fR task binding is performed. |
| .TP |
| \fBNone\fR |
| Perform no task binding. |
| Overrides user options or automatic binding. |
| .TP |
| \fBSched\fR |
| Use \fIsched_setaffinity\fR or \fIplpa_sched_setaffinity\fR |
| (if available) to bind tasks to processors. |
| .TP |
| \fBSockets\fR |
| Always bind to sockets. |
| Overrides user options or automatic binding. |
| .TP |
| \fBThreads\fR |
| Always bind to threads. |
| Overrides user options or automatic binding. |
| .TP |
| \fBVerbose\fR |
| Verbosely report binding before tasks run. |
| Overrides user options. |
| .RE |
| |
| .TP |
| \fBTaskProlog\fR |
| Fully qualified pathname of a program to be execute as the slurm job's |
| owner prior to initiation of each task. |
| Besides the normal environment variables, this has SLURM_TASK_PID |
| available to identify the process ID of the task being started. |
| Standard output from this program can be used to control the environment |
| variables and output for the user program. |
| .RS |
| .TP 20 |
| \fBexport NAME=value\fR |
| Will set environment variables for the task being spawned. |
| Everything after the equal sign to the end of the |
| line will be used as the value for the environment variable. |
| Exporting of functions is not currently supported. |
| .TP |
| \fBprint ...\fR |
| Will cause that line (without the leading "print ") |
| to be printed to the job's standard output. |
| .TP |
| \fBunset NAME\fR |
| Will clear environment variables for the task being spawned. |
| .TP |
| The order of task prolog/epilog execution is as follows: |
| .TP |
| \fB1. pre_launch()\fR |
| Function in TaskPlugin |
| .TP |
| \fB2. TaskProlog\fR |
| System\-wide per task program defined in slurm.conf |
| .TP |
| \fB3. user prolog\fR |
| Job step specific task program defined using |
| \fBsrun\fR's \fB\-\-task\-prolog\fR option or \fBSLURM_TASK_PROLOG\fR |
| environment variable |
| .TP |
| \fB4.\fR Execute the job step's task |
| .TP |
| \fB5. user epilog\fR |
| Job step specific task program defined using |
| \fBsrun\fR's \fB\-\-task\-epilog\fR option or \fBSLURM_TASK_EPILOG\fR |
| environment variable |
| .TP |
| \fB6. TaskEpilog\fR |
| System\-wide per task program defined in slurm.conf |
| .TP |
| \fB7. post_term()\fR |
| Function in TaskPlugin |
| .RE |
| |
| .TP |
| \fBTmpFS\fR |
| Fully qualified pathname of the file system available to user jobs for |
| temporary storage. This parameter is used in establishing a node's \fBTmpDisk\fR |
| space. |
| The default value is "/tmp". |
| |
| .TP |
| \fBTopologyPlugin\fR |
| Identifies the plugin to be used for determining the network topology |
| and optimizing job allocations to minimize network contention. |
| See \fBNETWORK TOPOLOGY\fR below for details. |
| Additional plugins may be provided in the future which gather topology |
| information directly from the network. |
| Acceptable values include: |
| .RS |
| .TP 21 |
| \fBtopology/3d_torus\fR |
| default for Sun Constellation |
| systems, best\-fit logic over three\-dimensional topology |
| .TP |
| \fBtopology/node_rank\fR |
| default for Cray computers, orders nodes based upon information in the |
| ALPS database and then performs a best\-fit algorithm over over those |
| ordered nodes |
| .TP |
| \fBtopology/none\fR |
| default for other systems, best\-fit logic over one\-dimensional topology |
| .TP |
| \fBtopology/tree\fR |
| used for a hierarchical network as described in a \fItopology.conf\fR file |
| .RE |
| |
| .TP |
| \fBTrackWCKey\fR |
| Boolean yes or no. Used to set display and track of the Workload |
| Characterization Key. Must be set to track wckey usage. |
| |
| .TP |
| \fBTreeWidth\fR |
| \fBSlurmd\fR daemons use a virtual tree network for communications. |
| \fBTreeWidth\fR specifies the width of the tree (i.e. the fanout). |
| The default value is 50, meaning each slurmd daemon can communicate |
| with up to 50 other slurmd daemons and over 2500 nodes can be contacted |
| with two message hops. |
| The default value will work well for most clusters. |
| Optimal system performance can typically be achieved if \fBTreeWidth\fR |
| is set to the square root of the number of nodes in the cluster for |
| systems having no more than 2500 nodes or the cube root for larger |
| systems. |
| |
| .TP |
| \fBUnkillableStepProgram\fR |
| If the processes in a job step are determined to be unkillable for a period |
| of time specified by the \fBUnkillableStepTimeout\fR variable, the program |
| specified by \fBUnkillableStepProgram\fR will be executed. |
| This program can be used to take special actions to clean up the unkillable |
| processes and/or notify computer administrators. |
| The program will be run \fBSlurmdUser\fR (usually "root"). |
| By default no program is run. |
| |
| .TP |
| \fBUnkillableStepTimeout\fR |
| The length of time, in seconds, that SLURM will wait before deciding that |
| processes in a job step are unkillable (after they have been signaled with |
| SIGKILL) and execute \fBUnkillableStepProgram\fR as described above. |
| The default timeout value is 60 seconds. |
| |
| .TP |
| \fBUsePAM\fR |
| If set to 1, PAM (Pluggable Authentication Modules for Linux) will be enabled. |
| PAM is used to establish the upper bounds for resource limits. With PAM support |
| enabled, local system administrators can dynamically configure system resource |
| limits. Changing the upper bound of a resource limit will not alter the limits |
| of running jobs, only jobs started after a change has been made will pick up |
| the new limits. |
| The default value is 0 (not to enable PAM support). |
| Remember that PAM also needs to be configured to support SLURM as a service. |
| For sites using PAM's directory based configuration option, a configuration |
| file named \fBslurm\fR should be created. The module\-type, control\-flags, and |
| module\-path names that should be included in the file are: |
| .br |
| auth required pam_localuser.so |
| .br |
| auth required pam_shells.so |
| .br |
| account required pam_unix.so |
| .br |
| account required pam_access.so |
| .br |
| session required pam_unix.so |
| .br |
| For sites configuring PAM with a general configuration file, the appropriate |
| lines (see above), where \fBslurm\fR is the service\-name, should be added. |
| |
| .TP |
| \fBVSizeFactor\fR |
| Memory specifications in job requests apply to real memory size (also known |
| as resident set size). It is possible to enforce virtual memory limits for |
| both jobs and job steps by limiting their virtual memory to some percentage |
| of their real memory allocation. The \fBVSizeFactor\fR parameter specifies |
| the job's or job step's virtual memory limit as a percentage of its real |
| memory limit. For example, if a job's real memory limit is 500MB and |
| VSizeFactor is set to 101 then the job will be killed if its real memory |
| exceeds 500MB or its virtual memory exceeds 505MB (101 percent of the |
| real memory limit). |
| The default valus is 0, which disables enforcement of virtual memory limits. |
| The value may not exceed 65533 percent. |
| |
| .TP |
| \fBWaitTime\fR |
| Specifies how many seconds the srun command should by default wait after |
| the first task terminates before terminating all remaining tasks. The |
| "\-\-wait" option on the srun command line overrides this value. |
| If set to 0, this feature is disabled. |
| May not exceed 65533 seconds. |
| |
| .LP |
| The configuration of nodes (or machines) to be managed by SLURM is |
| also specified in \fB/etc/slurm.conf\fR. |
| Changes in node configuration (e.g. adding nodes, changing their |
| processor count, etc.) require restarting the slurmctld daemon. |
| Only the NodeName must be supplied in the configuration file. |
| All other node configuration information is optional. |
| It is advisable to establish baseline node configurations, |
| especially if the cluster is heterogeneous. |
| Nodes which register to the system with less than the configured resources |
| (e.g. too little memory), will be placed in the "DOWN" state to |
| avoid scheduling jobs on them. |
| Establishing baseline configurations will also speed SLURM's |
| scheduling process by permitting it to compare job requirements |
| against these (relatively few) configuration parameters and |
| possibly avoid having to check job requirements |
| against every individual node's configuration. |
| The resources checked at node registration time are: Procs, |
| RealMemory and TmpDisk. |
| While baseline values for each of these can be established |
| in the configuration file, the actual values upon node |
| registration are recorded and these actual values may be |
| used for scheduling purposes (depending upon the value of |
| \fBFastSchedule\fR in the configuration file. |
| .LP |
| Default values can be specified with a record in which |
| "NodeName" is "DEFAULT". |
| The default entry values will apply only to lines following it in the |
| configuration file and the default values can be reset multiple times |
| in the configuration file with multiple entries where "NodeName=DEFAULT". |
| The "NodeName=" specification must be placed on every line |
| describing the configuration of nodes. |
| In fact, it is generally possible and desirable to define the |
| configurations of all nodes in only a few lines. |
| This convention permits significant optimization in the scheduling |
| of larger clusters. |
| In order to support the concept of jobs requiring consecutive nodes |
| on some architectures, |
| node specifications should be place in this file in consecutive order. |
| No single node name may be listed more than once in the configuration |
| file. |
| Use "DownNodes=" to record the state of nodes which are temporarily |
| in a DOWN, DRAIN or FAILING state without altering permanent |
| configuration information. |
| A job step's tasks are allocated to nodes in order the nodes appear |
| in the configuration file. There is presently no capability within |
| SLURM to arbitrarily order a job step's tasks. |
| .LP |
| Multiple node names may be comma separated (e.g. "alpha,beta,gamma") |
| and/or a simple node range expression may optionally be used to |
| specify numeric ranges of nodes to avoid building a configuration |
| file with large numbers of entries. |
| The node range expression can contain one pair of square brackets |
| with a sequence of comma separated numbers and/or ranges of numbers |
| separated by a "\-" (e.g. "linux[0\-64,128]", or "lx[15,18,32\-33]"). |
| Note that the numeric ranges can include one or more leading |
| zeros to indicate the numeric portion has a fixed number of digits |
| (e.g. "linux[0000\-1023]"). |
| Up to two numeric ranges can be included in the expression |
| (e.g. "rack[0\-63]_blade[0\-41]"). |
| If one or more numeric expressions are included, one of them |
| must be at the end of the name (e.g. "unit[0\-31]rack" is invalid), |
| but arbitrary names can always be used in a comma separated list. |
| .LP |
| On BlueGene systems only, the square brackets should contain |
| pairs of three digit numbers separated by a "x". |
| These numbers indicate the boundaries of a rectangular prism |
| (e.g. "bgl[000x144,400x544]"). |
| See BlueGene documentation for more details. |
| The node configuration specified the following information: |
| |
| .TP |
| \fBNodeName\fR |
| Name that SLURM uses to refer to a node (or base partition for |
| BlueGene systems). |
| Typically this would be the string that "/bin/hostname \-s" returns. |
| It may also be the fully qualified domain name as returned by "/bin/hostname \-f" |
| (e.g. "foo1.bar.com"), or any valid domain name associated with the host |
| through the host database (/etc/hosts) or DNS, depending on the resolver |
| settings. Note that if the short form of the hostname is not used, it |
| may prevent use of hostlist expressions (the numeric portion in brackets |
| must be at the end of the string). |
| Only short hostname forms are compatible with the |
| switch/elan and switch/federation plugins at this time. |
| It may also be an arbitrary string if \fBNodeHostname\fR is specified. |
| If the \fBNodeName\fR is "DEFAULT", the values specified |
| with that record will apply to subsequent node specifications |
| unless explicitly set to other values in that node record or |
| replaced with a different set of default values. |
| For architectures in which the node order is significant, |
| nodes will be considered consecutive in the order defined. |
| For example, if the configuration for "NodeName=charlie" immediately |
| follows the configuration for "NodeName=baker" they will be |
| considered adjacent in the computer. |
| |
| .TP |
| \fBNodeHostname\fR |
| Typically this would be the string that "/bin/hostname \-s" returns. |
| It may also be the fully qualified domain name as returned by "/bin/hostname \-f" |
| (e.g. "foo1.bar.com"), or any valid domain name associated with the host |
| through the host database (/etc/hosts) or DNS, depending on the resolver |
| settings. Note that if the short form of the hostname is not used, it |
| may prevent use of hostlist expressions (the numeric portion in brackets |
| must be at the end of the string). |
| Only short hostname forms are compatible with the |
| switch/elan and switch/federation plugins at this time. |
| A node range expression can be used to specify a set of nodes. |
| If an expression is used, the number of nodes identified by |
| \fBNodeHostname\fR on a line in the configuration file must |
| be identical to the number of nodes identified by \fBNodeName\fR. |
| By default, the \fBNodeHostname\fR will be identical in value to |
| \fBNodeName\fR. |
| |
| .TP |
| \fBNodeAddr\fR |
| Name that a node should be referred to in establishing |
| a communications path. |
| This name will be used as an |
| argument to the gethostbyname() function for identification. |
| If a node range expression is used to designate multiple nodes, |
| they must exactly match the entries in the \fBNodeName\fR |
| (e.g. "NodeName=lx[0\-7] NodeAddr="elx[0\-7]"). |
| \fBNodeAddr\fR may also contain IP addresses. |
| By default, the \fBNodeAddr\fR will be identical in value to |
| \fBNodeName\fR. |
| |
| .TP |
| \fBCoresPerSocket\fR |
| Number of cores in a single physical processor socket (e.g. "2"). |
| The CoresPerSocket value describes physical cores, not the |
| logical number of processors per socket. |
| \fBNOTE\fR: If you have multi\-core processors, you will likely |
| need to specify this parameter in order to optimize scheduling. |
| The default value is 1. |
| |
| .TP |
| \fBFeature\fR |
| A comma delimited list of arbitrary strings indicative of some |
| characteristic associated with the node. |
| There is no value associated with a feature at this time, a node |
| either has a feature or it does not. |
| If desired a feature may contain a numeric component indicating, |
| for example, processor speed. |
| By default a node has no features. |
| Also see \fBGres\fR. |
| |
| .TP |
| \fBGres\fR |
| A comma delimited list of generic resources specifications for a node. |
| Each resource specification consists of a name followed by an optional |
| colon with a numeric value (default value is one) |
| (e.g. "Gres=bandwidth:10000,gpus:2"). |
| A suffix of "K", "M" or "G" may be used to mulitply the number by 1024, |
| 1048576 or 1073741824 respectively (e.g. "Gres=bandwidth:4G,gpus:4").. |
| By default a node has no generic resources. |
| Also see \fBFeature\fR. |
| |
| .TP |
| \fBPort\fR |
| The port number that the SLURM compute node daemon, \fBslurmd\fR, listens |
| to for work on this particular node. By default there is a single port number |
| for all \fBslurmd\fR daemons on all compute nodes as defined by the |
| \fBSlurmdPort\fR configuration parameter. Use of this option is not generally |
| recommended except for development or testing purposes. |
| |
| .TP |
| \fBProcs\fR |
| Number of logical processors on the node (e.g. "2"). |
| If \fBProcs\fR is omitted, it will set equal to the product of |
| \fBSockets\fR, \fBCoresPerSocket\fR, and \fBThreadsPerCore\fR. |
| The default value is 1. |
| |
| .TP |
| \fBRealMemory\fR |
| Size of real memory on the node in MegaBytes (e.g. "2048"). |
| The default value is 1. |
| |
| .TP |
| \fBReason\fR |
| Identifies the reason for a node being in state "DOWN", "DRAINED" |
| "DRAINING", "FAIL" or "FAILING". |
| Use quotes to enclose a reason having more than one word. |
| |
| .TP |
| \fBSockets\fR |
| Number of physical processor sockets/chips on the node (e.g. "2"). |
| If Sockets is omitted, it will be inferred from |
| \fBProcs\fR, \fBCoresPerSocket\fR, and \fBThreadsPerCore\fR. |
| \fBNOTE\fR: If you have multi\-core processors, you will likely |
| need to specify these parameters. |
| The default value is 1. |
| |
| .TP |
| \fBState\fR |
| State of the node with respect to the initiation of user jobs. |
| Acceptable values are "DOWN", "DRAIN", "FAIL", "FAILING" and "UNKNOWN". |
| "DOWN" indicates the node failed and is unavailable to be allocated work. |
| "DRAIN" indicates the node is unavailable to be allocated work. |
| "FAIL" indicates the node is expected to fail soon, has |
| no jobs allocated to it, and will not be allocated |
| to any new jobs. |
| "FAILING" indicates the node is expected to fail soon, has |
| one or more jobs allocated to it, but will not be allocated |
| to any new jobs. |
| "UNKNOWN" indicates the node's state is undefined (BUSY or IDLE), |
| but will be established when the \fBslurmd\fR daemon on that node |
| registers. |
| The default value is "UNKNOWN". |
| Also see the \fBDownNodes\fR parameter below. |
| |
| .TP |
| \fBThreadsPerCore\fR |
| Number of logical threads in a single physical core (e.g. "2"). |
| Note that the SLURM can allocate resources to jobs down to the |
| resolution of a core. If your system is configured with more than |
| one thread per core, execution of a different job on each thread |
| is not supported unless you configure \fBSelectTypeParameters=CR_CPU\fR |
| plus \fBProcs\fR; do not configure \fBSockets\fR, \fBCoresPerSocket\fR or |
| \fBThreadsPerCore\fR. |
| A job can execute a one task per thread from within one job step or |
| execute a distinct job step on each of the threads. |
| Note also if you are running with more than 1 thread per core and running |
| the select/cons_res plugin you will want to set the SelectTypeParameters |
| variable to something other than CR_CPU to avoid unexpected results. |
| The default value is 1. |
| |
| .TP |
| \fBTmpDisk\fR |
| Total size of temporary disk storage in \fBTmpFS\fR in MegaBytes |
| (e.g. "16384"). \fBTmpFS\fR (for "Temporary File System") |
| identifies the location which jobs should use for temporary storage. |
| Note this does not indicate the amount of free |
| space available to the user on the node, only the total file |
| system size. The system administration should insure this file |
| system is purged as needed so that user jobs have access to |
| most of this space. |
| The Prolog and/or Epilog programs (specified in the configuration file) |
| might be used to insure the file system is kept clean. |
| The default value is 0. |
| |
| .TP |
| \fBWeight\fR |
| The priority of the node for scheduling purposes. |
| All things being equal, jobs will be allocated the nodes with |
| the lowest weight which satisfies their requirements. |
| For example, a heterogeneous collection of nodes might |
| be placed into a single partition for greater system |
| utilization, responsiveness and capability. It would be |
| preferable to allocate smaller memory nodes rather than larger |
| memory nodes if either will satisfy a job's requirements. |
| The units of weight are arbitrary, but larger weights |
| should be assigned to nodes with more processors, memory, |
| disk space, higher processor speed, etc. |
| Note that if a job allocation request can not be satisfied |
| using the nodes with the lowest weight, the set of nodes |
| with the next lowest weight is added to the set of nodes |
| under consideration for use (repeat as needed for higher |
| weight values). If you absolutely want to minimize the number |
| of higher weight nodes allocated to a job (at a cost of higher |
| scheduling overhead), give each node a distinct \fBWeight\fR |
| value and they will be added to the pool of nodes being |
| considered for scheduling individually. |
| The default value is 1. |
| |
| .LP |
| The "DownNodes=" configuration permits you to mark certain nodes as in a |
| DOWN, DRAIN, FAIL, or FAILING state without altering the permanent |
| configuration information listed under a "NodeName=" specification. |
| |
| .TP |
| \fBDownNodes\fR |
| Any node name, or list of node names, from the "NodeName=" specifications. |
| |
| .TP |
| \fBReason\fR |
| Identifies the reason for a node being in state "DOWN", "DRAIN", |
| "FAIL" or "FAILING. |
| \Use quotes to enclose a reason having more than one word. |
| |
| .TP |
| \fBState\fR |
| State of the node with respect to the initiation of user jobs. |
| Acceptable values are "BUSY", "DOWN", "DRAIN", "FAIL", |
| "FAILING, "IDLE", and "UNKNOWN". |
| .RS |
| .TP 10 |
| \fBDOWN\fP |
| Indicates the node failed and is unavailable to be allocated work. |
| .TP |
| \fBDRAIN\fP |
| Indicates the node is unavailable to be allocated work.on. |
| .TP |
| \fBFAIL\fP |
| Indicates the node is expected to fail soon, has |
| no jobs allocated to it, and will not be allocated |
| to any new jobs. |
| .TP |
| \fBFAILING\fP |
| Indicates the node is expected to fail soon, has |
| one or more jobs allocated to it, but will not be allocated |
| to any new jobs. |
| .TP |
| \fBFUTURE\fP |
| Indicates the node is defined for future use and need not |
| exist when the SLURM daemons are started. These nodes can be made available |
| for use simply by updating the node state using the scontrol command rather |
| than restarting the slurmctld daemon. After these nodes are made available, |
| change their \fRState\fR in the slurm.conf file. Until these nodes are made |
| available, they will not be seen using any SLURM commands or Is nor will |
| any attempt be made to contact them. |
| .TP |
| \fBUNKNOWN\fP |
| Indicates the node's state is undefined (BUSY or IDLE), |
| but will be established when the \fBslurmd\fR daemon on that node |
| registers. |
| The default value is "UNKNOWN". |
| .RE |
| |
| .LP |
| The partition configuration permits you to establish different job |
| limits or access controls for various groups (or partitions) of nodes. |
| Nodes may be in more than one partition, making partitions serve |
| as general purpose queues. |
| For example one may put the same set of nodes into two different |
| partitions, each with different constraints (time limit, job sizes, |
| groups allowed to use the partition, etc.). |
| Jobs are allocated resources within a single partition. |
| Default values can be specified with a record in which |
| "PartitionName" is "DEFAULT". |
| The default entry values will apply only to lines following it in the |
| configuration file and the default values can be reset multiple times |
| in the configuration file with multiple entries where "PartitionName=DEFAULT". |
| The "PartitionName=" specification must be placed on every line |
| describing the configuration of partitions. |
| \fBNOTE:\fR Put all parameters for each partition on a single line. |
| Each line of partition configuration information should |
| represent a different partition. |
| The partition configuration file contains the following information: |
| |
| .TP |
| \fBAllocNodes\fR |
| Comma separated list of nodes from which users can execute jobs in the |
| partition. |
| Node names may be specified using the node range expression syntax |
| described above. |
| The default value is "ALL". |
| |
| .TP |
| \fBAllowGroups\fR |
| Comma separated list of group IDs which may execute jobs in the partition. |
| If at least one group associated with the user attempting to execute the |
| job is in AllowGroups, he will be permitted to use this partition. |
| Jobs executed as user root can use any partition without regard to |
| the value of AllowGroups. |
| If user root attempts to execute a job as another user (e.g. using |
| srun's \-\-uid option), this other user must be in one of groups |
| identified by AllowGroups for the job to successfully execute. |
| The default value is "ALL". |
| \fBNOTE:\fR For performance reasons, SLURM maintains a list of user IDs |
| allowed to use each partition and this is checked at job submission time. |
| This list of user IDs is updated when the \fBslurmctld\fR daemon is restarted, |
| reconfigured (e.g. "scontrol reconfig") or the partition's \fBAllowGroups\fR |
| value is reset, even if is value is unchanged |
| (e.g. "scontrol update PartitionName=name AllowGroups=group"). |
| For a user's access to a partition to change, both his group membership must |
| change and SLURM's internal user ID list must change using one of the methods |
| described above. |
| |
| .TP |
| \fBAlternate\fR |
| Partition name of alternate partition to be used if the state of this partition |
| is "DRAIN" or "INACTIVE." |
| |
| .TP |
| \fBDefault\fR |
| If this keyword is set, jobs submitted without a partition |
| specification will utilize this partition. |
| Possible values are "YES" and "NO". |
| The default value is "NO". |
| |
| .TP |
| \fBDefaultTime\fR |
| Run time limit used for jobs that don't specify a value. If not set |
| then MaxTime will be used. |
| Format is the same as for MaxTime. |
| |
| .TP |
| \fBDisableRootJobs\fR |
| If set to "YES" then user root will be prevented from running any jobs |
| on this partition. |
| The default value will be the value of \fBDisableRootJobs\fR set |
| outside of a partition specification (which is "NO", allowing user |
| root to execute jobs). |
| |
| .TP |
| \fBHidden\fR |
| Specifies if the partition and its jobs are to be hidden by default. |
| Hidden partitions will by default not be reported by the SLURM APIs or commands. |
| Possible values are "YES" and "NO". |
| The default value is "NO". |
| Note that partitions that a user lacks access to by virtue of the |
| \fBAllowGroups\fR parameter will also be hidden by default. |
| |
| .TP |
| \fBMaxNodes\fR |
| Maximum count of nodes which may be allocated to any single job. |
| For BlueGene systems this will be a c\-nodes count and will be converted |
| to a midplane count with a reduction in resolution. |
| The default value is "UNLIMITED", which is represented internally as \-1. |
| This limit does not apply to jobs executed by SlurmUser or user root. |
| |
| .TP |
| \fBMaxTime\fR |
| Maximum run time limit for jobs. |
| Format is minutes, minutes:seconds, hours:minutes:seconds, |
| days\-hours, days\-hours:minutes, days\-hours:minutes:seconds or |
| "UNLIMITED". |
| Time resolution is one minute and second values are rounded up to |
| the next minute. |
| This limit does not apply to jobs executed by SlurmUser or user root. |
| |
| .TP |
| \fBMinNodes\fR |
| Minimum count of nodes which may be allocated to any single job. |
| For BlueGene systems this will be a c\-nodes count and will be converted |
| to a midplane count with a reduction in resolution. |
| The default value is 1. |
| This limit does not apply to jobs executed by SlurmUser or user root. |
| |
| .TP |
| \fBNodes\fR |
| Comma separated list of nodes (or base partitions for BlueGene systems) |
| which are associated with this partition. |
| Node names may be specified using the node range expression syntax |
| described above. A blank list of nodes |
| (i.e. "Nodes= ") can be used if one wants a partition to exist, |
| but have no resources (possibly on a temporary basis). |
| |
| .TP |
| \fBPartitionName\fR |
| Name by which the partition may be referenced (e.g. "Interactive"). |
| This name can be specified by users when submitting jobs. |
| If the \fBPartitionName\fR is "DEFAULT", the values specified |
| with that record will apply to subsequent partition specifications |
| unless explicitly set to other values in that partition record or |
| replaced with a different set of default values. |
| |
| .TP |
| \fBPreemptMode\fR |
| Mechanism used to preempt jobs from this partition when |
| \fBPreemptType=preempt/partition_prio\fR is configured. |
| This partition specific \fBPreemptMode\fR configuration parameter will override |
| the \fBPreemptMode\fR configuration parameter set for the cluster as a whole. |
| The cluster\-level \fBPreemptMode\fR must include the GANG option if |
| \fBPreemptMode\fR is configured to SUSPEND for any partition. |
| The cluster\-level \fBPreemptMode\fR must not be OFF if \fBPreemptMode\fR |
| is enabled for any partition. |
| See the description of the cluster\-level \fBPreemptMode\fR configuration |
| parameter above for further information. |
| |
| .TP |
| \fBPriority\fR |
| Jobs submitted to a higher priority partition will be dispatched |
| before pending jobs in lower priority partitions and if possible |
| they will preempt running jobs from lower priority partitions. |
| Note that a partition's priority takes precedence over a job's |
| priority. |
| The value may not exceed 65533. |
| |
| |
| .TP |
| \fBRootOnly\fR |
| Specifies if only user ID zero (i.e. user \fIroot\fR) may allocate resources |
| in this partition. User root may allocate resources for any other user, |
| but the request must be initiated by user root. |
| This option can be useful for a partition to be managed by some |
| external entity (e.g. a higher\-level job manager) and prevents |
| users from directly using those resources. |
| Possible values are "YES" and "NO". |
| The default value is "NO". |
| |
| .TP |
| \fBShared\fR |
| Controls the ability of the partition to execute more than one job at a |
| time on each resource (node, socket or core depending upon the value |
| of \fBSelectTypeParameters\fR). |
| If resources are to be shared, avoiding memory over\-subscription |
| is very important. |
| \fBSelectTypeParameters\fR should be configured to treat |
| memory as a consumable resource and the \fB\-\-mem\fR option |
| should be used for job allocations. |
| Sharing of resources is typically useful only when using gang scheduling |
| (\fBPreemptMode=suspend\fR or \fBPreemptMode=kill\fR). |
| Possible values for \fBShared\fR are "EXCLUSIVE", "FORCE", "YES", and "NO". |
| The default value is "NO". |
| For more information see the following web pages: |
| .br |
| .na |
| \fIhttps://computing.llnl.gov/linux/slurm/cons_res.html\fR, |
| .br |
| \fIhttps://computing.llnl.gov/linux/slurm/cons_res_share.html\fR, |
| .br |
| \fIhttps://computing.llnl.gov/linux/slurm/gang_scheduling.html\fR, and |
| .br |
| \fIhttps://computing.llnl.gov/linux/slurm/preempt.html\fR. |
| .ad |
| |
| .RS |
| .TP 12 |
| \fBEXCLUSIVE\fR |
| Allocates entire nodes to jobs even with select/cons_res configured. |
| Jobs that run in partitions with "Shared=EXCLUSIVE" will have |
| exclusive access to all allocated nodes. |
| .TP |
| \fBFORCE\fR |
| Makes all resources in the partition available for sharing |
| without any means for users to disable it. |
| May be followed with a colon and maximum number of jobs in |
| running or suspended state. |
| For example "Shared=FORCE:4" enables each node, socket or |
| core to execute up to four jobs at once. |
| Recommended only for BlueGene systems configured with |
| small blocks or for systems running |
| with gang scheduling (\fBSchedulerType=sched/gang\fR). |
| .TP |
| \fBYES\fR |
| Makes all resources in the partition available for sharing, |
| but honors a user's request for dedicated resources. |
| If \fBSelectType=select/cons_res\fR, then resources will be |
| over\-subscribed unless explicitly disabled in the job submit |
| request using the "\-\-exclusive" option. |
| With \fBSelectType=select/bluegene\fR or \fBSelectType=select/linear\fR, |
| resources will only be over\-subscribed when explicitly requested |
| by the user using the "\-\-share" option on job submission. |
| May be followed with a colon and maximum number of jobs in |
| running or suspended state. |
| For example "Shared=YES:4" enables each node, socket or |
| core to execute up to four jobs at once. |
| Recommended only for systems running with gang scheduling |
| (\fBSchedulerType=sched/gang\fR). |
| .TP |
| \fBNO\fR |
| Selected resources are allocated to a single job. No resource will be |
| allocated to more than one job. |
| .RE |
| |
| .TP |
| \fBState\fR |
| State of partition or availability for use. Possible values |
| are "UP", "DOWN", "DRAIN" and "INACTIVE". The default value is "UP". |
| See also the related "Alternate" keyword. |
| .RS |
| .TP 10 |
| \fBUP\fP |
| Designates that new jobs may queued on the partition, and that |
| jobs may be allocated nodes and run from the partition. |
| .TP |
| \fBDOWN\fP |
| Designates that new jobs may be queued on the partition, but |
| queued jobs may not be allocated nodes and run from the partition. Jobs |
| already running on the partition continue to run. The jobs |
| must be explicitly canceled to force their termination. |
| .TP |
| \fBDRAIN\fP |
| Designates that no new jobs may be queued on the partition (job |
| submission requests will be denied with an error message), but jobs |
| already queued on the partition may be allocated nodes and run. |
| See also the "Alternate" partition specification. |
| .TP |
| \fBINACTIVE\fP |
| Designates that no new jobs may be queued on the partition, |
| and jobs already queued may not be allocated nodes and run. |
| See also the "Alternate" partition specification. |
| .RE |
| |
| .SH "Prolog and Epilog Scripts" |
| There are a variety of prolog and epilog program options that |
| execute with various permissions and at various times. |
| The four options most likely to be used are: |
| \fBProlog\fR and \fBEpilog\fR (executed once on each compute node |
| for each job) plus \fBPrologSlurmctld\fR and \fBEpilogSlurmctld\fR |
| (executed once on the \fBControlMachine\fR for each job). |
| |
| NOTE: Standard output and error messages are normally not preserved. |
| Explicitly write output and error messages to an appropriate location |
| if you wish to preserve that information. |
| |
| NOTE: The Prolog script is ONLY run on any individual |
| node when it first sees a job step from a new allocation; it does not |
| run the Prolog immediately when an allocation is granted. If no job steps |
| from an allocation are run on a node, it will never run the Prolog for that |
| allocation. The Epilog, on the other hand, always runs on every node of an |
| allocation when the allocation is released. |
| |
| Information about the job is passed to the script using environment |
| variables. |
| Unless otherwise specified, these environment variables are available |
| to all of the programs. |
| .TP |
| \fBBASIL_RESERVATION_ID\fR |
| Basil reservation ID. |
| Available on Cray XT systems only. |
| .TP |
| \fBMPIRUN_PARTITION\fR |
| BlueGene partition name. |
| Available on BlueGene systems only. |
| .TP |
| \fBSLURM_JOB_ACCOUNT\fR |
| Account name used for the job. |
| Available in \fBPrologSlurmctld\fR and \fBEpilogSlurmctld\fR only. |
| .TP |
| \fBSLURM_JOB_CONSTRAINTS\fR |
| Features required to run the job. |
| Available in \fBPrologSlurmctld\fR and \fBEpilogSlurmctld\fR only. |
| .TP |
| \fBSLURM_JOB_DERIVED_EC\fR |
| The highest exit code of all of the job steps. |
| Available in \fBEpilogSlurmctld\fR only. |
| .TP |
| \fBSLURM_JOB_EXIT_CODE\fR |
| The exit code of the job script (or salloc). |
| Available in \fBEpilogSlurmctld\fR only. |
| .TP |
| \fBSLURM_JOB_GID\fR |
| Group ID of the job's owner. |
| Available in \fBPrologSlurmctld\fR and \fBEpilogSlurmctld\fR only. |
| .TP |
| \fBSLURM_JOB_GROUP\fR |
| Group name of the job's owner. |
| Available in \fBPrologSlurmctld\fR and \fBEpilogSlurmctld\fR only. |
| .TP |
| \fBSLURM_JOB_ID\fR |
| Job ID. |
| .TP |
| \fBSLURM_JOB_NAME\fR |
| Name of the job. |
| Available in \fBPrologSlurmctld\fR and \fBEpilogSlurmctld\fR only. |
| .TP |
| \fBSLURM_JOB_NODELIST\fR |
| Nodes assigned to job. A SLURM hostlist expression. |
| "scontrol show hostnames" can be used to convert this to a |
| list of individual host names. |
| Available in \fBPrologSlurmctld\fR and \fBEpilogSlurmctld\fR only. |
| .TP |
| \fBSLURM_JOB_PARTITION\fR |
| Partition that job runs in. |
| Available in \fBPrologSlurmctld\fR and \fBEpilogSlurmctld\fR only. |
| .TP |
| \fBSLURM_JOB_UID\fR |
| User ID of the job's owner. |
| .TP |
| \fBSLURM_JOB_USER\fR |
| User name of the job's owner. |
| |
| .SH "NETWORK TOPOLOGY" |
| SLURM is able to optimize job allocations to minimize network contention. |
| Special SLURM logic is used to optimize allocations on systems with a |
| three\-dimensional interconnect (BlueGene, Sun Constellation, etc.) |
| and information about configuring those systems are available on |
| web pages available here: <https://computing.llnl.gov/linux/slurm/>. |
| For a hierarchical network, SLURM needs to have detailed information |
| about how nodes are configured on the network switches. |
| .LP |
| Given network topology information, SLURM allocates all of a job's |
| resources onto a single leaf of the network (if possible) using a best\-fit |
| algorithm. |
| Otherwise it will allocate a job's resources onto multiple leaf switches |
| so as to minimize the use of higher\-level switches. |
| The \fBTopologyPlugin\fR parameter controls which plugin is used to |
| collect network topology information. |
| The only values presently supported are |
| "topology/3d_torus" (default for IBM BlueGene, Sun Constellation and |
| Cray XT systems, performs best\-fit logic over three\-dimensional topology), |
| "topology/none" (default for other systems, |
| best\-fit logic over one\-dimensional topology), |
| "topology/tree" (determine the network topology based |
| upon information contained in a topology.conf file, |
| see "man topology.conf" for more information). |
| Future plugins may gather topology information directly from the network. |
| The topology information is optional. |
| If not provided, SLURM will perform a best\-fit algorithm assuming the |
| nodes are in a one\-dimensional array as configured and the communications |
| cost is related to the node distance in this array. |
| |
| .SH "RELOCATING CONTROLLERS" |
| If the cluster's computers used for the primary or backup controller |
| will be out of service for an extended period of time, it may be |
| desirable to relocate them. |
| In order to do so, follow this procedure: |
| .LP |
| 1. Stop the SLURM daemons |
| .br |
| 2. Modify the slurm.conf file appropriately |
| .br |
| 3. Distribute the updated slurm.conf file to all nodes |
| .br |
| 4. Restart the SLURM daemons |
| .LP |
| There should be no loss of any running or pending jobs. |
| Insure that any nodes added to the cluster have the current |
| slurm.conf file installed. |
| .LP |
| \fBCAUTION:\fR If two nodes are simultaneously configured as the |
| primary controller (two nodes on which \fBControlMachine\fR specify |
| the local host and the \fBslurmctld\fR daemon is executing on each), |
| system behavior will be destructive. |
| If a compute node has an incorrect \fBControlMachine\fR or |
| \fBBackupController\fR parameter, that node may be rendered |
| unusable, but no other harm will result. |
| |
| .SH "EXAMPLE" |
| .LP |
| # |
| .br |
| # Sample /etc/slurm.conf for dev[0\-25].llnl.gov |
| .br |
| # Author: John Doe |
| .br |
| # Date: 11/06/2001 |
| .br |
| # |
| .br |
| ControlMachine=dev0 |
| .br |
| ControlAddr=edev0 |
| .br |
| BackupController=dev1 |
| .br |
| BackupAddr=edev1 |
| .br |
| # |
| .br |
| AuthType=auth/munge |
| .br |
| Epilog=/usr/local/slurm/epilog |
| .br |
| Prolog=/usr/local/slurm/prolog |
| .br |
| FastSchedule=1 |
| .br |
| FirstJobId=65536 |
| .br |
| InactiveLimit=120 |
| .br |
| JobCompType=jobcomp/filetxt |
| .br |
| JobCompLoc=/var/log/slurm/jobcomp |
| .br |
| KillWait=30 |
| .br |
| MaxJobCount=10000 |
| .br |
| MinJobAge=3600 |
| .br |
| PluginDir=/usr/local/lib:/usr/local/slurm/lib |
| .br |
| ReturnToService=0 |
| .br |
| SchedulerType=sched/backfill |
| .br |
| SlurmctldLogFile=/var/log/slurm/slurmctld.log |
| .br |
| SlurmdLogFile=/var/log/slurm/slurmd.log |
| .br |
| SlurmctldPort=7002 |
| .br |
| SlurmdPort=7003 |
| .br |
| SlurmdSpoolDir=/usr/local/slurm/slurmd.spool |
| .br |
| StateSaveLocation=/usr/local/slurm/slurm.state |
| .br |
| SwitchType=switch/elan |
| .br |
| TmpFS=/tmp |
| .br |
| WaitTime=30 |
| .br |
| JobCredentialPrivateKey=/usr/local/slurm/private.key |
| .br |
| .na |
| JobCredentialPublicCertificate=/usr/local/slurm/public.cert |
| .ad |
| .br |
| # |
| .br |
| # Node Configurations |
| .br |
| # |
| .br |
| NodeName=DEFAULT Procs=2 RealMemory=2000 TmpDisk=64000 |
| .br |
| NodeName=DEFAULT State=UNKNOWN |
| .br |
| NodeName=dev[0\-25] NodeAddr=edev[0\-25] Weight=16 |
| .br |
| # Update records for specific DOWN nodes |
| .br |
| DownNodes=dev20 State=DOWN Reason="power,ETA=Dec25" |
| .br |
| # |
| .br |
| # Partition Configurations |
| .br |
| # |
| .br |
| PartitionName=DEFAULT MaxTime=30 MaxNodes=10 State=UP |
| .br |
| PartitionName=debug Nodes=dev[0\-8,18\-25] Default=YES |
| .br |
| PartitionName=batch Nodes=dev[9\-17] MinNodes=4 |
| .br |
| PartitionName=long Nodes=dev[9\-17] MaxTime=120 AllowGroups=admin |
| |
| .SH "FILE AND DIRECTORY PERMISSIONS" |
| There are three classes of files: |
| Files used by \fBslurmctld\fR must be accessible by user \fBSlurmUser\fR |
| and accessible by the primary and backup control machines. |
| Files used by \fBslurmd\fR must be accessible by user root and |
| accessible from every compute node. |
| A few files need to be accessible by normal users on all login and |
| compute nodes. |
| While many files and directories are listed below, most of them will |
| not be used with most configurations. |
| .TP |
| \fBAccountingStorageLoc\fR |
| If this specifies a file, it must be writable by user \fBSlurmUser\fR. |
| The file must be accessible by the primary and backup control machines. |
| It is recommended that the file be readable by all users from login and |
| compute nodes. |
| .TP |
| \fBEpilog\fR |
| Must be executable by user root. |
| It is recommended that the file be readable by all users. |
| The file must exist on every compute node. |
| .TP |
| \fBEpilogSlurmctld\fR |
| Must be executable by user \fBSlurmUser\fR. |
| It is recommended that the file be readable by all users. |
| The file must be accessible by the primary and backup control machines. |
| .TP |
| \fBHealthCheckProgram\fR |
| Must be executable by user root. |
| It is recommended that the file be readable by all users. |
| The file must exist on every compute node. |
| .TP |
| \fBJobCheckpointDir\fR |
| Must be writable by user \fBSlurmUser\fR and no other users. |
| The file must be accessible by the primary and backup control machines. |
| .TP |
| \fBJobCompLoc\fR |
| If this specifies a file, it must be writable by user \fBSlurmUser\fR. |
| The file must be accessible by the primary and backup control machines. |
| .TP |
| \fBJobCredentialPrivateKey\fR |
| Must be readable only by user \fBSlurmUser\fR and writable by no other users. |
| The file must be accessible by the primary and backup control machines. |
| .TP |
| \fBJobCredentialPublicCertificate\fR |
| Readable to all users on all nodes. |
| Must not be writable by regular users. |
| .TP |
| \fBMailProg\fR |
| Must be executable by user \fBSlurmUser\fR. |
| Must not be writable by regular users. |
| The file must be accessible by the primary and backup control machines. |
| .TP |
| \fBProlog\fR |
| Must be executable by user root. |
| It is recommended that the file be readable by all users. |
| The file must exist on every compute node. |
| .TP |
| \fBPrologSlurmctld\fR |
| Must be executable by user \fBSlurmUser\fR. |
| It is recommended that the file be readable by all users. |
| The file must be accessible by the primary and backup control machines. |
| .TP |
| \fBResumeProgram\fR |
| Must be executable by user \fBSlurmUser\fR. |
| The file must be accessible by the primary and backup control machines. |
| .TP |
| \fBSallocDefaultCommand\fR |
| Must be executable by all users. |
| The file must exist on every login and compute node. |
| .TP |
| \fBslurm.conf\fR |
| Readable to all users on all nodes. |
| Must not be writable by regular users. |
| .TP |
| \fBSlurmctldLogFile\fR |
| Must be writable by user \fBSlurmUser\fR. |
| The file must be accessible by the primary and backup control machines. |
| .TP |
| \fBSlurmctldPidFile\fR |
| Must be writable by user root. |
| Preferably writable and removable by \fBSlurmUser\fR. |
| The file must be accessible by the primary and backup control machines. |
| .TP |
| \fBSlurmdLogFile\fR |
| Must be writable by user root. |
| A distinct file must exist on each compute node. |
| .TP |
| \fBSlurmdPidFile\fR |
| Must be writable by user root. |
| A distinct file must exist on each compute node. |
| .TP |
| \fBSlurmdSpoolDir\fR |
| Must be writable by user root. |
| A distinct file must exist on each compute node. |
| .TP |
| \fBSrunEpilog\fR |
| Must be executable by all users. |
| The file must exist on every login and compute node. |
| .TP |
| \fBSrunProlog\fR |
| Must be executable by all users. |
| The file must exist on every login and compute node. |
| .TP |
| \fBStateSaveLocation\fR |
| Must be writable by user \fBSlurmUser\fR. |
| The file must be accessible by the primary and backup control machines. |
| .TP |
| \fBSuspendProgram\fR |
| Must be executable by user \fBSlurmUser\fR. |
| The file must be accessible by the primary and backup control machines. |
| .TP |
| \fBTaskEpilog\fR |
| Must be executable by all users. |
| The file must exist on every compute node. |
| .TP |
| \fBTaskProlog\fR |
| Must be executable by all users. |
| The file must exist on every compute node. |
| .TP |
| \fBUnkillableStepProgram\fR |
| Must be executable by user \fBSlurmUser\fR. |
| The file must be accessible by the primary and backup control machines. |
| |
| .SH "COPYING" |
| Copyright (C) 2002\-2007 The Regents of the University of California. |
| Copyright (C) 2008\-2010 Lawrence Livermore National Security. |
| Portions Copyright (C) 2010 SchedMD <http://www.sched\-md.com>. |
| Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER). |
| CODE\-OCEC\-09\-009. All rights reserved. |
| .LP |
| This file is part of SLURM, a resource management program. |
| For details, see <https://computing.llnl.gov/linux/slurm/>. |
| .LP |
| SLURM is free software; you can redistribute it and/or modify it under |
| the terms of the GNU General Public License as published by the Free |
| Software Foundation; either version 2 of the License, or (at your option) |
| any later version. |
| .LP |
| SLURM is distributed in the hope that it will be useful, but WITHOUT ANY |
| WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS |
| FOR A PARTICULAR PURPOSE. See the GNU General Public License for more |
| details. |
| |
| .SH "FILES" |
| /etc/slurm.conf |
| |
| .SH "SEE ALSO" |
| .LP |
| \fBbluegene.conf\fR(5), \fBcgroup.conf\fR(5), \fBgethostbyname\fR(3), |
| \fBgetrlimit\fR(2), \fBgres.conf\fR(5), \fBgroup\fR(5), \fBhostname\fR(1), |
| \fBscontrol\fR(1), \fBslurmctld\fR(8), \fBslurmd\fR(8), |
| \fBslurmdbd\fR(8), \fBslurmdbd.conf\fR(5), \fBsrun(1)\fR, |
| \fBspank(8)\fR, \fBsyslog\fR(2), \fBtopology.conf\fR(5), \fBwiki.conf\fR(5) |