| .TH "gres.conf" "5" "Slurm Configuration File" "December 2024" "Slurm Configuration File" |
| |
| .SH "NAME" |
| gres.conf \- Slurm configuration file for Generic RESource (GRES) management. |
| |
| .SH "DESCRIPTION" |
| \fBgres.conf\fP is an ASCII file which describes the configuration |
| of Generic RESource(s) (GRES) on each compute node. |
| If the GRES information in the slurm.conf file does not fully describe those |
| resources, then a gres.conf file should be included on each compute node. For |
| cloud nodes, a gres.conf file that includes all the cloud nodes must be on |
| all cloud nodes and the controller. The file will always be located in the same |
| directory as \fBslurm.conf\fR. |
| |
| .LP |
| If the GRES information in the slurm.conf file fully describes those resources |
| (i.e. no "Cores", "File" or "Links" specification is required for that GRES |
| type or that information is automatically detected), that information may be |
| omitted from the gres.conf file and only the configuration information in the |
| slurm.conf file will be used. |
| The gres.conf file may be omitted completely if the configuration information |
| in the slurm.conf file fully describes all GRES. |
| |
| .LP |
| If using the \fBgres.conf\fR file to describe the resources available to nodes, |
| the first parameter on the line should be \fBNodeName\fR. If configuring |
| Generic Resources without specifying nodes, the first parameter on the line |
| should be \fBName\fR. |
| |
| .LP |
| Parameter names are case insensitive. |
| Any text following a "#" in the configuration file is treated |
| as a comment through the end of that line. |
| Changes to the configuration file take effect upon restart of |
| Slurm daemons, daemon receipt of the SIGHUP signal, or execution |
| of the command "scontrol reconfigure" unless otherwise noted. |
| |
| .LP |
| \fBNOTE\fR: Slurm support for gres/[mps|shard] requires the use of the |
| select/cons_tres plugin. For more information on how to configure MPS, see |
| \fIhttps://slurm.schedmd.com/gres.html#MPS_Management\fR. |
| For more information on how to configure Sharding, see |
| \fIhttps://slurm.schedmd.com/gres.html#Sharding\fR. |
| |
| .LP |
| For more information on GRES scheduling in general, see |
| \fIhttps://slurm.schedmd.com/gres.html\fR. |
| |
| .LP |
| The overall configuration parameters available include: |
| |
| .TP |
| \fBAutoDetect\fR |
| The hardware detection mechanisms to enable for automatic GRES configuration. |
| Currently, the options are: |
| .IP |
| .RS |
| .TP |
| \fBnrt\fR |
| Automatically detect AWS Trainium/Inferentia devices. |
| .IP |
| |
| .TP |
| \fBnvidia\fR |
| Automatically detect NVIDIA GPUs. No library required, but doesn't detect MIGs |
| or NVlinks. Added in Slurm 24.11. |
| .IP |
| |
| .TP |
| \fBnvml\fR |
| Automatically detect NVIDIA GPUs. Requires the NVIDIA Management Library (NVML). |
| .IP |
| |
| .TP |
| \fBoff\fR |
| Do not automatically detect any GPUs. Used to override other options. |
| .IP |
| |
| .TP |
| \fBoneapi\fR |
| Automatically detect Intel GPUs. Requires the Intel Graphics Compute Runtime for |
| oneAPI Level Zero and OpenCL Driver (oneapi). |
| .IP |
| |
| .TP |
| \fBrsmi\fR |
| Automatically detect AMD GPUs. Requires the ROCm System Management Interface |
| (ROCm SMI) Library. |
| .RE |
| .IP |
| \fIAutoDetect\fR can be on a line by itself, in which case it will globally |
| apply to all lines in gres.conf by default. In addition, \fIAutoDetect\fR can be |
| combined with \fBNodeName\fR to only apply to certain nodes. Node\-specific |
| \fIAutoDetect\fRs will trump the global \fIAutoDetect\fR. A node\-specific |
| \fIAutoDetect\fR only needs to be specified once per node. If specified multiple |
| times for the same nodes, they must all be the same value. To unset |
| \fIAutoDetect\fR for a node when a global \fIAutoDetect\fR is set, simply set it |
| to "off" in a node\-specific GRES line. |
| E.g.: \fINodeName=tux3 AutoDetect=off Name=gpu File=/dev/nvidia[0\-3]\fR. |
| \fIAutoDetect\fR cannot be used with cloud nodes. |
| |
| |
| \fIAutoDetect\fR will automatically detect files, cores, links, and any other |
| hardware. If a parameter such as \fBFile\fR, \fBCores\fR, or \fBLinks\fR are |
| specified when \fIAutoDetect\fR is used, then the specified values are used to |
| sanity check the auto detected values. If there is a mismatch, then the node's |
| state is set to invalid and the node is drained. |
| .IP |
| |
| .TP |
| \fBCount\fR |
| Number of resources of this name/type available on this node. |
| The default value is set to the number of \fBFile\fR values specified (if any), |
| otherwise the default value is one. A suffix of "K", "M", "G", "T" or "P" may be |
| used to multiply the number by 1024, 1048576, 1073741824, etc. respectively. |
| For example: "Count=10G". |
| .IP |
| |
| .TP |
| \fBCores\fR |
| Optionally specify the core index numbers matching the specific sockets* which |
| can use this resource. |
| |
| If this option is used, all the cores in a socket* must be specified together. |
| While Slurm can track and assign resources at the CPU or thread level, its |
| scheduling algorithms used to co\-allocate GRES devices with CPUs operates at a |
| socket* level for job allocations. |
| Therefore, it is not possible to preferentially assign GRES with different |
| specific CPUs on the same socket*. |
| |
| *Sockets may be substituted for NUMA nodes with |
| SlurmdParameters=numa_node_as_socket or l3cache with |
| SlurmdParameters=l3cache_as_socket. |
| |
| Multiple cores may be specified using a comma\-delimited list or a range may be |
| specified using a "\-" separator (e.g. "0,1,2,3" or "0\-3"). |
| If a job specifies \fB\-\-gres\-flags=enforce\-binding\fR, then only the |
| identified cores can be allocated with each generic resource. This will tend to |
| improve performance of jobs, but delay the allocation of resources to them. |
| If specified and a job is \fInot\fR submitted with the |
| \fB\-\-gres\-flags=enforce\-binding\fR option the identified cores will be |
| preferred for scheduling with each generic resource. |
| |
| If \fB\-\-gres\-flags=disable\-binding\fR is specified, then any core can be |
| used with the resources, which also increases the speed of Slurm's |
| scheduling algorithm but can degrade the application performance. |
| The \fB\-\-gres\-flags=disable\-binding\fR option is currently required to use |
| more CPUs than are bound to a GRES (e.g. if a GPU is bound to the CPUs on one |
| socket, but resources on more than one socket are required to run the job). |
| If any core can be effectively used with the resources, then do not specify the |
| \fBcores\fR option for improved speed in the Slurm scheduling logic. |
| A restart of the slurmctld is needed for changes to the Cores option to take |
| effect. |
| |
| \fBNOTE\fR: Since Slurm must be able to perform resource management on |
| heterogeneous clusters having various processing unit numbering schemes, |
| a logical core index must be specified instead of the physical core index. |
| That logical core index might not correspond to your physical core index number. |
| Core 0 will be the first core on the first socket, while core 1 will be the |
| second core on the first socket. |
| This numbering coincides with the logical core number (Core L#) seen |
| in "lstopo \-l" command output. |
| .IP |
| |
| .TP |
| \fBFile\fR |
| Fully qualified pathname of the device files associated with a resource. |
| The name can include a numeric range suffix to be interpreted by Slurm |
| (e.g. \fIFile=/dev/nvidia[0\-3]\fR). |
| |
| |
| This field is generally required if enforcement of generic resource |
| allocations is to be supported (i.e. prevents users from making |
| use of resources allocated to a different user). |
| Enforcement of the file allocation relies upon Linux Control Groups (cgroups) |
| and Slurm's task/cgroup plugin, which will place the allocated files into |
| the job's cgroup and prevent use of other files. |
| Please see Slurm's Cgroups Guide for more |
| information: \fIhttps://slurm.schedmd.com/cgroups.html\fR. |
| |
| If \fBFile\fR is specified then \fBCount\fR must be either set to the number |
| of file names specified or not set (the default value is the number of files |
| specified). |
| The exception to this is MPS/Sharding. For either of these GRES, each GPU would be identified by device |
| file using the \fBFile\fR parameter and \fBCount\fR would specify the number of |
| entries that would correspond to that GPU. For MPS, typically 100 or some |
| multiple of 100. For Sharding typically the maximum number of jobs that could |
| simultaneously share that GPU. |
| |
| If using a card with Multi-Instance GPU functionality, use \fBMultipleFiles\fR |
| instead. \fBFile\fR and \fBMultipleFiles\fR are mutually exclusive. |
| |
| \fBNOTE\fR: \fBFile\fR is required for all \fIgpu\fR typed GRES. |
| |
| \fBNOTE\fR: If you specify the \fBFile\fR parameter for a resource on some node, |
| the option must be specified on all nodes and Slurm will track the assignment |
| of each specific resource on each node. Otherwise Slurm will only track a |
| count of allocated resources rather than the state of each individual device |
| file. |
| |
| \fBNOTE\fR: Drain a node before changing the count of records with \fBFile\fR |
| parameters (e.g. if you want to add or remove GPUs from a node's configuration). |
| Failure to do so will result in any job using those GRES being aborted. |
| |
| \fBNOTE\fR: When specifying \fBFile\fR, \fBCount\fR is limited in size |
| (currently 1024) for each node. |
| .IP |
| |
| .TP |
| \fBFlags\fR |
| Optional flags that can be specified to change configured behavior of the GRES. |
| |
| Allowed values at present are: |
| .IP |
| .RS |
| .TP 20 |
| \fBCountOnly\fR |
| Do not attempt to load a plugin of the GRES type as this GRES will only be |
| used to track counts of |
| GRES used. This avoids attempting to load non\-existent plugin which can |
| affect filesystems with high latency metadata operations for non\-existent files. |
| |
| \fBNOTE\fR: If a gres has this flag configured it is global, so all other nodes |
| with that gres will have this flag implied. |
| .IP |
| |
| .TP |
| \fBexplicit\fR |
| If the flag is set, GRES is not allocated to the job as part of whole node |
| allocation (--exclusive or OverSubscribe=EXCLUSIVE set on partition) unless |
| it was explicitly requested by the job. |
| |
| \fBNOTE\fR: If a gres has this flag configured it is global, so all other nodes |
| with that gres will have this flag implied. |
| .IP |
| |
| .TP |
| \fBone_sharing\fR |
| To be used on a shared gres. If using a shared gres (mps) on top of a sharing |
| gres (gpu) only allow one of the sharing gres to be used by the shared gres. |
| This is the default for MPS. |
| |
| \fBNOTE\fR: If a gres has this flag configured it is global, so all other nodes |
| with that gres will have this flag implied. This flag is not compatible with |
| all_sharing for a specific gres. |
| .IP |
| |
| .TP |
| \fBall_sharing\fR |
| To be used on a shared gres. This is the opposite of one_sharing and can be |
| used to allow all sharing gres (gpu) on a node to be used for shared gres (mps). |
| |
| \fBNOTE\fR: If a gres has this flag configured it is global, so all other nodes |
| with that gres will have this flag implied. This flag is not compatible with |
| one_sharing for a specific gres. |
| .IP |
| |
| .TP |
| \fBnvidia_gpu_env\fR |
| Set environment variable \fICUDA_VISIBLE_DEVICES\fR for all GPUs on the |
| specified node(s). |
| .IP |
| |
| .TP |
| \fBamd_gpu_env\fR |
| Set environment variable \fIROCR_VISIBLE_DEVICES\fR for all GPUs on the |
| specified node(s). |
| .IP |
| |
| .TP |
| \fBintel_gpu_env\fR |
| Set environment variable \fIZE_AFFINITY_MASK\fR for all GPUs on the |
| specified node(s). |
| .IP |
| |
| .TP |
| \fBopencl_env\fR |
| Set environment variable \fIGPU_DEVICE_ORDINAL\fR for all GPUs on the |
| specified node(s). |
| .IP |
| |
| .TP |
| \fBno_gpu_env\fR |
| Set no GPU\-specific environment variables. This is mutually exclusive to all |
| other environment\-related flags. |
| .RE |
| .IP |
| If no environment\-related flags are specified, then \fInvidia_gpu_env\fR, |
| \fIamd_gpu_env\fR, \fIintel_gpu_env\fR, and \fIopencl_env\fR will be |
| implicitly set by default. |
| If \fBAutoDetect\fR is used and environment\-related flags are not specified, |
| then \fIAutoDetect=nvml\fR or \fIAutoDetect=nvidia\fR will set |
| \fInvidia_gpu_env\fR, \fIAutoDetect=rsmi\fR will set \fIamd_gpu_env\fR, |
| and \fIAutoDetect=oneapi\fR will set \fIintel_gpu_env\fR. |
| Conversely, specified environment\-related flags will always override |
| \fBAutoDetect\fR. |
| |
| Environment\-related flags set on one GRES line will be inherited by the GRES |
| line directly below it if no environment\-related flags are specified on that |
| line and if it is of the same node, name, and type. Environment\-related flags |
| must be the same for GRES of the same node, name, and type. |
| |
| Note that there is a known issue with the AMD ROCm runtime where |
| \fIROCR_VISIBLE_DEVICES\fR is processed first, and then |
| \fICUDA_VISIBLE_DEVICES\fR is processed. To avoid the issues caused by this, set |
| \fIFlags=amd_gpu_env\fR for AMD GPUs so only \fIROCR_VISIBLE_DEVICES\fR is set. |
| .IP |
| |
| .TP |
| \fBLinks\fR |
| A comma\-delimited list of numbers identifying the number of connections |
| between this device and other devices to allow coscheduling of |
| better connected devices. |
| This is an ordered list in which the number of connections this specific |
| device has to device number 0 would be in the first position, the number of |
| connections it has to device number 1 in the second position, etc. |
| A \-1 indicates the device itself and a 0 indicates no connection. |
| If specified, then this line can only contain a single GRES device (i.e. can |
| only contain a single file via \fBFile\fR). |
| |
| |
| This is an optional value and is usually automatically determined if |
| \fBAutoDetect\fR is enabled. |
| A typical use case would be to identify GPUs having NVLink connectivity. |
| Note that for GPUs, the minor number assigned by the OS and used in the device |
| file (i.e. the X in \fI/dev/nvidiaX\fR) is not necessarily the same as the |
| device number/index. The device number is created by sorting the GPUs by PCI bus |
| ID and then numbering them starting from the smallest bus ID. |
| See \fIhttps://slurm.schedmd.com/gres.html#GPU_Management\fR |
| .IP |
| |
| .TP |
| \fBMultipleFiles\fR |
| Fully qualified pathname of the device files associated with a resource. |
| Graphics cards using Multi-Instance GPU (MIG) technology will present multiple |
| device files that should be managed as a single generic resource. The file |
| names can be a comma separated list or it can include a numeric range suffix |
| (e.g. MultipleFiles=/dev/nvidia[0-3]). |
| |
| Drain a node before changing the count of records with the \fBMultipleFiles\fR |
| parameter, such as when adding or removing GPUs from a node's configuration. |
| Failure to do so will result in any job using those GRES being aborted. |
| |
| When not using GPUs with MIG functionality, use \fBFile\fR instead. |
| \fBMultipleFiles\fR and \fBFile\fR are mutually exclusive. |
| .IP |
| |
| .TP |
| \fBName\fR |
| Name of the generic resource. Any desired name may be used. |
| The name must match a value in \fBGresTypes\fR in \fIslurm.conf\fR. |
| Each generic resource has an optional plugin which can provide |
| resource\-specific functionality. |
| Generic resources that currently include an optional plugin are: |
| .IP |
| .RS |
| .TP |
| \fBgpu\fR |
| Graphics Processing Unit |
| .IP |
| |
| .TP |
| \fBmps\fR |
| CUDA Multi\-Process Service (MPS) |
| .IP |
| |
| .TP |
| \fBnic\fR |
| Network Interface Card |
| .IP |
| |
| .TP |
| \fBshard\fR |
| Shards of a gpu |
| .IP |
| .RE |
| |
| .TP |
| \fBNodeName\fR |
| An optional NodeName specification can be used to permit one gres.conf file to |
| be used for all compute nodes in a cluster by specifying the node(s) that each |
| line should apply to. |
| The NodeName specification can use a Slurm hostlist specification as shown in |
| the example below. |
| .IP |
| |
| .TP |
| \fBType\fR |
| An optional arbitrary string identifying the type of generic resource. |
| For example, this might be used to identify a specific model of GPU, which users |
| can then specify in a job request. |
| For changes to the \fBType\fR option to take effect with a scontrol reconfig all |
| affected \fBslurmd\fR daemons must be responding to the \fBslurmctld\fR. |
| Otherwise a restart of the \fBslurmctld\fR and \fBslurmd\fR daemons is required. |
| |
| \fBNOTE\fR: If using autodetect functionality and defining the Type in your |
| gres.conf file, the Type specified should match or be a substring of the value |
| that is detected, using an underscore in lieu of any spaces. |
| .IP |
| |
| .SH "EXAMPLES" |
| .nf |
| ################################################################## |
| # Slurm's Generic Resource (GRES) configuration file |
| # Define GPU devices with MPS support, with AutoDetect sanity checking |
| ################################################################## |
| AutoDetect=nvml |
| Name=gpu Type=gtx560 File=/dev/nvidia0 COREs=0,1 |
| Name=gpu Type=tesla File=/dev/nvidia1 COREs=2,3 |
| Name=mps Count=100 File=/dev/nvidia0 COREs=0,1 |
| Name=mps Count=100 File=/dev/nvidia1 COREs=2,3 |
| .fi |
| |
| .nf |
| ################################################################## |
| # Slurm's Generic Resource (GRES) configuration file |
| # Overwrite system defaults and explicitly configure three GPUs |
| ################################################################## |
| Name=gpu Type=tesla File=/dev/nvidia[0\-1] COREs=0,1 |
| # Name=gpu Type=tesla File=/dev/nvidia[2\-3] COREs=2,3 |
| # NOTE: nvidia2 device is out of service |
| Name=gpu Type=tesla File=/dev/nvidia3 COREs=2,3 |
| .fi |
| |
| .nf |
| ################################################################## |
| # Slurm's Generic Resource (GRES) configuration file |
| # Use a single gres.conf file for all compute nodes \- positive method |
| ################################################################## |
| ## Explicitly specify devices on nodes tux0\-tux15 |
| # NodeName=tux[0\-15] Name=gpu File=/dev/nvidia[0\-3] |
| # NOTE: tux3 nvidia1 device is out of service |
| NodeName=tux[0\-2] Name=gpu File=/dev/nvidia[0\-3] |
| NodeName=tux3 Name=gpu File=/dev/nvidia[0,2\-3] |
| NodeName=tux[4\-15] Name=gpu File=/dev/nvidia[0\-3] |
| .fi |
| |
| .nf |
| ################################################################## |
| # Slurm's Generic Resource (GRES) configuration file |
| # Use NVML to gather GPU configuration information |
| # for all nodes except one |
| ################################################################## |
| AutoDetect=nvml |
| NodeName=tux3 AutoDetect=off Name=gpu File=/dev/nvidia[0\-3] |
| .fi |
| |
| .nf |
| ################################################################## |
| # Slurm's Generic Resource (GRES) configuration file |
| # Specify some nodes with NVML, some with RSMI, and some with no AutoDetect |
| ################################################################## |
| NodeName=tux[0\-7] AutoDetect=nvml |
| NodeName=tux[8\-11] AutoDetect=rsmi |
| NodeName=tux[12\-15] Name=gpu File=/dev/nvidia[0\-3] |
| .fi |
| |
| .nf |
| ################################################################## |
| # Slurm's Generic Resource (GRES) configuration file |
| # Define 'bandwidth' GRES to use as a way to limit the |
| # resource use on these nodes for workflow purposes |
| ################################################################## |
| NodeName=tux[0\-7] Name=bandwidth Type=lustre Count=4G Flags=CountOnly |
| .nf |
| |
| .SH "COPYING" |
| Copyright (C) 2010 The Regents of the University of California. |
| Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER). |
| .br |
| Copyright (C) 2010\-2022 SchedMD LLC. |
| .LP |
| This file is part of Slurm, a resource management program. |
| For details, see <https://slurm.schedmd.com/>. |
| .LP |
| Slurm is free software; you can redistribute it and/or modify it under |
| the terms of the GNU General Public License as published by the Free |
| Software Foundation; either version 2 of the License, or (at your option) |
| any later version. |
| .LP |
| Slurm is distributed in the hope that it will be useful, but WITHOUT ANY |
| WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS |
| FOR A PARTICULAR PURPOSE. See the GNU General Public License for more |
| details. |
| |
| .SH "SEE ALSO" |
| .LP |
| \fBslurm.conf\fR(5) |