|  | <!--#include virtual="header.txt"--> | 
|  |  | 
|  | <h1>Control Group in Slurm</h1> | 
|  |  | 
|  | <h2 id="contents">Contents | 
|  | <a class="slurm_link" href="#contents"></a> | 
|  | </h2> | 
|  |  | 
|  | <ul> | 
|  | <li><a href="#overview">Control Group Overview</a></li> | 
|  | <li><a href="#cgroup_design">Slurm cgroup plugins design</a></li> | 
|  | <li><a href="#use">Use of cgroup in Slurm</a></li> | 
|  | <li><a href="#configuration">Slurm Cgroup Configuration Overview</a></li> | 
|  | <li><a href="#Plugins">Currently Available Cgroup Plugins</a> | 
|  | <ul> | 
|  | <li><a href="#proctrack">proctrack/cgroup plugin</a></li> | 
|  | <li><a href="#task">task/cgroup plugin</a></li> | 
|  | <li><a href="#jobacct_gather">jobacct_gather/cgroup plugin</a></li> | 
|  | </ul> | 
|  | </li> | 
|  | <li><a href="#Specialization">Use of cgroup for Resource Specialization</a></li> | 
|  | <li><a href="#cgroupplugins">Slurm cgroup plugins</a> | 
|  | <ul> | 
|  | <li><a href="#differences">Main differences between cgroup/v1 and cgroup/v2</a></li> | 
|  | <li><a href="#interfaces">Main differences between controller interfaces</a></li> | 
|  | <li><a href="#generalities">Other generalities</a></li> | 
|  | </ul> | 
|  | </li> | 
|  | </ul> | 
|  |  | 
|  | <h2 id="overview">Control Group Overview | 
|  | <a class="slurm_link" href="#overview"></a> | 
|  | </h2> | 
|  | <p>Control Group is a mechanism provided by the kernel to organize processes | 
|  | hierarchically and distribute system resources along the hierarchy in a | 
|  | controlled and configurable manner. Slurm can make use of cgroups to constrain | 
|  | different resources to jobs, steps and tasks, and to get accounting about these | 
|  | resources.</p> | 
|  |  | 
|  | <p>A cgroup provides different controllers (formerly "subsystems") for different | 
|  | resources. Slurm plugins can use several of these controllers, e.g.: <i>memory, | 
|  | cpu, devices, freezer, cpuset, cpuacct</i>. Each enabled controller | 
|  | gives the ability to constrain resources to a set of processes. If one | 
|  | controller is not available on the system, then Slurm cannot constrain the | 
|  | associated resources through a cgroup.</p> | 
|  |  | 
|  | <p>"cgroup" stands for "control group" and is never capitalized. The singular | 
|  | form is used to designate the whole feature and also as a qualifier as in | 
|  | "cgroup controllers". When explicitly referring to multiple individual control | 
|  | groups, the plural form "cgroups" is used.</p> | 
|  |  | 
|  | <p>Slurm supports two cgroup modes, Legacy mode (cgroup v1) and Unified Mode | 
|  | (cgroup v2). Hybrid mode where controllers from both version 1 and version 2 are | 
|  | mixed in a system is not supported.</p> | 
|  |  | 
|  | <p><b>NOTE</b>: The cgroup/v1 plugin is deprecated and will not be supported in | 
|  | future Slurm versions. Newer GNU/Linux distributions are dropping, or have | 
|  | dropped, support for cgroup v1 and may even not provide kernel support for the | 
|  | required cgroup v1 interfaces. Systemd also deprecated cgroup v1. Starting with | 
|  | Slurm version 25.05, no new features will be added to cgroup v1. Support for | 
|  | critical bugs will be provided until its final removal.</p> | 
|  |  | 
|  | <p>See the kernel.org documentation for a more comprehensive description of | 
|  | cgroup:</p> | 
|  |  | 
|  | <ul> | 
|  | <li><a href="https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt"> | 
|  | Kernel's Cgroup v1 documentation</a> | 
|  | </li> | 
|  | <li><a href="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html"> | 
|  | Kernel's Cgroup v2 documentation </a> | 
|  | </li> | 
|  | </ul> | 
|  |  | 
|  | <h2 id="cgroup_design">Slurm cgroup plugins design | 
|  | <a class="slurm_link" href="#cgroup_design"></a> | 
|  | </h2> | 
|  |  | 
|  | For extended information on Slurm's internal Cgroup plugin read: | 
|  | <ul> | 
|  | <li><a href="cgroup_v2.html">cgroup/v2 plugin documentation</a> </li> | 
|  | </ul> | 
|  |  | 
|  | <h2 id="use">Use of cgroup in Slurm <a class="slurm_link" href="#use"></a> </h2> | 
|  | <p>Slurm provides cgroup versions of a number of plugins.</p> | 
|  | <ul> | 
|  | <li>proctrack/cgroup (for process tracking and management)</li> | 
|  | <li>task/cgroup (for constraining resources at step and task level)</li> | 
|  | <li>jobacct_gather/cgroup (for gathering statistics)</li> | 
|  | </ul> | 
|  |  | 
|  | <p>cgroups can also be used for resource specialization (constraining daemons to | 
|  | cores or memory).</p> | 
|  |  | 
|  | <h2 id="configuration">Slurm Cgroup Configuration Overview | 
|  | <a class="slurm_link" href="#configuration"></a> | 
|  | </h2> | 
|  |  | 
|  | <p>There are several sets of configuration options for Slurm cgroups:</p> | 
|  |  | 
|  | <ul> | 
|  | <li><a href="slurm.conf.html">slurm.conf</a> provides options to enable the | 
|  | cgroup plugins. Each plugin may be enabled or disabled independently of the | 
|  | others. | 
|  | </li> | 
|  | <li><a href="cgroup.conf.html">cgroup.conf</a> provides general options that are | 
|  | common to all cgroup plugins, plus additional options that apply only to | 
|  | specific plugins. | 
|  | </li> | 
|  | <li>System-level resource specialization is enabled using node configuration | 
|  | parameters. | 
|  | </li> | 
|  | </ul> | 
|  |  | 
|  | <h2 id="Plugins">Currently Available Cgroup Plugins | 
|  | <a class="slurm_link" href="#Plugins"></a> | 
|  | </h2> | 
|  |  | 
|  | <h3 id="proctrack">proctrack/cgroup plugin | 
|  | <a class="slurm_link" href="#proctrack"></a> | 
|  | </h3> | 
|  |  | 
|  | <p>The proctrack/cgroup plugin is an alternative to other proctrack plugins such | 
|  | as proctrack/linux for process tracking and suspend/resume capability. | 
|  | </p> | 
|  |  | 
|  | <p> | 
|  | proctrack/cgroup uses the freezer controller to keep track of all pids of a | 
|  | job. It basically stores the pids in a specific hierarchy in the cgroup tree and | 
|  | takes cares of signaling these pids when instructed. For example, if a user | 
|  | decides to cancel a job, Slurm will execute this order internally by calling the | 
|  | proctrack plugin and asking it to send a SIGTERM to the job. Since proctrack | 
|  | maintains a hierarchy of all Slurm-related pids in cgroup, it will easily know | 
|  | which ones will need to be signaled. | 
|  | <br> | 
|  | Proctrack can also respond to queries for getting a list of all the pids of a | 
|  | job or a step. | 
|  | <br> | 
|  | Alternatively, when using proctrack/linux, pids are stored by cgroup in a | 
|  | single file (cgroup.procs) which is read by the plugin to get all the pids of a | 
|  | part of the hierarchy. For example, when using proctrack/cgroup, a single step | 
|  | has its own cgroup.procs file, so getting the pids of the step is instantaneous. | 
|  | In proctrack/linux, we need to read recursively /proc to get all the descendants | 
|  | of a parent pid. | 
|  | </p> | 
|  |  | 
|  | <p>To enable this plugin, configure the following option in slurm.conf: | 
|  | <pre>ProctrackType=proctrack/cgroup</pre> | 
|  | </p> | 
|  |  | 
|  | <p>There are no specific options for this plugin in cgroup.conf, but the general | 
|  | options apply. See the <a href="cgroup.conf.html">cgroup.conf</a> man page for | 
|  | details.</p> | 
|  |  | 
|  | <h3 id="task">task/cgroup plugin<a class="slurm_link" href="#task"></a></h3> | 
|  |  | 
|  | <p>The task/cgroup plugin allows constraining resources to a job, a step, or a | 
|  | task. This is the only plugin that can ensure that the boundaries of an | 
|  | allocation are not violated. | 
|  | Only jobacctgather/linux offers a very simplistic mechanism for | 
|  | constraining memory to a job but it is not reliable (there's a window of time | 
|  | where jobs can exceed its limits) and only for very rare systems where cgroup is | 
|  | not available.</p> | 
|  |  | 
|  | <p>task/cgroup provides the following features:</p> | 
|  |  | 
|  | <ul> | 
|  | <li>Confine jobs and steps to their allocated cpuset.</li> | 
|  | <li>Confine jobs and steps to specific memory resources.</li> | 
|  | <li>Confine jobs, steps and tasks to their allocated gres, including gpus.</li> | 
|  | </ul> | 
|  |  | 
|  | <p>The task/cgroup plugin uses the cpuset, memory and devices subsystems.</p> | 
|  |  | 
|  | <p>To enable this plugin, add <i>task/cgroup</i> to the TaskPlugin configuration | 
|  | parameter in slurm.conf:</p> | 
|  |  | 
|  | <pre>TaskPlugin=task/cgroup</pre> | 
|  |  | 
|  | <p>There are many specific options for this plugin in cgroup.conf. The general | 
|  | options also apply. See the <a href="cgroup.conf.html">cgroup.conf</a> man page | 
|  | for details.</p> | 
|  |  | 
|  | <p>This plugin can be stacked with other task plugins, for example with | 
|  | <i>task/affinity</i>. This will allow it to constrain resources to a job plus | 
|  | getting the advantage of the affinity plugin (order doesn't matter):</p> | 
|  |  | 
|  | <pre>TaskPlugin=task/cgroup,task/affinity</pre> | 
|  |  | 
|  | <h3 id="jobacct_gather">jobacct_gather/cgroup plugin | 
|  | <a class="slurm_link" href="#jobacct_gather"></a> | 
|  | </h3> | 
|  |  | 
|  | <p> | 
|  | The <i>jobacct_gather/cgroup</i> plugin is an alternative to the | 
|  | <i>jobacct_gather/linux</i> plugin for the collection of accounting statistics | 
|  | for jobs, steps and tasks. | 
|  | <br> | 
|  | <i>jobacct_gather/cgroup</i> uses the cpuacct and memory cgroup controllers. | 
|  | </p> | 
|  |  | 
|  | <p>The cpu and memory statistics collected by this plugin do not represent the | 
|  | same resources as the cpu and memory statistics collected by the | 
|  | <i>jobacct_gather/linux</i>. While the cgroup plugin just reads a cgroup.stats | 
|  | file and similar containing the information for the entire subtree of pids, the | 
|  | linux plugin gets information from /proc/pid/stat for every pid and then does | 
|  | the calculations, thus becoming a bit less efficient (thought not noticeable in | 
|  | the practice) than the cgroup one.</p> | 
|  |  | 
|  | <p>To enable this plugin, configure the following option in slurm.conf: | 
|  | <pre>JobacctGatherType=jobacct_gather/cgroup</pre> | 
|  | </p> | 
|  |  | 
|  | <p>There are no specific options for this plugin in cgroup.conf, but the general | 
|  | options apply. See the <a href="cgroup.conf.html">cgroup.conf</a> man page for | 
|  | details.</p> | 
|  |  | 
|  | <h2 id="Specialization">Use of cgroup for Resource Specialization | 
|  | <a class="slurm_link" href="#Specialization"></a> | 
|  | </h2> | 
|  |  | 
|  | <p>Resource Specialization may be used to reserve a subset of cores or a | 
|  | specific amount of memory on each compute node for exclusive use by the Slurm | 
|  | compute node daemon, slurmd.</p> | 
|  |  | 
|  | <p>If cgroup/v1 is used the reserved resources will also be used by the | 
|  | slurmstepd processes. If cgroup/v2 is used, slurmstepd is not constrained by | 
|  | this resource specialization. Instead the slurmstepd is constrained to the | 
|  | resources allocated to the job, since it is considered part of the job and its | 
|  | consumption is completely dependent on the topology of the job. For example an | 
|  | MPI job can initialize many ranks with PMI and make slurmstepd consume more | 
|  | memory.</p> | 
|  |  | 
|  | <p>System-level resource specialization is enabled with special node | 
|  | configuration parameters. Read <a href="slurm.conf.html">slurm.conf</a> and core | 
|  | specialization in <a href="core_spec.html">core_spec.html</a> for more | 
|  | information.</p> | 
|  |  | 
|  | <h2 id="cgroupplugins">Slurm cgroup plugins | 
|  | <a class="slurm_link" href="#cgroupplugins"></a> | 
|  | </h2> | 
|  |  | 
|  | <p> | 
|  | Both cgroup v1 and v2 plugins have very different ways of organizing their | 
|  | hierarchies and respond to different design constraints. The design is the | 
|  | responsibility of the kernel maintainers. | 
|  | </p> | 
|  |  | 
|  | <h3 id="differences">Main differences between cgroup/v1 and cgroup/v2 | 
|  | <a class="slurm_link" href="#differences"></a> | 
|  | </h3> | 
|  |  | 
|  | <p>The three main differences between v1 and v2 are:</p> | 
|  |  | 
|  | <ul> | 
|  | <li><b>Unified mode in v2</b><br> | 
|  |  | 
|  | <p>In <i>cgroup/v1</i> there's a separate hierarchy for each controller, which | 
|  | means the job structure must be replicated and managed for every enabled | 
|  | controller. For example, for the same job, if using | 
|  | <i>memory</i> and <i>freezer</i> controllers, we will need to create the same | 
|  | slurm/uid/job_id/step_id/ hierarchy in both controller's directories. For | 
|  | example: | 
|  |  | 
|  | <pre>/sys/fs/cgroup/memory/slurm/uid_1000/job_1/step_0/</pre> | 
|  | <pre>/sys/fs/cgroup/freezer/slurm/uid_1000/job_1/step_0/</pre> | 
|  |  | 
|  | <p>In <i>cgroup/v2</i> we have a <i>Unified</i> hierarchy, where controllers are | 
|  | enabled at the same level and presented to the user as different files.</p> | 
|  |  | 
|  | <pre>/sys/fs/cgroup/system.slice/slurmstepd.scope/job_1/step_0/</pre></p> | 
|  |  | 
|  | </li> | 
|  |  | 
|  | <li><b>Top-down constraint in v2</b><br> | 
|  | <p>Resources are distributed top-down and a cgroup can further distribute a | 
|  | resource only if the resource has been distributed to it from the parent. | 
|  | Enabled controllers are listed in the <i>cgroup.controllers</i> file and | 
|  | enabled controllers in a subtree are listed in <i>cgroup.subtree_control</i>. | 
|  | </li> | 
|  |  | 
|  | <li><b>No-Internal-Process constraint in v2</b><br> | 
|  | <p>In <i>cgroup/v1</i> the hierarchy is free, which means one can create any | 
|  | directory in the tree and put pids in it. In <i>cgroup/v2</i> there's a kernel | 
|  | restriction which impedes adding a pid to non-leaf directories.</p> | 
|  | </li> | 
|  |  | 
|  | <li><b>Systemd dependency on cgroup/v2 - separation of slurmd and stepds | 
|  | </b><p> This is not a kernel limitation but a systemd decision, which imposes an | 
|  | important restriction on services that decide to use <i>Delegate=yes</i>. | 
|  | Systemd, with pid 1, decided to be the complete owner of the cgroup | 
|  | hierarchy, <i>/sys/fs/cgroup</i>, trying to impose a <i>single-writer</i> | 
|  | design. This means that everything related to cgroup must be under control of | 
|  | systemd. If one decides to manually modify the cgroup tree, creating directories | 
|  | and moving pids around, it is possible that at some point systemd may decide to | 
|  | enable or disable controllers on the entire tree, or move pids around. It's been | 
|  | experienced that a | 
|  |  | 
|  | <pre>systemd reload</pre> | 
|  |  | 
|  | or a | 
|  |  | 
|  | <pre>systemd reset-failed</pre> | 
|  |  | 
|  | removed controllers, at any level and directory of the tree, if there was not | 
|  | any "systemd unit" making use of it and there were not any "Delegate=Yes" | 
|  | started "systemd unit" on the system. This is because systemd wants to cleanup | 
|  | the cgroup tree and match it against its internal unit database. In fact, | 
|  | looking at the code of systemd one can see how cgroup directories related to | 
|  | units with "Delegate=yes" flag are ignored, while any other cgroup directories | 
|  | are modified.  This makes it mandatory to start slurmd and slurmstepd processes | 
|  | under a unit with "Delegate=yes". This means we need to start, stop and restart | 
|  | slurmd with systemd. If we do that though, since we may have previously modified | 
|  | the tree where slurmd belongs (e.g. adding job directories) systemd will not be | 
|  | able to restart slurmd because of the <i>Top-down constraint</i> mentioned | 
|  | earlier. It will not be able to put the new slurmd pid into the root cgroup | 
|  | which is now a non-leaf. This forces us to separate the cgroup hierarchies of | 
|  | slurmstepd from the slurmd ones, and since we need to inform systemd about it | 
|  | and put slurmstepd into a new unit, we will do a dbus call to systemd to create | 
|  | a new scope for slurmstepds. See | 
|  | <a href="https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/"> | 
|  | systemd ControlGroupInterface</a> for more information.</p> | 
|  | </li> | 
|  | </ul> | 
|  |  | 
|  | <p>The following differences shouldn't affect how other plugins interact with | 
|  | cgroup plugins, but instead they only show internal functional differences.</p> | 
|  |  | 
|  | <ul> | 
|  | <li>A controller in <i>cgroup/v2</i> is enabled by writing to | 
|  | <i>cgroup.controllers</i>, while in <i>cgroup/v1</i> a new mount point must be | 
|  | mounted with filesystem type <i>"-t cgroup"</i> and corresponding options, | 
|  | e.g.<i>"-o freezer"</i>. | 
|  | </li> | 
|  |  | 
|  | <li>In <i>cgroup/v2</i> the freezer controller is inherently present in the | 
|  | <i>cgroup.freeze</i> interface. In <i>cgroup/v1</i> it is a specific and | 
|  | separate controller which needs to be mounted. | 
|  | </li> | 
|  |  | 
|  | <li>The devices controller does not exist in cgroup/v2, instead a new eBPF | 
|  | program must be inserted in the kernel. | 
|  | </li> | 
|  |  | 
|  | <li>In <i>cgroup/v2</i>, memory.stat file has changed and now we do the sum of | 
|  | anon+swapcached+anon_thp to match the RSS concept in v1. | 
|  | </li> | 
|  |  | 
|  | <li>In <i>cgroup/v2</i>, cpu.stat provides metrics in milis while puacct.stat | 
|  | in <i>cgroup/v1</i> provides metrics in USER_HZ. | 
|  | </li> | 
|  |  | 
|  | </ul> | 
|  |  | 
|  | <h3 id="interfaces">Main differences between controller interfaces | 
|  | <a class="slurm_link" href="#interfaces"></a> | 
|  | </h3> | 
|  | <table style="page-break-inside: avoid; font-family: Arial,Helvetica,sans-serif;" border="1" bordercolor="#000000" cellpadding="3" cellspacing="0" width="100%"> | 
|  | <tr bgcolor="#e0e0e0"> | 
|  | <td><u><b>cgroup/v1</b></u></td> | 
|  | <td><u><b>cgroup/v2</b></u></td> | 
|  | </tr> | 
|  | <tr> | 
|  | <td>memory.limit_in_bytes</td> | 
|  | <td>memory.max</td> | 
|  | </tr> | 
|  | <tr> | 
|  | <td>memory.soft_limit_in_bytes</td> | 
|  | <td>memory.high</td> | 
|  | </tr> | 
|  | <tr> | 
|  | <td>memory.memsw_limit_in_bytes</td> | 
|  | <td>memory.swap.max</td> | 
|  | </tr> | 
|  | <tr> | 
|  | <td>memory.swappiness</td> | 
|  | <td>none</td> | 
|  | </tr> | 
|  | <tr> | 
|  | <td>freezer.state</td> | 
|  | <td>cgroup.freeze</td> | 
|  | </tr> | 
|  | <tr> | 
|  | <td>cpuset.cpus</td> | 
|  | <td>cpuset.cpus.effective and cpuset.cpus</td> | 
|  | </tr> | 
|  | <tr> | 
|  | <td>cpuset.mems</td> | 
|  | <td>cpuset.mems.effective and cpuset.mems</td> | 
|  | </tr> | 
|  | <tr> | 
|  | <td>cpuacct.stat</td> | 
|  | <td>cpu.stat</td> | 
|  | </tr> | 
|  | <tr> | 
|  | <td>device.*</td> | 
|  | <td>ebpf program</td> | 
|  | </tr> | 
|  | </table> | 
|  |  | 
|  |  | 
|  | <h3 id="generalities">Other generalities | 
|  | <a class="slurm_link" href="#generalities"></a> | 
|  | </h3> | 
|  |  | 
|  | <ul> | 
|  | <li>When using cgroup/v1, some configurations can exclude the swap cgroup | 
|  | accounting. This accounting is part of the features provided by the memory | 
|  | controller.  If this feature is disabled from the kernel or boot parameters, | 
|  | trying to enable swap constraints will produce an error. If this is required, | 
|  | add the following parameters to the kernel command line: | 
|  |  | 
|  | <pre>cgroup_enable=memory swapaccount=1</pre> | 
|  |  | 
|  | This can usually be placed in /etc/default/grub inside | 
|  | the <i>GRUB_CMDLINE_LINUX</i> variable. A command such as <i>update-grub</i> | 
|  | must be run after updating the file. This feature can be disabled also at kernel | 
|  | config with the parameter: | 
|  |  | 
|  | <pre>CONFIG_MEMCG_SWAP=</pre></li> | 
|  |  | 
|  | <li>In some Linux distributions, it was possible to use the systemd parameter | 
|  | JoinControllers, which is actually deprecated. This parameter allowed multiple | 
|  | controllers to be mounted in a single hierarchy in <i>cgroup/v1</i>, more or | 
|  | less trying to emulate the behavior of <i>cgroup/v2</i> in "Unified" mode. | 
|  | However, Slurm does not work correctly with this configuration, so please make | 
|  | sure your system.conf does not use JoinControllers and that all your cgroup | 
|  | controllers are under separate directories when using | 
|  | <i>cgroup/v1</i> legacy mode. | 
|  | </li> | 
|  | </ul> | 
|  |  | 
|  | <p style="text-align:center;">Last modified 5 May 2025</p> | 
|  |  | 
|  | <!--#include virtual="footer.txt"--> |