| <!--#include virtual="header.txt"--> |
| |
| <h1>Control Group in Slurm</h1> |
| |
| <h2 id="contents">Contents |
| <a class="slurm_link" href="#contents"></a> |
| </h2> |
| |
| <ul> |
| <li><a href="#overview">Control Group Overview</a></li> |
| <li><a href="#cgroup_design">Slurm cgroup plugins design</a></li> |
| <li><a href="#use">Use of cgroup in Slurm</a></li> |
| <li><a href="#configuration">Slurm Cgroup Configuration Overview</a></li> |
| <li><a href="#Plugins">Currently Available Cgroup Plugins</a> |
| <ul> |
| <li><a href="#proctrack">proctrack/cgroup plugin</a></li> |
| <li><a href="#task">task/cgroup plugin</a></li> |
| <li><a href="#jobacct_gather">jobacct_gather/cgroup plugin</a></li> |
| </ul> |
| </li> |
| <li><a href="#Specialization">Use of cgroup for Resource Specialization</a></li> |
| <li><a href="#cgroupplugins">Slurm cgroup plugins</a> |
| <ul> |
| <li><a href="#differences">Main differences between cgroup/v1 and cgroup/v2</a></li> |
| <li><a href="#interfaces">Main differences between controller interfaces</a></li> |
| <li><a href="#generalities">Other generalities</a></li> |
| </ul> |
| </li> |
| </ul> |
| |
| <h2 id="overview">Control Group Overview |
| <a class="slurm_link" href="#overview"></a> |
| </h2> |
| <p>Control Group is a mechanism provided by the kernel to organize processes |
| hierarchically and distribute system resources along the hierarchy in a |
| controlled and configurable manner. Slurm can make use of cgroups to constrain |
| different resources to jobs, steps and tasks, and to get accounting about these |
| resources.</p> |
| |
| <p>A cgroup provides different controllers (formerly "subsystems") for different |
| resources. Slurm plugins can use several of these controllers, e.g.: <i>memory, |
| cpu, devices, freezer, cpuset, cpuacct</i>. Each enabled controller |
| gives the ability to constrain resources to a set of processes. If one |
| controller is not available on the system, then Slurm cannot constrain the |
| associated resources through a cgroup.</p> |
| |
| <p>"cgroup" stands for "control group" and is never capitalized. The singular |
| form is used to designate the whole feature and also as a qualifier as in |
| "cgroup controllers". When explicitly referring to multiple individual control |
| groups, the plural form "cgroups" is used.</p> |
| |
| <p>Slurm supports two cgroup modes, Legacy mode (cgroup v1) and Unified Mode |
| (cgroup v2). Hybrid mode where controllers from both version 1 and version 2 are |
| mixed in a system is not supported.</p> |
| |
| <p><b>NOTE</b>: The cgroup/v1 plugin is deprecated and will not be supported in |
| future Slurm versions. Newer GNU/Linux distributions are dropping, or have |
| dropped, support for cgroup v1 and may even not provide kernel support for the |
| required cgroup v1 interfaces. Systemd also deprecated cgroup v1. Starting with |
| Slurm version 25.05, no new features will be added to cgroup v1. Support for |
| critical bugs will be provided until its final removal.</p> |
| |
| <p>See the kernel.org documentation for a more comprehensive description of |
| cgroup:</p> |
| |
| <ul> |
| <li><a href="https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt"> |
| Kernel's Cgroup v1 documentation</a> |
| </li> |
| <li><a href="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html"> |
| Kernel's Cgroup v2 documentation </a> |
| </li> |
| </ul> |
| |
| <h2 id="cgroup_design">Slurm cgroup plugins design |
| <a class="slurm_link" href="#cgroup_design"></a> |
| </h2> |
| |
| For extended information on Slurm's internal Cgroup plugin read: |
| <ul> |
| <li><a href="cgroup_v2.html">cgroup/v2 plugin documentation</a> </li> |
| </ul> |
| |
| <h2 id="use">Use of cgroup in Slurm <a class="slurm_link" href="#use"></a> </h2> |
| <p>Slurm provides cgroup versions of a number of plugins.</p> |
| <ul> |
| <li>proctrack/cgroup (for process tracking and management)</li> |
| <li>task/cgroup (for constraining resources at step and task level)</li> |
| <li>jobacct_gather/cgroup (for gathering statistics)</li> |
| </ul> |
| |
| <p>cgroups can also be used for resource specialization (constraining daemons to |
| cores or memory).</p> |
| |
| <h2 id="configuration">Slurm Cgroup Configuration Overview |
| <a class="slurm_link" href="#configuration"></a> |
| </h2> |
| |
| <p>There are several sets of configuration options for Slurm cgroups:</p> |
| |
| <ul> |
| <li><a href="slurm.conf.html">slurm.conf</a> provides options to enable the |
| cgroup plugins. Each plugin may be enabled or disabled independently of the |
| others. |
| </li> |
| <li><a href="cgroup.conf.html">cgroup.conf</a> provides general options that are |
| common to all cgroup plugins, plus additional options that apply only to |
| specific plugins. |
| </li> |
| <li>System-level resource specialization is enabled using node configuration |
| parameters. |
| </li> |
| </ul> |
| |
| <h2 id="Plugins">Currently Available Cgroup Plugins |
| <a class="slurm_link" href="#Plugins"></a> |
| </h2> |
| |
| <h3 id="proctrack">proctrack/cgroup plugin |
| <a class="slurm_link" href="#proctrack"></a> |
| </h3> |
| |
| <p>The proctrack/cgroup plugin is an alternative to other proctrack plugins such |
| as proctrack/linux for process tracking and suspend/resume capability. |
| </p> |
| |
| <p> |
| proctrack/cgroup uses the freezer controller to keep track of all pids of a |
| job. It basically stores the pids in a specific hierarchy in the cgroup tree and |
| takes cares of signaling these pids when instructed. For example, if a user |
| decides to cancel a job, Slurm will execute this order internally by calling the |
| proctrack plugin and asking it to send a SIGTERM to the job. Since proctrack |
| maintains a hierarchy of all Slurm-related pids in cgroup, it will easily know |
| which ones will need to be signaled. |
| <br> |
| Proctrack can also respond to queries for getting a list of all the pids of a |
| job or a step. |
| <br> |
| Alternatively, when using proctrack/linux, pids are stored by cgroup in a |
| single file (cgroup.procs) which is read by the plugin to get all the pids of a |
| part of the hierarchy. For example, when using proctrack/cgroup, a single step |
| has its own cgroup.procs file, so getting the pids of the step is instantaneous. |
| In proctrack/linux, we need to read recursively /proc to get all the descendants |
| of a parent pid. |
| </p> |
| |
| <p>To enable this plugin, configure the following option in slurm.conf: |
| <pre>ProctrackType=proctrack/cgroup</pre> |
| </p> |
| |
| <p>There are no specific options for this plugin in cgroup.conf, but the general |
| options apply. See the <a href="cgroup.conf.html">cgroup.conf</a> man page for |
| details.</p> |
| |
| <h3 id="task">task/cgroup plugin<a class="slurm_link" href="#task"></a></h3> |
| |
| <p>The task/cgroup plugin allows constraining resources to a job, a step, or a |
| task. This is the only plugin that can ensure that the boundaries of an |
| allocation are not violated. |
| Only jobacctgather/linux offers a very simplistic mechanism for |
| constraining memory to a job but it is not reliable (there's a window of time |
| where jobs can exceed its limits) and only for very rare systems where cgroup is |
| not available.</p> |
| |
| <p>task/cgroup provides the following features:</p> |
| |
| <ul> |
| <li>Confine jobs and steps to their allocated cpuset.</li> |
| <li>Confine jobs and steps to specific memory resources.</li> |
| <li>Confine jobs, steps and tasks to their allocated gres, including gpus.</li> |
| </ul> |
| |
| <p>The task/cgroup plugin uses the cpuset, memory and devices subsystems.</p> |
| |
| <p>To enable this plugin, add <i>task/cgroup</i> to the TaskPlugin configuration |
| parameter in slurm.conf:</p> |
| |
| <pre>TaskPlugin=task/cgroup</pre> |
| |
| <p>There are many specific options for this plugin in cgroup.conf. The general |
| options also apply. See the <a href="cgroup.conf.html">cgroup.conf</a> man page |
| for details.</p> |
| |
| <p>This plugin can be stacked with other task plugins, for example with |
| <i>task/affinity</i>. This will allow it to constrain resources to a job plus |
| getting the advantage of the affinity plugin (order doesn't matter):</p> |
| |
| <pre>TaskPlugin=task/cgroup,task/affinity</pre> |
| |
| <h3 id="jobacct_gather">jobacct_gather/cgroup plugin |
| <a class="slurm_link" href="#jobacct_gather"></a> |
| </h3> |
| |
| <p> |
| The <i>jobacct_gather/cgroup</i> plugin is an alternative to the |
| <i>jobacct_gather/linux</i> plugin for the collection of accounting statistics |
| for jobs, steps and tasks. |
| <br> |
| <i>jobacct_gather/cgroup</i> uses the cpuacct and memory cgroup controllers. |
| </p> |
| |
| <p>The cpu and memory statistics collected by this plugin do not represent the |
| same resources as the cpu and memory statistics collected by the |
| <i>jobacct_gather/linux</i>. While the cgroup plugin just reads a cgroup.stats |
| file and similar containing the information for the entire subtree of pids, the |
| linux plugin gets information from /proc/pid/stat for every pid and then does |
| the calculations, thus becoming a bit less efficient (thought not noticeable in |
| the practice) than the cgroup one.</p> |
| |
| <p>To enable this plugin, configure the following option in slurm.conf: |
| <pre>JobacctGatherType=jobacct_gather/cgroup</pre> |
| </p> |
| |
| <p>There are no specific options for this plugin in cgroup.conf, but the general |
| options apply. See the <a href="cgroup.conf.html">cgroup.conf</a> man page for |
| details.</p> |
| |
| <h2 id="Specialization">Use of cgroup for Resource Specialization |
| <a class="slurm_link" href="#Specialization"></a> |
| </h2> |
| |
| <p>Resource Specialization may be used to reserve a subset of cores or a |
| specific amount of memory on each compute node for exclusive use by the Slurm |
| compute node daemon, slurmd.</p> |
| |
| <p>If cgroup/v1 is used the reserved resources will also be used by the |
| slurmstepd processes. If cgroup/v2 is used, slurmstepd is not constrained by |
| this resource specialization. Instead the slurmstepd is constrained to the |
| resources allocated to the job, since it is considered part of the job and its |
| consumption is completely dependent on the topology of the job. For example an |
| MPI job can initialize many ranks with PMI and make slurmstepd consume more |
| memory.</p> |
| |
| <p>System-level resource specialization is enabled with special node |
| configuration parameters. Read <a href="slurm.conf.html">slurm.conf</a> and core |
| specialization in <a href="core_spec.html">core_spec.html</a> for more |
| information.</p> |
| |
| <h2 id="cgroupplugins">Slurm cgroup plugins |
| <a class="slurm_link" href="#cgroupplugins"></a> |
| </h2> |
| |
| <p> |
| Both cgroup v1 and v2 plugins have very different ways of organizing their |
| hierarchies and respond to different design constraints. The design is the |
| responsibility of the kernel maintainers. |
| </p> |
| |
| <h3 id="differences">Main differences between cgroup/v1 and cgroup/v2 |
| <a class="slurm_link" href="#differences"></a> |
| </h3> |
| |
| <p>The three main differences between v1 and v2 are:</p> |
| |
| <ul> |
| <li><b>Unified mode in v2</b><br> |
| |
| <p>In <i>cgroup/v1</i> there's a separate hierarchy for each controller, which |
| means the job structure must be replicated and managed for every enabled |
| controller. For example, for the same job, if using |
| <i>memory</i> and <i>freezer</i> controllers, we will need to create the same |
| slurm/uid/job_id/step_id/ hierarchy in both controller's directories. For |
| example: |
| |
| <pre>/sys/fs/cgroup/memory/slurm/uid_1000/job_1/step_0/</pre> |
| <pre>/sys/fs/cgroup/freezer/slurm/uid_1000/job_1/step_0/</pre> |
| |
| <p>In <i>cgroup/v2</i> we have a <i>Unified</i> hierarchy, where controllers are |
| enabled at the same level and presented to the user as different files.</p> |
| |
| <pre>/sys/fs/cgroup/system.slice/slurmstepd.scope/job_1/step_0/</pre></p> |
| |
| </li> |
| |
| <li><b>Top-down constraint in v2</b><br> |
| <p>Resources are distributed top-down and a cgroup can further distribute a |
| resource only if the resource has been distributed to it from the parent. |
| Enabled controllers are listed in the <i>cgroup.controllers</i> file and |
| enabled controllers in a subtree are listed in <i>cgroup.subtree_control</i>. |
| </li> |
| |
| <li><b>No-Internal-Process constraint in v2</b><br> |
| <p>In <i>cgroup/v1</i> the hierarchy is free, which means one can create any |
| directory in the tree and put pids in it. In <i>cgroup/v2</i> there's a kernel |
| restriction which impedes adding a pid to non-leaf directories.</p> |
| </li> |
| |
| <li><b>Systemd dependency on cgroup/v2 - separation of slurmd and stepds |
| </b><p> This is not a kernel limitation but a systemd decision, which imposes an |
| important restriction on services that decide to use <i>Delegate=yes</i>. |
| Systemd, with pid 1, decided to be the complete owner of the cgroup |
| hierarchy, <i>/sys/fs/cgroup</i>, trying to impose a <i>single-writer</i> |
| design. This means that everything related to cgroup must be under control of |
| systemd. If one decides to manually modify the cgroup tree, creating directories |
| and moving pids around, it is possible that at some point systemd may decide to |
| enable or disable controllers on the entire tree, or move pids around. It's been |
| experienced that a |
| |
| <pre>systemd reload</pre> |
| |
| or a |
| |
| <pre>systemd reset-failed</pre> |
| |
| removed controllers, at any level and directory of the tree, if there was not |
| any "systemd unit" making use of it and there were not any "Delegate=Yes" |
| started "systemd unit" on the system. This is because systemd wants to cleanup |
| the cgroup tree and match it against its internal unit database. In fact, |
| looking at the code of systemd one can see how cgroup directories related to |
| units with "Delegate=yes" flag are ignored, while any other cgroup directories |
| are modified. This makes it mandatory to start slurmd and slurmstepd processes |
| under a unit with "Delegate=yes". This means we need to start, stop and restart |
| slurmd with systemd. If we do that though, since we may have previously modified |
| the tree where slurmd belongs (e.g. adding job directories) systemd will not be |
| able to restart slurmd because of the <i>Top-down constraint</i> mentioned |
| earlier. It will not be able to put the new slurmd pid into the root cgroup |
| which is now a non-leaf. This forces us to separate the cgroup hierarchies of |
| slurmstepd from the slurmd ones, and since we need to inform systemd about it |
| and put slurmstepd into a new unit, we will do a dbus call to systemd to create |
| a new scope for slurmstepds. See |
| <a href="https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/"> |
| systemd ControlGroupInterface</a> for more information.</p> |
| </li> |
| </ul> |
| |
| <p>The following differences shouldn't affect how other plugins interact with |
| cgroup plugins, but instead they only show internal functional differences.</p> |
| |
| <ul> |
| <li>A controller in <i>cgroup/v2</i> is enabled by writing to |
| <i>cgroup.controllers</i>, while in <i>cgroup/v1</i> a new mount point must be |
| mounted with filesystem type <i>"-t cgroup"</i> and corresponding options, |
| e.g.<i>"-o freezer"</i>. |
| </li> |
| |
| <li>In <i>cgroup/v2</i> the freezer controller is inherently present in the |
| <i>cgroup.freeze</i> interface. In <i>cgroup/v1</i> it is a specific and |
| separate controller which needs to be mounted. |
| </li> |
| |
| <li>The devices controller does not exist in cgroup/v2, instead a new eBPF |
| program must be inserted in the kernel. |
| </li> |
| |
| <li>In <i>cgroup/v2</i>, memory.stat file has changed and now we do the sum of |
| anon+swapcached+anon_thp to match the RSS concept in v1. |
| </li> |
| |
| <li>In <i>cgroup/v2</i>, cpu.stat provides metrics in milis while puacct.stat |
| in <i>cgroup/v1</i> provides metrics in USER_HZ. |
| </li> |
| |
| </ul> |
| |
| <h3 id="interfaces">Main differences between controller interfaces |
| <a class="slurm_link" href="#interfaces"></a> |
| </h3> |
| <table style="page-break-inside: avoid; font-family: Arial,Helvetica,sans-serif;" border="1" bordercolor="#000000" cellpadding="3" cellspacing="0" width="100%"> |
| <tr bgcolor="#e0e0e0"> |
| <td><u><b>cgroup/v1</b></u></td> |
| <td><u><b>cgroup/v2</b></u></td> |
| </tr> |
| <tr> |
| <td>memory.limit_in_bytes</td> |
| <td>memory.max</td> |
| </tr> |
| <tr> |
| <td>memory.soft_limit_in_bytes</td> |
| <td>memory.high</td> |
| </tr> |
| <tr> |
| <td>memory.memsw_limit_in_bytes</td> |
| <td>memory.swap.max</td> |
| </tr> |
| <tr> |
| <td>memory.swappiness</td> |
| <td>none</td> |
| </tr> |
| <tr> |
| <td>freezer.state</td> |
| <td>cgroup.freeze</td> |
| </tr> |
| <tr> |
| <td>cpuset.cpus</td> |
| <td>cpuset.cpus.effective and cpuset.cpus</td> |
| </tr> |
| <tr> |
| <td>cpuset.mems</td> |
| <td>cpuset.mems.effective and cpuset.mems</td> |
| </tr> |
| <tr> |
| <td>cpuacct.stat</td> |
| <td>cpu.stat</td> |
| </tr> |
| <tr> |
| <td>device.*</td> |
| <td>ebpf program</td> |
| </tr> |
| </table> |
| |
| |
| <h3 id="generalities">Other generalities |
| <a class="slurm_link" href="#generalities"></a> |
| </h3> |
| |
| <ul> |
| <li>When using cgroup/v1, some configurations can exclude the swap cgroup |
| accounting. This accounting is part of the features provided by the memory |
| controller. If this feature is disabled from the kernel or boot parameters, |
| trying to enable swap constraints will produce an error. If this is required, |
| add the following parameters to the kernel command line: |
| |
| <pre>cgroup_enable=memory swapaccount=1</pre> |
| |
| This can usually be placed in /etc/default/grub inside |
| the <i>GRUB_CMDLINE_LINUX</i> variable. A command such as <i>update-grub</i> |
| must be run after updating the file. This feature can be disabled also at kernel |
| config with the parameter: |
| |
| <pre>CONFIG_MEMCG_SWAP=</pre></li> |
| |
| <li>In some Linux distributions, it was possible to use the systemd parameter |
| JoinControllers, which is actually deprecated. This parameter allowed multiple |
| controllers to be mounted in a single hierarchy in <i>cgroup/v1</i>, more or |
| less trying to emulate the behavior of <i>cgroup/v2</i> in "Unified" mode. |
| However, Slurm does not work correctly with this configuration, so please make |
| sure your system.conf does not use JoinControllers and that all your cgroup |
| controllers are under separate directories when using |
| <i>cgroup/v1</i> legacy mode. |
| </li> |
| </ul> |
| |
| <p style="text-align:center;">Last modified 5 May 2025</p> |
| |
| <!--#include virtual="footer.txt"--> |