blob: 16787dfab68a837150b5dcb36a3dd8c21d86155b [file] [log] [blame]
<!--#include virtual="header.txt"-->
<h1>Control Group in Slurm</h1>
<h2 id="contents">Contents
<a class="slurm_link" href="#contents"></a>
</h2>
<ul>
<li><a href="#overview">Control Group Overview</a></li>
<li><a href="#cgroup_design">Slurm cgroup plugins design</a></li>
<li><a href="#use">Use of cgroup in Slurm</a></li>
<li><a href="#configuration">Slurm Cgroup Configuration Overview</a></li>
<li><a href="#Plugins">Currently Available Cgroup Plugins</a>
<ul>
<li><a href="#proctrack">proctrack/cgroup plugin</a></li>
<li><a href="#task">task/cgroup plugin</a></li>
<li><a href="#jobacct_gather">jobacct_gather/cgroup plugin</a></li>
</ul>
</li>
<li><a href="#Specialization">Use of cgroup for Resource Specialization</a></li>
<li><a href="#cgroupplugins">Slurm cgroup plugins</a>
<ul>
<li><a href="#differences">Main differences between cgroup/v1 and cgroup/v2</a></li>
<li><a href="#interfaces">Main differences between controller interfaces</a></li>
<li><a href="#generalities">Other generalities</a></li>
</ul>
</li>
</ul>
<h2 id="overview">Control Group Overview
<a class="slurm_link" href="#overview"></a>
</h2>
<p>Control Group is a mechanism provided by the kernel to organize processes
hierarchically and distribute system resources along the hierarchy in a
controlled and configurable manner. Slurm can make use of cgroups to constrain
different resources to jobs, steps and tasks, and to get accounting about these
resources.</p>
<p>A cgroup provides different controllers (formerly "subsystems") for different
resources. Slurm plugins can use several of these controllers, e.g.: <i>memory,
cpu, devices, freezer, cpuset, cpuacct</i>. Each enabled controller
gives the ability to constrain resources to a set of processes. If one
controller is not available on the system, then Slurm cannot constrain the
associated resources through a cgroup.</p>
<p>"cgroup" stands for "control group" and is never capitalized. The singular
form is used to designate the whole feature and also as a qualifier as in
"cgroup controllers". When explicitly referring to multiple individual control
groups, the plural form "cgroups" is used.</p>
<p>Slurm supports two cgroup modes, Legacy mode (cgroup v1) and Unified Mode
(cgroup v2). Hybrid mode where controllers from both version 1 and version 2 are
mixed in a system is not supported.</p>
<p><b>NOTE</b>: The cgroup/v1 plugin is deprecated and will not be supported in
future Slurm versions. Newer GNU/Linux distributions are dropping, or have
dropped, support for cgroup v1 and may even not provide kernel support for the
required cgroup v1 interfaces. Systemd also deprecated cgroup v1. Starting with
Slurm version 25.05, no new features will be added to cgroup v1. Support for
critical bugs will be provided until its final removal.</p>
<p>See the kernel.org documentation for a more comprehensive description of
cgroup:</p>
<ul>
<li><a href="https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt">
Kernel's Cgroup v1 documentation</a>
</li>
<li><a href="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html">
Kernel's Cgroup v2 documentation </a>
</li>
</ul>
<h2 id="cgroup_design">Slurm cgroup plugins design
<a class="slurm_link" href="#cgroup_design"></a>
</h2>
For extended information on Slurm's internal Cgroup plugin read:
<ul>
<li><a href="cgroup_v2.html">cgroup/v2 plugin documentation</a> </li>
</ul>
<h2 id="use">Use of cgroup in Slurm <a class="slurm_link" href="#use"></a> </h2>
<p>Slurm provides cgroup versions of a number of plugins.</p>
<ul>
<li>proctrack/cgroup (for process tracking and management)</li>
<li>task/cgroup (for constraining resources at step and task level)</li>
<li>jobacct_gather/cgroup (for gathering statistics)</li>
</ul>
<p>cgroups can also be used for resource specialization (constraining daemons to
cores or memory).</p>
<h2 id="configuration">Slurm Cgroup Configuration Overview
<a class="slurm_link" href="#configuration"></a>
</h2>
<p>There are several sets of configuration options for Slurm cgroups:</p>
<ul>
<li><a href="slurm.conf.html">slurm.conf</a> provides options to enable the
cgroup plugins. Each plugin may be enabled or disabled independently of the
others.
</li>
<li><a href="cgroup.conf.html">cgroup.conf</a> provides general options that are
common to all cgroup plugins, plus additional options that apply only to
specific plugins.
</li>
<li>System-level resource specialization is enabled using node configuration
parameters.
</li>
</ul>
<h2 id="Plugins">Currently Available Cgroup Plugins
<a class="slurm_link" href="#Plugins"></a>
</h2>
<h3 id="proctrack">proctrack/cgroup plugin
<a class="slurm_link" href="#proctrack"></a>
</h3>
<p>The proctrack/cgroup plugin is an alternative to other proctrack plugins such
as proctrack/linux for process tracking and suspend/resume capability.
</p>
<p>
proctrack/cgroup uses the freezer controller to keep track of all pids of a
job. It basically stores the pids in a specific hierarchy in the cgroup tree and
takes cares of signaling these pids when instructed. For example, if a user
decides to cancel a job, Slurm will execute this order internally by calling the
proctrack plugin and asking it to send a SIGTERM to the job. Since proctrack
maintains a hierarchy of all Slurm-related pids in cgroup, it will easily know
which ones will need to be signaled.
<br>
Proctrack can also respond to queries for getting a list of all the pids of a
job or a step.
<br>
Alternatively, when using proctrack/linux, pids are stored by cgroup in a
single file (cgroup.procs) which is read by the plugin to get all the pids of a
part of the hierarchy. For example, when using proctrack/cgroup, a single step
has its own cgroup.procs file, so getting the pids of the step is instantaneous.
In proctrack/linux, we need to read recursively /proc to get all the descendants
of a parent pid.
</p>
<p>To enable this plugin, configure the following option in slurm.conf:
<pre>ProctrackType=proctrack/cgroup</pre>
</p>
<p>There are no specific options for this plugin in cgroup.conf, but the general
options apply. See the <a href="cgroup.conf.html">cgroup.conf</a> man page for
details.</p>
<h3 id="task">task/cgroup plugin<a class="slurm_link" href="#task"></a></h3>
<p>The task/cgroup plugin allows constraining resources to a job, a step, or a
task. This is the only plugin that can ensure that the boundaries of an
allocation are not violated.
Only jobacctgather/linux offers a very simplistic mechanism for
constraining memory to a job but it is not reliable (there's a window of time
where jobs can exceed its limits) and only for very rare systems where cgroup is
not available.</p>
<p>task/cgroup provides the following features:</p>
<ul>
<li>Confine jobs and steps to their allocated cpuset.</li>
<li>Confine jobs and steps to specific memory resources.</li>
<li>Confine jobs, steps and tasks to their allocated gres, including gpus.</li>
</ul>
<p>The task/cgroup plugin uses the cpuset, memory and devices subsystems.</p>
<p>To enable this plugin, add <i>task/cgroup</i> to the TaskPlugin configuration
parameter in slurm.conf:</p>
<pre>TaskPlugin=task/cgroup</pre>
<p>There are many specific options for this plugin in cgroup.conf. The general
options also apply. See the <a href="cgroup.conf.html">cgroup.conf</a> man page
for details.</p>
<p>This plugin can be stacked with other task plugins, for example with
<i>task/affinity</i>. This will allow it to constrain resources to a job plus
getting the advantage of the affinity plugin (order doesn't matter):</p>
<pre>TaskPlugin=task/cgroup,task/affinity</pre>
<h3 id="jobacct_gather">jobacct_gather/cgroup plugin
<a class="slurm_link" href="#jobacct_gather"></a>
</h3>
<p>
The <i>jobacct_gather/cgroup</i> plugin is an alternative to the
<i>jobacct_gather/linux</i> plugin for the collection of accounting statistics
for jobs, steps and tasks.
<br>
<i>jobacct_gather/cgroup</i> uses the cpuacct and memory cgroup controllers.
</p>
<p>The cpu and memory statistics collected by this plugin do not represent the
same resources as the cpu and memory statistics collected by the
<i>jobacct_gather/linux</i>. While the cgroup plugin just reads a cgroup.stats
file and similar containing the information for the entire subtree of pids, the
linux plugin gets information from /proc/pid/stat for every pid and then does
the calculations, thus becoming a bit less efficient (thought not noticeable in
the practice) than the cgroup one.</p>
<p>To enable this plugin, configure the following option in slurm.conf:
<pre>JobacctGatherType=jobacct_gather/cgroup</pre>
</p>
<p>There are no specific options for this plugin in cgroup.conf, but the general
options apply. See the <a href="cgroup.conf.html">cgroup.conf</a> man page for
details.</p>
<h2 id="Specialization">Use of cgroup for Resource Specialization
<a class="slurm_link" href="#Specialization"></a>
</h2>
<p>Resource Specialization may be used to reserve a subset of cores or a
specific amount of memory on each compute node for exclusive use by the Slurm
compute node daemon, slurmd.</p>
<p>If cgroup/v1 is used the reserved resources will also be used by the
slurmstepd processes. If cgroup/v2 is used, slurmstepd is not constrained by
this resource specialization. Instead the slurmstepd is constrained to the
resources allocated to the job, since it is considered part of the job and its
consumption is completely dependent on the topology of the job. For example an
MPI job can initialize many ranks with PMI and make slurmstepd consume more
memory.</p>
<p>System-level resource specialization is enabled with special node
configuration parameters. Read <a href="slurm.conf.html">slurm.conf</a> and core
specialization in <a href="core_spec.html">core_spec.html</a> for more
information.</p>
<h2 id="cgroupplugins">Slurm cgroup plugins
<a class="slurm_link" href="#cgroupplugins"></a>
</h2>
<p>
Both cgroup v1 and v2 plugins have very different ways of organizing their
hierarchies and respond to different design constraints. The design is the
responsibility of the kernel maintainers.
</p>
<h3 id="differences">Main differences between cgroup/v1 and cgroup/v2
<a class="slurm_link" href="#differences"></a>
</h3>
<p>The three main differences between v1 and v2 are:</p>
<ul>
<li><b>Unified mode in v2</b><br>
<p>In <i>cgroup/v1</i> there's a separate hierarchy for each controller, which
means the job structure must be replicated and managed for every enabled
controller. For example, for the same job, if using
<i>memory</i> and <i>freezer</i> controllers, we will need to create the same
slurm/uid/job_id/step_id/ hierarchy in both controller's directories. For
example:
<pre>/sys/fs/cgroup/memory/slurm/uid_1000/job_1/step_0/</pre>
<pre>/sys/fs/cgroup/freezer/slurm/uid_1000/job_1/step_0/</pre>
<p>In <i>cgroup/v2</i> we have a <i>Unified</i> hierarchy, where controllers are
enabled at the same level and presented to the user as different files.</p>
<pre>/sys/fs/cgroup/system.slice/slurmstepd.scope/job_1/step_0/</pre></p>
</li>
<li><b>Top-down constraint in v2</b><br>
<p>Resources are distributed top-down and a cgroup can further distribute a
resource only if the resource has been distributed to it from the parent.
Enabled controllers are listed in the <i>cgroup.controllers</i> file and
enabled controllers in a subtree are listed in <i>cgroup.subtree_control</i>.
</li>
<li><b>No-Internal-Process constraint in v2</b><br>
<p>In <i>cgroup/v1</i> the hierarchy is free, which means one can create any
directory in the tree and put pids in it. In <i>cgroup/v2</i> there's a kernel
restriction which impedes adding a pid to non-leaf directories.</p>
</li>
<li><b>Systemd dependency on cgroup/v2 - separation of slurmd and stepds
</b><p> This is not a kernel limitation but a systemd decision, which imposes an
important restriction on services that decide to use <i>Delegate=yes</i>.
Systemd, with pid 1, decided to be the complete owner of the cgroup
hierarchy, <i>/sys/fs/cgroup</i>, trying to impose a <i>single-writer</i>
design. This means that everything related to cgroup must be under control of
systemd. If one decides to manually modify the cgroup tree, creating directories
and moving pids around, it is possible that at some point systemd may decide to
enable or disable controllers on the entire tree, or move pids around. It's been
experienced that a
<pre>systemd reload</pre>
or a
<pre>systemd reset-failed</pre>
removed controllers, at any level and directory of the tree, if there was not
any "systemd unit" making use of it and there were not any "Delegate=Yes"
started "systemd unit" on the system. This is because systemd wants to cleanup
the cgroup tree and match it against its internal unit database. In fact,
looking at the code of systemd one can see how cgroup directories related to
units with "Delegate=yes" flag are ignored, while any other cgroup directories
are modified. This makes it mandatory to start slurmd and slurmstepd processes
under a unit with "Delegate=yes". This means we need to start, stop and restart
slurmd with systemd. If we do that though, since we may have previously modified
the tree where slurmd belongs (e.g. adding job directories) systemd will not be
able to restart slurmd because of the <i>Top-down constraint</i> mentioned
earlier. It will not be able to put the new slurmd pid into the root cgroup
which is now a non-leaf. This forces us to separate the cgroup hierarchies of
slurmstepd from the slurmd ones, and since we need to inform systemd about it
and put slurmstepd into a new unit, we will do a dbus call to systemd to create
a new scope for slurmstepds. See
<a href="https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/">
systemd ControlGroupInterface</a> for more information.</p>
</li>
</ul>
<p>The following differences shouldn't affect how other plugins interact with
cgroup plugins, but instead they only show internal functional differences.</p>
<ul>
<li>A controller in <i>cgroup/v2</i> is enabled by writing to
<i>cgroup.controllers</i>, while in <i>cgroup/v1</i> a new mount point must be
mounted with filesystem type <i>"-t cgroup"</i> and corresponding options,
e.g.<i>"-o freezer"</i>.
</li>
<li>In <i>cgroup/v2</i> the freezer controller is inherently present in the
<i>cgroup.freeze</i> interface. In <i>cgroup/v1</i> it is a specific and
separate controller which needs to be mounted.
</li>
<li>The devices controller does not exist in cgroup/v2, instead a new eBPF
program must be inserted in the kernel.
</li>
<li>In <i>cgroup/v2</i>, memory.stat file has changed and now we do the sum of
anon+swapcached+anon_thp to match the RSS concept in v1.
</li>
<li>In <i>cgroup/v2</i>, cpu.stat provides metrics in milis while puacct.stat
in <i>cgroup/v1</i> provides metrics in USER_HZ.
</li>
</ul>
<h3 id="interfaces">Main differences between controller interfaces
<a class="slurm_link" href="#interfaces"></a>
</h3>
<table style="page-break-inside: avoid; font-family: Arial,Helvetica,sans-serif;" border="1" bordercolor="#000000" cellpadding="3" cellspacing="0" width="100%">
<tr bgcolor="#e0e0e0">
<td><u><b>cgroup/v1</b></u></td>
<td><u><b>cgroup/v2</b></u></td>
</tr>
<tr>
<td>memory.limit_in_bytes</td>
<td>memory.max</td>
</tr>
<tr>
<td>memory.soft_limit_in_bytes</td>
<td>memory.high</td>
</tr>
<tr>
<td>memory.memsw_limit_in_bytes</td>
<td>memory.swap.max</td>
</tr>
<tr>
<td>memory.swappiness</td>
<td>none</td>
</tr>
<tr>
<td>freezer.state</td>
<td>cgroup.freeze</td>
</tr>
<tr>
<td>cpuset.cpus</td>
<td>cpuset.cpus.effective and cpuset.cpus</td>
</tr>
<tr>
<td>cpuset.mems</td>
<td>cpuset.mems.effective and cpuset.mems</td>
</tr>
<tr>
<td>cpuacct.stat</td>
<td>cpu.stat</td>
</tr>
<tr>
<td>device.*</td>
<td>ebpf program</td>
</tr>
</table>
<h3 id="generalities">Other generalities
<a class="slurm_link" href="#generalities"></a>
</h3>
<ul>
<li>When using cgroup/v1, some configurations can exclude the swap cgroup
accounting. This accounting is part of the features provided by the memory
controller. If this feature is disabled from the kernel or boot parameters,
trying to enable swap constraints will produce an error. If this is required,
add the following parameters to the kernel command line:
<pre>cgroup_enable=memory swapaccount=1</pre>
This can usually be placed in /etc/default/grub inside
the <i>GRUB_CMDLINE_LINUX</i> variable. A command such as <i>update-grub</i>
must be run after updating the file. This feature can be disabled also at kernel
config with the parameter:
<pre>CONFIG_MEMCG_SWAP=</pre></li>
<li>In some Linux distributions, it was possible to use the systemd parameter
JoinControllers, which is actually deprecated. This parameter allowed multiple
controllers to be mounted in a single hierarchy in <i>cgroup/v1</i>, more or
less trying to emulate the behavior of <i>cgroup/v2</i> in "Unified" mode.
However, Slurm does not work correctly with this configuration, so please make
sure your system.conf does not use JoinControllers and that all your cgroup
controllers are under separate directories when using
<i>cgroup/v1</i> legacy mode.
</li>
</ul>
<p style="text-align:center;">Last modified 5 May 2025</p>
<!--#include virtual="footer.txt"-->