blob: 8b74e3f852b78a3d551e0d995a705c1f9da3bab6 [file] [log] [blame]
<!--#include virtual="header.txt"-->
<h1>Control Group v2 plugin</h1>
<h2 id="contents">Contents
<a class="slurm_link" href="#contents"></a>
</h2>
<ul>
<li><a href="#overview">Overview</a></li>
<li><a href="#conversion">Conversion from cgroup v1</a>
<ul>
<li><a href="#reconfigure_systemd">Reconfigure SystemD</a></li>
<li><a href="#general_conversion">General conversion</a></li>
</ul>
</li>
<li><a href="#v2_rules">Following cgroup v2 rules</a>
<ul>
<li><a href="#top_down">Top-down Constraint</a></li>
<li><a href="#no_internal_process">No Internal Process Constraint</a></li>
</ul>
</li>
<li><a href="#systemd_rules">Following systemd rules</a>
<ul>
<li><a href="#real_sysd_prob">The real problem: systemd + restarting slurmd</a></li>
<li><a href="#consequences_nosysd">Consequences of not following systemd rules</a></li>
<li><a href="#distro_no_sysd">What happens with Linux distros without systemd?</a></li>
</ul>
</li>
<li><a href="#v2_overview">cgroup/v2 overview</a>
<ul>
<li><a href="#slurmd_startup">slurmd startup</a></li>
<li><a href="#slurmd_restart">slurmd restart</a></li>
<li><a href="#stepd_start">slurmstepd start</a></li>
<li><a href="#term_clean">Termination and cleanup</a></li>
<li><a href="#manual_startup">Special case - manual startup</a></li>
<li><a href="#troubleshooting_startup">Troubleshooting startup</a></li>
</ul>
</li>
<li><a href="#hierarchy_overview">Hierarchy overview</a></li>
<li><a href="#task_level">Working at the task level</a></li>
<li><a href="#ebpf_controller">The eBPF based devices controller</a></li>
<li><a href="#diff_ver">Running different nodes with different cgroup versions</a></li>
<li><a href="#configuration">Configuration</a>
<ul>
<li><a href="#cgroup_plugin">Cgroup Plugin</a></li>
<li><a href="#dev_options">Developer options</a></li>
<li><a href="#ignored_params">Ignored parameters</a></li>
</ul>
</li>
<li><a href="#requirements">Requirements</a></li>
<li><a href="#pam_slurm_adopt">PAM Slurm Adopt plugin on cgroup v2</a></li>
<li><a href="#limitations">Limitations</a></li>
</ul>
<h2 id="overview">Overview
<a class="slurm_link" href="#overview"></a>
</h2>
<p>Slurm provides support for systems with Control Group v2.<br>
Documentation for this cgroup version can be found in kernel.org
<a href="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html">
Control Cgroup v2 Documentation</a>.</p>
<p>The <i>cgroup/v2</i> plugin is an internal Slurm API used by other plugins,
like <i>proctrack/cgroup</i>, <i>task/cgroup</i> and
<i>jobacctgather/cgroup</i>. This document gives an overview of how it
is designed, with the aim of getting a better idea of what is happening on the
system when Slurm constrains resources with this plugin.</p>
<p>Before reading this document we assume you have read the cgroup v2 kernel
documentation and you are familiar with most of the concepts and terminology.
It is equally important to read systemd's
<a href="https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface">
Control Group Interfaces Documentation</a> since <i>cgroup/v2</i> needs to
interact with systemd and a lot of concepts will overlap. Finally, it is
recommended that you understand the concept of
<a href="https://ebpf.io/what-is-ebpf">eBPF technology</a>, since in cgroup v2
the device cgroup controller is eBPF-based.</p>
<h2 id="conversion">Conversion from cgroup v1
<a class="slurm_link" href="#conversion"></a></h2>
<p>Existing Slurm installations may be running with Slurm's cgroup/v1 plugin.
Sites that wish to use the new features of cgroup/v2 can convert their nodes
to run with cgroup v2 if it is supported by the OS. Slurm supports compute
nodes running a mix of cgroup/v1 and cgroup/v2 plugins.</p>
<h3 id="reconfigure_systemd">Reconfigure Systemd
<a class="slurm_link" href="#reconfigure_systemd"></a></h3>
<p>In certain circumstances, it may be necessary to make some changes to the
systemd configuration to support cgroup v2. You will need to complete the
procedure in this section if either of these conditions apply:
<ol>
<li>Systemd version is less than 252</li>
<li>The file <code>/proc/1/cgroup</code> contains multiple lines or the first
line starts with a non-zero value. For example:
<ul>
<li>Systemd needs to be reconfigured:
<pre>
12:cpuset:/
11:hugetlb:/
10:perf_event:/
. . .
</pre>
</li>
<li>Ready for cgroup v2 (skip to the <a href="#general_conversion">
next section</a>):
<pre>
0::/init.scope
</pre>
</li>
</ul>
</li>
</ol>
</p>
<p>The following procedure will reconfigure such systems for cgroup v2:
<ol>
<li>Swap kernel commandline options for systemd to cgroup v2 support:
<pre>systemd.unified_cgroup_hierarchy=1 systemd.legacy_systemd_cgroup_controller=0 cgroup_no_v1=all</pre>
Example commands for <b>Debian</b> based systems:
<pre>sed -e 's@^GRUB_CMDLINE_LINUX=@#GRUB_CMDLINE_LINUX=@' -i /etc/default/grub
echo 'GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=1 systemd.legacy_systemd_cgroup_controller=0 cgroup_no_v1=all"' >> /etc/default/grub
update-grub</pre>
Example command for <b>Red Hat</b> based systems:
<pre>grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=1 systemd.legacy_systemd_cgroup_controller=0 cgroup_no_v1=all"</pre>
</li>
<li>Reboot to apply new kernel command line options.</li>
<li>Verify kernel has correct command line options:
<pre>grep -o -e systemd.unified_cgroup_hierarchy=. -e systemd.legacy_systemd_cgroup_controller=. /proc/cmdline
systemd.unified_cgroup_hierarchy=1
systemd.legacy_systemd_cgroup_controller=0</pre>
If the output does not match exactly, then repeat prior steps
and verify kernel is given correct command line options.
</li>
<li>Verify that there is not any <a href="https://docs.kernel.org/admin-guide/cgroup-v1/index.html">cgroup v1</a>
controller mounted and <a
href="https://github.com/systemd/systemd/blob/main/docs/CGROUP_DELEGATION.md#three-different-tree-setups-">
that your system is not running in hybrid mode</a><br>Example of hybrid mode:
<pre>$ grep -v ^0: /proc/self/cgroup
8:net_cls,net_prio:/
6:name=systemd:/</pre>
If there are any entries, then a reboot is required. If there are entries after
a reboot then there is a process actively mounting Cgroup v1 mounts that will
need to be stopped.</li>
</ol>
</p>
<h3 id="general_conversion">General conversion
<a class="slurm_link" href="#general_conversion"></a></h3>
<p>The following procedure is required when switching from cgroup v1 to v2:
<ol>
<li>Modify Slurm configuration to allow cgroup/v2 plugin:<br>
<b>/etc/slurm/cgroup.conf</b>:
<ul>
<li>Remove line starting with:<pre>CgroupAutomount=</pre></li>
<li>Remove line starting with:<pre>CgroupMountpoint=</pre></li>
<li>Remove line if present:<pre>CgroupPlugin=cgroup/v1</pre></li>
<li>Add line:<pre>CgroupPlugin=autodetect</pre></li>
</ul>
</li>
<li>Restart Slurm daemons per normal startup procedure</li>
</ol>
</p>
<h2 id="v2_rules">Following cgroup v2 rules
<a class="slurm_link" href="#v2_rules"></a>
</h2>
<p>Kernel's Control Group v2 has two particularities that affect how Slurm
needs to structure its internal cgroup tree.</p>
<h3 id="top_down">Top-down Constraint
<a class="slurm_link" href="#top_down"></a>
</h3>
<p>Resources are distributed top-down to the tree, so a controller is only
available on a cgroup directory if the parent node has it listed in its
<i>cgroup.controllers</i> file and added to its <i>cgroup.subtree_control</i>.
Also, a controller activated in the subtree cannot be disabled if one or more
children has them enabled. For Slurm, this implies that we need to do this
kind of management over our hierarchy by modifying <i>cgroup.subtree_control</i>
and enabling the required controllers for the child.</p>
<h3 id="no_internal_process">No Internal Process Constraint
<a class="slurm_link" href="#no_internal_process"></a>
</h3>
<p>Except for the root cgroup, parent cgroups (really called domain cgroups) can
only enable controllers for their children if they do not have any process at
their own level. This means we can create a subtree inside a cgroup directory,
but before writing to <i>cgroup.subtree_control</i>, all the pids listed in the
parent's <i>cgroup.procs</i> must be migrated to the child. This requires that
all processes must live on the leaves of the tree and so it will not be possible
to have pids in non-leaf directories.</p>
<h2 id="systemd_rules">Following systemd rules
<a class="slurm_link" href="#systemd_rules"></a>
</h2>
<p>Systemd is currently the most widely used init mechanism. For this reason
Slurm needs to find a way to coexist with the rules of systemd. The designers of
systemd have conceived a new rule called the "single-writer" rule, which implies
that every cgroup has one single owner and nobody else should write to it. Read
more about this in <a href="https://systemd.io/CGROUP_DELEGATION">systemd.io
Cgroup Delegation Documentation</a>. In practice this means that the systemd
daemon, started when the kernel boots and which takes pid 1, will consider
itself the absolute owner and single writer of the entire cgroup tree.
This means that systemd expects that no other process should be modifying any
cgroup directly, nor should another process be creating directories or moving
pids around, without systemd being aware of it.</p>
<p>There's one method that allows Slurm to work without issues, which is to
start Slurm daemons in a systemd <i>Unit</i> with the special systemd option
<i>Delegate=yes</i>. Starting slurmd within a systemd Unit, will give Slurm a
"delegated" cgroup subtree in the filesystem where it will be able to create
directories, move pids, and manage its own hierarchy. In practice, what
happens is that systemd registers a new <i>Unit</i> in its internal database and
relates the cgroup directory to it. Then for any future "intrusive" actions of
the cgroup tree, systemd will effectively ignore the "delegated" directories.
</p>
<p>This is similar to what happened in cgroup v1, since this is not a
kernel rule, but a systemd rule. But this fact combined with the new cgroup v2
rules, forces Slurm to choose a design which coexists with both.</p>
<h3 id="real_sysd_prob">The real problem: systemd + restarting slurmd
<a class="slurm_link" href="#real_sysd_prob"></a>
</h3>
<p>When designing the cgroup/v2 plugin for Slurm, the initial idea was to let
slurmd setup the required hierarchy in its own root cgroup directory. It
would create a specific directory for itself and then place jobs and steps in
other corresponding directories. This would guarantee the
<a href="#no_internal_process">no internal process constraint</a> rule.</p>
<p>This worked fine until we needed to restart slurmd. Since the entire
hierarchy was already created starting at the slurmd cgroup, the slurmd restart
would terminate the slurmd process and then start a new one, which would be
put into the root of the original group tree. Since this directory was now what
is called a "domain controller" (it contained sub-directories) and not a leaf
anymore, the <a href="#no_internal_process">no internal process constraint</a>
rule would be broken and systemd would fail to start the daemon.</p>
<p>Lacking any mechanism in systemd to tackle this situation, this left us with
no other choice but to separate slurmd and forked slurmstepds into separate
subtree directories. Because of the design rule of systemd about being the
single-writer on the tree, it was not possible to just do a "mkdir" from
slurmd or the slurmstepd itself and then move the stepd process into a new and
separate directory, that would mean this directory would not be controlled by
systemd and would cause problems.</p>
<p>The only way that a "mkdir" could work was if it was done inside a
"delegated" cgroup subtree, so we needed to find a way to find a Unit with
"Delegate=yes", different from the slurmd one, which would guarantee our
independence. So, we really needed to start a new unit for user jobs.</p>
<p>Actually, in systemd there are two types of Units that can get the
"Delegate=yes" parameter and that are directly related to a cgroup directory.
One is a "Service" and the other is a "Scope". We are interested the "scope":
<ul>
<li><b>A Systemd Scope:</b> systemd takes a pid as an argument, creates a cgroup
directory and then adds the provided pid to the directory. The scope will remain
until this pid is gone.</li>
</ul>
<p>It is worth noting that a discussion with main systemd developers raised
the <i>RemainAfterExit</i> systemd parameter. This parameter is intended to keep
the unit alive even if all the processes on it are gone. This option is only
valid for "Services" and not for "Scopes". This would be a very interesting
option to have if it was included also for Scopes. They stated
that its functionality could be extended to not only keep the unit, but
to also keep the cgroup directories until the unit was manually terminated.
Currently, the unit remains alive but the cgroup is cleaned anyway.
</p>
<p>With all this background, we're ready to show which solution was used to make
Slurm get away from the problem of the slurmd restart.</p>
<ul>
<li>Create a new Scope on slurmd startup for hosting new slurmstepd processes.
It does one single call at the <b>first</b> slurmd startup. Slurmd prepares a
scope for future slurmstepd pids, and the stepd itself moves itself there when
starting. This comes without any performance issue, and conceptually is just
like a slower "mkdir" + informing systemd from slurmd only at the first startup.
Moving processes from one delegated unit to another delegated unit was approved
by systemd developers. The only downside is that the scope needs processes
inside or it will terminate and cleanup the cgroup, so slurmd needed to create a
"sleep" infinity process, which we encoded into the "slurmstepd infinity"
process, which will live forever in the scope. In the future, if the
<i>RemainAfterExit</i> parameter is extended to scopes and allows the cgroup
tree to not be destroyed, the need for this infinity process would be
eliminated.
</li>
</ul>
<p>Finally we ended up with separating slurmd from slurmstepds, using a scope
with "Delegate=yes" option.</p>
<h3 id="consequences_nosysd">Consequences of not following systemd rules
<a class="slurm_link" href="#consequences_nosysd"></a>
</h3>
<p>There is a known issue where systemd can decide to cleanup the cgroup
hierarchy with the intention of making it match with its internal database.
For example, if there are no units in the system with "Delegate=yes",
it will go through the tree and possibly deactivate all the controllers which
it thinks are not in use. In our testing we stopped all our units with
"Delegate=yes", issued a "systemd reload" or a
"systemd reset-failed" and witnessed how the <i>cpuset</i> controller
disappeared from our "manually" created directories deep in the cgroup tree.
There are other situations, and the fact that systemd developers and
documentation claim that they are the unique single-writer to the tree, made
SchedMD decide to be on the safe side and have Slurm coexist with systemd.
</p>
<p>It is worth noting that we added <i>IgnoreSystemd</i> and
<i>IgnoreSystemdOnFailure</i> as cgroup.conf parameters which will avoid any
contact with systemd, and will just use a regular "mkdir" to create the same
directory structures. These parameters are for development and testing
purposes only.</p>
<h3 id="distro_no_sysd">What happens with Linux distros without systemd?
<a class="slurm_link" href="#distro_no_sysd"></a>
</h3>
<p>Slurm does not support them, but they can still work. The only
requirements are to have libdbus, ebpf and systemd packages installed in
the system to compile slurm. Then you can set the <i>IgnoreSystemd</i>
parameter in cgroup.conf to manually create the
<i>/sys/fs/cgroup/system.slice/</i> directory. With these requirements met,
Slurm should work normally.</p>
<h2 id="v2_overview">cgroup/v2 overview
<a class="slurm_link" href="#v2_overview"></a>
</h2>
<p>We will explain briefly this plugin's workflow.</p>
<h3 id="slurmd_startup">slurmd startup
<a class="slurm_link" href="#slurmd_startup"></a>
</h3>
<p>Fresh system: slurmd is started. Some plugins (proctrack, jobacctgather or
task) which use cgroup, call init() function of cgroup/v2 plugin. What happens
immediately is that slurmd does a call to dbus using libdbus, and creates
a new systemd "Scope". The scope name is predefined and set depending on an
internal constant SYSTEM_CGSCOPE under SYSTEM_CGSLICE. It basically ends up
with the name "slurmstepd.scope" or "nodename_slurmstepd.scope" depending on
whether Slurm is compiled with <i>--enable-multiple-slurmd</i> (prefixes node
name) or not. The cgroup directory associated with this scope will be fixed as:
"/sys/fs/cgroup/system.slice/slurmstepd.scope" or
"/sys/fs/cgroup/system.slice/nodename_slurmstepd.scope".
</p>
<p>Since the call to dbus "startTransientUnit" requires a pid as a parameter,
slurmd needs to fork a "slurmstepd infinity" and use this parameter as the
argument.</p>
<p>The call to dbus is asynchronous, so slurmd delivers the message to the Dbus
bus and then starts an active wait, waiting for the scope directory to show up.
If the directory doesn't show up within a hard-coded timeout, it fails.
Otherwise it continues and slurmd then creates a directory for new slurmstepds
and for the infinity pid in the recently created scope directory, called
"system". It moves the infinity process into there and then enables all the
required controllers in the new cgroup directories.
</p>
<p>As this is a regular systemd Unit, the scope will show up in
"systemctl list-unit-files" and other systemd commands, for example:</p>
<pre>
]$ systemctl cat gamba1_slurmstepd.scope
# /run/systemd/transient/gamba1_slurmstepd.scope
# This is a transient unit file, created programmatically via the systemd API. Do not edit.
[Scope]
Delegate=yes
TasksMax=infinity
]$ systemctl list-unit-files gamba1_slurmstepd.scope
UNIT FILE STATE VENDOR PRESET
gamba1_slurmstepd.scope transient -
1 unit files listed.
]$ systemctl status gamba1_slurmstepd.scope
● gamba1_slurmstepd.scope
Loaded: loaded (/run/systemd/transient/gamba1_slurmstepd.scope; transient)
Transient: yes
Active: active (abandoned) since Wed 2022-04-06 14:17:46 CEST; 2h 47min ago
Tasks: 1
Memory: 1.6M
CPU: 258ms
CGroup: /system.slice/gamba1_slurmstepd.scope
└─system
└─113094 /home/lipi/slurm/master/inst/sbin/slurmstepd infinity
apr 06 14:17:46 llit systemd[1]: Started gamba1_slurmstepd.scope.
</pre>
<p>Another action of slurmd init will be to detect which controllers are
available in the system (in /sys/fs/cgroup), and recursively enable the
needed ones until reaching its level. It will enable them for the recently
created slurmstepd scope.</p>
<pre>
]$ cat /sys/fs/cgroup/system.slice/gamba1_slurmstepd.scope/cgroup.controllers
cpuset cpu io memory pids
]$ cat /sys/fs/cgroup/system.slice/gamba1_slurmstepd.scope/cgroup.subtree_control
cpuset cpu memory
</pre>
<p>If resource specialization is enabled, slurmd will set its memory and/or
cpu constraints at its own level too.</p>
<h3 id="slurmd_restart">slurmd restart
<a class="slurm_link" href="#slurmd_restart"></a>
</h3>
<p>Slurmd restarts as usual. When restarted, it will detect if the "scope"
directory already exists, and will do nothing if it does. Otherwise it will
try to setup the scope again.</p>
<h3 id="stepd_start">slurmstepd start
<a class="slurm_link" href="#stepd_start"></a>
</h3>
<p>When a new step needs to be created, whether part of a new job or as part of
an existing job, slurmd will fork the slurmstepd process in its own cgroup
directory. Instantly slurmstepd will start initializing and (if cgroup plugins
are enabled) it will infer the scope directory and will move itself into the
"waiting" area, which is the
<i>/sys/fs/cgroup/system.slice/slurmstepd_nodename.scope/system</i> directory.
Immediately it will initialize the job and step cgroup directories and will move
itself into them, setting the subtree_controllers as required.</p>
<h3 id="term_clean">Termination and cleanup
<a class="slurm_link" href="#term_clean"></a>
</h3>
<p>When a job ends, slurmstepd will take care of removing all the created
directories. The slurmstepd.scope directory will <b>never</b> be removed or
stopped by Slurm, and the "slurmstepd infinity" process will never be killed by
Slurm.</p>
<p>When slurmd ends (since on supported systems it has been started by systemd)
its cgroup will just be cleaned up by systemd.</p>
<h3 id="manual_startup">Special case - manual startup
<a class="slurm_link" href="#manual_startup"></a>
</h3>
<p>Starting slurmd from systemd creates the slurmd unit with its own cgroup.
Then slurmd starts the slurmstepd.scope which in turn creates a new cgroup
tree. Any new process spawned for a job is migrated into this scope. If,
instead of starting slurmd from systemd, one starts slurmd manually from the
command line, things are different. The slurmd will be spawned into the same
terminal's cgroup and will share the cgroup tree with the terminal process
itself (and possibly with other user processes).</p>
<p>This situation is detected by slurmd by reading the <b>INVOCATION_ID</b>
environment variable. This variable is normally set by systemd when it starts
a process and is a way to determine if slurmd has been started in its own
cgroup or started manually into a shared cgroup. In the first case slurmd
doesn't try to move itself to any other cgroup. In the second case, where
<b>INVOCATION_ID</b> is not set, it will try to move itself to a new
subdirectory inside the slurmstepd.scope cgroup.</p>
<p>A problem arises when <b>INVOCATION_ID</b> is set in your environment and
you try to start slurmd manually. slurmd will think it is in its own cgroup
and won't try to migrate itself and, if MemSpecLimit or CoreSpecLimit are set,
slurmd will apply memory or core limits into this cgroup, indirectly limiting
your terminal or other processes. For example, starting slurmd in your terminal
with low memory in MemSpecLimit, sending it to the background, and then trying
to run any program that consumes memory, might end up with your processes
being OOMed.</p>
<p>To avoid this situation we recommend you unset <b>INVOCATION_ID</b> before
starting Slurm, in situations where this environment variable is set.</p>
<p>Another problem related to this is when not all controllers are enabled in
your terminal's cgroup, which is what typically happens in the systemd
<i>user.slice</i>. Then slurmd will fail to initialize because it won't detect
the required controllers, and will display errors similar to these:</p>
<pre>
]# slurmd -Dv
slurmd: error: Controller cpuset is not enabled!
slurmd: error: Controller cpu is not enabled!
...
slurmd: slurmd version 23.11.0-0rc1 started
slurmd: error: cpu cgroup controller is not available.
slurmd: error: There's an issue initializing memory or cpu controller
slurmd: error: Couldn't load specified plugin name for jobacct_gather/cgroup: Plugin init() callback failed
slurmd: error: cannot create jobacct_gather context for jobacct_gather/cgroup
slurmd: fatal: Unable to initialize jobacct_gather
</pre>
<p>One workaround is to set <i>EnableControllers=yes</i> in cgroup.conf, but
note that this won't save you from possibly having other processes have OOM
errors, as mentioned previously. Moreover, that will modify your entire cgroup
tree from the root <i>/sys/fs/cgroup</i>. So the real solution is to either
start slurmd through a unit file, or unset the <b>INVOCATION_ID</b>
environment variable.</p>
<p><b>NOTE</b>: Be aware that this doesn't only happen when starting slurmd
manually. It may happen if you use custom scripts to start slurmd, even if the
scripts are run with systemd. We encourage you to use our provided
slurmd.service file or at least to unset the <b>INVOCATION_ID</b> explicitly
in your startup scripts.</p>
<h3 id="troubleshooting_startup">Troubleshooting startup
<a class="slurm_link" href="#troubleshooting_startup"></a>
</h3>
<p> As the integration with systemd has some degree of complexity, and due to
different configurations or changes in OS setups, we encourage you to set the
debug flags in slurm.conf in order to diagnose what is going on if slurm doesn't
start in cgroup/v2:</p>
<pre>
DebugFlags=cgroup
SlurmdDebug=debug
</pre>
<p>If slurmd starts but throws cgroup errors, it is advisable to look at which
cgroup slurmd has been started in. For example, this shows slurmd started in
the user slice cgroup, which is generally wrong, and has possibly been started
manually from the terminal with <b>INVOCATION_ID</b> set:</p>
<pre>
[root@llagosti ~]# cat /proc/$(pidof slurmd)/cgroup
0::/user.slice/user-1000.slice/user@1000.service/app.slice/app-gnome-tmaster-47247.scope
[root@llagosti ~]# grep -i INVOCATION_ID= /proc/47279/environ
grep: /proc/47279/environ: binary file matches
</pre>
<p>Instead, when slurmd is manually and correctly started:</p>
<pre>
[root@llagosti ~]# cat /proc/$(pidof slurmd)/cgroup
0::/system.slice/gamba1_slurmstepd.scope/slurmd
</pre>
<p>Finally, if slurmd is started by systemd you should see it living in its own
cgroup:</p>
<pre>
[root@llagosti ~]# cat /proc/$(pidof slurmd)/cgroup
0::/system.slice/slurmd.service
</pre>
<h2 id="hierarchy_overview">Hierarchy overview
<a class="slurm_link" href="#hierarchy_overview"></a>
</h2>
Hierarchy will take this form:
<div class="figure">
<img src="cg_hierarchy.jpg"></img>
<br>
Figure 1. Slurm cgroup v2 hierarchy.
</div>
<p>On the left side we have the slurmd service, started with systemd and living
alone in its own delegated cgroup.</p>
<p>On the right side we see the slurmstepd scope, a directory in the cgroup tree
also delegated where all slurmstepd and user jobs will reside. The slurmstepd
is migrated initially in the waiting area for new stepds, <i>system</i>
directory, and immediately, when it initializes the job hierarchy, it will move
itself into the corresponding <i>job_x/step_y/slurm_processes</i> directory.
</p>
<p>User processes will be spawned by slurmstepd and moved into the appropriate
task directory.</p>
<p>At this point it should be possible to check which processes
are running in a slurmstepd scope by issuing this command:</p>
<pre>
]$ systemctl status slurmstepd.scope
● slurmstepd.scope
Loaded: loaded (/run/systemd/transient/slurmstepd.scope; transient)
Transient: yes
Active: active (abandoned) since Wed 2022-04-06 14:17:46 CEST; 2min 47s ago
Tasks: 24
Memory: 18.7M
CPU: 141ms
CGroup: /system.slice/slurmstepd.scope
├─job_3385
│ ├─step_0
│ │ ├─slurm
│ │ │ └─113630 slurmstepd: [3385.0]
│ │ └─user
│ │ └─task_0
│ │ └─113635 /usr/bin/sleep 123
│ ├─step_extern
│ │ ├─slurm
│ │ │ └─113565 slurmstepd: [3385.extern]
│ │ └─user
│ │ └─task_0
│ │ └─113569 sleep 100000000
│ └─step_interactive
│ ├─slurm
│ │ └─113584 slurmstepd: [3385.interactive]
│ └─user
│ └─task_0
│ ├─113590 /bin/bash
│ ├─113620 srun sleep 123
│ └─113623 srun sleep 123
└─system
└─113094 /home/lipi/slurm/master/inst/sbin/slurmstepd infinity
</pre>
<p><b>NOTE</b>: If running on a development system with
<i>--enable-multiple-slurmd</i>, the slurmstepd.scope will have the nodename
prepended to it.</p>
<h2 id="task_level">Working at the task level
<a class="slurm_link" href="#task_level"></a>
</h2>
<p>There is a directory called <i>task_special</i> in the user job hierarchy.
The <i>jobacctgather/cgroup</i> and <i>task/cgroup</i> plugins respectively get
statistics and constrain resources at the task level. Other plugins like
<i>proctrack/cgroup</i> just work at the step level. To unify the hierarchy and
make it work for all different plugins, when a plugin asks to add a pid to a
step but not to a task, the pid will be put into a special directory called
<i>task_special</i>. If another plugin adds this pid to a task, it will be
migrated from there. Normally this happens with the proctrack plugin when a call
is done to add a pid to a step with <i>proctrack_g_add_pid</i>.</p>
<h2 id="ebpf_controller">The eBPF based devices controller
<a class="slurm_link" href="#ebpf_controller"></a>
</h2>
<p>In Control Group v2, the devices controller interfaces has been removed.
Instead of controlling it through files, now it is required to create a bpf
program of type BPF_PROG_TYPE_CGROUP_DEVICE and attach it to the desired
cgroup. This program is created by slurmstepd dynamically and inserted into
the kernel with a bpf syscall, and describes which devices are allowed or
denied for the job, step and task.</p>
<p>The only devices that are managed are the ones described in the
gres.conf file.</p>
<p>The insertion and removal of such programs will be logged in the system
log:</p>
<pre>
apr 06 17:20:14 node1 audit: BPF prog-id=564 op=LOAD
apr 06 17:20:14 node1 audit: BPF prog-id=565 op=LOAD
apr 06 17:20:14 node1 audit: BPF prog-id=566 op=LOAD
apr 06 17:20:14 node1 audit: BPF prog-id=567 op=LOAD
apr 06 17:20:14 node1 audit: BPF prog-id=564 op=UNLOAD
apr 06 17:20:14 node1 audit: BPF prog-id=567 op=UNLOAD
apr 06 17:20:14 node1 audit: BPF prog-id=566 op=UNLOAD
apr 06 17:20:14 node1 audit: BPF prog-id=565 op=UNLOAD
</pre>
<h2 id="diff_ver">Running different nodes with different cgroup versions
<a class="slurm_link" href="#diff_ver"></a>
</h2>
<p>The cgroup version to be used is entirely dependent on the node. Because of
this, it is possible to run the same job on different nodes with different
cgroup plugins. The configuration is done per node in cgroup.conf.</p>
<p>What can not be done is to swap the version of cgroup plugin in cgroup.conf
without rebooting and configuring the node. Since we do not support "hybrid"
systems with mixed controller versions, a node must be booted with one specific
cgroup version.</p>
<h2 id="configuration">Configuration
<a class="slurm_link" href="#configuration"></a>
</h2>
<p>In terms of configuration, setup does not differ much from the previous
<i>cgroup/v1</i> plugin, but the following considerations must be taken into
account when configuring the cgroup plugin in <i>cgroup.conf</i>:</p>
<h3 id="cgroup_plugin">Cgroup Plugin
<a class="slurm_link" href="#cgroup_plugin"></a>
</h3>
<p>This option allows the sysadmin to specify which cgroup version will be run
on the node. It is recommended to use <i>autodetect</i> and forget about it, but
this can be forced to the plugin version too.</p>
<p><b>CgroupPlugin=[autodetect|cgroup/v1|cgroup/v2]</b></p>
<h3 id="dev_options">Developer options
<a class="slurm_link" href="#dev_options"></a>
</h3>
<ul>
<li><b>IgnoreSystemd=[yes|no]</b>: This option is used to avoid any call to dbus
for contacting systemd. Instead of requesting the creation of a new scope when
slurmd starts up, it will only use "mkdir" to prepare the cgroup directories for
the slurmstepds. Use of this option in production systems with systemd is not
supported for the reasons mentioned <a href="#consequences_nosysd">above</a>.
This option can be useful for systems without systemd though.
</li>
<li><b>IgnoreSystemdOnFailure=[yes|no]</b>: This option will fallback to manual
mode for creating the cgroup directories without creating a systemd "scope".
This is only if a call to dbus returned an error, as it would be with
<b>IgnoreSystemd</b>.
</li>
<li><b>EnableControllers=[yes|no]</b>: When set, slurmd will check all the
available controllers in <i>/sys/fs/cgroup/cgroup.controllers</i> and will
enable them recursively in the cgroup.subtree_control file until it reaches
the slurmd level. This is generally required in RHEL8/Rocky8, some containers,
or with systemd &lt; 244.
</li>
<li><b>CgroupMountPoint=/path/to/mount/point</b>: In most cases with cgroup v2,
this parameter should not be used because <i>/sys/fs/cgroup</i> will be the only
cgroup directory.
</li>
</ul>
<h3 id="ignored_params">Ignored parameters
<a class="slurm_link" href="#ignored_params"></a>
</h3>
<p>Since Cgroup v2 doesn't provide the swappiness interface anymore in
the memory controller, the following parameter in cgroup.conf will be ignored:
</p>
<pre>
MemorySwappiness=
</pre>
<h2 id="requirements">Requirements
<a class="slurm_link" href="#requirements"></a>
</h2>
<p>For building <i>cgroup/v2</i> there are two required libraries checked at
configure time. Look at your config.log when configuring to see if they
were correctly detected on your system.</p>
<table style="page-break-inside: avoid; font-family: Arial,Helvetica,sans-serif;" border="1" bordercolor="#000000" cellpadding="3" cellspacing="0" width="100%">
<colgroup>
<col width="5%">
<col width="20%">
<col width="20%">
<col width="15%">
<col width="30%">
</colgroup>
<tr bgcolor="#e0e0e0">
<td><u><b>Library</b></u></td>
<td><u><b>Header file</b></u></td>
<td><u><b>Package provides</b></u></td>
<td><u><b>Configure option</b></u></td>
<td><u><b>Purpose</b></u></td>
</tr>
<tr>
<td>eBPF</td>
<td>include/linux/bpf.h</td>
<td>kernel-headers (>= 5.7)</td>
<td>--with-bpf=</td>
<td>Constrain devices to a job/step/task</td>
</tr>
<tr>
<td>dBus</td>
<td>dbus-1.0/dbus/dbus.h</td>
<td>dbus-devel (>= 1.11.16)</td>
<td>n/a</td>
<td>dBus API for contacting systemd</td>
</tr>
</table>
<br>
<p><b>NOTE</b>: In systems without systemd, these libraries are also needed to
compile Slurm. If some other requirement exists, like not including the dbus
or systemd package requirement, the configure files would have to be modified.
</p>
<p>In order to use <i>cgroup/v2</i>, a valid cgroup namespace, mount namespace
and process namespace, plus its respective mounts are required. This typically
applies to containerized environments where depending on the configuration,
namespaces are created but related mountpoints are not mounted. This may happen
in certain configurations of Docker or Kubernetes.</p>
<p>The default behaviour of Kubernetes has been tested and found it uses a
correct cgroup setup compatible with Slurm. Regarding Docker, either use the
host cgroup namespace or create a private one by using
<i>--cgroupns=private</i>. Note that you will need <i>--privileged</i>,
otherwise the container will not have write permissions on the cgroup.
To use the host cgroup namespace ensure that the container is created
inside a child cgroup, you can specify this mode of operation with the option
<i>--cgroupns=host</i> together with <i>--cgroup-parent</i> to specify the
parent cgroup of the container.</p>
<h2 id="pam_slurm_adopt">PAM Slurm Adopt plugin on cgroup v2
<a class="slurm_link" href="#pam_slurm_adopt"></a>
</h2>
<p>The <a href="pam_slurm_adopt.html">pam_slurm_adopt plugin</a> has a
dependency with the API of <i>cgroup/v1</i> because in some situations it relied
on the job's cgroup creation time for choosing which job id should be picked to
add your sshd pid into. With v2 we wanted to remove this dependency and not
rely on the cgroup filesystem, but simply on the job id. This won't guarantee
that the sshd session is inserted into the youngest job, but will guarantee it
will be put into the largest job id. Thanks to this we removed the dependency of
the plugin against the specific cgroup hierarchy.
</p>
<h2 id="limitations">Limitations
<a class="slurm_link" href="#limitations"></a>
</h2>
<p>
The <i>cgroup/v2</i> plugin can provide all the accounting statistics for
CPU and Memory that the kernel cgroup interface offers. This does not
include virtual memory, so expect a value of 0 for metrics such as <i>AveVMSize,
MaxVMSize, MaxVMSizeNode, MaxVMSizeTask</i> and <i>vmem</i> in
<i>TRESUsageInTot</i> when <i>jobacct_gather/cgroup</i> is used in combination
with <i>cgroup/v2</i>.</p>
<p>In what regards to real stack size (RSS), this plugin provides cgroup's
<i>memory.current</i> value from the memory interface, which is not equal to the
RSS value provided by procfs. Nevertheless it is the same value that the kernel
uses in its OOM killer logic.
</p>
<p>RHEL8 / Rocky8: According to its release notes, support for cgroups v2
started as a technology preview in RHEL8.0 and the features were backported to
the 4.18 kernel. In RHEL8.2 the notes say cgroups v2 was fully supported, but
they emit a warning that not all features are implemented. We recommend
contacting Red Hat for the status of their support for cgroups v2, which should
be tracked in their ticket: BZ#1401552. This release also comes with systemd
239, which does not support the cpuset interface.
</p>
<p>Systemd &lt; 244: Prior to this version, systemd did not support the cpuset
controller, and in old kernels the cpu controller is not enabled by default.
The cpu controller can be enabled in system.conf by setting
<code>DefaultCpuAccounting=yes</code>. For the cpuset controller, you need to
set <code>EnableControllers=yes</code> in cgroup.conf.
</p>
<p style="text-align:center;">Last modified 11 October 2024</p>
<!--#include virtual="footer.txt"-->