doc/html/cgroups.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <h1>Control Group in Slurm</h1>

 <h2 id="contents">Contents
 <a class="slurm_link" href="#contents"></a>
 </h2>

 <ul>
 <li><a href="#overview">Control Group Overview</a></li>
 <li><a href="#cgroup_design">Slurm cgroup plugins design</a></li>
 <li><a href="#use">Use of cgroup in Slurm</a></li>
 <li><a href="#configuration">Slurm Cgroup Configuration Overview</a></li>
 <li><a href="#Plugins">Currently Available Cgroup Plugins</a>
     <ul>
         <li><a href="#proctrack">proctrack/cgroup plugin</a></li>
         <li><a href="#task">task/cgroup plugin</a></li>
         <li><a href="#jobacct_gather">jobacct_gather/cgroup plugin</a></li>
     </ul>
 </li>
 <li><a href="#Specialization">Use of cgroup for Resource Specialization</a></li>
 <li><a href="#cgroupplugins">Slurm cgroup plugins</a>
     <ul>
         <li><a href="#differences">Main differences between cgroup/v1 and cgroup/v2</a></li>
         <li><a href="#interfaces">Main differences between controller interfaces</a></li>
         <li><a href="#generalities">Other generalities</a></li>
     </ul>
 </li>
 </ul>

 <h2 id="overview">Control Group Overview
 <a class="slurm_link" href="#overview"></a>
 </h2>
 <p>Control Group is a mechanism provided by the kernel to organize processes
 hierarchically and distribute system resources along the hierarchy in a
 controlled and configurable manner. Slurm can make use of cgroups to constrain
 different resources to jobs, steps and tasks, and to get accounting about these
 resources.</p>

 <p>A cgroup provides different controllers (formerly "subsystems") for different
 resources. Slurm plugins can use several of these controllers, e.g.: <i>memory,
 cpu, devices, freezer, cpuset, cpuacct</i>. Each enabled controller
 gives the ability to constrain resources to a set of processes. If one
 controller is not available on the system, then Slurm cannot constrain the
 associated resources through a cgroup.</p>

 <p>"cgroup" stands for "control group" and is never capitalized. The singular
 form is used to designate the whole feature and also as a qualifier as in
 "cgroup controllers". When explicitly referring to multiple individual control
 groups, the plural form "cgroups" is used.</p>

 <p>Slurm supports two cgroup modes, Legacy mode (cgroup v1) and Unified Mode
 (cgroup v2). Hybrid mode where controllers from both version 1 and version 2 are
 mixed in a system is not supported.</p>

 <p><b>NOTE</b>: The cgroup/v1 plugin is deprecated and will not be supported in
 future Slurm versions. Newer GNU/Linux distributions are dropping, or have
 dropped, support for cgroup v1 and may even not provide kernel support for the
 required cgroup v1 interfaces. Systemd also deprecated cgroup v1. Starting with
 Slurm version 25.05, no new features will be added to cgroup v1. Support for
 critical bugs will be provided until its final removal.</p>

 <p>See the kernel.org documentation for a more comprehensive description of
 cgroup:</p>

 <ul>
 <li><a href="https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt">
 Kernel's Cgroup v1 documentation</a>
 </li>
 <li><a href="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html">
 Kernel's Cgroup v2 documentation </a>
 </li>
 </ul>

 <h2 id="cgroup_design">Slurm cgroup plugins design
 <a class="slurm_link" href="#cgroup_design"></a>
 </h2>

 For extended information on Slurm's internal Cgroup plugin read:
 <ul>
 <li><a href="cgroup_v2.html">cgroup/v2 plugin documentation</a> </li>
 </ul>

 <h2 id="use">Use of cgroup in Slurm <a class="slurm_link" href="#use"></a> </h2>
 <p>Slurm provides cgroup versions of a number of plugins.</p>
 <ul>
 <li>proctrack/cgroup (for process tracking and management)</li>
 <li>task/cgroup (for constraining resources at step and task level)</li>
 <li>jobacct_gather/cgroup (for gathering statistics)</li>
 </ul>

 <p>cgroups can also be used for resource specialization (constraining daemons to
 cores or memory).</p>

 <h2 id="configuration">Slurm Cgroup Configuration Overview
 <a class="slurm_link" href="#configuration"></a>
 </h2>

 <p>There are several sets of configuration options for Slurm cgroups:</p>

 <ul>
 <li><a href="slurm.conf.html">slurm.conf</a> provides options to enable the
 cgroup plugins. Each plugin may be enabled or disabled independently of the
 others.
 </li>
 <li><a href="cgroup.conf.html">cgroup.conf</a> provides general options that are
 common to all cgroup plugins, plus additional options that apply only to
 specific plugins.
 </li>
 <li>System-level resource specialization is enabled using node configuration
 parameters.
 </li>
 </ul>

 <h2 id="Plugins">Currently Available Cgroup Plugins
 <a class="slurm_link" href="#Plugins"></a>
 </h2>

 <h3 id="proctrack">proctrack/cgroup plugin
 <a class="slurm_link" href="#proctrack"></a>
 </h3>

 <p>The proctrack/cgroup plugin is an alternative to other proctrack plugins such
 as proctrack/linux for process tracking and suspend/resume capability.
 </p>

 <p>
 proctrack/cgroup uses the freezer controller to keep track of all pids of a
 job. It basically stores the pids in a specific hierarchy in the cgroup tree and
 takes cares of signaling these pids when instructed. For example, if a user
 decides to cancel a job, Slurm will execute this order internally by calling the
 proctrack plugin and asking it to send a SIGTERM to the job. Since proctrack
 maintains a hierarchy of all Slurm-related pids in cgroup, it will easily know
 which ones will need to be signaled.
 <br>
 Proctrack can also respond to queries for getting a list of all the pids of a
 job or a step.
 <br>
 Alternatively, when using proctrack/linux, pids are stored by cgroup in a
 single file (cgroup.procs) which is read by the plugin to get all the pids of a
 part of the hierarchy. For example, when using proctrack/cgroup, a single step
 has its own cgroup.procs file, so getting the pids of the step is instantaneous.
 In proctrack/linux, we need to read recursively /proc to get all the descendants
 of a parent pid.
 </p>

 <p>To enable this plugin, configure the following option in slurm.conf:
 <pre>ProctrackType=proctrack/cgroup</pre>
 </p>

 <p>There are no specific options for this plugin in cgroup.conf, but the general
 options apply. See the <a href="cgroup.conf.html">cgroup.conf</a> man page for
 details.</p>

 <h3 id="task">task/cgroup plugin<a class="slurm_link" href="#task"></a></h3>

 <p>The task/cgroup plugin allows constraining resources to a job, a step, or a
 task. This is the only plugin that can ensure that the boundaries of an
 allocation are not violated.
 Only jobacctgather/linux offers a very simplistic mechanism for
 constraining memory to a job but it is not reliable (there's a window of time
 where jobs can exceed its limits) and only for very rare systems where cgroup is
 not available.</p>

 <p>task/cgroup provides the following features:</p>

 <ul>
 <li>Confine jobs and steps to their allocated cpuset.</li>
 <li>Confine jobs and steps to specific memory resources.</li>
 <li>Confine jobs, steps and tasks to their allocated gres, including gpus.</li>
 </ul>

 <p>The task/cgroup plugin uses the cpuset, memory and devices subsystems.</p>

 <p>To enable this plugin, add <i>task/cgroup</i> to the TaskPlugin configuration
 parameter in slurm.conf:</p>

 <pre>TaskPlugin=task/cgroup</pre>

 <p>There are many specific options for this plugin in cgroup.conf. The general
 options also apply. See the <a href="cgroup.conf.html">cgroup.conf</a> man page
 for details.</p>

 <p>This plugin can be stacked with other task plugins, for example with
 <i>task/affinity</i>. This will allow it to constrain resources to a job plus
 getting the advantage of the affinity plugin (order doesn't matter):</p>

 <pre>TaskPlugin=task/cgroup,task/affinity</pre>

 <h3 id="jobacct_gather">jobacct_gather/cgroup plugin
 <a class="slurm_link" href="#jobacct_gather"></a>
 </h3>

 <p>
 The <i>jobacct_gather/cgroup</i> plugin is an alternative to the
 <i>jobacct_gather/linux</i> plugin for the collection of accounting statistics
 for jobs, steps and tasks.
 <br>
 <i>jobacct_gather/cgroup</i> uses the cpuacct and memory cgroup controllers.
 </p>

 <p>The cpu and memory statistics collected by this plugin do not represent the
 same resources as the cpu and memory statistics collected by the
 <i>jobacct_gather/linux</i>. While the cgroup plugin just reads a cgroup.stats
 file and similar containing the information for the entire subtree of pids, the
 linux plugin gets information from /proc/pid/stat for every pid and then does
 the calculations, thus becoming a bit less efficient (thought not noticeable in
 the practice) than the cgroup one.</p>

 <p>To enable this plugin, configure the following option in slurm.conf:
 <pre>JobacctGatherType=jobacct_gather/cgroup</pre>
 </p>

 <p>There are no specific options for this plugin in cgroup.conf, but the general
 options apply. See the <a href="cgroup.conf.html">cgroup.conf</a> man page for
 details.</p>

 <h2 id="Specialization">Use of cgroup for Resource Specialization
 <a class="slurm_link" href="#Specialization"></a>
 </h2>

 <p>Resource Specialization may be used to reserve a subset of cores or a
 specific amount of memory on each compute node for exclusive use by the Slurm
 compute node daemon, slurmd.</p>

 <p>If cgroup/v1 is used the reserved resources will also be used by the
 slurmstepd processes. If cgroup/v2 is used, slurmstepd is not constrained by
 this resource specialization. Instead the slurmstepd is constrained to the
 resources allocated to the job, since it is considered part of the job and its
 consumption is completely dependent on the topology of the job. For example an
 MPI job can initialize many ranks with PMI and make slurmstepd consume more
 memory.</p>

 <p>System-level resource specialization is enabled with special node
 configuration parameters. Read <a href="slurm.conf.html">slurm.conf</a> and core
 specialization in <a href="core_spec.html">core_spec.html</a> for more
 information.</p>

 <h2 id="cgroupplugins">Slurm cgroup plugins
 <a class="slurm_link" href="#cgroupplugins"></a>
 </h2>

 <p>
 Both cgroup v1 and v2 plugins have very different ways of organizing their
 hierarchies and respond to different design constraints. The design is the
 responsibility of the kernel maintainers.
 </p>

 <h3 id="differences">Main differences between cgroup/v1 and cgroup/v2
 <a class="slurm_link" href="#differences"></a>
 </h3>

 <p>The three main differences between v1 and v2 are:</p>

 <ul>
 <li><b>Unified mode in v2</b><br>

 <p>In <i>cgroup/v1</i> there's a separate hierarchy for each controller, which
 means the job structure must be replicated and managed for every enabled
 controller. For example, for the same job, if using
 <i>memory</i> and <i>freezer</i> controllers, we will need to create the same
 slurm/uid/job_id/step_id/ hierarchy in both controller's directories. For
 example:

 <pre>/sys/fs/cgroup/memory/slurm/uid_1000/job_1/step_0/</pre>
 <pre>/sys/fs/cgroup/freezer/slurm/uid_1000/job_1/step_0/</pre>

 <p>In <i>cgroup/v2</i> we have a <i>Unified</i> hierarchy, where controllers are
 enabled at the same level and presented to the user as different files.</p>

 <pre>/sys/fs/cgroup/system.slice/slurmstepd.scope/job_1/step_0/</pre></p>

 </li>

 <li><b>Top-down constraint in v2</b><br>
 <p>Resources are distributed top-down and a cgroup can further distribute a
 resource only if the resource has been distributed to it from the parent.
 Enabled controllers are listed in the <i>cgroup.controllers</i> file and
 enabled controllers in a subtree are listed in <i>cgroup.subtree_control</i>.
 </li>

 <li><b>No-Internal-Process constraint in v2</b><br>
 <p>In <i>cgroup/v1</i> the hierarchy is free, which means one can create any
 directory in the tree and put pids in it. In <i>cgroup/v2</i> there's a kernel
 restriction which impedes adding a pid to non-leaf directories.</p>
 </li>

 <li><b>Systemd dependency on cgroup/v2 - separation of slurmd and stepds
 </b><p> This is not a kernel limitation but a systemd decision, which imposes an
 important restriction on services that decide to use <i>Delegate=yes</i>.
 Systemd, with pid 1, decided to be the complete owner of the cgroup
 hierarchy, <i>/sys/fs/cgroup</i>, trying to impose a <i>single-writer</i>
 design. This means that everything related to cgroup must be under control of
 systemd. If one decides to manually modify the cgroup tree, creating directories
 and moving pids around, it is possible that at some point systemd may decide to
 enable or disable controllers on the entire tree, or move pids around. It's been
 experienced that a

 <pre>systemd reload</pre>

 or a

 <pre>systemd reset-failed</pre>

 removed controllers, at any level and directory of the tree, if there was not
 any "systemd unit" making use of it and there were not any "Delegate=Yes"
 started "systemd unit" on the system. This is because systemd wants to cleanup
 the cgroup tree and match it against its internal unit database. In fact,
 looking at the code of systemd one can see how cgroup directories related to
 units with "Delegate=yes" flag are ignored, while any other cgroup directories
 are modified.  This makes it mandatory to start slurmd and slurmstepd processes
 under a unit with "Delegate=yes". This means we need to start, stop and restart
 slurmd with systemd. If we do that though, since we may have previously modified
 the tree where slurmd belongs (e.g. adding job directories) systemd will not be
 able to restart slurmd because of the <i>Top-down constraint</i> mentioned
 earlier. It will not be able to put the new slurmd pid into the root cgroup
 which is now a non-leaf. This forces us to separate the cgroup hierarchies of
 slurmstepd from the slurmd ones, and since we need to inform systemd about it
 and put slurmstepd into a new unit, we will do a dbus call to systemd to create
 a new scope for slurmstepds. See
 <a href="https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/">
 systemd ControlGroupInterface</a> for more information.</p>
 </li>
 </ul>

 <p>The following differences shouldn't affect how other plugins interact with
 cgroup plugins, but instead they only show internal functional differences.</p>

 <ul>
 <li>A controller in <i>cgroup/v2</i> is enabled by writing to
 <i>cgroup.controllers</i>, while in <i>cgroup/v1</i> a new mount point must be
 mounted with filesystem type <i>"-t cgroup"</i> and corresponding options,
 e.g.<i>"-o freezer"</i>.
 </li>

 <li>In <i>cgroup/v2</i> the freezer controller is inherently present in the
 <i>cgroup.freeze</i> interface. In <i>cgroup/v1</i> it is a specific and
 separate controller which needs to be mounted.
 </li>

 <li>The devices controller does not exist in cgroup/v2, instead a new eBPF
 program must be inserted in the kernel.
 </li>

 <li>In <i>cgroup/v2</i>, memory.stat file has changed and now we do the sum of
 anon+swapcached+anon_thp to match the RSS concept in v1.
 </li>

 <li>In <i>cgroup/v2</i>, cpu.stat provides metrics in milis while puacct.stat
 in <i>cgroup/v1</i> provides metrics in USER_HZ.
 </li>

 </ul>

 <h3 id="interfaces">Main differences between controller interfaces
 <a class="slurm_link" href="#interfaces"></a>
 </h3>
 <table style="page-break-inside: avoid; font-family: Arial,Helvetica,sans-serif;" border="1" bordercolor="#000000" cellpadding="3" cellspacing="0" width="100%">
 <tr bgcolor="#e0e0e0">
 <td><u><b>cgroup/v1</b></u></td>
 <td><u><b>cgroup/v2</b></u></td>
 </tr>
 <tr>
 <td>memory.limit_in_bytes</td>
 <td>memory.max</td>
 </tr>
 <tr>
 <td>memory.soft_limit_in_bytes</td>
 <td>memory.high</td>
 </tr>
 <tr>
 <td>memory.memsw_limit_in_bytes</td>
 <td>memory.swap.max</td>
 </tr>
 <tr>
 <td>memory.swappiness</td>
 <td>none</td>
 </tr>
 <tr>
 <td>freezer.state</td>
 <td>cgroup.freeze</td>
 </tr>
 <tr>
 <td>cpuset.cpus</td>
 <td>cpuset.cpus.effective and cpuset.cpus</td>
 </tr>
 <tr>
 <td>cpuset.mems</td>
 <td>cpuset.mems.effective and cpuset.mems</td>
 </tr>
 <tr>
 <td>cpuacct.stat</td>
 <td>cpu.stat</td>
 </tr>
 <tr>
 <td>device.*</td>
 <td>ebpf program</td>
 </tr>
 </table>


 <h3 id="generalities">Other generalities
 <a class="slurm_link" href="#generalities"></a>
 </h3>

 <ul>
 <li>When using cgroup/v1, some configurations can exclude the swap cgroup
 accounting. This accounting is part of the features provided by the memory
 controller.  If this feature is disabled from the kernel or boot parameters,
 trying to enable swap constraints will produce an error. If this is required,
 add the following parameters to the kernel command line:

 <pre>cgroup_enable=memory swapaccount=1</pre>

 This can usually be placed in /etc/default/grub inside
 the <i>GRUB_CMDLINE_LINUX</i> variable. A command such as <i>update-grub</i>
 must be run after updating the file. This feature can be disabled also at kernel
 config with the parameter:

 <pre>CONFIG_MEMCG_SWAP=</pre></li>

 <li>In some Linux distributions, it was possible to use the systemd parameter
 JoinControllers, which is actually deprecated. This parameter allowed multiple
 controllers to be mounted in a single hierarchy in <i>cgroup/v1</i>, more or
 less trying to emulate the behavior of <i>cgroup/v2</i> in "Unified" mode.
 However, Slurm does not work correctly with this configuration, so please make
 sure your system.conf does not use JoinControllers and that all your cgroup
 controllers are under separate directories when using
 <i>cgroup/v1</i> legacy mode.
 </li>
 </ul>

 <p style="text-align:center;">Last modified 5 May 2025</p>

 <!--#include virtual="footer.txt"-->
	<!--#include virtual="header.txt"-->

	<h1>Control Group in Slurm</h1>

	<h2 id="contents">Contents
	<a class="slurm_link" href="#contents"></a>
	</h2>

	<ul>
	<li><a href="#overview">Control Group Overview</a></li>
	<li><a href="#cgroup_design">Slurm cgroup plugins design</a></li>
	<li><a href="#use">Use of cgroup in Slurm</a></li>
	<li><a href="#configuration">Slurm Cgroup Configuration Overview</a></li>
	<li><a href="#Plugins">Currently Available Cgroup Plugins</a>
	<ul>
	<li><a href="#proctrack">proctrack/cgroup plugin</a></li>
	<li><a href="#task">task/cgroup plugin</a></li>
	<li><a href="#jobacct_gather">jobacct_gather/cgroup plugin</a></li>
	</ul>
	</li>
	<li><a href="#Specialization">Use of cgroup for Resource Specialization</a></li>
	<li><a href="#cgroupplugins">Slurm cgroup plugins</a>
	<ul>
	<li><a href="#differences">Main differences between cgroup/v1 and cgroup/v2</a></li>
	<li><a href="#interfaces">Main differences between controller interfaces</a></li>
	<li><a href="#generalities">Other generalities</a></li>
	</ul>
	</li>
	</ul>

	<h2 id="overview">Control Group Overview
	<a class="slurm_link" href="#overview"></a>
	</h2>
	<p>Control Group is a mechanism provided by the kernel to organize processes
	hierarchically and distribute system resources along the hierarchy in a
	controlled and configurable manner. Slurm can make use of cgroups to constrain
	different resources to jobs, steps and tasks, and to get accounting about these
	resources.</p>

	<p>A cgroup provides different controllers (formerly "subsystems") for different
	resources. Slurm plugins can use several of these controllers, e.g.: <i>memory,
	cpu, devices, freezer, cpuset, cpuacct</i>. Each enabled controller
	gives the ability to constrain resources to a set of processes. If one
	controller is not available on the system, then Slurm cannot constrain the
	associated resources through a cgroup.</p>

	<p>"cgroup" stands for "control group" and is never capitalized. The singular
	form is used to designate the whole feature and also as a qualifier as in
	"cgroup controllers". When explicitly referring to multiple individual control
	groups, the plural form "cgroups" is used.</p>

	<p>Slurm supports two cgroup modes, Legacy mode (cgroup v1) and Unified Mode
	(cgroup v2). Hybrid mode where controllers from both version 1 and version 2 are
	mixed in a system is not supported.</p>

	<p><b>NOTE</b>: The cgroup/v1 plugin is deprecated and will not be supported in
	future Slurm versions. Newer GNU/Linux distributions are dropping, or have
	dropped, support for cgroup v1 and may even not provide kernel support for the
	required cgroup v1 interfaces. Systemd also deprecated cgroup v1. Starting with
	Slurm version 25.05, no new features will be added to cgroup v1. Support for
	critical bugs will be provided until its final removal.</p>

	<p>See the kernel.org documentation for a more comprehensive description of
	cgroup:</p>

	<ul>
	<li><a href="https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt">
	Kernel's Cgroup v1 documentation</a>
	</li>
	<li><a href="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html">
	Kernel's Cgroup v2 documentation </a>
	</li>
	</ul>

	<h2 id="cgroup_design">Slurm cgroup plugins design
	<a class="slurm_link" href="#cgroup_design"></a>
	</h2>

	For extended information on Slurm's internal Cgroup plugin read:
	<ul>
	<li><a href="cgroup_v2.html">cgroup/v2 plugin documentation</a> </li>
	</ul>

	<h2 id="use">Use of cgroup in Slurm <a class="slurm_link" href="#use"></a> </h2>
	<p>Slurm provides cgroup versions of a number of plugins.</p>
	<ul>
	<li>proctrack/cgroup (for process tracking and management)</li>
	<li>task/cgroup (for constraining resources at step and task level)</li>
	<li>jobacct_gather/cgroup (for gathering statistics)</li>
	</ul>

	<p>cgroups can also be used for resource specialization (constraining daemons to
	cores or memory).</p>

	<h2 id="configuration">Slurm Cgroup Configuration Overview
	<a class="slurm_link" href="#configuration"></a>
	</h2>

	<p>There are several sets of configuration options for Slurm cgroups:</p>

	<ul>
	<li><a href="slurm.conf.html">slurm.conf</a> provides options to enable the
	cgroup plugins. Each plugin may be enabled or disabled independently of the
	others.
	</li>
	<li><a href="cgroup.conf.html">cgroup.conf</a> provides general options that are
	common to all cgroup plugins, plus additional options that apply only to
	specific plugins.
	</li>
	<li>System-level resource specialization is enabled using node configuration
	parameters.
	</li>
	</ul>

	<h2 id="Plugins">Currently Available Cgroup Plugins
	<a class="slurm_link" href="#Plugins"></a>
	</h2>

	<h3 id="proctrack">proctrack/cgroup plugin
	<a class="slurm_link" href="#proctrack"></a>
	</h3>

	<p>The proctrack/cgroup plugin is an alternative to other proctrack plugins such
	as proctrack/linux for process tracking and suspend/resume capability.
	</p>

	<p>
	proctrack/cgroup uses the freezer controller to keep track of all pids of a
	job. It basically stores the pids in a specific hierarchy in the cgroup tree and
	takes cares of signaling these pids when instructed. For example, if a user
	decides to cancel a job, Slurm will execute this order internally by calling the
	proctrack plugin and asking it to send a SIGTERM to the job. Since proctrack
	maintains a hierarchy of all Slurm-related pids in cgroup, it will easily know
	which ones will need to be signaled.
	<br>
	Proctrack can also respond to queries for getting a list of all the pids of a
	job or a step.
	<br>
	Alternatively, when using proctrack/linux, pids are stored by cgroup in a
	single file (cgroup.procs) which is read by the plugin to get all the pids of a
	part of the hierarchy. For example, when using proctrack/cgroup, a single step
	has its own cgroup.procs file, so getting the pids of the step is instantaneous.
	In proctrack/linux, we need to read recursively /proc to get all the descendants
	of a parent pid.
	</p>

	<p>To enable this plugin, configure the following option in slurm.conf:
	<pre>ProctrackType=proctrack/cgroup</pre>
	</p>

	<p>There are no specific options for this plugin in cgroup.conf, but the general
	options apply. See the <a href="cgroup.conf.html">cgroup.conf</a> man page for
	details.</p>

	<h3 id="task">task/cgroup plugin<a class="slurm_link" href="#task"></a></h3>

	<p>The task/cgroup plugin allows constraining resources to a job, a step, or a
	task. This is the only plugin that can ensure that the boundaries of an
	allocation are not violated.
	Only jobacctgather/linux offers a very simplistic mechanism for
	constraining memory to a job but it is not reliable (there's a window of time
	where jobs can exceed its limits) and only for very rare systems where cgroup is
	not available.</p>

	<p>task/cgroup provides the following features:</p>

	<ul>
	<li>Confine jobs and steps to their allocated cpuset.</li>
	<li>Confine jobs and steps to specific memory resources.</li>
	<li>Confine jobs, steps and tasks to their allocated gres, including gpus.</li>
	</ul>

	<p>The task/cgroup plugin uses the cpuset, memory and devices subsystems.</p>

	<p>To enable this plugin, add <i>task/cgroup</i> to the TaskPlugin configuration
	parameter in slurm.conf:</p>

	<pre>TaskPlugin=task/cgroup</pre>

	<p>There are many specific options for this plugin in cgroup.conf. The general
	options also apply. See the <a href="cgroup.conf.html">cgroup.conf</a> man page
	for details.</p>

	<p>This plugin can be stacked with other task plugins, for example with
	<i>task/affinity</i>. This will allow it to constrain resources to a job plus
	getting the advantage of the affinity plugin (order doesn't matter):</p>

	<pre>TaskPlugin=task/cgroup,task/affinity</pre>

	<h3 id="jobacct_gather">jobacct_gather/cgroup plugin
	<a class="slurm_link" href="#jobacct_gather"></a>
	</h3>

	<p>
	The <i>jobacct_gather/cgroup</i> plugin is an alternative to the
	<i>jobacct_gather/linux</i> plugin for the collection of accounting statistics
	for jobs, steps and tasks.
	<br>
	<i>jobacct_gather/cgroup</i> uses the cpuacct and memory cgroup controllers.
	</p>

	<p>The cpu and memory statistics collected by this plugin do not represent the
	same resources as the cpu and memory statistics collected by the
	<i>jobacct_gather/linux</i>. While the cgroup plugin just reads a cgroup.stats
	file and similar containing the information for the entire subtree of pids, the
	linux plugin gets information from /proc/pid/stat for every pid and then does
	the calculations, thus becoming a bit less efficient (thought not noticeable in
	the practice) than the cgroup one.</p>

	<p>To enable this plugin, configure the following option in slurm.conf:
	<pre>JobacctGatherType=jobacct_gather/cgroup</pre>
	</p>

	<p>There are no specific options for this plugin in cgroup.conf, but the general
	options apply. See the <a href="cgroup.conf.html">cgroup.conf</a> man page for
	details.</p>

	<h2 id="Specialization">Use of cgroup for Resource Specialization
	<a class="slurm_link" href="#Specialization"></a>
	</h2>

	<p>Resource Specialization may be used to reserve a subset of cores or a
	specific amount of memory on each compute node for exclusive use by the Slurm
	compute node daemon, slurmd.</p>

	<p>If cgroup/v1 is used the reserved resources will also be used by the
	slurmstepd processes. If cgroup/v2 is used, slurmstepd is not constrained by
	this resource specialization. Instead the slurmstepd is constrained to the
	resources allocated to the job, since it is considered part of the job and its
	consumption is completely dependent on the topology of the job. For example an
	MPI job can initialize many ranks with PMI and make slurmstepd consume more
	memory.</p>

	<p>System-level resource specialization is enabled with special node
	configuration parameters. Read <a href="slurm.conf.html">slurm.conf</a> and core
	specialization in <a href="core_spec.html">core_spec.html</a> for more
	information.</p>

	<h2 id="cgroupplugins">Slurm cgroup plugins
	<a class="slurm_link" href="#cgroupplugins"></a>
	</h2>

	<p>
	Both cgroup v1 and v2 plugins have very different ways of organizing their
	hierarchies and respond to different design constraints. The design is the
	responsibility of the kernel maintainers.
	</p>

	<h3 id="differences">Main differences between cgroup/v1 and cgroup/v2
	<a class="slurm_link" href="#differences"></a>
	</h3>

	<p>The three main differences between v1 and v2 are:</p>

	<ul>
	<li><b>Unified mode in v2</b><br>

	<p>In <i>cgroup/v1</i> there's a separate hierarchy for each controller, which
	means the job structure must be replicated and managed for every enabled
	controller. For example, for the same job, if using
	<i>memory</i> and <i>freezer</i> controllers, we will need to create the same
	slurm/uid/job_id/step_id/ hierarchy in both controller's directories. For
	example:

	<pre>/sys/fs/cgroup/memory/slurm/uid_1000/job_1/step_0/</pre>
	<pre>/sys/fs/cgroup/freezer/slurm/uid_1000/job_1/step_0/</pre>

	<p>In <i>cgroup/v2</i> we have a <i>Unified</i> hierarchy, where controllers are
	enabled at the same level and presented to the user as different files.</p>

	<pre>/sys/fs/cgroup/system.slice/slurmstepd.scope/job_1/step_0/</pre></p>

	</li>

	<li><b>Top-down constraint in v2</b><br>
	<p>Resources are distributed top-down and a cgroup can further distribute a
	resource only if the resource has been distributed to it from the parent.
	Enabled controllers are listed in the <i>cgroup.controllers</i> file and
	enabled controllers in a subtree are listed in <i>cgroup.subtree_control</i>.
	</li>

	<li><b>No-Internal-Process constraint in v2</b><br>
	<p>In <i>cgroup/v1</i> the hierarchy is free, which means one can create any
	directory in the tree and put pids in it. In <i>cgroup/v2</i> there's a kernel
	restriction which impedes adding a pid to non-leaf directories.</p>
	</li>

	<li><b>Systemd dependency on cgroup/v2 - separation of slurmd and stepds
	</b><p> This is not a kernel limitation but a systemd decision, which imposes an
	important restriction on services that decide to use <i>Delegate=yes</i>.
	Systemd, with pid 1, decided to be the complete owner of the cgroup
	hierarchy, <i>/sys/fs/cgroup</i>, trying to impose a <i>single-writer</i>
	design. This means that everything related to cgroup must be under control of
	systemd. If one decides to manually modify the cgroup tree, creating directories
	and moving pids around, it is possible that at some point systemd may decide to
	enable or disable controllers on the entire tree, or move pids around. It's been
	experienced that a

	<pre>systemd reload</pre>

	or a

	<pre>systemd reset-failed</pre>

	removed controllers, at any level and directory of the tree, if there was not
	any "systemd unit" making use of it and there were not any "Delegate=Yes"
	started "systemd unit" on the system. This is because systemd wants to cleanup
	the cgroup tree and match it against its internal unit database. In fact,
	looking at the code of systemd one can see how cgroup directories related to
	units with "Delegate=yes" flag are ignored, while any other cgroup directories
	are modified. This makes it mandatory to start slurmd and slurmstepd processes
	under a unit with "Delegate=yes". This means we need to start, stop and restart
	slurmd with systemd. If we do that though, since we may have previously modified
	the tree where slurmd belongs (e.g. adding job directories) systemd will not be
	able to restart slurmd because of the <i>Top-down constraint</i> mentioned
	earlier. It will not be able to put the new slurmd pid into the root cgroup
	which is now a non-leaf. This forces us to separate the cgroup hierarchies of
	slurmstepd from the slurmd ones, and since we need to inform systemd about it
	and put slurmstepd into a new unit, we will do a dbus call to systemd to create
	a new scope for slurmstepds. See
	<a href="https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/">
	systemd ControlGroupInterface</a> for more information.</p>
	</li>
	</ul>

	<p>The following differences shouldn't affect how other plugins interact with
	cgroup plugins, but instead they only show internal functional differences.</p>

	<ul>
	<li>A controller in <i>cgroup/v2</i> is enabled by writing to
	<i>cgroup.controllers</i>, while in <i>cgroup/v1</i> a new mount point must be
	mounted with filesystem type <i>"-t cgroup"</i> and corresponding options,
	e.g.<i>"-o freezer"</i>.
	</li>

	<li>In <i>cgroup/v2</i> the freezer controller is inherently present in the
	<i>cgroup.freeze</i> interface. In <i>cgroup/v1</i> it is a specific and
	separate controller which needs to be mounted.
	</li>

	<li>The devices controller does not exist in cgroup/v2, instead a new eBPF
	program must be inserted in the kernel.
	</li>

	<li>In <i>cgroup/v2</i>, memory.stat file has changed and now we do the sum of
	anon+swapcached+anon_thp to match the RSS concept in v1.
	</li>

	<li>In <i>cgroup/v2</i>, cpu.stat provides metrics in milis while puacct.stat
	in <i>cgroup/v1</i> provides metrics in USER_HZ.
	</li>

	</ul>

	<h3 id="interfaces">Main differences between controller interfaces
	<a class="slurm_link" href="#interfaces"></a>
	</h3>
	<table style="page-break-inside: avoid; font-family: Arial,Helvetica,sans-serif;" border="1" bordercolor="#000000" cellpadding="3" cellspacing="0" width="100%">
	<tr bgcolor="#e0e0e0">
	<td><u><b>cgroup/v1</b></u></td>
	<td><u><b>cgroup/v2</b></u></td>
	</tr>
	<tr>
	<td>memory.limit_in_bytes</td>
	<td>memory.max</td>
	</tr>
	<tr>
	<td>memory.soft_limit_in_bytes</td>
	<td>memory.high</td>
	</tr>
	<tr>
	<td>memory.memsw_limit_in_bytes</td>
	<td>memory.swap.max</td>
	</tr>
	<tr>
	<td>memory.swappiness</td>
	<td>none</td>
	</tr>
	<tr>
	<td>freezer.state</td>
	<td>cgroup.freeze</td>
	</tr>
	<tr>
	<td>cpuset.cpus</td>
	<td>cpuset.cpus.effective and cpuset.cpus</td>
	</tr>
	<tr>
	<td>cpuset.mems</td>
	<td>cpuset.mems.effective and cpuset.mems</td>
	</tr>
	<tr>
	<td>cpuacct.stat</td>
	<td>cpu.stat</td>
	</tr>
	<tr>
	<td>device.*</td>
	<td>ebpf program</td>
	</tr>
	</table>


	<h3 id="generalities">Other generalities
	<a class="slurm_link" href="#generalities"></a>
	</h3>

	<ul>
	<li>When using cgroup/v1, some configurations can exclude the swap cgroup
	accounting. This accounting is part of the features provided by the memory
	controller. If this feature is disabled from the kernel or boot parameters,
	trying to enable swap constraints will produce an error. If this is required,
	add the following parameters to the kernel command line:

	<pre>cgroup_enable=memory swapaccount=1</pre>

	This can usually be placed in /etc/default/grub inside
	the <i>GRUB_CMDLINE_LINUX</i> variable. A command such as <i>update-grub</i>
	must be run after updating the file. This feature can be disabled also at kernel
	config with the parameter:

	<pre>CONFIG_MEMCG_SWAP=</pre></li>

	<li>In some Linux distributions, it was possible to use the systemd parameter
	JoinControllers, which is actually deprecated. This parameter allowed multiple
	controllers to be mounted in a single hierarchy in <i>cgroup/v1</i>, more or
	less trying to emulate the behavior of <i>cgroup/v2</i> in "Unified" mode.
	However, Slurm does not work correctly with this configuration, so please make
	sure your system.conf does not use JoinControllers and that all your cgroup
	controllers are under separate directories when using
	<i>cgroup/v1</i> legacy mode.
	</li>
	</ul>

	<p style="text-align:center;">Last modified 5 May 2025</p>

	<!--#include virtual="footer.txt"-->