| <!--#include virtual="header.txt"--> |
| |
| <h1>Control Group v2 plugin</h1> |
| |
| <h2 id="contents">Contents |
| <a class="slurm_link" href="#contents"></a> |
| </h2> |
| |
| <ul> |
| <li><a href="#overview">Overview</a></li> |
| <li><a href="#conversion">Conversion from cgroup v1</a> |
| <ul> |
| <li><a href="#reconfigure_systemd">Reconfigure SystemD</a></li> |
| <li><a href="#general_conversion">General conversion</a></li> |
| </ul> |
| </li> |
| <li><a href="#v2_rules">Following cgroup v2 rules</a> |
| <ul> |
| <li><a href="#top_down">Top-down Constraint</a></li> |
| <li><a href="#no_internal_process">No Internal Process Constraint</a></li> |
| </ul> |
| </li> |
| <li><a href="#systemd_rules">Following systemd rules</a> |
| <ul> |
| <li><a href="#real_sysd_prob">The real problem: systemd + restarting slurmd</a></li> |
| <li><a href="#consequences_nosysd">Consequences of not following systemd rules</a></li> |
| <li><a href="#distro_no_sysd">What happens with Linux distros without systemd?</a></li> |
| </ul> |
| </li> |
| <li><a href="#v2_overview">cgroup/v2 overview</a> |
| <ul> |
| <li><a href="#slurmd_startup">slurmd startup</a></li> |
| <li><a href="#slurmd_restart">slurmd restart</a></li> |
| <li><a href="#stepd_start">slurmstepd start</a></li> |
| <li><a href="#term_clean">Termination and cleanup</a></li> |
| <li><a href="#manual_startup">Special case - manual startup</a></li> |
| <li><a href="#troubleshooting_startup">Troubleshooting startup</a></li> |
| </ul> |
| </li> |
| <li><a href="#hierarchy_overview">Hierarchy overview</a></li> |
| <li><a href="#task_level">Working at the task level</a></li> |
| <li><a href="#ebpf_controller">The eBPF based devices controller</a></li> |
| <li><a href="#diff_ver">Running different nodes with different cgroup versions</a></li> |
| <li><a href="#configuration">Configuration</a> |
| <ul> |
| <li><a href="#cgroup_plugin">Cgroup Plugin</a></li> |
| <li><a href="#dev_options">Developer options</a></li> |
| <li><a href="#ignored_params">Ignored parameters</a></li> |
| </ul> |
| </li> |
| <li><a href="#requirements">Requirements</a></li> |
| <li><a href="#pam_slurm_adopt">PAM Slurm Adopt plugin on cgroup v2</a></li> |
| <li><a href="#limitations">Limitations</a></li> |
| </ul> |
| |
| <h2 id="overview">Overview |
| <a class="slurm_link" href="#overview"></a> |
| </h2> |
| |
| <p>Slurm provides support for systems with Control Group v2.<br> |
| Documentation for this cgroup version can be found in kernel.org |
| <a href="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html"> |
| Control Cgroup v2 Documentation</a>.</p> |
| |
| <p>The <i>cgroup/v2</i> plugin is an internal Slurm API used by other plugins, |
| like <i>proctrack/cgroup</i>, <i>task/cgroup</i> and |
| <i>jobacctgather/cgroup</i>. This document gives an overview of how it |
| is designed, with the aim of getting a better idea of what is happening on the |
| system when Slurm constrains resources with this plugin.</p> |
| |
| <p>Before reading this document we assume you have read the cgroup v2 kernel |
| documentation and you are familiar with most of the concepts and terminology. |
| It is equally important to read systemd's |
| <a href="https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface"> |
| Control Group Interfaces Documentation</a> since <i>cgroup/v2</i> needs to |
| interact with systemd and a lot of concepts will overlap. Finally, it is |
| recommended that you understand the concept of |
| <a href="https://ebpf.io/what-is-ebpf">eBPF technology</a>, since in cgroup v2 |
| the device cgroup controller is eBPF-based.</p> |
| |
| <h2 id="conversion">Conversion from cgroup v1 |
| <a class="slurm_link" href="#conversion"></a></h2> |
| |
| <p>Existing Slurm installations may be running with Slurm's cgroup/v1 plugin. |
| Sites that wish to use the new features of cgroup/v2 can convert their nodes |
| to run with cgroup v2 if it is supported by the OS. Slurm supports compute |
| nodes running a mix of cgroup/v1 and cgroup/v2 plugins.</p> |
| |
| <h3 id="reconfigure_systemd">Reconfigure Systemd |
| <a class="slurm_link" href="#reconfigure_systemd"></a></h3> |
| |
| <p>In certain circumstances, it may be necessary to make some changes to the |
| systemd configuration to support cgroup v2. You will need to complete the |
| procedure in this section if either of these conditions apply: |
| <ol> |
| <li>Systemd version is less than 252</li> |
| <li>The file <code>/proc/1/cgroup</code> contains multiple lines or the first |
| line starts with a non-zero value. For example: |
| <ul> |
| <li>Systemd needs to be reconfigured: |
| <pre> |
| 12:cpuset:/ |
| 11:hugetlb:/ |
| 10:perf_event:/ |
| . . . |
| </pre> |
| </li> |
| <li>Ready for cgroup v2 (skip to the <a href="#general_conversion"> |
| next section</a>): |
| <pre> |
| 0::/init.scope |
| </pre> |
| </li> |
| </ul> |
| </li> |
| </ol> |
| </p> |
| |
| <p>The following procedure will reconfigure such systems for cgroup v2: |
| |
| <ol> |
| <li>Swap kernel commandline options for systemd to cgroup v2 support: |
| <pre>systemd.unified_cgroup_hierarchy=1 systemd.legacy_systemd_cgroup_controller=0 cgroup_no_v1=all</pre> |
| Example commands for <b>Debian</b> based systems: |
| <pre>sed -e 's@^GRUB_CMDLINE_LINUX=@#GRUB_CMDLINE_LINUX=@' -i /etc/default/grub |
| echo 'GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=1 systemd.legacy_systemd_cgroup_controller=0 cgroup_no_v1=all"' >> /etc/default/grub |
| update-grub</pre> |
| Example command for <b>Red Hat</b> based systems: |
| <pre>grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=1 systemd.legacy_systemd_cgroup_controller=0 cgroup_no_v1=all"</pre> |
| </li> |
| <li>Reboot to apply new kernel command line options.</li> |
| <li>Verify kernel has correct command line options: |
| <pre>grep -o -e systemd.unified_cgroup_hierarchy=. -e systemd.legacy_systemd_cgroup_controller=. /proc/cmdline |
| systemd.unified_cgroup_hierarchy=1 |
| systemd.legacy_systemd_cgroup_controller=0</pre> |
| If the output does not match exactly, then repeat prior steps |
| and verify kernel is given correct command line options. |
| </li> |
| <li>Verify that there is not any <a href="https://docs.kernel.org/admin-guide/cgroup-v1/index.html">cgroup v1</a> |
| controller mounted and <a |
| href="https://github.com/systemd/systemd/blob/main/docs/CGROUP_DELEGATION.md#three-different-tree-setups-"> |
| that your system is not running in hybrid mode</a><br>Example of hybrid mode: |
| <pre>$ grep -v ^0: /proc/self/cgroup |
| 8:net_cls,net_prio:/ |
| 6:name=systemd:/</pre> |
| If there are any entries, then a reboot is required. If there are entries after |
| a reboot then there is a process actively mounting Cgroup v1 mounts that will |
| need to be stopped.</li> |
| </ol> |
| </p> |
| |
| <h3 id="general_conversion">General conversion |
| <a class="slurm_link" href="#general_conversion"></a></h3> |
| |
| <p>The following procedure is required when switching from cgroup v1 to v2: |
| |
| <ol> |
| <li>Modify Slurm configuration to allow cgroup/v2 plugin:<br> |
| <b>/etc/slurm/cgroup.conf</b>: |
| <ul> |
| <li>Remove line starting with:<pre>CgroupAutomount=</pre></li> |
| <li>Remove line starting with:<pre>CgroupMountpoint=</pre></li> |
| <li>Remove line if present:<pre>CgroupPlugin=cgroup/v1</pre></li> |
| <li>Add line:<pre>CgroupPlugin=autodetect</pre></li> |
| </ul> |
| </li> |
| <li>Restart Slurm daemons per normal startup procedure</li> |
| </ol> |
| </p> |
| |
| <h2 id="v2_rules">Following cgroup v2 rules |
| <a class="slurm_link" href="#v2_rules"></a> |
| </h2> |
| <p>Kernel's Control Group v2 has two particularities that affect how Slurm |
| needs to structure its internal cgroup tree.</p> |
| |
| <h3 id="top_down">Top-down Constraint |
| <a class="slurm_link" href="#top_down"></a> |
| </h3> |
| <p>Resources are distributed top-down to the tree, so a controller is only |
| available on a cgroup directory if the parent node has it listed in its |
| <i>cgroup.controllers</i> file and added to its <i>cgroup.subtree_control</i>. |
| Also, a controller activated in the subtree cannot be disabled if one or more |
| children has them enabled. For Slurm, this implies that we need to do this |
| kind of management over our hierarchy by modifying <i>cgroup.subtree_control</i> |
| and enabling the required controllers for the child.</p> |
| |
| <h3 id="no_internal_process">No Internal Process Constraint |
| <a class="slurm_link" href="#no_internal_process"></a> |
| </h3> |
| <p>Except for the root cgroup, parent cgroups (really called domain cgroups) can |
| only enable controllers for their children if they do not have any process at |
| their own level. This means we can create a subtree inside a cgroup directory, |
| but before writing to <i>cgroup.subtree_control</i>, all the pids listed in the |
| parent's <i>cgroup.procs</i> must be migrated to the child. This requires that |
| all processes must live on the leaves of the tree and so it will not be possible |
| to have pids in non-leaf directories.</p> |
| |
| <h2 id="systemd_rules">Following systemd rules |
| <a class="slurm_link" href="#systemd_rules"></a> |
| </h2> |
| <p>Systemd is currently the most widely used init mechanism. For this reason |
| Slurm needs to find a way to coexist with the rules of systemd. The designers of |
| systemd have conceived a new rule called the "single-writer" rule, which implies |
| that every cgroup has one single owner and nobody else should write to it. Read |
| more about this in <a href="https://systemd.io/CGROUP_DELEGATION">systemd.io |
| Cgroup Delegation Documentation</a>. In practice this means that the systemd |
| daemon, started when the kernel boots and which takes pid 1, will consider |
| itself the absolute owner and single writer of the entire cgroup tree. |
| This means that systemd expects that no other process should be modifying any |
| cgroup directly, nor should another process be creating directories or moving |
| pids around, without systemd being aware of it.</p> |
| |
| <p>There's one method that allows Slurm to work without issues, which is to |
| start Slurm daemons in a systemd <i>Unit</i> with the special systemd option |
| <i>Delegate=yes</i>. Starting slurmd within a systemd Unit, will give Slurm a |
| "delegated" cgroup subtree in the filesystem where it will be able to create |
| directories, move pids, and manage its own hierarchy. In practice, what |
| happens is that systemd registers a new <i>Unit</i> in its internal database and |
| relates the cgroup directory to it. Then for any future "intrusive" actions of |
| the cgroup tree, systemd will effectively ignore the "delegated" directories. |
| </p> |
| |
| <p>This is similar to what happened in cgroup v1, since this is not a |
| kernel rule, but a systemd rule. But this fact combined with the new cgroup v2 |
| rules, forces Slurm to choose a design which coexists with both.</p> |
| |
| <h3 id="real_sysd_prob">The real problem: systemd + restarting slurmd |
| <a class="slurm_link" href="#real_sysd_prob"></a> |
| </h3> |
| <p>When designing the cgroup/v2 plugin for Slurm, the initial idea was to let |
| slurmd setup the required hierarchy in its own root cgroup directory. It |
| would create a specific directory for itself and then place jobs and steps in |
| other corresponding directories. This would guarantee the |
| <a href="#no_internal_process">no internal process constraint</a> rule.</p> |
| |
| <p>This worked fine until we needed to restart slurmd. Since the entire |
| hierarchy was already created starting at the slurmd cgroup, the slurmd restart |
| would terminate the slurmd process and then start a new one, which would be |
| put into the root of the original group tree. Since this directory was now what |
| is called a "domain controller" (it contained sub-directories) and not a leaf |
| anymore, the <a href="#no_internal_process">no internal process constraint</a> |
| rule would be broken and systemd would fail to start the daemon.</p> |
| |
| <p>Lacking any mechanism in systemd to tackle this situation, this left us with |
| no other choice but to separate slurmd and forked slurmstepds into separate |
| subtree directories. Because of the design rule of systemd about being the |
| single-writer on the tree, it was not possible to just do a "mkdir" from |
| slurmd or the slurmstepd itself and then move the stepd process into a new and |
| separate directory, that would mean this directory would not be controlled by |
| systemd and would cause problems.</p> |
| |
| <p>The only way that a "mkdir" could work was if it was done inside a |
| "delegated" cgroup subtree, so we needed to find a way to find a Unit with |
| "Delegate=yes", different from the slurmd one, which would guarantee our |
| independence. So, we really needed to start a new unit for user jobs.</p> |
| |
| <p>Actually, in systemd there are two types of Units that can get the |
| "Delegate=yes" parameter and that are directly related to a cgroup directory. |
| One is a "Service" and the other is a "Scope". We are interested the "scope": |
| <ul> |
| <li><b>A Systemd Scope:</b> systemd takes a pid as an argument, creates a cgroup |
| directory and then adds the provided pid to the directory. The scope will remain |
| until this pid is gone.</li> |
| </ul> |
| <p>It is worth noting that a discussion with main systemd developers raised |
| the <i>RemainAfterExit</i> systemd parameter. This parameter is intended to keep |
| the unit alive even if all the processes on it are gone. This option is only |
| valid for "Services" and not for "Scopes". This would be a very interesting |
| option to have if it was included also for Scopes. They stated |
| that its functionality could be extended to not only keep the unit, but |
| to also keep the cgroup directories until the unit was manually terminated. |
| Currently, the unit remains alive but the cgroup is cleaned anyway. |
| </p> |
| <p>With all this background, we're ready to show which solution was used to make |
| Slurm get away from the problem of the slurmd restart.</p> |
| <ul> |
| <li>Create a new Scope on slurmd startup for hosting new slurmstepd processes. |
| It does one single call at the <b>first</b> slurmd startup. Slurmd prepares a |
| scope for future slurmstepd pids, and the stepd itself moves itself there when |
| starting. This comes without any performance issue, and conceptually is just |
| like a slower "mkdir" + informing systemd from slurmd only at the first startup. |
| Moving processes from one delegated unit to another delegated unit was approved |
| by systemd developers. The only downside is that the scope needs processes |
| inside or it will terminate and cleanup the cgroup, so slurmd needed to create a |
| "sleep" infinity process, which we encoded into the "slurmstepd infinity" |
| process, which will live forever in the scope. In the future, if the |
| <i>RemainAfterExit</i> parameter is extended to scopes and allows the cgroup |
| tree to not be destroyed, the need for this infinity process would be |
| eliminated. |
| </li> |
| </ul> |
| <p>Finally we ended up with separating slurmd from slurmstepds, using a scope |
| with "Delegate=yes" option.</p> |
| |
| <h3 id="consequences_nosysd">Consequences of not following systemd rules |
| <a class="slurm_link" href="#consequences_nosysd"></a> |
| </h3> |
| <p>There is a known issue where systemd can decide to cleanup the cgroup |
| hierarchy with the intention of making it match with its internal database. |
| For example, if there are no units in the system with "Delegate=yes", |
| it will go through the tree and possibly deactivate all the controllers which |
| it thinks are not in use. In our testing we stopped all our units with |
| "Delegate=yes", issued a "systemd reload" or a |
| "systemd reset-failed" and witnessed how the <i>cpuset</i> controller |
| disappeared from our "manually" created directories deep in the cgroup tree. |
| There are other situations, and the fact that systemd developers and |
| documentation claim that they are the unique single-writer to the tree, made |
| SchedMD decide to be on the safe side and have Slurm coexist with systemd. |
| </p> |
| <p>It is worth noting that we added <i>IgnoreSystemd</i> and |
| <i>IgnoreSystemdOnFailure</i> as cgroup.conf parameters which will avoid any |
| contact with systemd, and will just use a regular "mkdir" to create the same |
| directory structures. These parameters are for development and testing |
| purposes only.</p> |
| |
| <h3 id="distro_no_sysd">What happens with Linux distros without systemd? |
| <a class="slurm_link" href="#distro_no_sysd"></a> |
| </h3> |
| <p>Slurm does not support them, but they can still work. The only |
| requirements are to have libdbus, ebpf and systemd packages installed in |
| the system to compile slurm. Then you can set the <i>IgnoreSystemd</i> |
| parameter in cgroup.conf to manually create the |
| <i>/sys/fs/cgroup/system.slice/</i> directory. With these requirements met, |
| Slurm should work normally.</p> |
| |
| <h2 id="v2_overview">cgroup/v2 overview |
| <a class="slurm_link" href="#v2_overview"></a> |
| </h2> |
| |
| <p>We will explain briefly this plugin's workflow.</p> |
| |
| <h3 id="slurmd_startup">slurmd startup |
| <a class="slurm_link" href="#slurmd_startup"></a> |
| </h3> |
| <p>Fresh system: slurmd is started. Some plugins (proctrack, jobacctgather or |
| task) which use cgroup, call init() function of cgroup/v2 plugin. What happens |
| immediately is that slurmd does a call to dbus using libdbus, and creates |
| a new systemd "Scope". The scope name is predefined and set depending on an |
| internal constant SYSTEM_CGSCOPE under SYSTEM_CGSLICE. It basically ends up |
| with the name "slurmstepd.scope" or "nodename_slurmstepd.scope" depending on |
| whether Slurm is compiled with <i>--enable-multiple-slurmd</i> (prefixes node |
| name) or not. The cgroup directory associated with this scope will be fixed as: |
| "/sys/fs/cgroup/system.slice/slurmstepd.scope" or |
| "/sys/fs/cgroup/system.slice/nodename_slurmstepd.scope". |
| </p> |
| <p>Since the call to dbus "startTransientUnit" requires a pid as a parameter, |
| slurmd needs to fork a "slurmstepd infinity" and use this parameter as the |
| argument.</p> |
| <p>The call to dbus is asynchronous, so slurmd delivers the message to the Dbus |
| bus and then starts an active wait, waiting for the scope directory to show up. |
| If the directory doesn't show up within a hard-coded timeout, it fails. |
| Otherwise it continues and slurmd then creates a directory for new slurmstepds |
| and for the infinity pid in the recently created scope directory, called |
| "system". It moves the infinity process into there and then enables all the |
| required controllers in the new cgroup directories. |
| </p> |
| <p>As this is a regular systemd Unit, the scope will show up in |
| "systemctl list-unit-files" and other systemd commands, for example:</p> |
| <pre> |
| ]$ systemctl cat gamba1_slurmstepd.scope |
| # /run/systemd/transient/gamba1_slurmstepd.scope |
| # This is a transient unit file, created programmatically via the systemd API. Do not edit. |
| [Scope] |
| Delegate=yes |
| TasksMax=infinity |
| |
| ]$ systemctl list-unit-files gamba1_slurmstepd.scope |
| UNIT FILE STATE VENDOR PRESET |
| gamba1_slurmstepd.scope transient - |
| |
| 1 unit files listed. |
| |
| ]$ systemctl status gamba1_slurmstepd.scope |
| ● gamba1_slurmstepd.scope |
| Loaded: loaded (/run/systemd/transient/gamba1_slurmstepd.scope; transient) |
| Transient: yes |
| Active: active (abandoned) since Wed 2022-04-06 14:17:46 CEST; 2h 47min ago |
| Tasks: 1 |
| Memory: 1.6M |
| CPU: 258ms |
| CGroup: /system.slice/gamba1_slurmstepd.scope |
| └─system |
| └─113094 /home/lipi/slurm/master/inst/sbin/slurmstepd infinity |
| |
| apr 06 14:17:46 llit systemd[1]: Started gamba1_slurmstepd.scope. |
| </pre> |
| |
| <p>Another action of slurmd init will be to detect which controllers are |
| available in the system (in /sys/fs/cgroup), and recursively enable the |
| needed ones until reaching its level. It will enable them for the recently |
| created slurmstepd scope.</p> |
| |
| <pre> |
| ]$ cat /sys/fs/cgroup/system.slice/gamba1_slurmstepd.scope/cgroup.controllers |
| cpuset cpu io memory pids |
| |
| ]$ cat /sys/fs/cgroup/system.slice/gamba1_slurmstepd.scope/cgroup.subtree_control |
| cpuset cpu memory |
| </pre> |
| |
| <p>If resource specialization is enabled, slurmd will set its memory and/or |
| cpu constraints at its own level too.</p> |
| |
| <h3 id="slurmd_restart">slurmd restart |
| <a class="slurm_link" href="#slurmd_restart"></a> |
| </h3> |
| <p>Slurmd restarts as usual. When restarted, it will detect if the "scope" |
| directory already exists, and will do nothing if it does. Otherwise it will |
| try to setup the scope again.</p> |
| |
| <h3 id="stepd_start">slurmstepd start |
| <a class="slurm_link" href="#stepd_start"></a> |
| </h3> |
| <p>When a new step needs to be created, whether part of a new job or as part of |
| an existing job, slurmd will fork the slurmstepd process in its own cgroup |
| directory. Instantly slurmstepd will start initializing and (if cgroup plugins |
| are enabled) it will infer the scope directory and will move itself into the |
| "waiting" area, which is the |
| <i>/sys/fs/cgroup/system.slice/slurmstepd_nodename.scope/system</i> directory. |
| Immediately it will initialize the job and step cgroup directories and will move |
| itself into them, setting the subtree_controllers as required.</p> |
| |
| <h3 id="term_clean">Termination and cleanup |
| <a class="slurm_link" href="#term_clean"></a> |
| </h3> |
| <p>When a job ends, slurmstepd will take care of removing all the created |
| directories. The slurmstepd.scope directory will <b>never</b> be removed or |
| stopped by Slurm, and the "slurmstepd infinity" process will never be killed by |
| Slurm.</p> |
| <p>When slurmd ends (since on supported systems it has been started by systemd) |
| its cgroup will just be cleaned up by systemd.</p> |
| |
| <h3 id="manual_startup">Special case - manual startup |
| <a class="slurm_link" href="#manual_startup"></a> |
| </h3> |
| <p>Starting slurmd from systemd creates the slurmd unit with its own cgroup. |
| Then slurmd starts the slurmstepd.scope which in turn creates a new cgroup |
| tree. Any new process spawned for a job is migrated into this scope. If, |
| instead of starting slurmd from systemd, one starts slurmd manually from the |
| command line, things are different. The slurmd will be spawned into the same |
| terminal's cgroup and will share the cgroup tree with the terminal process |
| itself (and possibly with other user processes).</p> |
| |
| <p>This situation is detected by slurmd by reading the <b>INVOCATION_ID</b> |
| environment variable. This variable is normally set by systemd when it starts |
| a process and is a way to determine if slurmd has been started in its own |
| cgroup or started manually into a shared cgroup. In the first case slurmd |
| doesn't try to move itself to any other cgroup. In the second case, where |
| <b>INVOCATION_ID</b> is not set, it will try to move itself to a new |
| subdirectory inside the slurmstepd.scope cgroup.</p> |
| |
| <p>A problem arises when <b>INVOCATION_ID</b> is set in your environment and |
| you try to start slurmd manually. slurmd will think it is in its own cgroup |
| and won't try to migrate itself and, if MemSpecLimit or CoreSpecLimit are set, |
| slurmd will apply memory or core limits into this cgroup, indirectly limiting |
| your terminal or other processes. For example, starting slurmd in your terminal |
| with low memory in MemSpecLimit, sending it to the background, and then trying |
| to run any program that consumes memory, might end up with your processes |
| being OOMed.</p> |
| |
| <p>To avoid this situation we recommend you unset <b>INVOCATION_ID</b> before |
| starting Slurm, in situations where this environment variable is set.</p> |
| |
| <p>Another problem related to this is when not all controllers are enabled in |
| your terminal's cgroup, which is what typically happens in the systemd |
| <i>user.slice</i>. Then slurmd will fail to initialize because it won't detect |
| the required controllers, and will display errors similar to these:</p> |
| |
| <pre> |
| ]# slurmd -Dv |
| slurmd: error: Controller cpuset is not enabled! |
| slurmd: error: Controller cpu is not enabled! |
| ... |
| slurmd: slurmd version 23.11.0-0rc1 started |
| slurmd: error: cpu cgroup controller is not available. |
| slurmd: error: There's an issue initializing memory or cpu controller |
| slurmd: error: Couldn't load specified plugin name for jobacct_gather/cgroup: Plugin init() callback failed |
| slurmd: error: cannot create jobacct_gather context for jobacct_gather/cgroup |
| slurmd: fatal: Unable to initialize jobacct_gather |
| </pre> |
| |
| <p>One workaround is to set <i>EnableControllers=yes</i> in cgroup.conf, but |
| note that this won't save you from possibly having other processes have OOM |
| errors, as mentioned previously. Moreover, that will modify your entire cgroup |
| tree from the root <i>/sys/fs/cgroup</i>. So the real solution is to either |
| start slurmd through a unit file, or unset the <b>INVOCATION_ID</b> |
| environment variable.</p> |
| |
| <p><b>NOTE</b>: Be aware that this doesn't only happen when starting slurmd |
| manually. It may happen if you use custom scripts to start slurmd, even if the |
| scripts are run with systemd. We encourage you to use our provided |
| slurmd.service file or at least to unset the <b>INVOCATION_ID</b> explicitly |
| in your startup scripts.</p> |
| |
| <h3 id="troubleshooting_startup">Troubleshooting startup |
| <a class="slurm_link" href="#troubleshooting_startup"></a> |
| </h3> |
| <p> As the integration with systemd has some degree of complexity, and due to |
| different configurations or changes in OS setups, we encourage you to set the |
| debug flags in slurm.conf in order to diagnose what is going on if slurm doesn't |
| start in cgroup/v2:</p> |
| <pre> |
| DebugFlags=cgroup |
| SlurmdDebug=debug |
| </pre> |
| |
| <p>If slurmd starts but throws cgroup errors, it is advisable to look at which |
| cgroup slurmd has been started in. For example, this shows slurmd started in |
| the user slice cgroup, which is generally wrong, and has possibly been started |
| manually from the terminal with <b>INVOCATION_ID</b> set:</p> |
| <pre> |
| [root@llagosti ~]# cat /proc/$(pidof slurmd)/cgroup |
| 0::/user.slice/user-1000.slice/user@1000.service/app.slice/app-gnome-tmaster-47247.scope |
| [root@llagosti ~]# grep -i INVOCATION_ID= /proc/47279/environ |
| grep: /proc/47279/environ: binary file matches |
| </pre> |
| |
| <p>Instead, when slurmd is manually and correctly started:</p> |
| <pre> |
| [root@llagosti ~]# cat /proc/$(pidof slurmd)/cgroup |
| 0::/system.slice/gamba1_slurmstepd.scope/slurmd |
| </pre> |
| |
| <p>Finally, if slurmd is started by systemd you should see it living in its own |
| cgroup:</p> |
| <pre> |
| [root@llagosti ~]# cat /proc/$(pidof slurmd)/cgroup |
| 0::/system.slice/slurmd.service |
| </pre> |
| |
| <h2 id="hierarchy_overview">Hierarchy overview |
| <a class="slurm_link" href="#hierarchy_overview"></a> |
| </h2> |
| Hierarchy will take this form: |
| <div class="figure"> |
| <img src="cg_hierarchy.jpg"></img> |
| <br> |
| Figure 1. Slurm cgroup v2 hierarchy. |
| </div> |
| <p>On the left side we have the slurmd service, started with systemd and living |
| alone in its own delegated cgroup.</p> |
| <p>On the right side we see the slurmstepd scope, a directory in the cgroup tree |
| also delegated where all slurmstepd and user jobs will reside. The slurmstepd |
| is migrated initially in the waiting area for new stepds, <i>system</i> |
| directory, and immediately, when it initializes the job hierarchy, it will move |
| itself into the corresponding <i>job_x/step_y/slurm_processes</i> directory. |
| </p> |
| <p>User processes will be spawned by slurmstepd and moved into the appropriate |
| task directory.</p> |
| <p>At this point it should be possible to check which processes |
| are running in a slurmstepd scope by issuing this command:</p> |
| <pre> |
| ]$ systemctl status slurmstepd.scope |
| ● slurmstepd.scope |
| Loaded: loaded (/run/systemd/transient/slurmstepd.scope; transient) |
| Transient: yes |
| Active: active (abandoned) since Wed 2022-04-06 14:17:46 CEST; 2min 47s ago |
| Tasks: 24 |
| Memory: 18.7M |
| CPU: 141ms |
| CGroup: /system.slice/slurmstepd.scope |
| ├─job_3385 |
| │ ├─step_0 |
| │ │ ├─slurm |
| │ │ │ └─113630 slurmstepd: [3385.0] |
| │ │ └─user |
| │ │ └─task_0 |
| │ │ └─113635 /usr/bin/sleep 123 |
| │ ├─step_extern |
| │ │ ├─slurm |
| │ │ │ └─113565 slurmstepd: [3385.extern] |
| │ │ └─user |
| │ │ └─task_0 |
| │ │ └─113569 sleep 100000000 |
| │ └─step_interactive |
| │ ├─slurm |
| │ │ └─113584 slurmstepd: [3385.interactive] |
| │ └─user |
| │ └─task_0 |
| │ ├─113590 /bin/bash |
| │ ├─113620 srun sleep 123 |
| │ └─113623 srun sleep 123 |
| └─system |
| └─113094 /home/lipi/slurm/master/inst/sbin/slurmstepd infinity |
| </pre> |
| <p><b>NOTE</b>: If running on a development system with |
| <i>--enable-multiple-slurmd</i>, the slurmstepd.scope will have the nodename |
| prepended to it.</p> |
| |
| <h2 id="task_level">Working at the task level |
| <a class="slurm_link" href="#task_level"></a> |
| </h2> |
| <p>There is a directory called <i>task_special</i> in the user job hierarchy. |
| The <i>jobacctgather/cgroup</i> and <i>task/cgroup</i> plugins respectively get |
| statistics and constrain resources at the task level. Other plugins like |
| <i>proctrack/cgroup</i> just work at the step level. To unify the hierarchy and |
| make it work for all different plugins, when a plugin asks to add a pid to a |
| step but not to a task, the pid will be put into a special directory called |
| <i>task_special</i>. If another plugin adds this pid to a task, it will be |
| migrated from there. Normally this happens with the proctrack plugin when a call |
| is done to add a pid to a step with <i>proctrack_g_add_pid</i>.</p> |
| |
| <h2 id="ebpf_controller">The eBPF based devices controller |
| <a class="slurm_link" href="#ebpf_controller"></a> |
| </h2> |
| <p>In Control Group v2, the devices controller interfaces has been removed. |
| Instead of controlling it through files, now it is required to create a bpf |
| program of type BPF_PROG_TYPE_CGROUP_DEVICE and attach it to the desired |
| cgroup. This program is created by slurmstepd dynamically and inserted into |
| the kernel with a bpf syscall, and describes which devices are allowed or |
| denied for the job, step and task.</p> |
| <p>The only devices that are managed are the ones described in the |
| gres.conf file.</p> |
| <p>The insertion and removal of such programs will be logged in the system |
| log:</p> |
| <pre> |
| apr 06 17:20:14 node1 audit: BPF prog-id=564 op=LOAD |
| apr 06 17:20:14 node1 audit: BPF prog-id=565 op=LOAD |
| apr 06 17:20:14 node1 audit: BPF prog-id=566 op=LOAD |
| apr 06 17:20:14 node1 audit: BPF prog-id=567 op=LOAD |
| apr 06 17:20:14 node1 audit: BPF prog-id=564 op=UNLOAD |
| apr 06 17:20:14 node1 audit: BPF prog-id=567 op=UNLOAD |
| apr 06 17:20:14 node1 audit: BPF prog-id=566 op=UNLOAD |
| apr 06 17:20:14 node1 audit: BPF prog-id=565 op=UNLOAD |
| </pre> |
| |
| <h2 id="diff_ver">Running different nodes with different cgroup versions |
| <a class="slurm_link" href="#diff_ver"></a> |
| </h2> |
| <p>The cgroup version to be used is entirely dependent on the node. Because of |
| this, it is possible to run the same job on different nodes with different |
| cgroup plugins. The configuration is done per node in cgroup.conf.</p> |
| <p>What can not be done is to swap the version of cgroup plugin in cgroup.conf |
| without rebooting and configuring the node. Since we do not support "hybrid" |
| systems with mixed controller versions, a node must be booted with one specific |
| cgroup version.</p> |
| |
| <h2 id="configuration">Configuration |
| <a class="slurm_link" href="#configuration"></a> |
| </h2> |
| <p>In terms of configuration, setup does not differ much from the previous |
| <i>cgroup/v1</i> plugin, but the following considerations must be taken into |
| account when configuring the cgroup plugin in <i>cgroup.conf</i>:</p> |
| |
| <h3 id="cgroup_plugin">Cgroup Plugin |
| <a class="slurm_link" href="#cgroup_plugin"></a> |
| </h3> |
| <p>This option allows the sysadmin to specify which cgroup version will be run |
| on the node. It is recommended to use <i>autodetect</i> and forget about it, but |
| this can be forced to the plugin version too.</p> |
| <p><b>CgroupPlugin=[autodetect|cgroup/v1|cgroup/v2]</b></p> |
| |
| <h3 id="dev_options">Developer options |
| <a class="slurm_link" href="#dev_options"></a> |
| </h3> |
| <ul> |
| <li><b>IgnoreSystemd=[yes|no]</b>: This option is used to avoid any call to dbus |
| for contacting systemd. Instead of requesting the creation of a new scope when |
| slurmd starts up, it will only use "mkdir" to prepare the cgroup directories for |
| the slurmstepds. Use of this option in production systems with systemd is not |
| supported for the reasons mentioned <a href="#consequences_nosysd">above</a>. |
| This option can be useful for systems without systemd though. |
| </li> |
| <li><b>IgnoreSystemdOnFailure=[yes|no]</b>: This option will fallback to manual |
| mode for creating the cgroup directories without creating a systemd "scope". |
| This is only if a call to dbus returned an error, as it would be with |
| <b>IgnoreSystemd</b>. |
| </li> |
| <li><b>EnableControllers=[yes|no]</b>: When set, slurmd will check all the |
| available controllers in <i>/sys/fs/cgroup/cgroup.controllers</i> and will |
| enable them recursively in the cgroup.subtree_control file until it reaches |
| the slurmd level. This is generally required in RHEL8/Rocky8, some containers, |
| or with systemd < 244. |
| </li> |
| <li><b>CgroupMountPoint=/path/to/mount/point</b>: In most cases with cgroup v2, |
| this parameter should not be used because <i>/sys/fs/cgroup</i> will be the only |
| cgroup directory. |
| </li> |
| </ul> |
| |
| <h3 id="ignored_params">Ignored parameters |
| <a class="slurm_link" href="#ignored_params"></a> |
| </h3> |
| <p>Since Cgroup v2 doesn't provide the swappiness interface anymore in |
| the memory controller, the following parameter in cgroup.conf will be ignored: |
| </p> |
| <pre> |
| MemorySwappiness= |
| </pre> |
| |
| <h2 id="requirements">Requirements |
| <a class="slurm_link" href="#requirements"></a> |
| </h2> |
| <p>For building <i>cgroup/v2</i> there are two required libraries checked at |
| configure time. Look at your config.log when configuring to see if they |
| were correctly detected on your system.</p> |
| <table style="page-break-inside: avoid; font-family: Arial,Helvetica,sans-serif;" border="1" bordercolor="#000000" cellpadding="3" cellspacing="0" width="100%"> |
| <colgroup> |
| <col width="5%"> |
| <col width="20%"> |
| <col width="20%"> |
| <col width="15%"> |
| <col width="30%"> |
| </colgroup> |
| <tr bgcolor="#e0e0e0"> |
| <td><u><b>Library</b></u></td> |
| <td><u><b>Header file</b></u></td> |
| <td><u><b>Package provides</b></u></td> |
| <td><u><b>Configure option</b></u></td> |
| <td><u><b>Purpose</b></u></td> |
| </tr> |
| <tr> |
| <td>eBPF</td> |
| <td>include/linux/bpf.h</td> |
| <td>kernel-headers (>= 5.7)</td> |
| <td>--with-bpf=</td> |
| <td>Constrain devices to a job/step/task</td> |
| </tr> |
| <tr> |
| <td>dBus</td> |
| <td>dbus-1.0/dbus/dbus.h</td> |
| <td>dbus-devel (>= 1.11.16)</td> |
| <td>n/a</td> |
| <td>dBus API for contacting systemd</td> |
| </tr> |
| </table> |
| <br> |
| <p><b>NOTE</b>: In systems without systemd, these libraries are also needed to |
| compile Slurm. If some other requirement exists, like not including the dbus |
| or systemd package requirement, the configure files would have to be modified. |
| </p> |
| |
| <p>In order to use <i>cgroup/v2</i>, a valid cgroup namespace, mount namespace |
| and process namespace, plus its respective mounts are required. This typically |
| applies to containerized environments where depending on the configuration, |
| namespaces are created but related mountpoints are not mounted. This may happen |
| in certain configurations of Docker or Kubernetes.</p> |
| <p>The default behaviour of Kubernetes has been tested and found it uses a |
| correct cgroup setup compatible with Slurm. Regarding Docker, either use the |
| host cgroup namespace or create a private one by using |
| <i>--cgroupns=private</i>. Note that you will need <i>--privileged</i>, |
| otherwise the container will not have write permissions on the cgroup. |
| To use the host cgroup namespace ensure that the container is created |
| inside a child cgroup, you can specify this mode of operation with the option |
| <i>--cgroupns=host</i> together with <i>--cgroup-parent</i> to specify the |
| parent cgroup of the container.</p> |
| |
| <h2 id="pam_slurm_adopt">PAM Slurm Adopt plugin on cgroup v2 |
| <a class="slurm_link" href="#pam_slurm_adopt"></a> |
| </h2> |
| <p>The <a href="pam_slurm_adopt.html">pam_slurm_adopt plugin</a> has a |
| dependency with the API of <i>cgroup/v1</i> because in some situations it relied |
| on the job's cgroup creation time for choosing which job id should be picked to |
| add your sshd pid into. With v2 we wanted to remove this dependency and not |
| rely on the cgroup filesystem, but simply on the job id. This won't guarantee |
| that the sshd session is inserted into the youngest job, but will guarantee it |
| will be put into the largest job id. Thanks to this we removed the dependency of |
| the plugin against the specific cgroup hierarchy. |
| </p> |
| |
| <h2 id="limitations">Limitations |
| <a class="slurm_link" href="#limitations"></a> |
| </h2> |
| <p> |
| The <i>cgroup/v2</i> plugin can provide all the accounting statistics for |
| CPU and Memory that the kernel cgroup interface offers. This does not |
| include virtual memory, so expect a value of 0 for metrics such as <i>AveVMSize, |
| MaxVMSize, MaxVMSizeNode, MaxVMSizeTask</i> and <i>vmem</i> in |
| <i>TRESUsageInTot</i> when <i>jobacct_gather/cgroup</i> is used in combination |
| with <i>cgroup/v2</i>.</p> |
| <p>In what regards to real stack size (RSS), this plugin provides cgroup's |
| <i>memory.current</i> value from the memory interface, which is not equal to the |
| RSS value provided by procfs. Nevertheless it is the same value that the kernel |
| uses in its OOM killer logic. |
| </p> |
| |
| <p>RHEL8 / Rocky8: According to its release notes, support for cgroups v2 |
| started as a technology preview in RHEL8.0 and the features were backported to |
| the 4.18 kernel. In RHEL8.2 the notes say cgroups v2 was fully supported, but |
| they emit a warning that not all features are implemented. We recommend |
| contacting Red Hat for the status of their support for cgroups v2, which should |
| be tracked in their ticket: BZ#1401552. This release also comes with systemd |
| 239, which does not support the cpuset interface. |
| </p> |
| |
| <p>Systemd < 244: Prior to this version, systemd did not support the cpuset |
| controller, and in old kernels the cpu controller is not enabled by default. |
| The cpu controller can be enabled in system.conf by setting |
| <code>DefaultCpuAccounting=yes</code>. For the cpuset controller, you need to |
| set <code>EnableControllers=yes</code> in cgroup.conf. |
| </p> |
| <p style="text-align:center;">Last modified 11 October 2024</p> |
| |
| <!--#include virtual="footer.txt"--> |