src/plugins/cgroup/v2/README - SchedMD/slurm - Git at Google

 INTERFACE DIFFERENCES BETWEEN V1 and V2
 --------------------------------------------------------

 Old Interface			Need to work with
 '''''''''''''			''''''''''''''''''''''
 memory.limit_in_bytes		memory.max
 memory.soft_limit_in_bytes	memory.high
 memory.memsw_limit_in_bytes 	memory.swap.max
 memory.swappiness		none
 freezer.state			cgroup.freeze
 cpuset.expected_usage_in_bytes	none (HAVE_NATIVE_CRAY only)
 cpuset.cpus			cpuset.cpus.effective and cpuset.cpus
 cpuset.mems			cpuset.mems.effective and cpuset.mem
 cpuacct.stat			cpu.stat

 OTHER CHANGES
 -------------
 - Freezer controller is always implicit in cgroup.freeze interface.
 - Devices controller is now an eBPF program.
 - Work now always at task level, even if no jobacctgather/cgroup plugin is set.
   Adding a pid to a step adds it to a special task directory, "task_special".
 - The memory.stat file has changed and now we do the sum of
   anon+swapcached+anon_thp which should be equivalent to rss concept in v1.
   There are many possibilities besides that one that can be object of future
   study, like the use of memory.current or PSI metrics.
 - The cpu.stat interface provides metrics in ms while cpuacct.stat provided
   metrics in USER_HZ. New logic in commit:
   "Recognize different cpu accounting units"
 - The slurmstepd daemons are put into its own cgroup, but they are not
   constrained by the step limits. The memory used by the step is accounted
   globally as part of the job. A slurmstepd can sometimes grow in memory, for
   example with the pmi initialization when the user initializes many ranks in a
   mpi job and the mpi stack consumes memory. There are pros and cons to account
   the slurmstepd consumption as part of the step vs part of the job.
   On one hand it seems reasonable to compute it as part of the job because the
   profiling will be more consistent regardless of changes in slurmstepd.
   On the other hand an uncontrolled slurmstepd can terminate the job completely
   instead of only the step. In both cases, if user processes consume too much
   memory and causes OOM to act in this cgroup, it can also kill slurmstepd which
   would impede it to do the proper cleanup of the job. This is worked around as
   in v1 with setting oom_score_adj to -1000 to make slurmstepd unkillable. It
   can be discused to use a -999 instead to make it killable but more unlikely.

 HOW TO START SLURM
 ------------------

 You can start slurmctld and dbd as usual.

 slurmd needs to be started through systemd because cgroup v2 API has changed and
 the kernel delegates the control to a single writer, which is PID 1, the systemd
 pid. In order to be able to work with parts of the tree to other pids, these
 must be started by systemd itself and set to Delegate=yes. Then systemd won't
 touch these subtree.

 But a more important reason is because if we e.g. start slurmd from our terminal
 the slurmd itself will reside in the same cgroup than this terminal, for example
 on Gnome:

 /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-fbd45fd9-2c1f-4dce-b177-da71ad46be3b.scope

 then, if slurmd tries to modify the cgroup hierarchy under this tree, he will
 find pids other than himself in cgroup.procs, and all changes will affect these
 other pids and viceversa. So for example, creating a subdirectory and attaching
 himself to this new directory and then changing the subtree_control for the
 parent will fail. subtree_control cannot be set if there are processes at this
 level. In that case slurmd should decide what to do with these unrelated pids,
 (gnome-terminal or whatever) which is not its responsibility.

 Another problem can happen if we have CoreSpec* or MemSpec* set. slurmd will
 then constrain these values in its cgroup, affecting also the unrelated pids.

 One way to workaround this is that slurmd's cgroupv2 plugin creates a new cgroup
 and then attaches himself there, but for this it needs to talk with systemd
 unregistering its pid from the current unit, or then systemd accounting will be
 unmatched with the reality, with any unforeseeable consequences, like reclaiming
 back this pid.

 It is also important to set the proper limits for slurmd, specially for cgroups
 v2 the MEMLOCK limit can cause the eBPF program for devices to not be able to
 load it into the kernel.

 Here is an example unit file for one node called "gamba1".

  ]$ systemctl cat slurmd-master-gamba1.service
  # /usr/lib/systemd/system/slurmd-master-gamba1.service
  [Unit]
  Description=Slurm node daemon
  After=munge.service network.target remote-fs.target

  [Service]
  Type=simple
  EnvironmentFile=-/etc/sysconfig/slurmd
  ExecStart=/home/lipi/slurm/master/inst/sbin/slurmd -D -s -N gamba1 $SLURMD_OPTIONS
  ExecReload=/bin/kill -HUP $MAINPID
  KillMode=process
  LimitNOFILE=131072
  LimitMEMLOCK=infinity
  LimitSTACK=infinity
  Delegate=yes
  TasksMax=infinity

  [Install]
  WantedBy=multi-user.target

 .-.-.-.-
 For testing one can create a template unit file:

 ]$ cat ../slurmd-master@.service
 [Unit]
 Description=Slurm node daemon %i
 After=munge.service network.target remote-fs.target

 [Service]
 Type=simple
 EnvironmentFile=-/etc/sysconfig/slurmd
 ExecStart=/home/lipi/slurm/master/inst/sbin/slurmd -D -s -N %i $SLURMD_OPTIONS
 ExecReload=/bin/kill -HUP $MAINPID
 KillMode=process
 LimitNOFILE=131072
 LimitMEMLOCK=infinity
 LimitSTACK=infinity
 Delegate=yes
 TasksMax=infinity

 [Install]
 WantedBy=multi-user.target

 .-.-.-
 And can start this by doing systemctl start slurmd-master@sgamba1.

 But more conveniently there is another method. It is interesting to start slurmd
 just with systemd-run. The interaction can be like this example:

 ]# systemd-run -G -p Delegate=yes -p LimitMEMLOCK=infinity .... -u sgamba1 slurmd -D -s -N gamba1
 ]# systemctl status sgamba1
 ]# systemctl stop sgamba1

 DEVELOPER NOTES
 ---------------
 - Systemd has three kinds of cgroup hierarchies, namely legacy, hybrid and
   unified cgroup hierarchies.

 - All controllers which support v2 and are not bound to a v1 hierarchy are
   automatically bound to the v2 hierarchy and show up at the root. Controllers
   which are not in active use in the v2 hierarchy can be bound to other
   hierarchies. This allows mixing v2 hierarchy with the legacy v1 multiple
   hierarchies in a fully backward compatible way. In any case we are not going
   to support hybrid (mixed) hierarchies because they have no future, are not
   supported by many softwares, and simply they are not convenient.

 - cgroup.procs contains the list of pids for this cgroup, but is not 100%
   reliable for a pid count:

   The same PID may show up more than once if the process got moved to another
   cgroup and then back or the PID got recycled while reading.

 - "/proc/$PID/cgroup" lists a process's cgroup membership. On v1 this contain
   multiple lines, one for each hierarchy. In v2 the entry takes always the
   format "0::$PATH".

 - In v2, Controllers which support thread mode are called threaded controllers.
   The ones which don't are called domain controllers. Threaded controllers allow
   thread granularity. see "cgroup.type" file.

 - Each non-root cgroup has a "cgroup.events" file which contains a "populated"
   field indicating whether the cgroup's sub-hierarchy has alive processes in it.
   poll and [id]notify events are triggered when the value changes.
   This can be used, for example, to start a clean-up operation after all
   processes of a given sub-hierarchy have exited.

 - There are no mount options for a cgroup2 mountpoint.

 - No controller is enabled by default. Controllers can be enabled and disabled
   by writing to the "cgroup.subtree_control" file:

    echo "+cpu +memory -io" > cgroup.subtree_control

 - Only controllers which are listed in "cgroup.controllers" can be enabled.

 - Non-root cgroups can distribute domain resources to their children only when
   they don't have any processes of their own. So, step_0->task_0 couldn't have
   processes in step_0, only in the leaf task_0

   Note that the restriction doesn't get in the way if there is no enabled
   controller in the cgroup's "cgroup.subtree_control". This is important as
   otherwise it wouldn't be possible to create children of a populated cgroup.
   To control resource distribution of a cgroup, the cgroup must create children
   and transfer all its processes to the children before enabling controllers in
   its "cgroup.subtree_control" file.

   This means also that a cgroup which has "cgroup.subtree_control" enabled is
   not intended to be a *leaf*, so it cannot host processes. If one tries to
   attach a pid to cgroup.procs will get an EBUSY.

 - A cgroup can be delegated in two ways.

   First, to a less privileged user by granting write access of the directory and
   its "cgroup.procs", "cgroup.threads" and "cgroup.subtree_control" files.

   Second, when the "nsdelegate" mount option is set automatically to a cgroup
   namespace on namespace creation.

   A delegated sub-hierarchy is contained in the sense that processes can't be
   moved into or out of the sub-hierarchy by the delegatee.

 - cgroup v2 doesn't have the device controller, it uses eBPF-based device
   controller. Needs privileged containers (root). Since kernel 4.15.

   eBFP stands for extended Berkeley Packet Filter.

   https://speakerdeck.com/kentatada/cgroup-v2-internals?slide=5
   https://medium.com/nttlabs/cgroup-v2-596d035be4d7

 - https://systemd.io/CGROUP_DELEGATION/

 - If we want to kill by OOM an entire step set memory.oom.group = 1 to step
   cgroup, this is an option for the future if we want to implement this
   possibility.

 - echo "+cpu" > cgroup.subtree_control
   bash: cgroup.subtree_control: Access Denied.

    https://www.kernel.org/doc/Documentation/cgroup-v2.txt).

     WARNING: cgroup2 doesn't yet support control of realtime processes and
     the cpu controller can only be enabled when all RT processes are in
     the root cgroup. Be aware that system management software may already
     have placed RT processes into nonroot cgroups during the system boot
     process, and these processes may need to be moved to the root cgroup
     before the cpu controller can be enabled.

     Use this to see the running RT apps:
     ps ax -L -o 'pid tid cls rtprio comm' | grep RR


 MANUAL EXAMPLES
 ----------------

 1. Check available controllers in the root and enable subtree_control for the
    required ones:

  cat /sys/fs/cgroup/cgroup.controllers
  cpuset cpu io memory hugetlb pids

 2. Check/enable subtree_control for required ones

  cat /sys/fs/cgroup/cgroup.subtree_control
  memory pids
  echo "+cpuset +cpu +memory +pids" > /sys/fs/cgroup/cgroup.subtree_control
  cat /sys/fs/cgroup/cgroup.subtree_control
  cpuset cpu memory pids

 3. Create slurm dir and enable subtree_control too:

  mkdir /sys/fs/cgroup/slurm
  cat /sys/fs/cgroup/slurm/cgroup.subtree_control
  <empty>
  echo "+cpuset +cpu +memory +pids" > /sys/fs/cgroup/slurm/cgroup.subtree_control
  cat /sys/fs/cgroup/slurm/cgroup.subtree_control
  cpuset cpu memory pids
	INTERFACE DIFFERENCES BETWEEN V1 and V2
	--------------------------------------------------------

	Old Interface Need to work with
	''''''''''''' ''''''''''''''''''''''
	memory.limit_in_bytes memory.max
	memory.soft_limit_in_bytes memory.high
	memory.memsw_limit_in_bytes memory.swap.max
	memory.swappiness none
	freezer.state cgroup.freeze
	cpuset.expected_usage_in_bytes none (HAVE_NATIVE_CRAY only)
	cpuset.cpus cpuset.cpus.effective and cpuset.cpus
	cpuset.mems cpuset.mems.effective and cpuset.mem
	cpuacct.stat cpu.stat

	OTHER CHANGES
	-------------
	- Freezer controller is always implicit in cgroup.freeze interface.
	- Devices controller is now an eBPF program.
	- Work now always at task level, even if no jobacctgather/cgroup plugin is set.
	Adding a pid to a step adds it to a special task directory, "task_special".
	- The memory.stat file has changed and now we do the sum of
	anon+swapcached+anon_thp which should be equivalent to rss concept in v1.
	There are many possibilities besides that one that can be object of future
	study, like the use of memory.current or PSI metrics.
	- The cpu.stat interface provides metrics in ms while cpuacct.stat provided
	metrics in USER_HZ. New logic in commit:
	"Recognize different cpu accounting units"
	- The slurmstepd daemons are put into its own cgroup, but they are not
	constrained by the step limits. The memory used by the step is accounted
	globally as part of the job. A slurmstepd can sometimes grow in memory, for
	example with the pmi initialization when the user initializes many ranks in a
	mpi job and the mpi stack consumes memory. There are pros and cons to account
	the slurmstepd consumption as part of the step vs part of the job.
	On one hand it seems reasonable to compute it as part of the job because the
	profiling will be more consistent regardless of changes in slurmstepd.
	On the other hand an uncontrolled slurmstepd can terminate the job completely
	instead of only the step. In both cases, if user processes consume too much
	memory and causes OOM to act in this cgroup, it can also kill slurmstepd which
	would impede it to do the proper cleanup of the job. This is worked around as
	in v1 with setting oom_score_adj to -1000 to make slurmstepd unkillable. It
	can be discused to use a -999 instead to make it killable but more unlikely.

	HOW TO START SLURM
	------------------

	You can start slurmctld and dbd as usual.

	slurmd needs to be started through systemd because cgroup v2 API has changed and
	the kernel delegates the control to a single writer, which is PID 1, the systemd
	pid. In order to be able to work with parts of the tree to other pids, these
	must be started by systemd itself and set to Delegate=yes. Then systemd won't
	touch these subtree.

	But a more important reason is because if we e.g. start slurmd from our terminal
	the slurmd itself will reside in the same cgroup than this terminal, for example
	on Gnome:

	/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-fbd45fd9-2c1f-4dce-b177-da71ad46be3b.scope

	then, if slurmd tries to modify the cgroup hierarchy under this tree, he will
	find pids other than himself in cgroup.procs, and all changes will affect these
	other pids and viceversa. So for example, creating a subdirectory and attaching
	himself to this new directory and then changing the subtree_control for the
	parent will fail. subtree_control cannot be set if there are processes at this
	level. In that case slurmd should decide what to do with these unrelated pids,
	(gnome-terminal or whatever) which is not its responsibility.

	Another problem can happen if we have CoreSpec* or MemSpec* set. slurmd will
	then constrain these values in its cgroup, affecting also the unrelated pids.

	One way to workaround this is that slurmd's cgroupv2 plugin creates a new cgroup
	and then attaches himself there, but for this it needs to talk with systemd
	unregistering its pid from the current unit, or then systemd accounting will be
	unmatched with the reality, with any unforeseeable consequences, like reclaiming
	back this pid.

	It is also important to set the proper limits for slurmd, specially for cgroups
	v2 the MEMLOCK limit can cause the eBPF program for devices to not be able to
	load it into the kernel.

	Here is an example unit file for one node called "gamba1".

	]$ systemctl cat slurmd-master-gamba1.service
	# /usr/lib/systemd/system/slurmd-master-gamba1.service
	[Unit]
	Description=Slurm node daemon
	After=munge.service network.target remote-fs.target

	[Service]
	Type=simple
	EnvironmentFile=-/etc/sysconfig/slurmd
	ExecStart=/home/lipi/slurm/master/inst/sbin/slurmd -D -s -N gamba1 $SLURMD_OPTIONS
	ExecReload=/bin/kill -HUP $MAINPID
	KillMode=process
	LimitNOFILE=131072
	LimitMEMLOCK=infinity
	LimitSTACK=infinity
	Delegate=yes
	TasksMax=infinity

	[Install]
	WantedBy=multi-user.target

	.-.-.-.-
	For testing one can create a template unit file:

	]$ cat ../slurmd-master@.service
	[Unit]
	Description=Slurm node daemon %i
	After=munge.service network.target remote-fs.target

	[Service]
	Type=simple
	EnvironmentFile=-/etc/sysconfig/slurmd
	ExecStart=/home/lipi/slurm/master/inst/sbin/slurmd -D -s -N %i $SLURMD_OPTIONS
	ExecReload=/bin/kill -HUP $MAINPID
	KillMode=process
	LimitNOFILE=131072
	LimitMEMLOCK=infinity
	LimitSTACK=infinity
	Delegate=yes
	TasksMax=infinity

	[Install]
	WantedBy=multi-user.target

	.-.-.-
	And can start this by doing systemctl start slurmd-master@sgamba1.

	But more conveniently there is another method. It is interesting to start slurmd
	just with systemd-run. The interaction can be like this example:

	]# systemd-run -G -p Delegate=yes -p LimitMEMLOCK=infinity .... -u sgamba1 slurmd -D -s -N gamba1
	]# systemctl status sgamba1
	]# systemctl stop sgamba1

	DEVELOPER NOTES
	---------------
	- Systemd has three kinds of cgroup hierarchies, namely legacy, hybrid and
	unified cgroup hierarchies.

	- All controllers which support v2 and are not bound to a v1 hierarchy are
	automatically bound to the v2 hierarchy and show up at the root. Controllers
	which are not in active use in the v2 hierarchy can be bound to other
	hierarchies. This allows mixing v2 hierarchy with the legacy v1 multiple
	hierarchies in a fully backward compatible way. In any case we are not going
	to support hybrid (mixed) hierarchies because they have no future, are not
	supported by many softwares, and simply they are not convenient.

	- cgroup.procs contains the list of pids for this cgroup, but is not 100%
	reliable for a pid count:

	The same PID may show up more than once if the process got moved to another
	cgroup and then back or the PID got recycled while reading.

	- "/proc/$PID/cgroup" lists a process's cgroup membership. On v1 this contain
	multiple lines, one for each hierarchy. In v2 the entry takes always the
	format "0::$PATH".

	- In v2, Controllers which support thread mode are called threaded controllers.
	The ones which don't are called domain controllers. Threaded controllers allow
	thread granularity. see "cgroup.type" file.

	- Each non-root cgroup has a "cgroup.events" file which contains a "populated"
	field indicating whether the cgroup's sub-hierarchy has alive processes in it.
	poll and [id]notify events are triggered when the value changes.
	This can be used, for example, to start a clean-up operation after all
	processes of a given sub-hierarchy have exited.

	- There are no mount options for a cgroup2 mountpoint.

	- No controller is enabled by default. Controllers can be enabled and disabled
	by writing to the "cgroup.subtree_control" file:

	echo "+cpu +memory -io" > cgroup.subtree_control

	- Only controllers which are listed in "cgroup.controllers" can be enabled.

	- Non-root cgroups can distribute domain resources to their children only when
	they don't have any processes of their own. So, step_0->task_0 couldn't have
	processes in step_0, only in the leaf task_0

	Note that the restriction doesn't get in the way if there is no enabled
	controller in the cgroup's "cgroup.subtree_control". This is important as
	otherwise it wouldn't be possible to create children of a populated cgroup.
	To control resource distribution of a cgroup, the cgroup must create children
	and transfer all its processes to the children before enabling controllers in
	its "cgroup.subtree_control" file.

	This means also that a cgroup which has "cgroup.subtree_control" enabled is
	not intended to be a leaf, so it cannot host processes. If one tries to
	attach a pid to cgroup.procs will get an EBUSY.

	- A cgroup can be delegated in two ways.

	First, to a less privileged user by granting write access of the directory and
	its "cgroup.procs", "cgroup.threads" and "cgroup.subtree_control" files.

	Second, when the "nsdelegate" mount option is set automatically to a cgroup
	namespace on namespace creation.

	A delegated sub-hierarchy is contained in the sense that processes can't be
	moved into or out of the sub-hierarchy by the delegatee.

	- cgroup v2 doesn't have the device controller, it uses eBPF-based device
	controller. Needs privileged containers (root). Since kernel 4.15.

	eBFP stands for extended Berkeley Packet Filter.

	https://speakerdeck.com/kentatada/cgroup-v2-internals?slide=5
	https://medium.com/nttlabs/cgroup-v2-596d035be4d7

	- https://systemd.io/CGROUP_DELEGATION/

	- If we want to kill by OOM an entire step set memory.oom.group = 1 to step
	cgroup, this is an option for the future if we want to implement this
	possibility.

	- echo "+cpu" > cgroup.subtree_control
	bash: cgroup.subtree_control: Access Denied.

	https://www.kernel.org/doc/Documentation/cgroup-v2.txt).

	WARNING: cgroup2 doesn't yet support control of realtime processes and
	the cpu controller can only be enabled when all RT processes are in
	the root cgroup. Be aware that system management software may already
	have placed RT processes into nonroot cgroups during the system boot
	process, and these processes may need to be moved to the root cgroup
	before the cpu controller can be enabled.

	Use this to see the running RT apps:
	ps ax -L -o 'pid tid cls rtprio comm' \| grep RR


	MANUAL EXAMPLES
	----------------

	1. Check available controllers in the root and enable subtree_control for the
	required ones:

	cat /sys/fs/cgroup/cgroup.controllers
	cpuset cpu io memory hugetlb pids

	2. Check/enable subtree_control for required ones

	cat /sys/fs/cgroup/cgroup.subtree_control
	memory pids
	echo "+cpuset +cpu +memory +pids" > /sys/fs/cgroup/cgroup.subtree_control
	cat /sys/fs/cgroup/cgroup.subtree_control
	cpuset cpu memory pids

	3. Create slurm dir and enable subtree_control too:

	mkdir /sys/fs/cgroup/slurm
	cat /sys/fs/cgroup/slurm/cgroup.subtree_control
	<empty>
	echo "+cpuset +cpu +memory +pids" > /sys/fs/cgroup/slurm/cgroup.subtree_control
	cat /sys/fs/cgroup/slurm/cgroup.subtree_control
	cpuset cpu memory pids