Documentation/udev.md - linux-rdma/rdma-core - Git at Google

 # Kernel Module Loading

 The RDMA subsystem relies on the kernel, udev and systemd to load modules on
 demand when RDMA hardware is present. The RDMA subsystem is unique since it
 does not load the optional RDMA hardware modules unless the system has the
 rdma-core package installed.

 This is to avoid exposing systems not using RDMA from having RDMA enabled, for
 instance if a system has a multi-protocol ethernet adapter, but is only using
 the net stack interface.

 ## Boot ordering with systemd

 systemd assumes everything is hot pluggable and runs in an event driven
 manner. This creates a chain of hot plug events as each part of the system
 autoloads based on earlier parts. The first step in the process is udev
 loading the physical hardware driver.

 This can happen in several spots along the bootup:

  - From the initrd or built into the kernel. If hardware modules are present
    in the initrd then they are loaded into the kernel before booting the
    system. This is done largely synchronously with the boot process.

  - From udev when it auto detects PCI hardware or otherwise.
    This happens asynchronously in the boot process, systemd does not wait for
    udev to finish loading modules before it continues on.

    This path makes it very likely the system will experience an RDMA 'hot plug'
    scenario.

  - From systemd's fixed module loader systemd-modules-load.service, e.g. from
    the list in /etc/modules-load.d/. In this case the modules load happens
    synchronously within systemd and it will hold off sysinit.target until
    modules are loaded

 Once the hardware module is loaded it may be necessary to load a protocol
 module, e.g. to enable RDMA support on an ethernet device.

 This is triggered automatically by udev rules that match the master devices
 and load the protocol module with udev's module loader. This happens
 asynchronously to the rest of the systemd startup.

 Once an RDMA device is created by the kernel then udev will cause systemd to
 schedule ULP module loading services (e.g. rdma-load-modules@.service) specific
 to the plugged hardware. If sysinit.target has not yet been passed then these
 loaders will defer sysinit.target until they complete, otherwise this is a hot
 plug event and things will load asynchronously to the boot up process.

 Finally udev will cause systemd to start RDMA specific daemons like
 srp_daemon, rdma-ndd and iwpmd. These starts are linked to the detection of
 the first RDMA hardware, and the daemons internally handle hot plug events for
 other hardware.

 ## Hot Plug compatible services

 Services using RDMA need to have device specific systemd dependencies in their
 unit files, either created by hand by the admin or by using udev rules.

 For instance, a service that uses /dev/infiniband/umad0 requires:

 ```
 After=dev-infiniband-umad0.device
 BindsTo=dev-infiniband-umad0.device
 ```

 Which will ensure the service will not run until the required umad device
 appears, and will be stopped if the umad device is unplugged.

 This is similar to how systemd handles mounting filesystems and configuring
 ethernet devices.

 ## Interaction with legacy non-hotplug services

 Services that cannot handle hot plug must be ordered after
 systemd-udev-settle.service, which will wait for udev to complete loading
 modules and scheduling systemd services. This ensures that all RDMA hardware
 present at boot is setup before proceeding to run the legacy service.

 Admins using legacy services can also place their RDMA hardware modules
 (e.g.  mlx4_ib) directly in /etc/modules-load.d/ or in their initrd which will
 cause systemd to defer passing to sysinit.target until all RDMA hardware is
 setup, this is usually sufficient for legacy services. This is probably the
 default behavior in many configurations.

 # Systemd Ordering

 Within rdma-core we have a series of units which run in the pre `basic.target`
 world to setup kernel services:

  - `iwpmd`
  - `rdma-ndd`
  - `rdma-load-modules@.service`
  - `ibacmd.socket`

 These special units use DefaultDependencies=no and order before any other unit that
 uses DefaultDependencies=yes. This will happen even in the case of hotplug.

 Units for normal rdma-using daemons should use DefaultDependencies=yes, and
 either this pattern for 'any RDMA device':

 ```
 [Unit]
 # Order after rdma-hw.target has become active and setup the kernel services
 Requires=rdma-hw.target
 After=rdma-hw.target

 [Install]
 # Autostart when RDMA hardware is present
 WantedBy=rdma-hw.target
 ```

 Or this pattern for a specific RDMA device:

 ```
 [Unit]
 # Order after RDMA services are setup
 After=rdma-hw.target
 # Run only while a specific umad device is present
 After=dev-infiniband-umad0.device
 BindsTo=dev-infiniband-umad0.device

 [Install]
 # Schedule the unit to be runnable when RDMA hardware is present, but
 # it will only start once the requested device actually appears.
 WantedBy=rdma-hw.target
 ```

 Note, the above does explicitly reference `After=rdma-hw.target` even though
 all the current constituents of that target order before
 `sysinit.target`. This is to provide greater flexibility in the future.

 ## rdma-hw.target

 This target is Wanted automatically by udev as soon as any RDMA hardware is
 plugged in or becomes available at boot.

 This may be used to pull in rdma management daemons dynamically when RDMA
 hardware is found. Such daemons should use:

 ```
 [Install]
 WantedBy=rdma-hw.target
 ```

 In their unit files.

 `rdma-hw.target` is also a synchronization point that orders after the low level,
 pre `sysinit.target` RDMA related units have been started.

 # Stable names

 The library provides general utility and udev rule to automatically perform
 stable IB device name assignments, so users will always see names based on
 topology/GUID information. Such naming scheme has big advantage that the
 names are fully automatic, fully predictable and they stay fixed even if
 hardware is added or removed (i.e. no reenumeration takes place) and that
 broken hardware can be replaced seamlessly.

 The name is combination of link type (Infiniband, RoCE, iWARP, OPA or USNIC)
 and the chosen naming policy, like NAME_KERNEL, NAME_PCI, NAME_GUID, NAME_ONBOARD
 or NAME_FALLBACK. Those naming policies are controlled by udev rule and can be
 overwritten by placing own rename policy udev rules into /etc/udev/rules.d/
 directory.

  * NAME_KERNEL - don't change names and rely on kernel assignment. This
  will keep RDMA names as before. Example: "mlx5_0".
  * NAME_PCI - read PCI location and topology as a source for stable names,
  which won't change in any software event (reset, PCI probe e.t.c.).
  Example: "ibp0s12f4".
  * NAME_GUID - read node GUID information in similar manner to
  net MAC naming policy. Example "rocex525400c0fe123455".
  * NAME_ONBOARD - read Firmware/BIOS provided index numbers for on-board devices.
  Example: "ibo3".
  * NAME_FALLBACK - automatic fallback: NAME_ONBOARD->NAME_PCI->NAME_KERNEL

 No doubts that new names are harder to read than the "mlx5_0" everybody,
 is used to, but being consistent in scripts is much more important.

 There is a distinction between real devices and virtual ones like RXE or SIW.
 For real devices, the naming policy is NAME_FALLBACK, while virtual devices keep
 their kernel name.

 In similar way to netdev, NAME_GUID scheme is not participating in fallback mechanism
 and needs to be enabled explicitly by the users.

 Type of names:

  * o<index> - on-board device index number
  * s<slot>[f<function>] - hotplug slot index number
  * x<GUID> - Node GUID
  * [P<domain>]p<bus>s<slot>[f<function>] - PCI geographical location

 Notes:

  * All multi-function PCI devices will carry the [f<function>] number in the
  device name, including the function 0 device.
  * When using PCI geography, The PCI domain is only prepended when it is not 0.
  * SR-IOV virtual devices are named based on the name of the parent interface,
  with a suffix of "v<N>", where <N> is the virtual device number.
	# Kernel Module Loading

	The RDMA subsystem relies on the kernel, udev and systemd to load modules on
	demand when RDMA hardware is present. The RDMA subsystem is unique since it
	does not load the optional RDMA hardware modules unless the system has the
	rdma-core package installed.

	This is to avoid exposing systems not using RDMA from having RDMA enabled, for
	instance if a system has a multi-protocol ethernet adapter, but is only using
	the net stack interface.

	## Boot ordering with systemd

	systemd assumes everything is hot pluggable and runs in an event driven
	manner. This creates a chain of hot plug events as each part of the system
	autoloads based on earlier parts. The first step in the process is udev
	loading the physical hardware driver.

	This can happen in several spots along the bootup:

	- From the initrd or built into the kernel. If hardware modules are present
	in the initrd then they are loaded into the kernel before booting the
	system. This is done largely synchronously with the boot process.

	- From udev when it auto detects PCI hardware or otherwise.
	This happens asynchronously in the boot process, systemd does not wait for
	udev to finish loading modules before it continues on.

	This path makes it very likely the system will experience an RDMA 'hot plug'
	scenario.

	- From systemd's fixed module loader systemd-modules-load.service, e.g. from
	the list in /etc/modules-load.d/. In this case the modules load happens
	synchronously within systemd and it will hold off sysinit.target until
	modules are loaded

	Once the hardware module is loaded it may be necessary to load a protocol
	module, e.g. to enable RDMA support on an ethernet device.

	This is triggered automatically by udev rules that match the master devices
	and load the protocol module with udev's module loader. This happens
	asynchronously to the rest of the systemd startup.

	Once an RDMA device is created by the kernel then udev will cause systemd to
	schedule ULP module loading services (e.g. rdma-load-modules@.service) specific
	to the plugged hardware. If sysinit.target has not yet been passed then these
	loaders will defer sysinit.target until they complete, otherwise this is a hot
	plug event and things will load asynchronously to the boot up process.

	Finally udev will cause systemd to start RDMA specific daemons like
	srp_daemon, rdma-ndd and iwpmd. These starts are linked to the detection of
	the first RDMA hardware, and the daemons internally handle hot plug events for
	other hardware.

	## Hot Plug compatible services

	Services using RDMA need to have device specific systemd dependencies in their
	unit files, either created by hand by the admin or by using udev rules.

	For instance, a service that uses /dev/infiniband/umad0 requires:

	```
	After=dev-infiniband-umad0.device
	BindsTo=dev-infiniband-umad0.device
	```

	Which will ensure the service will not run until the required umad device
	appears, and will be stopped if the umad device is unplugged.

	This is similar to how systemd handles mounting filesystems and configuring
	ethernet devices.

	## Interaction with legacy non-hotplug services

	Services that cannot handle hot plug must be ordered after
	systemd-udev-settle.service, which will wait for udev to complete loading
	modules and scheduling systemd services. This ensures that all RDMA hardware
	present at boot is setup before proceeding to run the legacy service.

	Admins using legacy services can also place their RDMA hardware modules
	(e.g. mlx4_ib) directly in /etc/modules-load.d/ or in their initrd which will
	cause systemd to defer passing to sysinit.target until all RDMA hardware is
	setup, this is usually sufficient for legacy services. This is probably the
	default behavior in many configurations.

	# Systemd Ordering

	Within rdma-core we have a series of units which run in the pre `basic.target`
	world to setup kernel services:

	- `iwpmd`
	- `rdma-ndd`
	- `rdma-load-modules@.service`
	- `ibacmd.socket`

	These special units use DefaultDependencies=no and order before any other unit that
	uses DefaultDependencies=yes. This will happen even in the case of hotplug.

	Units for normal rdma-using daemons should use DefaultDependencies=yes, and
	either this pattern for 'any RDMA device':

	```
	[Unit]
	# Order after rdma-hw.target has become active and setup the kernel services
	Requires=rdma-hw.target
	After=rdma-hw.target

	[Install]
	# Autostart when RDMA hardware is present
	WantedBy=rdma-hw.target
	```

	Or this pattern for a specific RDMA device:

	```
	[Unit]
	# Order after RDMA services are setup
	After=rdma-hw.target
	# Run only while a specific umad device is present
	After=dev-infiniband-umad0.device
	BindsTo=dev-infiniband-umad0.device

	[Install]
	# Schedule the unit to be runnable when RDMA hardware is present, but
	# it will only start once the requested device actually appears.
	WantedBy=rdma-hw.target
	```

	Note, the above does explicitly reference `After=rdma-hw.target` even though
	all the current constituents of that target order before
	`sysinit.target`. This is to provide greater flexibility in the future.

	## rdma-hw.target

	This target is Wanted automatically by udev as soon as any RDMA hardware is
	plugged in or becomes available at boot.

	This may be used to pull in rdma management daemons dynamically when RDMA
	hardware is found. Such daemons should use:

	```
	[Install]
	WantedBy=rdma-hw.target
	```

	In their unit files.

	`rdma-hw.target` is also a synchronization point that orders after the low level,
	pre `sysinit.target` RDMA related units have been started.

	# Stable names

	The library provides general utility and udev rule to automatically perform
	stable IB device name assignments, so users will always see names based on
	topology/GUID information. Such naming scheme has big advantage that the
	names are fully automatic, fully predictable and they stay fixed even if
	hardware is added or removed (i.e. no reenumeration takes place) and that
	broken hardware can be replaced seamlessly.

	The name is combination of link type (Infiniband, RoCE, iWARP, OPA or USNIC)
	and the chosen naming policy, like NAME_KERNEL, NAME_PCI, NAME_GUID, NAME_ONBOARD
	or NAME_FALLBACK. Those naming policies are controlled by udev rule and can be
	overwritten by placing own rename policy udev rules into /etc/udev/rules.d/
	directory.

	* NAME_KERNEL - don't change names and rely on kernel assignment. This
	will keep RDMA names as before. Example: "mlx5_0".
	* NAME_PCI - read PCI location and topology as a source for stable names,
	which won't change in any software event (reset, PCI probe e.t.c.).
	Example: "ibp0s12f4".
	* NAME_GUID - read node GUID information in similar manner to
	net MAC naming policy. Example "rocex525400c0fe123455".
	* NAME_ONBOARD - read Firmware/BIOS provided index numbers for on-board devices.
	Example: "ibo3".
	* NAME_FALLBACK - automatic fallback: NAME_ONBOARD->NAME_PCI->NAME_KERNEL

	No doubts that new names are harder to read than the "mlx5_0" everybody,
	is used to, but being consistent in scripts is much more important.

	There is a distinction between real devices and virtual ones like RXE or SIW.
	For real devices, the naming policy is NAME_FALLBACK, while virtual devices keep
	their kernel name.

	In similar way to netdev, NAME_GUID scheme is not participating in fallback mechanism
	and needs to be enabled explicitly by the users.

	Type of names:

	* o<index> - on-board device index number
	* s<slot>[f<function>] - hotplug slot index number
	* x<GUID> - Node GUID
	* [P<domain>]p<bus>s<slot>[f<function>] - PCI geographical location

	Notes:

	* All multi-function PCI devices will carry the [f<function>] number in the
	device name, including the function 0 device.
	* When using PCI geography, The PCI domain is only prepended when it is not 0.
	* SR-IOV virtual devices are named based on the name of the parent interface,
	with a suffix of "v<N>", where <N> is the virtual device number.