| SCSI RDMA Protocol (SRP) Target driver for Linux |
| ================================================= |
| |
| The SRP target driver has been designed to work on top of the Linux RDMA |
| kernel drivers -- either the RDMA drivers included with a Linux distribution |
| or the OFED RDMA drivers. For more information about using the SRP target |
| driver in combination with OFED, see also README.ofed. |
| |
| The SRP target driver has been implemented as an SCST driver. This |
| makes it possible to support a lot of I/O modes on real and virtual |
| devices. A few examples of supported device handlers are: |
| |
| 1. scst_disk. This device handler implements transparent pass-through |
| of SCSI commands and allows SRP to access and to export real |
| SCSI devices, i.e. disks, hardware RAID volumes, tape libraries |
| as SRP LUNs. |
| |
| 2. scst_vdisk, either in fileio or in blockio mode. This device handler |
| allows to export software RAID volumes, LVM volumes, IDE disks, and |
| normal files as SRP LUNs. |
| |
| 3. nullio. The nullio device handler allows to measure the performance |
| of the SRP target implementation without performing any actual I/O. |
| |
| |
| Installation |
| ------------ |
| |
| Building and installing the SRP target driver is possible as follows: |
| |
| cd ${SCST_DIR} |
| if type -p rpm >/dev/null; then |
| make -s rpm |
| sudo rpm -U rpmbuilddir/RPMS/*/*rpm scstadmin/rpmbuilddir/RPMS/*/*rpm |
| else |
| make -s scst_clean srpt_clean scst srpt scstadmin |
| sudo make -s scst_install srpt_install scstadm_install |
| fi |
| |
| The ib_srpt kernel module supports the following parameters: |
| |
| * rdma_cm_port (number) |
| A 16-bit number that specifies the port number to be registered via the |
| RDMA/CM. Must be specified to make communication over RoCE or iWARP |
| possible. If this parameter is zero (the default value) the SRP target |
| driver does not register with the RDMA/CM. |
| * srp_max_req_size (number) |
| Maximum size of an SRP control message in bytes. Examples of SRP control |
| messages are: login request, logout request, data transfer request, ... |
| The larger this parameter, the more scatter/gather list elements can be |
| sent at once. Use the following formula to compute an appropriate value |
| for this parameter: 68 + 16 * (sg_tablesize). The default value of |
| this parameter is 4148, which corresponds to an sg table size of 255. |
| * srp_max_rsp_size (number) |
| Maximum size of an SRP response message in bytes. Sense data is sent back |
| via these messages towards the initiator. The default size is 256 bytes. |
| With this value there remains (256-36) = 220 bytes for sense data. |
| * srp_max_rdma_size (number) |
| Maximum number of bytes that may be transferred at once via RDMA. Defaults |
| to 65536 bytes, which is sufficient to use the full bandwidth of low-latency |
| HCAs. Increasing this value may decrease latency for applications |
| transferring large amounts of data at once. |
| * srpt_srq_size (number, default 4095) |
| ib_srpt uses a shared receive queue (SRQ) for processing incoming SRP |
| requests. This number may have to be increased when a large number of |
| initiator systems is accessing a single SRP target system. |
| * srpt_sq_size (number, default 256) |
| Per-channel InfiniBand send queue size. Depending on the queue depth, |
| changing this parameter to a smaller value may cause RDMA requests to be |
| retried and hence may slow down data transfer severely. |
| * trace_flag (unsigned integer, only available in debug builds) |
| The individual bits of the trace_flag parameter define which categories of |
| trace messages should be sent to the kernel log and which ones not. |
| |
| |
| Configuring the SRP Target System |
| --------------------------------- |
| |
| When using RoCE or iWARP the first step is to enable support for these |
| protocols in the target driver by setting the rdma_cm_port kernel module |
| parameter to a non-zero value. An example: |
| |
| echo options ib_srpt rdma_cm_port=5000 > /etc/modprobe.d/ib_srpt.conf |
| |
| Next, create the file /etc/scst.conf. You can create this file with |
| the scstadmin tool as follows: |
| |
| /etc/init.d/scst stop |
| /etc/init.d/scst start |
| |
| Now configure SCST using scstadmin - see also the scstadmin documentation for |
| further information. Once finished, save the configuration to /etc/scst.conf: |
| |
| scstadmin -write_config /etc/scst.conf |
| |
| One can verify the contents of scst.conf e.g. as follows: |
| |
| cat /etc/scst.conf |
| |
| Now verify that loading the configuration from file works correctly: |
| |
| /etc/init.d/scst reload |
| |
| Note: when using InfiniBand loading the ib_ipoib kernel module and assigning |
| an IP address to each IPoIB interface is only needed when using the RDMA/CM. |
| When using the IB/CM however, it is allowed but not necessary to load the |
| ib_ipoib kernel module. |
| |
| |
| Configuring the SRP Initiator System |
| ------------------------------------ |
| |
| First of all, load the SRP kernel module as follows: |
| |
| modprobe ib_srp |
| |
| Next, when using InfiniBand, discover the new SRP target by running the |
| srp_daemon command: |
| |
| for d in /dev/infiniband/umad*; do srp_daemon -oacd$d; done |
| |
| If you want to let the initiator system log in to all SRP targets available |
| in the same InfiniBand subnet that is possible as follows (-e = execute): |
| |
| for d in /dev/infiniband/umad*; do srp_daemon -oecd$d; done |
| |
| If you want to let the initiator log in to a specific target you can do that |
| e.g. as follows: |
| |
| echo "id_ext=0002c903000f1366,ioc_guid=0002c903000f1366,dgid=fe800000000000000002c903000f1367,pkey=ffff,service_id=0002c903000f1366" > /sys/class/infiniband_srp/${SRP_HCA_NAME}/add_target; done |
| |
| The meaning of the parameters in the above command is as follows: |
| * id_ext: must match ioc_guid. |
| * ioc_guid: see also the documentation of the ib_srpt ioc_guid parameter. |
| * dgid: target HCA port GID to connect to. |
| * pkey: IB partition key (P_Key) of the target to connect to. |
| * service_id: must match ioc_guid. |
| |
| When using RoCE or iWARP, log in to the target system to determine the id_ext |
| and ioc_guid parameters and use these to log in. An example: |
| |
| [ target system ] |
| # sed 's/tid_ext=/id_ext=/;s/,\(pkey\|dgid\|service_id\)=[^,]*//g' $(find /sys/kernel/scst_tgt/targets/ib_srpt -name login_info) | uniq |
| id_ext=0002c90300a34270,ioc_guid=0002c90300a34270 |
| |
| [ initiator system ] |
| echo dest=192.168.5.1:5000,id_ext=0002c90300a34270,ioc_guid=0002c90300a34270 |
| >/sys/class/infiniband_srp/srp-mlx4_0-1/add_target |
| echo dest=192.168.6.1:5000,id_ext=0002c90300a34270,ioc_guid=0002c90300a34270 |
| >/sys/class/infiniband_srp/srp-mlx4_0-2/add_target |
| |
| Initiator port GIDs can be queried e.g. via sysfs: |
| |
| $ for f in /sys/devices/*/*/*/infiniband/*/ports/*/gids/0; do echo $f; \ |
| cat $f | sed 's/://g'; done |
| /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/infiniband/mlx4_0/ports/1/gids/0 |
| fe800000000000000002c9030005f34b |
| /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/infiniband/mlx4_0/ports/2/gids/0 |
| fe800000000000000002c9030005f34c |
| /sys/devices/pci0000:00/0000:00:1c.0/0000:05:00.0/infiniband/mlx4_1/ports/1/gids/0 |
| fe800000000000000002c9030003cca7 |
| /sys/devices/pci0000:00/0000:00:1c.0/0000:05:00.0/infiniband/mlx4_1/ports/2/gids/0 |
| fe800000000000000002c9030003cca8 |
| |
| Finally run lsscsi to display the details of the newly discovered SCSI disks: |
| |
| lsscsi |
| |
| SRP targets can be recognized in the output of lsscsi by looking for |
| the disk names assigned on the SCST target ("disk01" in the example below): |
| |
| [8:0:0:0] disk SCST_FIO disk01 102 /dev/sdb |
| |
| |
| Target names |
| ------------ |
| |
| The name assigned by the ib_srpt target driver to an SCST target is either |
| ib_srpt_target_<n>, the node GUID of a HCA in hexadecimal form with a colon |
| after every fourth digit or the port GID with a colon afer every fourth |
| digit. The HCA node GUID and the port GIDs can be obtained via the |
| ibv_devinfo command. An example: |
| |
| # ibv_devinfo -v | grep -E '[^a-z]port:|guid|GID' |
| node_guid: 0002:c903:0005:f34e |
| sys_image_guid: 0002:c903:0005:f351 |
| port: 1 |
| GID[0]: fe80:0000:0000:0000:0002:c903:0005:f34f |
| port: 2 |
| GID[0]: fe80:0000:0000:0000:0002:c903:0005:f350 |
| |
| Once the ib_srpt driver has been loaded the available SCST targets can be |
| queried as follows: |
| |
| # (cd /sys/kernel/scst_tgt/targets/ib_srpt && ls -d [0-9a-f]*) |
| fe80:0000:0000:0000:0002:c903:0005:f34f |
| fe80:0000:0000:0000:0002:c903:0005:f350 |
| |
| |
| Session names |
| ------------- |
| |
| The ib_srpt target driver uses the source port GID as session name. |
| |
| An example: |
| |
| [ INITIATOR ] |
| |
| $ for f in /sys/devices/*/*/*/infiniband/*/ports/*/gids/0; do echo |
| f; cat $f; done |
| /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/infiniband/mlx4_0/ports/1/gids/0 |
| fe80:0000:0000:0000:0002:c903:0005:f34b |
| /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/infiniband/mlx4_0/ports/2/gids/0 |
| fe80:0000:0000:0000:0002:c903:0005:f34c |
| /sys/devices/pci0000:00/0000:00:1c.0/0000:05:00.0/infiniband/mlx4_1/ports/1/gids/0 |
| fe80:0000:0000:0000:0002:c903:0003:cca7 |
| /sys/devices/pci0000:00/0000:00:1c.0/0000:05:00.0/infiniband/mlx4_1/ports/2/gids/0 |
| fe80:0000:0000:0000:0002:c903:0003:cca8 |
| |
| [ TARGET, after login ] |
| |
| $ (cd /sys/kernel/scst_tgt/targets/ib_srpt/[0-9a-f]* && ls -d sessions/*) |
| sessions/fe80:0000:0000:0000:0002:c903:0003:cca7 |
| sessions/fe80:0000:0000:0000:0002:c903:0003:cca8 |
| sessions/fe80:0000:0000:0000:0002:c903:0005:f34b |
| sessions/fe80:0000:0000:0000:0002:c903:0005:f34c |
| |
| |
| LUN masking |
| ----------- |
| |
| In a straightforward configuration every LUN is visible to every initiator. |
| It is possible however to make a different set of LUNs visible to each |
| initiator by using the LUN masking feature of SCST. SRP initiators are |
| identified by their session name (see above). An example of an scst.conf |
| file using LUN masking for ib_srpt: |
| |
| TARGET_DRIVER ib_srpt { |
| TARGET fe80:0000:0000:0000:0002:c903:0005:f34b { |
| enabled 1 |
| rel_tgt_id 1 |
| |
| # LUNs visible by all initiators not listed below |
| LUN 0 disk01 |
| |
| GROUP grp1 { |
| # LUNs visible by initiator system 1 |
| LUN 0 disk02 |
| |
| INITIATOR fe80:0000:0000:0000:0002:c903:0005:f34b |
| } |
| |
| GROUP grp2 { |
| # LUNs visible by initiator system 2 |
| LUN 0 disk03 |
| |
| INITIATOR fe80:0000:0000:0000:0002:c903:0005:f34c |
| } |
| } |
| } |
| |
| |
| Adding and Removing LUNs Dynamically |
| ------------------------------------ |
| |
| It is possible to add and/or remove LUNs on the target without restarting |
| target or initiator. This can be done either via scstadmin or directly via the |
| sysfs interface. Although the SCST core will notify the initiator about LUN |
| changes, Linux initiators will ignore these notifications. In order to bring a |
| Linux initiator again in sync after a LUN change, the initiator has to be told |
| to rescan SCSI devices. Rescanning SCSI devices is e.g. possible via the |
| rescsan-scsi-bus.sh script that can be found here: |
| http://www.garloff.de/kurt/linux/#rescan-scsi. An example: |
| $ rescan-scsi-bus --hosts=${srp_host_id} --channels=0 --ids=0 --luns=0-31 |
| |
| |
| InfiniBand Partitions |
| --------------------- |
| |
| Just like a VLAN allows to segment traffic on an Ethernet network partitions |
| allow to segment traffic on an InfiniBand network. Each InfiniBand partition |
| is identified by a partition key which is a 16-bit number. During fabric |
| initialization the subnet manager assigns one or more partition keys to |
| each InfiniBand port. For opensm partitions are defined in |
| /etc/opensm/partitions.conf. ib_srpt uses the partition with index 0. Which |
| partition key corresponds to index 0 can be found out by querying sysfs: |
| |
| $ head /sys/class/infiniband/*/ports/*/pkeys/0 |
| ==> /sys/class/infiniband/mlx4_0/ports/1/pkeys/0 <== |
| 0xffff |
| |
| ==> /sys/class/infiniband/mlx4_0/ports/2/pkeys/0 <== |
| 0xffff |
| |
| |
| High availability |
| ----------------- |
| |
| If there are redundant paths in the IB network between initiator and target, |
| automatic path failover can be set up on the initiator as follows: |
| * Edit /etc/infiniband/openib.conf to load the SRP driver and SRP HA daemon |
| automatically: set SRP_LOAD=yes and SRPHA_ENABLE=yes. |
| * To set up and use the high availability feature you need the dm-multipath |
| driver and multipath tool. |
| * Please refer to the OFED-1.x user manual for more detailed instructions |
| on how to enable and how to use the HA feature. See e.g. |
| http://www.mellanox.com/related-docs/prod_software/Mellanox_OFED%20_Linux_user_manual_1_5_1_2.pdf. |
| |
| A setup with automatic failover between redundant targets is possible by |
| installing and configuring DRBD on both targets. If the initiator system |
| supports mirroring (e.g. Linux), you can use the following approach: |
| * Configure DRBD in Active/Active mode. |
| * Configure the initiator(s) for mirroring between the redundant targets. |
| If the initiator system does not support mirroring (e.g. VMware ESX), you |
| can use the following approach: |
| * Configure DRBD in Active/Passive mode and enable STONITH mode in the |
| Heartbeat software. |
| |
| For more information, see also: |
| * http://www.drbd.org/ |
| * http://www.linux-ha.org/wiki/Main_Page |
| |
| |
| Performance Notes - Target Side |
| ------------------------------- |
| |
| * Building the SCST core and the ib_srpt target driver in release mode |
| improves performance compared to debug mode. |
| |
| * When using high-latency storage devices (hard disks), the default value |
| chosen by SCST for DEVICE.threads_num should be fine. When using |
| low-latency storage devices though (SSDs), DEVICE.threads_num should be set |
| to 1 or 2 in /etc/scst.conf in order to reach optimal performance for small |
| block sizes (e.g. 4 KB). |
| |
| * When multiple InfiniBand HCA's are present in a target system the Linux |
| kernel by default will assign the associated interrupt handlers to CPU 0. |
| Even irqbalance will often assign the interrupt handlers of multiple HCA's |
| to the same CPU. That is unfortunate because it leads to unfair handling of |
| SRP sessions. The solution is to assign InfiniBand HCA interrupts manually |
| to different CPU's. That's possible by writing looking up the InfiniBand |
| interrupt numbers in /proc/interrupts and by writing proper bitmasks into |
| /proc/irq/<n>/smp_affinity. |
| |
| |
| Performance Notes - Initiator Side |
| ---------------------------------- |
| |
| * Using multiple RDMA connections between initiator and target results in a |
| significant performance improvement. To benefit from this feature, use |
| kernel 3.19 or later at the initiator side and enable scsi-mq either by |
| setting SCSI_MQ_DEFAULT=y in the kernel config or via the following command: |
| |
| echo Y > /sys/module/scsi_mod/parameters/use_blk_mq |
| |
| If the HCA model in your initiator system supports multiple MSI-X interrupts |
| the next step is either to stop the irqbalance service or to write a policy |
| script that stops irqbalance from modifying the IB interrupt CPU |
| affinity. |
| |
| For more information about scsi-mq see also Michael Larabel, SCSI |
| Multi-Queue Performance Appears Great For Linux 3.17, Phoronix, June 18, |
| 2014 (http://www.phoronix.com/scan.php?page=news_item&px=MTcyMjk). |
| |
| * Choose a proper value for the ib_srp kernel module parameter |
| cmd_sg_entries. The default value 12 works well for buffered reads while |
| the throughput for write-dominated workloads improves by changing this value |
| into 255. One way to set this kernel module parameter is as follows: |
| |
| echo options ib_srp cmd_sg_entries=255 >/etc/modprobe.d/ib_srp.conf |
| |
| * For multithreaded workloads using small block sizes changing rq_affinity |
| into 2 improves IOPS significantly (Linux kernel 3.1 and later; see also |
| commit 5757a6d76cdf6dda2a492c09b985c015e86779b1). |
| |
| * For latency sensitive applications, using the noop scheduler at the initiator |
| side can give significantly better results than with other schedulers. |
| |
| * The SRP initiator limits by default the queue depth to 64 commands. If your |
| workload benefits from a larger queue depth, enlarge the queue depth by |
| setting the max_cmd_per_lun and queue_size parameters in the SRP login |
| string. |
| |
| * The following parameters have a small but measurable impact on SRP |
| performance: |
| * /sys/class/block/${dev}/queue/rotational |
| * /sys/class/block/${dev}/queue/rq_affinity |
| * /proc/irq/${ib_int_no}/smp_affinity |
| |
| |
| Performance Notes - Both Sides |
| ------------------------------ |
| |
| * Disabling CONFIG_SCHED_DEBUG and CONFIG_SCHEDSTATS in the kernel config |
| improves performance. |
| |
| * Disable CONFIG_IRQSOFF_TRACER such that CONFIG_TRACE_IRQFLAGS is disabled. |
| |
| * Consider which memory allocator to use. With recent kernels using the SLUB |
| memory allocator instead of SLAB may help. On multi-socket systems the SLAB |
| memory allocator may result in better performance. Please note that SLAB is |
| tunable while SLUB is not. See also http://lkml.org/lkml/2010/7/9/264 and |
| http://www.ibm.com/developerworks/linux/library/l-linux-slab-allocator/. |
| |
| |
| Frequently Asked Questions |
| -------------------------- |
| |
| Q: Every now and then "SRP abort called" and "SRP reset_device called" |
| messages are logged at the initiator side. Around the same time I see the |
| following message in the target log: "ib_srpt: ***ERROR***: Command ...: IB |
| completion for idx ... has not been received in time (SRPT command state |
| ...)". What is the meaning of these messages mean and how can I fix this ? |
| |
| A: This means that a timeout occurred while a HCA was waiting for an |
| acknowledge message. Check the IB network for bad IB cables, bad HCA's |
| and/or bad switch ports. Also make sure that the HCA firmware is up to |
| date. |
| |
| Q: Loading the kernel module ib_srpt triggers a kernel panic with a call trace |
| like the one below. What is the cause of this and how can this be solved ? |
| |
| Call Trace: |
| [<ffffffffa02f2a50>] srpt_alloc_ioctx+0x60/0xb0 [ib_srpt] |
| [<ffffffffa02f2f0a>] srpt_alloc_ioctx_ring+0xea/0x1e0 [ib_srpt] |
| [<ffffffffa02f32e9>] srpt_add_one+0x2e9/0x670 [ib_srpt] |
| [<ffffffffa015a480>] ib_register_client+0x80/0xa0 [ib_core] |
| [<ffffffffa02421eb>] srpt_init_module+0x1eb/0x235 [ib_srpt] |
| [<ffffffff81000344>] do_one_initcall+0x34/0x1a0 |
| [<ffffffff8107a63c>] sys_init_module+0xdc/0x260 |
| [<ffffffff81002e3b>] system_call_fastpath+0x16/0x1b |
| |
| A: This means that you are using a system on which OFED has been installed but |
| that ib_srpt has been compiled against the in-tree kernel headers instead |
| of the OFED kernel headers. You can fix this by rebuilding ib_srpt against |
| the OFED kernel headers. The ib_srpt makefile should detect the OFED kernel |
| headers automatically - at least if ib_srpt is built after OFED has been |
| installed. |
| |
| |
| Feedback |
| -------- |
| |
| Send questions about this driver to scst-devel@lists.sourceforge.net. |