| Generic SCSI target mid-level for Linux (SCST) |
| ============================================== |
| |
| Version 3.5.0, 21 December 2020 |
| ---------------------------- |
| |
| SCST is designed to provide unified, consistent interface between SCSI |
| target drivers and Linux kernel and simplify target drivers development |
| as much as possible. Detail description of SCST's features and internals |
| could be found on its Internet page http://scst.sourceforge.net. |
| |
| SCST supports the following I/O modes: |
| |
| * Pass-through mode with one to many relationship, i.e. when multiple |
| initiators can connect to the exported pass-through devices, for |
| the following SCSI devices types: disks (type 0), tapes (type 1), |
| processors (type 3), CDROMs (type 5), MO disks (type 7), medium |
| changers (type 8) and RAID controllers (type 0xC). |
| |
| * FILEIO mode, which allows to use files on file systems or block |
| devices as virtual remotely available SCSI disks or CDROMs with |
| benefits of the Linux page cache. |
| |
| * BLOCKIO mode, which performs direct block IO with a block device, |
| bypassing page-cache for all operations. This mode works ideally with |
| high-end storage HBAs and for applications that either do not need |
| caching between application and disk or need the large block |
| throughput. |
| |
| * User space mode using scst_user device handler, which allows to |
| implement in the user space high performance virtual SCSI |
| devices. Comparing with fully in-kernel dev handlers this mode has |
| very low overhead (few %%). |
| |
| * "Performance" device handlers, which provide in pseudo pass-through |
| mode a way for direct performance measurements without overhead of |
| actual data transferring from/to underlying SCSI device. |
| |
| In addition, SCST supports advanced per-initiator access and devices |
| visibility management, so different initiators could see different set |
| of devices with different access permissions. See below for details. |
| |
| Full list of SCST features and comparison with other Linux targets you |
| can find on http://scst.sourceforge.net/comparison.html. |
| |
| |
| Installation |
| ------------ |
| |
| Only vanilla kernels from kernel.org and RHEL/CentOS 5.2 kernels are |
| supported, but SCST should work on other (vendors') kernels, if you |
| manage to successfully compile on them. The main problem with vendors' |
| kernels is that they often contain patches, which will appear only in |
| the next version of the vanilla kernel, therefore it's quite hard to |
| track such changes. Thus, if during compilation for some vendor kernel |
| your compiler complains about redefinition of some symbol, you should |
| either switch to vanilla kernel, or add or change as necessary the |
| corresponding to that symbol "#if LINUX_VERSION_CODE" statement. |
| |
| Kernel version 2.6.26 and higher are supported. |
| |
| At first, make sure that the link "/lib/modules/`you_kernel_version`/build" |
| points to the source code for your currently running kernel. |
| |
| Then you should consider to apply necessary kernel patches. SCST has the |
| following patches for the kernel in the "kernel" subdirectory. All of |
| them are optional, so, if you don't need the corresponding |
| functionality, you may not apply them. |
| |
| 1. readahead-2.6.X.patch. This patch fixes problem in Linux readahead |
| subsystem and greatly improves performance for software RAIDs. See |
| http://sourceforge.net/mailarchive/forum.php?thread_name=a0272b440906030714g67eabc5k8f847fb1e538cc62%40mail.gmail.com&forum_name=scst-devel |
| thread for more details. It is included in the mainstream kernels 2.6.33 |
| and 2.6.32.11. |
| |
| 2. readahead-context-2.6.X.patch. This is backported from 2.6.31 version |
| of the context readahead patch http://lkml.org/lkml/2009/4/12/9, big |
| thanks to Wu Fengguang. This is a performance improvement patch. It is |
| included in the mainstream kernel 2.6.31. |
| |
| Then, to compile SCST type 'make scst'. It will build SCST itself and its |
| device handlers. To install them type 'make scst_install'. The driver |
| modules will be installed in '/lib/modules/`you_kernel_version`/extra'. |
| In addition, scst.h, scst_debug.h as well as Module.symvers or |
| Modules.symvers will be copied to '/usr/local/include/scst'. The first |
| file contains all SCST's public data definition, which are used by |
| target drivers. The other ones support debug messages logging and build |
| process. |
| |
| Then you can load any module by typing 'modprobe module_name'. The names |
| are: |
| |
| - scst - SCST itself |
| - scst_disk - device handler for disks (type 0) |
| - scst_tape - device handler for tapes (type 1) |
| - scst_processor - device handler for processors (type 3) |
| - scst_cdrom - device handler for CDROMs (type 5) |
| - scst_modisk - device handler for MO disks (type 7) |
| - scst_changer - device handler for medium changers (type 8) |
| - scst_raid - device handler for storage array controller (e.g. raid) (type C) |
| - scst_vdisk - device handler for virtual disks (file, device or ISO CD image). |
| - scst_user - user space device handler |
| |
| Then, to see your devices remotely, you need to add a corresponding LUN |
| for them (see below how). By default, no local devices are seen |
| remotely. There must be LUN 0 in each LUNs set (security group), i.e. |
| LUs numeration must not start from, e.g., 1. Otherwise you will see no |
| devices on remote initiators and SCST core will write into the kernel |
| log message: "tgt_dev for LUN 0 not found, command to unexisting LU?" |
| |
| It is highly recommended to use scstadmin utility for configuring |
| devices and security groups. |
| |
| The flow of SCST initialization should be as follows: |
| |
| 1. Load of SCST modules with necessary module parameters, if needed. |
| |
| 2. Configure targets, devices, LUNs, etc. using either scstadmin |
| (recommended), or the sysfs interface directly as described below. |
| |
| If you experience problems during modules load or running, check your |
| kernel logs (or run dmesg command for the few most recent messages). |
| |
| IMPORTANT: Without loading appropriate device handler, corresponding devices |
| ========= will be invisible for remote initiators, which could lead to holes |
| in the LUN addressing, so automatic device scanning by remote SCSI |
| mid-level could not notice the devices. Therefore you will have |
| to add them manually via |
| 'echo "- - -" >/sys/class/scsi_host/hostX/scan', |
| where X - is the host number. |
| |
| IMPORTANT: Working of target and initiator on the same host is |
| ========= supported, except the following 2 cases: swap over target exported |
| device and using a writable mmap over a file from target |
| exported device. The latter means you can't mount a file |
| system over target exported device. In other words, you can |
| freely use any sg, sd, st, etc. devices imported from target |
| on the same host, but you can't mount file systems or put |
| swap on them. This is a limitation of Linux memory/cache |
| manager, because in this case a memory allocation deadlock is |
| possible like: system needs some memory -> it decides to |
| clear some cache -> the cache is needed to be written on a |
| target exported device -> initiator sends request to the |
| target located on the same system -> the target needs memory |
| -> the system needs even more memory -> deadlock. |
| |
| IMPORTANT: In the current version simultaneous access to local SCSI devices |
| ========= via standard high-level SCSI drivers (sd, st, sg, etc.) and |
| SCST's target drivers is unsupported. Especially it is |
| important for execution via sg and st commands that change |
| the state of devices and their parameters, because that could |
| lead to data corruption. If any such command is done, at |
| least related device handler(s) must be restarted. For block |
| devices READ/WRITE commands using direct disk handler are |
| generally safe. |
| |
| To uninstall, type 'make scst_uninstall'. |
| |
| |
| Creating a kernel patch or patched kernel |
| ----------------------------------------- |
| |
| You can use generate-kernel-patch or generate-patched-kernel scripts in |
| the scripts/ subdirectory to convert SCST source tree as it exists |
| in the Subversion repository to a Linux kernel patch or generate a |
| kernel source tree with the SCST patches applied correspondingly. This |
| subdirectory exists only in the SVN tree. |
| |
| Example how to use generate-kernel-patch you can find at "How To install |
| SCST on Ubutuntu 15.04 with in-tree kernel patches" |
| https://gist.github.com/chrwei/42f8bbb687290b04b598, thanks to Chris Weiss. |
| |
| |
| Migration from the obsolete proc interface |
| ------------------------------------------ |
| |
| Sysfs enabled scstadmin supports old procfs config file format, so with |
| it you should do the following steps to migrate your proc-based |
| configuration to the sysfs interface: |
| |
| 1. Load SCST modules |
| |
| 2. Run "scstadmin -config old_config_file" |
| |
| 3. Run "scstadmin -write_config new_config_file" |
| |
| 4. Check new_config_file and make sure it has everything written |
| properly. |
| |
| 5. Start using "scstadmin -config new_config_file" to configure SCST. |
| |
| |
| Usage in failover mode |
| ---------------------- |
| |
| It is recommended to use TEST UNIT READY ("tur") command to check if |
| SCST target is alive in MPIO configurations. |
| |
| |
| Device handlers |
| --------------- |
| |
| Device specific drivers (device handlers) are plugins for SCST, which |
| help SCST to analyze incoming requests and determine parameters, |
| specific to various types of devices. If an appropriate device handler |
| for a SCSI device type isn't loaded, SCST doesn't know how to handle |
| devices of this type, so they will be invisible for remote initiators |
| (more precisely, "LUN not supported" sense code will be returned). |
| |
| In addition to device handlers for real devices, there are VDISK, user |
| space and "performance" device handlers. |
| |
| VDISK device handler works over files on file systems and makes from |
| them virtual remotely available SCSI disks or CDROM's. In addition, it |
| allows to work directly over a block device, e.g. local IDE or SCSI disk |
| or ever disk partition, where there is no file systems overhead. Using |
| block devices comparing to sending SCSI commands directly to SCSI |
| mid-level via scsi_do_req()/scsi_execute_async() has advantage that data |
| are transferred via system cache, so it is possible to fully benefit |
| from caching and read ahead performed by Linux's VM subsystem. The only |
| disadvantage here that in the FILEIO mode there is superfluous data |
| copying between the cache and SCST's buffers. This issue is going to be |
| addressed in one of the future releases. Virtual CDROM's are useful for |
| remote installation. See below for details how to setup and use VDISK |
| device handler. |
| |
| SCST user space device handler provides an interface between SCST and |
| the user space, which allows to create pure user space devices. The |
| simplest example, where one would want it is if he/she wants to write a |
| VTL. With scst_user he/she can write it purely in the user space. Or one |
| would want it if he/she needs some sophisticated for kernel space |
| processing of the passed data, like encrypting them or making snapshots. |
| |
| "Performance" device handlers for disks, MO disks and tapes in their |
| exec() method skip (pretend to execute) all READ and WRITE operations |
| and thus provide a way for direct link performance measurements without |
| overhead of actual data transferring from/to underlying SCSI device. |
| |
| NOTE: Since "perf" device handlers on READ operations don't touch the |
| ==== commands' data buffer, it is returned to remote initiators as it |
| was allocated, without even being zeroed. Thus, "perf" device |
| handlers impose some security risk, so use them with caution. |
| |
| |
| Compilation options |
| ------------------- |
| |
| There are the following compilation options, that could be commented |
| in/out in Makefile and scst.h: |
| |
| - CONFIG_SCST_DEBUG - if defined, turns on some debugging code, |
| including some logging. Makes the driver considerably bigger and slower, |
| producing large amount of log data. |
| |
| - CONFIG_SCST_TRACING - if defined, turns on ability to log events. Makes the |
| driver considerably bigger and leads to some performance loss. |
| |
| - CONFIG_SCST_EXTRACHECKS - if defined, adds extra validity checks in |
| the various places. |
| |
| - CONFIG_SCST_USE_EXPECTED_VALUES - if not defined (default), initiator |
| supplied expected data transfer length and direction will be used |
| only for verification purposes to return error or warn in case if one |
| of them is invalid. Instead, locally decoded from SCSI command values |
| will be used. This is necessary for security reasons, because |
| otherwise a faulty initiator can crash target by supplying invalid |
| value in one of those parameters. This is especially important in |
| case of pass-through mode. If CONFIG_SCST_USE_EXPECTED_VALUES is |
| defined, initiator supplied expected data transfer length and |
| direction will override the locally decoded values. This might be |
| necessary if internal SCST commands translation table doesn't contain |
| SCSI command, which is used in your environment. You can know that if |
| you enable "minor" trace level and have messages like "Unknown |
| opcode XX for YY. Should you update scst_scsi_op_table?" in your |
| kernel log and your initiator returns an error. Also report those |
| messages in the SCST mailing list scst-devel@lists.sourceforge.net. |
| Note, that not all SCSI transports support supplying expected values. |
| You should try to enable this option if you have a not working with |
| SCST pass-through device, for instance, an SATA CDROM. |
| |
| - CONFIG_SCST_DEBUG_TM - if defined, turns on task management functions |
| debugging, when on LUN 6 some of the commands will be delayed for |
| about 60 sec., so making the remote initiator send TM functions, eg |
| ABORT TASK and TARGET RESET. Also define |
| CONFIG_SCST_TM_DBG_GO_OFFLINE symbol in the Makefile if you want that |
| the device eventually become completely unresponsive, or otherwise to |
| circle around ABORTs and RESETs code. Needs CONFIG_SCST_DEBUG turned |
| on. |
| |
| - CONFIG_SCST_DEBUG_SYSFS_EAGAIN - if defined, makes three out of four |
| reads of sysfs attributes fail with -EAGAIN and also makes every sysfs |
| write fail with -EAGAIN. This is useful to test -EAGAIN handling in user |
| space tools like e.g. scstadmin. See also the documentation of the |
| last_sysfs_mgmt_res sysfs attribute for more information. |
| |
| - CONFIG_SCST_STRICT_SERIALIZING - if defined, makes SCST send all commands to |
| underlying SCSI device synchronously, one after one. This makes task |
| management more reliable, with cost of some performance penalty. This |
| is mostly actual for stateful SCSI devices like tapes, where the |
| result of command's execution depends from device's settings defined |
| by previous commands. Disk and RAID devices are stateless in the most |
| cases. The current SCSI core in Linux doesn't allow to abort all |
| commands reliably if they sent asynchronously to a stateful device. |
| Turned off by default, turn it on if you use stateful device(s) and |
| need as much error recovery reliability as possible. As a side effect |
| of CONFIG_SCST_STRICT_SERIALIZING, on kernels below 2.6.30 no kernel |
| patching is necessary for pass-through device handlers (scst_disk, |
| etc.). |
| |
| - CONFIG_SCST_TEST_IO_IN_SIRQ - if defined, allows SCST to submit selected |
| SCSI commands (TUR and READ/WRITE) from soft-IRQ context (tasklets). |
| Enabling it will decrease amount of context switches and slightly |
| improve performance. The goal of this option is to be able to measure |
| overhead of the context switches. If after enabling this option you |
| don't see under load in vmstat output on the target significant |
| decrease of amount of context switches, then your target driver |
| doesn't submit commands to SCST in IRQ context. For instance, |
| iSCSI-SCST doesn't do that, but qla2x00t with |
| CONFIG_QLA_TGT_DEBUG_WORK_IN_THREAD disabled - does. This option is |
| designed to be used with vdisk NULLIO backend. |
| |
| WARNING! Using this option enabled with other backend than vdisk |
| NULLIO is unsafe and can lead you to a kernel crash! |
| |
| - CONFIG_SCST_STRICT_SECURITY - if defined, makes SCST zero allocated data |
| buffers. Undefining it (default) considerably improves performance |
| and eases CPU load, but could create a security hole (information |
| leakage), so enable it, if you have strict security requirements. |
| |
| - CONFIG_SCST_ABORT_CONSIDER_FINISHED_TASKS_AS_NOT_EXISTING - if defined, |
| in case when TASK MANAGEMENT function ABORT TASK is trying to abort a |
| command, which has already finished, remote initiator, which sent the |
| ABORT TASK request, will receive TASK NOT EXIST (or ABORT FAILED) |
| response for the ABORT TASK request. This is more logical response, |
| since, because the command finished, attempt to abort it failed, but |
| some initiators, particularly VMware iSCSI initiator, consider TASK |
| NOT EXIST response as if the target got crazy and try to RESET it. |
| Then sometimes get crazy itself. So, this option is disabled by |
| default. |
| |
| - CONFIG_SCST_DIF_INJECT_CORRUPTED_TAGS - if defined, allows injection |
| of corrupted DIF tags according to the Oracle specification. This |
| functionality is working only if dif_mode doesn't contain dev_store |
| and dif_type is 1. |
| |
| - CONFIG_SCST_NO_TOTAL_MEM_CHECKS - disables checks of allocated |
| memory, see scst_max_cmd_mem below. Allows to avoid 2 global |
| variables on the fast path, hence get better multi-queue performance. |
| |
| HIGHMEM kernel configurations are fully supported, but not recommended |
| for performance reasons, except for scst_user, where they are not |
| supported, because this module deals with user supplied memory on a |
| zero-copy manner. If you need to use HIGHMEM enabled, consider change |
| VMSPLIT option or use 64-bit system configuration instead. |
| |
| For changing VMSPLIT option (CONFIG_VMSPLIT to be precise) you should in |
| "make menuconfig" command set the following variables: |
| |
| - General setup->Configure standard kernel features (for small systems): ON |
| |
| - General setup->Prompt for development and/or incomplete code/drivers: ON |
| |
| - Processor type and features->High Memory Support: OFF |
| |
| - Processor type and features->Memory split: according to amount of |
| memory you have. If it is less than 800MB, you may not touch this |
| option at all. |
| |
| |
| Module parameters |
| ----------------- |
| |
| Module scst supports the following parameters: |
| |
| - scst_threads - allows to set count of SCST's threads. By default it |
| is CPU count. |
| |
| - scst_max_cmd_mem - sets maximum amount of memory in MB allowed to be |
| consumed by the SCST commands for data buffers at any given time. By |
| default it is approximately TotalMem/4. |
| |
| - scst_max_dev_cmd_mem - sets maximum amount of memory in MB allowed |
| to be consumed by all SCSI commands of a device at any given time. By |
| default, it is approximately 2/5 of scst_max_cmd_mem. |
| |
| - auto_cm_assignment - enables the copy managers auto registration. |
| If a device is not registered in the copy manager, it can not be |
| source or target of EXTENDED COPY commands. Enabled by default. |
| Disable, if you want to manually control the copy manager |
| registration or need to change a device, e.g. a DM cache device, with |
| SCST LUN on top of it to avoid extra reference the copy manager holds |
| on this device. In the later case you can also remove this reference |
| by manually deleting the corresponding copy manager LUN via sysfs interface |
| (/sys/kernel/scst_tgt/targets/copy_manager/copy_manager_tgt/luns/mgmt). |
| |
| |
| SCST sysfs interface |
| -------------------- |
| |
| Starting from 2.0.0 SCST has sysfs interface. It supports only kernels |
| 2.6.26 and higher, because in 2.6.26 internal kernel's sysfs interface |
| had a major change, which made it heavily incompatible with pre-2.6.26 |
| version. |
| |
| SCST sysfs interface designed to be self descriptive and self |
| containing. This means that a high level management tool for it can be |
| written once and automatically support any future sysfs interface |
| changes (attributes additions or removals, new target drivers and dev |
| handlers, etc.) without any modifications. Scstadmin is an example of |
| such management tool. |
| |
| To implement that an management tool should not be implemented around |
| drivers and their attributes, but around common rules those drivers and |
| attributes follow. You can find those rules in SysfsRules file. For |
| instance, each SCST sysfs file (attribute) can contain in the last line |
| mark "[key]". It is automatically added to allow scstadmin and other |
| management tools to see which attributes it should save in the config |
| file. If you are doing manual attributes manipulations, you can ignore |
| this mark. |
| |
| Root of SCST sysfs interface is /sys/kernel/scst_tgt. It has the |
| following entries: |
| |
| - devices - this is a root subdirectory for all SCST devices |
| |
| - handlers - this is a root subdirectory for all SCST dev handlers |
| |
| - max_tasklet_cmd - specifies how many commands at max can be queued in |
| the SCST core simultaneously on a single CPU from all connected |
| initiators to allow processing commands on this CPU in soft-IRQ |
| context in tasklets. If the count of the commands exceeds this value, |
| then all of them will be processed only in SCST threads. This is to |
| to prevent possible under heavy load starvation of processes on the |
| CPUs serving soft IRQs and in some cases to improve performance by |
| more evenly spreading load over available CPUs. |
| |
| - measure_latency - whether or not to enable latency measurements. |
| Enabling latency measurements has a small impact on performance but |
| makes detailed information available about how much time is needed |
| to process SCSI commands. The structure of the paths to files with |
| latency information is as follows: |
| |
| /sys/kernel/scst_tgt/targets/${target_driver_name}/${target_port_name}/sessions/${initiator_name}/latency/${io_type}${io_size} |
| |
| ${io_type} is n, r, w or b. 'n' means that no data buffer was |
| associated with the command, 'r' stands for read, 'w' for write |
| and 'b' for bidirectional. ${io_size} is a power of two between 512 |
| and 524288. Each file contains statistics for I/O requests with a |
| size up to ${io_size} and that exceed a smaller I/O size. The files |
| for ${io_size} 524288 are an exception because these also include |
| data for all larger requests. |
| |
| Here is an example of the data produced by this infrastructure (edited for |
| clarity): |
| |
| $ echo 1 >/sys/kernel/scst_tgt/measure_latency |
| $ sleep 10 # Wait until an initiator has submitted multiple I/O requests |
| $ (cd /sys/kernel/scst_tgt/targets && |
| find -name latency | xargs grep -raH .) |
| state count min max avg stddev |
| PARSE 219 1.3 26.6 2.2 2.5 us |
| PREPARE_SPACE 219 0.9 10.3 1.1 0.6 us |
| RDY_TO_XFER 219 0.7 1.7 0.7 0.2 us |
| TGT_PRE_EXEC 219 0.7 11.0 0.8 0.9 us |
| EXEC_CHECK_SN 219 0.7 1.7 0.8 0.2 us |
| PRE_DEV_DONE 219 11.3 3445.7 39.6 276.4 us |
| DEV_DONE 219 0.7 11.0 0.9 0.7 us |
| PRE_XMIT_RESP1 219 1.2 58.4 1.6 3.8 us |
| CSW2 219 0.7 1.6 0.8 0.1 us |
| PRE_XMIT_RESP2 219 0.7 1.5 0.7 0.1 us |
| XMIT_RESP 219 0.7 1.5 0.7 0.1 us |
| INIT_WAIT 219 1.0 57.3 2.1 4.4 us |
| INIT 219 0.9 27.4 1.6 2.4 us |
| CSW1 219 15.0 3856.1 74.2 264.8 us |
| EXEC_CHECK_BLOCKING 219 1.3 10.8 1.7 0.9 us |
| LOCAL_EXEC 219 0.7 1.8 0.7 0.1 us |
| REAL_EXEC 219 0.6 1.5 0.7 0.1 us |
| EXEC_WAIT 219 40.6 1021.7 54.4 68.7 us |
| XMIT_WAIT 219 6.4 1682.0 50.6 228.1 us |
| total 219 - - 236.9 2012.1 us |
| |
| PRE_DEV_DONE refers to internal checks done after execution of a command |
| finished. CSW1 is the context switch that happens after the transport |
| driver received a command and before processing of a command starts. |
| EXEC_WAIT is the time spent in the device handler .exec() method. |
| |
| - sgv - this is a root subdirectory for all SCST SGV caches |
| |
| - targets - this is a root subdirectory for all SCST targets |
| |
| - setup_id - allows to read and write SCST setup ID. This ID can be |
| used in cases, when the same SCST configuration should be installed |
| on several targets, but exported from those targets devices should |
| have different IDs and SNs. For instance, VDISK dev handler uses this |
| ID to generate T10 vendor specific identifier and SN of the devices. |
| |
| - poll_us - if polling is desired, sets how many us each SCST thread |
| is polling its queue after it became empty in a hope that a new |
| command can come. In some cases, polling can significantly increase |
| IOPS, especially if low power states on CPU not disabled, because on |
| high IOPS polling could be cheaper comparing to spending significant |
| time on entering, then exiting CPU low power states + corresponding |
| context switches. Disabled, i.e. set to 0, by default. |
| |
| - suspend - globally suspends or releases all SCSI activities on all |
| devices. Useful for mass management, like adding or deleting LUNs. |
| Writing to it value v: |
| |
| * v > 0 - suspends activities, but waits no more, than v seconds |
| |
| * v = 0 - suspends activities, waits indefinitely |
| |
| * V < 0 - releases activities. |
| |
| Reading from this attribute returns number of previous suspend |
| requests. |
| |
| - threads - allows to read and set number of global SCST I/O threads. |
| Those threads used with async. dev handlers, for instance, vdisk |
| BLOCKIO or NULLIO. |
| |
| - trace_cmds - shows current SCST commands up to size of the sysfs |
| buffer (4KB) |
| |
| - trace_mcmds - shows current SCST management commands up to size of |
| the sysfs buffer (4KB) |
| |
| - trace_level - allows to enable and disable various tracing |
| facilities. See content of this file for help how to use it. See also |
| section "Dealing with massive logs" for more info how to make correct |
| logs when you enabled trace levels producing a lot of logs data. |
| |
| - version - read-only attribute, which allows to see version of |
| SCST and enabled optional features. |
| |
| - last_sysfs_mgmt_res - read-only attribute returning completion status |
| of the last management command. In the sysfs implementation there are |
| some problems between internal sysfs and internal SCST locking. To |
| avoid them in some cases sysfs calls can return error with errno |
| EAGAIN. This doesn't mean the operation failed. It only means that |
| the operation queued and not yet completed. To wait for it to |
| complete, an management tool should poll this file. If the operation |
| hasn't yet completed, it will also return EAGAIN. But after it's |
| completed, it will return the result of this operation (0 for success |
| or -errno for error). The following two shell functions show how to do |
| this: |
| |
| - force_global_sgv_pool - if not set, buffers for SCSI commands are |
| allocated from per-CPU SGV pool. Otherwise, global SGV pool is used. |
| |
| # Read the SCST sysfs attribute $1. See also scst/README for more information. |
| scst_sysfs_read() { |
| local EAGAIN val |
| |
| EAGAIN="Resource temporarily unavailable" |
| while true; do |
| if val="$(LC_ALL=C cat "$1" 2>&1)"; then |
| echo -n "${val%\[key\]}" |
| return 0 |
| elif [ "${val/*: }" != "$EAGAIN" ]; then |
| return 1 |
| fi |
| sleep 1 |
| done |
| } |
| |
| # Write $1 into the SCST sysfs attribute $2. See also scst/README for more |
| # information. |
| scst_sysfs_write() { |
| local EAGAIN status |
| |
| EAGAIN="Resource temporarily unavailable" |
| if status="$(LC_ALL=C; (echo -n "$1" > "$2") 2>&1)"; then |
| return 0 |
| elif [ "${status/*: }" != "$EAGAIN" ]; then |
| return 1 |
| fi |
| scst_sysfs_read /sys/kernel/scst_tgt/last_sysfs_mgmt_res >/dev/null |
| } |
| |
| "Devices" subdirectory contains subdirectories for each SCST devices. |
| |
| Content of each device's subdirectory is dev handler specific. See |
| documentation for your dev handlers for more info about it as well as |
| SysfsRules file for more info about common to all dev handlers rules. |
| SCST dev handlers can have the following common entries: |
| |
| - block - allows to temporary block and unblock this device. See below. |
| |
| - exported - subdirectory containing links to all LUNs where this |
| device was exported. |
| |
| - handler - if dev handler determined for this device, this link points |
| to it. The handler can be not set for pass-through devices. |
| |
| - threads_num - shows and allows to set number of threads in this device's |
| threads pool. If 0 - no threads will be created, and global SCST |
| threads pool will be used. If <0 - creation of the threads pool is |
| prohibited. |
| |
| - threads_pool_type - shows and allows to sets threads pool type. |
| Possible values: "per_initiator" and "shared". When the value is |
| "per_initiator" (default), each session from each initiator will use |
| separate dedicated pool of threads. When the value is "shared", all |
| sessions from all initiators will share the same per-device pool of |
| threads. Valid only if threads_num attribute >0. |
| |
| - dump_prs - allows to dump persistent reservations information in the |
| kernel log. |
| |
| - type - SCSI type of this device |
| |
| - max_tgt_dev_commands - maximum number of SCSI commands any session to |
| this device can have in flight. |
| |
| - numa_node_id - NUMA node id this device physically belongs to. SCST |
| NUMA handling assumes that being used in the system NUMA memory |
| allocation policy is to always allocate from the current node. |
| |
| Attribute "block" allows to temporary block and unblock this device. |
| "Blocking" means that no new commands for this device will go into the |
| execution stage, but instead will be suspended just before it. The |
| blocked state is not reached until queue of the corresponding device is |
| completely drained. You can also call this state "frozen". It is useful |
| in many cases, like consistent snapshots and graceful shutdown. |
| |
| On write "block" entry allows the following 3 types of parameters: |
| |
| - 1 - block device synchronously, i.e. don't return until this device |
| becomes blocked, i.e. until queue of it is not completely drained. Can |
| be called as many times as needed. |
| |
| - 11 params - block device asynchronously, i.e. return immediately. |
| Notification about completing is delivered using SCST_EVENT_EXT_BLOCKING_DONE |
| event. "Params" delivered to it as is in "data" payload. Can be |
| called as many times as needed. Alternatively, status of blocking could be |
| polled by reading this attributes until the second number reaches 0 |
| (see below). |
| |
| - 0 - unblock this device. |
| |
| Reading from "block" entry returns two numbers separated by space: |
| |
| 1. How many times this device was blocked, i.e. how many times writing |
| "0" to it is needed to unblock this device. |
| |
| 2. Boolean (0 or 1) if blocking, if any, is done (0) or still pending (1). |
| |
| See below for more information about other entries of this subdirectory |
| of the standard SCST dev handlers. |
| |
| "Handlers" subdirectory contains subdirectories for each SCST dev |
| handler. |
| |
| Content of each handler's subdirectory is dev handler specific. See |
| documentation for your dev handlers for more info about it as well as |
| SysfsRules file for more info about common to all dev handlers rules. |
| SCST dev handlers can have the following common entries: |
| |
| - mgmt - this entry allows to create virtual devices and their |
| attributes (for virtual devices dev handlers) or assign/unassign real |
| SCSI devices to/from this dev handler (for pass-through dev |
| handlers). |
| |
| - trace_level - allows to enable and disable various tracing |
| facilities. See content of this file for help how to use it. See also |
| section "Dealing with massive logs" for more info how to make correct |
| logs when you enabled trace levels producing a lot of logs data. |
| |
| - type - SCSI type of devices served by this dev handler. |
| |
| See below for more information about other entries of this subdirectory |
| of the standard SCST dev handlers. |
| |
| "Sgv" subdirectory contains statistic information of SCST SGV caches. It |
| has the following entries: |
| |
| - None, one or more subdirectories for each existing SGV cache. |
| |
| - global_stats - file containing global SGV caches statistics. |
| |
| Each SGV cache's subdirectory has the following item: |
| |
| - stats - file containing statistics for this SGV caches. |
| |
| "Targets" subdirectory contains subdirectories for each SCST target. |
| |
| Content of each target's subdirectory is target specific. See |
| documentation for your target for more info about it as well as |
| SysfsRules file for more info about common to all targets rules. |
| Every target should have at least the following entries: |
| |
| - ini_groups - subdirectory, which contains and allows to define |
| initiator-oriented access control information, see below. |
| |
| - luns - subdirectory, which contains list of available LUNs in the |
| target-oriented access control and allows to define it, see below. |
| |
| - sessions - subdirectory containing connected to this target sessions. |
| |
| - comment - this attribute can be used to store any human readable info |
| to help identify target. For instance, to help identify the target's |
| mapping to the corresponding hardware port. It isn't anyhow used by |
| SCST. |
| |
| - enabled - using this attribute you can enable or disable this target. |
| It allows to finish configuring it before it starts accepting new |
| connections. 0 by default. |
| |
| - addr_method - used LUNs addressing method. Possible values: |
| "Peripheral", "Flat" or "LUN". Most initiators work well with |
| Peripheral addressing method (default), but some (HP-UX, for instance) |
| may require the Flat method or the LUN method (e.g. IBM systems). This |
| attribute is also available in the initiators security groups, so you |
| can assign the addressing method on per-initiator basis. See also the |
| "Logical unit addressing (LUN)" section in SAM-5 for more information. |
| |
| - black_hole - if set, all LUNs in the corresponding initiator group, |
| default target group in this case, start "swallowing" requests from |
| initiators. Possible values are: |
| |
| * 0 - disable black hole mode |
| |
| * 1 - immediately abort all coming SCSI commands, i.e. all SCSI commands |
| are dropped and TM requests return that they completed. It is |
| supposed to simulate lost front end responses. |
| |
| * 2 - immediately abort all coming SCSI commands and drop all coming TM |
| commands. It is supposed to simulate logical target hang, when the |
| target stops responding, but on the HW/TCP connection level still |
| appears to be online. |
| |
| * 3 - immediately abort all coming data transfer SCSI commands, i.e. |
| only data transfer SCSI commands are dropped, while commands like |
| INQUIRY and TEST UNIT READY pass well. It is supposed to simulate |
| flaky front end connectivity, when responses for small commands |
| pass well, but big data transfers fail. |
| |
| * 4 - immediately abort all coming data transfer SCSI commands and |
| drop all coming TM commands. It is supposed to simulate really |
| flaky front end connectivity, when TM requests or responses are |
| also lost. |
| |
| Modes 3 and 4 are the most evil ones, because they are not too well |
| handled by many initiator OS'es, including Linux, so they may never |
| recover from it. |
| |
| Note, dropping TM commands, i.e. not sending response on them, |
| implemented not for all target drivers. If it's implemented for your |
| particular target driver or not, you can find out by checking traces |
| or the target driver's source code. |
| |
| - dif_capabilities - if this target supports T10-PI, returns which |
| exact DIF capabilities this target supports. |
| |
| - dif_checks_failed - if this target supports T10-PI, returns |
| statistics how many DIF errors have been detected on the |
| corresponding processing stages on this target. It returns 3 rows of |
| numbers with 3 numbers in each row: for target driver stage, for SCST |
| stage and for dev handler stage. Numbers in each row: how many errors |
| detected checking application, reference and guard tags |
| correspondingly. Writing to this attribute resets the numbers. |
| |
| - cpu_mask - defines CPU affinity mask for threads serving this target. |
| For threads serving LUNs it is used only for devices with |
| threads_pool_type "per_initiator". |
| |
| - io_grouping_type - defines how I/O from sessions to this target are |
| grouped together. This I/O grouping is very important for |
| performance. By setting this attribute in a right value, you can |
| considerably increase performance of your setup. This grouping is |
| performed only if you use CFQ I/O scheduler on the target and for |
| devices with threads_num >= 0 and, if threads_num > 0, with |
| threads_pool_type "per_initiator". Possible values: |
| "this_group_only", "never", "auto", or I/O group number >0. When the |
| value is "this_group_only" all I/O from all sessions in this target |
| will be grouped together. When the value is "never", I/O from |
| different sessions will not be grouped together, i.e. all sessions in |
| this target will have separate dedicated I/O groups. When the value |
| is "auto" (default), all I/O from initiators with the same name |
| (iSCSI initiator name, for instance) in all targets will be grouped |
| together with a separate dedicated I/O group for each initiator name. |
| For iSCSI this mode works well, but other transports usually use |
| different initiator names for different sessions, so using such |
| transports in MPIO configurations you should either use value |
| "this_group_only", or an explicit I/O group number. This attribute is |
| also available in the initiators security groups, so you can assign |
| the I/O grouping on per-initiator basis. See below for more info how |
| to use this attribute. |
| |
| - rel_tgt_id - allows to read or write SCSI Relative Target Port |
| Identifier attribute. This identifier is used to identify SCSI Target |
| Ports by some SCSI commands, mainly by Persistent Reservations |
| commands. This identifier must be unique among all SCST targets, but |
| for convenience SCST allows disabled targets to have not unique |
| rel_tgt_id. In this case SCST will not allow to enable this target |
| until rel_tgt_id becomes unique. This attribute initialized unique by |
| SCST by default. |
| |
| - forward_src - if set this target port is a forwarding source. This means |
| that commands like COMPARE AND WRITE, EXTENDED COPY and RECEIVE COPY |
| RESULTS are submitted to the SCSI device instead of being handled inside |
| the SCST core. PERSISTENT RESERVE IN and OUT commands are processed by the |
| SCST core, whether or not this mode is enabled. The name 'forwarding_src' |
| refers to the use case where SCSI passthrough is used to send SCSI commands |
| to another H.A. node. |
| |
| - forward_dst - if set this target port is a forwarding destination. This means |
| that it does not check any local SCSI events (reservations, etc.). Those |
| event are supposed to be checked at the forwarding source side. |
| |
| - forwarding - obsolete synonym for forward_dst. |
| |
| - *count*, e.g. read_io_count_kb, - statistics about executed |
| commands and transferred data. Those attributes have speaking names |
| built from parts: |
| |
| 1. Data transfer direction |
| |
| 2. Alignment type: not specified or unaligned (on 4K boundaries) |
| |
| 3. Type: IO (commands) count or amount of transferred data |
| |
| 4. For transferred data: measurement units |
| |
| For instance, read_unaligned_cmd_count means number of 4K unaligned IOs. |
| |
| A target driver may have also the following entries: |
| |
| - "hw_target" - if the target driver supports both hardware and virtual |
| targets (for instance, an FC adapter supporting NPIV, which has |
| hardware targets for its physical ports as well as virtual NPIV |
| targets), this read only attribute for all hardware targets will |
| exist and contain value 1. |
| |
| Subdirectory "sessions" contains one subdirectory for each connected |
| session with name equal to name of the connected initiator with the |
| following entries: |
| |
| - initiator_name - contains initiator name |
| |
| - force_close - optional write-only attribute, which allows to force |
| close this session. |
| |
| - active_commands - contains number of active, i.e. not yet or being |
| executed, SCSI commands in this session. |
| |
| - commands - contains overall number of SCSI commands in this session. |
| |
| - dif_checks_failed - if target of this session supports T10-PI, returns |
| statistics how many DIF errors have been detected on the |
| corresponding processing stages on all DIF-enabled LUNs in this |
| session. It returns 3 rows of numbers with 3 numbers in each row: for |
| target driver stage, for SCST stage and for dev handler stage. |
| Numbers in each row: how many errors detected checking application, |
| reference and guard tags correspondingly. Writing to this attribute |
| resets the numbers. Similar statistics returned in attribute with the |
| same name for each LUN in this session in this LUN's subdirectory, if |
| its device configured with dif_type > 0. |
| |
| - read_cmd_count - number of READ SCSI commands received since beginning |
| or last reset (writing 0 in this attribute) |
| |
| - read_io_count_kb - amount of data in KB read by the initiator since |
| beginning or last reset (writing 0 in this attribute) |
| |
| - write_cmd_count - number of WRITE SCSI commands received since |
| beginning or last reset (writing 0 in this attribute) |
| |
| - write_io_count_kb - amount of data in KB written by the initiator |
| since beginning or last reset (writing 0 in this attribute) |
| |
| - bidi_cmd_count - number of BIDI SCSI commands received since |
| beginning or last reset (writing 0 in this attribute) |
| |
| - bidi_io_count_kb - amount of data in KB transferred by the |
| initiator since beginning or last reset (writing 0 in this attribute) |
| |
| - none_cmd_count - number of not transferring data SCSI commands |
| (e.g. INQUIRY or TEST UNIT READY) received since beginning or last |
| reset (writing 0 in this attribute) |
| |
| - unknown_cmd_count - number of unknown SCSI commands received since |
| beginning or last reset (writing 0 in this attribute) |
| |
| - *count*, e.g. read_io_count_kb, - statistics about executed |
| commands and transferred data. See above for more details. |
| |
| - luns - a link pointing out to the corresponding LUNs set (security |
| group) where this session was attached to. |
| |
| - One or more "lunX" subdirectories, where 'X' is a number, for each LUN |
| this session has (see below). |
| |
| - other target driver specific attributes and subdirectories. |
| |
| See below description of the VDISK's sysfs interface for samples. |
| |
| |
| Each sessions/<sess>/lun<X> subdirectory contains the following entries: |
| |
| - active_commands - contains number of active, i.e. not yet or being |
| executed, SCSI commands for lun<X> in session <sess>. |
| |
| - thread_pid - contains a single line with all the process identifiers |
| (PIDs) of the kernel threads that process SCSI commands intended for |
| lun<X> in session <sess>. |
| |
| - thread_index - thread index assigned by scst_add_threads(). |
| Can be used to look up which export thread is serving which target |
| since this index also appears in the export thread name. This |
| information then could be used to set CPU affinity for those threads |
| to improve performance. Has a value in the range 0..n-1 for |
| threads_pool_type per_initiator or -1 when using a shared thread pool |
| per LUN or the global thread pool. |
| |
| |
| Access and devices visibility management (LUN masking) |
| ------------------------------------------------------ |
| |
| Access and devices visibility management allows for an initiator or |
| group of initiators to see different devices with different LUNs |
| with necessary access permissions. |
| |
| SCST supports two modes of access control: |
| |
| 1. Target-oriented. In this mode you define for each target a default |
| set of LUNs, which are accessible to all initiators, connected to that |
| target. This is a regular access control mode, which people usually mean |
| thinking about access control in general. For instance, in IET this is |
| the only supported mode. |
| |
| 2. Initiator-oriented. In this mode you define which LUNs are accessible |
| for each initiator. In this mode you should create for each set of one |
| or more initiators, which should access to the same set of devices with |
| the same LUNs, a separate security group, then add to it devices and |
| names of allowed initiator(s). |
| |
| Both modes can be used simultaneously. In this case the |
| initiator-oriented mode has higher priority, than the target-oriented, |
| i.e. initiators are at first searched in all defined security groups for |
| this target and, if none matches, the default target's set of LUNs is |
| used. This set of LUNs might be empty, then the initiator will not see |
| any LUNs from the target. |
| |
| You can at any time find out which set of LUNs each session is assigned |
| to by looking where link |
| /sys/kernel/scst_tgt/targets/target_driver/target_name/sessions/initiator_name/luns |
| points to. |
| |
| To configure the target-oriented access control SCST provides the |
| following interface. Each target's sysfs subdirectory |
| (/sys/kernel/scst_tgt/targets/target_driver/target_name) has "luns" |
| subdirectory. This subdirectory contains the list of already defined |
| target-oriented access control LUNs for this target as well as file |
| "mgmt". This file has the following commands, which you can send to it, |
| for instance, using "echo" shell command. You can always get a small |
| help about supported commands by looking inside this file. "Parameters" |
| are one or more param_name=value pairs separated by ';'. |
| |
| - "add H:C:I:L lun [parameters]" - adds a pass-through device with |
| host:channel:id:lun with LUN "lun". Optionally, the device could be |
| marked as read only by using parameter "read_only". The recommended |
| way to find out H:C:I:L numbers is use of lsscsi utility. |
| |
| - "replace H:C:I:L lun [parameters]" - replaces by pass-through device |
| with host:channel:id:lun existing with LUN "lun" device with |
| generation of INQUIRY DATA HAS CHANGED Unit Attention. If the old |
| device doesn't exist, this command acts as the "add" command. |
| Optionally, the device could be marked as read only by using |
| parameter "read_only". The recommended way to find out H:C:I:L |
| numbers is use of lsscsi utility. |
| |
| - "add VNAME lun [parameters]" - adds a virtual device with name VNAME |
| with LUN "lun". Optionally, the device could be marked as read only |
| by using parameter "read_only". |
| |
| - "replace VNAME lun [parameters]" - replaces by virtual device |
| with name VNAME existing with LUN "lun" device with generation of |
| INQUIRY DATA HAS CHANGED Unit Attention. If the old device doesn't |
| exist, this command acts as the "add" command. Optionally, the device |
| could be marked as read only by using parameter "read_only". |
| |
| - "del lun" - deletes LUN lun |
| |
| - "clear" - clears the list of devices |
| |
| To configure the initiator-oriented access control SCST provides the |
| following interface. Each target's sysfs subdirectory |
| (/sys/kernel/scst_tgt/targets/target_driver/target_name) has "ini_groups" |
| subdirectory. This subdirectory contains the list of already defined |
| security groups for this target as well as file "mgmt". This file has |
| the following commands, which you can send to it, for instance, using |
| "echo" shell command. You can always get a small help about supported |
| commands by looking inside this file. |
| |
| - "create GROUP_NAME" - creates a new security group. |
| |
| - "del GROUP_NAME" - deletes a new security group. |
| |
| Each security group's subdirectory contains 2 subdirectories: initiators |
| and luns as well as the following attributes: addr_method, cpu_mask and |
| io_grouping_type, black_hole. See above description of them. |
| |
| Each "initiators" subdirectory contains list of added to this groups |
| initiator as well as as well as file "mgmt". This file has the following |
| commands, which you can send to it, for instance, using "echo" shell |
| command. You can always get a small help about supported commands by |
| looking inside this file. |
| |
| - "add INITIATOR_NAME" - adds initiator with name INITIATOR_NAME to the |
| group. |
| |
| - "del INITIATOR_NAME" - deletes initiator with name INITIATOR_NAME |
| from the group. |
| |
| - "move INITIATOR_NAME DEST_GROUP_NAME" moves initiator with name |
| INITIATOR_NAME from the current group to group with name |
| DEST_GROUP_NAME. |
| |
| - "clear" - deletes all initiators from this group. |
| |
| For "add" and "del" commands INITIATOR_NAME can be a simple DOS-type |
| patterns, containing '*' and '?' symbols. '*' means match all any |
| symbols, '?' means match only any single symbol. For instance, |
| "blah.xxx" will match "bl?h.*". Additionally, you can use negative sign |
| '!' to revert the value of the pattern. For instance, "ah.xxx" will |
| match "!bl?h.*". |
| |
| Each "luns" subdirectory contains the list of already defined LUNs for |
| this group as well as file "mgmt". Content of this file as well as list |
| of available in it commands is fully identical to the "luns" |
| subdirectory of the target-oriented access control. |
| |
| Examples: |
| |
| - echo "create INI" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.vlnb:tgt1/ini_groups/mgmt - |
| creates security group INI for target iqn.2006-10.net.vlnb:tgt1. |
| |
| - echo "add 2:0:1:0 11" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.vlnb:tgt1/ini_groups/INI/luns/mgmt - |
| adds a pass-through device sitting on host 2, channel 0, ID 1, LUN 0 |
| to group with name INI as LUN 11. |
| |
| - echo "add disk1 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.vlnb:tgt1/ini_groups/INI/luns/mgmt - |
| adds a virtual disk with name disk1 to group with name INI as LUN 0. |
| |
| - echo "add 21:*:e0:?b:83:*" >/sys/kernel/scst_tgt/targets/21:00:00:a0:8c:54:52:12/ini_groups/INI/initiators/mgmt - |
| adds a pattern to group with name INI to Fibre Channel target with |
| WWN 21:00:00:a0:8c:54:52:12, which matches WWNs of Fibre Channel |
| initiator ports. |
| |
| Consider you need to have an iSCSI target with name |
| "iqn.2007-05.com.example:storage.disk1.sys1.xyz", which should export |
| virtual device "dev1" with LUN 0 and virtual device "dev2" with LUN 1, |
| but initiator with name |
| "iqn.2007-05.com.example:storage.disk1.spec_ini.xyz" should see only |
| virtual device "dev2" read only with LUN 0. To achieve that you should |
| do the following commands: |
| |
| # echo "iqn.2007-05.com.example:storage.disk1.sys1.xyz" >/sys/kernel/scst_tgt/targets/iscsi/mgmt |
| # echo "add dev1 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2007-05.com.example:storage.disk1.sys1.xyz/luns/mgmt |
| # echo "add dev2 1" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2007-05.com.example:storage.disk1.sys1.xyz/luns/mgmt |
| # echo "create SPEC_INI" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2007-05.com.example:storage.disk1.sys1.xyz/ini_groups/mgmt |
| # echo "add dev2 0 read_only=1" \ |
| >/sys/kernel/scst_tgt/targets/iscsi/iqn.2007-05.com.example:storage.disk1.sys1.xyz/ini_groups/SPEC_INI/luns/mgmt |
| # echo "iqn.2007-05.com.example:storage.disk1.spec_ini.xyz" \ |
| >/sys/kernel/scst_tgt/targets/iscsi/iqn.2007-05.com.example:storage.disk1.sys1.xyz/ini_groups/SPEC_INI/initiators/mgmt |
| |
| For Fibre Channel or SAS in the above example you should use target's |
| and initiator ports WWNs instead of iSCSI names. |
| |
| It is highly recommended to use scstadmin utility instead of described |
| in this section low level interface. |
| |
| IMPORTANT |
| ========= |
| |
| There must be LUN 0 in each set of LUNs, i.e. LUs numeration must not |
| start from, e.g., 1. Otherwise you will see no devices on remote |
| initiators and SCST core will write into the kernel log message: "tgt_dev |
| for LUN 0 not found, command to unexisting LU?" |
| |
| IMPORTANT |
| ========= |
| |
| All the access control must be fully configured BEFORE the corresponding |
| target is enabled. When you enable a target, it will immediately start |
| accepting new connections, hence creating new sessions, and those new |
| sessions will be assigned to security groups according to the |
| *currently* configured access control settings. For instance, to |
| the default target's set of LUNs, instead of "HOST004" group as you may |
| need, because "HOST004" doesn't exist yet. So, you must configure all |
| the security groups before new connections from the initiators are |
| created, i.e. before the target enabled. |
| |
| |
| VDISK device handler |
| -------------------- |
| |
| Starting from 2.0.0 VDISK device handler uses sysfs interface. |
| |
| VDISK has 4 built-in dev handlers: vdisk_fileio, vdisk_blockio, |
| vdisk_nullio and vcdrom. Roots of their sysfs interface are |
| /sys/kernel/scst_tgt/handlers/handler_name, e.g. for vdisk_fileio: |
| /sys/kernel/scst_tgt/handlers/vdisk_fileio. Each root has the following |
| entries: |
| |
| - None, one or more links to devices with name equal to names |
| of the corresponding devices. |
| |
| - trace_level - allows to enable and disable various tracing |
| facilities. See content of this file for help how to use it. See also |
| section "Dealing with massive logs" for more info how to make correct |
| logs when you enabled trace levels producing a lot of logs data. |
| |
| - mgmt - main management entry, which allows to add/delete VDISK |
| devices with the corresponding type. |
| |
| The "mgmt" file has the following commands, which you can send to it, |
| for instance, using "echo" shell command. You can always get a small |
| help about supported commands by looking inside this file. "Parameters" |
| are one or more param_name=value pairs separated by ';'. |
| |
| - echo "add_device device_name [parameters]" - adds a virtual device |
| with name device_name and specified parameters (see below) |
| |
| - echo "del_device device_name" - deletes a virtual device with name |
| device_name. |
| |
| Handler vdisk_fileio provides FILEIO mode to create virtual devices. |
| This mode uses as backend files and accesses to them using regular |
| read()/write() file calls. This allows to use full power of Linux page |
| cache. The following parameters possible for vdisk_fileio: |
| |
| - filename - specifies path and file name of the backend file. The path |
| must be absolute. |
| |
| - blocksize - specifies block size used by this virtual device. The |
| block size must be power of 2 and >= 512 bytes. Default is 512. |
| |
| - opt_trans_len - specifies the optimal transfer length data in the block |
| limits VPD page. Value is in bytes, and must be a multiple of the block |
| size. Default is 524288. Setting this parameter to a multiple of the |
| optimal transfer length below 4 MB may improve performance. Setting this |
| parameter to a value above 4 MB hurts performance because the SGV cache |
| only supports buffers up to 4 MB. |
| |
| - write_through - disables write back caching. Note, this option |
| has sense only if you also *manually* disable write-back cache in |
| *all* your backstorage devices and make sure it's actually disabled, |
| since many devices are known to lie about this mode to get better |
| benchmark results. Default is 0. |
| |
| - read_only - read only. Default is 0. |
| |
| - async - submit I/O asynchronously to the device handler. This mode |
| allows concurrent processing of SCSI commands even when using only |
| a single SCST command thread. This mode is only supported for kernel |
| version 4.1 and later. RHEL 8 is the first RHEL version that supports |
| in-kernel asynchronous file I/O. |
| |
| - o_direct - disables both read and write caching if asynchronous |
| I/O is used. This mode bypasses the page cache and hence improves |
| performance. |
| |
| - nv_cache - enables "non-volatile cache" mode. In this mode it is |
| assumed that the target has a GOOD UPS with ability to cleanly |
| shutdown target in case of power failure and it is software/hardware |
| bugs free, i.e. all data from the target's cache are guaranteed |
| sooner or later to go to the media. Hence all data synchronization |
| with media operations, like SYNCHRONIZE_CACHE, are ignored in order |
| to bring more performance. Also in this mode target reports to |
| initiators that the corresponding device has write-through cache to |
| disable all write-back cache workarounds used by initiators. Use with |
| extreme caution, since in this mode after a crash of the target |
| journaled file systems don't guarantee the consistency after journal |
| recovery, therefore manual fsck MUST be ran. Note, that since usually |
| the journal barrier protection (see "IMPORTANT" note below) turned |
| off, enabling NV_CACHE could change nothing from data protection |
| point of view, since no data synchronization with media operations |
| will go from the initiator. This option overrides "write_through" |
| option. Disabled by default. |
| |
| - thin_provisioned - enables thin provisioning facility, when remote |
| initiators can unmap blocks of storage, if they don't need them |
| anymore. Backend storage also must support this facility. |
| |
| - tst - allows to specify TST control mode page field. It specifies |
| the type of task set in the device. Possible values are: 0 - the |
| device maintains one task set for all I_T nexuses and 1 - the device |
| maintains separate task sets for each I_T nexus. Default - 1. |
| |
| - removable - with this flag set the device is reported to remote |
| initiators as removable. |
| |
| - rotational - if set, this device reported as rotational. Otherwise, |
| it is reported as non-rotational (SSD, etc.) |
| |
| - zero_copy - ignored. For zero-copy I/O, set the async flag and |
| possibly also the o_direct flag and use Linux kernel v4.10 or later. |
| |
| - dif_mode - specifies which T10-PI, or DIF, mode this device will use. |
| See SCSI standards from more info about T10-PI. Available DIF modes |
| (can be combined using '|'): |
| |
| * tgt - DIF tags are checked on the target hardware, if supported |
| |
| * scst - DIF tags are checked inside SCST core |
| |
| * dev_check - DIF tags are checked inside backend device. No DIF |
| tags storing is required, but optionally possible. |
| |
| * dev_store - DIF tags are stored inside backend device on the WRITE |
| path and read from it on the READ path. No DIF tags checking is |
| required, but optionally possible. |
| |
| For instance, if only tgt DIF mode specified, then target driver, |
| serving this device, will inside hardware check, then STRIP DIF tags |
| from SCSI commands on the WRITE path and generate, then INSERT DIF |
| tags into SCSI commands on the READ path, so neither SCST core, nor |
| dev handler will see them. |
| |
| Similarly, if only scst DIF mode specified, then target driver will |
| PASS DIF tags into SCST core, which then check/STRIP/generate/INSERT |
| them, so dev handler will not see them. |
| |
| If only dev_check DIF mode specified, then both target driver and |
| SCST core will PASS DIF tags into the dev handler, which is then |
| responsible to check them in the backend hardware. If only dev_store |
| specified, then DIF tags will only be stored by the dev handler in |
| the backend hardware without checking at any level. |
| |
| If all "tgt|scst|dev_check|dev_store" DIF mode specified, then all |
| target driver, SCST core and dev handler will check DIF tags, then |
| dev handler will store them in the backend hardware. |
| |
| - dif_type - specifies which DIF SCSI type this device will use. |
| |
| - dif_static_app_tag - specifies fixed (static) DIF application tag for |
| this device. |
| |
| - dif_filename - specifies full path to filename, where DIF tags will |
| be stored. |
| |
| Handler vdisk_blockio provides BLOCKIO mode to create virtual devices. |
| This mode performs direct block I/O with a block device, bypassing the |
| page cache for all operations. This mode works ideally with high-end |
| storage HBAs and for applications that either do not need caching |
| between application and disk or need the large block throughput. See |
| below for more info. |
| |
| The following parameters possible for vdisk_blockio: filename, |
| blocksize, nv_cache, read_only, removable, rotational, thin_provisioned, |
| tst, dif_mode, dif_type, dif_static_app_tag, dif_filename. See |
| vdisk_fileio above for description of those parameters. |
| |
| vdisk_blockio devices have the following two additional attributes: |
| |
| - active - if this flag is set (the default), the backing block device |
| will be opened when the SCST device is added/opened. If a SCST device |
| is opened with active=0 then the backing block device will not be |
| opened, allowing for an active/passive SCST configuration. In addition, |
| this attribute is writable via sysfs allowing the user to open/close the |
| backing block device on the fly, or via a script. |
| |
| - bind_alua_state - if this flag is set (the default), when the device is |
| associated with an ALUA device group, and a target group ALUA state |
| changes to the active/nonoptimized state, the active attribute will be |
| set to 1 which attempts to open the backing block device. If the target |
| group ALUA state changes to a value other than active/nonoptimized, the |
| backing device will be closed (active=0). If bind_alua_state=0 for a |
| device the ALUA state changes have NO effect on the active attribute, |
| it is left up to the user to use a script, or manually set the active |
| attribute to open/close the backing block device. |
| |
| Handler vdisk_nullio provides NULLIO mode to create virtual devices. In |
| this mode no real I/O is done, but success returned to initiators. |
| Intended to be used for performance measurements at the same way as |
| "*_perf" handlers. The following parameters possible for vdisk_nullio: |
| blocksize, read_only, removable, tst. See vdisk_fileio above for |
| description of those parameters. |
| |
| vdisk_nullio devices have the following two additional attributes: |
| |
| - dummy - if this flag is set, LUNs corresponding to this device will |
| not appear at the initiator side. This is because SCST will set the |
| PERIPHERAL QUALIFIER qualifier field to 1 (not connected) and the |
| PERIPHERAL DEVICE TYPE to 0x1f (no device) in the INQUIRY response. |
| See also SPC-4 for more information. It is designed to be used as a |
| "dummy" placeholder on LUN 0, if LUN 0 is not desired. |
| |
| - read_zero - if this flag is set, reading from a vdisk_nullio device |
| returns a buffer filled with byte 0x00. If this flag is cleared |
| (which is the default behavior), the buffer returned to the |
| initiator is not cleared. Although this results in slightly faster |
| operation this is a security hole since any data that is present in |
| kernel memory can be returned to the initiator. |
| |
| Handler vcdrom allows emulation of a virtual CDROM device using an ISO |
| file as backend. It has only single parameter: tst. |
| |
| For example: |
| |
| echo "add_device disk1 filename=/disk1; blocksize=4096; nv_cache=1" >/sys/kernel/scst_tgt/handlers/vdisk_fileio/mgmt |
| |
| will create a FILEIO virtual device disk1 with backend file /disk1 |
| with block size 4K and NV_CACHE enabled. |
| |
| Each vdisk_fileio's device has the following attributes in |
| /sys/kernel/scst_tgt/devices/device_name: |
| |
| - filename - contains path and file name of the backend file. |
| |
| - blocksize - contains block size used by this virtual device. |
| |
| - opt_trans_len - contains the optimal transfer length used by this virtual |
| device. |
| |
| - write_through - contains status of write back caching of this virtual |
| device. |
| |
| - sync - writing into this attribute causes the page cache contents to |
| be flushed to disk. |
| |
| - read_only - contains read only status of this virtual device. |
| |
| - o_direct - contains O_DIRECT status of this virtual device. |
| |
| - inq_vend_specific - Vendor specific data that will be reported via |
| either bytes 36..55 or bytes 96..256 of the INQUIRY response, depending |
| on whether this field is <= 20 or > 20 bytes long. |
| |
| - nv_cache - contains NV_CACHE status of this virtual device. |
| |
| - prod_id - PRODUCT IDENTIFICATION as reported via the INQUIRY response. |
| The default value for this field is the SCST device name. |
| |
| - prod_rev_lvl - PRODUCT REVISION LEVEL as reported via the INQUIRY |
| response. The default value for this field is " 300". |
| |
| - scsi_device_name - optional SCSI target device name to which this |
| SCST device belongs to (in SCSI terminology all SCST devices called |
| Logical Units). See SPC for more info. |
| |
| - tst - contains TST field of SCSI Control mode page. See SPC-4 for |
| more details about this field. |
| |
| - thin_provisioned - contains thin provisioning status of this virtual |
| device. |
| |
| - gen_tp_soft_threshold_reached_UA - for thin provisioned devices |
| writing of anything into this write-only attribute will generate THIN |
| PROVISIONING SOFT THRESHOLD REACHED Unit Attention to all connected |
| to this device initiators. |
| |
| - removable - contains removable status of this virtual device. |
| |
| - rotational - contains rotational status of this virtual device. |
| |
| - size_mb - contains size of this virtual device in MB. |
| |
| - pr_file_name - Full path of the file or block device in which to store |
| persistent reservation information. The default value for this attribute is |
| /var/lib/scst/pr/${device_name}. Writing a new value into this sysfs |
| attribute is only allowed if the device is not exported. Modifying this |
| sysfs attribute causes the persistent reservation state to be reloaded. |
| |
| - t10_dev_id - contains and allows to set T10 vendor specific |
| identifier for Device Identification VPD page (0x83) of INQUIRY data. |
| By default VDISK handler always generates t10_dev_id for every new |
| created device at creation time based on the device name and |
| scst_vdisk_ID scst_vdisk.ko module parameter for procfs (see below) |
| or the SCST setup_id when using the sysfs interface (see above). |
| Note: some initiators, e.g. VMware's ESXi or MS Hyper-V, only looks |
| at the first eight characters of t10_dev_id. You have to make sure |
| that these first eight characters are unique or VMware will consider |
| these devices as identical. |
| |
| - eui64_id - allows to set the EUI-64 based device identifier in the |
| SCSI device identification VPD page (83h). This identifier must be 8, |
| 12 or 16 bytes long and must be specified in hexadecimal format (EUI = |
| Extended Unique Identifier). A leading "0x" is allowed but is not |
| required. Writing a newline into this attribute discards the EUI-64 |
| identifier. If neither eui64_id nor naa_id have been set the first |
| eight bytes of the t10_dev_id are used as the EUI-64 ID. If naa_id has |
| been set but eui64_id has not been set no EUI-64 identifier is |
| reported in the SCSI device identification VPD page. If eui64_id has |
| been set the value of this attribute is reported as the EUI-64 ID. The |
| first three bytes of an EUI-64 ID are a so-called organizationally |
| unique identifier (OUI). The remaining bytes may be chosen by the |
| organization that owns the OUI. For more information about OUIs, see |
| also http://standards.ieee.org/develop/regauth/oui/public.html. |
| |
| - naa_id - allows to set the NAA ID in the SCSI INQUIRY response (NAA = |
| Network Address Authority). This identifier must be 8 or 16 bytes long |
| and must be specified in hex format. A leading "0x" is allowed but is |
| not required. Writing a newline into this attribute discards the NAA |
| ID. If this ID is set it is reported in the SCSI VPD device |
| identification page (83h). More information about NAA identifiers can |
| be found in the following documents: |
| * ANSI T11 committee, Fibre Channel Framing and Signaling Interface - 4 |
| (FC-FS-4) rev 0.50, May 2014 (http://www.t11.org/). |
| * IETF, RFC 3980 - T11 Network Address Authority (NAA) Naming Format for |
| iSCSI Node Names, February 2005 (https://tools.ietf.org/html/rfc3980). |
| |
| - t10_vend_id - Contents of the T10 VENDOR IDENTIFICATION field of the |
| INQUIRY response. The default value for this field is "SCST_BIO" for |
| vdisk_block devices and "SCST_FIO" for vdisk_fileio devices. |
| |
| - usn - contains the virtual device's serial number of INQUIRY data. It |
| is created at the device creation time based on the device name and |
| scst_vdisk_ID scst_vdisk.ko module parameter for procfs (see below) |
| or the SCST setup_id when using the sysfs interface (see above). |
| |
| - type - contains SCSI type of this virtual device. |
| |
| - resync_size - write only attribute, which makes vdisk_fileio to |
| rescan size of the backend file. It is useful if you changed it, for |
| instance, if you resized it. |
| |
| - vend_specific_id - Vendor specific ID as reported via the Device |
| Identification VPD page (83h). The default value for this attribute |
| is the value of the t10_dev_id attribute. |
| |
| For example: |
| |
| /sys/kernel/scst_tgt/devices/disk1 |
| |-- block |
| |-- blocksize |
| |-- opt_trans_len |
| |-- exported |
| | |-- export0 -> ../../../targets/iscsi/iqn.2006-10.net.vlnb:tgt/luns/0 |
| | |-- export1 -> ../../../targets/iscsi/iqn.2006-10.net.vlnb:tgt/ini_groups/INI/luns/0 |
| | |-- export2 -> ../../../targets/iscsi/iqn.2006-10.net.vlnb:tgt1/luns/0 |
| | |-- export3 -> ../../../targets/iscsi/iqn.2006-10.net.vlnb:tgt1/ini_groups/INI1/luns/0 |
| | |-- export4 -> ../../../targets/iscsi/iqn.2006-10.net.vlnb:tgt1/ini_groups/INI2/luns/0 |
| |-- filename |
| |-- handler -> ../../handlers/vdisk_fileio |
| |-- nv_cache |
| |-- o_direct |
| |-- read_only |
| |-- removable |
| |-- resync_size |
| |-- rotational |
| |-- size_mb |
| |-- t10_dev_id |
| |-- thin_provisioned |
| |-- threads_num |
| |-- threads_pool_type |
| |-- tst |
| |-- type |
| |-- usn |
| `-- write_through |
| |
| Each vdisk_blockio's device has the following attributes in |
| /sys/kernel/scst_tgt/devices/device_name: blocksize, filename, nv_cache, |
| read_only, removable, resync_size, rotational, size_mb, t10_dev_id, |
| thin_provisioned, gen_tp_soft_threshold_reached_UA, threads_num, |
| threads_pool_type, tst, type, usn. See above description of those |
| parameters. |
| |
| Each vdisk_nullio's device has the following attributes in |
| /sys/kernel/scst_tgt/devices/device_name: blocksize, read_only, |
| removable, size_mb, t10_dev_id, threads_num, threads_pool_type, type, |
| tst, usn, dummy. See above description of those parameters. |
| |
| Each vcdrom's device has the following attributes in |
| /sys/kernel/scst_tgt/devices/device_name: filename, size_mb, |
| t10_dev_id, threads_num, threads_pool_type, type, usn, tst. See above |
| description of those parameters. Exception is filename attribute. For |
| vcdrom it is writable. Writing to it allows to virtually insert or |
| change virtual CD media in the virtual CDROM device. For example: |
| |
| - echo "/image.iso" >/sys/kernel/scst_tgt/devices/cdrom/filename - will |
| insert file /image.iso as virtual media to the virtual CDROM cdrom. |
| |
| - echo "" >/sys/kernel/scst_tgt/devices/cdrom/filename - will remove |
| "media" from the virtual CDROM cdrom. |
| |
| Additionally VDISK handler has module parameter "num_threads", which |
| specifies count of I/O threads for each FILEIO VDISK's or VCDROM device. |
| If you have a workload, which tends to produce rather random accesses |
| (e.g. DB-like), you should increase this count to a bigger value, like |
| 32. If you have a rather sequential workload, you should decrease it to |
| a lower value, like number of CPUs on the target or even 1. Due to some |
| limitations of Linux I/O subsystem, increasing number of I/O threads too |
| much leads to sequential performance drop, especially with deadline |
| scheduler, so decreasing it can improve sequential performance. The |
| default provides a good compromise between random and sequential |
| accesses. |
| |
| You shouldn't be afraid to have too many VDISK I/O threads if you have |
| many VDISK devices. Kernel threads consume very little amount of |
| resources (several KBs) and only necessary threads will be used by SCST, |
| so the threads will not trash your system. |
| |
| CAUTION: If you partitioned/formatted your device with block size X, *NEVER* |
| ======== ever try to export and then mount it (even accidentally) with another |
| block size. Otherwise you can *instantly* damage it pretty |
| badly as well as all your data on it. Messages on initiator |
| like: "attempt to access beyond end of device" is the sign of |
| such damage. |
| |
| Moreover, if you want to compare how well different block sizes |
| work for you, you **MUST** EVERY TIME AFTER CHANGING BLOCK SIZE |
| **COMPLETELY** **WIPE OFF** ALL THE DATA FROM THE DEVICE. In |
| other words, THE **WHOLE** DEVICE **MUST** HAVE ONLY **ZEROS** |
| AS THE DATA AFTER YOU SWITCH TO NEW BLOCK SIZE. Switching block |
| sizes isn't like switching between FILEIO and BLOCKIO, after |
| changing block size all previously written with another block |
| size data MUST BE ERASED. Otherwise you will have a full set of |
| very weird behaviors, because blocks addressing will be |
| changed, but initiators in most cases will not have a |
| possibility to detect that old addresses written on the device |
| in, e.g., partition table, don't refer anymore to what they are |
| intended to refer. |
| |
| IMPORTANT: Some disk and partition table management utilities don't support |
| ========= block sizes >512 bytes, therefore make sure that your favorite one |
| supports it. Currently only cfdisk is known to work only with |
| 512 bytes blocks, other utilities like fdisk on Linux or |
| standard disk manager on Windows are proved to work well with |
| non-512 bytes blocks. Note, if you export a disk file or |
| device with some block size, different from one, with which |
| it was already partitioned, you could get various weird |
| things like utilities hang up or other unexpected behavior. |
| Hence, to be sure, zero the exported file or device before |
| the first access to it from the remote initiator with another |
| block size. On Window initiator make sure you "Set Signature" |
| in the disk manager on the imported from the target drive |
| before doing any other partitioning on it. After you |
| successfully mounted a file system over non-512 bytes block |
| size device, the block size stops matter, any program will |
| work with files on such file system. |
| |
| |
| Dealing with massive logs |
| ------------------------- |
| |
| If you want to enable using "trace_level" file logging levels, which |
| produce a lot of events, like "debug", to not loose logged events you |
| should also: |
| |
| * Increase in .config of your kernel CONFIG_LOG_BUF_SHIFT variable |
| to much bigger value, then recompile it. For example, value 25 will |
| provide good protection from logging overflow even under high volume |
| of logging events. To use it you will need to modify the maximum |
| allowed value for CONFIG_LOG_BUF_SHIFT in the corresponding Kconfig |
| file to 25 as well. |
| |
| * Change in your /etc/syslog.conf or other config file of your favorite |
| logging program to store kernel logs in async manner. For example, |
| you can add in rsyslog.conf line "kern.info -/var/log/kernel" and |
| add "kern.none" in line for /var/log/messages, so the resulting line |
| would looks like: |
| |
| "*.info;kern.none;mail.none;authpriv.none;cron.none /var/log/messages" |
| |
| |
| Persistent Reservations |
| ----------------------- |
| |
| SCST implements Persistent Reservations with full set of capabilities, |
| including "Persistence Through Power Loss". |
| |
| The "Persistence Through Power Loss" data are saved in /var/lib/scst/pr |
| with files with names the same as the names of the corresponding |
| devices. Also this directory contains backup versions of those files |
| with suffix ".1". Those backup files are used in case of power or other |
| failure to prevent Persistent Reservation information from corruption |
| during update. It is safe to assume that each of those files can be up |
| to 1KB big. |
| |
| The Persistent Reservations available on all transports implementing |
| get_initiator_port_transport_id() callback. Transports not implementing |
| this callback will act in one of 2 possible scenarios ("all or |
| nothing"): |
| |
| 1. If a device has such transport connected and doesn't have persistent |
| reservations, it will refuse Persistent Reservations commands as if it |
| doesn't support them. |
| |
| 2. If a device has persistent reservations, all initiators newly |
| connecting via such transports will not see this device. After all |
| persistent reservations from this device are released, upon reconnect |
| the initiators will see it. |
| |
| |
| ALUA Support |
| ------------ |
| |
| SCST supports both implicit and explicit asymmetric logical unit access |
| (ALUA). ALUA is a feature defined by the ANSI T10 SCSI committee. It |
| allows a target to tell the initiator which path to use in a multipath |
| setup plus, in the explicit case, control state of each path via SET |
| TARGET PORT GROUPS SCSI command. The redundant paths between initiator |
| and target can be used either for redundancy or for load sharing |
| purposes. The target can either be a single target system running SCST |
| with multiple communication interfaces or two target systems each |
| running SCST and configured in a high availability setup. |
| |
| In the SPC-4 standard the following concepts are defined related to ALUA: |
| * Relative target port ID. A number between 1 and 65535 that uniquely |
| identifies a target port. These numbers must be unique over the target as |
| a whole, even if that target consists of multiple systems each running SCST. |
| * Target port group asymmetric access state. One of active/optimized, |
| active/non-optimized, standby, unavailable, logical block dependent or |
| offline. The access state of a port defines which (if any) SCSI commands |
| will be processed by the target port. |
| * Target port preference indicator. This indicator is additional information |
| next to the asymmetric access state that is provided by the target to an |
| initiator and that may impact the decision taken by the initiator about |
| which path that will be chosen. |
| |
| More detailed information about ALUA can be found in section 5.11.2 of the |
| ANSI T10 standard called SPC-4. |
| |
| ALUA support in SCST |
| .................... |
| |
| SCST allows to define ALUA settings for each unique combination of SCST |
| device and SCST target. An initiator however queries ALUA settings by |
| sending an appropriate SCSI command to a specific LUN of an SCST target. |
| Each such LUN maps uniquely to an SCST device. For hardware SCST target |
| drivers, e.g. ib_srpt, there is a one-to-one correspondence between SCST |
| target and SCSI target port. With other SCST targets, e.g. iSCSI-SCST, |
| by default the only relationship between SCST targets and SCSI target |
| ports is that all SCST targets defined on a system are visible via all |
| SCSI target ports. See also the iSCSI-SCST documentation about the |
| allowed_portal attribute for information about how to associate iSCSI |
| targets with a single physical interface. |
| |
| Notes: |
| - In a H.A. setup it is the responsibility of the user to synchronize ALUA |
| information between the individual systems running SCST. There are no |
| provisions in SCST to exchange ALUA information automatically between |
| individual systems. |
| - In order to support H.A. setups it is possible to let one SCST system |
| report information about target ports present in other SCST systems. |
| - With SCST, and certainly in a H.A. setup, it is possible to configure ALUA |
| such that an initiator receives information that is not standard compliant, |
| e.g. setting all target ports in the offline state. It is the responsibility |
| of the user to make sure that the information queried by an initiator is |
| consistent independent of the LUN and the target port used by the initiator |
| to query this information. |
| - Before building a H.A. setup consisting of two or more SCST systems one |
| should evaluate whether it's acceptable that persistent reservation commands, |
| SCSI task management commands and MODE SELECT commands will only be processed |
| by a single node instead of being processed by all nodes. |
| |
| Configuring ALUA in SCST |
| ........................ |
| |
| SCST allows to configure the following settings related to ALUA |
| for each unique combination of SCST target and virtual SCST device |
| (vdisk_fileio, vdisk_blockio, vcdrom, ...): |
| * The target port group asymmetric access state. SCST supports all ALUA port |
| states except logical block dependent. |
| * The preference indicator for a target port group. |
| * The relative target port ID associated with the SCST target. |
| |
| It is possible to configure the following ALUA-related information via the |
| sysfs interface of SCST: |
| * Device groups, where each device group has a name and contains zero or more |
| SCST devices. If a device group contains only a single SCST device, the name |
| of the group may be identical to the device name. See also |
| /sys/kernel/scst_tgt/device_groups/mgmt. |
| * Which devices are inside a device group. See also |
| /sys/kernel/scst_tgt/device_groups/<device group name>/devices/mgmt. |
| * Target groups, where each target group has a name and contains zero or more |
| SCST target names. See also |
| /sys/kernel/scst_tgt/device_groups/<device group name>/target_groups/mgmt. |
| * Target port group identifier. This is a number in the range 0..65535 and is |
| called the TARGET PORT GROUP in SPC-4. See also |
| /sys/kernel/scst_tgt/device_groups/<device group name>/target_groups/<target |
| group name>/group_id. |
| * Target port group preference indicator. This is a boolean value called the |
| PREF bit in SPC-4. See also /sys/kernel/scst_tgt/device_groups/<device group |
| name>/target_groups/<target group name>/preferred. |
| * Target port group state name. One of active, nonoptimized, standby, |
| unavailable, offline or transitioning. See also |
| /sys/kernel/scst_tgt/device_groups/<device group name>/target_groups/<target |
| group name>/state. |
| * Target group contents - zero or more target names. The target names either |
| exist on the local system or on a remote system in a H.A. setup. For target |
| names that refer to SCST targets on another system only the relative target |
| port identifier matters, not the assigned name. See also |
| /sys/kernel/scst_tgt/device_groups/<device group name>/target_groups/<target |
| group name>/mgmt. |
| * Relative target identifier. See also |
| /sys/kernel/scst_tgt/device_groups/<device group name>/target_groups/<target |
| group name>/<target name>/rel_tgt_id. |
| |
| The steps involved in configuring ALUA are: |
| * Identify the SCST devices that will always share the same ALUA settings and |
| state. Assign a name to each such group of SCST devices. If a device group |
| only contains a single device, the group name may be identical to the device |
| name. |
| * Configure that device group in SCST via sysfs. |
| * Identify the SCSI target ports that will always share the same ALUA settings |
| and state. Assign a name, a group ID and preference indicator to each such |
| SCSI target port group. |
| * Configure the target port group information in SCST via sysfs. |
| * Identify all SCST targets that can be accessed via a target port group. |
| * Assign all these SCST target names to the target group via sysfs. |
| * Assign a relative target port identifier to each target. |
| |
| As an example, in a H.A. setup with two systems each having one InfiniBand |
| HCA controlled by the ib_srpt driver and where each system exports two LUNs |
| the following configuration can be used in scst.conf on both systems: |
| |
| DEVICE_GROUP dgroup1 { |
| DEVICE disk01 |
| |
| TARGET_GROUP tgroup1 { |
| group_id 256 |
| preferred 1 |
| state active |
| TARGET fe80:0000:0000:0000:0002:c903:00fa:b7e1 { |
| rel_tgt_id 1 |
| } |
| } |
| TARGET_GROUP tgroup2 { |
| group_id 257 |
| state standby |
| TARGET fe80:0000:0000:0000:0002:c903:00fa:b7f2 { |
| rel_tgt_id 2 |
| } |
| } |
| } |
| |
| DEVICE_GROUP dgroup2 { |
| DEVICE disk02 |
| |
| TARGET_GROUP tgroup1 { |
| group_id 258 |
| state standby |
| TARGET fe80:0000:0000:0000:0002:c903:00fa:b7e1 { |
| rel_tgt_id 1 |
| } |
| } |
| TARGET_GROUP tgroup2 { |
| group_id 259 |
| preferred 1 |
| state active |
| TARGET fe80:0000:0000:0000:0002:c903:00fa:b7f2 { |
| rel_tgt_id 2 |
| } |
| } |
| } |
| |
| Note, if you are using "active" BLOCKIO device attribute to prevent open |
| of the backend block device on the passive node, it is not recommended |
| to set both active ("active", "nonoptimized") and passive ("standby", |
| etc.) ALUA states for the same device if "bind_alua_state=1" is used, as |
| shown above to keep internal "active" state of the BLOCKIO device consistent. |
| |
| If using the "active" BLOCKIO device attribute and multiple target groups |
| exist per device on a SCST instance then "bind_alua_state=0" should be used |
| and it is left up to the user to modify the "active" attribute value. |
| |
| Explicit ALUA |
| ............. |
| |
| To enable explicit ALUA you need in addition to the above settings set |
| expl_alua device attribute to 1 (by default it is 0). Also you need to |
| run stpgd and supply to it path to a script or program, which will |
| perform actual path state switching on SET TARGET PORT GROUPS command, |
| for instance, by calling drbdadm. For more information see stpgd README |
| as well as sample script scst_on_stpg. |
| |
| DRBD and other replication/failover SW compatibility |
| .................................................... |
| |
| DRBD as well as other replication/failover SW does not allow to open its |
| device on the secondary as well as does not allow to perform primary to |
| secondary transition, if this device is open. |
| |
| SCST BLOCKIO handler has necessary support for such behavior: |
| |
| 1. If you need to prevent an SCST BLOCKIO device from opening its block |
| device, you need to create it with parameter "active=0". In case of DRBD |
| it would be done automatically, you don't have to use the "active" |
| attribute. |
| |
| 2. By default, if you write new ALUA state in the "state" attribute and |
| "bind_alua_state=1" for the device, SCST BLOCKIO handler before transition |
| closes open handles on all affected SCST devices and after transition |
| reopens them, if the new state is active or nonoptimized. Alternatively, |
| set "bind_alua_state=0" for SCST BLOCKIO devices and ALUA state changes |
| will not open/close the backing block device, the user will need to handle |
| this manually or via a cluster RA in an HA setup. |
| |
| Thus, the recommended implicit ALUA state change procedure for primary |
| to secondary transition is: |
| |
| 1. Block all involved SCST devices using "block" sysfs attribute (see |
| above). Wait until the blocking finished. |
| |
| 2. Change the ALUA state to "transitioning". At this moment all open |
| file handles will be closed. |
| |
| 3. Perform the DRBD or other replication/failover SW state transition |
| |
| 4. Change the ALUA state to your desired secondary state. |
| |
| 5. Unblock the blocked on step 1 devices. |
| |
| Optionally, if your initiators support Transitioning ALUA state, for |
| more responsive behavior the blocked devices can be unblocked |
| immediately after step (2). However, not all initiators correctly |
| behave, if they receive ASYMMETRIC STATE TRANSITION sense. |
| |
| For the secondary to primary transition procedure is similar. |
| |
| In case of explicit ALUA, SCST automatically performs the necessary |
| devices blocking around sending SCST_EVENT_STPG_USER_INVOKE event. |
| |
| Checking the Target Configuration |
| ................................. |
| |
| One way to verify the ALUA configuration from a Linux initiator is via |
| the commands provided in the sg3_utils package. The first step is to |
| verify whether for a certain LUN ALUA has been configured on the target. |
| This is possible by checking whether the TPGS=1 text appears in the |
| sg_inq output, where /dev/sdb is a device node created by the ib_srp |
| initiator: |
| |
| # sg_inq /dev/sdb |
| standard INQUIRY: |
| PQual=0 Device_type=0 RMB=0 version=0x05 [SPC-3] |
| [AERC=0] [TrmTsk=0] NormACA=0 HiSUP=1 Resp_data_format=2 |
| SCCS=0 ACC=0 TPGS=1 3PC=0 Protect=0 BQue=0 |
| EncServ=0 MultiP=0 [MChngr=0] [ACKREQQ=0] Addr16=1 |
| [RelAdr=0] WBus16=0 Sync=0 Linked=0 [TranDis=0] CmdQue=1 |
| [SPI: Clocking=0x0 QAS=0 IUS=0] |
| length=66 (0x42) Peripheral device type: disk |
| Vendor identification: SCST_FIO |
| Product identification: disk01 |
| Product revision level: 300 |
| Unit serial number: 27cddc71 |
| |
| The next step is to verify the target group configuration. That is possible |
| by verifying whether the output of the sg_rtpg command matches the values |
| configured on the target: |
| |
| # sg_rtpg /dev/sdb |
| Report target port groups: |
| target port group id : 0x100 , Pref=1 |
| target port group asymmetric access state : 0x00 |
| T_SUP : 0, O_SUP : 0, LBD_SUP : 0, U_SUP : 1, S_SUP : 1, AN_SUP : 1, AO_SUP : 1 |
| status code : 0x02 |
| vendor unique status : 0x00 |
| target port count : 01 |
| Relative target port ids: |
| 0x01 |
| target port group id : 0x101 , Pref=0 |
| target port group asymmetric access state : 0x02 |
| T_SUP : 0, O_SUP : 0, LBD_SUP : 0, U_SUP : 1, S_SUP : 1, AN_SUP : 1, AO_SUP : 1 |
| status code : 0x02 |
| vendor unique status : 0x00 |
| target port count : 01 |
| Relative target port ids: |
| 0x02 |
| |
| The relative target port ID and the target port group ID for a certain path |
| can be queried e.g. as follows: |
| |
| # sg_vpd -p di /dev/sdb |
| Device Identification VPD page: |
| Addressed logical unit: |
| designator type: T10 vendor identification, code set: ASCII |
| vendor id: SCST_FIO |
| vendor specific: 27cddc71-disk01 |
| designator type: EUI-64 based, code set: Binary |
| 0x3237636464633731 |
| Target port: |
| designator type: Relative target port, code set: Binary |
| Relative target port: 0x1 |
| designator type: Target port group, code set: Binary |
| Target port group: 0x100 |
| |
| |
| Initiator Support |
| ................. |
| |
| On Linux systems ALUA support is provided by the scsi_dh_alua kernel |
| driver in combination with the user space multipathd daemon. You will |
| have to modify at least the following in /etc/multipath.conf to enable |
| ALUA: |
| |
| * hardware_handler "1 alua" |
| * prio alua |
| * path_grouping_policy group_by_prio |
| * path_checker tur |
| |
| Notes: |
| - Newer versions of multipathd support a parameter called |
| "detect_prio". It can be more convenient to enable this parameter instead of |
| setting the parameter "prio" to "alua" for only those LUNs that support ALUA. |
| - Older versions of multipathd (e.g. RHEL 5 and SLES 10 SP1) need |
| 'prio_callout "/sbin/mpath_prio_alua /dev/%n"' instead of 'prio alua'. |
| |
| # multipath -ll |
| 23237636464633731 dm-3 SCST_FIO,disk01 |
| size=1.0G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw |
| |-+- policy='service-time 0' prio=1 status=active |
| | `- 10:0:0:0 sdd 8:48 active ready running |
| `-+- policy='service-time 0' prio=130 status=enabled |
| `- 11:0:0:0 sde 8:64 active ready running |
| 23133326137346538 dm-4 SCST_FIO,disk02 |
| size=1.0G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw |
| |-+- policy='service-time 0' prio=130 status=active |
| | `- 10:0:0:2 sdn 8:208 active ready running |
| `-+- policy='service-time 0' prio=1 status=enabled |
| `- 11:0:0:2 sdp 8:240 active ready running |
| |
| The following information can be derived from the above output: |
| * That the hardware handler (hw_handler) has been set to "1 alua". |
| * That multipathd created two priority groups - one with priority 1 and one |
| with priority 130. |
| * That the SRP path with SCSI host number 10 will be used for communication |
| with LUN "disk01" and that the SRP path with SCSI host number 11 will be used |
| for communication with LUN "disk02". |
| |
| More information about how to configure the device mapper and the scsi_dh_alua |
| driver can be found in the manual of your Linux distribution ("man |
| multipath.conf", "man multipath" and "man multipathd"). |
| |
| Windows initiator systems support ALUA from Windows Server 2008 on. For more |
| information about ALUA support in Windows Server, see also: |
| * Microsoft, Windows Server 2008 R2 Multipath I/O Overview, MSDN |
| (http://technet.microsoft.com/en-us/library/cc725907.aspx). |
| * Microsoft, Multipathing Support in Windows Server 2008, July 2008, MSDN |
| (http://blogs.msdn.com/b/san/archive/2008/07/27/multipathing-support-in-windows-server-2008.aspx). |
| * Microsoft, ALUA MPIO Logo Test, MSDN |
| (http://msdn.microsoft.com/en-us/library/gg607458%28v=vs.85%29.aspx). |
| |
| Active/Non-Optimized via internal redirection |
| ............................................. |
| |
| The Active-Standby configuration is simple to understand and to set up. |
| However, it might cause serious interoperability issues because not all |
| initiators handle the ALUA state 'standby' state correctly. For instance, |
| some versions of VMware reported to have such issues. Same for Windows. |
| |
| It is better to use the 'nonoptimized' state on the passive node instead |
| of 'standby' with internal commands redirection to the active node. This |
| is what the vast majority of storage vendors are doing. This is actually |
| the reason why the 'standby' and 'unavailable' states have all those |
| initiator interoperability issues. The latter combination has received |
| too few testing because it is only marginally used. |
| |
| SCST has the necessary support for such redirection, it just needs to be |
| configured correctly. It's a little bit of effort, especially to |
| understand how it's going to function, but then it would work MUCH more |
| reliable for full range of initiators. Ever poor initiators, who have no |
| idea about ALUA (boot from SAN, e.g.) would work now. The following |
| diagram illustrates this approach: |
| |
| ................................................................ |
| . . . |
| . Initiator A . Initiator B . |
| . | . | . |
| ................................................................ |
| . | . | . |
| . target port C . target port D . |
| . | . | . |
| . SCST . SCST . |
| . Instance E - target . target - Instance F . |
| . / \ port G . port H / \ . |
| . / \ \./ / \ . |
| . / \ /.\ / \ . |
| . vdisk_blockio dev_disk / . \ dev_disk vdisk_blockio . |
| . handler handler / . \ handler handler . |
| . | | / . \ | | . |
| . block device SCSI / . SCSI block device . |
| . I initiator . initiator J . |
| . | node K . node L | . |
| . |______________________ .______________________| . |
| ................................................................ |
| The link between block devices I and J stands for synchronous replication. |
| |
| |
| Such a setup can be configured as follows: |
| |
| 1. Build SCST. |
| |
| 2. Setup on active node internal redirect target, which is going to |
| accept redirected commands from the passive node. It must be visible |
| only to the passive node. |
| |
| 3. Set "forward_dst" attribute for this target to 1. This is necessary to |
| correctly handle PRs. |
| |
| 4. Export through this target the SAME backend SCST device as being |
| served to initiator(s) (consider for simplicity that there is only one |
| served device) |
| |
| 5. Connect to this SCST device through this internal target from the |
| passive node, for instance, using iSCSI. Now you have a local SCSI |
| device on the passive side pointing to the active node. |
| |
| 6. Export this local device to the initiator(s) using SCST |
| *pass-through* handler (scst_disk). Pass-though is needed to redirect |
| non-block commands as well: ATS, XCOPY, etc. |
| |
| 7. Set ALUA state to this target as "nonoptimized". Set the forward_src |
| attribute to one. |
| |
| That's it on the normal path. Now the initiator(s) would see 2 paths: |
| OPTIMIZED going to the active node and NON-OPTIMIZED going to the |
| passive node, then redirected to the active node. |
| |
| On failover (i.e. switching active and passive states): |
| |
| 1. Setup similar redirect target on the new active node. |
| |
| 2. Setup connectivity to that new redirect target from the new passive |
| node |
| |
| 3. Start ALUA change (see above) on both nodes |
| |
| 4. !! Exchange in the sysfs security group(s) for the initiator(s) *LUN* |
| from old SCST device to the new one (blockio -> pass-through on the new |
| passive and pass-through -> blockio on the new active) using "replace_no_ua" |
| SCST command. You need to do it directly in the sysfs interface, |
| scstadmin can't do it. |
| |
| 5. Set ALUA states to "active" on the new active node and "nonoptimized" |
| on the new passive node. |
| |
| 6. Finish ALUA states change. |
| |
| Example using direct sysfs interface could look like: |
| |
| Active-Optimized node: |
| |
| modprobe scst |
| modprobe scst_disk |
| modprobe scst_vdisk |
| |
| # Main device, DRBD primary here |
| echo "add_device aa filename=/dev/drbd1" >/sys/kernel/scst_tgt/handlers/vdisk_blockio/mgmt |
| |
| # Redirect device, not used here. Coming from connecting via iSCSI to the |
| # corresponding redirect target on the other side. |
| DEVICE=10:0:0:0 |
| echo add_device $DEVICE >/sys/kernel/scst_tgt/handlers/dev_disk/mgmt |
| |
| service iscsi-scst start |
| |
| # This is a regular, user-visible target |
| echo "add_target iqn.2006-10.net.v:tgt " >/sys/kernel/scst_tgt/targets/iscsi/mgmt |
| echo 1 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt/rel_tgt_id |
| echo "add aa 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt/luns/mgmt |
| |
| # This is redirect target, 192.168.9.x is the redirect network |
| echo "add_target iqn.2006-10.net.v:tgtR" >/sys/kernel/scst_tgt/targets/iscsi/mgmt |
| echo 2 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/rel_tgt_id |
| echo "add_target_attribute iqn.2006-10.net.v:tgtR allowed_portal 192.168.9.1" >/sys/kernel/scst_tgt/targets/iscsi/mgmt |
| echo "1" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/forwarding |
| echo "add aa 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/luns/mgmt |
| |
| echo 1 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt/enabled |
| echo 1 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/enabled |
| |
| echo 1 >/sys/kernel/scst_tgt/targets/iscsi/enabled |
| |
| # ALUA config |
| |
| echo create aa >/sys/kernel/scst_tgt/device_groups/mgmt |
| echo add aa >/sys/kernel/scst_tgt/device_groups/aa/devices/mgmt |
| |
| echo add tgt_a >/sys/kernel/scst_tgt/device_groups/aa/target_groups/mgmt |
| echo add iqn.2006-10.net.v:tgt >/sys/kernel/scst_tgt/device_groups/aa/target_groups/tgt_a/mgmt |
| echo 1 >/sys/kernel/scst_tgt/device_groups/aa/target_groups/tgt_a/group_id |
| echo active >/sys/kernel/scst_tgt/device_groups/aa/target_groups/tgt_a/state |
| |
| echo add tgt_n >/sys/kernel/scst_tgt/device_groups/aa/target_groups/mgmt |
| echo add iqn.2006-10.net.v:tgt1 >/sys/kernel/scst_tgt/device_groups/aa/target_groups/tgt_n/mgmt |
| echo 2 >/sys/kernel/scst_tgt/device_groups/aa/target_groups/tgt_n/iqn.2006-10.net.v:tgt1/rel_tgt_id |
| echo 2 >/sys/kernel/scst_tgt/device_groups/aa/target_groups/tgt_n/group_id |
| echo nonoptimized >/sys/kernel/scst_tgt/device_groups/aa/target_groups/tgt_n/state |
| |
| Active-Non-Optimized node: |
| |
| modprobe scst |
| modprobe scst_disk |
| modprobe scst_vdisk |
| |
| # Main device, DRBD secondary, not used here |
| echo "add_device aa filename=/dev/drbd1" >/sys/kernel/scst_tgt/handlers/vdisk_blockio/mgmt |
| |
| # Redirect device. Coming from connecting via iSCSI to the |
| # corresponding redirect target on the other side. |
| DEVICE=10:0:0:0 |
| echo add_device $DEVICE >/sys/kernel/scst_tgt/handlers/dev_disk/mgmt |
| |
| service iscsi-scst start |
| |
| echo "add_target iqn.2006-10.net.v:tgt1" >/sys/kernel/scst_tgt/targets/iscsi/mgmt |
| echo 2 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt1/rel_tgt_id |
| echo "add $DEVICE 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt1/luns/mgmt |
| |
| # Redirect target, 192.168.9.x is the redirect network |
| echo "add_target iqn.2006-10.net.v:tgtR" >/sys/kernel/scst_tgt/targets/iscsi/mgmt |
| echo 2 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/rel_tgt_id |
| echo "add_target_attribute iqn.2006-10.net.v:tgtR allowed_portal 192.168.9.2" >/sys/kernel/scst_tgt/targets/iscsi/mgmt |
| echo "1" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/forwarding |
| echo "add aa 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/luns/mgmt |
| |
| echo 1 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt1/enabled |
| |
| echo 1 >/sys/kernel/scst_tgt/targets/iscsi/enabled |
| echo 1 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/enabled |
| |
| # ALUA config |
| |
| echo create $DEVICE >/sys/kernel/scst_tgt/device_groups/mgmt |
| echo add $DEVICE >/sys/kernel/scst_tgt/device_groups/$DEVICE/devices/mgmt |
| |
| echo add tgt_a >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/mgmt |
| echo add iqn.2006-10.net.v:tgt >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_a/mgmt |
| echo 1 >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_a/iqn.2006-10.net.v:tgt/rel_tgt_id |
| echo 1 >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_a/group_id |
| echo active >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_a/state |
| |
| echo add tgt_n >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/mgmt |
| echo add iqn.2006-10.net.v:tgt1 >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_n/mgmt |
| echo 1 >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_n/group_id |
| echo nonoptimized >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_n/state |
| |
| ALUA state switch after DRBD primary-secondary transition: |
| |
| Ex-Optimized: |
| |
| echo "replace_no_ua $DEVICE 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt1/luns/mgmt |
| echo nonoptimized >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_a/state |
| echo active >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_n/state |
| |
| Ex-Non-Optimized: |
| |
| echo "replace_no_ua aa 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt1/luns/mgmt |
| echo nonoptimized >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_a/state |
| echo active >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_n/state |
| |
| If you have any questions, please read this above text at least 3 times |
| before asking. It might be tricky to understand :-) |
| |
| |
| VAAI |
| ---- |
| |
| SCST supports all 3 VAAI SCSI commands: WRITE SAME, COMPARE AND WRITE |
| (ATS) and EXTENDED COPY. Additionally, it supports not directly related |
| to VAAI Thin Provisioning capabilities, particularly, UNMAP SCSI |
| commands, WRITE SAME with UNMAP bit as well as thin provisioning related |
| devices' sysfs attributes (see above). |
| |
| In some cases dev handlers should perform some manual actions to fully |
| benefit from SCST VAAI implementation. Those actions described in the |
| implementation notes below. For vdisk and fileio_tgt handlers they have |
| already been implemented. |
| |
| IMPORTANT: To use EXTENDED COPY command between LUNs (datastores) they all |
| ========= MUST have the same PRODUCT IDENTIFICATION INQUIRY field. By |
| default, to simplify remote devices identification, SCST uses |
| vdisk names as PRODUCT IDENTIFICATION, so SCST devices look |
| differently from the initiators. However, for some reasons, |
| VMware does not use EXTENDED COPY between LUNs with different |
| PRODUCT IDENTIFICATION. Thus, to be able to use full VAAI in |
| your VMware setups you must manually set PRODUCT |
| IDENTIFICATION for all your VMware LUNs to the same value, |
| for instance, "SCST", via using "prod_id" attribute. It could |
| be done either by adding "prod_id" attribute to scstadmin |
| scst.conf, or by directly writing to SCST sysfs attribute. |
| For example: |
| |
| HANDLER vdisk_blockio { |
| DEVICE blockio1 { |
| filename /dev/sda5 |
| prod_id SCST |
| } |
| |
| or |
| |
| echo SCST >/sys/kernel/scst_tgt/devices/blockio1/prod_id |
| correspondingly. |
| |
| Note, this prod_id modification must be done on all |
| datastores BEFORE VMware connects to them. |
| |
| |
| Implementation notes |
| .................... |
| |
| WRITE SAME |
| ~~~~~~~~~~ |
| |
| WRITE SAME command supports 2 modes: |
| |
| 1. Manual writing mode. In this mode WRITE SAME generates a set of |
| internal WRITE(16) SCSI commands to perform requested writing. |
| |
| 2. Remap mode. In this mode a dev handler, if supported, can remap being |
| written blocks to a single block and then tell SCST to manually write |
| parts of the requested area, which for some reason can not be remapped. |
| |
| In both cases dev handlers should call from WRITE SAME command handler |
| scst_write_same() function. This function as the second argument gets |
| array of descriptors where to write the requested block of data. Last |
| element in this array must have len 0. If this argument is NULL, then |
| the whole area will be manually written by SCST. This value should be |
| used by dev handlers not supporting remapping blocks. |
| |
| User space dev handlers should use SCST_EXEC_REPLY_DO_WRITE_SAME |
| reply_type of SCST_USER_EXEC subcommand. See scst_user doc for more |
| info. |
| |
| |
| COMPARE AND WRITE |
| ~~~~~~~~~~~~~~~~~ |
| |
| COMPARE AND WRITE implemented by SCST a set of read, compare and write |
| actions done in atomic manner against affected blocks as well as regular |
| RESERVE SCSI commands. Particularly, COMPARE AND WRITE doesn't need any |
| queue flushing and unlimited number of COMPARE AND WRITE commands on |
| different blocks can be executed simultaneously. |
| |
| The read and write actions implemented as generation of internal |
| READ(16) and WRITE(16) SCSI commands. |
| |
| COMPARE AND WRITE command is completely transparent to dev handlers |
| (they only see the corresponding READ(16) and WRITE(16) commands), so |
| doesn't require any manual actions from them. |
| |
| |
| EXTENDED COPY |
| ~~~~~~~~~~~~~ |
| |
| SCST implements EXTENDED COPY via internal Copy Manager target. This |
| target has the following specific attribute in its sysfs: |
| |
| - allow_not_connected_copy - if not set (default), an initiator can |
| perform copy only between devices it has direct access to via any |
| target/session. If set, any initiator can copy between any devices in |
| the system. |
| |
| The Copy Manager has access only to those devices, for which it has LUNs |
| in /sys/kernel/scst_tgt/targets/copy_manager/copy_manager_tgt/luns/. |
| Devices from scst_vdisk dev handler added to it automatically upon |
| registration, but for other devices you need to manually add LUNs there |
| the same way as for any target driver. You can also delete any device at |
| any time from the Copy Manager visibility by deleting the corresponding |
| LUN from the sysfs. It might be useful during ALUA state switching. |
| |
| Internally SCST implements EXTENDED COPY as generation of sets of |
| internal READ(16) and WRITE(16) SCSI commands. Dev handlers don't need |
| any manual actions to use it. |
| |
| Also SCST provides for dev handlers possibility to remap blocks instead |
| of copy them, if they support this feature. It allows them to perform |
| EXTENDED COPY command much faster by just metadata update of their |
| backend storage, which supposed to be nearly instantaneous. |
| |
| To use this feature, a dev handler should setup ext_copy_remap() |
| callback in its struct scst_dev_type. This callback is called by SCST |
| during EXTENDED COPY command processing to let the dev handler try to |
| remap affected blocks at first. |
| |
| Upon finish, the dev handler should call scst_ext_copy_remap_done(). In |
| case of error, the dev handler should set the corresponding sense to cmd |
| and then also call scst_ext_copy_remap_done(cmd, NULL, 0). |
| |
| If dev handler is not able to remap any part of the segment, if should |
| kmalloc(), then fill all leftover subsegments and supply them to |
| scst_ext_copy_remap_done(). SCST then will copy the subsegments using |
| internal copy machine, then kfree() the supplied array. If the dev |
| handler is not able to remap the whole segment, it can simply directly |
| supply the original segment to scst_ext_copy_remap_done(). |
| |
| It is highly recommended that in normal circumstances dev handlers call |
| scst_ext_copy_remap_done() from another thread context than one where |
| ext_copy_remap() callback was originally called, because otherwise there |
| could be recursion in the segments processing. Hopefully, this thread |
| context switch is natural for such potentially long operation as |
| EXTENDED COPY. |
| |
| |
| VMware and Ceph RBD space reclaim |
| --------------------------------- |
| |
| VMware with VMFS5 filesystem ignores UNMAP alignment, so if you use 4MB |
| Ceph RBD objects and VMFS5, only some discards will reclaim RBD space |
| due to 1MB discard not often hitting the tail of objects. |
| |
| Thus, to have efficient ESXi space reclamation with RBD and VMFS5, you are |
| recommended to use 1 MB object size in Ceph. |
| |
| See https://sourceforge.net/p/scst/mailman/message/35287598 thread for |
| details. |
| |
| |
| Caching |
| ------- |
| |
| By default for performance reasons VDISK FILEIO devices use write back |
| caching policy. |
| |
| Generally, write back caching is safe for use and danger of it is |
| greatly overestimated, because most modern (especially, Enterprise |
| level) applications are well prepared to work with write back cached |
| storage. Particularly, such are all transactions-based applications. |
| Those applications flush cache to completely avoid ANY data loss on a |
| crash or power failure. For instance, journaled file systems flush cache |
| on each meta data update, so they survive power/hardware/software |
| failures pretty well. |
| |
| Since locally on initiators write back caching is always on, if an |
| application cares about its data consistency, it does flush the cache |
| when necessary or on any write, if open files with O_SYNC. If it doesn't |
| care, it doesn't flush the cache. As soon as the cache flushes |
| propagated to the storage, write back caching on it doesn't make any |
| difference. If application doesn't flush the cache, it's doomed to loose |
| data in case of a crash or power failure doesn't matter where this cache |
| located, locally or on the storage. |
| |
| To illustrate that consider, for example, a user who wants to copy /src |
| directory to /dst directory reliably, i.e. after the copy finished no |
| power failure or software/hardware crash could lead to a loss of the |
| data in /dst. There are 2 ways to achieve this. Let's suppose for |
| simplicity cp opens files for writing with O_SYNC flag, hence bypassing |
| the local cache. |
| |
| 1. Slow. Make the device behind /dst working in write through caching |
| mode and then run "cp -a /src /dst". |
| |
| 2. Fast. Let the device behind /dst working in write back caching mode |
| and then run "cp -a /src /dst; sync". The reliability of the result is |
| the same, but it's much faster than (1). Nobody would care if a crash |
| happens during the copy, because after recovery simply leftovers from |
| the not completed attempt would be deleted and the operation would be |
| restarted from the very beginning. |
| |
| So, you can see in (2) there is no danger of ANY data loss from the |
| write back caching. Moreover, since on practice cp doesn't open files |
| for writing with O_SYNC flag, to get the copy done reliably, sync |
| command must be called after cp anyway, so enabling write back caching |
| wouldn't make any difference for reliability. |
| |
| Also you can consider it from another side. Modern HDDs have at least |
| 16MB of cache working in write back mode by default, so for a 10 drives |
| RAID it is 160MB of a write back cache. How many people are happy with |
| it and how many disabled write back cache of their HDDs? Almost all and |
| almost nobody correspondingly? Moreover, many HDDs lie about state of |
| their cache and report write through while working in write back mode. |
| They are also successfully used. |
| |
| Note, Linux I/O subsystem guarantees to propagated cache flushes to the |
| storage only using data protection barriers, which usually turned off by |
| default (see http://lwn.net/Articles/283161). Without barriers enabled |
| Linux doesn't provide a guarantee that after sync()/fsync() all written |
| data really hit permanent storage. They can be stored in the cache of |
| your backstorage devices and, hence, lost on a power failure event. |
| Thus, ever with write-through cache mode, you still either need to |
| enable barriers on your backend file system on the target (for direct |
| /dev/sdX devices this is, indeed, impossible), or need a good UPS to |
| protect yourself from not committed data loss. Some info about barriers |
| from the XFS point of view could be found at |
| http://xfs.org/index.php/XFS_FAQ#Write_barrier_support. On Linux |
| initiators for Ext3 and ReiserFS file systems the barrier protection |
| could be turned on using "barrier=1" and "barrier=flush" mount options |
| correspondingly. You can check if the barriers turn on or off by looking |
| in /proc/mounts. Windows and, AFAIK, other UNIX'es don't need any |
| special explicit options and do necessary barrier actions on write-back |
| caching devices by default. |
| |
| To limit this data loss with write back caching you can use files in |
| /proc/sys/vm to limit amount of unflushed data in the system cache. |
| |
| If you for some reason have to use VDISK FILEIO devices in write through |
| caching mode, don't forget to disable internal caching on their backend |
| devices or make sure they have additional battery or supercapacitors |
| power supply on board. Otherwise, you still on a power failure would |
| loose all the unsaved yet data in the devices internal cache. |
| |
| Note, on some real-life workloads write through caching might perform |
| better, than write back one with the barrier protection turned on. |
| |
| |
| Errors caching |
| .............. |
| |
| When using virtual device in FILEIO mode, the Linux page cache comes |
| into picture. The negative side of it is that it's sometimes also |
| caching errored pages. That is, if the underlying file experiences IO |
| errors, those errors might be cached by the Linux page cache. As a |
| result, even when the underlying file recovers and stops failing IOs, |
| the initiator may still hit IO errors returned by the Linux page cache, |
| until the cache re-reads the errored pages (usually it happens pretty |
| soon, but not immediately). To make sure that cached pages are dropped, |
| one of the following can be done: |
| |
| - Detach the SCSI virtual device (del_device) and re-attach it |
| (add_device). This should evict all the cached pages, unless somebody |
| else holds the same "filename" opened. |
| |
| - Issue a BLKFLSBUF ioctl to the same "filename" you provided for "add_device". |
| |
| For the second option, a rudimentary C code is required: |
| |
| fd = open(filename, O_RDWR); |
| if (fd < 0) { |
| err = errno; |
| ... |
| } else { |
| err = ioctl(fd, BLKFLSBUF); |
| if (err < 0) { |
| err = errno; |
| ... |
| } |
| close(fd); |
| } |
| |
| |
| BLOCKIO VDISK mode |
| ------------------ |
| |
| This module works best for these types of scenarios: |
| |
| 1) Data that are not aligned to 4K sector boundaries and <4K block sizes |
| are used, which is normally found in virtualization environments where |
| operating systems start partitions on odd sectors (Windows and it's |
| sector 63). |
| |
| 2) Large block data transfers normally found in database loads/dumps and |
| streaming media. |
| |
| 3) Advanced relational database systems that perform their own caching |
| which prefer or demand direct IO access and, because of the nature of |
| their data access, can actually see worse performance with |
| non-discriminate caching. |
| |
| 4) Multiple layers of targets were the secondary and above layers need |
| to have a consistent view of the primary targets in order to preserve |
| data integrity which a page cache backed IO type might not provide |
| reliably. |
| |
| Also it has an advantage over FILEIO that it doesn't copy data between |
| the system cache and the commands data buffers, so it saves a |
| considerable amount of CPU power and memory bandwidth. |
| |
| IMPORTANT: Since data in BLOCKIO and FILEIO modes are not consistent between |
| ========= each other, if you try to use a device in both those modes |
| simultaneously, you will almost instantly corrupt your data |
| on that device. |
| |
| IMPORTANT: Some kernels starting from 2.6.32 have a problem, which |
| ========= prevents BLOCKIO from working correctly with RAID5/DM. See |
| http://lkml.org/lkml/2010/7/28/315. That problem was fixed in |
| 2.6.32.19, 2.6.34.4, 2.6.35.2 and 2.6.36-rc1. It is strongly |
| recommended to not use affected kernels with BLOCKIO. |
| |
| IMPORTANT: In SCST 1.x BLOCKIO worked by default in NV_CACHE mode, when |
| ========= each device reported to remote initiators as having write through |
| caching. But if your backend block device has internal write |
| back caching it might create a possibility for data loss of |
| the cached in the internal cache data in case of a power |
| failure. Starting from SCST 2.0 BLOCKIO works by default in |
| non-NV_CACHE mode, when each device reported to remote |
| initiators as having write back caching, and synchronizes the |
| internal device's cache on each SYNCHRONIZE_CACHE command |
| from the initiators. It might lead to some *PERFORMANCE LOSS*, |
| so if you are are sure in your power supply and want to |
| restore the 1.x behavior, your should recreate your BLOCKIO |
| devices in NV_CACHE mode. |
| |
| |
| Pass-through mode |
| ----------------- |
| |
| In the pass-through mode (i.e. using the pass-through device handlers |
| scst_disk, scst_tape, etc) SCSI commands, coming from remote initiators, |
| are passed to local SCSI devices on target as is, without any |
| modifications. |
| |
| SCST supports 1 to many pass-through, when several initiators can safely |
| connect a single pass-through device (a tape, for instance). For such |
| cases SCST emulates all the necessary functionality. |
| |
| In the sysfs interface all real SCSI devices are listed in |
| /sys/kernel/scst_tgt/devices in form host:channel:id:lun numbers, for |
| instance 1:0:0:0. The recommended way to match those numbers to your |
| devices is use of lsscsi utility. |
| |
| Each pass-through dev handler has in its root subdirectory |
| /sys/kernel/scst_tgt/handlers/handler_name, e.g. |
| /sys/kernel/scst_tgt/handlers/dev_disk, "mgmt" file. It allows the |
| following commands. They can be sent to it using, e.g., echo command. |
| |
| - "add_device" - this command assigns SCSI device with |
| host:channel:id:lun numbers to this dev handler. |
| |
| echo "add_device 1:0:0:0" >/sys/kernel/scst_tgt/handlers/dev_disk/mgmt |
| |
| will assign SCSI device 1:0:0:0 to this dev handler. |
| |
| - "del_device" - this command unassigns SCSI device with |
| host:channel:id:lun numbers from this dev handler. |
| |
| As usually, on read the "mgmt" file returns small help about available |
| commands. |
| |
| You need to manually assign each your real SCSI device to the |
| corresponding pass-through dev handler using the "add_device" command, |
| otherwise the real SCSI devices will not be visible remotely. The |
| assignment isn't done automatically, because it could lead to the |
| pass-through dev handlers load and initialization problems if any of the |
| local real SCSI devices are malfunctioning. |
| |
| As any other hardware, the local SCSI hardware can not handle commands |
| with amount of data and/or segments count in scatter-gather array bigger |
| some values. Therefore, when using the pass-through mode you should note |
| that values for maximum number of segments and maximum amount of |
| transferred data (max_sectors) for each SCSI command on devices on |
| initiators can not be bigger, than corresponding values of the |
| corresponding SCSI devices on the target. Otherwise you will see |
| symptoms like small transfers work well, but large ones stall and |
| messages like: "Unable to complete command due to SG IO count |
| limitation" are printed in the kernel logs. |
| |
| You can't control from the user space limit of the scatter-gather |
| segments, but for block devices usually it is sufficient if you set on |
| the initiators /sys/block/DEVICE_NAME/queue/max_sectors_kb in the same |
| or lower value as in /sys/block/DEVICE_NAME/queue/max_hw_sectors_kb for |
| the corresponding devices on the target. |
| |
| For not-block devices SCSI commands are usually generated directly by |
| applications, so, if you experience large transfers stalls, you should |
| check documentation for your application how to limit the transfer |
| sizes. |
| |
| Another way to solve this issue is to build SG entries with more than 1 |
| page each. See the following patch as an example: |
| http://scst.sourceforge.net/sgv_big_order_alloc.diff |
| |
| |
| User space mode using scst_user dev handler |
| ------------------------------------------- |
| |
| User space program fileio_tgt uses interface of scst_user dev handler |
| and allows to see how it works in various modes. Fileio_tgt provides |
| mostly the same functionality as scst_vdisk handler with the most |
| noticeable difference that it supports O_DIRECT mode. O_DIRECT mode is |
| basically the same as BLOCKIO, but also supports files, so for some |
| loads it could be significantly faster, than the regular FILEIO access. |
| All the words about BLOCKIO from above apply to O_DIRECT as well. See |
| fileio_tgt's README file for more details. |
| |
| |
| Performance |
| ----------- |
| |
| SCST from the very beginning has been designed and implemented to |
| provide the best possible performance. Since there is no "one fit all" |
| the best performance configuration for different setups and loads, SCST |
| provides extensive set of settings to allow to tune it for the best |
| performance in each particular case. You don't have to necessary use |
| those settings. If you don't, SCST will do very good job to autotune for |
| you, so the resulting performance will, in average, be better |
| (sometimes, much better) than with other SCSI targets. But in some cases |
| you can by manual tuning improve it even more. |
| |
| Before doing any performance measurements note that performance results |
| are very much dependent from your type of load, so it is crucial that |
| you choose access mode (FILEIO, BLOCKIO, O_DIRECT, pass-through), which |
| suits your needs the best. |
| |
| In order to get the maximum performance you should: |
| |
| 1. For SCST: |
| |
| - Disable in Makefile and scst.h CONFIG_SCST_STRICT_SERIALIZING, |
| CONFIG_SCST_EXTRACHECKS, CONFIG_SCST_TRACING, CONFIG_SCST_DEBUG*, |
| CONFIG_SCST_STRICT_SECURITY. |
| |
| 2. For target drivers: |
| |
| - Disable in Makefiles CONFIG_SCST_EXTRACHECKS, CONFIG_SCST_TRACING, |
| CONFIG_SCST_DEBUG* |
| |
| 3. For device handlers, including VDISK: |
| |
| - Disable in Makefile CONFIG_SCST_TRACING and CONFIG_SCST_DEBUG. |
| |
| Note, by disabling CONFIG_SCST_TRACING and CONFIG_SCST_DEBUG you are |
| disabling many useful SCST diagnostic messages, which can significantly |
| help in many troubleshooting cases. So, if you may consider to keep |
| CONFIG_SCST_TRACING, its performance impact is very limited. |
| |
| IMPORTANT: The development version of SCST in the SVN is optimized for |
| ========= development and bug hunting, not for performance. This means |
| it is MUCH slower (multiple times). To reconfigure SCST for |
| release you should run "make 2release" command in the root of |
| your source code (e.g. trunk/). It will set the above options |
| as needed. The only option it doesn't set is |
| CONFIG_SCST_TEST_IO_IN_SIRQ, so, if needed, you should change |
| it manually. There is also so called "performance" build |
| mode, which you can activate by "make 2perf" command. The |
| only difference it has comparing to release build mode is |
| disabled CONFIG_SCST_TRACING option. Because of that, you |
| won't be able to see many important SCST run time logging |
| messages. This mode is intended to evaluate impact of |
| CONFIG_SCST_TRACING on performance and not recommended for |
| production. |
| |
| IMPORTANT: You can't use debug SCST drivers with non-debug SCST core. |
| ========= So, after disabling both CONFIG_SCST_TRACING and CONFIG_SCST_DEBUG |
| for SCST core you have to disable them for all SCST drivers |
| you are using as well. |
| |
| 4. Make sure you have io_grouping_type option set correctly, especially |
| in the following cases: |
| |
| - Several initiators share your target's backstorage. It can be a |
| shared LU using some cluster FS, like VMFS, as well as can be |
| different LUs located on the same backstorage (RAID array). For |
| instance, if you have 3 initiators and each of them using its own |
| dedicated FILEIO device file from the same RAID-6 array on the |
| target. |
| |
| In this case for the best performance you should have |
| io_grouping_type option set in value "never" in all the LUNs' targets |
| and security groups. |
| |
| - Your initiator connected to your target in MPIO mode. In this case for |
| the best performance you should: |
| |
| * Either connect all the sessions from the initiator to a single |
| target or security group and have io_grouping_type option set in |
| value "this_group_only" in the target or security group, |
| |
| * Or, if it isn't possible to connect all the sessions from the |
| initiator to a single target or security group, assign the same |
| numeric io_grouping_type value for each target/security group this |
| initiator connected to. The exact value itself doesn't matter, |
| important only that all the targets/security groups use the same |
| value. |
| |
| Don't forget, io_grouping_type makes sense only if you use CFQ I/O |
| scheduler on the target and for devices with threads_num >= 0 and, if |
| threads_num > 0, with threads_pool_type "per_initiator". |
| |
| You can check if in your setup io_grouping_type set correctly as well as |
| if the "auto" io_grouping_type value works for you by tests like the |
| following: |
| |
| - For not MPIO case you can run single thread sequential reading, e.g. |
| using buffered dd, from one initiator, then run the same single |
| thread sequential reading from the second initiator in parallel. If |
| io_grouping_type is set correctly the aggregate throughput measured |
| on the target should only slightly decrease as well as all initiators |
| should have nearly equal share of it. If io_grouping_type is not set |
| correctly, the aggregate throughput and/or throughput on any |
| initiator will decrease significantly, in 2 times or even more. For |
| instance, you have 80MB/s single thread sequential reading from the |
| target on any initiator. When then both initiators are reading in |
| parallel you should see on the target aggregate throughput something |
| like 70-75MB/s with correct io_grouping_type and something like |
| 35-40MB/s or 8-10MB/s on any initiator with incorrect. |
| |
| - For the MPIO case it's quite easier. With incorrect io_grouping_type |
| you simply won't see performance increase from adding the second |
| session (assuming your hardware is capable to transfer data through |
| both sessions in parallel), or can even see a performance decrease. |
| |
| 5. If you are going to use your target in an VM environment, for |
| instance as a shared storage with VMware, make sure all your VMs |
| connected to the target via *separate* sessions. For instance, for iSCSI |
| it means that each VM has own connection to the target, not all VMs |
| connected using a single connection. You can check it using SCST sysfs |
| interface. For other transports you should use available facilities, |
| like NPIV for Fibre Channel, to make separate sessions for each VM. If |
| you miss it, you can greatly loose performance of parallel access to |
| your target from different VMs. This isn't related to the case if your |
| VMs are using the same shared storage, like with VMFS, for instance. In |
| this case all your VM hosts will be connected to the target via separate |
| sessions, which is enough. |
| |
| 6. For other target and initiator software parts: |
| |
| - Make sure you applied on your kernel all available SCST patches. |
| If for your kernel version this patch doesn't exist, it is strongly |
| recommended to upgrade your kernel to version, for which this patch |
| exists. |
| |
| - Don't enable debug/hacking features in the kernel, i.e. use them as |
| they are by default. |
| |
| - The default kernel read-ahead and queuing settings are optimized |
| for locally attached disks, therefore they are not optimal if they |
| attached remotely (SCSI target case), which sometimes could lead to |
| unexpectedly low throughput. You should increase read-ahead size to at |
| least 512KB or even more on all initiators and the target. |
| |
| You should also limit on all initiators maximum amount of sectors per |
| SCSI command. This tuning is also recommended on targets with large |
| read-ahead values. To do it on Linux, run: |
| |
| echo “64” > /sys/block/sdX/queue/max_sectors_kb |
| |
| where specify instead of X your imported from target device letter, |
| like 'b', i.e. sdb. |
| |
| To increase read-ahead size on Linux, run: |
| |
| blockdev --setra N /dev/sdX |
| |
| where N is a read-ahead number in 512-byte sectors and X is a device |
| letter like above. |
| |
| Note: you need to set read-ahead setting for device sdX again after |
| you changed the maximum amount of sectors per SCSI command for that |
| device. |
| |
| Note2: you need to restart SCST after you changed read-ahead settings |
| on the target. It is a limitation of the Linux read ahead |
| implementation. It reads RA values for each file only when the file |
| is open and not updates them when the global RA parameters changed. |
| Hence, the need for vdisk to reopen all its files/devices. |
| |
| - You may need to increase amount of requests that OS on initiator |
| sends to the target device. To do it on Linux initiators, run |
| |
| echo “64” > /sys/block/sdX/queue/nr_requests |
| |
| where X is a device letter like above. |
| |
| You may also experiment with other parameters in /sys/block/sdX |
| directory, they also affect performance. If you find the best values, |
| please share them with us. |
| |
| - On the target use CFQ IO scheduler. In most cases it has performance |
| advantage over other IO schedulers, sometimes huge (2+ times |
| aggregate throughput increase). |
| |
| - It is recommended to turn the kernel preemption off, i.e. set |
| the kernel preemption model to "No Forced Preemption (Server)". |
| |
| - Looks like XFS is the best filesystem on the target to store device |
| files, because it allows considerably better linear write throughput, |
| than ext3. |
| |
| 7. For hardware on target. |
| |
| - Make sure that your target hardware (e.g. target FC or network card) |
| and underlying IO hardware (e.g. IO card, like SATA, SCSI or RAID to |
| which your disks connected) don't share the same PCI bus. You can |
| check it using lspci utility. They have to work in parallel, so it |
| will be better if they don't compete for the bus. The problem is not |
| only in the bandwidth, which they have to share, but also in the |
| interaction between cards during that competition. This is very |
| important, because in some cases if target and backend storage |
| controllers share the same PCI bus, it could lead up to 5-10 times |
| less performance, than expected. Moreover, some motherboard (by |
| Supermicro, particularly) have serious stability issues if there are |
| several high speed devices on the same bus working in parallel. If |
| you have no choice, but PCI bus sharing, set in the BIOS PCI latency |
| as low as possible. |
| |
| 8. If you use VDISK IO module in FILEIO mode, NV_CACHE option will |
| provide you the best performance. But using it make sure you use a good |
| UPS with ability to shutdown the target on the power failure. |
| |
| Baseline performance numbers you can find in those measurements: |
| http://lkml.org/lkml/2009/3/30/283. |
| |
| IMPORTANT: If you use on initiator some versions of Windows (at least W2K) |
| ========= you can't get good write performance for VDISK FILEIO devices with |
| default 512 bytes block sizes. You could get about 10% of the |
| expected one. This is because of the partition alignment, which |
| is (simplifying) incompatible with how Linux page cache |
| works, so for each write the corresponding block must be read |
| first. Use 4096 bytes block sizes for VDISK devices and you |
| will have the expected write performance. Actually, any OS on |
| initiators, not only Windows, will benefit from block size |
| max(PAGE_SIZE, BLOCK_SIZE_ON_UNDERLYING_FS), where PAGE_SIZE |
| is the page size, BLOCK_SIZE_ON_UNDERLYING_FS is block size |
| on the underlying FS, on which the device file located, or 0, |
| if a device node is used. Both values are from the target. |
| See also important notes about setting block sizes >512 bytes |
| for VDISK FILEIO devices above. |
| |
| |
| 9. In some cases, for instance working with SSD devices, which consume |
| 100% of a single CPU load for data transfers in their internal threads, |
| to maximize IOPS it can be needed to assign for those threads dedicated |
| CPUs. Consider using cpu_mask attribute for devices with |
| threads_pool_type "per_initiator" or Linux CPU affinity facilities for |
| other threads_pool_types. No IRQ processing should be done on those |
| CPUs. Check that using /proc/interrupts. See taskset command and |
| Documentation/IRQ-affinity.txt in your kernel's source tree for how to |
| assign IRQ affinity to tasks and IRQs. |
| |
| The reason for that is that processing of coming commands in SIRQ |
| context might be done on the same CPUs as SSD devices' threads doing data |
| transfers. As the result, those threads won't receive all the processing |
| power of those CPUs and perform worse. |
| |
| 10. If your storage is capable of operation on hundreds of thousands |
| IOPS level, you can use poll_us sysfs attribute to set how many us each |
| SCST thread is polling its queue after it became empty in a hope that a |
| new command can come. In some cases, polling can significantly increase |
| IOPS, especially if low power states on CPU not disabled, because on |
| high IOPS polling could be cheaper comparing to spending significant |
| time on entering, then exiting CPU low power states + corresponding |
| context switches. Polling is disabled by default. The recommended value |
| to start from is 5-10 us. Then you can increase or decrease it to see if |
| your IOPS are increasing or decreasing. |
| |
| |
| Commands suspending takes too long |
| ---------------------------------- |
| |
| SCST is suspending commands during some management activities like |
| adding/deleting LUNs or devices. It is done to have lockless LUNs |
| translation on the hot commands processing path. This brings significant |
| performance advantage. You will see a message like "Waiting for X active |
| commands to complete" when this wait started. |
| |
| But downside of it is that no new commands start executing until older |
| ones, which had started before the suspending begun, finished. This |
| wait can not be any longer, than the worst command latency any your |
| initiator is seeing at this particular time. |
| |
| So, if this wait takes too long, in majority of cases it means that you |
| are overloading your storage. A proper storage should have worst case |
| latency below few hundreds of milliseconds. In this case the SCST |
| suspending will finish in few hundreds of milliseconds at worse. |
| |
| Another case, when it can take too long to suspend is a hung user space |
| device (i.e. scst_user device) not responding to any command. In this |
| case you should kill the corresponding user space program to finish |
| suspending. |
| |
| |
| Work if target's backstorage or link is too slow |
| ------------------------------------------------ |
| |
| Under high I/O load, when your target's backstorage gets overloaded, or |
| working over a slow link between initiator and target, when the link |
| can't serve all the queued commands on time, you can experience I/O |
| stalls or see in the kernel log abort or reset messages. |
| |
| At first, consider the case of too slow target's backstorage. On some |
| seek intensive workloads even fast disks or RAIDs, which able to serve |
| continuous data stream on 500+ MB/s speed, can be as slow as 0.3 MB/s. |
| Another possible cause for that can be MD/LVM/RAID on your target as in |
| http://lkml.org/lkml/2008/2/27/96 (check the whole thread as well). |
| |
| Thus, in such situations simply processing of one or more commands takes |
| too long time, hence initiator decides that they are stuck on the target |
| and tries to recover. Particularly, it is known that the default amount |
| of simultaneously queued commands (48) is sometimes too high if you do |
| intensive writes from VMware on a target disk, which uses LVM in the |
| snapshot mode. In this case value like 16 or even 8-10 depending of your |
| backstorage speed could be more appropriate. |
| |
| There are 6 possible actions, which you can do to workaround or fix such |
| issues: |
| |
| 1. Ignore incoming task management (TM) commands. It's fine if there are |
| not too many of them, so average performance isn't hurt and the |
| corresponding device isn't getting put offline, i.e. if the backstorage |
| isn't a way too slow. |
| |
| 2. Decrease /sys/block/sdX/device/queue_depth on the initiator in case |
| if it's Linux (see below how) or/and SCST_MAX_TGT_DEV_COMMANDS constant |
| in scst_priv.h file until you stop seeing incoming TM commands. |
| ISCSI-SCST driver also has its own iSCSI specific parameter for that, |
| see its README file. |
| |
| To decrease device queue depth on Linux initiators you can run command: |
| |
| # echo Y >/sys/block/sdX/device/queue_depth |
| |
| where Y is the new number of simultaneously queued commands, X - your |
| imported device letter, like 'a' for sda device. There are no special |
| limitations for Y value, it can be any value from 1 to possible maximum |
| (usually, 32), so start from dividing the current value on 2, i.e. set |
| 16, if /sys/block/sdX/device/queue_depth contains 32. |
| |
| 3. Increase the corresponding timeout on the initiator. For Linux it is |
| located in |
| /sys/devices/platform/host*/session*/target*:0:0/*:0:0:1/timeout. It can |
| be done automatically by an udev rule. For instance, the following |
| rule will increase it to 300 seconds: |
| |
| SUBSYSTEM=="scsi", KERNEL=="[0-9]*:[0-9]*", ACTION=="add", ATTR{type}=="0|7|14", ATTR{timeout}="300" |
| |
| By default, this timeout is 30 or 60 seconds, depending on your distribution. |
| |
| 4. Try to avoid such seek intensive workloads. |
| |
| 5. Increase speed of the target's backstorage. |
| |
| 6. Implement in SCST QoS, so queue depth size on the target is |
| dynamically adjusted, hence worst case initiator seen latencies are |
| controlled. |
| |
| Next, consider the case of too slow link between initiator and target, |
| when the initiator tries to simultaneously push N commands to the target |
| over it. In this case time to serve those commands, i.e. send or receive |
| data for them over the link, can be more, than timeout for any single |
| command, hence one or more commands in the tail of the queue can not be |
| served on time less than the timeout, so the initiator will decide that |
| they are stuck on the target and will try to recover. |
| |
| To workaround/fix this issue in this case you can use ways 1, 2, 3 above |
| or (7): increase speed of the link between target and initiator. |
| |
| Note, that logged messages about QUEUE_FULL status are quite different |
| by nature. This is a normal work, just SCSI flow control in action. |
| Simply don't enable "mgmt_minor" logging level, or, alternatively, if |
| you are confident in the worst case performance of your back-end storage |
| or initiator-target link, you can increase SCST_MAX_TGT_DEV_COMMANDS in |
| scst_priv.h to 64. Usually initiators don't try to push more commands on |
| the target. |
| |
| IMPORTANT |
| ========= |
| |
| There must be LUN 0 in each security group, i.e. LUs numeration must not |
| start from, e.g., 1. Otherwise you will see no devices on remote |
| initiators and SCST core will write into the kernel log message: "tgt_dev |
| for LUN 0 not found, command to unexisting LU?" |
| |
| IMPORTANT |
| ========= |
| |
| All the access control must be fully configured BEFORE load of the |
| corresponding target driver! When you load a target driver or enable |
| target mode in it, as for qla2x00t driver, it will immediately start |
| accepting new connections, hence creating new sessions, and those new |
| sessions will be assigned to security groups according to the |
| *currently* configured access control settings. For instance, to |
| "Default" group, instead of "HOST004" as you may need, because "HOST004" |
| doesn't exist yet. So, one must configure all the security groups before |
| new connections from the initiators are created, i.e. before target |
| drivers loaded. |
| |
| Access controls can be altered after the target driver loaded as long as |
| the target session doesn't yet exist. And even in the case of the |
| session already existing, changes are still possible, but won't be |
| reflected on the initiator side. |
| |
| So, the safest choice is to configure all the access control before any |
| target driver load and then only add new devices to new groups for new |
| initiators or add new devices to old groups, but not altering existing |
| LUNs in them. |
| |
| |
| Credits |
| ------- |
| |
| Thanks to: |
| |
| * Mark Buechler <mark.buechler@gmail.com> for a lot of useful |
| suggestions, bug reports and help in debugging. |
| |
| * Ming Zhang <mingz@ele.uri.edu> for fixes and comments. |
| |
| * Nathaniel Clark <nate@misrule.us> for fixes and comments. |
| |
| * Calvin Morrow <calvin.morrow@comcast.net> for testing and useful |
| suggestions. |
| |
| * Hu Gang <hugang@soulinfo.com> for the original version of the |
| LSI target driver. |
| |
| * Erik Habbinga <erikhabbinga@inphase-tech.com> for fixes and support |
| of the LSI target driver. |
| |
| * Ross S. W. Walker <rswwalker@hotmail.com> for BLOCKIO inspiration |
| and Vu Pham <huongvp@yahoo.com> who implemented it for VDISK dev handler. |
| |
| * Alessandro Premoli <a.premoli@andxor.it> for fixes |
| |
| * Terry Greeniaus <tgreeniaus@yottayotta.com> for fixes. |
| |
| * Krzysztof Blaszkowski <kb@sysmikro.com.pl> for many fixes and bug reports. |
| |
| * Jianxi Chen <pacers@users.sourceforge.net> for fixing problem with |
| devices >2TB in size |
| |
| * Bart Van Assche <bvanassche@acm.org> for a lot of help |
| |
| * University of New Hampshire Interoperability Labs (UNH IOL, http://www.iol.unh.edu) |
| for UNH-iSCSI project (http://www.iol.unh.edu/consortiums/iscsi/index.html) |
| on which interface between SCST core and target drivers was based. |
| |
| * Daniel Debonzi <debonzi@linux.vnet.ibm.com> for a big part of the |
| initial SCST sysfs tree implementation |
| |
| |
| Vladislav Bolkhovitin <vst@vlnb.net>, http://scst.sourceforge.net |