| .\" |
| .\" CDDL HEADER START |
| .\" |
| .\" The contents of this file are subject to the terms of the |
| .\" Common Development and Distribution License (the "License"). |
| .\" You may not use this file except in compliance with the License. |
| .\" |
| .\" You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE |
| .\" or http://www.opensolaris.org/os/licensing. |
| .\" See the License for the specific language governing permissions |
| .\" and limitations under the License. |
| .\" |
| .\" When distributing Covered Code, include this CDDL HEADER in each |
| .\" file and include the License file at usr/src/OPENSOLARIS.LICENSE. |
| .\" If applicable, add the following below this CDDL HEADER, with the |
| .\" fields enclosed by brackets "[]" replaced with your own identifying |
| .\" information: Portions Copyright [yyyy] [name of copyright owner] |
| .\" |
| .\" CDDL HEADER END |
| .\" |
| .\" Copyright (c) 2007, Sun Microsystems, Inc. All Rights Reserved. |
| .\" Copyright (c) 2012, 2018 by Delphix. All rights reserved. |
| .\" Copyright (c) 2012 Cyril Plisko. All Rights Reserved. |
| .\" Copyright (c) 2017 Datto Inc. |
| .\" Copyright (c) 2018 George Melikov. All Rights Reserved. |
| .\" Copyright 2017 Nexenta Systems, Inc. |
| .\" Copyright (c) 2017 Open-E, Inc. All Rights Reserved. |
| .\" |
| .Dd June 2, 2021 |
| .Dt ZPOOLCONCEPTS 7 |
| .Os |
| . |
| .Sh NAME |
| .Nm zpoolconcepts |
| .Nd overview of ZFS storage pools |
| . |
| .Sh DESCRIPTION |
| .Ss Virtual Devices (vdevs) |
| A "virtual device" describes a single device or a collection of devices |
| organized according to certain performance and fault characteristics. |
| The following virtual devices are supported: |
| .Bl -tag -width "special" |
| .It Sy disk |
| A block device, typically located under |
| .Pa /dev . |
| ZFS can use individual slices or partitions, though the recommended mode of |
| operation is to use whole disks. |
| A disk can be specified by a full path, or it can be a shorthand name |
| .Po the relative portion of the path under |
| .Pa /dev |
| .Pc . |
| A whole disk can be specified by omitting the slice or partition designation. |
| For example, |
| .Pa sda |
| is equivalent to |
| .Pa /dev/sda . |
| When given a whole disk, ZFS automatically labels the disk, if necessary. |
| .It Sy file |
| A regular file. |
| The use of files as a backing store is strongly discouraged. |
| It is designed primarily for experimental purposes, as the fault tolerance of a |
| file is only as good as the file system on which it resides. |
| A file must be specified by a full path. |
| .It Sy mirror |
| A mirror of two or more devices. |
| Data is replicated in an identical fashion across all components of a mirror. |
| A mirror with |
| .Em N No disks of size Em X No can hold Em X No bytes and can withstand Em N-1 |
| devices failing without losing data. |
| .It Sy raidz , raidz1 , raidz2 , raidz3 |
| A variation on RAID-5 that allows for better distribution of parity and |
| eliminates the RAID-5 |
| .Qq write hole |
| .Pq in which data and parity become inconsistent after a power loss . |
| Data and parity is striped across all disks within a raidz group. |
| .Pp |
| A raidz group can have single, double, or triple parity, meaning that the |
| raidz group can sustain one, two, or three failures, respectively, without |
| losing any data. |
| The |
| .Sy raidz1 |
| vdev type specifies a single-parity raidz group; the |
| .Sy raidz2 |
| vdev type specifies a double-parity raidz group; and the |
| .Sy raidz3 |
| vdev type specifies a triple-parity raidz group. |
| The |
| .Sy raidz |
| vdev type is an alias for |
| .Sy raidz1 . |
| .Pp |
| A raidz group with |
| .Em N No disks of size Em X No with Em P No parity disks can hold approximately |
| .Em (N-P)*X No bytes and can withstand Em P No devices failing without losing data. |
| The minimum number of devices in a raidz group is one more than the number of |
| parity disks. |
| The recommended number is between 3 and 9 to help increase performance. |
| .It Sy draid , draid1 , draid2 , draid3 |
| A variant of raidz that provides integrated distributed hot spares which |
| allows for faster resilvering while retaining the benefits of raidz. |
| A dRAID vdev is constructed from multiple internal raidz groups, each with |
| .Em D No data devices and Em P No parity devices. |
| These groups are distributed over all of the children in order to fully |
| utilize the available disk performance. |
| .Pp |
| Unlike raidz, dRAID uses a fixed stripe width (padding as necessary with |
| zeros) to allow fully sequential resilvering. |
| This fixed stripe width significantly effects both usable capacity and IOPS. |
| For example, with the default |
| .Em D=8 No and Em 4kB No disk sectors the minimum allocation size is Em 32kB . |
| If using compression, this relatively large allocation size can reduce the |
| effective compression ratio. |
| When using ZFS volumes and dRAID, the default of the |
| .Sy volblocksize |
| property is increased to account for the allocation size. |
| If a dRAID pool will hold a significant amount of small blocks, it is |
| recommended to also add a mirrored |
| .Sy special |
| vdev to store those blocks. |
| .Pp |
| In regards to I/O, performance is similar to raidz since for any read all |
| .Em D No data disks must be accessed. |
| Delivered random IOPS can be reasonably approximated as |
| .Sy floor((N-S)/(D+P))*single_drive_IOPS . |
| .Pp |
| Like raidzm a dRAID can have single-, double-, or triple-parity. |
| The |
| .Sy draid1 , |
| .Sy draid2 , |
| and |
| .Sy draid3 |
| types can be used to specify the parity level. |
| The |
| .Sy draid |
| vdev type is an alias for |
| .Sy draid1 . |
| .Pp |
| A dRAID with |
| .Em N No disks of size Em X , D No data disks per redundancy group, Em P |
| .No parity level, and Em S No distributed hot spares can hold approximately |
| .Em (N-S)*(D/(D+P))*X No bytes and can withstand Em P |
| devices failing without losing data. |
| .It Sy draid Ns Oo Ar parity Oc Ns Oo Sy \&: Ns Ar data Ns Sy d Oc Ns Oo Sy \&: Ns Ar children Ns Sy c Oc Ns Oo Sy \&: Ns Ar spares Ns Sy s Oc |
| A non-default dRAID configuration can be specified by appending one or more |
| of the following optional arguments to the |
| .Sy draid |
| keyword: |
| .Bl -tag -compact -width "children" |
| .It Ar parity |
| The parity level (1-3). |
| .It Ar data |
| The number of data devices per redundancy group. |
| In general, a smaller value of |
| .Em D No will increase IOPS, improve the compression ratio, |
| and speed up resilvering at the expense of total usable capacity. |
| Defaults to |
| .Em 8 , No unless Em N-P-S No is less than Em 8 . |
| .It Ar children |
| The expected number of children. |
| Useful as a cross-check when listing a large number of devices. |
| An error is returned when the provided number of children differs. |
| .It Ar spares |
| The number of distributed hot spares. |
| Defaults to zero. |
| .El |
| .It Sy spare |
| A pseudo-vdev which keeps track of available hot spares for a pool. |
| For more information, see the |
| .Sx Hot Spares |
| section. |
| .It Sy log |
| A separate intent log device. |
| If more than one log device is specified, then writes are load-balanced between |
| devices. |
| Log devices can be mirrored. |
| However, raidz vdev types are not supported for the intent log. |
| For more information, see the |
| .Sx Intent Log |
| section. |
| .It Sy dedup |
| A device dedicated solely for deduplication tables. |
| The redundancy of this device should match the redundancy of the other normal |
| devices in the pool. |
| If more than one dedup device is specified, then |
| allocations are load-balanced between those devices. |
| .It Sy special |
| A device dedicated solely for allocating various kinds of internal metadata, |
| and optionally small file blocks. |
| The redundancy of this device should match the redundancy of the other normal |
| devices in the pool. |
| If more than one special device is specified, then |
| allocations are load-balanced between those devices. |
| .Pp |
| For more information on special allocations, see the |
| .Sx Special Allocation Class |
| section. |
| .It Sy cache |
| A device used to cache storage pool data. |
| A cache device cannot be configured as a mirror or raidz group. |
| For more information, see the |
| .Sx Cache Devices |
| section. |
| .El |
| .Pp |
| Virtual devices cannot be nested, so a mirror or raidz virtual device can only |
| contain files or disks. |
| Mirrors of mirrors |
| .Pq or other combinations |
| are not allowed. |
| .Pp |
| A pool can have any number of virtual devices at the top of the configuration |
| .Po known as |
| .Qq root vdevs |
| .Pc . |
| Data is dynamically distributed across all top-level devices to balance data |
| among devices. |
| As new virtual devices are added, ZFS automatically places data on the newly |
| available devices. |
| .Pp |
| Virtual devices are specified one at a time on the command line, |
| separated by whitespace. |
| Keywords like |
| .Sy mirror No and Sy raidz |
| are used to distinguish where a group ends and another begins. |
| For example, the following creates a pool with two root vdevs, |
| each a mirror of two disks: |
| .Dl # Nm zpool Cm create Ar mypool Sy mirror Ar sda sdb Sy mirror Ar sdc sdd |
| . |
| .Ss Device Failure and Recovery |
| ZFS supports a rich set of mechanisms for handling device failure and data |
| corruption. |
| All metadata and data is checksummed, and ZFS automatically repairs bad data |
| from a good copy when corruption is detected. |
| .Pp |
| In order to take advantage of these features, a pool must make use of some form |
| of redundancy, using either mirrored or raidz groups. |
| While ZFS supports running in a non-redundant configuration, where each root |
| vdev is simply a disk or file, this is strongly discouraged. |
| A single case of bit corruption can render some or all of your data unavailable. |
| .Pp |
| A pool's health status is described by one of three states: |
| .Sy online , degraded , No or Sy faulted . |
| An online pool has all devices operating normally. |
| A degraded pool is one in which one or more devices have failed, but the data is |
| still available due to a redundant configuration. |
| A faulted pool has corrupted metadata, or one or more faulted devices, and |
| insufficient replicas to continue functioning. |
| .Pp |
| The health of the top-level vdev, such as a mirror or raidz device, |
| is potentially impacted by the state of its associated vdevs, |
| or component devices. |
| A top-level vdev or component device is in one of the following states: |
| .Bl -tag -width "DEGRADED" |
| .It Sy DEGRADED |
| One or more top-level vdevs is in the degraded state because one or more |
| component devices are offline. |
| Sufficient replicas exist to continue functioning. |
| .Pp |
| One or more component devices is in the degraded or faulted state, but |
| sufficient replicas exist to continue functioning. |
| The underlying conditions are as follows: |
| .Bl -bullet -compact |
| .It |
| The number of checksum errors exceeds acceptable levels and the device is |
| degraded as an indication that something may be wrong. |
| ZFS continues to use the device as necessary. |
| .It |
| The number of I/O errors exceeds acceptable levels. |
| The device could not be marked as faulted because there are insufficient |
| replicas to continue functioning. |
| .El |
| .It Sy FAULTED |
| One or more top-level vdevs is in the faulted state because one or more |
| component devices are offline. |
| Insufficient replicas exist to continue functioning. |
| .Pp |
| One or more component devices is in the faulted state, and insufficient |
| replicas exist to continue functioning. |
| The underlying conditions are as follows: |
| .Bl -bullet -compact |
| .It |
| The device could be opened, but the contents did not match expected values. |
| .It |
| The number of I/O errors exceeds acceptable levels and the device is faulted to |
| prevent further use of the device. |
| .El |
| .It Sy OFFLINE |
| The device was explicitly taken offline by the |
| .Nm zpool Cm offline |
| command. |
| .It Sy ONLINE |
| The device is online and functioning. |
| .It Sy REMOVED |
| The device was physically removed while the system was running. |
| Device removal detection is hardware-dependent and may not be supported on all |
| platforms. |
| .It Sy UNAVAIL |
| The device could not be opened. |
| If a pool is imported when a device was unavailable, then the device will be |
| identified by a unique identifier instead of its path since the path was never |
| correct in the first place. |
| .El |
| .Pp |
| Checksum errors represent events where a disk returned data that was expected |
| to be correct, but was not. |
| In other words, these are instances of silent data corruption. |
| The checksum errors are reported in |
| .Nm zpool Cm status |
| and |
| .Nm zpool Cm events . |
| When a block is stored redundantly, a damaged block may be reconstructed |
| (e.g. from raidz parity or a mirrored copy). |
| In this case, ZFS reports the checksum error against the disks that contained |
| damaged data. |
| If a block is unable to be reconstructed (e.g. due to 3 disks being damaged |
| in a raidz2 group), it is not possible to determine which disks were silently |
| corrupted. |
| In this case, checksum errors are reported for all disks on which the block |
| is stored. |
| .Pp |
| If a device is removed and later re-attached to the system, |
| ZFS attempts online the device automatically. |
| Device attachment detection is hardware-dependent |
| and might not be supported on all platforms. |
| . |
| .Ss Hot Spares |
| ZFS allows devices to be associated with pools as |
| .Qq hot spares . |
| These devices are not actively used in the pool, but when an active device |
| fails, it is automatically replaced by a hot spare. |
| To create a pool with hot spares, specify a |
| .Sy spare |
| vdev with any number of devices. |
| For example, |
| .Dl # Nm zpool Cm create Ar pool Sy mirror Ar sda sdb Sy spare Ar sdc sdd |
| .Pp |
| Spares can be shared across multiple pools, and can be added with the |
| .Nm zpool Cm add |
| command and removed with the |
| .Nm zpool Cm remove |
| command. |
| Once a spare replacement is initiated, a new |
| .Sy spare |
| vdev is created within the configuration that will remain there until the |
| original device is replaced. |
| At this point, the hot spare becomes available again if another device fails. |
| .Pp |
| If a pool has a shared spare that is currently being used, the pool can not be |
| exported since other pools may use this shared spare, which may lead to |
| potential data corruption. |
| .Pp |
| Shared spares add some risk. |
| If the pools are imported on different hosts, |
| and both pools suffer a device failure at the same time, |
| both could attempt to use the spare at the same time. |
| This may not be detected, resulting in data corruption. |
| .Pp |
| An in-progress spare replacement can be cancelled by detaching the hot spare. |
| If the original faulted device is detached, then the hot spare assumes its |
| place in the configuration, and is removed from the spare list of all active |
| pools. |
| .Pp |
| The |
| .Sy draid |
| vdev type provides distributed hot spares. |
| These hot spares are named after the dRAID vdev they're a part of |
| .Po Sy draid1 Ns - Ns Ar 2 Ns - Ns Ar 3 No specifies spare Ar 3 No of vdev Ar 2 , |
| .No which is a single parity dRAID Pc |
| and may only be used by that dRAID vdev. |
| Otherwise, they behave the same as normal hot spares. |
| .Pp |
| Spares cannot replace log devices. |
| . |
| .Ss Intent Log |
| The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous |
| transactions. |
| For instance, databases often require their transactions to be on stable storage |
| devices when returning from a system call. |
| NFS and other applications can also use |
| .Xr fsync 2 |
| to ensure data stability. |
| By default, the intent log is allocated from blocks within the main pool. |
| However, it might be possible to get better performance using separate intent |
| log devices such as NVRAM or a dedicated disk. |
| For example: |
| .Dl # Nm zpool Cm create Ar pool sda sdb Sy log Ar sdc |
| .Pp |
| Multiple log devices can also be specified, and they can be mirrored. |
| See the |
| .Sx EXAMPLES |
| section for an example of mirroring multiple log devices. |
| .Pp |
| Log devices can be added, replaced, attached, detached and removed. |
| In addition, log devices are imported and exported as part of the pool |
| that contains them. |
| Mirrored devices can be removed by specifying the top-level mirror vdev. |
| . |
| .Ss Cache Devices |
| Devices can be added to a storage pool as |
| .Qq cache devices . |
| These devices provide an additional layer of caching between main memory and |
| disk. |
| For read-heavy workloads, where the working set size is much larger than what |
| can be cached in main memory, using cache devices allows much more of this |
| working set to be served from low latency media. |
| Using cache devices provides the greatest performance improvement for random |
| read-workloads of mostly static content. |
| .Pp |
| To create a pool with cache devices, specify a |
| .Sy cache |
| vdev with any number of devices. |
| For example: |
| .Dl # Nm zpool Cm create Ar pool sda sdb Sy cache Ar sdc sdd |
| .Pp |
| Cache devices cannot be mirrored or part of a raidz configuration. |
| If a read error is encountered on a cache device, that read I/O is reissued to |
| the original storage pool device, which might be part of a mirrored or raidz |
| configuration. |
| .Pp |
| The content of the cache devices is persistent across reboots and restored |
| asynchronously when importing the pool in L2ARC (persistent L2ARC). |
| This can be disabled by setting |
| .Sy l2arc_rebuild_enabled Ns = Ns Sy 0 . |
| For cache devices smaller than |
| .Em 1GB , |
| we do not write the metadata structures |
| required for rebuilding the L2ARC in order not to waste space. |
| This can be changed with |
| .Sy l2arc_rebuild_blocks_min_l2size . |
| The cache device header |
| .Pq Em 512B |
| is updated even if no metadata structures are written. |
| Setting |
| .Sy l2arc_headroom Ns = Ns Sy 0 |
| will result in scanning the full-length ARC lists for cacheable content to be |
| written in L2ARC (persistent ARC). |
| If a cache device is added with |
| .Nm zpool Cm add |
| its label and header will be overwritten and its contents are not going to be |
| restored in L2ARC, even if the device was previously part of the pool. |
| If a cache device is onlined with |
| .Nm zpool Cm online |
| its contents will be restored in L2ARC. |
| This is useful in case of memory pressure |
| where the contents of the cache device are not fully restored in L2ARC. |
| The user can off- and online the cache device when there is less memory pressure |
| in order to fully restore its contents to L2ARC. |
| . |
| .Ss Pool checkpoint |
| Before starting critical procedures that include destructive actions |
| .Pq like Nm zfs Cm destroy , |
| an administrator can checkpoint the pool's state and in the case of a |
| mistake or failure, rewind the entire pool back to the checkpoint. |
| Otherwise, the checkpoint can be discarded when the procedure has completed |
| successfully. |
| .Pp |
| A pool checkpoint can be thought of as a pool-wide snapshot and should be used |
| with care as it contains every part of the pool's state, from properties to vdev |
| configuration. |
| Thus, certain operations are not allowed while a pool has a checkpoint. |
| Specifically, vdev removal/attach/detach, mirror splitting, and |
| changing the pool's GUID. |
| Adding a new vdev is supported, but in the case of a rewind it will have to be |
| added again. |
| Finally, users of this feature should keep in mind that scrubs in a pool that |
| has a checkpoint do not repair checkpointed data. |
| .Pp |
| To create a checkpoint for a pool: |
| .Dl # Nm zpool Cm checkpoint Ar pool |
| .Pp |
| To later rewind to its checkpointed state, you need to first export it and |
| then rewind it during import: |
| .Dl # Nm zpool Cm export Ar pool |
| .Dl # Nm zpool Cm import Fl -rewind-to-checkpoint Ar pool |
| .Pp |
| To discard the checkpoint from a pool: |
| .Dl # Nm zpool Cm checkpoint Fl d Ar pool |
| .Pp |
| Dataset reservations (controlled by the |
| .Sy reservation No and Sy refreservation |
| properties) may be unenforceable while a checkpoint exists, because the |
| checkpoint is allowed to consume the dataset's reservation. |
| Finally, data that is part of the checkpoint but has been freed in the |
| current state of the pool won't be scanned during a scrub. |
| . |
| .Ss Special Allocation Class |
| Allocations in the special class are dedicated to specific block types. |
| By default this includes all metadata, the indirect blocks of user data, and |
| any deduplication tables. |
| The class can also be provisioned to accept small file blocks. |
| .Pp |
| A pool must always have at least one normal |
| .Pq non- Ns Sy dedup Ns /- Ns Sy special |
| vdev before |
| other devices can be assigned to the special class. |
| If the |
| .Sy special |
| class becomes full, then allocations intended for it |
| will spill back into the normal class. |
| .Pp |
| Deduplication tables can be excluded from the special class by unsetting the |
| .Sy zfs_ddt_data_is_special |
| ZFS module parameter. |
| .Pp |
| Inclusion of small file blocks in the special class is opt-in. |
| Each dataset can control the size of small file blocks allowed |
| in the special class by setting the |
| .Sy special_small_blocks |
| property to nonzero. |
| See |
| .Xr zfsprops 7 |
| for more info on this property. |