zfs/man/man7/zpoolconcepts.7 - backupdr - Git at Google

 .\"
 .\" CDDL HEADER START
 .\"
 .\" The contents of this file are subject to the terms of the
 .\" Common Development and Distribution License (the "License").
 .\" You may not use this file except in compliance with the License.
 .\"
 .\" You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
 .\" or http://www.opensolaris.org/os/licensing.
 .\" See the License for the specific language governing permissions
 .\" and limitations under the License.
 .\"
 .\" When distributing Covered Code, include this CDDL HEADER in each
 .\" file and include the License file at usr/src/OPENSOLARIS.LICENSE.
 .\" If applicable, add the following below this CDDL HEADER, with the
 .\" fields enclosed by brackets "[]" replaced with your own identifying
 .\" information: Portions Copyright [yyyy] [name of copyright owner]
 .\"
 .\" CDDL HEADER END
 .\"
 .\" Copyright (c) 2007, Sun Microsystems, Inc. All Rights Reserved.
 .\" Copyright (c) 2012, 2018 by Delphix. All rights reserved.
 .\" Copyright (c) 2012 Cyril Plisko. All Rights Reserved.
 .\" Copyright (c) 2017 Datto Inc.
 .\" Copyright (c) 2018 George Melikov. All Rights Reserved.
 .\" Copyright 2017 Nexenta Systems, Inc.
 .\" Copyright (c) 2017 Open-E, Inc. All Rights Reserved.
 .\"
 .Dd June 2, 2021
 .Dt ZPOOLCONCEPTS 7
 .Os
 .
 .Sh NAME
 .Nm zpoolconcepts
 .Nd overview of ZFS storage pools
 .
 .Sh DESCRIPTION
 .Ss Virtual Devices (vdevs)
 A "virtual device" describes a single device or a collection of devices
 organized according to certain performance and fault characteristics.
 The following virtual devices are supported:
 .Bl -tag -width "special"
 .It Sy disk
 A block device, typically located under
 .Pa /dev .
 ZFS can use individual slices or partitions, though the recommended mode of
 operation is to use whole disks.
 A disk can be specified by a full path, or it can be a shorthand name
 .Po the relative portion of the path under
 .Pa /dev
 .Pc .
 A whole disk can be specified by omitting the slice or partition designation.
 For example,
 .Pa sda
 is equivalent to
 .Pa /dev/sda .
 When given a whole disk, ZFS automatically labels the disk, if necessary.
 .It Sy file
 A regular file.
 The use of files as a backing store is strongly discouraged.
 It is designed primarily for experimental purposes, as the fault tolerance of a
 file is only as good as the file system on which it resides.
 A file must be specified by a full path.
 .It Sy mirror
 A mirror of two or more devices.
 Data is replicated in an identical fashion across all components of a mirror.
 A mirror with
 .Em N No disks of size Em X No can hold Em X No bytes and can withstand Em N-1
 devices failing without losing data.
 .It Sy raidz , raidz1 , raidz2 , raidz3
 A variation on RAID-5 that allows for better distribution of parity and
 eliminates the RAID-5
 .Qq write hole
 .Pq in which data and parity become inconsistent after a power loss .
 Data and parity is striped across all disks within a raidz group.
 .Pp
 A raidz group can have single, double, or triple parity, meaning that the
 raidz group can sustain one, two, or three failures, respectively, without
 losing any data.
 The
 .Sy raidz1
 vdev type specifies a single-parity raidz group; the
 .Sy raidz2
 vdev type specifies a double-parity raidz group; and the
 .Sy raidz3
 vdev type specifies a triple-parity raidz group.
 The
 .Sy raidz
 vdev type is an alias for
 .Sy raidz1 .
 .Pp
 A raidz group with
 .Em N No disks of size Em X No with Em P No parity disks can hold approximately
 .Em (N-P)*X No bytes and can withstand Em P No devices failing without losing data.
 The minimum number of devices in a raidz group is one more than the number of
 parity disks.
 The recommended number is between 3 and 9 to help increase performance.
 .It Sy draid , draid1 , draid2 , draid3
 A variant of raidz that provides integrated distributed hot spares which
 allows for faster resilvering while retaining the benefits of raidz.
 A dRAID vdev is constructed from multiple internal raidz groups, each with
 .Em D No data devices and Em P No parity devices.
 These groups are distributed over all of the children in order to fully
 utilize the available disk performance.
 .Pp
 Unlike raidz, dRAID uses a fixed stripe width (padding as necessary with
 zeros) to allow fully sequential resilvering.
 This fixed stripe width significantly effects both usable capacity and IOPS.
 For example, with the default
 .Em D=8 No and Em 4kB No disk sectors the minimum allocation size is Em 32kB .
 If using compression, this relatively large allocation size can reduce the
 effective compression ratio.
 When using ZFS volumes and dRAID, the default of the
 .Sy volblocksize
 property is increased to account for the allocation size.
 If a dRAID pool will hold a significant amount of small blocks, it is
 recommended to also add a mirrored
 .Sy special
 vdev to store those blocks.
 .Pp
 In regards to I/O, performance is similar to raidz since for any read all
 .Em D No data disks must be accessed.
 Delivered random IOPS can be reasonably approximated as
 .Sy floor((N-S)/(D+P))*single_drive_IOPS .
 .Pp
 Like raidzm a dRAID can have single-, double-, or triple-parity.
 The
 .Sy draid1 ,
 .Sy draid2 ,
 and
 .Sy draid3
 types can be used to specify the parity level.
 The
 .Sy draid
 vdev type is an alias for
 .Sy draid1 .
 .Pp
 A dRAID with
 .Em N No disks of size Em X , D No data disks per redundancy group, Em P
 .No parity level, and Em S No distributed hot spares can hold approximately
 .Em (N-S)*(D/(D+P))*X No bytes and can withstand Em P
 devices failing without losing data.
 .It Sy draid Ns Oo Ar parity Oc Ns Oo Sy \&: Ns Ar data Ns Sy d Oc Ns Oo Sy \&: Ns Ar children Ns Sy c Oc Ns Oo Sy \&: Ns Ar spares Ns Sy s Oc
 A non-default dRAID configuration can be specified by appending one or more
 of the following optional arguments to the
 .Sy draid
 keyword:
 .Bl -tag -compact -width "children"
 .It Ar parity
 The parity level (1-3).
 .It Ar data
 The number of data devices per redundancy group.
 In general, a smaller value of
 .Em D No will increase IOPS, improve the compression ratio,
 and speed up resilvering at the expense of total usable capacity.
 Defaults to
 .Em 8 , No unless Em N-P-S No is less than Em 8 .
 .It Ar children
 The expected number of children.
 Useful as a cross-check when listing a large number of devices.
 An error is returned when the provided number of children differs.
 .It Ar spares
 The number of distributed hot spares.
 Defaults to zero.
 .El
 .It Sy spare
 A pseudo-vdev which keeps track of available hot spares for a pool.
 For more information, see the
 .Sx Hot Spares
 section.
 .It Sy log
 A separate intent log device.
 If more than one log device is specified, then writes are load-balanced between
 devices.
 Log devices can be mirrored.
 However, raidz vdev types are not supported for the intent log.
 For more information, see the
 .Sx Intent Log
 section.
 .It Sy dedup
 A device dedicated solely for deduplication tables.
 The redundancy of this device should match the redundancy of the other normal
 devices in the pool.
 If more than one dedup device is specified, then
 allocations are load-balanced between those devices.
 .It Sy special
 A device dedicated solely for allocating various kinds of internal metadata,
 and optionally small file blocks.
 The redundancy of this device should match the redundancy of the other normal
 devices in the pool.
 If more than one special device is specified, then
 allocations are load-balanced between those devices.
 .Pp
 For more information on special allocations, see the
 .Sx Special Allocation Class
 section.
 .It Sy cache
 A device used to cache storage pool data.
 A cache device cannot be configured as a mirror or raidz group.
 For more information, see the
 .Sx Cache Devices
 section.
 .El
 .Pp
 Virtual devices cannot be nested, so a mirror or raidz virtual device can only
 contain files or disks.
 Mirrors of mirrors
 .Pq or other combinations
 are not allowed.
 .Pp
 A pool can have any number of virtual devices at the top of the configuration
 .Po known as
 .Qq root vdevs
 .Pc .
 Data is dynamically distributed across all top-level devices to balance data
 among devices.
 As new virtual devices are added, ZFS automatically places data on the newly
 available devices.
 .Pp
 Virtual devices are specified one at a time on the command line,
 separated by whitespace.
 Keywords like
 .Sy mirror No and Sy raidz
 are used to distinguish where a group ends and another begins.
 For example, the following creates a pool with two root vdevs,
 each a mirror of two disks:
 .Dl # Nm zpool Cm create Ar mypool Sy mirror Ar sda sdb Sy mirror Ar sdc sdd
 .
 .Ss Device Failure and Recovery
 ZFS supports a rich set of mechanisms for handling device failure and data
 corruption.
 All metadata and data is checksummed, and ZFS automatically repairs bad data
 from a good copy when corruption is detected.
 .Pp
 In order to take advantage of these features, a pool must make use of some form
 of redundancy, using either mirrored or raidz groups.
 While ZFS supports running in a non-redundant configuration, where each root
 vdev is simply a disk or file, this is strongly discouraged.
 A single case of bit corruption can render some or all of your data unavailable.
 .Pp
 A pool's health status is described by one of three states:
 .Sy online , degraded , No or Sy faulted .
 An online pool has all devices operating normally.
 A degraded pool is one in which one or more devices have failed, but the data is
 still available due to a redundant configuration.
 A faulted pool has corrupted metadata, or one or more faulted devices, and
 insufficient replicas to continue functioning.
 .Pp
 The health of the top-level vdev, such as a mirror or raidz device,
 is potentially impacted by the state of its associated vdevs,
 or component devices.
 A top-level vdev or component device is in one of the following states:
 .Bl -tag -width "DEGRADED"
 .It Sy DEGRADED
 One or more top-level vdevs is in the degraded state because one or more
 component devices are offline.
 Sufficient replicas exist to continue functioning.
 .Pp
 One or more component devices is in the degraded or faulted state, but
 sufficient replicas exist to continue functioning.
 The underlying conditions are as follows:
 .Bl -bullet -compact
 .It
 The number of checksum errors exceeds acceptable levels and the device is
 degraded as an indication that something may be wrong.
 ZFS continues to use the device as necessary.
 .It
 The number of I/O errors exceeds acceptable levels.
 The device could not be marked as faulted because there are insufficient
 replicas to continue functioning.
 .El
 .It Sy FAULTED
 One or more top-level vdevs is in the faulted state because one or more
 component devices are offline.
 Insufficient replicas exist to continue functioning.
 .Pp
 One or more component devices is in the faulted state, and insufficient
 replicas exist to continue functioning.
 The underlying conditions are as follows:
 .Bl -bullet -compact
 .It
 The device could be opened, but the contents did not match expected values.
 .It
 The number of I/O errors exceeds acceptable levels and the device is faulted to
 prevent further use of the device.
 .El
 .It Sy OFFLINE
 The device was explicitly taken offline by the
 .Nm zpool Cm offline
 command.
 .It Sy ONLINE
 The device is online and functioning.
 .It Sy REMOVED
 The device was physically removed while the system was running.
 Device removal detection is hardware-dependent and may not be supported on all
 platforms.
 .It Sy UNAVAIL
 The device could not be opened.
 If a pool is imported when a device was unavailable, then the device will be
 identified by a unique identifier instead of its path since the path was never
 correct in the first place.
 .El
 .Pp
 Checksum errors represent events where a disk returned data that was expected
 to be correct, but was not.
 In other words, these are instances of silent data corruption.
 The checksum errors are reported in
 .Nm zpool Cm status
 and
 .Nm zpool Cm events .
 When a block is stored redundantly, a damaged block may be reconstructed
 (e.g. from raidz parity or a mirrored copy).
 In this case, ZFS reports the checksum error against the disks that contained
 damaged data.
 If a block is unable to be reconstructed (e.g. due to 3 disks being damaged
 in a raidz2 group), it is not possible to determine which disks were silently
 corrupted.
 In this case, checksum errors are reported for all disks on which the block
 is stored.
 .Pp
 If a device is removed and later re-attached to the system,
 ZFS attempts online the device automatically.
 Device attachment detection is hardware-dependent
 and might not be supported on all platforms.
 .
 .Ss Hot Spares
 ZFS allows devices to be associated with pools as
 .Qq hot spares .
 These devices are not actively used in the pool, but when an active device
 fails, it is automatically replaced by a hot spare.
 To create a pool with hot spares, specify a
 .Sy spare
 vdev with any number of devices.
 For example,
 .Dl # Nm zpool Cm create Ar pool Sy mirror Ar sda sdb Sy spare Ar sdc sdd
 .Pp
 Spares can be shared across multiple pools, and can be added with the
 .Nm zpool Cm add
 command and removed with the
 .Nm zpool Cm remove
 command.
 Once a spare replacement is initiated, a new
 .Sy spare
 vdev is created within the configuration that will remain there until the
 original device is replaced.
 At this point, the hot spare becomes available again if another device fails.
 .Pp
 If a pool has a shared spare that is currently being used, the pool can not be
 exported since other pools may use this shared spare, which may lead to
 potential data corruption.
 .Pp
 Shared spares add some risk.
 If the pools are imported on different hosts,
 and both pools suffer a device failure at the same time,
 both could attempt to use the spare at the same time.
 This may not be detected, resulting in data corruption.
 .Pp
 An in-progress spare replacement can be cancelled by detaching the hot spare.
 If the original faulted device is detached, then the hot spare assumes its
 place in the configuration, and is removed from the spare list of all active
 pools.
 .Pp
 The
 .Sy draid
 vdev type provides distributed hot spares.
 These hot spares are named after the dRAID vdev they're a part of
 .Po Sy draid1 Ns - Ns Ar 2 Ns - Ns Ar 3 No specifies spare Ar 3 No of vdev Ar 2 ,
 .No which is a single parity dRAID Pc
 and may only be used by that dRAID vdev.
 Otherwise, they behave the same as normal hot spares.
 .Pp
 Spares cannot replace log devices.
 .
 .Ss Intent Log
 The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous
 transactions.
 For instance, databases often require their transactions to be on stable storage
 devices when returning from a system call.
 NFS and other applications can also use
 .Xr fsync 2
 to ensure data stability.
 By default, the intent log is allocated from blocks within the main pool.
 However, it might be possible to get better performance using separate intent
 log devices such as NVRAM or a dedicated disk.
 For example:
 .Dl # Nm zpool Cm create Ar pool sda sdb Sy log Ar sdc
 .Pp
 Multiple log devices can also be specified, and they can be mirrored.
 See the
 .Sx EXAMPLES
 section for an example of mirroring multiple log devices.
 .Pp
 Log devices can be added, replaced, attached, detached and removed.
 In addition, log devices are imported and exported as part of the pool
 that contains them.
 Mirrored devices can be removed by specifying the top-level mirror vdev.
 .
 .Ss Cache Devices
 Devices can be added to a storage pool as
 .Qq cache devices .
 These devices provide an additional layer of caching between main memory and
 disk.
 For read-heavy workloads, where the working set size is much larger than what
 can be cached in main memory, using cache devices allows much more of this
 working set to be served from low latency media.
 Using cache devices provides the greatest performance improvement for random
 read-workloads of mostly static content.
 .Pp
 To create a pool with cache devices, specify a
 .Sy cache
 vdev with any number of devices.
 For example:
 .Dl # Nm zpool Cm create Ar pool sda sdb Sy cache Ar sdc sdd
 .Pp
 Cache devices cannot be mirrored or part of a raidz configuration.
 If a read error is encountered on a cache device, that read I/O is reissued to
 the original storage pool device, which might be part of a mirrored or raidz
 configuration.
 .Pp
 The content of the cache devices is persistent across reboots and restored
 asynchronously when importing the pool in L2ARC (persistent L2ARC).
 This can be disabled by setting
 .Sy l2arc_rebuild_enabled Ns = Ns Sy 0 .
 For cache devices smaller than
 .Em 1GB ,
 we do not write the metadata structures
 required for rebuilding the L2ARC in order not to waste space.
 This can be changed with
 .Sy l2arc_rebuild_blocks_min_l2size .
 The cache device header
 .Pq Em 512B
 is updated even if no metadata structures are written.
 Setting
 .Sy l2arc_headroom Ns = Ns Sy 0
 will result in scanning the full-length ARC lists for cacheable content to be
 written in L2ARC (persistent ARC).
 If a cache device is added with
 .Nm zpool Cm add
 its label and header will be overwritten and its contents are not going to be
 restored in L2ARC, even if the device was previously part of the pool.
 If a cache device is onlined with
 .Nm zpool Cm online
 its contents will be restored in L2ARC.
 This is useful in case of memory pressure
 where the contents of the cache device are not fully restored in L2ARC.
 The user can off- and online the cache device when there is less memory pressure
 in order to fully restore its contents to L2ARC.
 .
 .Ss Pool checkpoint
 Before starting critical procedures that include destructive actions
 .Pq like Nm zfs Cm destroy ,
 an administrator can checkpoint the pool's state and in the case of a
 mistake or failure, rewind the entire pool back to the checkpoint.
 Otherwise, the checkpoint can be discarded when the procedure has completed
 successfully.
 .Pp
 A pool checkpoint can be thought of as a pool-wide snapshot and should be used
 with care as it contains every part of the pool's state, from properties to vdev
 configuration.
 Thus, certain operations are not allowed while a pool has a checkpoint.
 Specifically, vdev removal/attach/detach, mirror splitting, and
 changing the pool's GUID.
 Adding a new vdev is supported, but in the case of a rewind it will have to be
 added again.
 Finally, users of this feature should keep in mind that scrubs in a pool that
 has a checkpoint do not repair checkpointed data.
 .Pp
 To create a checkpoint for a pool:
 .Dl # Nm zpool Cm checkpoint Ar pool
 .Pp
 To later rewind to its checkpointed state, you need to first export it and
 then rewind it during import:
 .Dl # Nm zpool Cm export Ar pool
 .Dl # Nm zpool Cm import Fl -rewind-to-checkpoint Ar pool
 .Pp
 To discard the checkpoint from a pool:
 .Dl # Nm zpool Cm checkpoint Fl d Ar pool
 .Pp
 Dataset reservations (controlled by the
 .Sy reservation No and Sy refreservation
 properties) may be unenforceable while a checkpoint exists, because the
 checkpoint is allowed to consume the dataset's reservation.
 Finally, data that is part of the checkpoint but has been freed in the
 current state of the pool won't be scanned during a scrub.
 .
 .Ss Special Allocation Class
 Allocations in the special class are dedicated to specific block types.
 By default this includes all metadata, the indirect blocks of user data, and
 any deduplication tables.
 The class can also be provisioned to accept small file blocks.
 .Pp
 A pool must always have at least one normal
 .Pq non- Ns Sy dedup Ns /- Ns Sy special
 vdev before
 other devices can be assigned to the special class.
 If the
 .Sy special
 class becomes full, then allocations intended for it
 will spill back into the normal class.
 .Pp
 Deduplication tables can be excluded from the special class by unsetting the
 .Sy zfs_ddt_data_is_special
 ZFS module parameter.
 .Pp
 Inclusion of small file blocks in the special class is opt-in.
 Each dataset can control the size of small file blocks allowed
 in the special class by setting the
 .Sy special_small_blocks
 property to nonzero.
 See
 .Xr zfsprops 7
 for more info on this property.
	.\"
	.\" CDDL HEADER START
	.\"
	.\" The contents of this file are subject to the terms of the
	.\" Common Development and Distribution License (the "License").
	.\" You may not use this file except in compliance with the License.
	.\"
	.\" You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
	.\" or http://www.opensolaris.org/os/licensing.
	.\" See the License for the specific language governing permissions
	.\" and limitations under the License.
	.\"
	.\" When distributing Covered Code, include this CDDL HEADER in each
	.\" file and include the License file at usr/src/OPENSOLARIS.LICENSE.
	.\" If applicable, add the following below this CDDL HEADER, with the
	.\" fields enclosed by brackets "[]" replaced with your own identifying
	.\" information: Portions Copyright [yyyy] [name of copyright owner]
	.\"
	.\" CDDL HEADER END
	.\"
	.\" Copyright (c) 2007, Sun Microsystems, Inc. All Rights Reserved.
	.\" Copyright (c) 2012, 2018 by Delphix. All rights reserved.
	.\" Copyright (c) 2012 Cyril Plisko. All Rights Reserved.
	.\" Copyright (c) 2017 Datto Inc.
	.\" Copyright (c) 2018 George Melikov. All Rights Reserved.
	.\" Copyright 2017 Nexenta Systems, Inc.
	.\" Copyright (c) 2017 Open-E, Inc. All Rights Reserved.
	.\"
	.Dd June 2, 2021
	.Dt ZPOOLCONCEPTS 7
	.Os
	.
	.Sh NAME
	.Nm zpoolconcepts
	.Nd overview of ZFS storage pools
	.
	.Sh DESCRIPTION
	.Ss Virtual Devices (vdevs)
	A "virtual device" describes a single device or a collection of devices
	organized according to certain performance and fault characteristics.
	The following virtual devices are supported:
	.Bl -tag -width "special"
	.It Sy disk
	A block device, typically located under
	.Pa /dev .
	ZFS can use individual slices or partitions, though the recommended mode of
	operation is to use whole disks.
	A disk can be specified by a full path, or it can be a shorthand name
	.Po the relative portion of the path under
	.Pa /dev
	.Pc .
	A whole disk can be specified by omitting the slice or partition designation.
	For example,
	.Pa sda
	is equivalent to
	.Pa /dev/sda .
	When given a whole disk, ZFS automatically labels the disk, if necessary.
	.It Sy file
	A regular file.
	The use of files as a backing store is strongly discouraged.
	It is designed primarily for experimental purposes, as the fault tolerance of a
	file is only as good as the file system on which it resides.
	A file must be specified by a full path.
	.It Sy mirror
	A mirror of two or more devices.
	Data is replicated in an identical fashion across all components of a mirror.
	A mirror with
	.Em N No disks of size Em X No can hold Em X No bytes and can withstand Em N-1
	devices failing without losing data.
	.It Sy raidz , raidz1 , raidz2 , raidz3
	A variation on RAID-5 that allows for better distribution of parity and
	eliminates the RAID-5
	.Qq write hole
	.Pq in which data and parity become inconsistent after a power loss .
	Data and parity is striped across all disks within a raidz group.
	.Pp
	A raidz group can have single, double, or triple parity, meaning that the
	raidz group can sustain one, two, or three failures, respectively, without
	losing any data.
	The
	.Sy raidz1
	vdev type specifies a single-parity raidz group; the
	.Sy raidz2
	vdev type specifies a double-parity raidz group; and the
	.Sy raidz3
	vdev type specifies a triple-parity raidz group.
	The
	.Sy raidz
	vdev type is an alias for
	.Sy raidz1 .
	.Pp
	A raidz group with
	.Em N No disks of size Em X No with Em P No parity disks can hold approximately
	.Em (N-P)*X No bytes and can withstand Em P No devices failing without losing data.
	The minimum number of devices in a raidz group is one more than the number of
	parity disks.
	The recommended number is between 3 and 9 to help increase performance.
	.It Sy draid , draid1 , draid2 , draid3
	A variant of raidz that provides integrated distributed hot spares which
	allows for faster resilvering while retaining the benefits of raidz.
	A dRAID vdev is constructed from multiple internal raidz groups, each with
	.Em D No data devices and Em P No parity devices.
	These groups are distributed over all of the children in order to fully
	utilize the available disk performance.
	.Pp
	Unlike raidz, dRAID uses a fixed stripe width (padding as necessary with
	zeros) to allow fully sequential resilvering.
	This fixed stripe width significantly effects both usable capacity and IOPS.
	For example, with the default
	.Em D=8 No and Em 4kB No disk sectors the minimum allocation size is Em 32kB .
	If using compression, this relatively large allocation size can reduce the
	effective compression ratio.
	When using ZFS volumes and dRAID, the default of the
	.Sy volblocksize
	property is increased to account for the allocation size.
	If a dRAID pool will hold a significant amount of small blocks, it is
	recommended to also add a mirrored
	.Sy special
	vdev to store those blocks.
	.Pp
	In regards to I/O, performance is similar to raidz since for any read all
	.Em D No data disks must be accessed.
	Delivered random IOPS can be reasonably approximated as
	.Sy floor((N-S)/(D+P))*single_drive_IOPS .
	.Pp
	Like raidzm a dRAID can have single-, double-, or triple-parity.
	The
	.Sy draid1 ,
	.Sy draid2 ,
	and
	.Sy draid3
	types can be used to specify the parity level.
	The
	.Sy draid
	vdev type is an alias for
	.Sy draid1 .
	.Pp
	A dRAID with
	.Em N No disks of size Em X , D No data disks per redundancy group, Em P
	.No parity level, and Em S No distributed hot spares can hold approximately
	.Em (N-S)(D/(D+P))X No bytes and can withstand Em P
	devices failing without losing data.
	.It Sy draid Ns Oo Ar parity Oc Ns Oo Sy \&: Ns Ar data Ns Sy d Oc Ns Oo Sy \&: Ns Ar children Ns Sy c Oc Ns Oo Sy \&: Ns Ar spares Ns Sy s Oc
	A non-default dRAID configuration can be specified by appending one or more
	of the following optional arguments to the
	.Sy draid
	keyword:
	.Bl -tag -compact -width "children"
	.It Ar parity
	The parity level (1-3).
	.It Ar data
	The number of data devices per redundancy group.
	In general, a smaller value of
	.Em D No will increase IOPS, improve the compression ratio,
	and speed up resilvering at the expense of total usable capacity.
	Defaults to
	.Em 8 , No unless Em N-P-S No is less than Em 8 .
	.It Ar children
	The expected number of children.
	Useful as a cross-check when listing a large number of devices.
	An error is returned when the provided number of children differs.
	.It Ar spares
	The number of distributed hot spares.
	Defaults to zero.
	.El
	.It Sy spare
	A pseudo-vdev which keeps track of available hot spares for a pool.
	For more information, see the
	.Sx Hot Spares
	section.
	.It Sy log
	A separate intent log device.
	If more than one log device is specified, then writes are load-balanced between
	devices.
	Log devices can be mirrored.
	However, raidz vdev types are not supported for the intent log.
	For more information, see the
	.Sx Intent Log
	section.
	.It Sy dedup
	A device dedicated solely for deduplication tables.
	The redundancy of this device should match the redundancy of the other normal
	devices in the pool.
	If more than one dedup device is specified, then
	allocations are load-balanced between those devices.
	.It Sy special
	A device dedicated solely for allocating various kinds of internal metadata,
	and optionally small file blocks.
	The redundancy of this device should match the redundancy of the other normal
	devices in the pool.
	If more than one special device is specified, then
	allocations are load-balanced between those devices.
	.Pp
	For more information on special allocations, see the
	.Sx Special Allocation Class
	section.
	.It Sy cache
	A device used to cache storage pool data.
	A cache device cannot be configured as a mirror or raidz group.
	For more information, see the
	.Sx Cache Devices
	section.
	.El
	.Pp
	Virtual devices cannot be nested, so a mirror or raidz virtual device can only
	contain files or disks.
	Mirrors of mirrors
	.Pq or other combinations
	are not allowed.
	.Pp
	A pool can have any number of virtual devices at the top of the configuration
	.Po known as
	.Qq root vdevs
	.Pc .
	Data is dynamically distributed across all top-level devices to balance data
	among devices.
	As new virtual devices are added, ZFS automatically places data on the newly
	available devices.
	.Pp
	Virtual devices are specified one at a time on the command line,
	separated by whitespace.
	Keywords like
	.Sy mirror No and Sy raidz
	are used to distinguish where a group ends and another begins.
	For example, the following creates a pool with two root vdevs,
	each a mirror of two disks:
	.Dl # Nm zpool Cm create Ar mypool Sy mirror Ar sda sdb Sy mirror Ar sdc sdd
	.
	.Ss Device Failure and Recovery
	ZFS supports a rich set of mechanisms for handling device failure and data
	corruption.
	All metadata and data is checksummed, and ZFS automatically repairs bad data
	from a good copy when corruption is detected.
	.Pp
	In order to take advantage of these features, a pool must make use of some form
	of redundancy, using either mirrored or raidz groups.
	While ZFS supports running in a non-redundant configuration, where each root
	vdev is simply a disk or file, this is strongly discouraged.
	A single case of bit corruption can render some or all of your data unavailable.
	.Pp
	A pool's health status is described by one of three states:
	.Sy online , degraded , No or Sy faulted .
	An online pool has all devices operating normally.
	A degraded pool is one in which one or more devices have failed, but the data is
	still available due to a redundant configuration.
	A faulted pool has corrupted metadata, or one or more faulted devices, and
	insufficient replicas to continue functioning.
	.Pp
	The health of the top-level vdev, such as a mirror or raidz device,
	is potentially impacted by the state of its associated vdevs,
	or component devices.
	A top-level vdev or component device is in one of the following states:
	.Bl -tag -width "DEGRADED"
	.It Sy DEGRADED
	One or more top-level vdevs is in the degraded state because one or more
	component devices are offline.
	Sufficient replicas exist to continue functioning.
	.Pp
	One or more component devices is in the degraded or faulted state, but
	sufficient replicas exist to continue functioning.
	The underlying conditions are as follows:
	.Bl -bullet -compact
	.It
	The number of checksum errors exceeds acceptable levels and the device is
	degraded as an indication that something may be wrong.
	ZFS continues to use the device as necessary.
	.It
	The number of I/O errors exceeds acceptable levels.
	The device could not be marked as faulted because there are insufficient
	replicas to continue functioning.
	.El
	.It Sy FAULTED
	One or more top-level vdevs is in the faulted state because one or more
	component devices are offline.
	Insufficient replicas exist to continue functioning.
	.Pp
	One or more component devices is in the faulted state, and insufficient
	replicas exist to continue functioning.
	The underlying conditions are as follows:
	.Bl -bullet -compact
	.It
	The device could be opened, but the contents did not match expected values.
	.It
	The number of I/O errors exceeds acceptable levels and the device is faulted to
	prevent further use of the device.
	.El
	.It Sy OFFLINE
	The device was explicitly taken offline by the
	.Nm zpool Cm offline
	command.
	.It Sy ONLINE
	The device is online and functioning.
	.It Sy REMOVED
	The device was physically removed while the system was running.
	Device removal detection is hardware-dependent and may not be supported on all
	platforms.
	.It Sy UNAVAIL
	The device could not be opened.
	If a pool is imported when a device was unavailable, then the device will be
	identified by a unique identifier instead of its path since the path was never
	correct in the first place.
	.El
	.Pp
	Checksum errors represent events where a disk returned data that was expected
	to be correct, but was not.
	In other words, these are instances of silent data corruption.
	The checksum errors are reported in
	.Nm zpool Cm status
	and
	.Nm zpool Cm events .
	When a block is stored redundantly, a damaged block may be reconstructed
	(e.g. from raidz parity or a mirrored copy).
	In this case, ZFS reports the checksum error against the disks that contained
	damaged data.
	If a block is unable to be reconstructed (e.g. due to 3 disks being damaged
	in a raidz2 group), it is not possible to determine which disks were silently
	corrupted.
	In this case, checksum errors are reported for all disks on which the block
	is stored.
	.Pp
	If a device is removed and later re-attached to the system,
	ZFS attempts online the device automatically.
	Device attachment detection is hardware-dependent
	and might not be supported on all platforms.
	.
	.Ss Hot Spares
	ZFS allows devices to be associated with pools as
	.Qq hot spares .
	These devices are not actively used in the pool, but when an active device
	fails, it is automatically replaced by a hot spare.
	To create a pool with hot spares, specify a
	.Sy spare
	vdev with any number of devices.
	For example,
	.Dl # Nm zpool Cm create Ar pool Sy mirror Ar sda sdb Sy spare Ar sdc sdd
	.Pp
	Spares can be shared across multiple pools, and can be added with the
	.Nm zpool Cm add
	command and removed with the
	.Nm zpool Cm remove
	command.
	Once a spare replacement is initiated, a new
	.Sy spare
	vdev is created within the configuration that will remain there until the
	original device is replaced.
	At this point, the hot spare becomes available again if another device fails.
	.Pp
	If a pool has a shared spare that is currently being used, the pool can not be
	exported since other pools may use this shared spare, which may lead to
	potential data corruption.
	.Pp
	Shared spares add some risk.
	If the pools are imported on different hosts,
	and both pools suffer a device failure at the same time,
	both could attempt to use the spare at the same time.
	This may not be detected, resulting in data corruption.
	.Pp
	An in-progress spare replacement can be cancelled by detaching the hot spare.
	If the original faulted device is detached, then the hot spare assumes its
	place in the configuration, and is removed from the spare list of all active
	pools.
	.Pp
	The
	.Sy draid
	vdev type provides distributed hot spares.
	These hot spares are named after the dRAID vdev they're a part of
	.Po Sy draid1 Ns - Ns Ar 2 Ns - Ns Ar 3 No specifies spare Ar 3 No of vdev Ar 2 ,
	.No which is a single parity dRAID Pc
	and may only be used by that dRAID vdev.
	Otherwise, they behave the same as normal hot spares.
	.Pp
	Spares cannot replace log devices.
	.
	.Ss Intent Log
	The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous
	transactions.
	For instance, databases often require their transactions to be on stable storage
	devices when returning from a system call.
	NFS and other applications can also use
	.Xr fsync 2
	to ensure data stability.
	By default, the intent log is allocated from blocks within the main pool.
	However, it might be possible to get better performance using separate intent
	log devices such as NVRAM or a dedicated disk.
	For example:
	.Dl # Nm zpool Cm create Ar pool sda sdb Sy log Ar sdc
	.Pp
	Multiple log devices can also be specified, and they can be mirrored.
	See the
	.Sx EXAMPLES
	section for an example of mirroring multiple log devices.
	.Pp
	Log devices can be added, replaced, attached, detached and removed.
	In addition, log devices are imported and exported as part of the pool
	that contains them.
	Mirrored devices can be removed by specifying the top-level mirror vdev.
	.
	.Ss Cache Devices
	Devices can be added to a storage pool as
	.Qq cache devices .
	These devices provide an additional layer of caching between main memory and
	disk.
	For read-heavy workloads, where the working set size is much larger than what
	can be cached in main memory, using cache devices allows much more of this
	working set to be served from low latency media.
	Using cache devices provides the greatest performance improvement for random
	read-workloads of mostly static content.
	.Pp
	To create a pool with cache devices, specify a
	.Sy cache
	vdev with any number of devices.
	For example:
	.Dl # Nm zpool Cm create Ar pool sda sdb Sy cache Ar sdc sdd
	.Pp
	Cache devices cannot be mirrored or part of a raidz configuration.
	If a read error is encountered on a cache device, that read I/O is reissued to
	the original storage pool device, which might be part of a mirrored or raidz
	configuration.
	.Pp
	The content of the cache devices is persistent across reboots and restored
	asynchronously when importing the pool in L2ARC (persistent L2ARC).
	This can be disabled by setting
	.Sy l2arc_rebuild_enabled Ns = Ns Sy 0 .
	For cache devices smaller than
	.Em 1GB ,
	we do not write the metadata structures
	required for rebuilding the L2ARC in order not to waste space.
	This can be changed with
	.Sy l2arc_rebuild_blocks_min_l2size .
	The cache device header
	.Pq Em 512B
	is updated even if no metadata structures are written.
	Setting
	.Sy l2arc_headroom Ns = Ns Sy 0
	will result in scanning the full-length ARC lists for cacheable content to be
	written in L2ARC (persistent ARC).
	If a cache device is added with
	.Nm zpool Cm add
	its label and header will be overwritten and its contents are not going to be
	restored in L2ARC, even if the device was previously part of the pool.
	If a cache device is onlined with
	.Nm zpool Cm online
	its contents will be restored in L2ARC.
	This is useful in case of memory pressure
	where the contents of the cache device are not fully restored in L2ARC.
	The user can off- and online the cache device when there is less memory pressure
	in order to fully restore its contents to L2ARC.
	.
	.Ss Pool checkpoint
	Before starting critical procedures that include destructive actions
	.Pq like Nm zfs Cm destroy ,
	an administrator can checkpoint the pool's state and in the case of a
	mistake or failure, rewind the entire pool back to the checkpoint.
	Otherwise, the checkpoint can be discarded when the procedure has completed
	successfully.
	.Pp
	A pool checkpoint can be thought of as a pool-wide snapshot and should be used
	with care as it contains every part of the pool's state, from properties to vdev
	configuration.
	Thus, certain operations are not allowed while a pool has a checkpoint.
	Specifically, vdev removal/attach/detach, mirror splitting, and
	changing the pool's GUID.
	Adding a new vdev is supported, but in the case of a rewind it will have to be
	added again.
	Finally, users of this feature should keep in mind that scrubs in a pool that
	has a checkpoint do not repair checkpointed data.
	.Pp
	To create a checkpoint for a pool:
	.Dl # Nm zpool Cm checkpoint Ar pool
	.Pp
	To later rewind to its checkpointed state, you need to first export it and
	then rewind it during import:
	.Dl # Nm zpool Cm export Ar pool
	.Dl # Nm zpool Cm import Fl -rewind-to-checkpoint Ar pool
	.Pp
	To discard the checkpoint from a pool:
	.Dl # Nm zpool Cm checkpoint Fl d Ar pool
	.Pp
	Dataset reservations (controlled by the
	.Sy reservation No and Sy refreservation
	properties) may be unenforceable while a checkpoint exists, because the
	checkpoint is allowed to consume the dataset's reservation.
	Finally, data that is part of the checkpoint but has been freed in the
	current state of the pool won't be scanned during a scrub.
	.
	.Ss Special Allocation Class
	Allocations in the special class are dedicated to specific block types.
	By default this includes all metadata, the indirect blocks of user data, and
	any deduplication tables.
	The class can also be provisioned to accept small file blocks.
	.Pp
	A pool must always have at least one normal
	.Pq non- Ns Sy dedup Ns /- Ns Sy special
	vdev before
	other devices can be assigned to the special class.
	If the
	.Sy special
	class becomes full, then allocations intended for it
	will spill back into the normal class.
	.Pp
	Deduplication tables can be excluded from the special class by unsetting the
	.Sy zfs_ddt_data_is_special
	ZFS module parameter.
	.Pp
	Inclusion of small file blocks in the special class is opt-in.
	Each dataset can control the size of small file blocks allowed
	in the special class by setting the
	.Sy special_small_blocks
	property to nonzero.
	See
	.Xr zfsprops 7
	for more info on this property.