| '\" te |
| .\" Copyright (c) 2013 by Turbo Fredriksson <turbo@bayour.com>. All rights reserved. |
| .\" Copyright (c) 2019 by Delphix. All rights reserved. |
| .\" Copyright (c) 2019 Datto Inc. |
| .\" The contents of this file are subject to the terms of the Common Development |
| .\" and Distribution License (the "License"). You may not use this file except |
| .\" in compliance with the License. You can obtain a copy of the license at |
| .\" usr/src/OPENSOLARIS.LICENSE or http://www.opensolaris.org/os/licensing. |
| .\" |
| .\" See the License for the specific language governing permissions and |
| .\" limitations under the License. When distributing Covered Code, include this |
| .\" CDDL HEADER in each file and include the License file at |
| .\" usr/src/OPENSOLARIS.LICENSE. If applicable, add the following below this |
| .\" CDDL HEADER, with the fields enclosed by brackets "[]" replaced with your |
| .\" own identifying information: |
| .\" Portions Copyright [yyyy] [name of copyright owner] |
| .TH ZFS-MODULE-PARAMETERS 5 "Feb 15, 2019" |
| .SH NAME |
| zfs\-module\-parameters \- ZFS module parameters |
| .SH DESCRIPTION |
| .sp |
| .LP |
| Description of the different parameters to the ZFS module. |
| |
| .SS "Module parameters" |
| .sp |
| .LP |
| |
| .sp |
| .ne 2 |
| .na |
| \fBdbuf_cache_max_bytes\fR (ulong) |
| .ad |
| .RS 12n |
| Maximum size in bytes of the dbuf cache. When \fB0\fR this value will default |
| to \fB1/2^dbuf_cache_shift\fR (1/32) of the target ARC size, otherwise the |
| provided value in bytes will be used. The behavior of the dbuf cache and its |
| associated settings can be observed via the \fB/proc/spl/kstat/zfs/dbufstats\fR |
| kstat. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBdbuf_metadata_cache_max_bytes\fR (ulong) |
| .ad |
| .RS 12n |
| Maximum size in bytes of the metadata dbuf cache. When \fB0\fR this value will |
| default to \fB1/2^dbuf_cache_shift\fR (1/16) of the target ARC size, otherwise |
| the provided value in bytes will be used. The behavior of the metadata dbuf |
| cache and its associated settings can be observed via the |
| \fB/proc/spl/kstat/zfs/dbufstats\fR kstat. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBdbuf_cache_hiwater_pct\fR (uint) |
| .ad |
| .RS 12n |
| The percentage over \fBdbuf_cache_max_bytes\fR when dbufs must be evicted |
| directly. |
| .sp |
| Default value: \fB10\fR%. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBdbuf_cache_lowater_pct\fR (uint) |
| .ad |
| .RS 12n |
| The percentage below \fBdbuf_cache_max_bytes\fR when the evict thread stops |
| evicting dbufs. |
| .sp |
| Default value: \fB10\fR%. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBdbuf_cache_shift\fR (int) |
| .ad |
| .RS 12n |
| Set the size of the dbuf cache, \fBdbuf_cache_max_bytes\fR, to a log2 fraction |
| of the target arc size. |
| .sp |
| Default value: \fB5\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBdbuf_metadata_cache_shift\fR (int) |
| .ad |
| .RS 12n |
| Set the size of the dbuf metadata cache, \fBdbuf_metadata_cache_max_bytes\fR, |
| to a log2 fraction of the target arc size. |
| .sp |
| Default value: \fB6\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBdmu_prefetch_max\fR (int) |
| .ad |
| .RS 12n |
| Limit the amount we can prefetch with one call to this amount (in bytes). |
| This helps to limit the amount of memory that can be used by prefetching. |
| .sp |
| Default value: \fB134,217,728\fR (128MB). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBignore_hole_birth\fR (int) |
| .ad |
| .RS 12n |
| This is an alias for \fBsend_holes_without_birth_time\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBl2arc_feed_again\fR (int) |
| .ad |
| .RS 12n |
| Turbo L2ARC warm-up. When the L2ARC is cold the fill interval will be set as |
| fast as possible. |
| .sp |
| Use \fB1\fR for yes (default) and \fB0\fR to disable. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBl2arc_feed_min_ms\fR (ulong) |
| .ad |
| .RS 12n |
| Min feed interval in milliseconds. Requires \fBl2arc_feed_again=1\fR and only |
| applicable in related situations. |
| .sp |
| Default value: \fB200\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBl2arc_feed_secs\fR (ulong) |
| .ad |
| .RS 12n |
| Seconds between L2ARC writing |
| .sp |
| Default value: \fB1\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBl2arc_headroom\fR (ulong) |
| .ad |
| .RS 12n |
| How far through the ARC lists to search for L2ARC cacheable content, expressed |
| as a multiplier of \fBl2arc_write_max\fR |
| .sp |
| Default value: \fB2\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBl2arc_headroom_boost\fR (ulong) |
| .ad |
| .RS 12n |
| Scales \fBl2arc_headroom\fR by this percentage when L2ARC contents are being |
| successfully compressed before writing. A value of 100 disables this feature. |
| .sp |
| Default value: \fB200\fR%. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBl2arc_noprefetch\fR (int) |
| .ad |
| .RS 12n |
| Do not write buffers to L2ARC if they were prefetched but not used by |
| applications |
| .sp |
| Use \fB1\fR for yes (default) and \fB0\fR to disable. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBl2arc_norw\fR (int) |
| .ad |
| .RS 12n |
| No reads during writes |
| .sp |
| Use \fB1\fR for yes and \fB0\fR for no (default). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBl2arc_write_boost\fR (ulong) |
| .ad |
| .RS 12n |
| Cold L2ARC devices will have \fBl2arc_write_max\fR increased by this amount |
| while they remain cold. |
| .sp |
| Default value: \fB8,388,608\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBl2arc_write_max\fR (ulong) |
| .ad |
| .RS 12n |
| Max write bytes per interval |
| .sp |
| Default value: \fB8,388,608\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBmetaslab_aliquot\fR (ulong) |
| .ad |
| .RS 12n |
| Metaslab granularity, in bytes. This is roughly similar to what would be |
| referred to as the "stripe size" in traditional RAID arrays. In normal |
| operation, ZFS will try to write this amount of data to a top-level vdev |
| before moving on to the next one. |
| .sp |
| Default value: \fB524,288\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBmetaslab_bias_enabled\fR (int) |
| .ad |
| .RS 12n |
| Enable metaslab group biasing based on its vdev's over- or under-utilization |
| relative to the pool. |
| .sp |
| Use \fB1\fR for yes (default) and \fB0\fR for no. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBmetaslab_force_ganging\fR (ulong) |
| .ad |
| .RS 12n |
| Make some blocks above a certain size be gang blocks. This option is used |
| by the test suite to facilitate testing. |
| .sp |
| Default value: \fB16,777,217\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_metaslab_segment_weight_enabled\fR (int) |
| .ad |
| .RS 12n |
| Enable/disable segment-based metaslab selection. |
| .sp |
| Use \fB1\fR for yes (default) and \fB0\fR for no. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_metaslab_switch_threshold\fR (int) |
| .ad |
| .RS 12n |
| When using segment-based metaslab selection, continue allocating |
| from the active metaslab until \fBzfs_metaslab_switch_threshold\fR |
| worth of buckets have been exhausted. |
| .sp |
| Default value: \fB2\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBmetaslab_debug_load\fR (int) |
| .ad |
| .RS 12n |
| Load all metaslabs during pool import. |
| .sp |
| Use \fB1\fR for yes and \fB0\fR for no (default). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBmetaslab_debug_unload\fR (int) |
| .ad |
| .RS 12n |
| Prevent metaslabs from being unloaded. |
| .sp |
| Use \fB1\fR for yes and \fB0\fR for no (default). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBmetaslab_fragmentation_factor_enabled\fR (int) |
| .ad |
| .RS 12n |
| Enable use of the fragmentation metric in computing metaslab weights. |
| .sp |
| Use \fB1\fR for yes (default) and \fB0\fR for no. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBmetaslab_df_max_search\fR (int) |
| .ad |
| .RS 12n |
| Maximum distance to search forward from the last offset. Without this limit, |
| fragmented pools can see >100,000 iterations and metaslab_block_picker() |
| becomes the performance limiting factor on high-performance storage. |
| |
| With the default setting of 16MB, we typically see less than 500 iterations, |
| even with very fragmented, ashift=9 pools. The maximum number of iterations |
| possible is: \fBmetaslab_df_max_search / (2 * (1<<ashift))\fR. |
| With the default setting of 16MB this is 16*1024 (with ashift=9) or 2048 |
| (with ashift=12). |
| .sp |
| Default value: \fB16,777,216\fR (16MB) |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBmetaslab_df_use_largest_segment\fR (int) |
| .ad |
| .RS 12n |
| If we are not searching forward (due to metaslab_df_max_search, |
| metaslab_df_free_pct, or metaslab_df_alloc_threshold), this tunable controls |
| what segment is used. If it is set, we will use the largest free segment. |
| If it is not set, we will use a segment of exactly the requested size (or |
| larger). |
| .sp |
| Use \fB1\fR for yes and \fB0\fR for no (default). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_default_ms_count\fR (int) |
| .ad |
| .RS 12n |
| When a vdev is added target this number of metaslabs per top-level vdev. |
| .sp |
| Default value: \fB200\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_min_ms_count\fR (int) |
| .ad |
| .RS 12n |
| Minimum number of metaslabs to create in a top-level vdev. |
| .sp |
| Default value: \fB16\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBvdev_ms_count_limit\fR (int) |
| .ad |
| .RS 12n |
| Practical upper limit of total metaslabs per top-level vdev. |
| .sp |
| Default value: \fB131,072\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBmetaslab_preload_enabled\fR (int) |
| .ad |
| .RS 12n |
| Enable metaslab group preloading. |
| .sp |
| Use \fB1\fR for yes (default) and \fB0\fR for no. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBmetaslab_lba_weighting_enabled\fR (int) |
| .ad |
| .RS 12n |
| Give more weight to metaslabs with lower LBAs, assuming they have |
| greater bandwidth as is typically the case on a modern constant |
| angular velocity disk drive. |
| .sp |
| Use \fB1\fR for yes (default) and \fB0\fR for no. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBsend_holes_without_birth_time\fR (int) |
| .ad |
| .RS 12n |
| When set, the hole_birth optimization will not be used, and all holes will |
| always be sent on zfs send. This is useful if you suspect your datasets are |
| affected by a bug in hole_birth. |
| .sp |
| Use \fB1\fR for on (default) and \fB0\fR for off. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBspa_config_path\fR (charp) |
| .ad |
| .RS 12n |
| SPA config file |
| .sp |
| Default value: \fB/etc/zfs/zpool.cache\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBspa_asize_inflation\fR (int) |
| .ad |
| .RS 12n |
| Multiplication factor used to estimate actual disk consumption from the |
| size of data being written. The default value is a worst case estimate, |
| but lower values may be valid for a given pool depending on its |
| configuration. Pool administrators who understand the factors involved |
| may wish to specify a more realistic inflation factor, particularly if |
| they operate close to quota or capacity limits. |
| .sp |
| Default value: \fB24\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBspa_load_print_vdev_tree\fR (int) |
| .ad |
| .RS 12n |
| Whether to print the vdev tree in the debugging message buffer during pool import. |
| Use 0 to disable and 1 to enable. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBspa_load_verify_data\fR (int) |
| .ad |
| .RS 12n |
| Whether to traverse data blocks during an "extreme rewind" (\fB-X\fR) |
| import. Use 0 to disable and 1 to enable. |
| |
| An extreme rewind import normally performs a full traversal of all |
| blocks in the pool for verification. If this parameter is set to 0, |
| the traversal skips non-metadata blocks. It can be toggled once the |
| import has started to stop or start the traversal of non-metadata blocks. |
| .sp |
| Default value: \fB1\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBspa_load_verify_metadata\fR (int) |
| .ad |
| .RS 12n |
| Whether to traverse blocks during an "extreme rewind" (\fB-X\fR) |
| pool import. Use 0 to disable and 1 to enable. |
| |
| An extreme rewind import normally performs a full traversal of all |
| blocks in the pool for verification. If this parameter is set to 0, |
| the traversal is not performed. It can be toggled once the import has |
| started to stop or start the traversal. |
| .sp |
| Default value: \fB1\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBspa_load_verify_shift\fR (int) |
| .ad |
| .RS 12n |
| Sets the maximum number of bytes to consume during pool import to the log2 |
| fraction of the target arc size. |
| .sp |
| Default value: \fB4\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBspa_slop_shift\fR (int) |
| .ad |
| .RS 12n |
| Normally, we don't allow the last 3.2% (1/(2^spa_slop_shift)) of space |
| in the pool to be consumed. This ensures that we don't run the pool |
| completely out of space, due to unaccounted changes (e.g. to the MOS). |
| It also limits the worst-case time to allocate space. If we have |
| less than this amount of free space, most ZPL operations (e.g. write, |
| create) will return ENOSPC. |
| .sp |
| Default value: \fB5\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBvdev_removal_max_span\fR (int) |
| .ad |
| .RS 12n |
| During top-level vdev removal, chunks of data are copied from the vdev |
| which may include free space in order to trade bandwidth for IOPS. |
| This parameter determines the maximum span of free space (in bytes) |
| which will be included as "unnecessary" data in a chunk of copied data. |
| |
| The default value here was chosen to align with |
| \fBzfs_vdev_read_gap_limit\fR, which is a similar concept when doing |
| regular reads (but there's no reason it has to be the same). |
| .sp |
| Default value: \fB32,768\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzap_iterate_prefetch\fR (int) |
| .ad |
| .RS 12n |
| If this is set, when we start iterating over a ZAP object, zfs will prefetch |
| the entire object (all leaf blocks). However, this is limited by |
| \fBdmu_prefetch_max\fR. |
| .sp |
| Use \fB1\fR for on (default) and \fB0\fR for off. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfetch_array_rd_sz\fR (ulong) |
| .ad |
| .RS 12n |
| If prefetching is enabled, disable prefetching for reads larger than this size. |
| .sp |
| Default value: \fB1,048,576\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfetch_max_distance\fR (uint) |
| .ad |
| .RS 12n |
| Max bytes to prefetch per stream (default 8MB). |
| .sp |
| Default value: \fB8,388,608\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfetch_max_streams\fR (uint) |
| .ad |
| .RS 12n |
| Max number of streams per zfetch (prefetch streams per file). |
| .sp |
| Default value: \fB8\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfetch_min_sec_reap\fR (uint) |
| .ad |
| .RS 12n |
| Min time before an active prefetch stream can be reclaimed |
| .sp |
| Default value: \fB2\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_abd_scatter_min_size\fR (uint) |
| .ad |
| .RS 12n |
| This is the minimum allocation size that will use scatter (page-based) |
| ABD's. Smaller allocations will use linear ABD's. |
| .sp |
| Default value: \fB1536\fR (512B and 1KB allocations will be linear). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_dnode_limit\fR (ulong) |
| .ad |
| .RS 12n |
| When the number of bytes consumed by dnodes in the ARC exceeds this number of |
| bytes, try to unpin some of it in response to demand for non-metadata. This |
| value acts as a ceiling to the amount of dnode metadata, and defaults to 0 which |
| indicates that a percent which is based on \fBzfs_arc_dnode_limit_percent\fR of |
| the ARC meta buffers that may be used for dnodes. |
| |
| See also \fBzfs_arc_meta_prune\fR which serves a similar purpose but is used |
| when the amount of metadata in the ARC exceeds \fBzfs_arc_meta_limit\fR rather |
| than in response to overall demand for non-metadata. |
| |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_dnode_limit_percent\fR (ulong) |
| .ad |
| .RS 12n |
| Percentage that can be consumed by dnodes of ARC meta buffers. |
| .sp |
| See also \fBzfs_arc_dnode_limit\fR which serves a similar purpose but has a |
| higher priority if set to nonzero value. |
| .sp |
| Default value: \fB10\fR%. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_dnode_reduce_percent\fR (ulong) |
| .ad |
| .RS 12n |
| Percentage of ARC dnodes to try to scan in response to demand for non-metadata |
| when the number of bytes consumed by dnodes exceeds \fBzfs_arc_dnode_limit\fR. |
| |
| .sp |
| Default value: \fB10\fR% of the number of dnodes in the ARC. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_average_blocksize\fR (int) |
| .ad |
| .RS 12n |
| The ARC's buffer hash table is sized based on the assumption of an average |
| block size of \fBzfs_arc_average_blocksize\fR (default 8K). This works out |
| to roughly 1MB of hash table per 1GB of physical memory with 8-byte pointers. |
| For configurations with a known larger average block size this value can be |
| increased to reduce the memory footprint. |
| |
| .sp |
| Default value: \fB8192\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_evict_batch_limit\fR (int) |
| .ad |
| .RS 12n |
| Number ARC headers to evict per sub-list before proceeding to another sub-list. |
| This batch-style operation prevents entire sub-lists from being evicted at once |
| but comes at a cost of additional unlocking and locking. |
| .sp |
| Default value: \fB10\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_grow_retry\fR (int) |
| .ad |
| .RS 12n |
| If set to a non zero value, it will replace the arc_grow_retry value with this value. |
| The arc_grow_retry value (default 5) is the number of seconds the ARC will wait before |
| trying to resume growth after a memory pressure event. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_lotsfree_percent\fR (int) |
| .ad |
| .RS 12n |
| Throttle I/O when free system memory drops below this percentage of total |
| system memory. Setting this value to 0 will disable the throttle. |
| .sp |
| Default value: \fB10\fR%. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_max\fR (ulong) |
| .ad |
| .RS 12n |
| Max arc size of ARC in bytes. If set to 0 then it will consume 1/2 of system |
| RAM. This value must be at least 67108864 (64 megabytes). |
| .sp |
| This value can be changed dynamically with some caveats. It cannot be set back |
| to 0 while running and reducing it below the current ARC size will not cause |
| the ARC to shrink without memory pressure to induce shrinking. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_meta_adjust_restarts\fR (ulong) |
| .ad |
| .RS 12n |
| The number of restart passes to make while scanning the ARC attempting |
| the free buffers in order to stay below the \fBzfs_arc_meta_limit\fR. |
| This value should not need to be tuned but is available to facilitate |
| performance analysis. |
| .sp |
| Default value: \fB4096\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_meta_limit\fR (ulong) |
| .ad |
| .RS 12n |
| The maximum allowed size in bytes that meta data buffers are allowed to |
| consume in the ARC. When this limit is reached meta data buffers will |
| be reclaimed even if the overall arc_c_max has not been reached. This |
| value defaults to 0 which indicates that a percent which is based on |
| \fBzfs_arc_meta_limit_percent\fR of the ARC may be used for meta data. |
| .sp |
| This value my be changed dynamically except that it cannot be set back to 0 |
| for a specific percent of the ARC; it must be set to an explicit value. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_meta_limit_percent\fR (ulong) |
| .ad |
| .RS 12n |
| Percentage of ARC buffers that can be used for meta data. |
| |
| See also \fBzfs_arc_meta_limit\fR which serves a similar purpose but has a |
| higher priority if set to nonzero value. |
| |
| .sp |
| Default value: \fB75\fR%. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_meta_min\fR (ulong) |
| .ad |
| .RS 12n |
| The minimum allowed size in bytes that meta data buffers may consume in |
| the ARC. This value defaults to 0 which disables a floor on the amount |
| of the ARC devoted meta data. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_meta_prune\fR (int) |
| .ad |
| .RS 12n |
| The number of dentries and inodes to be scanned looking for entries |
| which can be dropped. This may be required when the ARC reaches the |
| \fBzfs_arc_meta_limit\fR because dentries and inodes can pin buffers |
| in the ARC. Increasing this value will cause to dentry and inode caches |
| to be pruned more aggressively. Setting this value to 0 will disable |
| pruning the inode and dentry caches. |
| .sp |
| Default value: \fB10,000\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_meta_strategy\fR (int) |
| .ad |
| .RS 12n |
| Define the strategy for ARC meta data buffer eviction (meta reclaim strategy). |
| A value of 0 (META_ONLY) will evict only the ARC meta data buffers. |
| A value of 1 (BALANCED) indicates that additional data buffers may be evicted if |
| that is required to in order to evict the required number of meta data buffers. |
| .sp |
| Default value: \fB1\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_min\fR (ulong) |
| .ad |
| .RS 12n |
| Min arc size of ARC in bytes. If set to 0 then arc_c_min will default to |
| consuming the larger of 32M or 1/32 of total system memory. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_min_prefetch_ms\fR (int) |
| .ad |
| .RS 12n |
| Minimum time prefetched blocks are locked in the ARC, specified in ms. |
| A value of \fB0\fR will default to 1000 ms. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_min_prescient_prefetch_ms\fR (int) |
| .ad |
| .RS 12n |
| Minimum time "prescient prefetched" blocks are locked in the ARC, specified |
| in ms. These blocks are meant to be prefetched fairly aggressively ahead of |
| the code that may use them. A value of \fB0\fR will default to 6000 ms. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_max_missing_tvds\fR (int) |
| .ad |
| .RS 12n |
| Number of missing top-level vdevs which will be allowed during |
| pool import (only in read-only mode). |
| .sp |
| Default value: \fB0\fR |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_multilist_num_sublists\fR (int) |
| .ad |
| .RS 12n |
| To allow more fine-grained locking, each ARC state contains a series |
| of lists for both data and meta data objects. Locking is performed at |
| the level of these "sub-lists". This parameters controls the number of |
| sub-lists per ARC state, and also applies to other uses of the |
| multilist data structure. |
| .sp |
| Default value: \fB4\fR or the number of online CPUs, whichever is greater |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_overflow_shift\fR (int) |
| .ad |
| .RS 12n |
| The ARC size is considered to be overflowing if it exceeds the current |
| ARC target size (arc_c) by a threshold determined by this parameter. |
| The threshold is calculated as a fraction of arc_c using the formula |
| "arc_c >> \fBzfs_arc_overflow_shift\fR". |
| |
| The default value of 8 causes the ARC to be considered to be overflowing |
| if it exceeds the target size by 1/256th (0.3%) of the target size. |
| |
| When the ARC is overflowing, new buffer allocations are stalled until |
| the reclaim thread catches up and the overflow condition no longer exists. |
| .sp |
| Default value: \fB8\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| |
| \fBzfs_arc_p_min_shift\fR (int) |
| .ad |
| .RS 12n |
| If set to a non zero value, this will update arc_p_min_shift (default 4) |
| with the new value. |
| arc_p_min_shift is used to shift of arc_c for calculating both min and max |
| max arc_p |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_p_dampener_disable\fR (int) |
| .ad |
| .RS 12n |
| Disable arc_p adapt dampener |
| .sp |
| Use \fB1\fR for yes (default) and \fB0\fR to disable. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_shrink_shift\fR (int) |
| .ad |
| .RS 12n |
| If set to a non zero value, this will update arc_shrink_shift (default 7) |
| with the new value. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_pc_percent\fR (uint) |
| .ad |
| .RS 12n |
| Percent of pagecache to reclaim arc to |
| |
| This tunable allows ZFS arc to play more nicely with the kernel's LRU |
| pagecache. It can guarantee that the arc size won't collapse under scanning |
| pressure on the pagecache, yet still allows arc to be reclaimed down to |
| zfs_arc_min if necessary. This value is specified as percent of pagecache |
| size (as measured by NR_FILE_PAGES) where that percent may exceed 100. This |
| only operates during memory pressure/reclaim. |
| .sp |
| Default value: \fB0\fR% (disabled). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_arc_sys_free\fR (ulong) |
| .ad |
| .RS 12n |
| The target number of bytes the ARC should leave as free memory on the system. |
| Defaults to the larger of 1/64 of physical memory or 512K. Setting this |
| option to a non-zero value will override the default. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_autoimport_disable\fR (int) |
| .ad |
| .RS 12n |
| Disable pool import at module load by ignoring the cache file (typically \fB/etc/zfs/zpool.cache\fR). |
| .sp |
| Use \fB1\fR for yes (default) and \fB0\fR for no. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_checksums_per_second\fR (int) |
| .ad |
| .RS 12n |
| Rate limit checksum events to this many per second. Note that this should |
| not be set below the zed thresholds (currently 10 checksums over 10 sec) |
| or else zed may not trigger any action. |
| .sp |
| Default value: 20 |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_commit_timeout_pct\fR (int) |
| .ad |
| .RS 12n |
| This controls the amount of time that a ZIL block (lwb) will remain "open" |
| when it isn't "full", and it has a thread waiting for it to be committed to |
| stable storage. The timeout is scaled based on a percentage of the last lwb |
| latency to avoid significantly impacting the latency of each individual |
| transaction record (itx). |
| .sp |
| Default value: \fB5\fR%. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_condense_indirect_vdevs_enable\fR (int) |
| .ad |
| .RS 12n |
| Enable condensing indirect vdev mappings. When set to a non-zero value, |
| attempt to condense indirect vdev mappings if the mapping uses more than |
| \fBzfs_condense_min_mapping_bytes\fR bytes of memory and if the obsolete |
| space map object uses more than \fBzfs_condense_max_obsolete_bytes\fR |
| bytes on-disk. The condensing process is an attempt to save memory by |
| removing obsolete mappings. |
| .sp |
| Default value: \fB1\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_condense_max_obsolete_bytes\fR (ulong) |
| .ad |
| .RS 12n |
| Only attempt to condense indirect vdev mappings if the on-disk size |
| of the obsolete space map object is greater than this number of bytes |
| (see \fBfBzfs_condense_indirect_vdevs_enable\fR). |
| .sp |
| Default value: \fB1,073,741,824\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_condense_min_mapping_bytes\fR (ulong) |
| .ad |
| .RS 12n |
| Minimum size vdev mapping to attempt to condense (see |
| \fBzfs_condense_indirect_vdevs_enable\fR). |
| .sp |
| Default value: \fB131,072\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_dbgmsg_enable\fR (int) |
| .ad |
| .RS 12n |
| Internally ZFS keeps a small log to facilitate debugging. By default the log |
| is disabled, to enable it set this option to 1. The contents of the log can |
| be accessed by reading the /proc/spl/kstat/zfs/dbgmsg file. Writing 0 to |
| this proc file clears the log. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_dbgmsg_maxsize\fR (int) |
| .ad |
| .RS 12n |
| The maximum size in bytes of the internal ZFS debug log. |
| .sp |
| Default value: \fB4M\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_dbuf_state_index\fR (int) |
| .ad |
| .RS 12n |
| This feature is currently unused. It is normally used for controlling what |
| reporting is available under /proc/spl/kstat/zfs. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_deadman_enabled\fR (int) |
| .ad |
| .RS 12n |
| When a pool sync operation takes longer than \fBzfs_deadman_synctime_ms\fR |
| milliseconds, or when an individual I/O takes longer than |
| \fBzfs_deadman_ziotime_ms\fR milliseconds, then the operation is considered to |
| be "hung". If \fBzfs_deadman_enabled\fR is set then the deadman behavior is |
| invoked as described by the \fBzfs_deadman_failmode\fR module option. |
| By default the deadman is enabled and configured to \fBwait\fR which results |
| in "hung" I/Os only being logged. The deadman is automatically disabled |
| when a pool gets suspended. |
| .sp |
| Default value: \fB1\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_deadman_failmode\fR (charp) |
| .ad |
| .RS 12n |
| Controls the failure behavior when the deadman detects a "hung" I/O. Valid |
| values are \fBwait\fR, \fBcontinue\fR, and \fBpanic\fR. |
| .sp |
| \fBwait\fR - Wait for a "hung" I/O to complete. For each "hung" I/O a |
| "deadman" event will be posted describing that I/O. |
| .sp |
| \fBcontinue\fR - Attempt to recover from a "hung" I/O by re-dispatching it |
| to the I/O pipeline if possible. |
| .sp |
| \fBpanic\fR - Panic the system. This can be used to facilitate an automatic |
| fail-over to a properly configured fail-over partner. |
| .sp |
| Default value: \fBwait\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_deadman_checktime_ms\fR (int) |
| .ad |
| .RS 12n |
| Check time in milliseconds. This defines the frequency at which we check |
| for hung I/O and potentially invoke the \fBzfs_deadman_failmode\fR behavior. |
| .sp |
| Default value: \fB60,000\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_deadman_synctime_ms\fR (ulong) |
| .ad |
| .RS 12n |
| Interval in milliseconds after which the deadman is triggered and also |
| the interval after which a pool sync operation is considered to be "hung". |
| Once this limit is exceeded the deadman will be invoked every |
| \fBzfs_deadman_checktime_ms\fR milliseconds until the pool sync completes. |
| .sp |
| Default value: \fB600,000\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_deadman_ziotime_ms\fR (ulong) |
| .ad |
| .RS 12n |
| Interval in milliseconds after which the deadman is triggered and an |
| individual I/O operation is considered to be "hung". As long as the I/O |
| remains "hung" the deadman will be invoked every \fBzfs_deadman_checktime_ms\fR |
| milliseconds until the I/O completes. |
| .sp |
| Default value: \fB300,000\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_dedup_prefetch\fR (int) |
| .ad |
| .RS 12n |
| Enable prefetching dedup-ed blks |
| .sp |
| Use \fB1\fR for yes and \fB0\fR to disable (default). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_delay_min_dirty_percent\fR (int) |
| .ad |
| .RS 12n |
| Start to delay each transaction once there is this amount of dirty data, |
| expressed as a percentage of \fBzfs_dirty_data_max\fR. |
| This value should be >= zfs_vdev_async_write_active_max_dirty_percent. |
| See the section "ZFS TRANSACTION DELAY". |
| .sp |
| Default value: \fB60\fR%. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_delay_scale\fR (int) |
| .ad |
| .RS 12n |
| This controls how quickly the transaction delay approaches infinity. |
| Larger values cause longer delays for a given amount of dirty data. |
| .sp |
| For the smoothest delay, this value should be about 1 billion divided |
| by the maximum number of operations per second. This will smoothly |
| handle between 10x and 1/10th this number. |
| .sp |
| See the section "ZFS TRANSACTION DELAY". |
| .sp |
| Note: \fBzfs_delay_scale\fR * \fBzfs_dirty_data_max\fR must be < 2^64. |
| .sp |
| Default value: \fB500,000\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_slow_io_events_per_second\fR (int) |
| .ad |
| .RS 12n |
| Rate limit delay zevents (which report slow I/Os) to this many per second. |
| .sp |
| Default value: 20 |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_unlink_suspend_progress\fR (uint) |
| .ad |
| .RS 12n |
| When enabled, files will not be asynchronously removed from the list of pending |
| unlinks and the space they consume will be leaked. Once this option has been |
| disabled and the dataset is remounted, the pending unlinks will be processed |
| and the freed space returned to the pool. |
| This option is used by the test suite to facilitate testing. |
| .sp |
| Uses \fB0\fR (default) to allow progress and \fB1\fR to pause progress. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_delete_blocks\fR (ulong) |
| .ad |
| .RS 12n |
| This is the used to define a large file for the purposes of delete. Files |
| containing more than \fBzfs_delete_blocks\fR will be deleted asynchronously |
| while smaller files are deleted synchronously. Decreasing this value will |
| reduce the time spent in an unlink(2) system call at the expense of a longer |
| delay before the freed space is available. |
| .sp |
| Default value: \fB20,480\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_dirty_data_max\fR (int) |
| .ad |
| .RS 12n |
| Determines the dirty space limit in bytes. Once this limit is exceeded, new |
| writes are halted until space frees up. This parameter takes precedence |
| over \fBzfs_dirty_data_max_percent\fR. |
| See the section "ZFS TRANSACTION DELAY". |
| .sp |
| Default value: \fB10\fR% of physical RAM, capped at \fBzfs_dirty_data_max_max\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_dirty_data_max_max\fR (int) |
| .ad |
| .RS 12n |
| Maximum allowable value of \fBzfs_dirty_data_max\fR, expressed in bytes. |
| This limit is only enforced at module load time, and will be ignored if |
| \fBzfs_dirty_data_max\fR is later changed. This parameter takes |
| precedence over \fBzfs_dirty_data_max_max_percent\fR. See the section |
| "ZFS TRANSACTION DELAY". |
| .sp |
| Default value: \fB25\fR% of physical RAM. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_dirty_data_max_max_percent\fR (int) |
| .ad |
| .RS 12n |
| Maximum allowable value of \fBzfs_dirty_data_max\fR, expressed as a |
| percentage of physical RAM. This limit is only enforced at module load |
| time, and will be ignored if \fBzfs_dirty_data_max\fR is later changed. |
| The parameter \fBzfs_dirty_data_max_max\fR takes precedence over this |
| one. See the section "ZFS TRANSACTION DELAY". |
| .sp |
| Default value: \fB25\fR%. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_dirty_data_max_percent\fR (int) |
| .ad |
| .RS 12n |
| Determines the dirty space limit, expressed as a percentage of all |
| memory. Once this limit is exceeded, new writes are halted until space frees |
| up. The parameter \fBzfs_dirty_data_max\fR takes precedence over this |
| one. See the section "ZFS TRANSACTION DELAY". |
| .sp |
| Default value: \fB10\fR%, subject to \fBzfs_dirty_data_max_max\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_dirty_data_sync_percent\fR (int) |
| .ad |
| .RS 12n |
| Start syncing out a transaction group if there's at least this much dirty data |
| as a percentage of \fBzfs_dirty_data_max\fR. This should be less than |
| \fBzfs_vdev_async_write_active_min_dirty_percent\fR. |
| .sp |
| Default value: \fB20\fR% of \fBzfs_dirty_data_max\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_fletcher_4_impl\fR (string) |
| .ad |
| .RS 12n |
| Select a fletcher 4 implementation. |
| .sp |
| Supported selectors are: \fBfastest\fR, \fBscalar\fR, \fBsse2\fR, \fBssse3\fR, |
| \fBavx2\fR, \fBavx512f\fR, and \fBaarch64_neon\fR. |
| All of the selectors except \fBfastest\fR and \fBscalar\fR require instruction |
| set extensions to be available and will only appear if ZFS detects that they are |
| present at runtime. If multiple implementations of fletcher 4 are available, |
| the \fBfastest\fR will be chosen using a micro benchmark. Selecting \fBscalar\fR |
| results in the original, CPU based calculation, being used. Selecting any option |
| other than \fBfastest\fR and \fBscalar\fR results in vector instructions from |
| the respective CPU instruction set being used. |
| .sp |
| Default value: \fBfastest\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_free_bpobj_enabled\fR (int) |
| .ad |
| .RS 12n |
| Enable/disable the processing of the free_bpobj object. |
| .sp |
| Default value: \fB1\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_async_block_max_blocks\fR (ulong) |
| .ad |
| .RS 12n |
| Maximum number of blocks freed in a single txg. |
| .sp |
| Default value: \fB100,000\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_override_estimate_recordsize\fR (ulong) |
| .ad |
| .RS 12n |
| Record size calculation override for zfs send estimates. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_async_read_max_active\fR (int) |
| .ad |
| .RS 12n |
| Maximum asynchronous read I/Os active to each device. |
| See the section "ZFS I/O SCHEDULER". |
| .sp |
| Default value: \fB3\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_async_read_min_active\fR (int) |
| .ad |
| .RS 12n |
| Minimum asynchronous read I/Os active to each device. |
| See the section "ZFS I/O SCHEDULER". |
| .sp |
| Default value: \fB1\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_async_write_active_max_dirty_percent\fR (int) |
| .ad |
| .RS 12n |
| When the pool has more than |
| \fBzfs_vdev_async_write_active_max_dirty_percent\fR dirty data, use |
| \fBzfs_vdev_async_write_max_active\fR to limit active async writes. If |
| the dirty data is between min and max, the active I/O limit is linearly |
| interpolated. See the section "ZFS I/O SCHEDULER". |
| .sp |
| Default value: \fB60\fR%. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_async_write_active_min_dirty_percent\fR (int) |
| .ad |
| .RS 12n |
| When the pool has less than |
| \fBzfs_vdev_async_write_active_min_dirty_percent\fR dirty data, use |
| \fBzfs_vdev_async_write_min_active\fR to limit active async writes. If |
| the dirty data is between min and max, the active I/O limit is linearly |
| interpolated. See the section "ZFS I/O SCHEDULER". |
| .sp |
| Default value: \fB30\fR%. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_async_write_max_active\fR (int) |
| .ad |
| .RS 12n |
| Maximum asynchronous write I/Os active to each device. |
| See the section "ZFS I/O SCHEDULER". |
| .sp |
| Default value: \fB10\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_async_write_min_active\fR (int) |
| .ad |
| .RS 12n |
| Minimum asynchronous write I/Os active to each device. |
| See the section "ZFS I/O SCHEDULER". |
| .sp |
| Lower values are associated with better latency on rotational media but poorer |
| resilver performance. The default value of 2 was chosen as a compromise. A |
| value of 3 has been shown to improve resilver performance further at a cost of |
| further increasing latency. |
| .sp |
| Default value: \fB2\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_initializing_max_active\fR (int) |
| .ad |
| .RS 12n |
| Maximum initializing I/Os active to each device. |
| See the section "ZFS I/O SCHEDULER". |
| .sp |
| Default value: \fB1\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_initializing_min_active\fR (int) |
| .ad |
| .RS 12n |
| Minimum initializing I/Os active to each device. |
| See the section "ZFS I/O SCHEDULER". |
| .sp |
| Default value: \fB1\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_max_active\fR (int) |
| .ad |
| .RS 12n |
| The maximum number of I/Os active to each device. Ideally, this will be >= |
| the sum of each queue's max_active. It must be at least the sum of each |
| queue's min_active. See the section "ZFS I/O SCHEDULER". |
| .sp |
| Default value: \fB1,000\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_removal_max_active\fR (int) |
| .ad |
| .RS 12n |
| Maximum removal I/Os active to each device. |
| See the section "ZFS I/O SCHEDULER". |
| .sp |
| Default value: \fB2\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_removal_min_active\fR (int) |
| .ad |
| .RS 12n |
| Minimum removal I/Os active to each device. |
| See the section "ZFS I/O SCHEDULER". |
| .sp |
| Default value: \fB1\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_scrub_max_active\fR (int) |
| .ad |
| .RS 12n |
| Maximum scrub I/Os active to each device. |
| See the section "ZFS I/O SCHEDULER". |
| .sp |
| Default value: \fB2\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_scrub_min_active\fR (int) |
| .ad |
| .RS 12n |
| Minimum scrub I/Os active to each device. |
| See the section "ZFS I/O SCHEDULER". |
| .sp |
| Default value: \fB1\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_sync_read_max_active\fR (int) |
| .ad |
| .RS 12n |
| Maximum synchronous read I/Os active to each device. |
| See the section "ZFS I/O SCHEDULER". |
| .sp |
| Default value: \fB10\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_sync_read_min_active\fR (int) |
| .ad |
| .RS 12n |
| Minimum synchronous read I/Os active to each device. |
| See the section "ZFS I/O SCHEDULER". |
| .sp |
| Default value: \fB10\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_sync_write_max_active\fR (int) |
| .ad |
| .RS 12n |
| Maximum synchronous write I/Os active to each device. |
| See the section "ZFS I/O SCHEDULER". |
| .sp |
| Default value: \fB10\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_sync_write_min_active\fR (int) |
| .ad |
| .RS 12n |
| Minimum synchronous write I/Os active to each device. |
| See the section "ZFS I/O SCHEDULER". |
| .sp |
| Default value: \fB10\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_trim_max_active\fR (int) |
| .ad |
| .RS 12n |
| Maximum trim/discard I/Os active to each device. |
| See the section "ZFS I/O SCHEDULER". |
| .sp |
| Default value: \fB2\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_trim_min_active\fR (int) |
| .ad |
| .RS 12n |
| Minimum trim/discard I/Os active to each device. |
| See the section "ZFS I/O SCHEDULER". |
| .sp |
| Default value: \fB1\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_queue_depth_pct\fR (int) |
| .ad |
| .RS 12n |
| Maximum number of queued allocations per top-level vdev expressed as |
| a percentage of \fBzfs_vdev_async_write_max_active\fR which allows the |
| system to detect devices that are more capable of handling allocations |
| and to allocate more blocks to those devices. It allows for dynamic |
| allocation distribution when devices are imbalanced as fuller devices |
| will tend to be slower than empty devices. |
| |
| See also \fBzio_dva_throttle_enabled\fR. |
| .sp |
| Default value: \fB1000\fR%. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_expire_snapshot\fR (int) |
| .ad |
| .RS 12n |
| Seconds to expire .zfs/snapshot |
| .sp |
| Default value: \fB300\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_admin_snapshot\fR (int) |
| .ad |
| .RS 12n |
| Allow the creation, removal, or renaming of entries in the .zfs/snapshot |
| directory to cause the creation, destruction, or renaming of snapshots. |
| When enabled this functionality works both locally and over NFS exports |
| which have the 'no_root_squash' option set. This functionality is disabled |
| by default. |
| .sp |
| Use \fB1\fR for yes and \fB0\fR for no (default). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_flags\fR (int) |
| .ad |
| .RS 12n |
| Set additional debugging flags. The following flags may be bitwise-or'd |
| together. |
| .sp |
| .TS |
| box; |
| rB lB |
| lB lB |
| r l. |
| Value Symbolic Name |
| Description |
| _ |
| 1 ZFS_DEBUG_DPRINTF |
| Enable dprintf entries in the debug log. |
| _ |
| 2 ZFS_DEBUG_DBUF_VERIFY * |
| Enable extra dbuf verifications. |
| _ |
| 4 ZFS_DEBUG_DNODE_VERIFY * |
| Enable extra dnode verifications. |
| _ |
| 8 ZFS_DEBUG_SNAPNAMES |
| Enable snapshot name verification. |
| _ |
| 16 ZFS_DEBUG_MODIFY |
| Check for illegally modified ARC buffers. |
| _ |
| 64 ZFS_DEBUG_ZIO_FREE |
| Enable verification of block frees. |
| _ |
| 128 ZFS_DEBUG_HISTOGRAM_VERIFY |
| Enable extra spacemap histogram verifications. |
| _ |
| 256 ZFS_DEBUG_METASLAB_VERIFY |
| Verify space accounting on disk matches in-core range_trees. |
| _ |
| 512 ZFS_DEBUG_SET_ERROR |
| Enable SET_ERROR and dprintf entries in the debug log. |
| _ |
| 1024 ZFS_DEBUG_INDIRECT_REMAP |
| Verify split blocks created by device removal. |
| _ |
| 2048 ZFS_DEBUG_TRIM |
| Verify TRIM ranges are always within the allocatable range tree. |
| .TE |
| .sp |
| * Requires debug build. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_free_leak_on_eio\fR (int) |
| .ad |
| .RS 12n |
| If destroy encounters an EIO while reading metadata (e.g. indirect |
| blocks), space referenced by the missing metadata can not be freed. |
| Normally this causes the background destroy to become "stalled", as |
| it is unable to make forward progress. While in this stalled state, |
| all remaining space to free from the error-encountering filesystem is |
| "temporarily leaked". Set this flag to cause it to ignore the EIO, |
| permanently leak the space from indirect blocks that can not be read, |
| and continue to free everything else that it can. |
| |
| The default, "stalling" behavior is useful if the storage partially |
| fails (i.e. some but not all i/os fail), and then later recovers. In |
| this case, we will be able to continue pool operations while it is |
| partially failed, and when it recovers, we can continue to free the |
| space, with no leaks. However, note that this case is actually |
| fairly rare. |
| |
| Typically pools either (a) fail completely (but perhaps temporarily, |
| e.g. a top-level vdev going offline), or (b) have localized, |
| permanent errors (e.g. disk returns the wrong data due to bit flip or |
| firmware bug). In case (a), this setting does not matter because the |
| pool will be suspended and the sync thread will not be able to make |
| forward progress regardless. In case (b), because the error is |
| permanent, the best we can do is leak the minimum amount of space, |
| which is what setting this flag will do. Therefore, it is reasonable |
| for this flag to normally be set, but we chose the more conservative |
| approach of not setting it, so that there is no possibility of |
| leaking space in the "partial temporary" failure case. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_free_min_time_ms\fR (int) |
| .ad |
| .RS 12n |
| During a \fBzfs destroy\fR operation using \fBfeature@async_destroy\fR a minimum |
| of this much time will be spent working on freeing blocks per txg. |
| .sp |
| Default value: \fB1,000\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_immediate_write_sz\fR (long) |
| .ad |
| .RS 12n |
| Largest data block to write to zil. Larger blocks will be treated as if the |
| dataset being written to had the property setting \fBlogbias=throughput\fR. |
| .sp |
| Default value: \fB32,768\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_initialize_value\fR (ulong) |
| .ad |
| .RS 12n |
| Pattern written to vdev free space by \fBzpool initialize\fR. |
| .sp |
| Default value: \fB16,045,690,984,833,335,022\fR (0xdeadbeefdeadbeee). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_lua_max_instrlimit\fR (ulong) |
| .ad |
| .RS 12n |
| The maximum execution time limit that can be set for a ZFS channel program, |
| specified as a number of Lua instructions. |
| .sp |
| Default value: \fB100,000,000\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_lua_max_memlimit\fR (ulong) |
| .ad |
| .RS 12n |
| The maximum memory limit that can be set for a ZFS channel program, specified |
| in bytes. |
| .sp |
| Default value: \fB104,857,600\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_max_dataset_nesting\fR (int) |
| .ad |
| .RS 12n |
| The maximum depth of nested datasets. This value can be tuned temporarily to |
| fix existing datasets that exceed the predefined limit. |
| .sp |
| Default value: \fB50\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_max_recordsize\fR (int) |
| .ad |
| .RS 12n |
| We currently support block sizes from 512 bytes to 16MB. The benefits of |
| larger blocks, and thus larger I/O, need to be weighed against the cost of |
| COWing a giant block to modify one byte. Additionally, very large blocks |
| can have an impact on i/o latency, and also potentially on the memory |
| allocator. Therefore, we do not allow the recordsize to be set larger than |
| zfs_max_recordsize (default 1MB). Larger blocks can be created by changing |
| this tunable, and pools with larger blocks can always be imported and used, |
| regardless of this setting. |
| .sp |
| Default value: \fB1,048,576\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_metaslab_fragmentation_threshold\fR (int) |
| .ad |
| .RS 12n |
| Allow metaslabs to keep their active state as long as their fragmentation |
| percentage is less than or equal to this value. An active metaslab that |
| exceeds this threshold will no longer keep its active status allowing |
| better metaslabs to be selected. |
| .sp |
| Default value: \fB70\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_mg_fragmentation_threshold\fR (int) |
| .ad |
| .RS 12n |
| Metaslab groups are considered eligible for allocations if their |
| fragmentation metric (measured as a percentage) is less than or equal to |
| this value. If a metaslab group exceeds this threshold then it will be |
| skipped unless all metaslab groups within the metaslab class have also |
| crossed this threshold. |
| .sp |
| Default value: \fB95\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_mg_noalloc_threshold\fR (int) |
| .ad |
| .RS 12n |
| Defines a threshold at which metaslab groups should be eligible for |
| allocations. The value is expressed as a percentage of free space |
| beyond which a metaslab group is always eligible for allocations. |
| If a metaslab group's free space is less than or equal to the |
| threshold, the allocator will avoid allocating to that group |
| unless all groups in the pool have reached the threshold. Once all |
| groups have reached the threshold, all groups are allowed to accept |
| allocations. The default value of 0 disables the feature and causes |
| all metaslab groups to be eligible for allocations. |
| |
| This parameter allows one to deal with pools having heavily imbalanced |
| vdevs such as would be the case when a new vdev has been added. |
| Setting the threshold to a non-zero percentage will stop allocations |
| from being made to vdevs that aren't filled to the specified percentage |
| and allow lesser filled vdevs to acquire more allocations than they |
| otherwise would under the old \fBzfs_mg_alloc_failures\fR facility. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_ddt_data_is_special\fR (int) |
| .ad |
| .RS 12n |
| If enabled, ZFS will place DDT data into the special allocation class. |
| .sp |
| Default value: \fB1\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_user_indirect_is_special\fR (int) |
| .ad |
| .RS 12n |
| If enabled, ZFS will place user data (both file and zvol) indirect blocks |
| into the special allocation class. |
| .sp |
| Default value: \fB1\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_multihost_history\fR (int) |
| .ad |
| .RS 12n |
| Historical statistics for the last N multihost updates will be available in |
| \fB/proc/spl/kstat/zfs/<pool>/multihost\fR |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_multihost_interval\fR (ulong) |
| .ad |
| .RS 12n |
| Used to control the frequency of multihost writes which are performed when the |
| \fBmultihost\fR pool property is on. This is one factor used to determine the |
| length of the activity check during import. |
| .sp |
| The multihost write period is \fBzfs_multihost_interval / leaf-vdevs\fR |
| milliseconds. On average a multihost write will be issued for each leaf vdev |
| every \fBzfs_multihost_interval\fR milliseconds. In practice, the observed |
| period can vary with the I/O load and this observed value is the delay which is |
| stored in the uberblock. |
| .sp |
| Default value: \fB1000\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_multihost_import_intervals\fR (uint) |
| .ad |
| .RS 12n |
| Used to control the duration of the activity test on import. Smaller values of |
| \fBzfs_multihost_import_intervals\fR will reduce the import time but increase |
| the risk of failing to detect an active pool. The total activity check time is |
| never allowed to drop below one second. |
| .sp |
| On import the activity check waits a minimum amount of time determined by |
| \fBzfs_multihost_interval * zfs_multihost_import_intervals\fR, or the same |
| product computed on the host which last had the pool imported (whichever is |
| greater). The activity check time may be further extended if the value of mmp |
| delay found in the best uberblock indicates actual multihost updates happened |
| at longer intervals than \fBzfs_multihost_interval\fR. A minimum value of |
| \fB100ms\fR is enforced. |
| .sp |
| A value of 0 is ignored and treated as if it was set to 1. |
| .sp |
| Default value: \fB20\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_multihost_fail_intervals\fR (uint) |
| .ad |
| .RS 12n |
| Controls the behavior of the pool when multihost write failures or delays are |
| detected. |
| .sp |
| When \fBzfs_multihost_fail_intervals = 0\fR, multihost write failures or delays |
| are ignored. The failures will still be reported to the ZED which depending on |
| its configuration may take action such as suspending the pool or offlining a |
| device. |
| |
| .sp |
| When \fBzfs_multihost_fail_intervals > 0\fR, the pool will be suspended if |
| \fBzfs_multihost_fail_intervals * zfs_multihost_interval\fR milliseconds pass |
| without a successful mmp write. This guarantees the activity test will see |
| mmp writes if the pool is imported. A value of 1 is ignored and treated as |
| if it was set to 2. This is necessary to prevent the pool from being suspended |
| due to normal, small I/O latency variations. |
| |
| .sp |
| Default value: \fB10\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_no_scrub_io\fR (int) |
| .ad |
| .RS 12n |
| Set for no scrub I/O. This results in scrubs not actually scrubbing data and |
| simply doing a metadata crawl of the pool instead. |
| .sp |
| Use \fB1\fR for yes and \fB0\fR for no (default). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_no_scrub_prefetch\fR (int) |
| .ad |
| .RS 12n |
| Set to disable block prefetching for scrubs. |
| .sp |
| Use \fB1\fR for yes and \fB0\fR for no (default). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_nocacheflush\fR (int) |
| .ad |
| .RS 12n |
| Disable cache flush operations on disks when writing. Setting this will |
| cause pool corruption on power loss if a volatile out-of-order write cache |
| is enabled. |
| .sp |
| Use \fB1\fR for yes and \fB0\fR for no (default). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_nopwrite_enabled\fR (int) |
| .ad |
| .RS 12n |
| Enable NOP writes |
| .sp |
| Use \fB1\fR for yes (default) and \fB0\fR to disable. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_dmu_offset_next_sync\fR (int) |
| .ad |
| .RS 12n |
| Enable forcing txg sync to find holes. When enabled forces ZFS to act |
| like prior versions when SEEK_HOLE or SEEK_DATA flags are used, which |
| when a dnode is dirty causes txg's to be synced so that this data can be |
| found. |
| .sp |
| Use \fB1\fR for yes and \fB0\fR to disable (default). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_pd_bytes_max\fR (int) |
| .ad |
| .RS 12n |
| The number of bytes which should be prefetched during a pool traversal |
| (eg: \fBzfs send\fR or other data crawling operations) |
| .sp |
| Default value: \fB52,428,800\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_per_txg_dirty_frees_percent \fR (ulong) |
| .ad |
| .RS 12n |
| Tunable to control percentage of dirtied indirect blocks from frees allowed |
| into one TXG. After this threshold is crossed, additional frees will wait until |
| the next TXG. |
| A value of zero will disable this throttle. |
| .sp |
| Default value: \fB5\fR, set to \fB0\fR to disable. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_prefetch_disable\fR (int) |
| .ad |
| .RS 12n |
| This tunable disables predictive prefetch. Note that it leaves "prescient" |
| prefetch (e.g. prefetch for zfs send) intact. Unlike predictive prefetch, |
| prescient prefetch never issues i/os that end up not being needed, so it |
| can't hurt performance. |
| .sp |
| Use \fB1\fR for yes and \fB0\fR for no (default). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_qat_checksum_disable\fR (int) |
| .ad |
| .RS 12n |
| This tunable disables qat hardware acceleration for sha256 checksums. It |
| may be set after the zfs modules have been loaded to initialize the qat |
| hardware as long as support is compiled in and the qat driver is present. |
| .sp |
| Use \fB1\fR for yes and \fB0\fR for no (default). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_qat_compress_disable\fR (int) |
| .ad |
| .RS 12n |
| This tunable disables qat hardware acceleration for gzip compression. It |
| may be set after the zfs modules have been loaded to initialize the qat |
| hardware as long as support is compiled in and the qat driver is present. |
| .sp |
| Use \fB1\fR for yes and \fB0\fR for no (default). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_qat_encrypt_disable\fR (int) |
| .ad |
| .RS 12n |
| This tunable disables qat hardware acceleration for AES-GCM encryption. It |
| may be set after the zfs modules have been loaded to initialize the qat |
| hardware as long as support is compiled in and the qat driver is present. |
| .sp |
| Use \fB1\fR for yes and \fB0\fR for no (default). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_read_chunk_size\fR (long) |
| .ad |
| .RS 12n |
| Bytes to read per chunk |
| .sp |
| Default value: \fB1,048,576\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_read_history\fR (int) |
| .ad |
| .RS 12n |
| Historical statistics for the last N reads will be available in |
| \fB/proc/spl/kstat/zfs/<pool>/reads\fR |
| .sp |
| Default value: \fB0\fR (no data is kept). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_read_history_hits\fR (int) |
| .ad |
| .RS 12n |
| Include cache hits in read history |
| .sp |
| Use \fB1\fR for yes and \fB0\fR for no (default). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_reconstruct_indirect_combinations_max\fR (int) |
| .ad |
| .RS 12na |
| If an indirect split block contains more than this many possible unique |
| combinations when being reconstructed, consider it too computationally |
| expensive to check them all. Instead, try at most |
| \fBzfs_reconstruct_indirect_combinations_max\fR randomly-selected |
| combinations each time the block is accessed. This allows all segment |
| copies to participate fairly in the reconstruction when all combinations |
| cannot be checked and prevents repeated use of one bad copy. |
| .sp |
| Default value: \fB4096\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_recover\fR (int) |
| .ad |
| .RS 12n |
| Set to attempt to recover from fatal errors. This should only be used as a |
| last resort, as it typically results in leaked space, or worse. |
| .sp |
| Use \fB1\fR for yes and \fB0\fR for no (default). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_removal_ignore_errors\fR (int) |
| .ad |
| .RS 12n |
| .sp |
| Ignore hard IO errors during device removal. When set, if a device encounters |
| a hard IO error during the removal process the removal will not be cancelled. |
| This can result in a normally recoverable block becoming permanently damaged |
| and is not recommended. This should only be used as a last resort when the |
| pool cannot be returned to a healthy state prior to removing the device. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_removal_suspend_progress\fR (int) |
| .ad |
| .RS 12n |
| .sp |
| This is used by the test suite so that it can ensure that certain actions |
| happen while in the middle of a removal. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_remove_max_segment\fR (int) |
| .ad |
| .RS 12n |
| .sp |
| The largest contiguous segment that we will attempt to allocate when removing |
| a device. This can be no larger than 16MB. If there is a performance |
| problem with attempting to allocate large blocks, consider decreasing this. |
| .sp |
| Default value: \fB16,777,216\fR (16MB). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_resilver_min_time_ms\fR (int) |
| .ad |
| .RS 12n |
| Resilvers are processed by the sync thread. While resilvering it will spend |
| at least this much time working on a resilver between txg flushes. |
| .sp |
| Default value: \fB3,000\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_scan_ignore_errors\fR (int) |
| .ad |
| .RS 12n |
| If set to a nonzero value, remove the DTL (dirty time list) upon |
| completion of a pool scan (scrub) even if there were unrepairable |
| errors. It is intended to be used during pool repair or recovery to |
| stop resilvering when the pool is next imported. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_scrub_min_time_ms\fR (int) |
| .ad |
| .RS 12n |
| Scrubs are processed by the sync thread. While scrubbing it will spend |
| at least this much time working on a scrub between txg flushes. |
| .sp |
| Default value: \fB1,000\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_scan_checkpoint_intval\fR (int) |
| .ad |
| .RS 12n |
| To preserve progress across reboots the sequential scan algorithm periodically |
| needs to stop metadata scanning and issue all the verifications I/Os to disk. |
| The frequency of this flushing is determined by the |
| \fBzfs_scan_checkpoint_intval\fR tunable. |
| .sp |
| Default value: \fB7200\fR seconds (every 2 hours). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_scan_fill_weight\fR (int) |
| .ad |
| .RS 12n |
| This tunable affects how scrub and resilver I/O segments are ordered. A higher |
| number indicates that we care more about how filled in a segment is, while a |
| lower number indicates we care more about the size of the extent without |
| considering the gaps within a segment. This value is only tunable upon module |
| insertion. Changing the value afterwards will have no affect on scrub or |
| resilver performance. |
| .sp |
| Default value: \fB3\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_scan_issue_strategy\fR (int) |
| .ad |
| .RS 12n |
| Determines the order that data will be verified while scrubbing or resilvering. |
| If set to \fB1\fR, data will be verified as sequentially as possible, given the |
| amount of memory reserved for scrubbing (see \fBzfs_scan_mem_lim_fact\fR). This |
| may improve scrub performance if the pool's data is very fragmented. If set to |
| \fB2\fR, the largest mostly-contiguous chunk of found data will be verified |
| first. By deferring scrubbing of small segments, we may later find adjacent data |
| to coalesce and increase the segment size. If set to \fB0\fR, zfs will use |
| strategy \fB1\fR during normal verification and strategy \fB2\fR while taking a |
| checkpoint. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_scan_legacy\fR (int) |
| .ad |
| .RS 12n |
| A value of 0 indicates that scrubs and resilvers will gather metadata in |
| memory before issuing sequential I/O. A value of 1 indicates that the legacy |
| algorithm will be used where I/O is initiated as soon as it is discovered. |
| Changing this value to 0 will not affect scrubs or resilvers that are already |
| in progress. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_scan_max_ext_gap\fR (int) |
| .ad |
| .RS 12n |
| Indicates the largest gap in bytes between scrub / resilver I/Os that will still |
| be considered sequential for sorting purposes. Changing this value will not |
| affect scrubs or resilvers that are already in progress. |
| .sp |
| Default value: \fB2097152 (2 MB)\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_scan_mem_lim_fact\fR (int) |
| .ad |
| .RS 12n |
| Maximum fraction of RAM used for I/O sorting by sequential scan algorithm. |
| This tunable determines the hard limit for I/O sorting memory usage. |
| When the hard limit is reached we stop scanning metadata and start issuing |
| data verification I/O. This is done until we get below the soft limit. |
| .sp |
| Default value: \fB20\fR which is 5% of RAM (1/20). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_scan_mem_lim_soft_fact\fR (int) |
| .ad |
| .RS 12n |
| The fraction of the hard limit used to determined the soft limit for I/O sorting |
| by the sequential scan algorithm. When we cross this limit from below no action |
| is taken. When we cross this limit from above it is because we are issuing |
| verification I/O. In this case (unless the metadata scan is done) we stop |
| issuing verification I/O and start scanning metadata again until we get to the |
| hard limit. |
| .sp |
| Default value: \fB20\fR which is 5% of the hard limit (1/20). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_scan_vdev_limit\fR (int) |
| .ad |
| .RS 12n |
| Maximum amount of data that can be concurrently issued at once for scrubs and |
| resilvers per leaf device, given in bytes. |
| .sp |
| Default value: \fB41943040\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_send_corrupt_data\fR (int) |
| .ad |
| .RS 12n |
| Allow sending of corrupt data (ignore read/checksum errors when sending data) |
| .sp |
| Use \fB1\fR for yes and \fB0\fR for no (default). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_send_unmodified_spill_blocks\fR (int) |
| .ad |
| .RS 12n |
| Include unmodified spill blocks in the send stream. Under certain circumstances |
| previous versions of ZFS could incorrectly remove the spill block from an |
| existing object. Including unmodified copies of the spill blocks creates a |
| backwards compatible stream which will recreate a spill block if it was |
| incorrectly removed. |
| .sp |
| Use \fB1\fR for yes (default) and \fB0\fR for no. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_send_queue_length\fR (int) |
| .ad |
| .RS 12n |
| The maximum number of bytes allowed in the \fBzfs send\fR queue. This value |
| must be at least twice the maximum block size in use. |
| .sp |
| Default value: \fB16,777,216\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_recv_queue_length\fR (int) |
| .ad |
| .RS 12n |
| The maximum number of bytes allowed in the \fBzfs receive\fR queue. This value |
| must be at least twice the maximum block size in use. |
| .sp |
| Default value: \fB16,777,216\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_sync_pass_deferred_free\fR (int) |
| .ad |
| .RS 12n |
| Flushing of data to disk is done in passes. Defer frees starting in this pass |
| .sp |
| Default value: \fB2\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_spa_discard_memory_limit\fR (int) |
| .ad |
| .RS 12n |
| Maximum memory used for prefetching a checkpoint's space map on each |
| vdev while discarding the checkpoint. |
| .sp |
| Default value: \fB16,777,216\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_special_class_metadata_reserve_pct\fR (int) |
| .ad |
| .RS 12n |
| Only allow small data blocks to be allocated on the special and dedup vdev |
| types when the available free space percentage on these vdevs exceeds this |
| value. This ensures reserved space is available for pool meta data as the |
| special vdevs approach capacity. |
| .sp |
| Default value: \fB25\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_sync_pass_dont_compress\fR (int) |
| .ad |
| .RS 12n |
| Starting in this sync pass, we disable compression (including of metadata). |
| With the default setting, in practice, we don't have this many sync passes, |
| so this has no effect. |
| .sp |
| The original intent was that disabling compression would help the sync passes |
| to converge. However, in practice disabling compression increases the average |
| number of sync passes, because when we turn compression off, a lot of block's |
| size will change and thus we have to re-allocate (not overwrite) them. It |
| also increases the number of 128KB allocations (e.g. for indirect blocks and |
| spacemaps) because these will not be compressed. The 128K allocations are |
| especially detrimental to performance on highly fragmented systems, which may |
| have very few free segments of this size, and may need to load new metaslabs |
| to satisfy 128K allocations. |
| .sp |
| Default value: \fB8\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_sync_pass_rewrite\fR (int) |
| .ad |
| .RS 12n |
| Rewrite new block pointers starting in this pass |
| .sp |
| Default value: \fB2\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_sync_taskq_batch_pct\fR (int) |
| .ad |
| .RS 12n |
| This controls the number of threads used by the dp_sync_taskq. The default |
| value of 75% will create a maximum of one thread per cpu. |
| .sp |
| Default value: \fB75\fR%. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_trim_extent_bytes_max\fR (unsigned int) |
| .ad |
| .RS 12n |
| Maximum size of TRIM command. Ranges larger than this will be split in to |
| chunks no larger than \fBzfs_trim_extent_bytes_max\fR bytes before being |
| issued to the device. |
| .sp |
| Default value: \fB134,217,728\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_trim_extent_bytes_min\fR (unsigned int) |
| .ad |
| .RS 12n |
| Minimum size of TRIM commands. TRIM ranges smaller than this will be skipped |
| unless they're part of a larger range which was broken in to chunks. This is |
| done because it's common for these small TRIMs to negatively impact overall |
| performance. This value can be set to 0 to TRIM all unallocated space. |
| .sp |
| Default value: \fB32,768\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_trim_metaslab_skip\fR (unsigned int) |
| .ad |
| .RS 12n |
| Skip uninitialized metaslabs during the TRIM process. This option is useful |
| for pools constructed from large thinly-provisioned devices where TRIM |
| operations are slow. As a pool ages an increasing fraction of the pools |
| metaslabs will be initialized progressively degrading the usefulness of |
| this option. This setting is stored when starting a manual TRIM and will |
| persist for the duration of the requested TRIM. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_trim_queue_limit\fR (unsigned int) |
| .ad |
| .RS 12n |
| Maximum number of queued TRIMs outstanding per leaf vdev. The number of |
| concurrent TRIM commands issued to the device is controlled by the |
| \fBzfs_vdev_trim_min_active\fR and \fBzfs_vdev_trim_max_active\fR module |
| options. |
| .sp |
| Default value: \fB10\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_trim_txg_batch\fR (unsigned int) |
| .ad |
| .RS 12n |
| The number of transaction groups worth of frees which should be aggregated |
| before TRIM operations are issued to the device. This setting represents a |
| trade-off between issuing larger, more efficient TRIM operations and the |
| delay before the recently trimmed space is available for use by the device. |
| .sp |
| Increasing this value will allow frees to be aggregated for a longer time. |
| This will result is larger TRIM operations and potentially increased memory |
| usage. Decreasing this value will have the opposite effect. The default |
| value of 32 was determined to be a reasonable compromise. |
| .sp |
| Default value: \fB32\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_txg_history\fR (int) |
| .ad |
| .RS 12n |
| Historical statistics for the last N txgs will be available in |
| \fB/proc/spl/kstat/zfs/<pool>/txgs\fR |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_txg_timeout\fR (int) |
| .ad |
| .RS 12n |
| Flush dirty data to disk at least every N seconds (maximum txg duration) |
| .sp |
| Default value: \fB5\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_aggregate_trim\fR (int) |
| .ad |
| .RS 12n |
| Allow TRIM I/Os to be aggregated. This is normally not helpful because |
| the extents to be trimmed will have been already been aggregated by the |
| metaslab. This option is provided for debugging and performance analysis. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_aggregation_limit\fR (int) |
| .ad |
| .RS 12n |
| Max vdev I/O aggregation size |
| .sp |
| Default value: \fB1,048,576\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_aggregation_limit_non_rotating\fR (int) |
| .ad |
| .RS 12n |
| Max vdev I/O aggregation size for non-rotating media |
| .sp |
| Default value: \fB131,072\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_cache_bshift\fR (int) |
| .ad |
| .RS 12n |
| Shift size to inflate reads too |
| .sp |
| Default value: \fB16\fR (effectively 65536). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_cache_max\fR (int) |
| .ad |
| .RS 12n |
| Inflate reads smaller than this value to meet the \fBzfs_vdev_cache_bshift\fR |
| size (default 64k). |
| .sp |
| Default value: \fB16384\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_cache_size\fR (int) |
| .ad |
| .RS 12n |
| Total size of the per-disk cache in bytes. |
| .sp |
| Currently this feature is disabled as it has been found to not be helpful |
| for performance and in some cases harmful. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_mirror_rotating_inc\fR (int) |
| .ad |
| .RS 12n |
| A number by which the balancing algorithm increments the load calculation for |
| the purpose of selecting the least busy mirror member when an I/O immediately |
| follows its predecessor on rotational vdevs for the purpose of making decisions |
| based on load. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_mirror_rotating_seek_inc\fR (int) |
| .ad |
| .RS 12n |
| A number by which the balancing algorithm increments the load calculation for |
| the purpose of selecting the least busy mirror member when an I/O lacks |
| locality as defined by the zfs_vdev_mirror_rotating_seek_offset. I/Os within |
| this that are not immediately following the previous I/O are incremented by |
| half. |
| .sp |
| Default value: \fB5\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_mirror_rotating_seek_offset\fR (int) |
| .ad |
| .RS 12n |
| The maximum distance for the last queued I/O in which the balancing algorithm |
| considers an I/O to have locality. |
| See the section "ZFS I/O SCHEDULER". |
| .sp |
| Default value: \fB1048576\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_mirror_non_rotating_inc\fR (int) |
| .ad |
| .RS 12n |
| A number by which the balancing algorithm increments the load calculation for |
| the purpose of selecting the least busy mirror member on non-rotational vdevs |
| when I/Os do not immediately follow one another. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_mirror_non_rotating_seek_inc\fR (int) |
| .ad |
| .RS 12n |
| A number by which the balancing algorithm increments the load calculation for |
| the purpose of selecting the least busy mirror member when an I/O lacks |
| locality as defined by the zfs_vdev_mirror_rotating_seek_offset. I/Os within |
| this that are not immediately following the previous I/O are incremented by |
| half. |
| .sp |
| Default value: \fB1\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_read_gap_limit\fR (int) |
| .ad |
| .RS 12n |
| Aggregate read I/O operations if the gap on-disk between them is within this |
| threshold. |
| .sp |
| Default value: \fB32,768\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_write_gap_limit\fR (int) |
| .ad |
| .RS 12n |
| Aggregate write I/O over gap |
| .sp |
| Default value: \fB4,096\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_vdev_raidz_impl\fR (string) |
| .ad |
| .RS 12n |
| Parameter for selecting raidz parity implementation to use. |
| |
| Options marked (always) below may be selected on module load as they are |
| supported on all systems. |
| The remaining options may only be set after the module is loaded, as they |
| are available only if the implementations are compiled in and supported |
| on the running system. |
| |
| Once the module is loaded, the content of |
| /sys/module/zfs/parameters/zfs_vdev_raidz_impl will show available options |
| with the currently selected one enclosed in []. |
| Possible options are: |
| fastest - (always) implementation selected using built-in benchmark |
| original - (always) original raidz implementation |
| scalar - (always) scalar raidz implementation |
| sse2 - implementation using SSE2 instruction set (64bit x86 only) |
| ssse3 - implementation using SSSE3 instruction set (64bit x86 only) |
| avx2 - implementation using AVX2 instruction set (64bit x86 only) |
| avx512f - implementation using AVX512F instruction set (64bit x86 only) |
| avx512bw - implementation using AVX512F & AVX512BW instruction sets (64bit x86 only) |
| aarch64_neon - implementation using NEON (Aarch64/64 bit ARMv8 only) |
| aarch64_neonx2 - implementation using NEON with more unrolling (Aarch64/64 bit ARMv8 only) |
| .sp |
| Default value: \fBfastest\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_zevent_cols\fR (int) |
| .ad |
| .RS 12n |
| When zevents are logged to the console use this as the word wrap width. |
| .sp |
| Default value: \fB80\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_zevent_console\fR (int) |
| .ad |
| .RS 12n |
| Log events to the console |
| .sp |
| Use \fB1\fR for yes and \fB0\fR for no (default). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_zevent_len_max\fR (int) |
| .ad |
| .RS 12n |
| Max event queue length. A value of 0 will result in a calculated value which |
| increases with the number of CPUs in the system (minimum 64 events). Events |
| in the queue can be viewed with the \fBzpool events\fR command. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_zil_clean_taskq_maxalloc\fR (int) |
| .ad |
| .RS 12n |
| The maximum number of taskq entries that are allowed to be cached. When this |
| limit is exceeded transaction records (itxs) will be cleaned synchronously. |
| .sp |
| Default value: \fB1048576\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_zil_clean_taskq_minalloc\fR (int) |
| .ad |
| .RS 12n |
| The number of taskq entries that are pre-populated when the taskq is first |
| created and are immediately available for use. |
| .sp |
| Default value: \fB1024\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzfs_zil_clean_taskq_nthr_pct\fR (int) |
| .ad |
| .RS 12n |
| This controls the number of threads used by the dp_zil_clean_taskq. The default |
| value of 100% will create a maximum of one thread per cpu. |
| .sp |
| Default value: \fB100\fR%. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzil_maxblocksize\fR (int) |
| .ad |
| .RS 12n |
| This sets the maximum block size used by the ZIL. On very fragmented pools, |
| lowering this (typically to 36KB) can improve performance. |
| .sp |
| Default value: \fB131072\fR (128KB). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzil_nocacheflush\fR (int) |
| .ad |
| .RS 12n |
| Disable the cache flush commands that are normally sent to the disk(s) by |
| the ZIL after an LWB write has completed. Setting this will cause ZIL |
| corruption on power loss if a volatile out-of-order write cache is enabled. |
| .sp |
| Use \fB1\fR for yes and \fB0\fR for no (default). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzil_replay_disable\fR (int) |
| .ad |
| .RS 12n |
| Disable intent logging replay. Can be disabled for recovery from corrupted |
| ZIL |
| .sp |
| Use \fB1\fR for yes and \fB0\fR for no (default). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzil_slog_bulk\fR (ulong) |
| .ad |
| .RS 12n |
| Limit SLOG write size per commit executed with synchronous priority. |
| Any writes above that will be executed with lower (asynchronous) priority |
| to limit potential SLOG device abuse by single active ZIL writer. |
| .sp |
| Default value: \fB786,432\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzio_deadman_log_all\fR (int) |
| .ad |
| .RS 12n |
| If non-zero, the zio deadman will produce debugging messages (see |
| \fBzfs_dbgmsg_enable\fR) for all zios, rather than only for leaf |
| zios possessing a vdev. This is meant to be used by developers to gain |
| diagnostic information for hang conditions which don't involve a mutex |
| or other locking primitive; typically conditions in which a thread in |
| the zio pipeline is looping indefinitely. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzio_decompress_fail_fraction\fR (int) |
| .ad |
| .RS 12n |
| If non-zero, this value represents the denominator of the probability that zfs |
| should induce a decompression failure. For instance, for a 5% decompression |
| failure rate, this value should be set to 20. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzio_slow_io_ms\fR (int) |
| .ad |
| .RS 12n |
| When an I/O operation takes more than \fBzio_slow_io_ms\fR milliseconds to |
| complete is marked as a slow I/O. Each slow I/O causes a delay zevent. Slow |
| I/O counters can be seen with "zpool status -s". |
| |
| .sp |
| Default value: \fB30,000\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzio_dva_throttle_enabled\fR (int) |
| .ad |
| .RS 12n |
| Throttle block allocations in the I/O pipeline. This allows for |
| dynamic allocation distribution when devices are imbalanced. |
| When enabled, the maximum number of pending allocations per top-level vdev |
| is limited by \fBzfs_vdev_queue_depth_pct\fR. |
| .sp |
| Default value: \fB1\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzio_requeue_io_start_cut_in_line\fR (int) |
| .ad |
| .RS 12n |
| Prioritize requeued I/O |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzio_taskq_batch_pct\fR (uint) |
| .ad |
| .RS 12n |
| Percentage of online CPUs (or CPU cores, etc) which will run a worker thread |
| for I/O. These workers are responsible for I/O work such as compression and |
| checksum calculations. Fractional number of CPUs will be rounded down. |
| .sp |
| The default value of 75 was chosen to avoid using all CPUs which can result in |
| latency issues and inconsistent application performance, especially when high |
| compression is enabled. |
| .sp |
| Default value: \fB75\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzvol_inhibit_dev\fR (uint) |
| .ad |
| .RS 12n |
| Do not create zvol device nodes. This may slightly improve startup time on |
| systems with a very large number of zvols. |
| .sp |
| Use \fB1\fR for yes and \fB0\fR for no (default). |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzvol_major\fR (uint) |
| .ad |
| .RS 12n |
| Major number for zvol block devices |
| .sp |
| Default value: \fB230\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzvol_max_discard_blocks\fR (ulong) |
| .ad |
| .RS 12n |
| Discard (aka TRIM) operations done on zvols will be done in batches of this |
| many blocks, where block size is determined by the \fBvolblocksize\fR property |
| of a zvol. |
| .sp |
| Default value: \fB16,384\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzvol_prefetch_bytes\fR (uint) |
| .ad |
| .RS 12n |
| When adding a zvol to the system prefetch \fBzvol_prefetch_bytes\fR |
| from the start and end of the volume. Prefetching these regions |
| of the volume is desirable because they are likely to be accessed |
| immediately by \fBblkid(8)\fR or by the kernel scanning for a partition |
| table. |
| .sp |
| Default value: \fB131,072\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzvol_request_sync\fR (uint) |
| .ad |
| .RS 12n |
| When processing I/O requests for a zvol submit them synchronously. This |
| effectively limits the queue depth to 1 for each I/O submitter. When set |
| to 0 requests are handled asynchronously by a thread pool. The number of |
| requests which can be handled concurrently is controller by \fBzvol_threads\fR. |
| .sp |
| Default value: \fB0\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzvol_threads\fR (uint) |
| .ad |
| .RS 12n |
| Max number of threads which can handle zvol I/O requests concurrently. |
| .sp |
| Default value: \fB32\fR. |
| .RE |
| |
| .sp |
| .ne 2 |
| .na |
| \fBzvol_volmode\fR (uint) |
| .ad |
| .RS 12n |
| Defines zvol block devices behaviour when \fBvolmode\fR is set to \fBdefault\fR. |
| Valid values are \fB1\fR (full), \fB2\fR (dev) and \fB3\fR (none). |
| .sp |
| Default value: \fB1\fR. |
| .RE |
| |
| .SH ZFS I/O SCHEDULER |
| ZFS issues I/O operations to leaf vdevs to satisfy and complete I/Os. |
| The I/O scheduler determines when and in what order those operations are |
| issued. The I/O scheduler divides operations into five I/O classes |
| prioritized in the following order: sync read, sync write, async read, |
| async write, and scrub/resilver. Each queue defines the minimum and |
| maximum number of concurrent operations that may be issued to the |
| device. In addition, the device has an aggregate maximum, |
| \fBzfs_vdev_max_active\fR. Note that the sum of the per-queue minimums |
| must not exceed the aggregate maximum. If the sum of the per-queue |
| maximums exceeds the aggregate maximum, then the number of active I/Os |
| may reach \fBzfs_vdev_max_active\fR, in which case no further I/Os will |
| be issued regardless of whether all per-queue minimums have been met. |
| .sp |
| For many physical devices, throughput increases with the number of |
| concurrent operations, but latency typically suffers. Further, physical |
| devices typically have a limit at which more concurrent operations have no |
| effect on throughput or can actually cause it to decrease. |
| .sp |
| The scheduler selects the next operation to issue by first looking for an |
| I/O class whose minimum has not been satisfied. Once all are satisfied and |
| the aggregate maximum has not been hit, the scheduler looks for classes |
| whose maximum has not been satisfied. Iteration through the I/O classes is |
| done in the order specified above. No further operations are issued if the |
| aggregate maximum number of concurrent operations has been hit or if there |
| are no operations queued for an I/O class that has not hit its maximum. |
| Every time an I/O is queued or an operation completes, the I/O scheduler |
| looks for new operations to issue. |
| .sp |
| In general, smaller max_active's will lead to lower latency of synchronous |
| operations. Larger max_active's may lead to higher overall throughput, |
| depending on underlying storage. |
| .sp |
| The ratio of the queues' max_actives determines the balance of performance |
| between reads, writes, and scrubs. E.g., increasing |
| \fBzfs_vdev_scrub_max_active\fR will cause the scrub or resilver to complete |
| more quickly, but reads and writes to have higher latency and lower throughput. |
| .sp |
| All I/O classes have a fixed maximum number of outstanding operations |
| except for the async write class. Asynchronous writes represent the data |
| that is committed to stable storage during the syncing stage for |
| transaction groups. Transaction groups enter the syncing state |
| periodically so the number of queued async writes will quickly burst up |
| and then bleed down to zero. Rather than servicing them as quickly as |
| possible, the I/O scheduler changes the maximum number of active async |
| write I/Os according to the amount of dirty data in the pool. Since |
| both throughput and latency typically increase with the number of |
| concurrent operations issued to physical devices, reducing the |
| burstiness in the number of concurrent operations also stabilizes the |
| response time of operations from other -- and in particular synchronous |
| -- queues. In broad strokes, the I/O scheduler will issue more |
| concurrent operations from the async write queue as there's more dirty |
| data in the pool. |
| .sp |
| Async Writes |
| .sp |
| The number of concurrent operations issued for the async write I/O class |
| follows a piece-wise linear function defined by a few adjustable points. |
| .nf |
| |
| | o---------| <-- zfs_vdev_async_write_max_active |
| ^ | /^ | |
| | | / | | |
| active | / | | |
| I/O | / | | |
| count | / | | |
| | / | | |
| |-------o | | <-- zfs_vdev_async_write_min_active |
| 0|_______^______|_________| |
| 0% | | 100% of zfs_dirty_data_max |
| | | |
| | `-- zfs_vdev_async_write_active_max_dirty_percent |
| `--------- zfs_vdev_async_write_active_min_dirty_percent |
| |
| .fi |
| Until the amount of dirty data exceeds a minimum percentage of the dirty |
| data allowed in the pool, the I/O scheduler will limit the number of |
| concurrent operations to the minimum. As that threshold is crossed, the |
| number of concurrent operations issued increases linearly to the maximum at |
| the specified maximum percentage of the dirty data allowed in the pool. |
| .sp |
| Ideally, the amount of dirty data on a busy pool will stay in the sloped |
| part of the function between \fBzfs_vdev_async_write_active_min_dirty_percent\fR |
| and \fBzfs_vdev_async_write_active_max_dirty_percent\fR. If it exceeds the |
| maximum percentage, this indicates that the rate of incoming data is |
| greater than the rate that the backend storage can handle. In this case, we |
| must further throttle incoming writes, as described in the next section. |
| |
| .SH ZFS TRANSACTION DELAY |
| We delay transactions when we've determined that the backend storage |
| isn't able to accommodate the rate of incoming writes. |
| .sp |
| If there is already a transaction waiting, we delay relative to when |
| that transaction will finish waiting. This way the calculated delay time |
| is independent of the number of threads concurrently executing |
| transactions. |
| .sp |
| If we are the only waiter, wait relative to when the transaction |
| started, rather than the current time. This credits the transaction for |
| "time already served", e.g. reading indirect blocks. |
| .sp |
| The minimum time for a transaction to take is calculated as: |
| .nf |
| min_time = zfs_delay_scale * (dirty - min) / (max - dirty) |
| min_time is then capped at 100 milliseconds. |
| .fi |
| .sp |
| The delay has two degrees of freedom that can be adjusted via tunables. The |
| percentage of dirty data at which we start to delay is defined by |
| \fBzfs_delay_min_dirty_percent\fR. This should typically be at or above |
| \fBzfs_vdev_async_write_active_max_dirty_percent\fR so that we only start to |
| delay after writing at full speed has failed to keep up with the incoming write |
| rate. The scale of the curve is defined by \fBzfs_delay_scale\fR. Roughly speaking, |
| this variable determines the amount of delay at the midpoint of the curve. |
| .sp |
| .nf |
| delay |
| 10ms +-------------------------------------------------------------*+ |
| | *| |
| 9ms + *+ |
| | *| |
| 8ms + *+ |
| | * | |
| 7ms + * + |
| | * | |
| 6ms + * + |
| | * | |
| 5ms + * + |
| | * | |
| 4ms + * + |
| | * | |
| 3ms + * + |
| | * | |
| 2ms + (midpoint) * + |
| | | ** | |
| 1ms + v *** + |
| | zfs_delay_scale ----------> ******** | |
| 0 +-------------------------------------*********----------------+ |
| 0% <- zfs_dirty_data_max -> 100% |
| .fi |
| .sp |
| Note that since the delay is added to the outstanding time remaining on the |
| most recent transaction, the delay is effectively the inverse of IOPS. |
| Here the midpoint of 500us translates to 2000 IOPS. The shape of the curve |
| was chosen such that small changes in the amount of accumulated dirty data |
| in the first 3/4 of the curve yield relatively small differences in the |
| amount of delay. |
| .sp |
| The effects can be easier to understand when the amount of delay is |
| represented on a log scale: |
| .sp |
| .nf |
| delay |
| 100ms +-------------------------------------------------------------++ |
| + + |
| | | |
| + *+ |
| 10ms + *+ |
| + ** + |
| | (midpoint) ** | |
| + | ** + |
| 1ms + v **** + |
| + zfs_delay_scale ----------> ***** + |
| | **** | |
| + **** + |
| 100us + ** + |
| + * + |
| | * | |
| + * + |
| 10us + * + |
| + + |
| | | |
| + + |
| +--------------------------------------------------------------+ |
| 0% <- zfs_dirty_data_max -> 100% |
| .fi |
| .sp |
| Note here that only as the amount of dirty data approaches its limit does |
| the delay start to increase rapidly. The goal of a properly tuned system |
| should be to keep the amount of dirty data out of that range by first |
| ensuring that the appropriate limits are set for the I/O scheduler to reach |
| optimal throughput on the backend storage, and then by changing the value |
| of \fBzfs_delay_scale\fR to increase the steepness of the curve. |