|
|
Line 1: |
Line 1: |
− | =btrfs-man5(5) manual page=
| |
| {{GeneratedManpage | | {{GeneratedManpage |
| |name=btrfs-man5}} | | |name=btrfs-man5}} |
− |
| |
− | ==NAME==
| |
− | btrfs-man5 - topics about the BTRFS filesystem (mount options, supported file attributes and other)
| |
− |
| |
− | ==DESCRIPTION==
| |
− |
| |
− | <p>This document describes topics related to BTRFS that are not specific to the
| |
− | tools. Currently covers:</p>
| |
− | <ol>
| |
− | <li>
| |
− | <p>
| |
− | mount options
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | filesystem features
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | checksum algorithms
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | compression
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | filesystem exclusive operations
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | filesystem limits
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | bootloader support
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | file attributes
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | zoned mode
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | control device
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | filesystems with multiple block group profiles
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | seeding device
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | raid56 status and recommended practices
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | storage model
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | hardware considerations
| |
− | </p>
| |
− | </li>
| |
− | </ol>
| |
− | ==MOUNT OPTIONS==
| |
− |
| |
− | <p>This section describes mount options specific to BTRFS. For the generic mount
| |
− | options please refer to [http://man7.org/linux/man-pages/man8/mount.8.html mount(8)] manpage. The options are sorted alphabetically
| |
− | (discarding the <em>no</em> prefix).</p>
| |
− | <blockquote><b>Note:</b>
| |
− | most mount options apply to the whole filesystem and only options in the
| |
− | first mounted subvolume will take effect. This is due to lack of implementation
| |
− | and may change in the future. This means that (for example) you can’t set
| |
− | per-subvolume <em>nodatacow</em>, <em>nodatasum</em>, or <em>compress</em> using mount options. This
| |
− | should eventually be fixed, but it has proved to be difficult to implement
| |
− | correctly within the Linux VFS framework.</blockquote>
| |
− | <p>Mount options are processed in order, only the last occurrence of an option
| |
− | takes effect and may disable other options due to constraints (see eg.
| |
− | <em>nodatacow</em> and <em>compress</em>). The output of <em>mount</em> command shows which options
| |
− | have been applied.</p>
| |
− | <dl>
| |
− | <dt>
| |
− | <b>acl</b>
| |
− | <dt>
| |
− | <b>noacl</b>
| |
− | <dd>
| |
− | <p>
| |
− | (default: on)
| |
− | </p>
| |
− | <p>Enable/disable support for Posix Access Control Lists (ACLs). See the
| |
− | [http://man7.org/linux/man-pages/man5/acl.5.html acl(5)] manual page for more information about ACLs.</p>
| |
− | <p>The support for ACL is build-time configurable (BTRFS_FS_POSIX_ACL) and
| |
− | mount fails if <em>acl</em> is requested but the feature is not compiled in.</p>
| |
− |
| |
− | <dt>
| |
− | <b>autodefrag</b>
| |
− | <dt>
| |
− | <b>noautodefrag</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 3.0, default: off)
| |
− | </p>
| |
− | <p>Enable automatic file defragmentation.
| |
− | When enabled, small random writes into files (in a range of tens of kilobytes,
| |
− | currently it’s 64K) are detected and queued up for the defragmentation process.
| |
− | Not well suited for large database workloads.</p>
| |
− | <p>The read latency may increase due to reading the adjacent blocks that make up the
| |
− | range for defragmentation, successive write will merge the blocks in the new
| |
− | location.</p>
| |
− | <blockquote><b>Warning:</b>
| |
− | Defragmenting with Linux kernel versions < 3.9 or ≥ 3.14-rc2 as
| |
− | well as with Linux stable kernel versions ≥ 3.10.31, ≥ 3.12.12 or
| |
− | ≥ 3.13.4 will break up the reflinks of COW data (for example files
| |
− | copied with <tt>cp --reflink</tt>, snapshots or de-duplicated data).
| |
− | This may cause considerable increase of space usage depending on the
| |
− | broken up reflinks.</blockquote>
| |
− |
| |
− | <dt>
| |
− | <b>barrier</b>
| |
− | <dt>
| |
− | <b>nobarrier</b>
| |
− | <dd>
| |
− | <p>
| |
− | (default: on)
| |
− | </p>
| |
− | <p>Ensure that all IO write operations make it through the device cache and are stored
| |
− | permanently when the filesystem is at its consistency checkpoint. This
| |
− | typically means that a flush command is sent to the device that will
| |
− | synchronize all pending data and ordinary metadata blocks, then writes the
| |
− | superblock and issues another flush.</p>
| |
− | <p>The write flushes incur a slight hit and also prevent the IO block
| |
− | scheduler to reorder requests in a more effective way. Disabling barriers gets
| |
− | rid of that penalty but will most certainly lead to a corrupted filesystem in
| |
− | case of a crash or power loss. The ordinary metadata blocks could be yet
| |
− | unwritten at the time the new superblock is stored permanently, expecting that
| |
− | the block pointers to metadata were stored permanently before.</p>
| |
− | <p>On a device with a volatile battery-backed write-back cache, the <em>nobarrier</em>
| |
− | option will not lead to filesystem corruption as the pending blocks are
| |
− | supposed to make it to the permanent storage.</p>
| |
− |
| |
− | <dt>
| |
− | <b>check_int</b>
| |
− | <dt>
| |
− | <b>check_int_data</b>
| |
− | <dt>
| |
− | <b>check_int_print_mask=<em>value</em></b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 3.0, default: off)
| |
− | </p>
| |
− | <p>These debugging options control the behavior of the integrity checking
| |
− | module (the BTRFS_FS_CHECK_INTEGRITY config option required). The main goal is
| |
− | to verify that all blocks from a given transaction period are properly linked.</p>
| |
− | <p><em>check_int</em> enables the integrity checker module, which examines all
| |
− | block write requests to ensure on-disk consistency, at a large
| |
− | memory and CPU cost.</p>
| |
− | <p><em>check_int_data</em> includes extent data in the integrity checks, and
| |
− | implies the <em>check_int</em> option.</p>
| |
− | <p><em>check_int_print_mask</em> takes a bitmask of BTRFSIC_PRINT_MASK_* values
| |
− | as defined in <em>fs/btrfs/check-integrity.c</em>, to control the integrity
| |
− | checker module behavior.</p>
| |
− | <p>See comments at the top of <em>fs/btrfs/check-integrity.c</em>
| |
− | for more information.</p>
| |
− |
| |
− | <dt>
| |
− | <b>clear_cache</b>
| |
− | <dd>
| |
− | <p>
| |
− | Force clearing and rebuilding of the disk space cache if something
| |
− | has gone wrong. See also: <em>space_cache</em>.
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | <b>commit=<em>seconds</em></b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 3.12, default: 30)
| |
− | </p>
| |
− | <p>Set the interval of periodic transaction commit when data are synchronized
| |
− | to permanent storage. Higher interval values lead to larger amount of unwritten
| |
− | data, which has obvious consequences when the system crashes.
| |
− | The upper bound is not forced, but a warning is printed if it’s more than 300
| |
− | seconds (5 minutes). Use with care.</p>
| |
− |
| |
− | <dt>
| |
− | <b>compress</b>
| |
− | <dt>
| |
− | <b>compress=<em>type[:level]</em></b>
| |
− | <dt>
| |
− | <b>compress-force</b>
| |
− | <dt>
| |
− | <b>compress-force=<em>type[:level]</em></b>
| |
− | <dd>
| |
− | <p>
| |
− | (default: off, level support since: 5.1)
| |
− | </p>
| |
− | <p>Control BTRFS file data compression. Type may be specified as <em>zlib</em>,
| |
− | <em>lzo</em>, <em>zstd</em> or <em>no</em> (for no compression, used for remounting). If no type
| |
− | is specified, <em>zlib</em> is used. If <em>compress-force</em> is specified,
| |
− | then compression will always be attempted, but the data may end up uncompressed
| |
− | if the compression would make them larger.</p>
| |
− | <p>Both <em>zlib</em> and <em>zstd</em> (since version 5.1) expose the compression level as a
| |
− | tunable knob with higher levels trading speed and memory (<em>zstd</em>) for higher
| |
− | compression ratios. This can be set by appending a colon and the desired level.
| |
− | Zlib accepts the range [1, 9] and zstd accepts [1, 15]. If no level is set,
| |
− | both currently use a default level of 3. The value 0 is an alias for the
| |
− | default level.</p>
| |
− | <p>Otherwise some simple heuristics are applied to detect an incompressible file.
| |
− | If the first blocks written to a file are not compressible, the whole file is
| |
− | permanently marked to skip compression. As this is too simple, the
| |
− | <em>compress-force</em> is a workaround that will compress most of the files at the
| |
− | cost of some wasted CPU cycles on failed attempts.
| |
− | Since kernel 4.15, a set of heuristic algorithms have been improved by using
| |
− | frequency sampling, repeated pattern detection and Shannon entropy calculation
| |
− | to avoid that.</p>
| |
− | <blockquote><b>Note:</b>
| |
− | If compression is enabled, <em>nodatacow</em> and <em>nodatasum</em> are disabled.</blockquote>
| |
− |
| |
− | <dt>
| |
− | <b>datacow</b>
| |
− | <dt>
| |
− | <b>nodatacow</b>
| |
− | <dd>
| |
− | <p>
| |
− | (default: on)
| |
− | </p>
| |
− | <p>Enable data copy-on-write for newly created files.
| |
− | <em>Nodatacow</em> implies <em>nodatasum</em>, and disables <em>compression</em>. All files created
| |
− | under <em>nodatacow</em> are also set the NOCOW file attribute (see [http://man7.org/linux/man-pages/man1/chattr.1.html chattr(1)]).</p>
| |
− | <blockquote><b>Note:</b>
| |
− | If <em>nodatacow</em> or <em>nodatasum</em> are enabled, compression is disabled.</blockquote>
| |
− | <p>Updates in-place improve performance for workloads that do frequent overwrites,
| |
− | at the cost of potential partial writes, in case the write is interrupted
| |
− | (system crash, device failure).</p>
| |
− |
| |
− | <dt>
| |
− | <b>datasum</b>
| |
− | <dt>
| |
− | <b>nodatasum</b>
| |
− | <dd>
| |
− | <p>
| |
− | (default: on)
| |
− | </p>
| |
− | <p>Enable data checksumming for newly created files.
| |
− | <em>Datasum</em> implies <em>datacow</em>, ie. the normal mode of operation. All files created
| |
− | under <em>nodatasum</em> inherit the "no checksums" property, however there’s no
| |
− | corresponding file attribute (see [http://man7.org/linux/man-pages/man1/chattr.1.html chattr(1)]).</p>
| |
− | <blockquote><b>Note:</b>
| |
− | If <em>nodatacow</em> or <em>nodatasum</em> are enabled, compression is disabled.</blockquote>
| |
− | <p>There is a slight performance gain when checksums are turned off, the
| |
− | corresponding metadata blocks holding the checksums do not need to updated.
| |
− | The cost of checksumming of the blocks in memory is much lower than the IO,
| |
− | modern CPUs feature hardware support of the checksumming algorithm.</p>
| |
− |
| |
− | <dt>
| |
− | <b>degraded</b>
| |
− | <dd>
| |
− | <p>
| |
− | (default: off)
| |
− | </p>
| |
− | <p>Allow mounts with less devices than the RAID profile constraints
| |
− | require. A read-write mount (or remount) may fail when there are too many devices
| |
− | missing, for example if a stripe member is completely missing from RAID0.</p>
| |
− | <p>Since 4.14, the constraint checks have been improved and are verified on the
| |
− | chunk level, not an the device level. This allows degraded mounts of
| |
− | filesystems with mixed RAID profiles for data and metadata, even if the
| |
− | device number constraints would not be satisfied for some of the profiles.</p>
| |
− | <p>Example: metadata — raid1, data — single, devices — /dev/sda, /dev/sdb</p>
| |
− | <p>Suppose the data are completely stored on <em>sda</em>, then missing <em>sdb</em> will not
| |
− | prevent the mount, even if 1 missing device would normally prevent (any)
| |
− | <em>single</em> profile to mount. In case some of the data chunks are stored on <em>sdb</em>,
| |
− | then the constraint of single/data is not satisfied and the filesystem
| |
− | cannot be mounted.</p>
| |
− |
| |
− | <dt>
| |
− | <b>device=<em>devicepath</em></b>
| |
− | <dd>
| |
− | <p>
| |
− | Specify a path to a device that will be scanned for BTRFS filesystem during
| |
− | mount. This is usually done automatically by a device manager (like udev) or
| |
− | using the <b>btrfs device scan</b> command (eg. run from the initial ramdisk). In
| |
− | cases where this is not possible the <em>device</em> mount option can help.
| |
− | </p>
| |
− | <blockquote><b>Note:</b>
| |
− | booting eg. a RAID1 system may fail even if all filesystem’s <em>device</em>
| |
− | paths are provided as the actual device nodes may not be discovered by the
| |
− | system at that point.</blockquote>
| |
− |
| |
− | <dt>
| |
− | <b>discard</b>
| |
− | <dt>
| |
− | <b>discard=sync</b>
| |
− | <dt>
| |
− | <b>discard=async</b>
| |
− | <dt>
| |
− | <b>nodiscard</b>
| |
− | <dd>
| |
− | <p>
| |
− | (default: off, async support since: 5.6)
| |
− | </p>
| |
− | <p>Enable discarding of freed file blocks. This is useful for SSD devices, thinly
| |
− | provisioned LUNs, or virtual machine images; however, every storage layer must
| |
− | support discard for it to work.</p>
| |
− | <p>In the synchronous mode (<em>sync</em> or without option value), lack of asynchronous
| |
− | queued TRIM on the backing device TRIM can severely degrade performance,
| |
− | because a synchronous TRIM operation will be attempted instead. Queued TRIM
| |
− | requires newer than SATA revision 3.1 chipsets and devices.</p>
| |
− | <p>The asynchronous mode (<em>async</em>) gathers extents in larger chunks before sending
| |
− | them to the devices for TRIM. The overhead and performance impact should be
| |
− | negligible compared to the previous mode and it’s supposed to be the preferred
| |
− | mode if needed.</p>
| |
− | <p>If it is not necessary to immediately discard freed blocks, then the <tt>fstrim</tt>
| |
− | tool can be used to discard all free blocks in a batch. Scheduling a TRIM
| |
− | during a period of low system activity will prevent latent interference with
| |
− | the performance of other operations. Also, a device may ignore the TRIM command
| |
− | if the range is too small, so running a batch discard has a greater probability
| |
− | of actually discarding the blocks.</p>
| |
− |
| |
− | <dt>
| |
− | <b>enospc_debug</b>
| |
− | <dt>
| |
− | <b>noenospc_debug</b>
| |
− | <dd>
| |
− | <p>
| |
− | (default: off)
| |
− | </p>
| |
− | <p>Enable verbose output for some ENOSPC conditions. It’s safe to use but can
| |
− | be noisy if the system reaches near-full state.</p>
| |
− |
| |
− | <dt>
| |
− | <b>fatal_errors=<em>action</em></b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 3.4, default: bug)
| |
− | </p>
| |
− | <p>Action to take when encountering a fatal error.</p>
| |
− | <dl>
| |
− | <dt>
| |
− | <b>bug</b>
| |
− | <dd>
| |
− | <p>
| |
− | <em>BUG()</em> on a fatal error, the system will stay in the crashed state and may be
| |
− | still partially usable, but reboot is required for full operation
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | <b>panic</b>
| |
− | <dd>
| |
− | <p>
| |
− | <em>panic()</em> on a fatal error, depending on other system configuration, this may
| |
− | be followed by a reboot. Please refer to the documentation of kernel boot
| |
− | parameters, eg. <em>panic</em>, <em>oops</em> or <em>crashkernel</em>.
| |
− | </p>
| |
− |
| |
− | </dl>
| |
− |
| |
− | <dt>
| |
− | <b>flushoncommit</b>
| |
− | <dt>
| |
− | <b>noflushoncommit</b>
| |
− | <dd>
| |
− | <p>
| |
− | (default: off)
| |
− | </p>
| |
− | <p>This option forces any data dirtied by a write in a prior transaction to commit
| |
− | as part of the current commit, effectively a full filesystem sync.</p>
| |
− | <p>This makes the committed state a fully consistent view of the file system from
| |
− | the application’s perspective (i.e. it includes all completed file system
| |
− | operations). This was previously the behavior only when a snapshot was
| |
− | created.</p>
| |
− | <p>When off, the filesystem is consistent but buffered writes may last more than
| |
− | one transaction commit.</p>
| |
− |
| |
− | <dt>
| |
− | <b>fragment=<em>type</em></b>
| |
− | <dd>
| |
− | <p>
| |
− | (depends on compile-time option BTRFS_DEBUG, since: 4.4, default: off)
| |
− | </p>
| |
− | <p>A debugging helper to intentionally fragment given <em>type</em> of block groups. The
| |
− | type can be <em>data</em>, <em>metadata</em> or <em>all</em>. This mount option should not be used
| |
− | outside of debugging environments and is not recognized if the kernel config
| |
− | option <em>BTRFS_DEBUG</em> is not enabled.</p>
| |
− |
| |
− | <dt>
| |
− | <b>nologreplay</b>
| |
− | <dd>
| |
− | <p>
| |
− | (default: off, even read-only)
| |
− | </p>
| |
− | <p>The tree-log contains pending updates to the filesystem until the full commit.
| |
− | The log is replayed on next mount, this can be disabled by this option. See
| |
− | also <em>treelog</em>. Note that <em>nologreplay</em> is the same as <em>norecovery</em>.</p>
| |
− | <blockquote><b>Warning:</b>
| |
− | currently, the tree log is replayed even with a read-only mount! To
| |
− | disable that behaviour, mount also with <em>nologreplay</em>.</blockquote>
| |
− |
| |
− | <dt>
| |
− | <b>max_inline=<em>bytes</em></b>
| |
− | <dd>
| |
− | <p>
| |
− | (default: min(2048, page size) )
| |
− | </p>
| |
− | <p>Specify the maximum amount of space, that can be inlined in
| |
− | a metadata B-tree leaf. The value is specified in bytes, optionally
| |
− | with a K suffix (case insensitive). In practice, this value
| |
− | is limited by the filesystem block size (named <em>sectorsize</em> at mkfs time),
| |
− | and memory page size of the system. In case of sectorsize limit, there’s
| |
− | some space unavailable due to leaf headers. For example, a 4k sectorsize,
| |
− | maximum size of inline data is about 3900 bytes.</p>
| |
− | <p>Inlining can be completely turned off by specifying 0. This will increase data
| |
− | block slack if file sizes are much smaller than block size but will reduce
| |
− | metadata consumption in return.</p>
| |
− | <blockquote><b>Note:</b>
| |
− | the default value has changed to 2048 in kernel 4.6.</blockquote>
| |
− |
| |
− | <dt>
| |
− | <b>metadata_ratio=<em>value</em></b>
| |
− | <dd>
| |
− | <p>
| |
− | (default: 0, internal logic)
| |
− | </p>
| |
− | <p>Specifies that 1 metadata chunk should be allocated after every <em>value</em> data
| |
− | chunks. Default behaviour depends on internal logic, some percent of unused
| |
− | metadata space is attempted to be maintained but is not always possible if
| |
− | there’s not enough space left for chunk allocation. The option could be useful to
| |
− | override the internal logic in favor of the metadata allocation if the expected
| |
− | workload is supposed to be metadata intense (snapshots, reflinks, xattrs,
| |
− | inlined files).</p>
| |
− |
| |
− | <dt>
| |
− | <b>norecovery</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 4.5, default: off)
| |
− | </p>
| |
− | <p>Do not attempt any data recovery at mount time. This will disable <em>logreplay</em>
| |
− | and avoids other write operations. Note that this option is the same as
| |
− | <em>nologreplay</em>.</p>
| |
− | <blockquote><b>Note:</b>
| |
− | The opposite option <em>recovery</em> used to have different meaning but was
| |
− | changed for consistency with other filesystems, where <em>norecovery</em> is used for
| |
− | skipping log replay. BTRFS does the same and in general will try to avoid any
| |
− | write operations.</blockquote>
| |
− |
| |
− | <dt>
| |
− | <b>rescan_uuid_tree</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 3.12, default: off)
| |
− | </p>
| |
− | <p>Force check and rebuild procedure of the UUID tree. This should not
| |
− | normally be needed.</p>
| |
− |
| |
− | <dt>
| |
− | <b>rescue</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 5.9)
| |
− | </p>
| |
− | <p>Modes allowing mount with damaged filesystem structures.</p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | <em>usebackuproot</em> (since: 5.9, replaces standalone option <em>usebackuproot</em>)
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | <em>nologreplay</em> (since: 5.9, replaces standalone option <em>nologreplay</em>)
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | <em>ignorebadroots</em>, <em>ibadroots</em> (since: 5.11)
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | <em>ignoredatacsums</em>, <em>idatacsums</em> (since: 5.11)
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | <em>all</em> (since: 5.9)
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− |
| |
− | <dt>
| |
− | <b>skip_balance</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 3.3, default: off)
| |
− | </p>
| |
− | <p>Skip automatic resume of an interrupted balance operation. The operation can
| |
− | later be resumed with <b>btrfs balance resume</b>, or the paused state can be
| |
− | removed with <b>btrfs balance cancel</b>. The default behaviour is to resume an
| |
− | interrupted balance immediately after a volume is mounted.</p>
| |
− |
| |
− | <dt>
| |
− | <b>space_cache</b>
| |
− | <dt>
| |
− | <b>space_cache=<em>version</em></b>
| |
− | <dt>
| |
− | <b>nospace_cache</b>
| |
− | <dd>
| |
− | <p>
| |
− | (<em>nospace_cache</em> since: 3.2, <em>space_cache=v1</em> and <em>space_cache=v2</em> since 4.5, default: <em>space_cache=v1</em>)
| |
− | </p>
| |
− | <p>Options to control the free space cache. The free space cache greatly improves
| |
− | performance when reading block group free space into memory. However, managing
| |
− | the space cache consumes some resources, including a small amount of disk
| |
− | space.</p>
| |
− | <p>There are two implementations of the free space cache. The original
| |
− | one, referred to as <em>v1</em>, is the safe default. The <em>v1</em> space cache can be
| |
− | disabled at mount time with <em>nospace_cache</em> without clearing.</p>
| |
− | <p>On very large filesystems (many terabytes) and certain workloads, the
| |
− | performance of the <em>v1</em> space cache may degrade drastically. The <em>v2</em>
| |
− | implementation, which adds a new B-tree called the free space tree, addresses
| |
− | this issue. Once enabled, the <em>v2</em> space cache will always be used and cannot
| |
− | be disabled unless it is cleared. Use <em>clear_cache,space_cache=v1</em> or
| |
− | <em>clear_cache,nospace_cache</em> to do so. If <em>v2</em> is enabled, kernels without <em>v2</em>
| |
− | support will only be able to mount the filesystem in read-only mode.</p>
| |
− | <p>The [[Manpage/btrfs-check|btrfs-check(8)]] and [[Manpage/mkfs.btrfs|mkfs.btrfs(8)]] commands have full <em>v2</em> free space
| |
− | cache support since v4.19.</p>
| |
− | <p>If a version is not explicitly specified, the default implementation will be
| |
− | chosen, which is <em>v1</em>.</p>
| |
− |
| |
− | <dt>
| |
− | <b>ssd</b>
| |
− | <dt>
| |
− | <b>ssd_spread</b>
| |
− | <dt>
| |
− | <b>nossd</b>
| |
− | <dt>
| |
− | <b>nossd_spread</b>
| |
− | <dd>
| |
− | <p>
| |
− | (default: SSD autodetected)
| |
− | </p>
| |
− | <p>Options to control SSD allocation schemes. By default, BTRFS will
| |
− | enable or disable SSD optimizations depending on status of a device with
| |
− | respect to rotational or non-rotational type. This is determined by the
| |
− | contents of <em>/sys/block/DEV/queue/rotational</em>). If it is 0, the <em>ssd</em> option is
| |
− | turned on. The option <em>nossd</em> will disable the autodetection.</p>
| |
− | <p>The optimizations make use of the absence of the seek penalty that’s inherent
| |
− | for the rotational devices. The blocks can be typically written faster and
| |
− | are not offloaded to separate threads.</p>
| |
− | <blockquote><b>Note:</b>
| |
− | Since 4.14, the block layout optimizations have been dropped. This used
| |
− | to help with first generations of SSD devices. Their FTL (flash translation
| |
− | layer) was not effective and the optimization was supposed to improve the wear
| |
− | by better aligning blocks. This is no longer true with modern SSD devices and
| |
− | the optimization had no real benefit. Furthermore it caused increased
| |
− | fragmentation. The layout tuning has been kept intact for the option
| |
− | <em>ssd_spread</em>.</blockquote>
| |
− | <p>The <em>ssd_spread</em> mount option attempts to allocate into bigger and aligned
| |
− | chunks of unused space, and may perform better on low-end SSDs. <em>ssd_spread</em>
| |
− | implies <em>ssd</em>, enabling all other SSD heuristics as well. The option <em>nossd</em>
| |
− | will disable all SSD options while <em>nossd_spread</em> only disables <em>ssd_spread</em>.</p>
| |
− |
| |
− | <dt>
| |
− | <b>subvol=<em>path</em></b>
| |
− | <dd>
| |
− | <p>
| |
− | Mount subvolume from <em>path</em> rather than the toplevel subvolume. The
| |
− | <em>path</em> is always treated as relative to the toplevel subvolume.
| |
− | This mount option overrides the default subvolume set for the given filesystem.
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | <b>subvolid=<em>subvolid</em></b>
| |
− | <dd>
| |
− | <p>
| |
− | Mount subvolume specified by a <em>subvolid</em> number rather than the toplevel
| |
− | subvolume. You can use <b>btrfs subvolume list</b> of <b>btrfs subvolume show</b> to see
| |
− | subvolume ID numbers.
| |
− | This mount option overrides the default subvolume set for the given filesystem.
| |
− | </p>
| |
− | <blockquote><b>Note:</b>
| |
− | if both <em>subvolid</em> and <em>subvol</em> are specified, they must point at the
| |
− | same subvolume, otherwise the mount will fail.</blockquote>
| |
− |
| |
− | <dt>
| |
− | <b>thread_pool=<em>number</em></b>
| |
− | <dd>
| |
− | <p>
| |
− | (default: min(NRCPUS + 2, 8) )
| |
− | </p>
| |
− | <p>The number of worker threads to start. NRCPUS is number of on-line CPUs
| |
− | detected at the time of mount. Small number leads to less parallelism in
| |
− | processing data and metadata, higher numbers could lead to a performance hit
| |
− | due to increased locking contention, process scheduling, cache-line bouncing or
| |
− | costly data transfers between local CPU memories.</p>
| |
− |
| |
− | <dt>
| |
− | <b>treelog</b>
| |
− | <dt>
| |
− | <b>notreelog</b>
| |
− | <dd>
| |
− | <p>
| |
− | (default: on)
| |
− | </p>
| |
− | <p>Enable the tree logging used for <em>fsync</em> and <em>O_SYNC</em> writes. The tree log
| |
− | stores changes without the need of a full filesystem sync. The log operations
| |
− | are flushed at sync and transaction commit. If the system crashes between two
| |
− | such syncs, the pending tree log operations are replayed during mount.</p>
| |
− | <blockquote><b>Warning:</b>
| |
− | currently, the tree log is replayed even with a read-only mount! To
| |
− | disable that behaviour, also mount with <em>nologreplay</em>.</blockquote>
| |
− | <p>The tree log could contain new files/directories, these would not exist on
| |
− | a mounted filesystem if the log is not replayed.</p>
| |
− |
| |
− | <dt>
| |
− | <b>usebackuproot</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 4.6, default: off)
| |
− | </p>
| |
− | <p>Enable autorecovery attempts if a bad tree root is found at mount time.
| |
− | Currently this scans a backup list of several previous tree roots and tries to
| |
− | use the first readable. This can be used with read-only mounts as well.</p>
| |
− | <blockquote><b>Note:</b>
| |
− | This option has replaced <em>recovery</em>.</blockquote>
| |
− |
| |
− | <dt>
| |
− | <b>user_subvol_rm_allowed</b>
| |
− | <dd>
| |
− | <p>
| |
− | (default: off)
| |
− | </p>
| |
− | <p>Allow subvolumes to be deleted by their respective owner. Otherwise, only the
| |
− | root user can do that.</p>
| |
− | <blockquote><b>Note:</b>
| |
− | historically, any user could create a snapshot even if he was not owner
| |
− | of the source subvolume, the subvolume deletion has been restricted for that
| |
− | reason. The subvolume creation has been restricted but this mount option is
| |
− | still required. This is a usability issue.
| |
− | Since 4.18, the [http://man7.org/linux/man-pages/man2/rmdir.2.html rmdir(2)] syscall can delete an empty subvolume just like an
| |
− | ordinary directory. Whether this is possible can be detected at runtime, see
| |
− | <em>rmdir_subvol</em> feature in <em>FILESYSTEM FEATURES</em>.</blockquote>
| |
− |
| |
− | </dl>
| |
− | ===DEPRECATED MOUNT OPTIONS===
| |
− |
| |
− | <p>List of mount options that have been removed, kept for backward compatibility.</p>
| |
− | <dl>
| |
− | <dt>
| |
− | <b>recovery</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 3.2, default: off, deprecated since: 4.5)
| |
− | </p>
| |
− | <blockquote><b>Note:</b>
| |
− | this option has been replaced by <em>usebackuproot</em> and should not be used
| |
− | but will work on 4.5+ kernels.</blockquote>
| |
− |
| |
− | <dt>
| |
− | <b>inode_cache</b>
| |
− | <dt>
| |
− | <b>noinode_cache</b>
| |
− | <dd>
| |
− | <p>
| |
− | (removed in: 5.11, since: 3.0, default: off)
| |
− | </p>
| |
− | <blockquote><b>Note:</b>
| |
− | the functionality has been removed in 5.11, any stale data created by
| |
− | previous use of the <em>inode_cache</em> option can be removed by <b>btrfs check
| |
− | --clear-ino-cache</b>.</blockquote>
| |
− |
| |
− | </dl>
| |
− | ===NOTES ON GENERIC MOUNT OPTIONS===
| |
− |
| |
− | <p>Some of the general mount options from [http://man7.org/linux/man-pages/man8/mount.8.html mount(8)] that affect BTRFS and are
| |
− | worth mentioning.</p>
| |
− | <dl>
| |
− | <dt>
| |
− | <b>noatime</b>
| |
− | <dd>
| |
− | <p>
| |
− | under read intensive work-loads, specifying <em>noatime</em> significantly improves
| |
− | performance because no new access time information needs to be written. Without
| |
− | this option, the default is <em>relatime</em>, which only reduces the number of
| |
− | inode atime updates in comparison to the traditional <em>strictatime</em>. The worst
| |
− | case for atime updates under <em>relatime</em> occurs when many files are read whose
| |
− | atime is older than 24 h and which are freshly snapshotted. In that case the
| |
− | atime is updated <em>and</em> COW happens - for each file - in bulk. See also
| |
− | https://lwn.net/Articles/499293/ - <em>Atime and btrfs: a bad combination? (LWN, 2012-05-31)</em>.
| |
− | </p>
| |
− | <p>Note that <em>noatime</em> may break applications that rely on atime uptimes like
| |
− | the venerable Mutt (unless you use maildir mailboxes).</p>
| |
− |
| |
− | </dl>
| |
− | ==FILESYSTEM FEATURES==
| |
− |
| |
− | <p>The basic set of filesystem features gets extended over time. The backward
| |
− | compatibility is maintained and the features are optional, need to be
| |
− | explicitly asked for so accidental use will not create incompatibilities.</p>
| |
− | <p>There are several classes and the respective tools to manage the features:</p>
| |
− | <dl>
| |
− | <dt>
| |
− | at mkfs time only
| |
− | <dd>
| |
− | <p>
| |
− | This is namely for core structures, like the b-tree nodesize or checksum
| |
− | algorithm, see [[Manpage/mkfs.btrfs|mkfs.btrfs(8)]] for more details.
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | after mkfs, on an unmounted filesystem
| |
− | <dd>
| |
− | <p>
| |
− | Features that may optimize internal structures or add new structures to support
| |
− | new functionality, see [[Manpage/btrfstune|btrfstune(8)]]. The command <b>btrfs inspect-internal
| |
− | dump-super device</b> will dump a superblock, you can map the value of
| |
− | <em>incompat_flags</em> to the features listed below
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | after mkfs, on a mounted filesystem
| |
− | <dd>
| |
− | <p>
| |
− | The features of a filesystem (with a given UUID) are listed in
| |
− | <tt>/sys/fs/btrfs/UUID/features/</tt>, one file per feature. The status is stored
| |
− | inside the file. The value <em>1</em> is for enabled and active, while <em>0</em> means the
| |
− | feature was enabled at mount time but turned off afterwards.
| |
− | </p>
| |
− | <p>Whether a particular feature can be turned on a mounted filesystem can be found
| |
− | in the directory <tt>/sys/fs/btrfs/features/</tt>, one file per feature. The value <em>1</em>
| |
− | means the feature can be enabled.</p>
| |
− |
| |
− | </dl>
| |
− | <p>List of features (see also [[Manpage/mkfs.btrfs|mkfs.btrfs(8)]] section <em>FILESYSTEM FEATURES</em>):</p>
| |
− | <dl>
| |
− | <dt>
| |
− | <b>big_metadata</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 3.4)
| |
− | </p>
| |
− | <p>the filesystem uses <em>nodesize</em> for metadata blocks, this can be bigger than the
| |
− | page size</p>
| |
− |
| |
− | <dt>
| |
− | <b>compress_lzo</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 2.6.38)
| |
− | </p>
| |
− | <p>the <em>lzo</em> compression has been used on the filesystem, either as a mount option
| |
− | or via <b>btrfs filesystem defrag</b>.</p>
| |
− |
| |
− | <dt>
| |
− | <b>compress_zstd</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 4.14)
| |
− | </p>
| |
− | <p>the <em>zstd</em> compression has been used on the filesystem, either as a mount option
| |
− | or via <b>btrfs filesystem defrag</b>.</p>
| |
− |
| |
− | <dt>
| |
− | <b>default_subvol</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 2.6.34)
| |
− | </p>
| |
− | <p>the default subvolume has been set on the filesystem</p>
| |
− |
| |
− | <dt>
| |
− | <b>extended_iref</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 3.7)
| |
− | </p>
| |
− | <p>increased hardlink limit per file in a directory to 65536, older kernels
| |
− | supported a varying number of hardlinks depending on the sum of all file name
| |
− | sizes that can be stored into one metadata block</p>
| |
− |
| |
− | <dt>
| |
− | <b>free_space_tree</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 4.5)
| |
− | </p>
| |
− | <p>free space representation using a dedicated b-tree, successor of v1 space cache</p>
| |
− |
| |
− | <dt>
| |
− | <b>metadata_uuid</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 5.0)
| |
− | </p>
| |
− | <p>the main filesystem UUID is the metadata_uuid, which stores the new UUID only
| |
− | in the superblock while all metadata blocks still have the UUID set at mkfs
| |
− | time, see [[Manpage/btrfstune|btrfstune(8)]] for more</p>
| |
− |
| |
− | <dt>
| |
− | <b>mixed_backref</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 2.6.31)
| |
− | </p>
| |
− | <p>the last major disk format change, improved backreferences, now default</p>
| |
− |
| |
− | <dt>
| |
− | <b>mixed_groups</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 2.6.37)
| |
− | </p>
| |
− | <p>mixed data and metadata block groups, ie. the data and metadata are not
| |
− | separated and occupy the same block groups, this mode is suitable for small
| |
− | volumes as there are no constraints how the remaining space should be used
| |
− | (compared to the split mode, where empty metadata space cannot be used for data
| |
− | and vice versa)</p>
| |
− | <p>on the other hand, the final layout is quite unpredictable and possibly highly
| |
− | fragmented, which means worse performance</p>
| |
− |
| |
− | <dt>
| |
− | <b>no_holes</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 3.14)
| |
− | </p>
| |
− | <p>improved representation of file extents where holes are not explicitly
| |
− | stored as an extent, saves a few percent of metadata if sparse files are used</p>
| |
− |
| |
− | <dt>
| |
− | <b>raid1c34</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 5.5)
| |
− | </p>
| |
− | <p>extended RAID1 mode with copies on 3 or 4 devices respectively</p>
| |
− |
| |
− | <dt>
| |
− | <b>raid56</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 3.9)
| |
− | </p>
| |
− | <p>the filesystem contains or contained a raid56 profile of block groups</p>
| |
− |
| |
− | <dt>
| |
− | <b>rmdir_subvol</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 4.18)
| |
− | </p>
| |
− | <p>indicate that [http://man7.org/linux/man-pages/man2/rmdir.2.html rmdir(2)] syscall can delete an empty subvolume just like an
| |
− | ordinary directory. Note that this feature only depends on the kernel version.</p>
| |
− |
| |
− | <dt>
| |
− | <b>skinny_metadata</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 3.10)
| |
− | </p>
| |
− | <p>reduced-size metadata for extent references, saves a few percent of metadata</p>
| |
− |
| |
− | <dt>
| |
− | <b>send_stream_version</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 5.10)
| |
− | </p>
| |
− | <p>number of the highest supported send stream version</p>
| |
− |
| |
− | <dt>
| |
− | <b>supported_checksums</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 5.5)
| |
− | </p>
| |
− | <p>list of checksum algorithms supported by the kernel module, the respective
| |
− | modules or built-in implementing the algorithms need to be present to mount
| |
− | the filesystem, see <em>CHECKSUM ALGORITHMS</em></p>
| |
− |
| |
− | <dt>
| |
− | <b>supported_sectorsizes</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 5.13)
| |
− | </p>
| |
− | <p>list of values that are accepted as sector sizes (<b>mkfs.btrfs --sectorsize</b>) by
| |
− | the running kernel</p>
| |
− |
| |
− | <dt>
| |
− | <b>supported_rescue_options</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 5.11)
| |
− | </p>
| |
− | <p>list of values for the mount option <em>rescue</em> that are supported by the running
| |
− | kernel, see [[Manpage/btrfs|btrfs(5)]]</p>
| |
− |
| |
− | <dt>
| |
− | <b>zoned</b>
| |
− | <dd>
| |
− | <p>
| |
− | (since: 5.12)
| |
− | </p>
| |
− | <p>zoned mode is allocation/write friendly to host-managed zoned devices,
| |
− | allocation space is partitioned into fixed-size zones that must be updated
| |
− | sequentially, see <em>ZONED MODE</em></p>
| |
− |
| |
− | </dl>
| |
− | ===SWAPFILE SUPPORT===
| |
− |
| |
− | <p>The swapfile is supported since kernel 5.0. Use [http://man7.org/linux/man-pages/man8/swapon.8.html swapon(8)] to activate the
| |
− | swapfile. There are some limitations of the implementation in btrfs and linux
| |
− | swap subsystem:</p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | filesystem - must be only single device
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | filesystem - must have only <em>single</em> data profile
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | swapfile - the containing subvolume cannot be snapshotted
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | swapfile - must be preallocated
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | swapfile - must be nodatacow (ie. also nodatasum)
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | swapfile - must not be compressed
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | <p>The limitations come namely from the COW-based design and mapping layer of
| |
− | blocks that allows the advanced features like relocation and multi-device
| |
− | filesystems. However, the swap subsystem expects simpler mapping and no
| |
− | background changes of the file blocks once they’ve been attached to swap.</p>
| |
− | <p>With active swapfiles, the following whole-filesystem operations will skip
| |
− | swapfile extents or may fail:</p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | balance - block groups with swapfile extents are skipped and reported, the rest will be processed normally
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | resize grow - unaffected
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | resize shrink - works as long as the extents are outside of the shrunk range
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | device add - a new device does not interfere with existing swapfile and this operation will work, though no new swapfile can be activated afterwards
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | device delete - if the device has been added as above, it can be also deleted
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | device replace - ditto
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | <p>When there are no active swapfiles and a whole-filesystem exclusive operation
| |
− | is running (ie. balance, device delete, shrink), the swapfiles cannot be
| |
− | temporarily activated. The operation must finish first.</p>
| |
− | <p>To create and activate a swapfile run the following commands:</p>
| |
− | <pre># truncate -s 0 swapfile
| |
− | # chattr +C swapfile
| |
− | # fallocate -l 2G swapfile
| |
− | # chmod 0600 swapfile
| |
− | # mkswap swapfile
| |
− | # swapon swapfile</pre>
| |
− | <p>Please note that the UUID returned by the <em>mkswap</em> utility identifies the swap
| |
− | "filesystem" and because it’s stored in a file, it’s not generally visible and
| |
− | usable as an identifier unlike if it was on a block device.</p>
| |
− | <p>The file will appear in <em>/proc/swaps</em>:</p>
| |
− | <pre># cat /proc/swaps
| |
− | Filename Type Size Used Priority
| |
− | /path/swapfile file 2097152 0 -2</pre>
| |
− | <p>The swapfile can be created as one-time operation or, once properly created,
| |
− | activated on each boot by the <em>swapon -a</em> command (usually started by the
| |
− | service manager). Add the following entry to <em>/etc/fstab</em>, assuming the
| |
− | filesystem that provides the <em>/path</em> has been already mounted at this point.
| |
− | Additional mount options relevant for the swapfile can be set too (like
| |
− | priority, not the btrfs mount options).</p>
| |
− | <pre>/path/swapfile none swap defaults 0 0</pre>
| |
− | ==CHECKSUM ALGORITHMS==
| |
− |
| |
− | <p>There are several checksum algorithms supported. The default and backward
| |
− | compatible is <em>crc32c</em>. Since kernel 5.5 there are three more with different
| |
− | characteristics and trade-offs regarding speed and strength. The following
| |
− | list may help you to decide which one to select.</p>
| |
− | <dl>
| |
− | <dt>
| |
− | <b>CRC32C</b> (32bit digest)
| |
− | <dd>
| |
− | <p>
| |
− | default, best backward compatibility, very fast, modern CPUs have
| |
− | instruction-level support, not collision-resistant but still good error
| |
− | detection capabilities
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | <b>XXHASH</b> (64bit digest)
| |
− | <dd>
| |
− | <p>
| |
− | can be used as CRC32C successor, very fast, optimized for modern CPUs utilizing
| |
− | instruction pipelining, good collision resistance and error detection
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | <b>SHA256</b> (256bit digest)
| |
− | <dd>
| |
− | <p>
| |
− | a cryptographic-strength hash, relatively slow but with possible CPU
| |
− | instruction acceleration or specialized hardware cards, FIPS certified and
| |
− | in wide use
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | <b>BLAKE2b</b> (256bit digest)
| |
− | <dd>
| |
− | <p>
| |
− | a cryptographic-strength hash, relatively fast with possible CPU acceleration
| |
− | using SIMD extensions, not standardized but based on BLAKE which was a SHA3
| |
− | finalist, in wide use, the algorithm used is BLAKE2b-256 that’s optimized for
| |
− | 64bit platforms
| |
− | </p>
| |
− |
| |
− | </dl>
| |
− | <p>The <em>digest size</em> affects overall size of data block checksums stored in the
| |
− | filesystem. The metadata blocks have a fixed area up to 256bits (32 bytes), so
| |
− | there’s no increase. Each data block has a separate checksum stored, with
| |
− | additional overhead of the b-tree leaves.</p>
| |
− | <p>Approximate relative performance of the algorithms, measured against CRC32C
| |
− | using reference software implementations on a 3.5GHz intel CPU:</p>
| |
− | <div>
| |
− | <table rules="all"
| |
− | width="50%"
| |
− | frame="border"
| |
− | cellspacing="0" cellpadding="4">
| |
− | <tbody>
| |
− | <tr>
| |
− | <td align="center" width="25%" valign="top"><p><strong>Digest</strong></p></td>
| |
− | <td align="right" width="25%" valign="top"><p><strong>Cycles/4KiB</strong></p></td>
| |
− | <td align="right" width="25%" valign="top"><p><strong>Ratio</strong></p></td>
| |
− | <td align="right" width="25%" valign="top"><p><strong>Implementation</strong></p></td>
| |
− | </tr>
| |
− | <tr>
| |
− | <td align="center" width="25%" valign="top"><p>CRC32C</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>1700</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>1.00</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>CPU instruction</p></td>
| |
− | </tr>
| |
− | <tr>
| |
− | <td align="center" width="25%" valign="top"><p>XXHASH</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>2500</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>1.44</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>reference impl.</p></td>
| |
− | </tr>
| |
− | <tr>
| |
− | <td align="center" width="25%" valign="top"><p>SHA256</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>105000</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>61</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>reference impl.</p></td>
| |
− | </tr>
| |
− | <tr>
| |
− | <td align="center" width="25%" valign="top"><p>SHA256</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>36000</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>21</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>libgcrypt/AVX2</p></td>
| |
− | </tr>
| |
− | <tr>
| |
− | <td align="center" width="25%" valign="top"><p>SHA256</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>63000</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>37</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>libsodium/AVX2</p></td>
| |
− | </tr>
| |
− | <tr>
| |
− | <td align="center" width="25%" valign="top"><p>BLAKE2b</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>22000</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>13</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>reference impl.</p></td>
| |
− | </tr>
| |
− | <tr>
| |
− | <td align="center" width="25%" valign="top"><p>BLAKE2b</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>19000</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>11</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>libgcrypt/AVX2</p></td>
| |
− | </tr>
| |
− | <tr>
| |
− | <td align="center" width="25%" valign="top"><p>BLAKE2b</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>19000</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>11</p></td>
| |
− | <td align="right" width="25%" valign="top"><p>libsodium/AVX2</p></td>
| |
− | </tr>
| |
− | </tbody>
| |
− | </table>
| |
− | </div>
| |
− | <p>Many kernels are configured with SHA256 as built-in and not as a module.
| |
− | The accelerated versions are however provided by the modules and must be loaded
| |
− | explicitly (<b>modprobe sha256</b>) before mounting the filesystem to make use of
| |
− | them. You can check in <em>/sys/fs/btrfs/FSID/checksum</em> which one is used. If you
| |
− | see <em>sha256-generic</em>, then you may want to unmount and mount the filesystem
| |
− | again, changing that on a mounted filesystem is not possible.
| |
− | Check the file <em>/proc/crypto</em>, when the implementation is built-in, you’d find</p>
| |
− | <pre>name : sha256
| |
− | driver : sha256-generic
| |
− | module : kernel
| |
− | priority : 100
| |
− | ...</pre>
| |
− | <p>while accelerated implementation is e.g.</p>
| |
− | <pre>name : sha256
| |
− | driver : sha256-avx2
| |
− | module : sha256_ssse3
| |
− | priority : 170
| |
− | ...</pre>
| |
− | ==COMPRESSION==
| |
− |
| |
− | <p>Btrfs supports transparent file compression. There are three algorithms
| |
− | available: ZLIB, LZO and ZSTD (since v4.14). Basically, compression is on a file
| |
− | by file basis. You can have a single btrfs mount point that has some files that
| |
− | are uncompressed, some that are compressed with LZO, some with ZLIB, for
| |
− | instance (though you may not want it that way, it is supported).</p>
| |
− | <p>To enable compression, mount the filesystem with options <em>compress</em> or
| |
− | <em>compress-force</em>. Please refer to section <em>MOUNT OPTIONS</em>. Once compression is
| |
− | enabled, all new writes will be subject to compression. Some files may not
| |
− | compress very well, and these are typically not recompressed but still written
| |
− | uncompressed.</p>
| |
− | <p>Each compression algorithm has different speed/ratio trade offs. The levels
| |
− | can be selected by a mount option and affect only the resulting size (ie.
| |
− | no compatibility issues).</p>
| |
− | <p>Basic characteristics:</p>
| |
− | <table cellpadding="4">
| |
− | <tr valign="top">
| |
− | <td>
| |
− | ZLIB
| |
− | <br>
| |
− | </td>
| |
− | <td>
| |
− | <p>
| |
− | slower, higher compression ratio
| |
− | </p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | levels: 1 to 9, mapped directly, default level is 3
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | good backward compatibility
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | </td>
| |
− | </tr>
| |
− | <tr valign="top">
| |
− | <td>
| |
− | LZO
| |
− | <br>
| |
− | </td>
| |
− | <td>
| |
− | <p>
| |
− | faster compression and decompression than zlib, worse compression ratio, designed to be fast
| |
− | </p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | no levels
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | good backward compatibility
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | </td>
| |
− | </tr>
| |
− | <tr valign="top">
| |
− | <td>
| |
− | ZSTD
| |
− | <br>
| |
− | </td>
| |
− | <td>
| |
− | <p>
| |
− | compression comparable to zlib with higher compression/decompression speeds and different ratio
| |
− | </p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | levels: 1 to 15
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | since 4.14, levels since 5.1
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | </td>
| |
− | </tr>
| |
− | </table>
| |
− | <p>The differences depend on the actual data set and cannot be expressed by a
| |
− | single number or recommendation. Higher levels consume more CPU time and may
| |
− | not bring a significant improvement, lower levels are close to real time.</p>
| |
− | <p>The algorithms could be mixed in one file as they’re stored per extent. The
| |
− | compression can be changed on a file by <b>btrfs filesystem defrag</b> command,
| |
− | using the <em>-c</em> option, or by <b>btrfs property set</b> using the <em>compression</em>
| |
− | property. Setting compression by <em>chattr +c</em> utility will set it to zlib.</p>
| |
− | ===INCOMPRESSIBLE DATA===
| |
− |
| |
− | <p>Files with already compressed data or with data that won’t compress well with
| |
− | the CPU and memory constraints of the kernel implementations are using a simple
| |
− | decision logic. If the first portion of data being compressed is not smaller
| |
− | than the original, the compression of the file is disabled — unless the
| |
− | filesystem is mounted with <em>compress-force</em>. In that case compression will
| |
− | always be attempted on the file only to be later discarded. This is not optimal
| |
− | and subject to optimizations and further development.</p>
| |
− | <p>If a file is identified as incompressible, a flag is set (NOCOMPRESS) and it’s
| |
− | sticky. On that file compression won’t be performed unless forced. The flag
| |
− | can be also set by <em>chattr +m</em> (since e2fsprogs 1.46.2) or by properties with
| |
− | value <em>no</em> or <em>none</em>. Empty value will reset it to the default that’s currently
| |
− | applicable on the mounted filesystem.</p>
| |
− | <p>There are two ways to detect incompressible data:</p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | actual compression attempt - data are compressed, if the result is not smaller,
| |
− | it’s discarded, so this depends on the algorithm and level
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | pre-compression heuristics - a quick statistical evaluation on the data is
| |
− | peformed and based on the result either compression is performed or skipped,
| |
− | the NOCOMPRESS bit is not set just by the heuristic, only if the compression
| |
− | algorithm does not make an improvent
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | ===PRE-COMPRESSION HEURISTICS===
| |
− |
| |
− | <p>The heuristics aim to do a few quick statistical tests on the compressed data
| |
− | in order to avoid probably costly compression that would turn out to be
| |
− | inefficient. Compression algorithms could have internal detection of
| |
− | incompressible data too but this leads to more overhead as the compression is
| |
− | done in another thread and has to write the data anyway. The heuristic is
| |
− | read-only and can utilize cached memory.</p>
| |
− | <p>The tests performed based on the following: data sampling, long repated
| |
− | pattern detection, byte frequency, Shannon entropy.</p>
| |
− | ===COMPATIBILITY WITH OTHER FEATURES===
| |
− |
| |
− | <p>Compression is done using the COW mechanism so it’s incompatible with
| |
− | <em>nodatacow</em>. Direct IO works on compressed files but will fall back to buffered
| |
− | writes. Currently <em>nodatasum</em> and compression don’t work together.</p>
| |
− | ==FILESYSTEM EXCLUSIVE OPERATIONS==
| |
− |
| |
− | <p>There are several operations that affect the whole filesystem and cannot be run
| |
− | in parallel. Attempt to start one while another is running will fail.</p>
| |
− | <p>Since kernel 5.10 the currently running operation can be obtained from
| |
− | <tt>/sys/fs/UUID/exclusive_operation</tt> with following values and operations:</p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | balance
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | device add
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | device delete
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | device replace
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | resize
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | swapfile activate
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | none
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | <p>Enqueuing is supported for several btrfs subcommands so they can be started
| |
− | at once and then serialized.</p>
| |
− | ==FILESYSTEM LIMITS==
| |
− |
| |
− | <dl>
| |
− | <dt>
| |
− | maximum file name length
| |
− | <dd>
| |
− | <p>
| |
− | 255
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | maximum symlink target length
| |
− | <dd>
| |
− | <p>
| |
− | depends on the <em>nodesize</em> value, for 4k it’s 3949 bytes, for larger nodesize
| |
− | it’s 4095 due to the system limit PATH_MAX
| |
− | </p>
| |
− | <p>The symlink target may not be a valid path, ie. the path name components
| |
− | can exceed the limits (NAME_MAX), there’s no content validation at [http://man7.org/linux/man-pages/man3/symlink.3.html symlink(3)]
| |
− | creation.</p>
| |
− |
| |
− | <dt>
| |
− | maximum number of inodes
| |
− | <dd>
| |
− | <p>
| |
− | 2<sup>64</sup> but depends on the available metadata space as the inodes are created
| |
− | dynamically
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | inode numbers
| |
− | <dd>
| |
− | <p>
| |
− | minimum number: 256 (for subvolumes), regular files and directories: 257
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | maximum file length
| |
− | <dd>
| |
− | <p>
| |
− | inherent limit of btrfs is 2<sup>64</sup> (16 EiB) but the linux VFS limit is 2<sup>63</sup> (8 EiB)
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | maximum number of subvolumes
| |
− | <dd>
| |
− | <p>
| |
− | the subvolume ids can go up to 2<sup>64</sup> but the number of actual subvolumes
| |
− | depends on the available metadata space, the space consumed by all subvolume
| |
− | metadata includes bookkeeping of shared extents can be large (MiB, GiB)
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | maximum number of hardlinks of a file in a directory
| |
− | <dd>
| |
− | <p>
| |
− | 65536 when the <tt>extref</tt> feature is turned on during mkfs (default), roughly
| |
− | 100 otherwise
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | minimum filesystem size
| |
− | <dd>
| |
− | <p>
| |
− | the minimal size of each device depends on the <em>mixed-bg</em> feature, without that
| |
− | (the default) it’s about 109MiB, with mixed-bg it’s is 16MiB
| |
− | </p>
| |
− |
| |
− | </dl>
| |
− | ==BOOTLOADER SUPPORT==
| |
− |
| |
− | <p>GRUB2 (https://www.gnu.org/software/grub) has the most advanced support of
| |
− | booting from BTRFS with respect to features.</p>
| |
− | <p>U-boot (https://www.denx.de/wiki/U-Boot/) has decent support for booting but
| |
− | not all BTRFS features are implemented, check the documentation.</p>
| |
− | <p>EXTLINUX (from the https://syslinux.org project) can boot but does not support
| |
− | all features. Please check the upstream documentation before you use it.</p>
| |
− | <p>The first 1MiB on each device is unused with the exception of primary
| |
− | superblock that is on the offset 64KiB and spans 4KiB.</p>
| |
− | ==FILE ATTRIBUTES==
| |
− |
| |
− | <p>The btrfs filesystem supports setting file attributes or flags. Note there are
| |
− | old and new interfaces, with confusing names. The following list should clarify
| |
− | that:</p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | <em>attributes</em>: [http://man7.org/linux/man-pages/man1/chattr.1.html chattr(1)] or [http://man7.org/linux/man-pages/man1/lsattr.1.html lsattr(1)] utilities (the ioctls are
| |
− | FS_IOC_GETFLAGS and FS_IOC_SETFLAGS), due to the ioctl names the attributes are
| |
− | also called flags
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | <em>xflags</em>: to distinguish from the previous, it’s extended flags, with tunable
| |
− | bits similar to the attributes but extensible and new bits will be added in the
| |
− | future (the ioctls are FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR but they are not
| |
− | related to extended attributes that are also called xattrs), there’s no standard
| |
− | tool to change the bits, there’s support in [http://man7.org/linux/man-pages/man8/xfs_io.8.html xfs_io(8)] as command <b>xfs_io -c
| |
− | chattr</b>
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | ===ATTRIBUTES===
| |
− |
| |
− | <dl>
| |
− | <dt>
| |
− | <b>a</b>
| |
− | <dd>
| |
− | <p>
| |
− | <em>append only</em>, new writes are always written at the end of the file
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | <b>A</b>
| |
− | <dd>
| |
− | <p>
| |
− | <em>no atime updates</em>
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | <b>c</b>
| |
− | <dd>
| |
− | <p>
| |
− | <em>compress data</em>, all data written after this attribute is set will be compressed.
| |
− | Please note that compression is also affected by the mount options or the parent
| |
− | directory attributes.
| |
− | </p>
| |
− | <p>When set on a directory, all newly created files will inherit this attribute.
| |
− | This attribute cannot be set with <em>m</em> at the same time.</p>
| |
− |
| |
− | <dt>
| |
− | <b>C</b>
| |
− | <dd>
| |
− | <p>
| |
− | <em>no copy-on-write</em>, file data modifications are done in-place
| |
− | </p>
| |
− | <p>When set on a directory, all newly created files will inherit this attribute.</p>
| |
− | <blockquote><b>Note:</b>
| |
− | due to implementation limitations, this flag can be set/unset only on
| |
− | empty files.</blockquote>
| |
− |
| |
− | <dt>
| |
− | <b>d</b>
| |
− | <dd>
| |
− | <p>
| |
− | <em>no dump</em>, makes sense with 3rd party tools like [http://man7.org/linux/man-pages/man8/dump.8.html dump(8)], on BTRFS the
| |
− | attribute can be set/unset but no other special handling is done
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | <b>D</b>
| |
− | <dd>
| |
− | <p>
| |
− | <em>synchronous directory updates</em>, for more details search [http://man7.org/linux/man-pages/man2/open.2.html open(2)] for <em>O_SYNC</em>
| |
− | and <em>O_DSYNC</em>
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | <b>i</b>
| |
− | <dd>
| |
− | <p>
| |
− | <em>immutable</em>, no file data and metadata changes allowed even to the root user as
| |
− | long as this attribute is set (obviously the exception is unsetting the attribute)
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | <b>m</b>
| |
− | <dd>
| |
− | <p>
| |
− | <em>no compression</em>, permanently turn off compression on the given file. Any
| |
− | compression mount options will not affect this file. (<tt>chattr</tt> support added in
| |
− | 1.46.2)
| |
− | </p>
| |
− | <p>When set on a directory, all newly created files will inherit this attribute.
| |
− | This attribute cannot be set with <em>c</em> at the same time.</p>
| |
− |
| |
− | <dt>
| |
− | <b>S</b>
| |
− | <dd>
| |
− | <p>
| |
− | <em>synchronous updates</em>, for more details search [http://man7.org/linux/man-pages/man2/open.2.html open(2)] for <em>O_SYNC</em> and
| |
− | <em>O_DSYNC</em>
| |
− | </p>
| |
− |
| |
− | </dl>
| |
− | <p>No other attributes are supported. For the complete list please refer to the
| |
− | [http://man7.org/linux/man-pages/man1/chattr.1.html chattr(1)] manual page.</p>
| |
− | ===XFLAGS===
| |
− |
| |
− | <p>There’s overlap of letters assigned to the bits with the attributes, this list
| |
− | refers to what [http://man7.org/linux/man-pages/man8/xfs_io.8.html xfs_io(8)] provides:</p>
| |
− | <dl>
| |
− | <dt>
| |
− | <b>i</b>
| |
− | <dd>
| |
− | <p>
| |
− | <em>immutable</em>, same as the attribute
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | <b>a</b>
| |
− | <dd>
| |
− | <p>
| |
− | <em>append only</em>, same as the attribute
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | <b>s</b>
| |
− | <dd>
| |
− | <p>
| |
− | <em>synchronous updates</em>, same as the attribute <em>S</em>
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | <b>A</b>
| |
− | <dd>
| |
− | <p>
| |
− | <em>no atime updates</em>, same as the attribute
| |
− | </p>
| |
− |
| |
− | <dt>
| |
− | <b>d</b>
| |
− | <dd>
| |
− | <p>
| |
− | <em>no dump</em>, same as the attribute
| |
− | </p>
| |
− |
| |
− | </dl>
| |
− | ==ZONED MODE==
| |
− |
| |
− | <p>Since version 5.12 btrfs supports so called <em>zoned mode</em>. This is a special
| |
− | on-disk format and allocation/write strategy that’s friendly to zoned devices.
| |
− | In short, a device is partitioned into fixed-size zones and each zone can be
| |
− | updated by append-only manner, or reset. As btrfs has no fixed data structures,
| |
− | except the super blocks, the zoned mode only requires block placement that
| |
− | follows the device constraints. You can learn about the whole architecture at
| |
− | https://zonedstorage.io .</p>
| |
− | <p>The devices are also called SMR/ZBC/ZNS, in <em>host-managed</em> mode. Note that
| |
− | there are devices that appear as non-zoned but actually are, this is
| |
− | <em>drive-managed</em> and using zoned mode won’t help.</p>
| |
− | <p>The zone size depends on the device, typical sizes are 256MiB or 1GiB. In
| |
− | general it must be a power of two. Emulated zoned devices like <em>null_blk</em> allow
| |
− | to set various zone sizes.</p>
| |
− | ===REQUIREMENTS, LIMITATIONS===
| |
− |
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | all devices must have the same zone size
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | maximum zone size is 8GiB
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | mixing zoned and non-zoned devices is possible, the zone writes are emulated,
| |
− | but this is namely for testing
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | the super block is handled in a special way and is at different locations
| |
− | than on a non-zoned filesystem:
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | primary: 0B (and the next two zones)
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | secondary: 512G (and the next two zones)
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | tertiary: 4TiB (4096GiB, and the next two zones)
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | ===INCOMPATIBLE FEATURES===
| |
− |
| |
− | <p>The main constraint of the zoned devices is lack of in-place update of the data.
| |
− | This is inherently incompatbile with some features:</p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | nodatacow - overwrite in-place, cannot create such files
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | fallocate - preallocating space for in-place first write
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | mixed-bg - unordered writes to data and metadata, fixing that means using
| |
− | separate data and metadata block groups
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | booting - the zone at offset 0 contains superblock, resetting the zone would
| |
− | destroy the bootloader data
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | <p>Initial support lacks some features but they’re planned:</p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | only single profile is supported
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | fstrim - due to dependency on free space cache v1
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | ===SUPER BLOCK===
| |
− |
| |
− | <p>As said above, super block is handled in a special way. In order to be crash
| |
− | safe, at least one zone in a known location must contain a valid superblock.
| |
− | This is implemented as a ring buffer in two consecutive zones, starting from
| |
− | known offsets 0, 512G and 4TiB. The values are different than on non-zoned
| |
− | devices. Each new super block is appended to the end of the zone, once it’s
| |
− | filled, the zone is reset and writes continue to the next one. Looking up the
| |
− | latest super block needs to read offsets of both zones and determine the last
| |
− | written version.</p>
| |
− | <p>The amount of space reserved for super block depends on the zone size. The
| |
− | secondary and tertiary copies are at distant offsets as the capacity of the
| |
− | devices is expected to be large, tens of terabytes. Maximum zone size supported
| |
− | is 8GiB, which would mean that eg. offset 0-16GiB would be reserved just for
| |
− | the super block on a hypothetical device of that zone size. This is wasteful
| |
− | but required to guarantee crash safety.</p>
| |
− | ==CONTROL DEVICE==
| |
− |
| |
− | <p>There’s a character special device <tt>/dev/btrfs-control</tt> with major and minor
| |
− | numbers 10 and 234 (the device can be found under the <em>misc</em> category).</p>
| |
− | <pre>$ ls -l /dev/btrfs-control
| |
− | crw------- 1 root root 10, 234 Jan 1 12:00 /dev/btrfs-control</pre>
| |
− | <p>The device accepts some ioctl calls that can perform following actions on the
| |
− | filesystem module:</p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | scan devices for btrfs filesystem (ie. to let multi-device filesystems mount
| |
− | automatically) and register them with the kernel module
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | similar to scan, but also wait until the device scanning process is finished
| |
− | for a given filesystem
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | get the supported features (can be also found under <em>/sys/fs/btrfs/features</em>)
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | <p>The device is created when btrfs is initialized, either as a module or a
| |
− | built-in functionality and makes sense only in connection with that. Running
| |
− | eg. mkfs without the module loaded will not register the device and will
| |
− | probably warn about that.</p>
| |
− | <p>In rare cases when the module is loaded but the device is not present (most
| |
− | likely accidentally deleted), it’s possible to recreate it by</p>
| |
− | <pre># mknod --mode=600 /dev/btrfs-control c 10 234</pre>
| |
− | <p>or (since 5.11) by a convenience command</p>
| |
− | <pre># btrfs rescue create-control-device</pre>
| |
− | <p>The control device is not strictly required but the device scanning will not
| |
− | work and a workaround would need to be used to mount a multi-device filesystem.
| |
− | The mount option <em>device</em> can trigger the device scanning during mount, see
| |
− | also <b>btrfs device scan</b>.</p>
| |
− | ==FILESYSTEM WITH MULTIPLE PROFILES==
| |
− |
| |
− | <p>It is possible that a btrfs filesystem contains multiple block group profiles
| |
− | of the same type. This could happen when a profile conversion using balance
| |
− | filters is interrupted (see [[Manpage/btrfs-balance|btrfs-balance(8)]]). Some <em>btrfs</em> commands perform
| |
− | a test to detect this kind of condition and print a warning like this:</p>
| |
− | <pre>WARNING: Multiple block group profiles detected, see 'man btrfs(5)'.
| |
− | WARNING: Data: single, raid1
| |
− | WARNING: Metadata: single, raid1</pre>
| |
− | <p>The corresponding output of <b>btrfs filesystem df</b> might look like:</p>
| |
− | <pre>WARNING: Multiple block group profiles detected, see 'man btrfs(5)'.
| |
− | WARNING: Data: single, raid1
| |
− | WARNING: Metadata: single, raid1
| |
− | Data, RAID1: total=832.00MiB, used=0.00B
| |
− | Data, single: total=1.63GiB, used=0.00B
| |
− | System, single: total=4.00MiB, used=16.00KiB
| |
− | Metadata, single: total=8.00MiB, used=112.00KiB
| |
− | Metadata, RAID1: total=64.00MiB, used=32.00KiB
| |
− | GlobalReserve, single: total=16.25MiB, used=0.00B</pre>
| |
− | <p>There’s more than one line for type <em>Data</em> and <em>Metadata</em>, while the profiles
| |
− | are <em>single</em> and <em>RAID1</em>.</p>
| |
− | <p>This state of the filesystem OK but most likely needs the user/administrator to
| |
− | take an action and finish the interrupted tasks. This cannot be easily done
| |
− | automatically, also the user knows the expected final profiles.</p>
| |
− | <p>In the example above, the filesystem started as a single device and <em>single</em>
| |
− | block group profile. Then another device was added, followed by balance with
| |
− | <em>convert=raid1</em> but for some reason hasn’t finished. Restarting the balance
| |
− | with <em>convert=raid1</em> will continue and end up with filesystem with all block
| |
− | group profiles <em>RAID1</em>.</p>
| |
− | <blockquote><b>Note:</b>
| |
− | If you’re familiar with balance filters, you can use
| |
− | <em>convert=raid1,profiles=single,soft</em>, which will take only the unconverted
| |
− | <em>single</em> profiles and convert them to <em>raid1</em>. This may speed up the conversion
| |
− | as it would not try to rewrite the already convert <em>raid1</em> profiles.</blockquote>
| |
− | <p>Having just one profile is desired as this also clearly defines the profile of
| |
− | newly allocated block groups, otherwise this depends on internal allocation
| |
− | policy. When there are multiple profiles present, the order of selection is
| |
− | RAID6, RAID5, RAID10, RAID1, RAID0 as long as the device number constraints are
| |
− | satisfied.</p>
| |
− | <p>Commands that print the warning were chosen so they’re brought to user
| |
− | attention when the filesystem state is being changed in that regard. This is:
| |
− | <em>device add</em>, <em>device delete</em>, <em>balance cancel</em>, <em>balance pause</em>. Commands
| |
− | that report space usage: <em>filesystem df</em>, <em>device usage</em>. The command
| |
− | <em>filesystem usage</em> provides a line in the overall summary:</p>
| |
− | <pre> Multiple profiles: yes (data, metadata)</pre>
| |
− | ==SEEDING DEVICE==
| |
− |
| |
− | <p>The COW mechanism and multiple devices under one hood enable an interesting
| |
− | concept, called a seeding device: extending a read-only filesystem on a single
| |
− | device filesystem with another device that captures all writes. For example
| |
− | imagine an immutable golden image of an operating system enhanced with another
| |
− | device that allows to use the data from the golden image and normal operation.
| |
− | This idea originated on CD-ROMs with base OS and allowing to use them for live
| |
− | systems, but this became obsolete. There are technologies providing similar
| |
− | functionality, like <em>unionmount</em>, <em>overlayfs</em> or <em>qcow2</em> image snapshot.</p>
| |
− | <p>The seeding device starts as a normal filesystem, once the contents is ready,
| |
− | <b>btrfstune -S 1</b> is used to flag it as a seeding device. Mounting such device
| |
− | will not allow any writes, except adding a new device by <b>btrfs device add</b>.
| |
− | Then the filesystem can be remounted as read-write.</p>
| |
− | <p>Given that the filesystem on the seeding device is always recognized as
| |
− | read-only, it can be used to seed multiple filesystems, at the same time. The
| |
− | UUID that is normally attached to a device is automatically changed to a random
| |
− | UUID on each mount.</p>
| |
− | <p>Once the seeding device is mounted, it needs the writable device. After adding
| |
− | it, something like <em>remount -o remount,rw /path</em> makes the filesystem at
| |
− | <em>/path</em> ready for use. The simplest usecase is to throw away all changes by
| |
− | unmounting the filesystem when convenient.</p>
| |
− | <p>Alternatively, deleting the seeding device from the filesystem can turn it into
| |
− | a normal filesystem, provided that the writable device can also contain all the
| |
− | data from the seeding device.</p>
| |
− | <p>The seeding device flag can be cleared again by <b>btrfstune -f -s 0</b>, eg.
| |
− | allowing to update with newer data but please note that this will invalidate
| |
− | all existing filesystems that use this particular seeding device. This works
| |
− | for some usecases, not for others, and a forcing flag to the command is
| |
− | mandatory to avoid accidental mistakes.</p>
| |
− | <p>Example how to create and use one seeding device:</p>
| |
− | <pre># mkfs.btrfs /dev/sda
| |
− | # mount /dev/sda /mnt/mnt1
| |
− | # ... fill mnt1 with data
| |
− | # umount /mnt/mnt1
| |
− | # btrfstune -S 1 /dev/sda
| |
− | # mount /dev/sda /mnt/mnt1
| |
− | # btrfs device add /dev/sdb /mnt
| |
− | # mount -o remount,rw /mnt/mnt1
| |
− | # ... /mnt/mnt1 is now writable</pre>
| |
− | <p>Now <em>/mnt/mnt1</em> can be used normally. The device <em>/dev/sda</em> can be mounted
| |
− | again with a another writable device:</p>
| |
− | <pre># mount /dev/sda /mnt/mnt2
| |
− | # btrfs device add /dev/sdc /mnt/mnt2
| |
− | # mount -o remount,rw /mnt/mnt2
| |
− | # ... /mnt/mnt2 is now writable</pre>
| |
− | <p>The writable device (<em>/dev/sdb</em>) can be decoupled from the seeding device and
| |
− | used independently:</p>
| |
− | <pre># btrfs device delete /dev/sda /mnt/mnt1</pre>
| |
− | <p>As the contents originated in the seeding device, it’s possible to turn
| |
− | <em>/dev/sdb</em> to a seeding device again and repeat the whole process.</p>
| |
− | <p>A few things to note:</p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | it’s recommended to use only single device for the seeding device, it works
| |
− | for multiple devices but the <em>single</em> profile must be used in order to make
| |
− | the seeding device deletion work
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | block group profiles <em>single</em> and <em>dup</em> support the usecases above
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | the label is copied from the seeding device and can be changed by <b>btrfs filesystem label</b>
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | each new mount of the seeding device gets a new random UUID
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | ==RAID56 STATUS AND RECOMMENDED PRACTICES==
| |
− |
| |
− | <p>The RAID56 feature provides striping and parity over several devices, same as
| |
− | the traditional RAID5/6. There are some implementation and design deficiencies
| |
− | that make it unreliable for some corner cases and the feature <b>should not be
| |
− | used in production, only for evaluation or testing</b>. The power failure safety
| |
− | for metadata with RAID56 is not 100%.</p>
| |
− | ===Metadata===
| |
− |
| |
− | <p>Do not use <em>raid5</em> nor <em>raid6</em> for metadata. Use <em>raid1</em> or <em>raid1c3</em>
| |
− | respectively.</p>
| |
− | <p>The substitute profiles provide the same guarantees against loss of 1 or 2
| |
− | devices, and in some respect can be an improvement. Recovering from one
| |
− | missing device will only need to access the remaining 1st or 2nd copy, that in
| |
− | general may be stored on some other devices due to the way RAID1 works on
| |
− | btrfs, unlike on a striped profile (similar to <em>raid0</em>) that would need all
| |
− | devices all the time.</p>
| |
− | <p>The space allocation pattern and consumption is different (eg. on N devices):
| |
− | for <em>raid5</em> as an example, a 1GiB chunk is reserved on each device, while with
| |
− | <em>raid1</em> there’s each 1GiB chunk stored on 2 devices. The consumption of each
| |
− | 1GiB of used metadata is then <em>N * 1GiB</em> for vs <em>2 * 1GiB</em>. Using <em>raid1</em>
| |
− | is also more convenient for balancing/converting to other profile due to lower
| |
− | requirement on the available chunk space.</p>
| |
− | ===Missing/incomplete support===
| |
− |
| |
− | <p>When RAID56 is on the same filesystem with different raid profiles, the space
| |
− | reporting is inaccurate, eg. <em>df</em>, <em>btrfs filesystem df</em> or <em>btrfs filesystem
| |
− | usge</em>. When there’s only a one profile per block group type (eg. raid5 for data)
| |
− | the reporting is accurate.</p>
| |
− | <p>When scrub is started on a RAID56 filesystem, it’s started on all devices that
| |
− | degrade the performance. The workaround is to start it on each device
| |
− | separately. Due to that the device stats may not match the actual state and
| |
− | some errors might get reported multiple times.</p>
| |
− | <p>The <em>write hole</em> problem.</p>
| |
− | ==STORAGE MODEL==
| |
− |
| |
− | <p><em>A storage model is a model that captures key physical aspects of data
| |
− | structure in a data store. A filesystem is the logical structure organizing
| |
− | data on top of the storage device.</em></p>
| |
− | <p>The filesystem assumes several features or limitations of the storage device
| |
− | and utilizes them or applies measures to guarantee reliability. BTRFS in
| |
− | particular is based on a COW (copy on write) mode of writing, ie. not updating
| |
− | data in place but rather writing a new copy to a different location and then
| |
− | atomically switching the pointers.</p>
| |
− | <p>In an ideal world, the device does what it promises. The filesystem assumes
| |
− | that this may not be true so additional mechanisms are applied to either detect
| |
− | misbehaving hardware or get valid data by other means. The devices may (and do)
| |
− | apply their own detection and repair mechanisms but we won’t assume any.</p>
| |
− | <p>The following assumptions about storage devices are considered (sorted by
| |
− | importance, numbers are for further reference):</p>
| |
− | <ol>
| |
− | <li>
| |
− | <p>
| |
− | atomicity of reads and writes of blocks/sectors (the smallest unit of data
| |
− | the device presents to the upper layers)
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | there’s a flush command that instructs the device to forcibly order writes
| |
− | before and after the command; alternatively there’s a barrier command that
| |
− | facilitates the ordering but may not flush the data
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | data sent to write to a given device offset will be written without further
| |
− | changes to the data and to the offset
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | writes can be reordered by the device, unless explicitly serialized by the
| |
− | flush command
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | reads and writes can be freely reordered and interleaved
| |
− | </p>
| |
− | </li>
| |
− | </ol>
| |
− | <p>The consistency model of BTRFS builds on these assumptions. The logical data
| |
− | updates are grouped, into a generation, written on the device, serialized by
| |
− | the flush command and then the super block is written ending the generation.
| |
− | All logical links among metadata comprising a consistent view of the data may
| |
− | not cross the generation boundary.</p>
| |
− | ===WHEN THINGS GO WRONG===
| |
− |
| |
− | <p><b>No or partial atomicity of block reads/writes (1)</b></p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | <em>Problem</em>: a partial block contents is written (<em>torn write</em>), eg. due to a
| |
− | power glitch or other electronics failure during the read/write
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | <em>Detection</em>: checksum mismatch on read
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | <em>Repair</em>: use another copy or rebuild from multiple blocks using some encoding
| |
− | scheme
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | <p><b>The flush command does not flush (2)</b></p>
| |
− | <p>This is perhaps the most serious problem and impossible to mitigate by
| |
− | filesystem without limitations and design restrictions. What could happen in
| |
− | the worst case is that writes from one generation bleed to another one, while
| |
− | still letting the filesystem consider the generations isolated. Crash at any
| |
− | point would leave data on the device in an inconsistent state without any hint
| |
− | what exactly got written, what is missing and leading to stale metadata link
| |
− | information.</p>
| |
− | <p>Devices usually honor the flush command, but for performance reasons may do
| |
− | internal caching, where the flushed data are not yet persistently stored. A
| |
− | power failure could lead to a similar scenario as above, although it’s less
| |
− | likely that later writes would be written before the cached ones. This is
| |
− | beyond what a filesystem can take into account. Devices or controllers are
| |
− | usually equipped with batteries or capacitors to write the cache contents even
| |
− | after power is cut. (<em>Battery backed write cache</em>)</p>
| |
− | <p><b>Data get silently changed on write (3)</b></p>
| |
− | <p>Such thing should not happen frequently, but still can happen spuriously due
| |
− | the complex internal workings of devices or physical effects of the storage
| |
− | media itself.</p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | <em>Problem</em>: while the data are written atomically, the contents get changed
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | <em>Detection</em>: checksum mismatch on read
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | <em>Repair</em>: use another copy or rebuild from multiple blocks using some
| |
− | encoding scheme
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | <p><b>Data get silently written to another offset (3)</b></p>
| |
− | <p>This would be another serious problem as the filesystem has no information
| |
− | when it happens. For that reason the measures have to be done ahead of time.
| |
− | This problem is also commonly called <em>ghost write</em>.</p>
| |
− | <p>The metadata blocks have the checksum embedded in the blocks, so a correct
| |
− | atomic write would not corrupt the checksum. It’s likely that after reading
| |
− | such block the data inside would not be consistent with the rest. To rule that
| |
− | out there’s embedded block number in the metadata block. It’s the logical
| |
− | block number because this is what the logical structure expects and verifies.</p>
| |
− | ==HARDWARE CONSIDERATIONS==
| |
− |
| |
− | <p>The following is based on information publicly available, user feedback,
| |
− | community discussions or bug report analyses. It’s not complete and further
| |
− | research is encouraged when in doubt.</p>
| |
− | ===MAIN MEMORY===
| |
− |
| |
− | <p>The data structures and raw data blocks are temporarily stored in computer
| |
− | memory before they get written to the device. It is critical that memory is
| |
− | reliable because even simple bit flips can have vast consequences and lead to
| |
− | damaged structures, not only in the filesystem but in the whole operating
| |
− | system.</p>
| |
− | <p>Based on experience in the community, memory bit flips are more common than one
| |
− | would think. When it happens, it’s reported by the tree-checker or by a checksum
| |
− | mismatch after reading blocks. There are some very obvious instances of bit
| |
− | flips that happen, e.g. in an ordered sequence of keys in metadata blocks. We can
| |
− | easily infer from the other data what values get damaged and how. However, fixing
| |
− | that is not straightforward and would require cross-referencing data from the
| |
− | entire filesystem to see the scope.</p>
| |
− | <p>If available, ECC memory should lower the chances of bit flips, but this
| |
− | type of memory is not available in all cases. A memory test should be performed
| |
− | in case there’s a visible bit flip pattern, though this may not detect a faulty
| |
− | memory module because the actual load of the system could be the factor making
| |
− | the problems appear. In recent years attacks on how the memory modules operate
| |
− | have been demonstrated (<em>rowhammer</em>) achieving specific bits to be flipped.
| |
− | While these were targeted, this shows that a series of reads or writes can
| |
− | affect unrelated parts of memory.</p>
| |
− | <p>Further reading:</p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | https://en.wikipedia.org/wiki/Row_hammer
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | <p>What to do:</p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | run <em>memtest</em>, note that sometimes memory errors happen only when the system
| |
− | is under heavy load that the default memtest cannot trigger
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | memory errors may appear as filesystem going read-only due to "pre write"
| |
− | check, that verify meta data before they get written but fail some basic
| |
− | consistency checks
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | ===DIRECT MEMORY ACCESS (DMA)===
| |
− |
| |
− | <p>Another class of errors is related to DMA (direct memory access) performed
| |
− | by device drivers. While this could be considered a software error, the
| |
− | data transfers that happen without CPU assistance may accidentally corrupt
| |
− | other pages. Storage devices utilize DMA for performance reasons, the
| |
− | filesystem structures and data pages are passed back and forth, making
| |
− | errors possible in case page life time is not properly tracked.</p>
| |
− | <p>There are lots of quirks (device-specific workarounds) in Linux kernel
| |
− | drivers (regarding not only DMA) that are added when found. The quirks
| |
− | may avoid specific errors or disable some features to avoid worse problems.</p>
| |
− | <p>What to do:</p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | use up-to-date kernel (recent releases or maintained long term support versions)
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | as this may be caused by faulty drivers, keep the systems up-to-date
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | ===ROTATIONAL DISKS (HDD)===
| |
− |
| |
− | <p>Rotational HDDs typically fail at the level of individual sectors or small clusters.
| |
− | Read failures are caught on the levels below the filesystem and are returned to
| |
− | the user as <em>EIO - Input/output error</em>. Reading the blocks repeatedly may
| |
− | return the data eventually, but this is better done by specialized tools and
| |
− | filesystem takes the result of the lower layers. Rewriting the sectors may
| |
− | trigger internal remapping but this inevitably leads to data loss.</p>
| |
− | <p>Disk firmware is technically software but from the filesystem perspective is
| |
− | part of the hardware. IO requests are processed, and caching or various
| |
− | other optimizations are performed, which may lead to bugs under high load or
| |
− | unexpected physical conditions or unsupported use cases.</p>
| |
− | <p>Disks are connected by cables with two ends, both of which can cause problems
| |
− | when not attached properly. Data transfers are protected by checksums and the
| |
− | lower layers try hard to transfer the data correctly or not at all. The errors
| |
− | from badly-connecting cables may manifest as large amount of failed read or
| |
− | write requests, or as short error bursts depending on physical conditions.</p>
| |
− | <p>What to do:</p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | check <em>smartctl</em> for potential issues
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | ===SOLID STATE DRIVES (SSD)===
| |
− |
| |
− | <p>The mechanism of information storage is different from HDDs and this affects
| |
− | the failure mode as well. The data are stored in cells grouped in large blocks
| |
− | with limited number of resets and other write constraints. The firmware tries
| |
− | to avoid unnecessary resets and performs optimizations to maximize the storage
| |
− | media lifetime. The known techniques are deduplication (blocks with same
| |
− | fingerprint/hash are mapped to same physical block), compression or internal
| |
− | remapping and garbage collection of used memory cells. Due to the additional
| |
− | processing there are measures to verity the data e.g. by ECC codes.</p>
| |
− | <p>The observations of failing SSDs show that the whole electronic fails at once
| |
− | or affects a lot of data (eg. stored on one chip). Recovering such data
| |
− | may need specialized equipment and reading data repeatedly does not help as
| |
− | it’s possible with HDDs.</p>
| |
− | <p>There are several technologies of the memory cells with different
| |
− | characteristics and price. The lifetime is directly affected by the type and
| |
− | frequency of data written. Writing "too much" distinct data (e.g. encrypted)
| |
− | may render the internal deduplication ineffective and lead to a lot of rewrites
| |
− | and increased wear of the memory cells.</p>
| |
− | <p>There are several technologies and manufacturers so it’s hard to describe them
| |
− | but there are some that exhibit similar behaviour:</p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | expensive SSD will use more durable memory cells and is optimized
| |
− | for reliability and high load
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | cheap SSD is projected for a lower load ("desktop user") and is optimized for
| |
− | cost, it may employ the optimizations and/or extended error reporting partially
| |
− | or not at all
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | <p>It’s not possible to reliably determine the expected lifetime of an SSD due to
| |
− | lack of information about how it works or due to lack of reliable stats provided
| |
− | by the device.</p>
| |
− | <p>Metadata writes tend to be the biggest component of lifetime writes to a SSD,
| |
− | so there is some value in reducing them. Depending on the device class (high
| |
− | end/low end) the features like DUP block group profiles may affect the
| |
− | reliability in both ways:</p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | <em>high end</em> are typically more reliable and using <em>single</em> for data and metadata
| |
− | could be suitable to reduce device wear
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | <em>low end</em> could lack ability to identify errors so an additional
| |
− | redundancy at the filesystem level (checksums, <em>DUP</em>) could help
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | <p>Only users who consume 50 to 100% of the SSD’s actual lifetime writes need to be
| |
− | concerned by the write amplification of btrfs DUP metadata. Most users will be
| |
− | far below 50% of the actual lifetime, or will write the drive to death and
| |
− | discover how many writes 100% of the actual lifetime was. SSD firmware often
| |
− | adds its own write multipliers that can be arbitrary and unpredictable and
| |
− | dependent on application behavior, and these will typically have far greater
| |
− | effect on SSD lifespan than DUP metadata. It’s more or less impossible to
| |
− | predict when a SSD will run out of lifetime writes to within a factor of two, so
| |
− | it’s hard to justify wear reduction as a benefit.</p>
| |
− | <p>Further reading:</p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | https://www.snia.org/educational-library/ssd-and-deduplication-end-spinning-disk-2012
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | https://www.snia.org/educational-library/realities-solid-state-storage-2013-2013
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | https://www.snia.org/educational-library/ssd-performance-primer-2013
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | https://www.snia.org/educational-library/how-controllers-maximize-ssd-life-2013
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | <p>What to do:</p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | run <em>smartctl</em> or self-tests to look for potential issues
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | keep the firmware up-to-date
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | ===NVM EXPRESS, NON-VOLATILE MEMORY (NVMe)===
| |
− |
| |
− | <p>NVMe is a type of persistent memory usually connected over a system bus (PCIe)
| |
− | or similar interface and the speeds are an order of magnitude faster than SSD.
| |
− | It is also a non-rotating type of storage, and is not typically connected by a
| |
− | cable. It’s not a SCSI type device either but rather a complete specification
| |
− | for logical device interface.</p>
| |
− | <p>In a way the errors could be compared to a combination of SSD class and regular
| |
− | memory. Errors may exhibit as random bit flips or IO failures. There are tools
| |
− | to access the internal log (<em>nvme log</em> and <em>nvme-cli</em>) for a more detailed
| |
− | analysis.</p>
| |
− | <p>There are separate error detection and correction steps performed e.g. on the
| |
− | bus level and in most cases never making in to the filesystem level. Once this
| |
− | happens it could mean there’s some systematic error like overheating or bad
| |
− | physical connection of the device. You may want to run self-tests (using
| |
− | <em>smartctl</em>).</p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | https://en.wikipedia.org/wiki/NVM_Express
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | https://www.smartmontools.org/wiki/NVMe_Support
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | ===DRIVE FIRMWARE===
| |
− |
| |
− | <p>Firmware is technically still software but embedded into the hardware. As all
| |
− | software has bugs, so does firmware. Storage devices can update the firmware
| |
− | and fix known bugs. In some cases the it’s possible to avoid certain bugs by
| |
− | quirks (device-specific workarounds) in Linux kernel.</p>
| |
− | <p>A faulty firmware can cause wide range of corruptions from small and localized
| |
− | to large affecting lots of data. Self-repair capabilities may not be sufficient.</p>
| |
− | <p>What to do:</p>
| |
− | <ul>
| |
− | <li>
| |
− | <p>
| |
− | check for firmware updates in case there are known problems, note that
| |
− | updating firmware can be risky on itself
| |
− | </p>
| |
− | </li>
| |
− | <li>
| |
− | <p>
| |
− | use up-to-date kernel (recent releases or maintained long term support versions)
| |
− | </p>
| |
− | </li>
| |
− | </ul>
| |
− | ===SD FLASH CARDS===
| |
− |
| |
− | <p>There are a lot of devices with low power consumption and thus using storage
| |
− | media based on low power consumption too, typically flash memory stored on
| |
− | a chip enclosed in a detachable card package. An improperly inserted card may be
| |
− | damaged by electrical spikes when the device is turned on or off. The chips
| |
− | storing data in turn may be damaged permanently. All types of flash memory
| |
− | have a limited number of rewrites, so the data are internally translated by FTL
| |
− | (flash translation layer). This is implemented in firmware (technically a
| |
− | software) and prone to bugs that manifest as hardware errors.</p>
| |
− | <p>Adding redundancy like using DUP profiles for both data and metadata can help
| |
− | in some cases but a full backup might be the best option once problems appear
| |
− | and replacing the card could be required as well.</p>
| |
− | ===HARDWARE AS THE MAIN SOURCE OF FILESYSTEM CORRUPTIONS===
| |
− |
| |
− | <p><b>If you use unreliable hardware and don’t know about that, don’t blame the
| |
− | filesystem when it tells you.</b></p>
| |
− | ==SEE ALSO==
| |
− |
| |
− | <p>[http://man7.org/linux/man-pages/man5/acl.5.html acl(5)],
| |
− | [[Manpage/btrfs|btrfs(8)]],
| |
− | [http://man7.org/linux/man-pages/man1/chattr.1.html chattr(1)],
| |
− | [http://man7.org/linux/man-pages/man8/fstrim.8.html fstrim(8)],
| |
− | [http://man7.org/linux/man-pages/man2/ioctl.2.html ioctl(2)],
| |
− | [[Manpage/mkfs.btrfs|mkfs.btrfs(8)]],
| |
− | [http://man7.org/linux/man-pages/man8/mount.8.html mount(8)],
| |
− | [http://man7.org/linux/man-pages/man8/swapon.8.html swapon(8)]</p>
| |
− | [[Category:Manpage]]
| |