Gotchas

From btrfs Wiki
Jump to: navigation, search

This page lists some problems one might face when trying btrfs, some of these are not really bugs, but rather inconveniences about things not yet implemented, or yet undocumented design decisions.

The page references issues relevant for a few stable kernel releases that seem to be in use. This currently contains 4.14, 4.9 and 4.4. The list will be updated once a new stable is announced.

Please note this just to document known issues, ask your kernel vendor for backporting fixes or support.

Please note that most of this page was not written by the btrfs developer community and may be entirely inaccurate.

Contents

Affecting all versions

Block-level copies of devices

Do NOT

  • make a block-level copy of a Btrfs filesystem to another block device...
  • use LVM snapshots, or any other kind of block level snapshots...
  • turn a copy of a filesystem that is stored in a file into a block device with the loopback driver...

... and then try to mount either the original or the snapshot while both are visible to the same kernel.

Why?

If there are multiple block devices visible at the same time, and those block devices have the same filesystem UUID, then they're treated as part of the same filesystem.

If they are actually copies of each other (copied by dd or LVM snapshot, or any other method), then mounting either one of them could cause data corruption in one or both of them.

If you for example make an LVM snapshot of a btrfs filesystem, you can't mount either the LVM snapshot or the original, because the kernel will get confused, because it thinks it's mounting a Btrfs filesystem that consists of two disks, after which it runs into two devices which have the same device number.

Is there no way out of this?

While it's technically possible to have block device copies around as long as you don't try to use mount, you have to be extremely careful with this. In most distributions udev runs btrfs device scan automatically when a block device is discovered. Also, programs like os-prober exist which will try to look into filesystems when you might not expect it. So, don't leave anything to chance when trying this.

Some options you have to hide filesystem copies are:

  • Copying a filesystem into a file, which is harmless in itself, because the file will not be visible as a block device.
  • Remove one copy from the system (physically, or by deletion of the block device or FS) before mounting the other copy
  • When using LVM, lvchange -a n <vg>/<lv> can be used to make the block device disappear temporarily. But beware of the fact that it can be auto-activated again.

Fragmentation

  • Files with a lot of random writes can become heavily fragmented (10000+ extents) causing thrashing on HDDs and excessive multi-second spikes of CPU load on systems with an SSD or large amount a RAM.
    • On servers and workstations this affects databases and virtual machine images.
      • The nodatacow mount option may be of use here, with associated gotchas.
    • On desktops this primarily affects application databases (including Firefox and Chromium profiles, GNOME Zeitgeist, Ubuntu Desktop Couch, Banshee, and Evolution's datastore.)
      • Workarounds include manually defragmenting your home directory using btrfs fi defragment. Auto-defragment (mount option autodefrag) should solve this problem in 3.0.
    • Symptoms include btrfs-transacti and btrfs-endio-wri taking up a lot of CPU time (in spikes, possibly triggered by syncs). You can use filefrag to locate heavily fragmented files (may not work correctly with compression).

8TiB limit on 32-bit systems

Because of various implementation limitations on 32-bit systems:

  • It is possible to create Btrfs volumes larger than 8TiB.
  • Various Btrfs tools, among them btrfs check and btrfs receive don't support however Btrfs volumes larger than 8TiB on 32-bit systems.

It is also possible on 32-bit systems the limited address space per-process means that the tools cannot handle very complex (many inodes, many subvolumes, many hard links, ...) because they run out of memory on volumes smaller than 8TiB.

Parity RAID

  • Currently raid5 and raid6 profiles have flaws that make it strongly not recommended as per the Status page.
    • In less recent releases the parity of resynchronized blocks was not calculated correctly, this has been fixed in recent releases (TBD).
    • If a crash happens while a raid5/raid6 volume is being written this can result in a "transid" mismatch as in transid verify failed.
    • The resulting corruption cannot be currently fixed.

Some kernels had problems with "snapshot-aware defrag"

  • There were problems found with the snapshot-aware defrag that has been turned off in the following kernels:
    • 3.10.31, 3.12.12, 3.13.4
    • all newer than 3.14


For stable kernel versions 4.14.x, 4.9.x, 4.4.x

Having many subvolumes can be very slow

The cost of several operations, including currently balance, device delete and fs resize (shrinking), is proportional to the number of subvolumes, including snapshots, and (slightly super-linearly) the number of extents in the subvolumes.

This is "obvious" for "pure" subvolumes, as each is an independent file tree and has independent extents anyhow (except for ref-linked ones). But in the case of snapshots metadata and extents are (usually) largely ref-linked with the ancestor subvolume, so the full scan of the snapshot need not happen, but currently this happens.

This means that subvolumes with more than a dozen snapshots can greatly slow down balance and device delete. The multiple tree walks involve both high CPU and IOPS usage. This means that schemes that snapshot a volume periodically should set a low upper limit on the number of those snapshots that are retained.

For stable kernel versions 4.9.x, 4.4.x

raid1 volumes only mountable once RW if degraded

Even if there are no single profile chunks, raid1 volumes if they become degraded may only be mounted read-write once with the options -o degraded,rw.

Notes:

  • This does not happen (reportedly) when there are more than 2 devices.
  • This does not happen with raid10 profile volumes.
  • This is often due to the "Incomplete chunk conversion" issue, where there are single chunks left.

Possible recoveries:

  • If it is still read-write, you can convert the chunks from profile raid1 to profile single (or profile dup) if you have enough space).
  • If it is still read-write, you can btrfs device replace the missing device.
  • When a raid1 volume is stuck as read-only for either reason it can only be recovered by dumping its contents, recreating it and restoring the contents.

This mailing list message describes some recovery patches that might help avoid dumping, recreating and restoring, at your risk.

Implicit conversion dup to raid1 after adding a block device

When all chunks of a single device volume are "allocated" a balance operation cannot start, and the suggested solution is to add a temporary block device of at least 1GiB (better 2-4GiB).

The subsequent balance turns existing dup profile chunks, usually for metadata, into raid1 profile chunks. This prevents the removal of the temporary block device. Workarounds:

  • Start the balance operation with -musage=0 so metadata chunks are not balanced, so they are not converted to raid1.
  • Start the balance operation with *any* value of -musage=, even 99 or 100, as the implicit conversion does not happen along with compaction if there is such a value.
  • If converted to raid1, convert them explicitly to single, remove the temporary block device, and convert them back to dup.
  • Use btrfs-progs version 4.7.2 or newer, as this allows converting back raid1 chunks to dup even on a multidevice volume.

Note: by itself adding a new device does not cause the conversion from dup to raid1 profile, it is the balance without options that does it implicitly.

Direct IO and CRCs

Direct IO writes to Btrfs files can result in checksum warnings. This can happen with other filesystems, but most don't have checksums, so a mismatch between (updated) data and (out-of-date) checksum cannot arise.

This is the issue described in this email: "where the application will modify the page while it's inflight" (see also this article on stable writes). This results in checksum verification messages that are warnings instead of errors, as in for example:

BTRFS warning (device dm-1): csum failed ino 252784 off 62910578688 csum 802263983 expected csum 4110970844

Details of affected versions TBD.

Conversion from ext4 may not be undoable

  • In kernels 4.0+: the empty block groups are reclaimed automatically that can affect the following:
    • a converted filesystem may not be able to do a rollback because of the removed block groups

ssd mount option

Using the ssd mount option with older kernels than 4.14 has a negative impact on usability and lifetime of modern SSDs. This is fixed in 4.14, see this commit for more information: [1]

With 4.14+ it is safe and recommended again to use the ssd mount option for non-rotational storage.

Personal tools