Gotchas

From btrfs Wiki
Jump to: navigation, search

This page lists problems one might face when trying btrfs, some of these are not really bugs, but rather inconveniences about things not yet implemented, or yet undocumented design decisions.

Please add new things below, don't forget to add a comment on which version you observed this.

Contents

Issues

Affecting all versions

Block-level copies of devices

Do NOT...

  • make a block-level copy of a Btrfs filesystem to another block device...
  • use LVM snapshots, or any other kind of block level snapshots...
  • turn a copy of a filesystem that is stored in a file into a block device with the loopback driver...

... and then try to mount either the original or the snapshot while both are visible to the same kernel.

Why?

If there are multiple block devices visible at the same time, and those block devices have the same filesystem UUID, then they're treated as part of the same filesystem.

If they are actually copies of each other (copied by dd or LVM snapshot, or any other method), then mounting either one of them could cause data corruption in one or both of them.

If you for example make an LVM snapshot of a btrfs filesystem, you can't mount either the LVM snapshot or the original, because the kernel will get confused, because it thinks it's mounting a Btrfs filesystem that consists of two disks, after which it runs into two devices which have the same device number.

Is there no way out of this?

While it's technically possible to have block device copies around as long as you don't try to use mount, you have to be extremely careful with this. In most distributions udev runs btrfs device scan automatically when a block device is discovered. Also, programs like os-prober exist which will try to look into filesystems when you might not expect it. So, don't leave anything to chance when trying this.

Some options you have to hide filesystem copies are:

  • Copying a filesystem into a file, which is harmless in itself, because the file will not be visible as a block device.
  • Remove one copy from the system (physically, or by deletion of the block device or FS) before mounting the other copy
  • When using LVM, lvchange -a n <vg>/<lv> can be used to make the block device disappear temporarily. But beware of the fact that it can be auto-activated again.

Fragmentation

  • Files with a lot of random writes can become heavily fragmented (10000+ extents) causing trashing on HDDs and excessive multi-second spikes of CPU load on systems with an SSD or large amount a RAM.
    • On servers and workstations this affects databases and virtual machine images.
      • The nodatacow mount option may be of use here, with associated gotchas.
    • On desktops this primarily affects application databases (including Firefox and Chromium profiles, GNOME Zeitgeist, Ubuntu Desktop Couch, Banshee, and Evolution's datastore.)
      • Workarounds include manually defragmenting your home directory using btrfs fi defragment. Auto-defragment (mount option autodefrag) should solve this problem in 3.0.
    • Symptoms include btrfs-transacti and btrfs-endio-wri taking up a lot of CPU time (in spikes, possibly triggered by syncs). You can use filefrag to locate heavily fragmented files (may not work correctly with compression).

8TiB limit on 32-bit systems

Because of various implementation limitations on 32-bit systems:

  • It is possible to create Btrfs volumes larger than 8TiB.
  • Various Btrfs tools, among them btrfs check and btrfs receive don't support however Btrfs volumes larger than 8TiB on 32-bit systems.

It is also possible on 32-bit systems the limited address space per-process means that the tools cannot handle very complex (many inodes, many subvolumes, many hard links, ...) because they run out of memory on volumes smaller than 8TiB.

Version specific

Parity RAID

  • Currently raid5 and raid6 profiles have flaws that make it strongly not recommended as per the Status page.
    • In less recent releases the parity of resynchronized blocks was not calculated correctly, this has been fixed in recent releases (TBD).
    • If a crash happens while a raid5/raid6 volume is being written this can result in a "transid" mismatch as in transid verify failed.
    • The resulting corruption cannot be currently fixed.

Free space cache

  • Currently sometimes the free space cache v1 and v2 lose track of free space and a volume can be reported as not having free space when it obviously does.
    • Fix: disable use of the free space cache with mount option nospace_cache.
    • Fix: remount the volume with -o remount,clear_cache.
    • Switch to to new free space tree.

Having many subvolumes can be very slow

The cost of several operations, including currently balance and device delete, is proportional to the number of subvolumes, including snapshots, and (slightly super-linearly) the number of extents in the subvolumes.

This is "obvious" for "pure" subvolumes, as each is an independent file tree and has independent extents anyhow (except for ref-linked ones). But in the case of snapshots metadata and extents are (usually) largely ref-linked with the ancestor subvolume, so the full scan of the snapshot need not happen, but currently this happens.

This means that subvolumes with more than a dozen snapshots can greatly slow down balance and device delete. Some relevant quotes from the IRC channel:

  • "when you do device removes on file systems with a lot of snapshots, it is unbelievably slow ... took nearly a week to move 20GB of FS data from one device to the other using that method"
  • "a balance on 2TB of data that was heavily snapshotted - it took 3 months"
  • "try to avoid having more than 8 snapshots due to the overhead"
  • "when I have to do balances ... I delete all the snapshots and allow a few months for the balance to finish"

The multiple tree walks involve both high CPU and IOPS usage. There is a commit in kernel version 4.10-rc1 that greatly reduces CPU usage, but high IOPS usage remains.

This means that schemes that snapshot a volume periodically should set a low upper limit on the number of those snapshots that are retained.

Incomplete chunk conversion

Volumes created or converted with various profile can become read-only on the next mount if they become degraded, and then they will be stuck like that.

This can be detected by running btrfs filesystem df and checking that all data and metadata chunkfs have the same profile. An incomplete conversion looks like:

#  btrfs fi df /mnt/sdb6
Data, RAID10: total=174.00GiB, used=172.38GiB
Data, single: total=8.00MiB, used=0.00B
System, RAID10: total=96.00MiB, used=48.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID10: total=3.00GiB, used=397.95MiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=144.00MiB, used=0.00B
  • This is because older versions of btrfs-progs leave around on creation or conversion one chunk in the old profile by mistake, and then if a block device becomes missing Btrfs considers the volume fatally damaged if that old profile does not have redundancy.
    • This creation or conversion issue is solved in recent versions of btrfs-progs.
    • It can be fixed by running again (possibly repeatedly)
btrfs balance start -mconvert=$MPROFILE,soft -dconvert=$DPROFILE,soft ...
    • If the volume has already gone read-only, dump, recreate and restore it.

raid1 volumes only mountable once RW if degraded

Even if there are no single profile chunks, raid1 volumes if they become degraded may only be mounted read-write once with the options -o degraded,rw.

Notes:

  • This does not happen (reportedly) when there are more than 2 devices.
  • This does not happen with raid10 profile volumes.
  • This is often due to the "Incomplete chunk conversion" issue, where there are single chunks left.

Possible recoveries:

  • If it is still read-write, you can convert the chunks from profile raid1 to profile single (or profile dup) if you have enough space).
  • If it is still read-write, you can btrfs device replace the missing device.
  • When a raid1 volume is stuck as read-only for either reason it can only be recovered by dumping its contents, recreating it and restoring the contents.

Implicit conversion dup to raid1 after adding a block device

When all chunks of a single device volume are "allocated" a balance operation cannot start, and the suggested solution is to add a temporary block device of at least 1GiB (better 2-4GiB).

The subsequent balance<tt> turns existing <tt>dup profile chunks, usually for metadata, into raid1 profile chunks. This prevents the removal of the temporary block device. Workarounds:

  • Start the balance operation with -musage=0 so metadata chunks are not balanced, so they are not converted to raid1.
  • Start the balance operation with *any* value of -musage=, even 99 or 100, as the implicit conversion does not happen along with compaction if there is such a value.
  • If converted to raid1, convert them explicitly to single, remove the temporary block device, and convert them back to dup.
  • Use btrfs-progs version 4.7.2 or newer, as this allows converting back raid1 chunks to dup even on a multidevice volume.

Note: by itself adding a new device does not cause the conversion from dup to raid1 profile, it is the balance without options that does it implicitly.

Direct IO including NFS access

Direct IO reads and writes to Btrfs files are don't fully implement checksum semantics, particularly in the case of concurrent writes. This is the issue described in this email: "where the application will modify the page while it's inflight" (see also this article on stable pages). This results in checksum verification messages that are warnings instead of errors, as in for example:

BTRFS warning (device dm-1): csum failed ino 252784 off 62910578688 csum 802263983 expected csum 4110970844

Details of affected versions TBD.

The NFS kernel server accesses files by a mechanism equivalent to direct IO, so checksums semantics for non-direct IO access on an NFS client are not the same as for local access.

Conversion from ext4 may not be undoable

  • In kernels 4.0+: the empty block groups are reclaimed automatically that can affect the following:
    • a converted filesystem may not be able to do a rollback because of the removed block groups

Some kernels had problems with "snapshot-aware defrag"

  • There were problems found with the snapshot-aware defrag that has been turned off in the following kernels:
    • 3.10.31, 3.12.12, 3.13.4
    • all newer than 3.14

Historical references

List of issues going back 18 months from current release (kernels: 3.14+, date: Mar 2014). Older issues will be moved to a separate page.

  • Stable kernel version 4.0.6 fixes a regression in raid1 conversion, works fine on 3.19 and 4.1
    • conversion from eg. single or raid0 profiles to raid1 made no change to the filesystem
  • Stable kernel version 3.19.1+ can cause a deadlock at mount time
    • Fixed in 3.19.5, 3.14.39
    • workaround: boot with older kernel, or run btrfs-zero-log to clear the log. This will lose up to the last 30 seconds of writes to the filesystem. You will have to reboot after running the btrfs-zero-log command, to clear the jammed locks.
    • fix: scheduled for 3.19.5, or apply 9c4f61f01d269815bb7c37.
    • also affected: 3.14.35+, 3.18.9+
  • bcache + btrfs was not stable with bcache with old kernels but is apparently ok with 3.19+
  • Versions from 3.15 up to 3.16.1 suffer from a deadlock that was observed during heavy rsync workloads with compression on, it's recommended to use 3.16.2 and newer
Personal tools