From btrfs Wiki
(Difference between revisions)
Jump to: navigation, search
(Can I use RAID[56] on my Btrfs filesystem?)
m (superfluous category removed)
Line 334: Line 334:
The closest cluster file system that uses Btrfs as it's underlining file system is  [http://en.wikipedia.org/wiki/Ceph Ceph]  
The closest cluster file system that uses Btrfs as it's underlining file system is  [http://en.wikipedia.org/wiki/Ceph Ceph]  

Revision as of 17:21, 4 April 2011


Important Questions

I see a warning in dmesg about barriers being disabled when mounting my filesystem. What does that mean?

Your hard drive has been detected as not supporting barriers. This is a severe condition, which can result in full file-system corruption, not just losing or corrupting data that was being written at the time of the power cut or crash. There is only one certain way to work around this:

Note: Disable the write cache on each drive by running hdparm -W0 /dev/sda against each drive on every boot.

Failure to perform this can result in massive and possibly irrecoverable corruption (especially in the case of encrypted filesystems).

Help! I ran out of disk space!

Help! Btrfs claims I'm out of space, but it looks like I should have lots left!

Free space is a tricky concept in Btrfs. This is especially apparent when running low on it. Read "Why is there so many ways to check the amount of free space" below for the blow-by-blow.

if you're on 2.6.32 or older

You should consider upgrading. The error behaviour of Btrfs has significantly improved, such that you get a nice proper ENOSPC instead of an OOPS or worse. There may be backports of Btrfs eventually, but it currently relies on infrastructure and patches outside of the fs tree which make a backport trickier to manage without compromising the stability of your stable kernel.

if your device is small

i.e., a 4gb flash card: your main problem is the large block allocation size, which doesn't allow for much breathing room. a btrfs fi balance may get you working again, but it's probably only a short term fix, as the metadata to data ratio probably won't match the block allocations.

If you can afford to delete files, you can clobber a file via echo > /path/to/file, which will recover that space without requiring a new metadata allocation (which would otherwise ENOSPC again).

You might consider remounting with -o compress, and either rewrite particular files in-place, or run btrfs fi defragment to recompress everything. This may take a while.

Next, depending on whether your metadata block group or the data block group filled up, you can recreate your filesystem and mount it with metadata_ratio=, setting the value up or down from the default of 8 (i.e., 4 if metadata ran out first, 12 if data ran out first). This can be changed at any time by remounting, but will only affect new block allocations.

Finally, the best solution is to upgrade to 2.6.37 and recreate the filesystem to take advantage of mixed block groups, which avoid effectively-fixed allocation sizes on small devices. Note that this incurs a fragmentation overhead, and currently cannot be converted back to normal split metadata/data groups without recreating the partition. Using mixed block groups is currently (kernel 2.6.37) only recommended for filesystems of 1GiB or smaller.

if your device is large (>16gb)

sudo btrfs fi show /dev/device should show no free space on any drive.

It may show unallocated space if you're using raid1 with two drives of different sizes, and possibly similar with larger drives). This is normal in itself, as Btrfs will not write both copies to the same device, but you still have an enospc condition.

btrfs fi df /mountpoint will probably report available space in both metadata and data. The problem here is that one particular 256MB or 1GB block is full, and wants to allocate another whole block. The easy fix is to run btrfs fi balance /mountpoint. This will take a while (although the system is otherwise usable during this time), but when completed, you should be able to use most of the remaining space. We know this isn't ideal, and there are plans to improve the behavior. Running close to empty is rarely the ideal case, but we can get far closer to full than we do.

In a more time-critical situation, you can reclaim space by clobbering a file via echo > /path/to/file. This will delete the contents, allowing the space to be reclaimed, but without requiring a metadata allocation. Get out of the tight spot, and then balance as above.

Performance vs Correctness

Does Btrfs have data=ordered mode like Ext3?

In v0.16, Btrfs waits until data extents are on disk before updating metadata. This ensures that stale data isn't exposed after a crash, and that file data is consistent with the checksums stored in the btree after a crash.

Note that you may get zero-length files after a crash, see the next questions for more info.

Btrfs does not force all dirty data to disk on every fsync or O_SYNC operation, fsync is designed to be fast.

What are the crash guarantees of overwrite-by-rename?

Overwriting an existing file using a rename is atomic. That means that either the old content of the file is there or the new content. A sequence like this:

echo "oldcontent" > file

# make sure oldcontent is on disk

echo "newcontent" > file.tmp
mv -f file.tmp file

# *crash*
Will give either
  1. file contains "newcontent"; file.tmp does not exist
  2. file contains "oldcontent"; file.tmp may contain "newcontent", be zero-length or not exists at all.

What are the crash guarantees of rename?

Renames NOT overwriting existing files do not give additional guarantees. This means, a sequence like

echo "content" > file.tmp
mv file.tmp file

# *crash*
will most likely give you a zero-length "file". The sequence can give you either
  1. Neither file nor file.tmp exists
  2. Either file.tmp or file exists and is 0-size or contains "content"

For more info see this thread: http://thread.gmane.org/gmane.comp.file-systems.btrfs/5599/focus=5623

Can the data=ordered mode be turned off in Btrfs?

No, it is an important part of keeping data and checksums consistent. The Btrfs data=ordered mode is very fast and turning it off is not required for good performance.

What checksum function does Btrfs use?

Currently Btrfs uses crc32c for data and metadata. The disk format has room for 256bits of checksum for metadata and up to a full leaf block (roughly 4k or more) for data blocks. Over time we'll add support for more checksum alternatives.

Can data checksumming be turned off?

Yes, you can disable it by mounting with -o nodatasum

Can copy-on-write be turned off for data blocks?

Yes, you can disable it by mounting with -o nodatacow. This implies -o nodatasum as well. COW may still happen if a snapshot is taken.

Common questions

How do I do...?

See also the UseCases page.

I have converted my ext4 partition into Btrfs, how do I delete the ext2_saved folder?

Note: the new command btrfs is replacing the old one btrfsctl. In the examples below both commands are shown.

Use "btrfs subvolume delete" or "btrfsctl -D", with btrfs-progs from Git.

When will Btrfs have a fsck like tool

Check back soon.

(2011-01-12) A "scanning fsck" (which will be able to do things like find and fix missing superblocks) is "finally almost ready", according to cmason on IRC.

Why does df show incorrect free space for my RAID volume?

Why are there so many ways to check the amount of free space?

Because there's so many ways to answer the question.

Free space in Btrfs is a tricky concept, owing partly to the features it provides, and owing partly to the difficulty in sorting out what exactly you want to know at the moment you ask. Eventually somebody will figure out a sane solution that doesn't grossly misrepresent the situation depending on the phase of the moon, but until then...

Currently, the system's df command will suffice for purposes that don't require precision. It's almost completely sufficient, as long as you're aware of the raid level in use, so that you can multiply or divide the numbers accordingly. (This will become oh so much fun once we support different raid levels per-subvolume). Anything more precise needs some background on how Btrfs manages space.

Space in Btrfs exists in a pool of the sum total of all the drives/partitions included in the fs. From this pool, large blocks are allocated to the metadata block group (in 256MB allocations, or the remainder) and the data block group (in ~1GB allocations, or the remainder) as necessary. When a file is written, space from two metadata groups (by default) and one data group is required (or more, depending straightforwardly on the raid level). So, any given block has an amount of free space not currently allocated to files or metadata, and in addition the pool itself has free space that hasn't been allocated to a block group. Nearly all of the variation between the tools comes from these distinctions, and each tool has its quirks.

du (os builtin)

Included for completeness, du is very roughly equivalent to the "used" size reported by the other tools. Specifically, it will report the space that would be required if the folder was written to an uncompressed tarball. It will not reflect metadata or compression, and will double-count snapshots (causing the total to potentially be massively high). As such, there is no straightforward way to take its total and convert it to the "used" bytes reported by the other tools.

df (os builtin)

user@machine:~$ df -h /
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1             894G  311G  583G  35% /

df will return:

  • The device name used to mount the filesystem
  • The sum total of space available across drives in the filesystem (currently the only place to directly read this)
  • The total bytes used to store files (including metadata).
  • The total free bytes available to this mountpoint, which includes unused bytes within the relevant block groups as well as bytes that haven't yet been allocated to a block group.

btrfs filesystem df /mountpoint

user@machine:~$ btrfs fi df /
Metadata: total=18.00GB, used=6.10GB
Data: total=358.00GB, used=298.37GB
System: total=12.00MB, used=40.00KB

Btrfs' native df, will display:

  • The total number of bytes storable in metadata block group, taking raid level into account
  • The amount of that space which is in use
  • The raid level (2.6.37 or later) of that group
  • The total number of bytes storable in data block group, taking raid level into account
  • The amount of that space which is in use
  • The raid level (2.6.37 or later) of that group

Notably absent is the total amount of space available and total in the pool; we should probably add this.

btrfs filesystem show /dev/deviceName

sudo btrfs fi show /dev/sda1                                 #root required!
Label: none  uuid: 12345678-1234-5678-1234-1234567890ab
	Total devices 2 FS bytes used 304.48GB
	devid    1 size 427.24GB used 197.01GB path /dev/sda1
	devid    2 size 465.76GB used 197.01GB path /dev/sdc1

Btrfs' device summary, will display:

  • uuid of the filesystem
  • Total number of devices
  • Total bytes used
  • A list of each device including
    • Device size
    • Bytes used on that device
    • Path to the device

Notably absent again is the total amount of space available and total in the pool, although those can be calculated in this case.

Also note that root is required.

Two relations to note

user@machine:~$ df -h /
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1             894G  311G  583G  35% /
user@machine:~$ btrfs fi df /
Metadata: total=18.00GB, >>used=6.10GB<<  *2=  12.20GB
Data: total=358.00GB, >>used=298.37GB<<   *1= 298.37GB
System: total=12.00MB, >>used=40.00KB<<   *1=   0.00GB
                                           == 310.57GB 
                                           ~~ 311   GB
user@machine:~$ sudo btrfs fi show /dev/sda1
Label: none  uuid: 12345678-1234-5678-1234-1234567890ab
	Total devices 2 FS bytes used 304.48GB
	devid    1 size 427.24GB used >>197.01GB<< path /dev/sda1
	devid    2 size 465.76GB used >>197.01GB<< path /dev/sdc1

user@machine:~$ btrfs fi df /
Metadata: >>total=18.00GB<<, used=6.10GB   *2=  36.00GB
Data: >>total=358.00GB<<, used=298.37GB    *1= 358.00GB
System: >>total=12.00MB<<, used=40.00KB    *2=   0.02GB
                                            == 394.02GB

Yes, this is more complicated than it needs to be. Yes, we'll fix it, as soon as we've sorted out how other features play into this.

Why is there so much space overhead?

There are several things meant by this. One is the out-of-space issues discussed above; this is a known deficiency, which can be worked around, and will eventually be worked around properly. The other meaning is the size of the metadata block group, compared to the data block group. Note that you should compare the size of the allocations, but rather the used space in the allocations.

There are several considerations:

  • The default raid level for the metadata group is dup on single drive systems, and raid1 on multi drive systems. The meaning is the same in both cases: there's two copies of everything in that group. This can be disabled at mkfs time, and it will eventually be possible to migrate raid levels online.
  • There an overhead to maintaining the checksums [XXX: percentage?]
  • Small files are also written inline into the metadata group. If you have several gb of very small files, this will add up.

[incomplete; disabling features, etc]

What is a snapshot?

  • A snapshot is a frozen image of all the files and directories. For example, if you have two files ("a" and "b"), you take a snapshot and you delete "b", the file you just deleted is still available in the snapshot you took. The great thing about Btrfs snapshots is you can operate on any files or directories vs lvm when it is the whole logical volume.

snapshot example

  • Since backup from tape are a pain here is the thoughts of a lazy sysadm that create a home directory as a Btrfs file system for their users, lets try some fancy net attached storage ideas.
    • /home
      • Then there could be a snaphot every 6 hours via cron
        • /home_today_00,/home_today_06,/home_today_12,/home_today_18,

The logic would look something like this for rolling 3 day rotation that would use cron @ midnight

  • rename /home_today_00, /home_backday_1
  • create a symbolic link for /home_backDay_00 that points to real dir of /home_backday_1
  • rename /home_today_06, /home_backDay_06 , Need to do this for all hours (06..18)
  • /home_backday_1,/home_backday_2,/home_backday_3
    • delete the /home_backday_3
    • rename /home_backday_2 to /home_backday_3 day
    • rename /home_backday_1 to /home_backday_2 day

What is a subvolume?

A subvolume is like a directory - it has a name, there's nothing on it when it is created, and it can hold files and other directories. There's at least one subvolume in every Btrfs filesystem, the "root" subvolume.

The equivalent in Ext4 would be a filesystem. Each subvolume behaves as a individual filesystem. The difference is that in Ext4 you create each filesystem in a partition, in Btrfs however all the storage is in the 'pool', and subvolumes are created from the pool, you don't need to partition anything. You can create as many subvolumes as you want, as long as you have storage capacity.

Resizing partitions (shrink/grow)

Note: the new command btrfs is replacing the old one btrfsctl. In the examples below both commands are shown.

In order to demonstrate and test the back references, Btrfs devel team has added an online resizer, which can both grow and shrink the filesystem via btrfsctl or btrfs commands.

Mount File system

Mount Options

mount -t btrfs /dev/xxx /mnt

add 2GB to the FS

btrfs filesystem resize +2G /mnt


btrfsctl -r +2g /mnt

shrink 4GB to the FS

btrfs filesystem resize -4g /mnt


btrfsctl -r -4g /mnt

Explicitly set the FS size

btrfsctl -r 20g /mnt


btrfs filesystem resize 20g /mnt

Use 'max' to grow the FS to the limit of the device

btrfs filesystem resize max /mnt


btrfsctl -r max /mnt

Can I use RAID[56] on my Btrfs filesystem?

Not yet. There are some patches from Dave Woodhouse on the mailing list, but they are unfinished and not yet committed to Git. Rumour has it 2.6.37 2.6.39 will be the magic number.

(The patches didn't make it to the 2.6.38 merge window after 2.6.37, so the earliest they'll make it is 2.6.39, now).

Is Btrfs optimized for SSD?

There are some optimizations for SSD drives, and you can enable them by mounting with -o ssd. As of 2.6.31-rc1, this mount option will be enabled if Btrfs is able to detect non-rotating storage. SSD is going to be a big part of future storage, and the Btrfs developers plan on tuning for it heavily.

What is the difference between mount -o ssd and mount -o ssd_spread?

Mount -o ssd_spread is more strict about finding a large unused region of the disk for new allocations, which tends to fragment the free space more over time. Mount -o ssd_spread is often faster on the less expensive SSD devices. The default for autodetected SSD devices is mount -o ssd.

Will Btrfs be in the mainline Linux Kernel?

Btrfs is already in the mainline Linux kernel. It was merged on 9th January 2009, and was available in the Linux 2.6.29 release.

Does Btrfs run with older kernels?

v0.16 of Btrfs maintains compatibility with kernels back to 2.6.18. Kernels older than that will not work.

The current Btrfs unstable repositories only work against the mainline kernel. Once Btrfs is in mainline a backport repository will be created again.

How long will the Btrfs disk format keep changing?

The Btrfs disk format is not finalized, but it won't change unless a critical bug is found and no workarounds are possible. Not all the features have been implemented, but the current format is extensible enough to add those features later without requiring users to reformat.

How do I upgrade to the 2.6.31 format?

The 2.6.31 kernel can read and write Btrfs filesystems created by older kernels, but it writes a slightly different format for the extent allocation trees. Once you have mounted with 2.6.31, the stock Btrfs in 2.6.30 and older kernels will not be able to mount your filesystem.

We don't want to force people into 2.6.31 only, and so the newformat code is available against 2.6.30 as well. All fixes will also be maintained against 2.6.30. For details on downloading, see the Btrfs source repositories.

About the project

Does the Btrfs multi-device support make it a "rampant layering violation"?

Yes and no. Device management is a complex subject, and there are many different opinions about the best way to do it. Internally, the Btrfs code separates out components that deal with device management and maintains it's own layers for them. The vast majority of filesystem metadata has no idea there are multiple devices involved.

Many advanced features such as checking alternate mirrors for good copies of a corrupted block are meant to be used with RAID implementations below the FS.

What is CRFS? Is it related to Btrfs?

[CRFS] is a network file system protocol. It was designed at around the same time as Btrfs. It's wire format uses some Btrfs disk formats and crfsd, a CRFS server implementation, uses Btrfs to store data on disk. More information can be found at http://oss.oracle.com/projects/crfs/ and http://en.wikipedia.org/wiki/CRFS

Will Btrfs become a clustered file system

No. Btrfs' main goal right now is to be the best non-cluster file system.

If one wants a cluster file system there are many production choices that can be found Distributed file systems section on wikipedia, keep in mind that each file system has their own +s or -s so find the best fit for your environment. Most have a set cluster maximum and would that work in your environment is the question that one has to answer.

The closest cluster file system that uses Btrfs as it's underlining file system is Ceph

Personal tools