(→How much space do I get with unequal devices in RAID-1 mode?: attempted to describe edge case I brought up on discussion page. It could probably be more concise.)
(→What are the differences among MD-RAID / device mapper / btrfs raid?: added some basic comparison)
|Line 513:||Line 513:|
=== What are the differences among MD-RAID / device mapper / btrfs raid? ===
=== What are the differences among MD-RAID / device mapper / btrfs raid? ===
== About the project ==
== About the project ==
Revision as of 18:27, 25 April 2013
I have a problem with my btrfs filesystem!
See the Problem FAQ for commonly-encountered problems and solutions.
Explicitly said: please report bugs and issues to the mailing list (you are not required to subscribe). (Optionally you can use the bugzilla on kernel.org. Never use the bugzilla on "original" Btrfs project page at Oracle.)
I see a warning in dmesg about barriers being disabled when mounting my filesystem. What does that mean?
Your hard drive has been detected as not supporting barriers. This is a severe condition, which can result in full file-system corruption, not just losing or corrupting data that was being written at the time of the power cut or crash. There is only one certain way to work around this:
Failure to perform this can result in massive and possibly irrecoverable corruption (especially in the case of encrypted filesystems).
Help! I ran out of disk space!
Help! Btrfs claims I'm out of space, but it looks like I should have lots left!
Free space is a tricky concept in Btrfs. This is especially apparent when running low on it. Read "Why is there so many ways to check the amount of free space" below for the blow-by-blow.
if you're on 2.6.32 or older
You should upgrade your kernel, right now. The error behaviour of Btrfs has significantly improved, such that you get a nice proper ENOSPC instead of an OOPS or worse. There may be backports of Btrfs eventually, but it currently relies on infrastructure and patches outside of the fs tree which make a backport trickier to manage without compromising the stability of your stable kernel.
if your device is small
i.e., a 4GiB flash card: your main problem is the large block allocation size, which doesn't allow for much breathing room. A btrfs fi balance may get you working again, but it's probably only a short term fix, as the metadata to data ratio probably won't match the block allocations.
If you can afford to delete files, you can clobber a file via echo > /path/to/file, which will recover that space without requiring a new metadata allocation (which would otherwise ENOSPC again).
You might consider remounting with -o compress, and either rewrite particular files in-place, or run btrfs fi defragment to recompress everything. This may take a while.
Next, depending on whether your metadata block group or the data block group is filled up, you can recreate your filesystem and mount it with metadata_ratio=, setting the value up or down from the default of 8 (i.e., 4 if metadata ran out first, 12 if data ran out first). This can be changed at any time by remounting, but will only affect new block allocations.
Finally, the best solution is to upgrade to at least 2.6.37 (or the latest stable kernel) and recreate the filesystem to take advantage of mixed block groups, which avoid effectively-fixed allocation sizes on small devices. Note that this incurs a fragmentation overhead, and currently cannot be converted back to normal split metadata/data groups without recreating the partition. Using mixed block groups is recommended for filesystems of 1GiB or smaller and mkfs.btrfs will force mixed block groups automatically in that case.
if your device is large (>16GiB)
sudo btrfs fi show /dev/device should show no free space on any drive.
It may show unallocated space if you're using raid1 with two drives of different sizes, and possibly similar with larger drives. This is normal in itself, as Btrfs will not write both copies to the same device, but you still have an ENOSPC condition.
btrfs fi df /mountpoint will probably report available space in both metadata and data. The problem here is that one particular 256MiB or 1GiB block is full, and wants to allocate another whole block. The easy fix is to run btrfs fi balance /mountpoint -dusage=5. This may take a while (although the system is otherwise usable during this time), but when completed, you should be able to use most of the remaining space. We know this isn't ideal, and there are plans to improve the behavior. Running close to empty is rarely the ideal case, but we can get far closer to full than we do.
In a more time-critical situation, you can reclaim space by clobbering a file via true > /path/to/file. This will delete the contents, allowing the space to be reclaimed, but without requiring a metadata allocation. Get out of the tight spot, and then balance as above.
If the echo does not work, mount with the 'nodatacow' option, and try again (tried with 3.2.20 kernel for Ubuntu Precise). The reason behind that is that in some case the file is already snapshotted in a no obvious way (like a file of a converted ext4 filesystem). Using 'nodatacow' you are sure to not allocate new metadata when the file is overwritten.
Significant improvements in the way that btrfs handles ENOSPC are incorporated in most new kernel releases, so you should also upgrade to the latest kernel if you are not already using it.
If you've tried btrfs fi balance /mountpoint -dusage=5, and it takes an anormal amount of time (normal time is 20 hours for 1 TB) it may never end, if you try the 'nodatacow' option it also does not work, you have to use the mount with 'skip_balance' option and 'nodatacow' option, after that you should try te method described before.
Are btrfs changes backported to stable kernel releases?
The stable kernel releases do not receive regular btrfs backports unless it's a serious bugfix.
Performance vs Correctness
Does Btrfs have data=ordered mode like Ext3?
In v0.16, Btrfs waits until data extents are on disk before updating metadata. This ensures that stale data isn't exposed after a crash, and that file data is consistent with the checksums stored in the btree after a crash.
Note that you may get zero-length files after a crash, see the next questions for more info.
Btrfs does not force all dirty data to disk on every fsync or O_SYNC operation, fsync is designed to be fast.
What are the crash guarantees of overwrite-by-rename?
Overwriting an existing file using a rename is atomic. That means that either the old content of the file is there or the new content. A sequence like this:
echo "oldcontent" > file # make sure oldcontent is on disk sync echo "newcontent" > file.tmp mv -f file.tmp file # *crash*
Will give either
- file contains "newcontent"; file.tmp does not exist
- file contains "oldcontent"; file.tmp may contain "newcontent", be zero-length or not exists at all.
Why I experience poor performance during file access on filesystem?
By default the file system is mounted with relatime flag, which means it must update files' metadata during first access on each day. Since updates to metadata are done as COW, if one visits a lot o files, it results in massive and scattered write operations on the underlying media.
You need to mount file system with noatime flag to prevent this from happening.
More details are in Mount_options#Performance
What are the crash guarantees of rename?
Renames NOT overwriting existing files do not give additional guarantees. This means, a sequence like
echo "content" > file.tmp mv file.tmp file # *crash*
will most likely give you a zero-length "file". The sequence can give you either
- Neither file nor file.tmp exists
- Either file.tmp or file exists and is 0-size or contains "content"
For more info see this thread: http://thread.gmane.org/gmane.comp.file-systems.btrfs/5599/focus=5623
Can the data=ordered mode be turned off in Btrfs?
No, it is an important part of keeping data and checksums consistent. The Btrfs data=ordered mode is very fast and turning it off is not required for good performance.
What checksum function does Btrfs use?
Currently Btrfs uses crc32c for data and metadata. The disk format has room for 256bits of checksum for metadata and up to a full leaf block (roughly 4k or more) for data blocks. Over time we'll add support for more checksum alternatives.
Can data checksumming be turned off?
Yes, you can disable it by mounting with -o nodatasum
Can copy-on-write be turned off for data blocks?
Yes, there are several ways how to do that.
Disable it by mounting with nodatacow. This implies nodatasum as well. COW may still happen if a snapshot is taken. However COW will still be maintained for existing files, because the COW status can be modified only for empty or newly created files.
For an empty file, add the NOCOW file attribute (use chattr utility with +C), or you create a new file in a directory with the NOCOW attribute set (then the new file will inherit this attribute).
Now copy the original data into the pre-created file, delete original and rename back.
There is a script you can use to do this 
(See also the Project ideas page)
When will Btrfs have a fsck like tool?
The first detailed report on what comprises "btrfsck"
The btrfsck tool in the git master branch for btrfs-progs is now capable of repairing some types of filesystem breakage. It is not well-tested in real-life situations yet. If you have a broken filesystem, it is probably better to use btrfsck with advice from one of the btrfs developers, just in case something goes wrong. (But even if it does go badly wrong, you've still got your backups, right?)
Note that there is also a recovery tool in the btrfs-progs git repository which can often be used to copy essential files out of broken filesystems.
What's the difference between btrfsck and fsck.btrfs
- btrfsck is the actual utility that is able to check and repair a filesystem
- fsck.btrfs is a utility that should exist for any filesystem type and is called during system setup when the corresponding /etc/fstab entries contain non-zero value for fs_passno. (See fstab(5) for more.)
Traditional filesystems need to run their respective fsck utility in case the filesystem was not unmounted cleanly and the log needs to be replayed before mount. This is not needed for btrfs. You should set fs_passno to 0.
Note, if the fsck.btrfs utility is in fact btrfsck, then the filesystem is unnecessarily checked upon every boot and slows down the whole operation. It is safe to and recommended to turn fsck.btrfs into a no-op, eg. by cp /bin/true /sbin/fsck.btrfs.
Can I use RAID on my Btrfs filesystem?
The first drop of the code, in experimental form, is in the 3.9 kernel. This is currently only suitable for testing, as it is known to not be crash-safe, and many important (or even vital) features are missing. The current status (summarised from the announcement mail) is:
- Parity may be inconsistent after a crash
- RAID-5 (many devices, one parity)
- RAID-6 (many devices, two parity)
- The algorithm uses as many devices as are available (see note, below)
- No scrub support for fixing checksum errors
- No support in btrfs-progs for forcing parity rebuild
- No support for discard
- No support yet for n-way mirroring
- No support for a fixed-width stripe (see note, below)
Using as many devices as are available means that there will be a performance issue for filesystems with large numbers of devices. It also means that filesystems with different-sized devices will end up with differing-width stripes as the filesystem fills up, and some space may be wasted when the smaller devices are full. Both of these issues can be addressed by specifying a fixed-width stripe, always running over exactly the same number of devices.
RAID-5 was due to arrive in 3.5, but didn't make it in time because of a serious bug. The feature also missed 3.6, because two other large and important features also had to go in, and there wasn't time to complete the full testing programme for all three features before the 3.6 merge window.
(From the 3.7 pull request):
- "I'm cooking more unrelated RAID code, but I wanted to make sure [the rest of the pull request] makes it in. The largest updates here are relatively old and have been in testing for some time."
(From the 3.8 pull request):
- "raid5/6 is being rebased against the device replacement code. I'll have it posted this Friday along with a nice series of benchmarks."
- -- It didn't make it into the pull for 3.8.
Is Btrfs optimized for SSD?
There are some optimizations for SSD drives, and you can enable them by mounting with -o ssd. As of 2.6.31-rc1, this mount option will be enabled if Btrfs is able to detect non-rotating storage. SSD is going to be a big part of future storage, and the Btrfs developers plan on tuning for it heavily. Note that -o ssd will not enable TRIM/discard.
Does Btrfs support TRIM/discard?
There are two ways how to apply the discad:
- during normal operation on any space that's going to be freed, enabled by mount option discard
- on demand via the command fstrim
"-o discard" can have some negative consequences on performance on some SSDs or at least whether it adds worthwhile performance is up for debate depending on who you ask, and makes undeletion/recovery near impossible while being a security problem if you use dm-crypt underneath (see http://asalor.blogspot.com/2011/08/trim-dm-crypt-problems.html ), therefore it is not enabled by default. You are welcome to run your own benchmarks and post them here, with the caveat that they'll be very SSD firmware specific.
The fstrim way is more flexible as it allows to apply trim on a specific block range, or can be scheduled to time when the filesystem perfomace drop is not critical.
Does btrfs support encryption?
There are several different ways in which a filesystem can interoperate with encryption to keep your data secure:
- It can operate on top of an encrypted partition (dm-crypt / LUKS) scheme.
- It can be used as a component of a stacked approach (eg. ecryptfs) where a layer above the filesystem transparently provides the encryption.
- It can natively attempt to encrypt file data and associated information such as the file name.
There are advantages and disadvantages to each method, and care should be taken to make sure that the encryption protects against the right threat. In some situations, more than one approach may be needed.
Typically, partition (or entire disk) encryption is used to protect data in case a computer is stolen. This sort of method requires a password for the computer to boot, but the system operates normally after that. All data (except the boot loader and kernel) is encrypted. Btrfs works safely with partition encryption (luks/dm-crypt) since Linux 3.2. Earlier kernels will start up in this mode, but are known to be unsafe and may corrupt due to problems with dm-crypt write barrier support.
Partition encryption does not protect data accessed by a running system -- after boot, a user sees the computer normally, without having to enter extra passwords. There may also be some performance impact since all IO must be encrypted, not just important files. For this reason, it's often preferable to encrypt individual files or folders, so that important files can't be accessed without the right password while the system is online. If the computer might also be stolen, it may be preferable to use partition encryption as well as file encryption.
Btrfs does not support native file encryption (yet), and there's nobody actively working on it. It could conceivably be added in the future.
As an alternative, it is possible to use a stacked filesystem (eg. ecryptfs) with btrfs. In this mode, the stacked encryption layer is mounted over a portion of a btrfs volume and transparently applies the security before the data is sent to btrfs. Another similar option is to use the fuse-based filesystem encfs as a encrypting layer on top of btrfs.
Note that a stacked encryption layer (especially using fuse) may be slow, and because the encryption happens before btrfs sees the data, btrfs compression won't save space (encrypted data is too scrambled). From the point of view of btrfs, the user is just writing files full of noise.
Also keep in mind that if you use partition level encryption and btrfs RAID on top of multiple encrypted partitions, the partition encryption will have to individually encrypt each copy. This may result in somewhat reduced performance compared to a traditional RAID setup where the encryption might be done on top of RAID. Whether the encryption has a significant impact depends on the workload, and note that many newer CPUs have hardware encryption support.
Does Btrfs work on top of dm-crypt?
This is deemed safe since 3.2 kernels. Corruption has been reported before that, so you want a recent kernel. The reason was improper passing of device barriers that are a requirement of the filesystem to guarantee consistency.
Does btrfs support deduplication?
Deduplication is supported, with some limitations. See Deduplication.
Does btrfs support swap files?
Currently no. Just making a file NOCOW does not help, swap file support relies on one function that btrfs intentionally does not implement due to potential corruptions. The swap implementation relies on some assumptions which may not hold in btrfs, like block numbers in the swap file while btrfs has a different block number mapping in case of multiple devices. For more details have a look on the Project_ideas page.
A workaround, albeit with poor performance, is to mount a swap file via a loop device.
Does grub support btrfs?
Yes. GRUB 2.00 supports many btrfs configurations (including zlib and lzo compression, and RAID0/1/10 multi-dev filesystems). If your distribution only provides older versions of GRUB, you'll have to build it for yourself.
How do I do...?
See also the UseCases page.
Is btrfs stable?
Short answer: No, it's still considered experimental.
Long answer: Nobody is going to magically stick a label on the btrfs code and say "yes, this is now stable and bug-free". Different people have different concepts of stability: a home user who wants to keep their ripped CDs on it will have a different requirement for stability than a large financial institution running their trading system on it. If you are concerned about stability in commercial production use, you should test btrfs on a testbed system under production workloads to see if it will do what you want of it. In any case, you should join the mailing list (and hang out in IRC) and read through problem reports and follow them to their conclusion to give yourself a good idea of the types of issues that come up, and the degree to which they can be dealt with. Whatever you do, we recommend keeping good, tested, off-system (and off-site) backups.
Pragmatic answer: (2012-12-19) Many of the developers and testers run btrfs as their primary filesystem for day-to-day usage, or with various forms of "real" data. With reliable hardware and up-to-date kernels, we see very few unrecoverable problems showing up. As always, keep backups, test them, and be prepared to use them.
I have converted my ext4 partition into Btrfs, how do I delete the ext2_saved folder?
The folder is a normal btrfs subvolume and you can delete it with the command
btrfs subvolume delete /path/to/btrfs/ext2_saved
Why does df show incorrect free space for my RAID volume?
Aaargh! My filesystem is full, and I've put almost nothing into it!
Why are there so many ways to check the amount of free space?
Free space in Btrfs is a tricky concept from a traditional viewpoint, owing partly to the features it provides and partly to the difficulty in making too many assumptions about the exact information you need to know at the time. We will eventually figure out a more intuitive solution.
To understand the different ways that btrfs's tools report filesystem usage and free space, you need to know how it allocates and uses space.
Raw disk usage
Btrfs starts with a pool of raw storage. This is what you see when you run
btrfs fi show:
$ sudo btrfs fi show /dev/sda1 Label: none uuid: 12345678-1234-5678-1234-1234567890ab Total devices 2 FS bytes used 304.48GB devid 1 size 427.24GB used 197.01GB path /dev/sda1 devid 2 size 465.76GB used 197.01GB path /dev/sdc1
The "devid" lines show the total raw bytes available and allocated on each disk, whether containing redundant data or not. As the filesystem needs space for data or metadata, it allocates chunks of raw storage from the disks, typically 1GB (data) and 256MB (metadata) at a time. This allocation of data is known as a block group.
The way that the above allocation occurs depends on the RAID/replication in use, and the type of information it is attempting to store:
- single - data usage matches the raw block group usage on a single device (data = raw; 1GB of data requires 1GB of disk)
- DUP - data is duplicated across a single disk, mostly used for metadata (data * 2 = raw; 1GB of data requires 2GB of disk)
- RAID-1 - data usage will match with two equal chunks on two different devices (data * 2 = raw; 1GB of data requires 2GB of disk)
- RAID-10 - as with RAID-1 however will require four devices (data * 2 = raw; 1GB of data requires 2GB of disk)
- RAID-0 - data usage matches the raw data usage on multiple devices (data = raw; 1GB of data requires 1GB of disk)
- RAID-5 - similar to RAID-0 but with one extra raw block reserved for parity (where you have n disks: data * n = raw * (n-1) ; 6 disks, 5GB of data requires 6GB of disk)
- RAID-6 - as with RAID-5 except with two blocks reserved for "parity" (where you have n disks: data * n = raw * (n-2) ; 8 disks, 6 GB of data requires 8GB of disk)
^ The RAID-5 and RAID-6 configurations are not yet available
In the above example 304.48GB of storage has been used for data and metadata within the filesystem, however sda1 and sdc1 each have 197.01GB of "raw" disk allocated. Due to the differing replication schemes (single/DUP/RAID-x) above and that storage is often allocated but unused, these two numbers will always have a discrepancy, in this case 304.48GB vs 2x 197.01GB (384.02GB).
When allocating new block groups, for example with a new empty btrfs file system using RAID-1 for data, it will allocate two chunks of 1GiB each, which between them have 1GiB of storage capacity. You will see 2GiB of raw space used in "btrfs fi show", 1GiB from each of two devices. You will also see 1GiB of free space appear in "btrfs fi df" as "Data, RAID1". As you write files to it, that 1GiB will get used up at the rate you'd expect (i.e., write 1MiB to it, and 1MiB gets used -- in "btrfs fi df" output). When that 1GiB is used up, another 1GiB is allocated and used.
The total space allocated from the raw pool is shown with
btrfs fi show. If you want to see the types and quantities of space allocated, and what they can store, the command is
btrfs fi df <mountpoint>:
$ btrfs fi df / Metadata: total=18.00GB, used=6.10GB Data: total=358.00GB, used=298.37GB System: total=12.00MB, used=40.00KB
This shows how much data has been allocated for each data type and replication type, and how much has been used. The values shown are data rather than raw bytes, so if you're using RAID-1 or RAID-10, the amount of raw storage used is double the values you can see here.
Why is free space so complicated?
You might think, "My whole disk is RAID-1, so why can't you just divide everything by 2 and give me a sensible value in
If everything is RAID-1 (or RAID-0, or in general all the same RAID level), then yes, we could give a sane and consistent value from
df. However, we have plans to allow per-subvolume and per-file RAID levels. In this case, it becomes impossible to give a sensible estimate as to how much space there is left.
For example, if you have one subvolume as "single", and one as RAID-1, then the first subvolume will consume raw storage at the rate of one byte for each byte of data written. The second subvolume will take two bytes of raw data for each byte of data written. So, if we have 30GiB of raw space available, we could store 30GiB of data on the first subvolume, or 15GiB of data on the second, and there is no way of knowing which it will be until the user writes that data.
So, in general, it is impossible to give an accurate estimate of the amount of free space on any btrfs filesystem. Yes, this sucks. If you have a really good idea for how to make it simple for users to understand how much space they've got left, please do let us know, but also please be aware that the finest minds in btrfs development have been thinking about this problem for at least a couple of years, and we haven't found a simple solution yet.
Why is there so much space overhead?
There are several things meant by this. One is the out-of-space issues discussed above; this is a known deficiency, which can be worked around, and will eventually be worked around properly. The other meaning is the size of the metadata block group, compared to the data block group. Note that you should compare the size of the allocations, but rather the used space in the allocations.
There are several considerations:
- The default raid level for the metadata group is dup on single drive systems, and raid1 on multi drive systems. The meaning is the same in both cases: there's two copies of everything in that group. This can be disabled at mkfs time, and it will eventually be possible to migrate raid levels online.
- There an overhead to maintaining the checksums (approximately 0.1% – 4 bytes for each 4k block)
- Small files are also written inline into the metadata group. If you have several gigabytes of very small files, this will add up.
[incomplete; disabling features, etc]
How much space do I get with unequal devices in RAID-1 mode?
The general rule of thumb is if your largest device is bigger than all of the others put together, then you will get as much space as all the smaller devicess added together. Otherwise, you get half of the space of all of your devices added together. This is always true if the smaller devices are the same size.
For example, if you have disks of size 3TB, 1TB, 1TB, your largest disk is 3TB and the sum of the rest is 2TB. In this case, your largest disk is bigger than the sum of the rest, and you will get 2TB of usable space.
If you have disks of size 3TB, 2TB, 2TB, then your largest disk is 3TB and the sum of the rest of 4TB. In this case, your largest disk is smaller than the sum of the rest, and you will get (3+2+2)/2 = 3.5TB of usable space.
If the smaller disks are not the same size, the above holds true for the first case (largest device is bigger than all the others combined), but might not be true if the sum of the rest is larger. In this case, you can apply the rule multiple times.
For example, if you have disks of size 2TB, 1.5TB, 1TB, then the largest disk is 2TB and the sum is 2.5TB, but the smaller devices aren't equal, so we'll apply the rule of thumb twice. First, consider the 2TB and the 1.5TB. This set will give us 1.5TB usable and 500GB left over. Now consider the 500GB left over with the 1TB. This set will give us 500GB usable and 500GB left over. Our total set (2TB, 1.5TB, 1TB) will thus yield 2TB usable.
Another example is 3TB, 2TB, 1TB, 1TB. In this, the largest is 3TB and the sum of the rest is 4TB. Applying the rule of thumb twice, we consider the 3TB and the 2TB and get 2TB usable with 1TB left over. We then consider the 1TB left over with the 1TB and the 1TB and get 1.5TB usable with nothing left over. Our total is 3.5TB of usable space.
What does "balance" do?
btrfs filesystem balance is an operation which simply takes all of the data and metadata on the filesystem, and re-writes it in a different place on the disks, passing it through the allocator algorithm on the way. It was originally designed for multi-device filesystems, to spread data more evenly across the devices (i.e. to "balance" their usage). This is particularly useful when adding new devices to a nearly-full filesystem.
Due to the way that balance works, it also has some useful side-effects:
- If there is a lot of allocated but unused data or metadata chunks, a balance may reclaim some of that allocated space. This is the main reason for running a balance on a single-device filesystem.
- On a filesystem with damaged replication (e.g. a RAID-1 FS with a dead and removed disk), it will force the FS to rebuild the missing copy of the data on one of the currently active devices, restoring the RAID-1 capability of the filesystem.
Does a balance operation make the internal B-trees better/faster?
No, balance has nothing at all to do with the B-trees used for storing all of btrfs's metadata. The B-tree implementation used in btrfs is effectively self-balancing, and won't lead to imbalanced trees. See the question above for what balance does (and why it's called "balance").
Do I need to run a balance regularly?
In general usage, no. A full unfiltered balance typically takes a long time, and will rewrite huge amounts of data unnecessarily. You may wish to run a balance on metadata only (see Balance_Filters) if you find you have very large amounts of metadata space allocated but unused, but this should be a last resort. At some point, this kind of clean-up will be made an automatic background process.
What is a subvolume?
A subvolume is like a directory - it has a name, there's nothing on it when it is created, and it can hold files and other directories. There's at least one subvolume in every Btrfs filesystem, the top-level subvolume.
The equivalent in Ext4 would be a filesystem. Each subvolume behaves as a individual filesystem. The difference is that in Ext4 you create each filesystem in a partition, in Btrfs however all the storage is in the 'pool', and subvolumes are created from the pool, you don't need to partition anything. You can create as many subvolumes as you want, as long as you have storage capacity.
How do I find out which subvolume is mounted?
A specific subvolume can be mounted by -o subvol=/path/to/subvol option, but currently it's not implemented to read that path directly from /proc/mounts. If the filesystem is mounted via a /etc/fstab entry, then output of mount command will show the subvol path, as it reads it from /etc/mtab.
Generally working way to read the path, like for bind mounts, is from /proc/self/mountinfo
27 21 0:19 /subv1 /mnt/ker rw,relatime - btrfs /dev/loop0 rw,space_cache ^^^^^^
What is a snapshot?
A snapshot is a frozen image of all the files and directories of a subvolume. For example, if you have two files ("a" and "b") in a subvolume, you take a snapshot and you delete "b", the file you just deleted is still available in the snapshot you took. The great thing about Btrfs snapshots is you can operate on any files or directories vs lvm when it is the whole logical volume.
Note that a snapshot is not a backup: Snapshots work by use of btrfs's copy-on-write behaviour. A snapshot and the original it was taken from initially share all of the same data blocks. If that data is damaged in some way (cosmic rays, bad disk sector, accident with dd to the disk), then the snapshot and the original will both be damaged. Snapshots are useful to have local online "copies" of the filesystem that can be referred back to, or to implement a form of deduplication, or to fix the state of a filesystem for making a full backup without anything changing underneath it. They do not in themselves make your data any safer.
Since backup from tape are a pain here is the thoughts of a lazy sysadm that create a home directory as a Btrfs file system for their users, lets try some fancy net attached storage ideas.
- Then there could be a snaphot every 6 hours via cron
- Then there could be a snaphot every 6 hours via cron
The logic would look something like this for rolling 3 day rotation that would use cron @ midnight
- rename /home_today_00, /home_backday_1
- create a symbolic link for /home_backDay_00 that points to real dir of /home_backday_1
- rename /home_today_06, /home_backDay_06 , Need to do this for all hours (06..18)
- delete the /home_backday_3
- rename /home_backday_2 to /home_backday_3 day
- rename /home_backday_1 to /home_backday_2 day
What is the difference between mount -o ssd and mount -o ssd_spread?
Mount -o ssd_spread is more strict about finding a large unused region of the disk for new allocations, which tends to fragment the free space more over time. Mount -o ssd_spread is often faster on the less expensive SSD devices. The default for autodetected SSD devices is mount -o ssd.
Will Btrfs be in the mainline Linux Kernel?
Btrfs is already in the mainline Linux kernel. It was merged on 9th January 2009, and was available in the Linux 2.6.29 release.
Does Btrfs run with older kernels?
v0.16 of (out-of-tree) Btrfs maintains compatibility with kernels back to 2.6.18. Kernels older than that will not work.
btrfs made it into mainline in 2.6.29, and development and bugfixes since then have gone directly into the main kernel. Backporting btrfs from a newer kernel to an earlier one may be a difficult process due to changes in the VFS or block layer APIs; there are no known projects or people doing this on a regular basis.
We strongly recommend that you keep up-to-date with the latest released kernels from kernel.org -- we try to maintain a list of sources that make that task easier for most major distributions.
How long will the Btrfs disk format keep changing?
The Btrfs disk format is not finalized, but it won't change unless a critical bug is found and no workarounds are possible. Not all the features have been implemented, but the current format is extensible enough to add those features later without requiring users to reformat.
How do I upgrade to the 2.6.31 format?
The 2.6.31 kernel can read and write Btrfs filesystems created by older kernels, but it writes a slightly different format for the extent allocation trees. Once you have mounted with 2.6.31, the stock Btrfs in 2.6.30 and older kernels will not be able to mount your filesystem.
We don't want to force people into 2.6.31 only, and so the newformat code is available against 2.6.30 as well. All fixes will also be maintained against 2.6.30. For details on downloading, see the Btrfs source repositories.
Can I find out compression ratio of a file?
Currently no. There's a patch https://patchwork.kernel.org/patch/117782/ adding the kernel part (ioctl). However, the size obtained by this ioctl is not exact and is rounded up to block size (4KB). The real amount of compressed bytes is not reported and recorded by the filesystem (only the block count) in it's structures. It is saved in the disk blocks but solely processed by the compression code.
I'm running btrfs 0.19...
This is, unfortunately, almost meaningless. Almost all of the "interesting" code in btrfs is in the kernel, so the main thing you should be reporting is the version of the kernel you're running.
Even if you want to report a problem with the btrfs userspace tools, the main version number (which is usually 0.19) is useless, because it hasn't been updated in at least 18 months. If you have installed from your distribution's package manager, then the version number of the package will usually include a date that will indicate when your btrfs tools were compiled; it is this package version that you should tell people about if you have a problem. If you have built your btrfs-progs tools from git, please tell us what git commit ID was the head when you built your tools. A recent version of the btrfs-progs tools should report the commit ID as part of the version number when you run them:
hrm@ruthven:~ $ btrfs --help Usage: [...] Btrfs v0.19-116-g13eced9 ^^^^^^^^ this is the git commit ID
Can I mount subvolumes with different mount options?
- nodev, nosuid and probably all the generic ones
- compress/compress-force — possible, but not implemented
- ro — via bind mount
- the rest like space_cache, inode_cache, discard, autodefrag, ssd, ...
Can I change metadata block size without recreating the filesytem?
No, the value passed to mkfs.btrfs -n SIZE cannot be changed once the filesystem is created. A backup/restore is needed.
Note, that this will likely never be implemented because it would require major updates to the core functionality.
Interaction with partitions, device managers and logical volumes
Btrfs has subvolumes, does this mean I don't need a logical volume manager and I can create a big Btrfs filesystem on a raw partition?
There is not a single answer to this question. Here are the issues to think about when you choose raw partitions or LVM:
- raw partitions are slightly faster than logical volumes
- btrfs does write optimisation (sequential writes) across a filesystem
- subvolume write performance will benefit from this algorithm
- creating multiple btrfs filesystems, each on a different LV, means that the algorithm can be ineffective (although the kernel will still perform some optimization at the block device level)
- Online resizing and relocating the filesystem across devices:
- the pvmove command from LVM allows filesystems to move between devices while online
- raw partitions can only be moved to a different starting cylinder while offline
- raw partitions can only be made bigger if there is free space after the partition, while LVM can expand an LV onto free space anywhere in the volume group - and it can do the resize online
- subvolume/logical volume size constraints
- LVM is convenient for creating fixed size logical volumes (e.g. 10MB for each user, 20GB for each virtual machine image, etc)
- subvolumes don't currently enforce such rigid size constraints, although the upcoming qgroups feature will address this issue
Based on the above, all of the following are valid strategies, depending upon whether your priority is performance or flexibility:
- create a raw partition on each device, and create btrfs on top of the partition (or combine several such partitions into btrfs raid1)
- create subvolumes within btrfs (e.g. for /home/user1, /home/user2, /home/media, /home/software)
- in this case, any one subvolume could grow to use up all the space, leaving none for other subvolumes
- create a single volume group, with two logical volumes (LVs), each backed by separate devices
- create a btrfs raid1 across the two LVs
- create subvolumes within btrfs (e.g. for /home/user1, /home/user2, /home/media, /home/software)
- in this case, any one subvolume could grow to use up all the space, leaving none for other subvolumes
- however, it performs well and is convenient
- create a single volume group, create several pairs of logical volumes
- create several btrfs raid1 filesystems, each spanning a pair of LVs
- mount each filesystem on a distinct mount point (e.g. for /home/user1, /home/user2, /home/media, /home/software)
- in this case, each mount point has a fixed size, so one user can't use up all the space
Does the Btrfs multi-device support make it a "rampant layering violation"?
Yes and no. Device management is a complex subject, and there are many different opinions about the best way to do it. Internally, the Btrfs code separates out components that deal with device management and maintains its own layers for them. The vast majority of filesystem metadata has no idea there are multiple devices involved.
Many advanced features such as checking alternate mirrors for good copies of a corrupted block are meant to be used with RAID implementations below the FS.
What are the differences among MD-RAID / device mapper / btrfs raid?
MD-RAID supports RAID-0, RAID-1, RAID-10, RAID-5, and RAID-6. As of Linux 3.8, btrfs does not yet support RAID-5 and RAID-6.
MD-RAID operates directly on the devices. RAID-1 is defined as "data duplicated to all devices", so a raid with 3 1TB drives will have 1TB of usable space but there will be 3 copies of the data.
Likewise, RAID-0 is defined as "data striped across all devies", so a raid with 3 TB drives will have 3TB usable space, but to read/write a stripe all 3 disks must be spun up, as part of the stripe will is on each disk.
RAID-10 requires at least 4 devices is constructed as a stripe across 2 mirrors. So a raid with 4 1TB drives yields 2TB usable and 2 copies of the data. A raid with 6 1TB drives yields 3TB usable data with 2 copies of all the data (3 mirrors of 1TB each, striped)
btrfs supports AID-0, RAID-1, and RAID-10. As of Linux 3.8, btrfs does not yet support RAID-5 and RAID-6.
btrfs combines all the drives into a storage pool first, and then duplicates the chunks as file data is created. RAID-1 is defined currently as "2 copies of all the data on different disks". This differs from MD-RAID and dmraid, in that those make exactly n copies for n disks. In a btrfs RAID-1 on 3 1TB drives we get 1.5TB of usable data. Because each block is only copied to 2 drives, writing a given block only requires exactly 2 drives spin up, reading requires only 1 drive to spinup.
RAID-0 is similarly defined, with the stripe split among exactly 2 disks. 3 1TB drives yield 3TB usable space, but to read a given stripe only requires 2 disks.
RAID-10 is built on top of these definitions. Every stripe is split across to exactly 2 RAID1 sets and those RAID1 sets are written to exactly 2 disk (hense 4 disk minimum). A btrfs raid-10 volume with 6 1TB drives will yield 3TB usable space with 2 copies of all data, but only 4
When RAID-5 and RAID-6 support are added to btrfs, support for RAID-1 and RAID-10 with configurable redundancy (up to the number of disks) will also be added.
About the project
[CRFS] is a network file system protocol. It was designed at around the same time as Btrfs. Its wire format uses some Btrfs disk formats and crfsd, a CRFS server implementation, uses Btrfs to store data on disk. More information can be found at http://oss.oracle.com/projects/crfs/ and http://en.wikipedia.org/wiki/CRFS.
Will Btrfs become a clustered file system
No. Btrfs's main goal right now is to be the best non-cluster file system.
If you want a cluster file system, there are many production choices that can be found in the Distributed file systems section on Wikipedia. Keep in mind that each file system has their own benefits or limitations, so find the best fit for your environment.
The closest cluster file system that uses Btrfs as its underlying file system is Ceph