Project ideas

From btrfs Wiki
(Difference between revisions)
Jump to: navigation, search
(New features: upconvert dir/subvol)
m (Claiming the "Send notifications about important events" project idea)
Line 608: Line 608:
=== Send notifications about important events ===
=== Send notifications about important events ===
|who=Marcos Paulo de Souza
|text=Use kobject_uevents to notify userspace about some important events to userspace. The uevents is what udev uses so there's an established way and easier (extra data passed a-la shell variables) than raw netlink with custom protocol. The events are something like:
|text=Use kobject_uevents to notify userspace about some important events to userspace. The uevents is what udev uses so there's an established way and easier (extra data passed a-la shell variables) than raw netlink with custom protocol. The events are something like:
* state changes
* state changes

Revision as of 16:52, 14 October 2019

This page collects project ideas that could be implemented in btrfs. As its is open to anybody to edit, this does not necessarily mean that everything will be implemented. Status of the projects might be obsolete, lack enough about the usecase or implementation hints.

Quick links:


Project idea pool

Note, that some of the ideas may not be up-to-date with current state of btrfs implementation, or the projects are claimed but without any visible progress, always ask on IRC or in the mailinglist before starting an implementation.

Multiple Devices

IO stripe size and other optimizations

Not claimed — no patches yet — Not in kernel yet

The multi-device code includes a number of IO parameters that do not currently get used. These need tunables from userland and they need to be honored by the allocation and IO routines.

Take device with heavy IO errors offline or mark as "unreliable"

Not claimed — no patches yet — Not in kernel yet

Devices should be taken offline after they reach a given threshold of IO errors. Jeff Mahoney works on handling EIO errors (among others), this project can build on top of it.

"Enhanced" (ala RAID 5E / RAID5EE / RAID6E) support

Not claimed — no patches yet — Not in kernel yet

Support a Hot Spare device that is active within the fs but which does not contribute toward capacity. This would result in better performance and less rebalance/rebuild time in the event of a disk failure.

Facility to allow automatic conversion and rebalance when short on reliable/spare storage

Not claimed — no patches yet — Not in kernel yet

If a device is marked as unreliable and there is not enough spare storage to take over the failing device's data automatically (ala hotswap), have a facility available to convert the data to a less-replicated state. Such a facility MUST NOT be "silent" - appropriate warnings/safeguards to prevent a "silently degraded" state should be in place.

False alarm on bad disk - rebuild mitigation

Not claimed — no patches yet — Not in kernel yet

After a device is marked as unreliable, maintain the device within the FS in order to confirm the issue persists. The device will still contribute toward fs performance but will not be treated as if contributing towards replication/reliability. If the device shows that the given errors were a once-off issue then the device can be marked as reliable once again. This will mitigate further unnecessary rebalance. See - "[Drive Resurrection]" as an example of where this is a significant feature for storage vendors.

Better data balancing over multiple devices for raid1/10 (read)

Not claimed — no patches yet — Not in kernel yet

Currently (4.15) the mirrors are selected based on process id of the endio-metadata worker threads, this may lead to a pathological case when only one mirror is always used (with more common case of uneven mirror use, not that bad in practice).
First attempt with random mirror selection proved to be very wrong, the heuristic has to be improved (more comments), possibly in connection with readahead infrastructure within btrfs.
Second attempt with balance based on queue size of device.

dm_cache or bcache like cache e.g. on a SSD

Not claimed — no patches yet — Not in kernel yet

dm_cache or bcache can add a cache on a SSD to increase speed of slower spindles. Using this with a multi-device btrfs with mirroring or RAID5/6 would be inefficient, as the (write-trough) cache would not be aware of identical extends to be put on different disks and it doesn't make much sense to have them twice in the cache. This could be solved by adding a new feature into btrfs which allows adding a device used for caching only.

Verify data at end of device replace

Not claimed — no patches yet — Not in kernel yet

Verify copied data before declaring a replace operation as successfully finished. Explicitly ask the user whether to ignore or whether to proceed when errors are detected.

Parallelism for Scrub and Balance

Not claimed — no patches yet — Not in kernel yet

When doing a scrub or a balance, especially in consideration of the "Limits on number of stripes (stripe width)" feature, the system currently only actions a scrub/balance to a single stripe of data or metadata. With multiple disks this often means that some disks sit idle while a single thread waits on a separate disk or group of disks to complete their operation. This behaviour can be desirable (due to load) or undesirable (due to it being inefficient). This can be fixed by determining whether or not a new scrub/balance thread can operate on the next extent/group without conflicting with the already-active operation.

multiple copies per disk

Not claimed — no patches yet — Not in kernel yet

Just as with metadata, it should be possible to have multiple copies per file per disk. The user should be able to select how much copies there should be either on a per file... or via some option for all newly created file. It would be nice if this could be combined with a RAID level...e.g. with two drives, having one copy on drive A, and 2 copies on drive B. If that's somehow possible, it would be nice if btrfs is smart enough to spread these copies as "far from each other" as possible... e.g. on a HDD on different spindles at different positions (in order to protect against head crashes) and on SSDs on different chips.

control rebuild behaviour

Not claimed — no patches yet — Not in kernel yet

btrfs should allow to set how rebuilds are done (for both: spare disks and replacement disks). The idea is to allow to specify how much read and write bandwidth should be spend for the rebuild (e.g. some percent value or kB/s) and how much for normal reads/writes. Not sure if it makes sense to have this per device or just globally per filesystem. For some people, for which data security is most important it may even make sense to have this set to something very close to 100% ... perhaps one could add rules that processes under EUID=0 are exempt from this.

RAID level combinations

Not claimed — no patches yet — Not in kernel yet

btrfs should perhaps allow RAID level combinations (e.g. 60 or 50) like MD RAID does by stacking MD devices. Another possibly important part i

ignored / non-independent block devices in a multiple device setup

Not claimed — no patches yet — Not in kernel yet

What can easily happen is, that multiple block devices from the same underlying physical device (e.g. multiple partitions from a disk) are added to a btrfs filesystem. If btrfs doesn't know about this, the idea of protecting against device failures is basically gone. It should be possible to mark block devices of btrfs being dependent on some other block device(s) of that fs... and btrfs should make sure to distribute the redundant chunks correctly. Of course one could add some heuristics to auto-detect whether a block device added to a btrfs belongs to the same physical device (e.g. for partitions sdaX it would be easy)... but this might not be enough, since there could be arbitrary block layer levels or things like nbd in between, which cannot be detected. Hugo Mill worked on something related to that.

Allow device-replace accept smaller device than original

Not claimed — no patches yet — Not in kernel yet

Requested in, we don't need to insist on the same device size as long as the moved data fit to the new device and RAID constraints are satisfied. This has been verified to work, we need to extend the checks and command line UI. Eg. add a warning and only allow replace to smaller device with an extra option.

Device IO Priorities

Not claimed — submitted — Not in kernel yet

The disk format includes fields to record different performance characteristics of the device. This should be honored during allocations to find the most appropriate device. The allocator should also prefer to allocate from less busy drives to spread IO out more effectively.

Dedicated metadata drives

Not claimed — submitted — Not in kernel yet

We're able to split data and metadata IO very easily. Metadata tends to be dominated by seeks and for many applications it makes sense to put the metadata onto faster SSDs.

Chunk allocation groups

Not claimed — no patches yet — Not in kernel yet

If you have a RAID-1 filesystem spanning multiple controllers, you may want to ensure that each copy of your data goes to a different controller, not just a different device. We can fairly easily modify the chunk allocator so that we can specify this.

Limits on number of stripes (stripe width)

Not claimed — no patches yet — Not in kernel yet

With very large numbers of devices, or an unequal distribution of device sizes, the default allocator policy of using as many (or as few) devices as possible for striped RAID levels (0, 10, 5, 6) is problematic. The allocator can be modified to limit the number of devices it will stripe across.

Linear chunk allocation mode

Not claimed — no patches yet — Not in kernel yet

Some people would like to use single data storage and lose the minimum amount of data possible when a device dies. One prerequisite for this is to allocate chunks linearly, filling one device entirely before moving on to subsequent ones. This should be made a chunk-type option. (The file/extent allocator will probably need to be modified to deal with this use-case as well, but this is a start).

Better data balancing over multiple devices for raid1/10 (allocation)

Not claimed — no patches yet — Not in kernel yet

The chunk allocator logic (up to 3.6 at least) allocates new chunks from devices with largest amount of free space. This guarantees (almost provably) that the raid1/mirroring guarantees hold while maximizing the available space (see FAQ for more details). On the other hand, when there are devices with unequal size (not uncommon while mixing terabyte-sized devices), the largest devices are excessively used while the others may not get used at all.

It is possible to fix the allocation logic so it is more even while (almost almost provably) does not break the mirroring guarantees.

Different sector sizes

Not claimedinitial WIP — Not in kernel yet

The extent_io code makes some assumptions about the page size and the underlying FS sectorsize or blocksize. These need to be cleaned up, especially for the extent_buffer code.

Hot spare support

Anand JainGlobal hotspare — Not in kernel yet

It should be possible to add a drive and flag it as a hot spare.

Core updates

Track link count for directories

Praveen — no patches yet — Not in kernel yet

The link count for directories is traditionally used to count the number of subdirectores iff the link count is >= 2. Btrfs sets this to 1 and does not track the link count at all. The link count could be used by some utilities to do optimizations when traversing the file hierarchy (at least find does that).

It seems that the link count can be tracked like the other filesystems do. This will be even backward compatible:

  • for new directories and subvolumes , set the initial link count to 2
  • a mkdir/rmdir/move/snapshot will update the link count accordingly iff the current link count is not 1

More checksumming algorithms

Not claimed — no patches yet — Not in kernel yet

Currently crc32c is used, we may offer more alternatives with different speed/strength characteristics. The considered ones are xxHash and SHA256.

Btree lock contention

Not claimed — no patches yet — Not in kernel yet

The btree locks, especially on the root block can be very hot. We need to improve this, especially in read mostly workloads.

Make helper threads NUMA aware

Not claimed — no patches yet — Not in kernel yet

The helper threads are used to offload computation-heavy tasks and access lots of memory. This brings performance hit on NUMA machines where pages from other nodes are accessed. Extend the workers to be more NUMA-aware for the most exposed tasks, namely checksumming.

Note: this may be obsoleted by recent patches that replace home-grown worker management with workqueues available in kernel. (patchset)

RBtree lock contention

Liu Bo — no patches yet — Not in kernel yet

Btrfs uses a number of rbtrees to index in-memory data structures. Some of these are dominated by reads, and the lock contention from searching them is showing up in profiles. We need to look into an RCU and sequence counter combination to allow lockless reads.

Distinguish EIO and EUCLEAN types of errors

ongoing — n/a — In kernel n/a

Right now we use EIO as error code even for errors that are not directly related to IO failures from the lower layers. Other filesystems use EUCLEAN for corruptions or inconsistencies not caused by genuine IO errors. We'd like to gradually start using EUCLEAN as well where applicable. The symbolic name can be also synced up with other filesystems that use EFSCORRUPTION and maybe other more fine grained errors.

Improve self-tests and error messages

Not claimed — no patches yet — Not in kernel yet

There are self-tests run at module load time that verify bitmap operatios, extent tree manipulations, etc. The error messages are not always useful for and should be enhanced, eg.

  • switch all error case messages from test_msg to test_err that will also print the line and stack (use WARN_ON)
  • enhance messages with the values of last failed condition
  • drop ifdefs from generic code that's used only from tests/*.c and move it there

Get rid of buffer_head usage in superblock writing/reading

Not claimed — no patches yet — Not in kernel yet

The structure buffer_head is used for reading and writing the superblocks, the interface is easy to use to just stay 'read/write this offset and give me back the data'. The rest of btrfs uses the more modern 'bio' structures, that are also used to implement the buffer_heads. So this is merely lifting up the bio calls to the superblock IO. There's some page and mapping involved, but oterwise should be easy to convert.

New features

Online fsck

Not claimedinitial patch submitted — Not in kernel yet

Online fsck includes a number of difficult decisions around races and coherency. Given that back references allow us to limit the total amount of memory required to verify a data structure, we should consider simply implementing fsck in the kernel.

Intial work by Li Zefan.

Filesystem UUID change - on-line

Not claimed — no patches yet — Not in kernel yet

Allow to change the UUID of a filesystem at the mount time. The filesystem UUID is stored in all metadata blocks so it's necessary to rewrite all of them. Additionally, the device items also contain the filesystem UUID and need to be updated.

The process goes in three phases. First the new UUID replaces the old one that is backed up somewhere. This also sets a temporary incompatibility bit that means that the filesystem now accepts two UUIDs from the metadata blocks.

Second phase is to rewrite all metadata blocks. Scrub is used for that, ie. rewrite UUID of each block that still has the old one.

Third phase is to drop the incompat bit after verification that the filesystem is fully converted.

The intermediate state of "2 UUIDs" interacts with other feautres:

  • device scanning - we need to update the device-stored UUIDs as soon as possible, consistenly
  • seeding devices - the seeding devices do some UUID conversion magic, so the uuid= option will be probably forbidden here
  • drop the incompat bit - extend scrub with an option to drop the bit after a full scrub runs without problems
  • possibly others

block devices 'btrvols'

Not claimed — no patches yet — Not in kernel yet

Likely to be rejected: layering violation, can be replaced by scsi target backed by file

Allow block devices to be allocated from the filesystem. They take advantage of COW, snapshots, etc but can be formatted and used by other filesystems. This is similar to ZFS's zvols. See for an idea.

Bad block tracking

Not claimed — no patches yet — Not in kernel yet

Currently btrfs doesn't keep track of bad blocks, disk blocks that are very likely to lose data written to them. Btrfs should accept a list in badblocks' output format, store it in a new btree (or maybe in the current extent tree, with a new flag), relocate whatever data the blocks contain, and reserve these blocks so they can't be used for future allocations. Additionally, scrub could be taught to test for bad blocks when a checksum error is found. This would make scrub much more useful; checksum errors are generally caused by the disk, but while scrub detects afflicted files, which in a backup scenario gives the opportunity to recreate them, the next file to reuse the bad blocks will just start getting errors instead. These two items would match an ext4 feature (used through e2fsck).

Clear unallocated space

Not claimed — alpha prototype, not posted yet — Not in kernel yet

This is similar to TRIM on SSD devices, but for any device. Simply go through unallocated space and rewrite with zeros (or maybe with some poison pattern so we could recognize when data from free block end up being used). The trim code could be enhanced to submit either TRIM command or writing a zeroed block to the disk. As trim is supported by more filesystems, a new REQ_ flag could be introduced to block layer to perform the zeroing, so other filesystems can enhance their trim support also to clear the free space.

Note: Preliminary patches implementing zeroing exist, not yet posted. The interface needs a cleanup, but it basically works.

Compressed file size

David Sterba[1] — Not in kernel yet

Find actual size of a compressed file. This has evolved into a more generic solution, the compressed size will be returned via FIEMAP ioctl call instead of a btrfs-specific ioctl.

Allow to access/fixup/delete damaged files from the filesystem

Not claimed — no patches yet — Not in kernel yet

A recovery mode that would enable to delete remainders of damaged files without the need of copying the data out and recreating the filesystem. This may arise from a lost device in 'single' or 'raid0' modes. This operation could be implemented as a special mode of fsck or as an operation on a mounted filesystem. The level of damage and potential recovery is varied, for example wiping the files completely, removing the broken extents, ignoring checksum mismatches or forcing checksum rewrite.

Content based storage

Not claimed — no patches yet — Not in kernel yet

Content based storage would index data extents by a large (256bit at least) checksum of the data contents. This index would be stored in one or more dedicated btrees and new file writes would be checked to see if they matched extents already in the content btree.

There are a number of use cases where even large hashes can have security implications, and content based storage is not suitable for use by default. Options to mitigate this include verifying contents of the blocks before recording them as a duplicate (which would be very slow) or simply not using this storage mode.

There are some use cases where verifying equality of the blocks may have an acceptable performance impact. If hash collisions are recorded, it may be possible to later use idle time on the disks to verify equality. It may also be possible to verify equality immediately if another instance of the file is cached. For example, in the case of a mass web host, there are likely to be many identical instances of common software, and constant use is likely to keep these files cached. In that case, not only would disk space be saved, it may also be possible for a single instance of the data in cache to be used by all instances of the file.

If hashes match, a reference is taken against the existing extent instead of creating a new one.

If the checksum isn't already indexed, a new extent is created and the content tree takes a reference against it.

When extents are freed, if the checksum tree is the last reference holder, the extent is either removed from the checksum tree or kept for later use (configurable).

Another configurable is reading the existing block to compare with any matches or just trusting the checksum.

This work is related to stripe granular IO, which would make it possible to configure the size of the extent indexed.

Hybrid Storage

Not claimed — no patches yet — Not in kernel yet

It should be possible to use very fast devices as a front end to traditional storage. The most used blocks should be kept hot on the fast device and they should be pushed out to slower storage in large sequential chunks.

The latest generation of SSD drives can achieve high iops/sec rates at both reading and writing. They will be very effective front end caches for slower (and less expensive) spinning media. A caching layer could be added to Btrfs to store the hottest blocks on faster devices to achieve better read and write throughput. This cache could also make use of other spindles in the existing spinning storage, for example why not store frequently used random-heavy data mirrored on all drives if space is available. A similar mechanism could allow frequent random read patterns (such as booting a system) as a series of sequential blocks in this cache.

Upconvert new directories to subvolumes

Not claimed — no patches yet — Not in kernel yet

For several reasons it would be really convenient if there was a way to mark a btrfs directory such that the directories created in the marked directory would actually be automatically converted to subvolume creation and destruction.

  • NFS4 particularly pivots on file system boundaries, which it seems to include subvolumes-in-place as such boundaries
  • doing this to /home is another opportunity if you have transient accounts created by scripts/programs you cannot easily change
  • Other uses include creating virtual machine sets via tarballs and such

Original proposal: [2]

The core logic would be to upconvert any legal rmdir to a subvol delete if it's applied to a subvol. Yes, this _would_ remove non-empty subvols, that would be the point. Then any mkdir in that directory would create a subvol instead of a directory.

  • Normal files in the directory would be unchanged
  • And a normal directory moved into the directory would remain a normal directory for obvious reasons
  • And a subvol moved out of the directory (can you even do that?) would remain a subvol for equally obvious reasons
  • It's implicit that the non-superuser create/remove subvol operation would be legal for such a directory

Proposed implementation:

Use the 'T' bit of file attributes (GETFLAGS ioctl), when applied to directories, mkdir -> mksubvol, rmdir -> destroy subvolume.

Enhancements to existing features

Scrub free space

Not claimed — no patches yet — Not in kernel yet

Currently only those disk blocks are checked that are allocated by the filesystem and in use. To check for read errors on unallocated blocks can be beneficial to identify hardware that is going to fail in the near future.

This could be merged with the 'clear unused space' project as a special case.

Per-subvolume mount options

David Sterba — no patches yet — Not in kernel yet

Allow to specify mount options that apply only to the given subvolume.

Block group reclaim

Not claimed — no patches yet — Not in kernel yet

The split between data and metadata block groups means that we sometimes have mostly empty block groups dedicated to only data or metadata. As files are deleted, we should be able to reclaim these and put the space back into the free space pool.

(See also #Fine-grained balances, #Block_group_reclaim)

Take recursive snapshots

Not claimed — no patches yet — Not in kernel yet

Snapshots of volumes don't include nested subvolumes; allowing this would make it easier to make sure that a snapshot contains everything the source appears to contain.

Even with userspace help, it isn't currently possible to do recursive snapshots that are atomic or read-only. A new ioctl would solve that.

Compression enhancements

David Sterba — no patches yet — Not in kernel yet

  • LZMA -- very slow compression speed with high compression ratio, aimed for backups and infrequently accessed data; higher ZSTD levels are supposed to address this usecase though
  • enhanced container format -- currently a page-sized chunk is compressed at a time, enhance the container header with version and flags to store the chunk length or the actual way how the chunks were stored
  • longer compression chunk -- now it's 128k to limit the read time of random data in the middle of the file

Related to that, all the bootloaders that support btrfs should support the enhanced container format and compression algorithms.

Per-object default mount-options / btrfs-properties / chattr(1) attributes and reasonable userland defaults

The following more or less belongs together respectively has common motivations. It's also partially based on the assumption that, we have the following two classes:

  • mount options apply to anything below that respective mountpoint only, so for example if the same subvol is mounted ro as /foo and rw as /bar, one can write on it via /bar, but not via /foo,... and that these options would (as far as possible/implemented) also apply to any nested subvolumes below the one mounted.


  • properties (as in btrfs-property(8)) however are really per object (subvols, filesystems, device, inode), so for example a subvol having been marked ro with btrfs-property and then mounted ro as /foo and rw as /bar (or indirectly as being a nested subvol that is mounted like that), would in both cases be ro... and that these btrfs-properties are not "inherited" by child subvols of the one where the property was set upon.
  • attributes (as in chattr(1)), again really per object (this time: inodes only),

Both classes are quite useful in different cases, therefore the following separate ideas may be worth being implemented:

Handle (or at least document) conflict cases

Right now it's completely undocumented, AFAICS, what happens if different mechanisms specify different options, e.g. mount option: compress lzo btrfs property: compression no chattr attribute: compression yes

Especially chattr attributes and btrfs properties seem to overlap quite a bit, so perhaps a first part of a solution is to get away with chattr attributes at all, or just make them set the respective properties - if this isn't anyway already the case; it should then however be also documented. The 2nd part would be to properly document what actually happens in case of a conflict (which one wins?).

Whether the following makes really sense or not, would need to be thoroughly analysed,... but it may make sense, that one somehow allows to control, what happens in case of conflicting properties and mount options. For example:

  • A subvolume is marked compress=no via a property, but mounted compress=zlib via mount option.
  • Similar example with ro flag.

Possible solutions:

  • Simply let the one of the two always override (probably the btrfs-properties then).
  • Add a (possibly even per subvol) mountoption that allows to specify which of the two should win, maybe even per option, e.g. something like:
meaning that for the mountpoint where /foo is mounted, the mountoptions override in the case of the compress option, the properties override in case of the ro option.
Alternatively it would perhaps be even better to set the override rules via properties, which would e.g. allow to set fallback-defaults for the whole filesystem, or a mandatory-default for the whole filesystem, basically to control inheritance.

As you can see, this would all need to be properly thought through, which I, the proposer of this, haven't done here for that part of the whole feature request.

Support inheritance of properties

E.g., AFAIU, chattr sets the attributes only per file. So newly added files won't get e.g. "noatime" set. Properties, in a way, have inheritance, e.g. by setting the property for the subvolume, and it would be applied for everything below (again the question: what if a object below sets another property?).

So maybe some inheritance mechanism would make sense, eventually. E.g. setting "noatime" on a dir, and have that inherited by all newly created files below. As well as ways to "break out" of inheritance.

Add more btrfs-properties

Right now we have ro, label and compression. We also have chattr(1) attributes, A (for noatime), C (nodatacow) and a few more (see btrfs-mount(5) at the very bottom). As mentioned above, this is partially ambiguous with what btrfs-properties have.

(Some) mount options and attributes should be supported as properties: For example, noatime, nodatacow, notdatasum. Maybe even autodefrag/noautodefrag if these would work for e.g. certain subvolumes/dirs/files only and not just for the fs at a whole; discard/nodiscard could make sense on a per-device basis... and so on.

Using the chattrs or mount options, doesn't really scale/work in all cases: E.g. consider you want to make sure a snapshot is noatime, you could of course mount the snapshot noatime, but then, AFAIU, it would be only at that specific mountpoint noatime, not e.g. when accessed from another mountpoint or as nestes subvolume vi some parent subvolume (e.g. via the mounted top-level subvolume). chattr in turn, would require to set all of the snapshots files with the A attr... and that probably remains set when sending/receiving that snapshot or making another rw-snapshot of it. May be desired but also not. It would seem much better, if one could simply mark the subvol as noatime via a btrfs-property.

Easy to add, easy to remove, fast, and effective at all places the subvolume appears (at mountpoints, or as nested svol).

One can easily construct similar examples for other options (like said, nodatacow, nodatasum, etc.)

I would further propose, to add properties that match the behaviour of the nodev,noexec,nosuid general mount options.

This is based on some idea brought to my attention by Duncan on the mailing list: When snapshots are made from the system' (i.e. /usr/, /dev/, /etc/ and so on), these snapshots may be accessible by any user (e.g. at /snapshots/) which may not be just by accident but intentional. The problem now is, that these snapshots may contain old files that impose at least the following (security) issues:

  • Old and insecure permissions/owners/ACLs to any file. E.g. /etc/shadow may have had accidentally, been world-readable, this was noticed, corrected in the "main" file, but it's still open in the already existing snapshots.
  • Even though udev mounts typically a devtmpfs on /dev, there may still be systems not using udev, or systems that do use udev, but have manually created some most important device file on /dev below the devtmpfs mount.
Same as above, if any of these had bad permissions, it would have been snapshot, and may be abused by any user that can access the snapshot.
  • Or take and setuid binaries, these may have seen security updates in the "main" version of the file, but the snapshot would still hold the vulnerable one. If accessible by users, that would be bad.
  • Or take systems that have taken measures so that users cannot run/create their own binaries. If in such a system an undesired binary is removed, it would again be still in the snaphot, where people could execute it.

Having nodev,noexec,nosuid properties would at least help against all of the above except the first one. Still better than nothing. Again, the main idea for having them as properties and not just as mount-options would be, to make sure, that wherever the subvolume snapshot appears (mountpoint, nested) it's "protected".

Having these three options as subvolumes, might even be used in other scenarios. Consider e.g. areas (which could be made subvols), where less trusted people should be allowed to write e.g. executable files, but where one doesn't want allow them, to actually do execute them.

Another idea would be to allow for user/nouser and user= options to be set at subvolumes. On systems with many subvolumes this might allow to keep /etc/fstab small, while still allowing ordinary users to mount "their" subvols.

Make reasonable use of newly added btrfs properties

If the above should get implemented, and e.g. noatime,noexec,nodev,nosuid options be implemented. Then one should consider whether or not it makes sense to have these automatically used in some cases.

The use case I'm thinking of is snapshots, especially ro snapshots, maybe even rw. For the reasons outlaid above, these could be dangerous in terms of security, so maybe btrfs subvolume snapshot would per default set the snapshot's subvolume to be noexec,nodev,nosuid... and it would perhaps offer an extra switch, not to do so.

Further, there's the well known issue, that when having many snapshots and noatime is not in place, than accessing will cause atime uptdates which is followed by a lot of metadata writes, that could potentially eat up quite some space. For that reason it may make sense to have snapshots subvols marked with a noatime property per default, again offering a switch not to do so. Whether it makes sense for ro-snapshots, depends on whether ro-snapshots have atime updates at all.

Default [subvolume] mount options

Speaking of what e.g. ext2/3/4 has in form of tune2fs' -E mount_opts=string and the -o option.

The idea is, if mount(8) mounts a btrfs fs|svol, the default mountoptions are read and used unless overridden somehow for that mountpoint.

This may have some motivations:

  • Mount options, AFAIU, are "inherited" to any nested subvolume. So unless property-inheritance (as proposed above) would get implemented, this could be kinda abused for something similar.
  • Similar to why e.g. noatime,nodev,noexec,nosuid,etc. properties were motivated above (i.e.making sure wherever the object appears, it has these properties) it would allow one to make sure that when subvol foo is mounted (wherever) it gets e.g. nodev set... but unlike to properties it would still allow one to make intentional exceptions, by mounting with the "opposite" mount-option, thereby overriding the default mount option.

Cancellable operations

Not claimed — no patches yet — Not in kernel yet

There are a few operations that may take long, cause umount to stall or slow down filesystem. It is possible and would be nice to add some support for cancelling to device del or filesystem balance.

There are two ways how to cancel an operation:

  • synchronous – when the operation is called from userspace and all the processing is done from the context of this process (like the case of btrfs fi defrag FILE), then pressing Ctrl-C will raise a signal and this will be checked inside the defrag loop. It should be discussed whether to allow Ctrl-C or only kill -9.
  • asynchronous – when the processing is done in a kernel thread, this would need same command support like scrub or balance have

There are more places when a check whether the filesystem is being unmounted will improve responsiveness, like during free space cache writeout. However, one has to be sure that cancelling such opeations is safe.

Unlimited extended attributes

Not claimed — no patches yet — Not in kernel yet

Currently size of value of an extended attribute must fit into inline space (~3900 on 4k leaf size), while other filesystems do not limit the size. Add a new b-tree item to hold the xattr value in extents.


Not claimed — no patches yet — Not in kernel yet

Implement a similar encryption scheme to that of ZFS which features

  • Encryption is integrated with the btrfs command set. Like other btrfs operations, encryption operations such as key changes and rekey are performed online.
  • You can use your existing storage pools as long as they are upgraded. You have the flexibility of encrypting specific file systems.
  • Encryption is inheritable to descendent file systems. Key management can be delegated through delegated administration.
  • Data is encrypted using the ciphers and block modes implemented in the kernel.
  • Escrow passphrase support so it have be used for enterprise desktop computers and laptops.

The encryption capability is embedded into the I/O pipeline. During writes a block may be compressed, encrypted, checksummed and then deduplicated in that order. The policy for encryption is set at the dataset level when datasets (file systems or VOLs) are created.

The wrapping keys provided by the user/administrator can be changed at any time without taking the file system off line. The default behaviour is for the wrapping key to be inherited by any child data sets. The data encryption keys are randomly generated at dataset creation time. Only descendant datasets (snapshots and clones) share data encryption keys. A command to switch to a new data encryption key for the clone or at any time is provided — this does not re-encrypt already existing data, instead utilising an encrypted master-key mechanism.

Random write performance

Not claimed — somehow done by the autodefrag mount option — Not in kernel yet

Random writes introduce small extents and fragmentation. We need new file layout code to improve this and defrag the files as they are being changed.

Improve subvolume usability

Accept arbitrary directories with the mount-time subvol flag

subvol= is a mount option to mount a subvolume other than the default. It currently only allows subvolumes; but vfsmounts can start at any path, allowing to mount any directory.

Note from kdave: this is intentionally disabled, see the patch that added the subvol test. It makes the snapshotting semantics unclear.
Gabriel: Thanks for the info. The snapshot ioctl, and other volume ioctls, need to validate that they have a subvolume rather than any vfsmount. Currently this check is in btrfs-progs but missing on the kernel side; the subvol= check didn't address the root cause of the bogus semantics and the snapshot ioctl is still problematic with bind mounts.

Snapshot arbitrary directories

Currently the most efficient way to snapshot a non-subvolume is either:

  • to snapshot its parent volume and remove the extra bits.
  • to make a reflink copy

Neither option is as efficient as it could be. copy_root should be updated to copy from a non-root directory (copy_root_at_dir).

Hide the subvolume/directory distinction

That distinction in data structures makes snapshotting efficient, but it may not be necessary to expose it to userspace. Transparent snapshots would encode the subvolume rootid (which non-transparent snapshots expose in st_dev) by reserving bits from the inode number. rename() and link() would reuse copy_root_at_dir when crossing a subvolume boundary.

(Atomically) convert directories into subvolumes and vice versa

Probably not the most important feature, but if it would be easily possible to implement (or even technically possible at all) then it may help to make systems more easily administrable, when directories could be converted atomically (especially without stopping any service, moving the dir/subvol, creating a subvol/dir of it's name, ref-link moving the data from/to the subvol, and removing the dir/subvol) into a subvolume and vice versa (that is a subvolume being "merged" atomically into it's parent subvolume). And example would be, that there is a system with one subvolume, including /var/lib/postgresql, and the administrator decides that he rather wants to have that moved into a separate subvolume at that location (for whatever reason, e.g. to exclude it from snapshots or to set nodatacow on that subvol)... and that this should be done ideally without requiring a downtime of PosgreSQL.

Limiting btree failure domains

Not claimed — no patches yet — Not in kernel yet

One way to limit the time required for filesystem rebuilds is to limit the number of places that might reference a given area of the drive. One recent discussion of these issues is Val Henson's chunkfs work.

There are a few cases where chunkfs still breaks down to O(N) rebuild, and without making completely separate filesystems it is very difficult to avoid these in general. Btrfs has a number of options that fit in the current design:

  • Tie chunks of space to Subvolumes and snapshots of those subvolumes
  • Tie chunks of space to tree roots inside a subvolume

But these still allow a single large subvolume to span huge amounts of data. We can either place responsibility for limiting failure domains on the admin, or we can implement a key range restriction on allocations.

The key range restriction would consist of a new btree that provided very coarse grained indexing of chunk allocations against key ranges. This could provide hints to fsck in order to limit consistency checking.

Feature parity, upstream changes

Projects that bring generic changes and new interfaces provided by VFS, MM or other filesystem specific to btrfs.

Implement new FALLOC_FL_* modes

Not claimed — no patches yet — Not in kernel yet

Note: Depends on the generic implementation that is on it's way to kernel. Extension to fallocate that chops the given range from a file and does not leave a hole, the offsets to the right of the range are shifted.

  • FALLOC_FL_ZERO_RANGE -- done in 4.16

Use fs/crypto API

Not claimed — no patches yet — Not in kernel yet

Generic VFS-level crypto API is available since 4.6, port btrfs.

Support SHUTDOWN ioctl

Not claimed — no patches yet — Not in kernel yet

Add the ioctl that will forcibly shutdown the filesystem.

Integration with userspace tools

Rsync integration

Not claimed — no patches yet — Not in kernel yet

Now that we have code to efficiently find newly updated files, we need to tie it into tools such as rsync and dirvish. (For bonus points, we can even allow rsync to use btrfs's builtin checksums and, when a file has changed, tell rsync _which blocks_ inside that file have changed. Would need to work with the rsync developers on that one.)

Update rsync to preserve NOCOW file status.

Reference to other backup tools

Not exactly an implementation of this, but I (A. v. Bidder) have patched dirvish to use btrfs snapshots instead of hardlinked directory trees. Discussion archived on the Dirvish mailing list.

Snapshot-aware updatedb/locate

Not claimed — no patches yet — Not in kernel yet

It is desirable to be able to locate content inside snapshots. At present, it happens often that the daily updatedb has a multiplied amount of work as a result of the number of snapshots. This is at least until the administrator configures updatedb to ignore the snapshots entirely. Some systems have hundreds of snapshots resulting in updatedb requiring a lot more than one day to complete. If the updatedb tool were aware of the snapshots (also whether or not they are read-only) then, perhaps utilising the btrfs-send logic, it could simply query the changed content, greatly reducing the amount of work required to update the database. An optional flag for updatedb could allow it to ignore a volume's snapshots. An optional flag for locate could allow a user to specify that they also want results from a volume's snapshots.

Make bootloaders aware of incompat bits and features

Not claimed — no patches yet — Not in kernel yet

Bootloaders (grub2, syslinux) that support btrfs do not check the incompatibility bits and boot may fail due to lack of support, eg. compression, skinny metadata, no-extent-hole, etc. The bootloader should verify that there are no unknown bits and at least issue a warning.

Audit btrfs code in bootloaders

Not claimed — no patches yet — Not in kernel yet

The bootloaders do not share the code from linux source tree and usually lack support for new features or the checks that would at least warn that booting could fail, neither regularly backport fixes from linux. The consequences are obvious.

The bootloaders in question:

  • grub2

Things to audit/fix:

  • bugfixes to code touching the core datastructures
  • check that new features like skinny-metadata or no-holes are handled properly, or
  • ... add warnings if the bootloader does not understand the incompat bits

Send notifications about important events

Marcos Paulo de Souza — no patches yet — Not in kernel yet

Use kobject_uevents to notify userspace about some important events to userspace. The uevents is what udev uses so there's an established way and easier (extra data passed a-la shell variables) than raw netlink with custom protocol. The events are something like:

  • state changes
    • filesystem size changed
    • enospc encountered
    • subvolume created, deleted, cleaned
    • transaction aborted, filesystem readonly
  • whole filesystem jobs
    • scrub started, stopped
    • balance started, stopped
  • recovery actions
    • checksum mismatch detected, and repaired
    • checksum mismatch detected

The environment of the event would contain filesystem UUID, subvolume, path to file, device id, etc where applicable.

Userspace tools projects

Collection of ideas or small tasks for btrfsprogs or other relevant utilites.


  • merge functionality of btrfstune
    • for the reasons stated under [3], the functionality should be implemented with the properties rather than a separate subcommand
  • audit code for use of backup superblocks and change it to read only the first unless told otherwise by a command-line option
  • write a mount.btrfs helper to scan devices on the fly (and use libblkid and libmount for that)
  • (claimed) introduce subcommand debug and move there functionality from separate debugging utilities
    • map-logical (not sure if this is not for inspect-internal)
    • dump-image from btrfs-image
  • improve error handling
    • start with the easy ones that are not in the code shared with kernel
    • similar to what's done in kernel code, but beware of differences
  • subvolume listing: overhaul, see Hugo's mail
  • check: make more verbose about the phases, more verbosity levels
  • balance: allow to run it in background (fork) and report status periodically

Provide a library covering 'btrfs' functionality

Omar Sandoval — no patches yet — Not in kernel yet

It would be nice to have a library for manipulating btrfs filesystem in a way that the btrfs tool does and make it available to other programs, or via language bindings.

For python, there's python-btrfs, a python library to to inspect the internals of an existing filesystem. This library only works with filesystems that are online and mounted, and consists of helper functions to call the IOCTL kernel API and wrapper classes to easily work with the data that is returned.

bedup has a module exposing some of the core btrfs functionality to Python.

There is also a btrfs library for haskell.

Time slider

Not claimed — no patches yet — Not in kernel yet

Auto snapshot and its management tool.

A tool partly covering this idea is snapper. Earlier work from Anand Jain on this is available at GitHub.

Audit tools for usecases with security implications

Some tools have to run under root but still may need some level of restrictions or confinement. An example is receive that could use chroot at some points. Search for more.


  • this very wiki needs updates and removal of obsolete information, work here is highly appreciated as this is one of the frequently consulted sources of information about btrfs
  • document send stream format, RFC, validator, dumper; see far-progs


Not claimed — no patches yet — Not in kernel yet

Add support for btrfs-specific ioctls. Currently raw numbers are printed, teach strace btrfs ioctl names, how to parse btrfs ioctl structs and how to print a human readable output for them.

Cleanup projects

This is a short collection of possible cleanups, that would make the code easier to read and to maintain. Please note that cleanups may interfere with patches in-flight doing some real work and merging may be postponed or you may be asked to refresh them on top of other changes.

Please note that pure whitespace and style reformatting changes are not really necessary at this phase of development. They get fixed along regular changes. Possibly once upon in a while a patch that fixes many if not all whitespace errors could work, but otherwise it's considered a noise.

Trivial changes

Trivial changes that are obviously correct are also accepted, see SubmittingPatches, end of section 5 for more specific information about the whole trivial patch workflow. It's recommended to CC linux-btrfs mailinglist even for the trivial patches that have visible impact.


  • typos in comments: spelling fixes, typos in variable names referenced in comments
  • typos in error messages, or other wording changes that make things more readable and understandable

Helpers for tree enumeration

Not claimed — no patches yet — Not in kernel yet

Writing tree enumeration code requires deep knowledge of the underlying functions and makes some assumption about possible results. Build some generic helpers or enumeration functions instead, to make the code shorter and more readable, and easier to write.

Use the kernel code in user mode

Not claimed — no patches yet — Not in kernel yet

The user mode utilities have a local copy of the kernel code, with some small adjustments for running in user mode. The fork of the sources was quite a while ago, so many kernel features are now missing in user mode. Also maintaining two copies is burdensome. Expand the wrappers instead so the true kernel code can be used. These helpers could also generically make use of the upcoming readahead API.

Remove unused parameters (in general)

Not claimed — no patches yet — Not in kernel yet

Eg. like this [1] or [2] . To find more, enable -Wunused-parameter in scripts/Makefile.extrawarn and run

make W=1 fs/btrfs/

The output is noisy, you can comment out the other warnings or grep is your friend.

Make fewer functions inline to help identifying them on the stack

Not claimed — no patches yet — Not in kernel yet (at the end)

As an aside, this is why XFS use noinline for most of it's static functions - so that stack traces are accurate when a problem occurs. Debuggability of complex code paths is far more important than the small speed improvement automatic inlining of static functions gives...

Good candidates appear when one tries to analyze a stacktrace, doest not see a function being called although it appears on stack. Then it's needed to look into all such functions (and maybe repeating the whole exercise from there). Short wrappers or small simple helpers are not good candidates.

The functions should be tagged with noinline_for_stack.

Move rcu_string out of btrfs to lib/

Omar Sandoval — patch in development — Not in kernel yet

rcu-string.h implements helpers and wrappers around RCU-friendly strings. This is a generic piece of code and should live in the generic library (linux.git/lib/). The task involves documenting API and pushing through lkml and updating according to the feedback.

Unbound or ongoing projects


The userbase is growing, we need to improve the documentation. The project ("official") sources comprise:

  • this wiki
  • btrfs-progs.git/Documentation - commandline tool man pages
  • btrfs-progs.git/ - help texts in the commandline tools, error messages

The documentation has moderate visibility and low impact on stability, so the patches get merged quickly. Sending patches the same way as for the code is preferred as it keeps track of the author (credits) and makes it easy for the maitainers to add the patches to git trees.

Ideas for fixes:

  • spelling, wording, clarifying
  • enhancing terse texts
  • usage examples (wiki, manpages)

Note: keeping consistent look of the documentation may not be easy, but let's try it.

Offline fsck

  • Introduce semantic checks for filesystem data structures
  • Limit memory usage by triggering a back-reference only mode when too many extents are pending
  • Add verification of the generation number in btree metadata

Test Suite

  • new features need support in xfstests
  • bugfixes that are accompanied with a reproducer should be converted to a xfstests
  • userspace testsuite tasks
  • add more tests for existing features
    • qgroups
    • send/receive

Static checkers

The idea of this project is to run static checkers on the btrfs source codes and identify issues to fix.

There are several static source code checkers that may point out code defficiencies that need fixing. There are some that can be used for linux kernel:

  • sparse - provides a set of annotations designed to convey semantic information about types, such as what address space pointers point to, or what locks a function acquires or releases
  • smatch - similar to sparse, enhanced set of capabilities
  • clang static analyzer - not really tailored for linux kernel, but still usable

Please note that the level of false positives varies and not every issue reported is an actual bug. A review is always required.


Free inode number cache

Li Zefan — complete — In kernel 3.0

As the filesystem fills up, finding a free inode number will become expensive. This should be cached the same way we do free blocks.

NFS support

Josef Bacik — complete — In kernel 3.5

Btrfs currently has a sequence number the NFS server can use to detect changes in the file. This should be wired into the NFS support code.

Changing RAID levels

Ilya Dryomov — complete — In kernel 3.3

We need ioctls to change between different raid levels. Some of these are quite easy -- e.g. for RAID0 to RAID1, we just halve the available bytes on the fs, then queue a rebalance.

Contains "Balance operation progress" project.

Device IO Error recording

Stefan Behrens — complete — In kernel 3.5

Items should be inserted into the device tree to record the location and frequency of IO errors, including checksumming errors and misplaced writes.

Forced readonly mounts on errors

various people — complete — In kernel 3.4

The sources have a number of BUG() statements that could easily be replaced with code to force the filesystem readonly. This is the first step in being more fault tolerant of disk corruptions. The first step is to add a framework for generating errors that should result in filesystems going readonly, and the conversion from BUG() to that framework can happen incrementally.

Backref walking utilities

various people — complete — In kernel 3.2

Given a block number on a disk, the Btrfs metadata can find all the files and directories that use or care about that block. Some utilities to walk these back refs and print the results would help debug corruptions.

Given an inode, the Btrfs metadata can find all the directories that point to the inode. We should have utils to walk these back refs as well.

Snapshot aware defrag

Li Zefan, Liu Bo — complete — In kernel 3.9

As we defragment files, we break any sharing from other snapshots. The balancing code will preserve the sharing, and defrag needs to grow this as well.

Drive swapping

Stefan Behrens — complete — In kernel 3.8

Right now when we replace a drive, we do so with a full FS balance. If we are inserting a new drive to remove an old one, we can do a much less expensive operation where we just put valid copies of all the blocks onto the new drive.

Support different disk types in the same filesystem

Josef Bacik — commit de1ee92a, different approach, but fixes the problem — In kernel 3.8

Currently the situation is that for I/O write bios, the bio is prepared using latest_dev. bio_add_page() applies all checks against that device. Before submission of the bio, in btrfs_map_bio() such a bio is cloned for each additional RAID mirror to write. The bi_bdev member of such cloned bios is updated. When one of the devices supports only a lower number of pages per bio then the device that was initially used to build the bio, the submission of the bio will cause "bio too big" errors and kernel log messages. The write operation will fail in this case. One possible solution could be to use bio_get_nr_vecs() initially for each device to find the max number of pages per bio for each device. The minimum of these values could then be used to limit the size of bios in submit_extent_page().

Set/change file system label

Jeff Liu — complete — In kernel 3.9

Set file system label via ioctl(2), user can play with Btrfs label through btrfs filesystem label [label]

Filesystem object properties

Filipe Mananamailinglist — Not in kernel yet

Interface to set/get properties of object types like filesytem, subvolume, device, file. The properties are eg. compression, raid type, cow/nocow status.

Some initial work started by Alexander Block in the past:

Implement O_TMPFILE support

Filipe Manana — complete — In kernel 3.16

There's a special open() flag O_TMPFILE that creates temporary file in a safe way [4]. There's a filesystem-specific support needed.


Chris Mason — complete — In kernel 3.9

The multi-device code needs a raid6 implementation, and perhaps a raid5 implementation. This involves a few different topics including extra code to make sure writeback happens in stripe sized widths, and stripe aligning file allocations.

Metadata blocks are currently clustered together, but extra code will be needed to ensure stripe efficient IO on metadata. Another option is to leave metadata in raid1 mirrors and only do raid5/6 for data.

The existing raid0/1/10 code will need minor refactoring to provide dedicated chunk allocation and lookup functions per raid level. It currently happens via a collection of if statements.

Scrub with RAID 5/6

Miao Xie — complete — In kernel 3.19

The goal of scrubbing is to find and repair disk errors while causing minimal impact to the performance of the filesystem. Therefore a goal is to avoid seek operations, to read the blocks from each device in sorted order. Another goal is to perform the scrubbing quickly, therefore currently for each disk one thread is spawned that deals with the disk independent of the other disks thus without waiting for the other disks. Things are a little bit different in case of RAID 5/6 since you need to read from multiple devices in order to be able to check the parity information. A strategy needs to be found how to scrub RAID 5/6 filesystems efficiently, afterwards this needs to be implemented.

Device replace with RAID 5/6

Miao Xie — complete — In kernel 3.19

See "Scrub with RAID 5/6" since the replace code makes use of the scrub code.

Filesystem UUID change - off-line

Qu Wenruo — complete — In kernel / btrfs-progs 4.1

Change the filesystem UUID on a given filesystem image. This is easier than the on-line variant. Go through all the metadata blocks reachable from the superblock, verify and rewrite the UUID.

Implement new RENAME_* modes

Dan Fuhry — complete — In kernel 4.11

There are new modes of rename syscall.



Yonghong Song — complete — In kernel 4.13

Implement support for the new statx() syscall

Extended file attributes FS_XFLAG_*

David Sterba — complete — In kernel 4.19

Defined in include/uapi/linux/fs.h in the namespace FS_XFLAG_. Extended range of the file attributes, similar to what FS_IOC_GETFLAGS does.

Swap file support

Not claimed — complete — In kernel 5.0

Implement swapfile support on top of swap-over-nfs infrastructure that has been merged in 3.7. Use the exported an API to manage the extents.

There is a patchset (swap-over-nfs) which enhances the swapfile API and btrfs could build swap support on top of the infrastructure. The patchset has been merged into 3.6.

Obsolete requests

The following ideas are no longer needed, or have been subsumed within another piece of work.

Set mount options permanently

Filipe Manana, Hidetoshi Seto — no patches yet — Obsoleted in favour of Project_ideas#Filesystem_object_properties

Set mount options permanently (for ex: compress) like ext4 "tune2fs -O". Two different implementations and approaches were proposed so far:

And related to this, ability to remember compression algorithm and forcefulness per inode:

Background balancing

Ilya Dryomov — no patches yet — Obsoleted in favour of Project_ideas#Block_group_reclaim

A background thread could check in regular intervals if there is enough room to balance the smallest chunk for each RAID type into the existing ones and do so. This would also handle the 'Block group reclaim'-case.

Extend btrfstune to be able to tune more parameters

Sanjeev Mk, Shravan Aras, Gautam Akiwate — no patches yet — Obsoleted in favour of Project_ideas#Filesystem_object_properties

btrfstune currently can be used to update the seeding value. This project would add on to that and make btrfstune a generic tool to tune various FS parameters.

Atomic write API

Chris Mason, Josef BacikAtomic IO — Not in kernel yet

The Btrfs implementation of data=ordered only updates metadata to point to new data blocks when the data IO is finished. This makes it easy for us to implement atomic writes of an arbitrary size. Some hardware is coming out that can support this down in the block layer as well.


Not claimed — no patches yet — Not in kernel yet

POSIX.1e ACLs (which btrfs supports already) are rather limited. So it would be nice if btrfs could eventually support RichACLs / NFS4 ACLs, if any specific support for that is needed from the btrfs side. See also RichACLs project site.

Note: the richacl feature is unlikely to get merged

Personal tools