Project ideas

From btrfs Wiki
(Difference between revisions)
Jump to: navigation, search
(Bad block tracking)
(btrfs: smartcols)
Line 402: Line 402:
* balance: allow to run it in background (fork) and report status periodically
* balance: allow to run it in background (fork) and report status periodically
* (maybe) provide a utility to do ''dd and uuid change'' in one go, should be easy to copy a modified superblock first with new uuid, then simply dd the rest and then run btrfstune on to finish the uuid conversion
* (maybe) provide a utility to do ''dd and uuid change'' in one go, should be easy to copy a modified superblock first with new uuid, then simply dd the rest and then run btrfstune on to finish the uuid conversion
* convert custom code for formatting columns to use libsmartcols
=== btrfs-restore ===
=== btrfs-restore ===

Revision as of 10:04, 27 October 2015


Unclaimed projects

If you are actually going to implement an idea/feature, read the notes at the end of this page.

Note, that some of the ideas may not be up-to-date with current state of btrfs implementation, or the projects are claimed but without any visible progress, always ask on IRC or in the mailinglist before starting an implementation.

Multiple Devices

IO stripe size and other optimizations

Not claimed — no patches yet — Not in kernel yet

The multi-device code includes a number of IO parameters that do not currently get used. These need tunables from userland and they need to be honored by the allocation and IO routines.

Take device with heavy IO errors offline or mark as "unreliable"

Not claimed — no patches yet — Not in kernel yet

Devices should be taken offline after they reach a given threshold of IO errors. Jeff Mahoney works on handling EIO errors (among others), this project can build on top of it.

Hot spare support

Not claimed — no patches yet — Not in kernel yet

It should be possible to add a drive and flag it as a hot spare.

"Enhanced" (ala RAID 5E / RAID5EE / RAID6E) support

Not claimed — no patches yet — Not in kernel yet

Support a Hot Spare device that is active within the fs but which does not contribute toward capacity. This would result in better performance and less rebalance/rebuild time in the event of a disk failure.

Facility to allow automatic conversion and rebalance when short on reliable/spare storage

Not claimed — no patches yet — Not in kernel yet

If a device is marked as unreliable and there is not enough spare storage to take over the failing device's data automatically (ala hotswap), have a facility available to convert the data to a less-replicated state. Such a facility MUST NOT be "silent" - appropriate warnings/safeguards to prevent a "silently degraded" state should be in place.

False alarm on bad disk - rebuild mitigation

Not claimed — no patches yet — Not in kernel yet

After a device is marked as unreliable, maintain the device within the FS in order to confirm the issue persists. The device will still contribute toward fs performance but will not be treated as if contributing towards replication/reliability. If the device shows that the given errors were a once-off issue then the device can be marked as reliable once again. This will mitigate further unnecessary rebalance. See - "[Drive Resurrection]" as an example of where this is a significant feature for storage vendors.

Better data balancing over multiple devices for raid1/10 (read)

Not claimed — no patches yet — Not in kernel yet

Currently (3.6) the mirrors are selected based on process id of the endio-metadata worker threads, this may lead to a pathological case when only one mirror is always used (with more common case of uneven mirror use, not that bad in practice).
First attempt with random mirror selection proved to be very wrong, the heuristic has to be improved (more comments), possibly in connection with readahead infrastructure within btrfs.

dm_cache or bcache like cache e.g. on a SSD

Not claimed — no patches yet — Not in kernel yet

dm_cache or bcache can add a cache on a SSD to increase speed of slower spindles. Using this with a multi-device btrfs with mirroring or RAID5/6 would be inefficient, as the (write-trough) cache would not be aware of identical extends to be put on different disks and it doesn't make much sense to have them twice in the cache. This could be solved by adding a new feature into btrfs which allows adding a device used for caching only.

Verify data at end of device replace

Not claimed — no patches yet — Not in kernel yet

Verify copied data before declaring a replace operation as successfully finished. Explicitly ask the user whether to ignore or whether to proceed when errors are detected.

Parallelism for Scrub and Balance

Not claimed — no patches yet — Not in kernel yet

When doing a scrub or a balance, especially in consideration of the "Limits on number of stripes (stripe width)" feature, the system currently only actions a scrub/balance to a single stripe of data or metadata. With multiple disks this often means that some disks sit idle while a single thread waits on a separate disk or group of disks to complete their operation. This behaviour can be desirable (due to load) or undesirable (due to it being inefficient). This can be fixed by determining whether or not a new scrub/balance thread can operate on the next extent/group without conflicting with the already-active operation.

multiple copies per disk

Not claimed — no patches yet — Not in kernel yet

Just as with metadata, it should be possible to have multiple copies per file per disk. The user should be able to select how much copies there should be either on a per file... or via some option for all newly created file. It would be nice if this could be combined with a RAID level...e.g. with two drives, having one copy on drive A, and 2 copies on drive B. If that's somehow possible, it would be nice if btrfs is smart enough to spread these copies as "far from each other" as possible... e.g. on a HDD on different spindles at different positions (in order to protect against head crashes) and on SSDs on different chips.

control rebuild behaviour

Not claimed — no patches yet — Not in kernel yet

btrfs should allow to set how rebuilds are done (for both: spare disks and replacement disks). The idea is to allow to specify how much read and write bandwidth should be spend for the rebuild (e.g. some percent value or kB/s) and how much for normal reads/writes. Not sure if it makes sense to have this per device or just globally per filesystem. For some people, for which data security is most important it may even make sense to have this set to something very close to 100% ... perhaps one could add rules that processes under EUID=0 are exempt from this.

RAID level combinations

Not claimed — no patches yet — Not in kernel yet

btrfs should perhaps allow RAID level combinations (e.g. 60 or 50) like MD RAID does by stacking MD devices. Another possibly important part i

ignored / non-independent block devices in a multiple device setup

Not claimed — no patches yet — Not in kernel yet

What can easily happen is, that multiple block devices from the same underlying physical device (e.g. multiple partitions from a disk) are added to a btrfs filesystem. If btrfs doesn't know about this, the idea of protecting against device failures is basically gone. It should be possible to mark block devices of btrfs being dependent on some other block device(s) of that fs... and btrfs should make sure to distribute the redundant chunks correctly. Of course one could add some heuristics to auto-detect whether a block device added to a btrfs belongs to the same physical device (e.g. for partitions sdaX it would be easy)... but this might not be enough, since there could be arbitrary block layer levels or things like nbd in between, which cannot be detected. Hugo Mill worked on something related to that.

Other projects

extent_io state ranges

Not claimed — no patches yet — Not in kernel yet

The extent_io locking code works on [start, end] tuples. This should be changed to [start, length] tuples and all users should be updated.

Limiting btree failure domains

Not claimed — no patches yet — Not in kernel yet

One way to limit the time required for filesystem rebuilds is to limit the number of places that might reference a given area of the drive. One recent discussion of these issues is Val Henson's chunkfs work.

There are a few cases where chunkfs still breaks down to O(N) rebuild, and without making completely separate filesystems it is very difficult to avoid these in general. Btrfs has a number of options that fit in the current design:

  • Tie chunks of space to Subvolumes and snapshots of those subvolumes
  • Tie chunks of space to tree roots inside a subvolume

But these still allow a single large subvolume to span huge amounts of data. We can either place responsibility for limiting failure domains on the admin, or we can implement a key range restriction on allocations.

The key range restriction would consist of a new btree that provided very coarse grained indexing of chunk allocations against key ranges. This could provide hints to fsck in order to limit consistency checking.

Content based storage

Not claimed — no patches yet — Not in kernel yet

Content based storage would index data extents by a large (256bit at least) checksum of the data contents. This index would be stored in one or more dedicated btrees and new file writes would be checked to see if they matched extents already in the content btree.

There are a number of use cases where even large hashes can have security implications, and content based storage is not suitable for use by default. Options to mitigate this include verifying contents of the blocks before recording them as a duplicate (which would be very slow) or simply not using this storage mode.

There are some use cases where verifying equality of the blocks may have an acceptable performance impact. If hash collisions are recorded, it may be possible to later use idle time on the disks to verify equality. It may also be possible to verify equality immediately if another instance of the file is cached. For example, in the case of a mass web host, there are likely to be many identical instances of common software, and constant use is likely to keep these files cached. In that case, not only would disk space be saved, it may also be possible for a single instance of the data in cache to be used by all instances of the file.

If hashes match, a reference is taken against the existing extent instead of creating a new one.

If the checksum isn't already indexed, a new extent is created and the content tree takes a reference against it.

When extents are freed, if the checksum tree is the last reference holder, the extent is either removed from the checksum tree or kept for later use (configurable).

Another configurable is reading the existing block to compare with any matches or just trusting the checksum.

This work is related to stripe granular IO, which would make it possible to configure the size of the extent indexed.

Rsync integration

Not claimed — no patches yet — Not in kernel yet

Now that we have code to efficiently find newly updated files, we need to tie it into tools such as rsync and dirvish. (For bonus points, we can even allow rsync to use btrfs's builtin checksums and, when a file has changed, tell rsync _which blocks_ inside that file have changed. Would need to work with the rsync developers on that one.)

Update rsync to preserve NOCOW file status.

Reference to other backup tools

Not exactly an implementation of this, but I (A. v. Bidder) have patched dirvish to use btrfs snapshots instead of hardlinked directory trees. Discussion archived on the Dirvish mailing list.

Snapshot-aware updatedb/locate

Not claimed — no patches yet — Not in kernel yet

It is desirable to be able to locate content inside snapshots. At present, it happens often that the daily updatedb has a multiplied amount of work as a result of the number of snapshots. This is at least until the administrator configures updatedb to ignore the snapshots entirely. Some systems have hundreds of snapshots resulting in updatedb requiring a lot more than one day to complete. If the updatedb tool were aware of the snapshots (also whether or not they are read-only) then, perhaps utilising the btrfs-send logic, it could simply query the changed content, greatly reducing the amount of work required to update the database. An optional flag for updatedb could allow it to ignore a volume's snapshots. An optional flag for locate could allow a user to specify that they also want results from a volume's snapshots.

Random write performance

Chris Mason — somehow done by the autodefrag mount option — Not in kernel yet

Random writes introduce small extents and fragmentation. We need new file layout code to improve this and defrag the files as they are being changed.

Clear unallocated space

Not claimed — alpha prototype, not posted yet — Not in kernel yet

This is similar to TRIM on SSD devices, but for any device. Simply go through unallocated space and rewrite with zeros (or maybe with some poison pattern so we could recognize when data from free block end up being used). The trim code could be enhanced to submit either TRIM command or writing a zeroed block to the disk. As trim is supported by more filesystems, a new REQ_ flag could be introduced to block layer to perform the zeroing, so other filesystems can enhance their trim support also to clear the free space.

Note: Preliminary patches implementing zeroing exist, not yet posted. The interface needs a cleanup, but it basically works.

Cancellable operations

Not claimed — no patches yet — Not in kernel yet

There are a few operations that may take long, cause umount to stall or slow down filesystem. It is possible and would be nice to add some support for cancelling to device del or filesystem balance.

There are two ways how to cancel an operation:

  • synchronous – when the operation is called from userspace and all the processing is done from the context of this process (like the case of btrfs fi defrag FILE), then pressing Ctrl-C will raise a signal and this will be checked inside the defrag loop. It should be discussed whether to allow Ctrl-C or only kill -9.
  • asynchronous – when the processing is done in a kernel thread, this would need same command support like scrub or balance have

There are more places when a check whether the filesystem is being unmounted will improve responsiveness, like during free space cache writeout. However, one has to be sure that cancelling such opeations is safe.

Unlimited extended attributes

Not claimed — no patches yet — Not in kernel yet

Currently size of value of an extended attribute must fit into inline space (~3900 on 4k leaf size), while other filesystems do not limit the size. Add a new b-tree item to hold the xattr value in extents.

Swap file support

Omar Sandoval — patch in development — Not in kernel yet

Implement swapfile support on top of swap-over-nfs infrastructure that has been merged in 3.7. Use the exported an API to manage the extents.

There is a patchset (swap-over-nfs) which enhances the swapfile API and btrfs could build swap support on top of the infrastructure. The patchset has been merged into 3.6.


Bad block tracking

Not claimed — no patches yet — Not in kernel yet

Currently btrfs doesn't keep track of bad blocks, disk blocks that are very likely to lose data written to them. Btrfs should accept a list in badblocks' output format, store it in a new btree (or maybe in the current extent tree, with a new flag), relocate whatever data the blocks contain, and reserve these blocks so they can't be used for future allocations. Additionally, scrub could be taught to test for bad blocks when a checksum error is found. This would make scrub much more useful; checksum errors are generally caused by the disk, but while scrub detects afflicted files, which in a backup scenario gives the opportunity to recreate them, the next file to reuse the bad blocks will just start getting errors instead. These two items would match an ext4 feature (used through e2fsck).

Hybrid Storage

Not claimed — no patches yet — Not in kernel yet

It should be possible to use very fast devices as a front end to traditional storage. The most used blocks should be kept hot on the fast device and they should be pushed out to slower storage in large sequential chunks.

The latest generation of SSD drives can achieve high iops/sec rates at both reading and writing. They will be very effective front end caches for slower (and less expensive) spinning media. A caching layer could be added to Btrfs to store the hottest blocks on faster devices to achieve better read and write throughput. This cache could also make use of other spindles in the existing spinning storage, for example why not store frequently used random-heavy data mirrored on all drives if space is available. A similar mechanism could allow frequent random read patterns (such as booting a system) as a series of sequential blocks in this cache.


Not claimed — no patches yet — Not in kernel yet

Implement a similar encryption scheme to that of ZFS which features

  • Encryption is integrated with the btrfs command set. Like other btrfs operations, encryption operations such as key changes and rekey are performed online.
  • You can use your existing storage pools as long as they are upgraded. You have the flexibility of encrypting specific file systems.
  • Encryption is inheritable to descendent file systems. Key management can be delegated through delegated administration.
  • Data is encrypted using the ciphers and block modes implemented in the kernel.
  • Escrow passphrase support so it have be used for enterprise desktop computers and laptops.

The encryption capability is embedded into the I/O pipeline. During writes a block may be compressed, encrypted, checksummed and then deduplicated in that order. The policy for encryption is set at the dataset level when datasets (file systems or VOLs) are created.

The wrapping keys provided by the user/administrator can be changed at any time without taking the file system off line. The default behaviour is for the wrapping key to be inherited by any child data sets. The data encryption keys are randomly generated at dataset creation time. Only descendant datasets (snapshots and clones) share data encryption keys. A command to switch to a new data encryption key for the clone or at any time is provided — this does not re-encrypt already existing data, instead utilising an encrypted master-key mechanism.

Allow to access/fixup/delete damaged files from the filesystem

Not claimed — no patches yet — Not in kernel yet

A recovery mode that would enable to delete remainders of damaged files without the need of copying the data out and recreating the filesystem. This may arise from a lost device in 'single' or 'raid0' modes. This operation could be implemented as a special mode of fsck or as an operation on a mounted filesystem. The level of damage and potential recovery is varied, for example wiping the files completely, removing the broken extents, ignoring checksum mismatches or forcing checksum rewrite.

block devices 'btrvols'

Not claimed — no patches yet — Not in kernel yet

Likely to be rejected: layering violation, can be replaced by scsi target backed by file

Allow block devices to be allocated from the filesystem. They take advantage of COW, snapshots, etc but can be formatted and used by other filesystems. This is similar to ZFS's zvols. See for an idea.

Make helper threads NUMA aware

Not claimed — no patches yet — Not in kernel yet

The helper threads are used to offload computation-heavy tasks and access lots of memory. This brings performance hit on NUMA machines where pages from other nodes are accessed. Extend the workers to be more NUMA-aware for the most exposed tasks, namely checksumming.

Note: this may be obsoleted by recent patches that replace home-grown worker management with workqueues available in kernel. (patchset)

Scrub free space

Not claimed — no patches yet — Not in kernel yet

Currently only those disk blocks are checked that are allocated by the filesystem and in use. To check for read errors on unallocated blocks can be beneficial to identify hardware that is going to fail in the near future.

This could be merged with the 'clear unused space' project as a special case.

Btree lock contention

Not claimed — no patches yet — Not in kernel yet

The btree locks, especially on the root block can be very hot. We need to improve this, especially in read mostly workloads.

Improve subvolume usability

Accept arbitrary directories with the mount-time subvol flag

subvol= is a mount option to mount a subvolume other than the default. It currently only allows subvolumes; but vfsmounts can start at any path, allowing to mount any directory.

Note from kdave: this is intentionally disabled, see the patch that added the subvol test. It makes the snapshotting semantics unclear.
Gabriel: Thanks for the info. The snapshot ioctl, and other volume ioctls, need to validate that they have a subvolume rather than any vfsmount. Currently this check is in btrfs-progs but missing on the kernel side; the subvol= check didn't address the root cause of the bogus semantics and the snapshot ioctl is still problematic with bind mounts.

Snapshot arbitrary directories

Currently the most efficient way to snapshot a non-subvolume is either:

  • to snapshot its parent volume and remove the extra bits.
  • to make a reflink copy

Neither option is as efficient as it could be. copy_root should be updated to copy from a non-root directory (copy_root_at_dir).

Take recursive snapshots

Snapshots of volumes don't include nested subvolumes; allowing this would make it easier to make sure that a snapshot contains everything the source appears to contain.

Even with userspace help, it isn't currently possible to do recursive snapshots that are atomic or read-only. A new ioctl would solve that.

Hide the subvolume/directory distinction

That distinction in data structures makes snapshotting efficient, but it may not be necessary to expose it to userspace. Transparent snapshots would encode the subvolume rootid (which non-transparent snapshots expose in st_dev) by reserving bits from the inode number. rename() and link() would reuse copy_root_at_dir when crossing a subvolume boundary.

More checksumming algorithms

Not claimed — no patches yet — Not in kernel yet

Currently crc32c is used, we may offer more alternatives with different speed/strength characteristics. The considered ones are xxHash and SHA1.

Make bootloaders aware of incompat bits and features

Not claimed — no patches yet — Not in kernel yet

Bootloaders (grub2, syslinux) that support btrfs do not check the incompatibility bits and boot may fail due to lack of support, eg. compression, skinny metadata, no-extent-hole, etc. The bootloader should verify that there are no unknown bits and at least issue a warning.

Implement new FALLOC_FL_* modes

Not claimed — no patches yet — Not in kernel yet

Note: Depends on the generic implementation that is on it's way to kernel. Extension to fallocate that chops the given range from a file and does not leave a hole, the offsets to the right of the range are shifted.

Online fsck

Not claimedinitial patch submitted — Not in kernel yet

Online fsck includes a number of difficult decisions around races and coherency. Given that back references allow us to limit the total amount of memory required to verify a data structure, we should consider simply implementing fsck in the kernel.

Intial work by Li Zefan.


Not claimed — no patches yet — Not in kernel yet

POSIX.1e ACLs (which btrfs supports already) are rather limited. So it would be nice if btrfs could eventually support RichACLs / NFS4 ACLs, if any specific support for that is needed from the btrfs side. See also RichACLs project site.

Audit btrfs code in bootloaders

Not claimed — no patches yet — Not in kernel yet

The bootloaders do not share the code from linux source tree and usually lack support for new features or the checks that would at least warn that booting could fail, neither regularly backport fixes from linux. The consequences are obvious.

The bootloaders in question:

  • grub2
  • LILO (maybe)

Things to audit/fix:

  • bugfixes to code touching the core datastructures
  • check that new features like skinny-metadata or no-holes are handled properly, or
  • ... add warnings if the bootloader does not understand the incompat bits

Filesystem UUID change - on-line

Not claimed — no patches yet — Not in kernel yet

Allow to change the UUID of a filesystem at the mount time. The filesystem UUID is stored in all metadata blocks so it's necessary to rewrite all of them. Additionally, the device items also contain the filesystem UUID and need to be updated.

The process goes in three phases. First the new UUID replaces the old one that is backed up somewhere. This also sets a temporary incompatibility bit that means that the filesystem now accepts two UUIDs from the metadata blocks.

Second phase is to rewrite all metadata blocks. Scrub is used for that, ie. rewrite UUID of each block that still has the old one.

Third phase is to drop the incompat bit after verification that the filesystem is fully converted.

The intermediate state of "2 UUIDs" interacts with other feautres:

  • device scanning - we need to update the device-stored UUIDs as soon as possible, consistenly
  • seeding devices - the seeding devices do some UUID conversion magic, so the uuid= option will be probably forbidden here
  • drop the incompat bit - extend scrub with an option to drop the bit after a full scrub runs without problems
  • possibly others

Implement new RENAME_* modes

Not claimed — no patches yet — Not in kernel yet

There are new modes of rename syscall.


Track link count for directories

Not claimed — no patches yet — Not in kernel yet

The link count for directories is traditionally used to count the number of subdirectores iff the link count is >= 2. Btrfs sets this to 1 and does not track the link count at all. The link count could be used by some utilities to do optimizations when traversing the file hierarchy (at least find does that).

It seems that the link count can be tracked like the other filesystems do. This will be even backward compatible:

  • for new directories and subvolumes , set the initial link count to 2
  • a mkdir/rmdir/move/snapshot will update the link count accordingly iff the current link count is not 1

Userspace tools projects

Collection of ideas or small tasks for btrfsprogs or other relevant utilites.


  • show meaning of various error codes, eg. for the incompatibility bits
  • error messages need to be cleaned up: wording, spelling, consistent format, error codes where missing
  • merge functionality of btrfstune
    • for the reasons stated under [1], the functionality should be implemented with the properties rather than a separate subcommand
  • audit code for use of backup superblocks and change it to read only the first unless told otherwise by a command-line option
  • write a mount.btrfs helper to scan devices on the fly (and use libblkid and libmount for that)
  • introduce subcommand debug and move there functionality from separate debugging utilities
    • dump-tree from btrfs-debug-tree
    • dump-super from btrfs-show-super (with functionality of btrfs-dump-super as an option)
    • tree-stats from calc-size
    • map-logical (not sure if this is not for inspect-internal)
    • dump-image from btrfs-image
  • add API for tree search ioctl
  • add heuristics to defrag
    • use libmagic to guess file type and skip incompressible files -- the library dependency should be probably only run-time, not compile-time
  • improve error handling
    • start with the easy ones that are not in the code shared with kernel
    • similar to what's done in kernel code, but beware of differences
  • scrub status: print percentage completed (use get_df for total size)
  • allow to undelete a subvolume that is still left intact on the device
    • constraints: on an unmounted filesystem, drop key is 0, must find a free directory/name
  • scrub: print ETA guessed on reported scrubbed bytes and time
  • subvolume listing: overhaul, see Hugo's mail
  • check: make more verbose about the phases, more verbosity levels
  • (prototype exists) print block groups aka chunks in detail
  • use libsmartcolumns -- backward compatibility issues but would be a big improvement all over
  • alternate output formats -- eg. json, structured data, tables etc; libsmartcols knows that already but we'd need to cover output that's not only tables
  • balance: allow to run it in background (fork) and report status periodically
  • (maybe) provide a utility to do dd and uuid change in one go, should be easy to copy a modified superblock first with new uuid, then simply dd the rest and then run btrfstune on to finish the uuid conversion
  • convert custom code for formatting columns to use libsmartcols



  • optimize data structures and decrease memory consumption on large filesystems


  • pre-create a subvolume and put there contents of the root instead of filling the toplevel subvolume
  • convert from other filesystems (reiserfs)
  • convert from md configurations (ie, ext4-on-md converted into btrfs without md)
  • make more verbose about the phases, make it possible to identify at which inode it failed


  • pre-create a subvolume and put there contents of --rootdir instead of filling the toplevel subvolume
  • enable quotas during mkfs
  • allow to compress the files with --rootdir option


Add more regression tests for the userspace tools.

  • all: option coverage
  • mkfs: option combinations
  • fsck: add more broken filesystem images

Provide a library covering 'btrfs' functionality

Not claimed — no patches yet — Not in kernel yet

It would be nice to have a library for manipulating btrfs filesystem in a way that the btrfs tool does and make it available to other programs, or via language bindings.

bedup has a module exposing some of the core btrfs functionality to Python. There is also a btrfs library for haskell.

Time slider

Not claimed — no patches yet — Not in kernel yet

Auto snapshot and its management tool.

A tool partly covering this idea is snapper. Earlier work from Anand Jain on this is available at GitHub.

Audit tools for usecases with security implications

Some tools have to run under root but still may need some level of restrictions or confinement. An example is receive that could use chroot at some points. Search for more.


  • update man page of mount to contain btrfs options (use Mount_options as a source)
  • this very wiki needs updates and removal of obsolete information, work here is highly appreciated as this is one of the frequently consulted sources of information about btrfs
  • document ioctls: parameters, return values, expected effects
  • document send stream format, RFC, validator, dumper; see far-progs

Cleanup projects

This is a short collection of possible cleanups, that would make the code easier to read and to maintain. Please note that cleanups may interfere with patches in-flight doing some real work and merging may be postponed or you may be asked to refresh them on top of other changes.

Please note that pure whitespace and style reformatting changes are not really necessary at this phase of development. They get fixed along regular changes. Possibly once upon in a while a patch that fixes many if not all whitespace errors could work, but otherwise it's considered a noise.

Pass fs_info instead of root

Daniel Dressler — Patches in flight — Not in kernel yet

Lots of functions get a root passed, but only need the fs_info part of it. This is confusing, as it is unclear if a specific root is needed or if any root will do.

Helpers for tree enumeration

Not claimed — no patches yet — Not in kernel yet

Writing tree enumeration code requires deep knowledge of the underlying functions and makes some assumption about possible results. Build some generic helpers or enumeration functions instead, to make the code shorter and more readable, and easier to write.

Use the kernel code in user mode

Eric Sandeen has started working on this (Apr 2013) — no patches yet — Not in kernel yet

The user mode utilities have a local copy of the kernel code, with some small adjustments for running in user mode. The fork of the sources was quite a while ago, so many kernel features are now missing in user mode. Also maintaining two copies is burdensome. Expand the wrappers instead so the true kernel code can be used. These helpers could also generically make use of the upcoming readahead API.

Remove unused parameters (in general)

Not claimed — no patches yet — Not in kernel yet

Eg. like this [1] or [2] . To find more, enable -Wunused-parameter in scripts/ and run

make W=1 fs/btrfs/

The output is noisy, you can comment out the other warnings or grep is your friend.

Make less functions inline to help identifying them on the stack

Not claimed — no patches yet — Not in kernel yet (at the end)

As an aside, this is why XFS use noinline for most of it's static functions - so that stack traces are accurate when a problem occurs. Debuggability of complex code paths is far more important than the small speed improvement automatic inlining of static functions gives...

Good candidates appear when one tries to analyze a stacktrace, doest not see a function being called although it appears on stack. Then it's needed to look into all such functions (and maybe repeating the whole exercise from there). Short wrappers or small simple helpers are not good candidates.

The functions should be tagged with noinline_for_stack.

Move nodesize/leafsize/sectorsize/stripesize to fs_info

Not claimed — no patches yet — Not in kernel yet


There are some items in btrfs_root (and others) that are unnecessarily duplicated but one definition is enough.

Move from btrfs_root to btrfs_fs_info:

  • nodesize, sectorsize, stripesize


  • kill leafsize and unify usage to just nodesize, eg. btrfs_level_size
  • struct btrfs_block_group_cache also duplicates sectorsize, replace it with fs_info::sectorsize

Note: this may increase the .ko size and may add some runtime penalty due to increased pointer chasing

Unbound or ongoing projects


The userbase is growing, we need to improve the documentation. The project ("official") sources comprise:

The documentation has moderate visibility and low impact on stability, so the patches get merged quickly. Sending patches the same way as for the code is preferred as it keeps track of the author (credits) and makes it easy for the maitainers to add the patches to git trees.

Ideas for fixes:

  • spelling, wording, clarifying
  • enhancing terse texts
  • usage examples (wiki, manpages)

Note: keeping consistent look of the documentation may not be easy, but let's try it.

Offline fsck

  • Introduce semantic checks for filesystem data structures
  • Limit memory usage by triggering a back-reference only mode when too many extents are pending
  • Add verification of the generation number in btree metadata
  • Project_ideas#fsck

Test Suite

  • new features need support in xfstests
  • bugfixes that are accompanied with a reproducer should be converted to a xfstests
  • userspace testsuite tasks
  • add more tests for existing features
    • qgroups
    • send/receive

Static checkers

The idea of this project is to run static checkers on the btrfs source codes and identify issues to fix.

There are several static source code checkers that may point out code defficiencies that need fixing. There are some that can be used for linux kernel:

  • sparse - provides a set of annotations designed to convey semantic information about types, such as what address space pointers point to, or what locks a function acquires or releases
  • smatch - similar to sparse, enhanced set of capabilities
  • clang static analyzer - not really tailored for linux kernel, but still usable

Please note that the level of false positives varies and not every issue reported is an actual bug. A review is always required.

Projects claimed and in progress

Projects that are under development. Patches may exist, but have not been pulled into the mainline kernel.

Multiple Devices

Device IO Priorities

Jan Schmidt and Arne Jansen — submitted — Not in kernel yet

The disk format includes fields to record different performance characteristics of the device. This should be honored during allocations to find the most appropriate device. The allocator should also prefer to allocate from less busy drives to spread IO out more effectively.

Dedicated metadata drives

Jan Schmidt and Arne Jansen — submitted — Not in kernel yet

We're able to split data and metadata IO very easily. Metadata tends to be dominated by seeks and for many applications it makes sense to put the metadata onto faster SSDs.


Chris Mason — patch developed, needs updating and integration — Not in kernel yet

The multi-device code needs a raid6 implementation, and perhaps a raid5 implementation. This involves a few different topics including extra code to make sure writeback happens in stripe sized widths, and stripe aligning file allocations.

Metadata blocks are currently clustered together, but extra code will be needed to ensure stripe efficient IO on metadata. Another option is to leave metadata in raid1 mirrors and only do raid5/6 for data.

The existing raid0/1/10 code will need minor refactoring to provide dedicated chunk allocation and lookup functions per raid level. It currently happens via a collection of if statements.

Better data balancing over multiple devices for raid1/10 (allocation)

Hugo Mills — no patches yet — Not in kernel yet

The chunk allocator logic (up to 3.6 at least) allocates new chunks from devices with largest amount of free space. This guarantees (almost provably) that the raid1/mirroring guarantees hold while maximizing the available space (see FAQ for more details). On the other hand, when there are devices with unequal size (not uncommon while mixing terabyte-sized devices), the largest devices are excessively used while the others may not get used at all.

It is possible to fix the allocation logic so it is more even while (almost almost provably) does not break the mirroring guarantees.

Chunk allocation groups

Hugo Mills — no patches yet — Not in kernel yet

If you have a RAID-1 filesystem spanning multiple controllers, you may want to ensure that each copy of your data goes to a different controller, not just a different device. We can fairly easily modify the chunk allocator so that we can specify this.

Limits on number of stripes (stripe width)

Hugo Mills — no patches yet — Not in kernel yet

With very large numbers of devices, or an unequal distribution of device sizes, the default allocator policy of using as many (or as few) devices as possible for striped RAID levels (0, 10, 5, 6) is problematic. The allocator can be modified to limit the number of devices it will stripe across.

Linear chunk allocation mode

Hugo Mills — no patches yet — Not in kernel yet

Some people would like to use single data storage and lose the minimum amount of data possible when a device dies. One prerequisite for this is to allocate chunks linearly, filling one device entirely before moving on to subsequent ones. This should be made a chunk-type option. (The file/extent allocator will probably need to be modified to deal with this use-case as well, but this is a start).

Other projects

RBtree lock contention

Liu Bo — no patches yet — Not in kernel yet

Btrfs uses a number of rbtrees to index in-memory data structures. Some of these are dominated by reads, and the lock contention from searching them is showing up in profiles. We need to look into an RCU and sequence counter combination to allow lockless reads.

Chunk tree backups

Wu Bo — on mailinglist — Not in kernel yet

The chunk tree is critical to mapping logical block numbers to physical locations on the drive. We need to make the mappings discoverable via a block device scan so that we can recover from corrupted chunk trees.

Different sector sizes

Mingming Cao & Wade Clineinitial WIP — Not in kernel yet

The extent_io code makes some assumptions about the page size and the underlying FS sectorsize or blocksize. These need to be cleaned up, especially for the extent_buffer code.

The HDD industry currently supports 512 byte blocks. We can expect HDDs in the future to support 4K Byte blocks

Compressed file size

David Sterba[2] — Not in kernel yet

Find actual size of a compressed file. This has evolved into a more generic solution, the compressed size will be returned via FIEMAP ioctl call instead of a btrfs-specific ioctl.

Block group reclaim

Ilya Dryomov — no patches yet — Not in kernel yet

The split between data and metadata block groups means that we sometimes have mostly empty block groups dedicated to only data or metadata. As files are deleted, we should be able to reclaim these and put the space back into the free space pool.

(See also #Fine-grained balances, #Block_group_reclaim)

Per-subvolume mount options

David Sterba — no patches yet — Not in kernel yet

Allow to specify mount options that apply only to the given subvolume.

Filesystem object properties

Filipe Mananamailinglist — Not in kernel yet

Interface to set/get properties of object types like filesytem, subvolume, device, file. The properties are eg. compression, raid type, cow/nocow status.

Some initial work started by Alexander Block in the past:

Compression enhancements

David Sterba — no patches yet — Not in kernel yet

Started with simply adding more compression algorithms, like LZ4, that itself does not bring a significant improvement without other changes.

  • add more compression algorithms
    • LZ4 -- fast compression, very fast decompression; for general use
    • LZ4HC -- slow mode with better compression ratio, decompressor stays the same, the expected usecase is for write-once/read-many files
    • LZMA -- very slow compression with high compression ratio, aimed for backups and infrequently accessed data
  • enhanced container format -- currently a page-sized chunk is compressed at a time, enhance the container header with version and flags to store the chunk length or the actual way how the chunks were stored
  • longer compression chunk -- now it's 128k to limit the read time of random data in the middle of the file
  • heuristics -- try to learn in a simple way how well the file data compress, or not

Related to that, all the bootloaders that support btrfs should support the enhanced container format and compression algorithms.

Atomic write API

Chris Mason, Josef BacikAtomic IO — Not in kernel yet

The Btrfs implementation of data=ordered only updates metadata to point to new data blocks when the data IO is finished. This makes it easy for us to implement atomic writes of an arbitrary size. Some hardware is coming out that can support this down in the block layer as well.



Filipe Brandenburger — patch in development — Not in kernel yet

Add support for btrfs-specific ioctls. Currently raw numbers are printed, teach strace btrfs ioctl names, how to parse btrfs ioctl structs and how to print a human readable output for them.

Cleanup projects

Move rcu_string out of btrfs to lib/

Omar Sandoval — patch in development — Not in kernel yet

rcu-string.h implements helpers and wrappers around RCU-friendly strings. This is a generic piece of code and should live in the generic library (linux.git/lib/). The task involves documenting API and pushing through lkml and updating according to the feedback.


Free inode number cache

Li Zefan — complete — In kernel 3.0

As the filesystem fills up, finding a free inode number will become expensive. This should be cached the same way we do free blocks.

NFS support

Josef Bacik — complete — In kernel 3.5

Btrfs currently has a sequence number the NFS server can use to detect changes in the file. This should be wired into the NFS support code.

Changing RAID levels

Ilya Dryomov — complete — In kernel 3.3

We need ioctls to change between different raid levels. Some of these are quite easy -- e.g. for RAID0 to RAID1, we just halve the available bytes on the fs, then queue a rebalance.

Contains "Balance operation progress" project.

Device IO Error recording

Stefan Behrens — complete — In kernel 3.5

Items should be inserted into the device tree to record the location and frequency of IO errors, including checksumming errors and misplaced writes.

Forced readonly mounts on errors

various people — complete — In kernel 3.4

The sources have a number of BUG() statements that could easily be replaced with code to force the filesystem readonly. This is the first step in being more fault tolerant of disk corruptions. The first step is to add a framework for generating errors that should result in filesystems going readonly, and the conversion from BUG() to that framework can happen incrementally.

Backref walking utilities

various people — complete — In kernel 3.2

Given a block number on a disk, the Btrfs metadata can find all the files and directories that use or care about that block. Some utilities to walk these back refs and print the results would help debug corruptions.

Given an inode, the Btrfs metadata can find all the directories that point to the inode. We should have utils to walk these back refs as well.

Snapshot aware defrag

Li Zefan, Liu Bo — complete — In kernel 3.9

As we defragment files, we break any sharing from other snapshots. The balancing code will preserve the sharing, and defrag needs to grow this as well.

Drive swapping

Stefan Behrens — complete — In kernel 3.8

Right now when we replace a drive, we do so with a full FS balance. If we are inserting a new drive to remove an old one, we can do a much less expensive operation where we just put valid copies of all the blocks onto the new drive.

Support different disk types in the same filesystem

Josef Bacik — commit de1ee92a, different approach, but fixes the problem — In kernel 3.8

Currently the situation is that for I/O write bios, the bio is prepared using latest_dev. bio_add_page() applies all checks against that device. Before submission of the bio, in btrfs_map_bio() such a bio is cloned for each additional RAID mirror to write. The bi_bdev member of such cloned bios is updated. When one of the devices supports only a lower number of pages per bio then the device that was initially used to build the bio, the submission of the bio will cause "bio too big" errors and kernel log messages. The write operation will fail in this case. One possible solution could be to use bio_get_nr_vecs() initially for each device to find the max number of pages per bio for each device. The minimum of these values could then be used to limit the size of bios in submit_extent_page().

Set/change file system label

Jeff Liu — complete — In kernel 3.9

Set file system label via ioctl(2), user can play with Btrfs label through btrfs filesystem label [label]

Implement O_TMPFILE support

Filipe Manana — complete — In kernel 3.16

There's a special open() flag O_TMPFILE that creates temporary file in a safe way [3]. There's a filesystem-specific support needed.

Scrub with RAID 5/6

Miao Xie — complete — In kernel 3.19

The goal of scrubbing is to find and repair disk errors while causing minimal impact to the performance of the filesystem. Therefore a goal is to avoid seek operations, to read the blocks from each device in sorted order. Another goal is to perform the scrubbing quickly, therefore currently for each disk one thread is spawned that deals with the disk independent of the other disks thus without waiting for the other disks. Things are a little bit different in case of RAID 5/6 since you need to read from multiple devices in order to be able to check the parity information. A strategy needs to be found how to scrub RAID 5/6 filesystems efficiently, afterwards this needs to be implemented.

Device replace with RAID 5/6

Miao Xie — complete — In kernel 3.19

See "Scrub with RAID 5/6" since the replace code makes use of the scrub code.

Filesystem UUID change - off-line

Qu Wenruo — complete — In kernel / btrfs-progs 4.1

Change the filesystem UUID on a given filesystem image. This is easier than the on-line variant. Go through all the metadata blocks reachable from the superblock, verify and rewrite the UUID.

Obsolete requests

The following ideas are no longer needed, or have been subsumed within another piece of work.

Set mount options permanently

Filipe Manana, Hidetoshi Seto — no patches yet — Obsoleted in favour of Project_ideas#Filesystem_object_properties

Set mount options permanently (for ex: compress) like ext4 "tune2fs -O". Two different implementations and approaches were proposed so far:

And related to this, ability to remember compression algorithm and forcefulness per inode:

Background balancing

Ilya Dryomov — no patches yet — Obsoleted in favour of Project_ideas#Block_group_reclaim

A background thread could check in regular intervals if there is enough room to balance the smallest chunk for each RAID type into the existing ones and do so. This would also handle the 'Block group reclaim'-case.

Extend btrfstune to be able to tune more parameters

Sanjeev Mk, Shravan Aras, Gautam Akiwate — no patches yet — Obsoleted in favour of Project_ideas#Filesystem_object_properties

btrfstune currently can be used to update the seeding value. This project would add on to that and make btrfstune a generic tool to tune various FS parameters.

Development notes, please read

It's quite normal that there are several features being developed, and some of them can be utilized by a ioctl call, identified by a number. Please, check that your feature does not use already claimed number.

Tentative list:

Ioctl range Feature Owner Notes
21 free
51 compressed file size David Sterba
55 in-band dedup Liu Bo
56 exclusive operation status Anand Jain
58 first free
Personal tools