Project ideas

From btrfs Wiki
(Difference between revisions)
Jump to: navigation, search
(Multiple Devices - added Parallelism->Scrub/Balance)
(Implement O_TMPFILE support: claim)
Line 318: Line 318:
 
=== Implement O_TMPFILE support ===
 
=== Implement O_TMPFILE support ===
 
{{project
 
{{project
 +
|who=Filipe Manana
 +
|patch=https://patchwork.kernel.org/patch/3925691/
 
|text=There's a special open() flag O_TMPFILE that creates temporary file in a safe way [http://lwn.net/Articles/559921/]. There's a filesystem-specific support needed.
 
|text=There's a special open() flag O_TMPFILE that creates temporary file in a safe way [http://lwn.net/Articles/559921/]. There's a filesystem-specific support needed.
 
}}
 
}}

Revision as of 16:00, 4 April 2014

Contents

Unclaimed projects

If you are actually going to implement an idea/feature, read the notes at the end of this page.

Note, that some of the ideas may not be up-to-date with current state of btrfs implementation, or the projects are claimed but without any visible progress, always ask on IRC or in the mailinglist before starting an implementation.

Multiple Devices

IO stripe size and other optimizations

Not claimed — no patches yet — Not in kernel yet

The multi-device code includes a number of IO parameters that do not currently get used. These need tunables from userland and they need to be honored by the allocation and IO routines.

Take device with heavy IO errors offline or mark as "unreliable"

Not claimed — no patches yet — Not in kernel yet

Devices should be taken offline after they reach a given threshold of IO errors. Jeff Mahoney works on handling EIO errors (among others), this project can build on top of it.

Hot spare support

Not claimed — no patches yet — Not in kernel yet

It should be possible to add a drive and flag it as a hot spare.

"Enhanced" (ala RAID) support

Not claimed — no patches yet — Not in kernel yet

Support a Hot Spare device that is active within the fs but which does not contribute toward capacity. This would result in better performance and less rebalance/rebuild time in the event of a disk failure.

Facility to allow automatic conversion and rebalance when short on reliable/spare storage

Not claimed — no patches yet — Not in kernel yet

If a device is marked as unreliable and there is not enough spare storage to take over the failing device's data automatically (ala hotswap), have a facility available to convert the data to a less-replicated state. Such a facility MUST NOT be "silent" - appropriate warnings/safeguards to prevent a "silently degraded" state should be in place.

False alarm on bad disk - rebuild mitigation

Not claimed — no patches yet — Not in kernel yet

After a device is marked as unreliable, maintain the device within the FS in order to confirm the issue persists. The device will still contribute toward fs performance but will not be treated as if contributing towards replication/reliability. If the device shows that the given errors were a once-off issue then the device can be marked as reliable once again. This will mitigate further unnecessary rebalance. See http://storagemojo.com/2007/02/26/netapp-weighs-in-on-disks/ - "[Drive Resurrection]" as an example of where this is a significant feature for storage vendors.

Better data balancing over multiple devices for raid1/10 (allocation)

Not claimed — no patches yet — Not in kernel yet

The chunk allocator logic (up to 3.6 at least) allocates new chunks from devices with largest amount of free space. This guarantees (almost provably) that the raid1/mirroring guarantees hold while maximizing the available space (see FAQ for more details). On the other hand, when there are devices with unequal size (not uncommon while mixing terabyte-sized devices), the largest devices are excessively used while the others may not get used at all.
It is possible to fix the allocation logic so it is more even while (almost almost provably) does not break the mirroring guarantees.https://btrfs.wiki.kernel.org/index.php?title=Project_ideas&action=submit

Better data balancing over multiple devices for raid1/10 (read)

Not claimed — no patches yet — Not in kernel yet

Currently (3.6) the mirrors are selected based on process id of the endio-metadata worker threads, this may lead to a pathological case when only one mirror is always used (with more common case of uneven mirror use, not that bad in practice).
First attempt with random mirror selection proved to be very wrong, the heuristic has to be improved (more comments), possibly in connection with readahead infrastructure within btrfs.

dm_cache or bcache like cache e.g. on a SSD

Not claimed — no patches yet — Not in kernel yet

dm_cache or bcache can add a cache on a SSD to increase speed of slower spindles. Using this with a multi-device btrfs with mirroring or RAID5/6 would be inefficient, as the (write-trough) cache would not be aware of identical extends to be put on different disks and it doesn't make much sense to have them twice in the cache. This could be solved by adding a new feature into btrfs which allows adding a device used for caching only.

Scrub with RAID 5/6

Not claimed — no patches yet — Not in kernel yet

Scrubbing a RAID 5/6 system is currently not supported. The goal of scrubbing is to find and repair disk errors while causing minimal impact to the performance of the filesystem. Therefore a goal is to avoid seek operations, to read the blocks from each device in sorted order. Another goal is to perform the scrubbing quickly, therefore currently for each disk one thread is spawned that deals with the disk independent of the other disks thus without waiting for the other disks. Things are a little bit different in case of RAID 5/6 since you need to read from multiple devices in order to be able to check the parity information. A strategy needs to be found how to scrub RAID 5/6 filesystems efficiently, afterwards this needs to be implemented.

Device replace with RAID 5/6

Not claimed — no patches yet — Not in kernel yet

See "Scrub with RAID 5/6" since the replace code makes use of the scrub code.

Verify data at end of device replace

Not claimed — no patches yet — Not in kernel yet

Verify copied data before declaring a replace operation as successfully finished. Explicitly ask the user whether to ignore or whether to proceed when errors are detected.

Chunk allocation groups

Not claimed — no patches yet — Not in kernel yet

If you have a RAID-1 filesystem spanning multiple controllers, you may want to ensure that each copy of your data goes to a different controller, not just a different device. We can fairly easily modify the chunk allocator so that we can specify this.

Limits on number of stripes (stripe width)

Not claimed — no patches yet — Not in kernel yet

With very large numbers of devices, or an unequal distribution of device sizes, the default allocator policy of using as many (or as few) devices as possible for striped RAID levels (0, 10, 5, 6) is problematic. The allocator can be modified to limit the number of devices it will stripe across.

Linear chunk allocation mode

chandan — no patches yet — Not in kernel yet

Some people would like to use single data storage and lose the minimum amount of data possible when a device dies. One prerequisite for this is to allocate chunks linearly, filling one device entirely before moving on to subsequent ones. This should be made a chunk-type option. (The file/extent allocator will probably need to be modified to deal with this use-case as well, but this is a start).

Parallelism for Scrub and Balance

Not claimed — no patches yet — Not in kernel yet

When doing a scrub or a balance, especially in consideration of the "Limits on number of stripes (stripe width)" feature, the system currently only actions a scrub/balance to a single stripe of data or metadata. With multiple disks this often means that some disks sit idle while a single thread waits on a separate disk or group of disks to complete their operation. This behaviour can be desirable (due to load) or undesirable (due to it being inefficient). This can be fixed by determining whether or not a new scrub/balance thread can operate on the next extent/group without conflicting with the already-active operation.

Other projects

extent_io state ranges

Not claimed — no patches yet — Not in kernel yet

The extent_io locking code works on [start, end] tuples. This should be changed to [start, length] tuples and all users should be updated.

Limiting btree failure domains

Not claimed — no patches yet — Not in kernel yet

One way to limit the time required for filesystem rebuilds is to limit the number of places that might reference a given area of the drive. One recent discussion of these issues is Val Henson's chunkfs work.

There are a few cases where chunkfs still breaks down to O(N) rebuild, and without making completely separate filesystems it is very difficult to avoid these in general. Btrfs has a number of options that fit in the current design:

  • Tie chunks of space to Subvolumes and snapshots of those subvolumes
  • Tie chunks of space to tree roots inside a subvolume

But these still allow a single large subvolume to span huge amounts of data. We can either place responsibility for limiting failure domains on the admin, or we can implement a key range restriction on allocations.

The key range restriction would consist of a new btree that provided very coarse grained indexing of chunk allocations against key ranges. This could provide hints to fsck in order to limit consistency checking.

Content based storage

Not claimed — no patches yet — Not in kernel yet

Content based storage would index data extents by a large (256bit at least) checksum of the data contents. This index would be stored in one or more dedicated btrees and new file writes would be checked to see if they matched extents already in the content btree.

There are a number of use cases where even large hashes can have security implications, and content based storage is not suitable for use by default. Options to mitigate this include verifying contents of the blocks before recording them as a duplicate (which would be very slow) or simply not using this storage mode.

There are some use cases where verifying equality of the blocks may have an acceptable performance impact. If hash collisions are recorded, it may be possible to later use idle time on the disks to verify equality. It may also be possible to verify equality immediately if another instance of the file is cached. For example, in the case of a mass web host, there are likely to be many identical instances of common software, and constant use is likely to keep these files cached. In that case, not only would disk space be saved, it may also be possible for a single instance of the data in cache to be used by all instances of the file.

If hashes match, a reference is taken against the existing extent instead of creating a new one.

If the checksum isn't already indexed, a new extent is created and the content tree takes a reference against it.

When extents are freed, if the checksum tree is the last reference holder, the extent is either removed from the checksum tree or kept for later use (configurable).

Another configurable is reading the existing block to compare with any matches or just trusting the checksum.

This work is related to stripe granular IO, which would make it possible to configure the size of the extent indexed.

Rsync integration

Not claimed — no patches yet — Not in kernel yet

Now that we have code to efficiently find newly updated files, we need to tie it into tools such as rsync and dirvish. (For bonus points, we can even allow rsync to use btrfs's builtin checksums and, when a file has changed, tell rsync _which blocks_ inside that file have changed. Would need to work with the rsync developers on that one.)

Update rsync to preserve NOCOW file status.

Reference to other backup tools

Not exactly an implementation of this, but I (A. v. Bidder) have patched dirvish to use btrfs snapshots instead of hardlinked directory trees. Discussion archived on the Dirvish mailing list.

Snapshot-aware updatedb/locate

Not claimed — no patches yet — Not in kernel yet

It is desirable to be able to locate content inside snapshots. At present, it happens often that the daily updatedb has a multiplied amount of work as a result of the number of snapshots. This is at least until the administrator configures updatedb to ignore the snapshots entirely. Some systems have hundreds of snapshots resulting in updatedb requiring a lot more than one day to complete. If the updatedb tool were aware of the snapshots (also whether or not they are read-only) then, perhaps utilising the btrfs-send logic, it could simply query the changed content, greatly reducing the amount of work required to update the database. An optional flag for updatedb could allow it to ignore a volume's snapshots. An optional flag for locate could allow a user to specify that they also want results from a volume's snapshots.

Random write performance

Chris Mason — somehow done by the autodefrag mount option — Not in kernel yet

Random writes introduce small extents and fragmentation. We need new file layout code to improve this and defrag the files as they are being changed.

Clear unallocated space

Not claimed — alpha prototype, not posted yet — Not in kernel yet

This is similar to TRIM on SSD devices, but for any device. Simply go through unallocated space and rewrite with zeros (or maybe with some poison pattern so we could recognize when data from free block end up being used). The trim code could be enhanced to submit either TRIM command or writing a zeroed block to the disk. As trim is supported by more filesystems, a new REQ_ flag could be introduced to block layer to perform the zeroing, so other filesystems can enhance their trim support also to clear the free space.

Note: Preliminary patches implementing zeroing exist, not yet posted. The interface needs a cleanup, but it basically works. http://repo.or.cz/w/linux-2.6/btrfs-unstable.git/shortlog/refs/heads/dev/clear-free-space-v0

Cancellable operations

Not claimed — no patches yet — Not in kernel yet

There are a few operations that may take long, cause umount to stall or slow down filesystem. It is possible and would be nice to add some support for cancelling to device del or filesystem balance.

There are two ways how to cancel an operation:

  • synchronous – when the operation is called from userspace and all the processing is done from the context of this process (like the case of btrfs fi defrag FILE), then pressing Ctrl-C will raise a signal and this will be checked inside the defrag loop. It should be discussed whether to allow Ctrl-C or only kill -9.
  • asynchronous – when the processing is done in a kernel thread, this would need same command support like scrub or balance have

There are more places when a check whether the filesystem is being unmounted will improve responsiveness, like during free space cache writeout. However, one has to be sure that cancelling such opeations is safe.

Unlimited extended attributes

Not claimed — no patches yet — Not in kernel yet

Currently size of value of an extended attribute must fit into inline space (~3900 on 4k leaf size), while other filesystems do not limit the size. Add a new b-tree item to hold the xattr value in extents.

Swap file support

Not claimed — no patches yet — Not in kernel yet

Implement swapfile support on top of swap-over-nfs infrastructure that has been merged in 3.7. Use the exported an API to manage the extents.

There is a patchset (swap-over-nfs) which enhances the swapfile API and btrfs could build swap support on top of the infrastructure. The patchset has been merged into 3.6.

References:

Hybrid Storage

Not claimed — no patches yet — Not in kernel yet

It should be possible to use very fast devices as a front end to traditional storage. The most used blocks should be kept hot on the fast device and they should be pushed out to slower storage in large sequential chunks.

The latest generation of SSD drives can achieve high iops/sec rates at both reading and writing. They will be very effective front end caches for slower (and less expensive) spinning media. A caching layer could be added to Btrfs to store the hottest blocks on faster devices to achieve better read and write throughput. This cache could also make use of other spindles in the existing spinning storage, for example why not store frequently used random-heavy data mirrored on all drives if space is available. A similar mechanism could allow frequent random read patterns (such as booting a system) as a series of sequential blocks in this cache.

Encryption

Not claimed — no patches yet — Not in kernel yet

Implement a similar encryption scheme to that of ZFS which features

  • Encryption is integrated with the btrfs command set. Like other btrfs operations, encryption operations such as key changes and rekey are performed online.
  • You can use your existing storage pools as long as they are upgraded. You have the flexibility of encrypting specific file systems.
  • Encryption is inheritable to descendent file systems. Key management can be delegated through delegated administration.
  • Data is encrypted using the ciphers and block modes implemented in the kernel.
  • Escrow passphrase support so it have be used for enterprise desktop computers and laptops.

The encryption capability is embedded into the I/O pipeline. During writes a block may be compressed, encrypted, checksummed and then deduplicated in that order. The policy for encryption is set at the dataset level when datasets (file systems or VOLs) are created.

The wrapping keys provided by the user/administrator can be changed at any time without taking the file system off line. The default behaviour is for the wrapping key to be inherited by any child data sets. The data encryption keys are randomly generated at dataset creation time. Only descendant datasets (snapshots and clones) share data encryption keys. A command to switch to a new data encryption key for the clone or at any time is provided — this does not re-encrypt already existing data, instead utilising an encrypted master-key mechanism.

Allow to access/fixup/delete damaged files from the filesystem

Not claimed — no patches yet — Not in kernel yet

A recovery mode that would enable to delete remainders of damaged files without the need of copying the data out and recreating the filesystem. This may arise from a lost device in 'single' or 'raid0' modes. This operation could be implemented as a special mode of fsck or as an operation on a mounted filesystem. The level of damage and potential recovery is varied, for example wiping the files completely, removing the broken extents, ignoring checksum mismatches or forcing checksum rewrite.

block devices 'btrvols'

Not claimed — no patches yet — Not in kernel yet

Allow block devices to be allocated from the filesystem. They take advantage of COW, snapshots, etc but can be formatted and used by other filesystems. This is similar to ZFS's zvols. See http://pthree.org/2012/12/21/zfs-administration-part-xiv-zvols/ for an idea.

Make helper threads NUMA aware

Not claimed — no patches yet — Not in kernel yet

The helper threads are used to offload computation-heavy tasks and access lots of memory. This brings performance hit on NUMA machines where pages from other nodes are accessed. Extend the workers to be more NUMA-aware for the most exposed tasks, namely checksumming.

Note: this may be obsoleted by recent patches that replace home-grown worker management with workqueues available in kernel. (patchset)

Scrub free space

Not claimed — no patches yet — Not in kernel yet

Currently only those disk blocks are checked that are allocated by the filesystem and in use. To check for read errors on unallocated blocks can be beneficial to identify hardware that is going to fail in the near future.

This could be merged with the 'clear unused space' project as a special case.

Btree lock contention

Not claimed — no patches yet — Not in kernel yet

The btree locks, especially on the root block can be very hot. We need to improve this, especially in read mostly workloads.

Improve subvolume usability

Accept arbitrary directories with the mount-time subvol flag

subvol= is a mount option to mount a subvolume other than the default. It currently only allows subvolumes; but vfsmounts can start at any path, allowing to mount any directory.

Note from kdave: this is intentionally disabled, see the patch that added the subvol test. It makes the snapshotting semantics unclear.
Gabriel: Thanks for the info. The snapshot ioctl, and other volume ioctls, need to validate that they have a subvolume rather than any vfsmount. Currently this check is in btrfs-progs but missing on the kernel side; the subvol= check didn't address the root cause of the bogus semantics and the snapshot ioctl is still problematic with bind mounts.

Snapshot arbitrary directories

Currently the most efficient way to snapshot a non-subvolume is either:

  • to snapshot its parent volume and remove the extra bits.
  • to make a reflink copy

Neither option is as efficient as it could be. copy_root should be updated to copy from a non-root directory (copy_root_at_dir).

Take recursive snapshots

Snapshots of volumes don't include nested subvolumes; allowing this would make it easier to make sure that a snapshot contains everything the source appears to contain.

Even with userspace help, it isn't currently possible to do recursive snapshots that are atomic or read-only. A new ioctl would solve that.

Hide the subvolume/directory distinction

That distinction in data structures makes snapshotting efficient, but it may not be necessary to expose it to userspace. Transparent snapshots would encode the subvolume rootid (which non-transparent snapshots expose in st_dev) by reserving bits from the inode number. rename() and link() would reuse copy_root_at_dir when crossing a subvolume boundary.

More checksumming algorithms

Not claimed — no patches yet — Not in kernel yet

Currently crc32c is used, we may offer more alternatives with different speed/strength characteristics. The considered ones are xxHash and SHA1.

Make bootloaders aware of incompat bits and features

Not claimed — no patches yet — Not in kernel yet

Bootloaders (grub2, syslinux) that support btrfs do not check the incompatibility bits and boot may fail due to lack of support, eg. compression, skinny metadata, no-extent-hole, etc. The bootloader should verify that there are no unknown bits and at least issue a warning.

Implement FALLOC_FL_COLLAPSE_RANGE

Not claimedhttp://oss.sgi.com/archives/xfs/2013-07/msg00907.html and followups — Not in kernel yet

Note: Depends on the generic implementation that is on it's way to kernel. Extension to fallocate that chops the given range from a file and does not leave a hole, the offsets to the right of the range are shifted.

Online fsck

Not claimedinitial patch submitted — Not in kernel yet

Online fsck includes a number of difficult decisions around races and coherency. Given that back references allow us to limit the total amount of memory required to verify a data structure, we should consider simply implementing fsck in the kernel.

Intial work by Li Zefan.

Implement O_TMPFILE support

Filipe Mananahttps://patchwork.kernel.org/patch/3925691/ — Not in kernel yet

There's a special open() flag O_TMPFILE that creates temporary file in a safe way [1]. There's a filesystem-specific support needed.

Userspace tools projects

Collection of ideas or small tasks for btrfsprogs or other relevant utilites.

btrfs

  • sort devices in btrfs fi show by name
  • show meaning of various error codes, eg. for the incompatibility bits
  • error messages need to be cleaned up: wording, spelling, consistent format, error codes where missing
  • merge functionality of btrfstune, eg. under btrfs dev set-seed /dev/ (discuss the command name though)
  • audit code for use of backup superblocks and change it to read only the first unless told otherwise by a command-line option
  • write a mount.btrfs helper to scan devices on the fly (and use libblkid and libmount for that)
  • introduce subcommand debug and move there functionality from separate debugging utilities
    • dump-tree from btrfs-debug-tree
    • dump-super from btrfs-show-super (with functionality of btrfs-dump-super as an option)
    • tree-stats from calc-size
    • map-logical (not sure if this is not for inspect-internal)
    • dump-image from btrfs-image
  • add API for tree search ioctl
  • add heuristics to defrag
    • use libmagic to guess file type and skip incompressible files -- the library dependency should be probably only run-time, not compile-time
  • shell completion (for bash 04/2012 needs completing)
  • improve error handling
    • start with the easy ones that are not in the code shared with kernel
    • similar to what's done in kernel code, but beware of differences

btrfs-restore

fsck

  • optimize data structures and decrease memory consumption on large filesystems

btrfs-convert

  • [patch] report progress
  • [patch] add option to transfer label from the original filesystem
  • [draft] allow to set leafsize/nodesize of the new filesystem (like mkfs)
  • pre-create a subvolume and put there contents of the root instead of filling the toplevel subvolume
  • convert from other filesystems (reiserfs)
  • convert from md configurations (ie, ext4-on-md converted into btrfs without md)

mkfs.btrfs

  • pre-create a subvolume and put there contents of --rootdir instead of filling the toplevel subvolume
  • give nicer overview of the created filesystem (block/node/fs sizes, raid profiles, features)
  • enable quotas during mkfs

Testsuite

Add more regression tests for the userspace tools.

  • all: option coverage
  • mkfs: option combinations
  • fsck: add more broken filesystem images

Provide a library covering 'btrfs' functionality

Not claimed — no patches yet — Not in kernel yet

It would be nice to have a library for manipulating btrfs filesystem in a way that the btrfs tool does and make it available to other programs, or via language bindings.

bedup has a module exposing some of the core btrfs functionality to Python.

Time slider

Not claimed — no patches yet — Not in kernel yet

Auto snapshot and its management tool.

A tool partly covering this idea is snapper. Earlier work from Anand Jain on this is available at GitHub.

Documentation

  • update man page of mount to contain btrfs options (use Mount_options as a source)
  • this very wiki needs updates and removal of obsolete information, work here is highly appreciated as this is one of the frequently consulted sources of information about btrfs

Unbound or ongoing projects

Offline fsck

  • Introduce semantic checks for filesystem data structures
  • Limit memory usage by triggering a back-reference only mode when too many extents are pending
  • Add verification of the generation number in btree metadata
  • Project_ideas#fsck

Test Suite

  • new features need support in xfstests
  • bugfixes that are accompanied with a reproducer should be converted to a xfstests
  • userspace testsuite tasks
  • add more tests for existing features
    • qgroups
    • send/receive

Static checkers

The idea of this project is to run static checkers on the btrfs source codes and identify issues to fix.

There are several static source code checkers that may point out code defficiencies that need fixing. There are some that can be used for linux kernel:

  • sparse - provides a set of annotations designed to convey semantic information about types, such as what address space pointers point to, or what locks a function acquires or releases
  • smatch - similar to sparse, enhanced set of capabilities
  • clang static analyzer - not really tailored for linux kernel, but still usable

Please note that the level of false positives varies and not every issue reported is an actual bug. A review is always required.

Projects claimed and in progress

Projects that are under development. Patches may exist, but have not been pulled into the mainline kernel.

Multiple Devices

Device IO Priorities

Jan Schmidt and Arne Jansen — submitted — Not in kernel yet

The disk format includes fields to record different performance characteristics of the device. This should be honored during allocations to find the most appropriate device. The allocator should also prefer to allocate from less busy drives to spread IO out more effectively.

Dedicated metadata drives

Jan Schmidt and Arne Jansen — submitted — Not in kernel yet

We're able to split data and metadata IO very easily. Metadata tends to be dominated by seeks and for many applications it makes sense to put the metadata onto faster SSDs.

Raid5/6

Chris Mason — patch developed, needs updating and integration — Not in kernel yet

The multi-device code needs a raid6 implementation, and perhaps a raid5 implementation. This involves a few different topics including extra code to make sure writeback happens in stripe sized widths, and stripe aligning file allocations.

Metadata blocks are currently clustered together, but extra code will be needed to ensure stripe efficient IO on metadata. Another option is to leave metadata in raid1 mirrors and only do raid5/6 for data.

The existing raid0/1/10 code will need minor refactoring to provide dedicated chunk allocation and lookup functions per raid level. It currently happens via a collection of if statements.

Other projects

RBtree lock contention

Liu Bo — no patches yet — Not in kernel yet

Btrfs uses a number of rbtrees to index in-memory data structures. Some of these are dominated by reads, and the lock contention from searching them is showing up in profiles. We need to look into an RCU and sequence counter combination to allow lockless reads.

Chunk tree backups

Wu Bo — on mailinglist — Not in kernel yet

The chunk tree is critical to mapping logical block numbers to physical locations on the drive. We need to make the mappings discoverable via a block device scan so that we can recover from corrupted chunk trees.

Different sector sizes

Mingming Cao & Wade Clineinitial WIP — Not in kernel yet

The extent_io code makes some assumptions about the page size and the underlying FS sectorsize or blocksize. These need to be cleaned up, especially for the extent_buffer code.

The HDD industry currently supports 512 byte blocks. We can expect HDDs in the future to support 4K Byte blocks

Compressed file size

David Sterba[2] — Not in kernel yet

Find actual size of a compressed file. This has evolved into a more generic solution, the compressed size will be returned via FIEMAP ioctl call instead of a btrfs-specific ioctl.

Block group reclaim

Ilya Dryomov — no patches yet — Not in kernel yet

The split between data and metadata block groups means that we sometimes have mostly empty block groups dedicated to only data or metadata. As files are deleted, we should be able to reclaim these and put the space back into the free space pool.

(See also #Fine-grained balances, #Block_group_reclaim)

Per-subvolume mount options

David Sterba — no patches yet — Not in kernel yet

Allow to specify mount options that apply only to the given subvolume.

Filesystem object properties

Filipe Mananamailinglist — Not in kernel yet

Interface to set/get properties of object types like filesytem, subvolume, device, file. The properties are eg. compression, raid type, cow/nocow status.

Some initial work started by Alexander Block in the past:

http://thread.gmane.org/gmane.comp.file-systems.btrfs/18287

Compression enhancements

David Sterba — no patches yet — Not in kernel yet

Started with simply adding more compression algorithms, like LZ4, that itself does not bring a significant improvement without other changes.

  • add more compression algorithms
    • LZ4 -- fast compression, very fast decompression; for general use
    • LZ4HC -- slow mode with better compression ratio, decompressor stays the same, the expected usecase is for write-once/read-many files
    • LZMA -- very slow compression with high compression ratio, aimed for backups and infrequently accessed data
  • enhanced container format -- currently a page-sized chunk is compressed at a time, enhance the container header with version and flags to store the chunk length or the actual way how the chunks were stored
  • longer compression chunk -- now it's 128k to limit the read time of random data in the middle of the file
  • heuristics -- try to learn in a simple way how well the file data compress, or not

Related to that, all the bootloaders that support btrfs should support the enhanced container format and compression algorithms.

Atomic write API

Chris Mason, Josef BacikAtomic IO — Not in kernel yet

The Btrfs implementation of data=ordered only updates metadata to point to new data blocks when the data IO is finished. This makes it easy for us to implement atomic writes of an arbitrary size. Some hardware is coming out that can support this down in the block layer as well.

Userspace

strace

Filipe Brandenburger — patch in development — Not in kernel yet

Add support for btrfs-specific ioctls. Currently raw numbers are printed, teach strace btrfs ioctl names, how to parse btrfs ioctl structs and how to print a human readable output for them.

Finished

Free inode number cache

Li Zefan — complete — In kernel 3.0

As the filesystem fills up, finding a free inode number will become expensive. This should be cached the same way we do free blocks.

NFS support

Josef Bacik — complete — In kernel 3.5

Btrfs currently has a sequence number the NFS server can use to detect changes in the file. This should be wired into the NFS support code.

Changing RAID levels

Ilya Dryomov — complete — In kernel 3.3

We need ioctls to change between different raid levels. Some of these are quite easy -- e.g. for RAID0 to RAID1, we just halve the available bytes on the fs, then queue a rebalance.

Contains "Balance operation progress" project.

Device IO Error recording

Stefan Behrens — complete — In kernel 3.5

Items should be inserted into the device tree to record the location and frequency of IO errors, including checksumming errors and misplaced writes.

Forced readonly mounts on errors

various people — complete — In kernel 3.4

The sources have a number of BUG() statements that could easily be replaced with code to force the filesystem readonly. This is the first step in being more fault tolerant of disk corruptions. The first step is to add a framework for generating errors that should result in filesystems going readonly, and the conversion from BUG() to that framework can happen incrementally.

Backref walking utilities

various people — complete — In kernel 3.2

Given a block number on a disk, the Btrfs metadata can find all the files and directories that use or care about that block. Some utilities to walk these back refs and print the results would help debug corruptions.

Given an inode, the Btrfs metadata can find all the directories that point to the inode. We should have utils to walk these back refs as well.

Snapshot aware defrag

Li Zefan, Liu Bo — complete — In kernel 3.9

As we defragment files, we break any sharing from other snapshots. The balancing code will preserve the sharing, and defrag needs to grow this as well.

Drive swapping

Stefan Behrens — complete — In kernel 3.8

Right now when we replace a drive, we do so with a full FS balance. If we are inserting a new drive to remove an old one, we can do a much less expensive operation where we just put valid copies of all the blocks onto the new drive.

Support different disk types in the same filesystem

Josef Bacik — commit de1ee92a, different approach, but fixes the problem — In kernel 3.8

Currently the situation is that for I/O write bios, the bio is prepared using latest_dev. bio_add_page() applies all checks against that device. Before submission of the bio, in btrfs_map_bio() such a bio is cloned for each additional RAID mirror to write. The bi_bdev member of such cloned bios is updated. When one of the devices supports only a lower number of pages per bio then the device that was initially used to build the bio, the submission of the bio will cause "bio too big" errors and kernel log messages. The write operation will fail in this case. One possible solution could be to use bio_get_nr_vecs() initially for each device to find the max number of pages per bio for each device. The minimum of these values could then be used to limit the size of bios in submit_extent_page().

Set/change file system label

Jeff Liu — complete — In kernel 3.9

Set file system label via ioctl(2), user can play with Btrfs label through btrfs filesystem label [label]

Obsolete requests

The following ideas are no longer needed, or have been subsumed within another piece of work.

Set mount options permanently

Filipe Manana, Hidetoshi Seto — no patches yet — Obsoleted in favour of Project_ideas#Filesystem_object_properties

Set mount options permanently (for ex: compress) like ext4 "tune2fs -O". Two different implementations and approaches were proposed so far:

http://thread.gmane.org/gmane.comp.file-systems.btrfs/19757

http://thread.gmane.org/gmane.comp.file-systems.btrfs/27406

And related to this, ability to remember compression algorithm and forcefulness per inode:

http://thread.gmane.org/gmane.comp.file-systems.btrfs/27452


Background balancing

Ilya Dryomov — no patches yet — Obsoleted in favour of Project_ideas#Block_group_reclaim

A background thread could check in regular intervals if there is enough room to balance the smallest chunk for each RAID type into the existing ones and do so. This would also handle the 'Block group reclaim'-case.

Extend btrfstune to be able to tune more parameters

Sanjeev Mk, Shravan Aras, Gautam Akiwate — no patches yet — Obsoleted in favour of Project_ideas#Filesystem_object_properties

btrfstune currently can be used to update the seeding value. This project would add on to that and make btrfstune a generic tool to tune various FS parameters.

(Would be better to have this under 'btrfs tune' or something like that)

Contact:
email : sanjeevmk4890@gmail.com IRC Nick: s-mk
email: 123.shravan@gmail.com IRC Nick: shravan
email: gautam.akiwate@gmail.com IRC Nick: gakiwate


Development notes, please read

It's quite normal that there are several features being developed, and some of them can be utilized by a ioctl call, identified by a number. Please, check that your feature does not use already claimed number.

Tentative list:

Ioctl range Feature Owner Notes
21 free
51 compressed file size David Sterba
55 in-band dedup Liu Bo
56 exclusive operation status Anand Jain
57 feature bits Jeff Mahoney
58 first free
Personal tools