Project ideas

From btrfs Wiki
(Difference between revisions)
Jump to: navigation, search
(Test Suite)
Line 293: Line 293:
|Jan Schmidt
|Jan Schmidt
| might need more than one
| might need more than one
|Subvolume Quota implementation
|Arne Jansen
|might not need all

Revision as of 15:16, 31 August 2011


Unclaimed projects

If you are actually going to implement an idea/feature, read the notes at the end of this page.

NFS support

Btrfs currently has a sequence number the NFS server can use to detect changes in the file. This should be wired into the NFS support code.

Multiple Devices

Hybrid Storage

It should be possible to use very fast devices as a front end to traditional storage. The most used blocks should be kept hot on the fast device and they should be pushed out to slower storage in large sequential chunks.

The latest generation of SSD drives can achieve high iops/sec rates at both reading and writing. They will be very effective front end caches for slower (and less expensive) spinning media. A caching layer could be added to Btrfs to store the hottest blocks on faster devices to achieve better read and write throughput. This cache could also make use of other spindles in the existing spinning storage, for example why not store frequently used random-heavy data mirrored on all drives if space is available. A similar mechanism could allow frequent random read patterns (such as booting a system) as a series of sequential blocks in this cache.

IO stripe size and other optimizations

The multi-device code includes a number of IO parameters that do not currently get used. These need tunables from userland and they need to be honored by the allocation and IO routines.

IO Error recording

Items should be inserted into the device tree to record the location and frequency of IO errors, including checksumming errors and misplaced writes. Devices should be taken offline after they reach a given threshold.

There should be an offline application that tries to rehab a drive by overwriting the bad sectors and exercising it.

If a checksum failure is missed write is corrected by reading from an alternate mirror, the extent should be scheduled for relocation to correct the bad data.

Hot spare support

It should be possible to add a drive and flag it as a hot spare.


Different sector sizes

The extent_io code makes some assumptions about the page size and the underlying FS sectorsize or blocksize. These need to be cleaned up, especially for the extent_buffer code.

The HDD industry currently supports 512 Byte Blocks We can expect HDDs in the future to support 4K Byte blocks

extent_io state ranges

The extent_io locking code works on [start, end] tuples. This should be changed to [start, length] tuples and all users should be updated.

Limiting btree failure domains

One way to limit the time required for filesystem rebuilds is to limit the number of places that might reference a given area of the drive. One recent discussion of these issues is Val Henson's chunkfs work.

There are a few cases where chunkfs still breaks down to O(N) rebuild, and without making completely separate filesystems it is very difficult to avoid these in general. Btrfs has a number of options that fit in the current design:

  • Tie chunks of space to Subvolumes and snapshots of those subvolumes
  • Tie chunks of space to tree roots inside a subvolume

But these still allow a single large subvolume to span huge amounts of data. We can either place responsibility for limiting failure domains on the admin, or we can implement a key range restriction on allocations.

The key range restriction would consist of a new btree that provided very coarse grained indexing of chunk allocations against key ranges. This could provide hints to fsck in order to limit consistency checking.

Online fsck

Online fsck includes a number of difficult decisions around races and coherency. Given that back references allow us to limit the total amount of memory required to verify a data structure, we should consider simply implementing fsck in the kernel.

Content based storage

Content based storage would index data extents by a large (256bit at least) checksum of the data contents. This index would be stored in one or more dedicated btrees and new file writes would be checked to see if they matched extents already in the content btree.

There are a number of use cases where even large hashes can have security implications, and content based storage is not suitable for use by default. Options to mitigate this include verifying contents of the blocks before recording them as a duplicate (which would be very slow) or simply not using this storage mode.

There are some use cases where verifying equality of the blocks may have an acceptable performance impact. If hash collisions are recorded, it may be possible to later use idle time on the disks to verify equality. It may also be possible to verify equality immediately if another instance of the file is cached. For example, in the case of a mass web host, there are likely to be many identical instances of common software, and constant use is likely to keep these files cached. In that case, not only would disk space be saved, it may also be possible for a single instance of the data in cache to be used by all instances of the file.

If hashes match, a reference is taken against the existing extent instead of creating a new one.

If the checksum isn't already indexed, a new extent is created and the content tree takes a reference against it.

When extents are freed, if the checksum tree is the last reference holder, the extent is either removed from the checksum tree or kept for later use (configurable).

Another configurable is reading the existing block to compare with any matches or just trusting the checksum.

This work is related to stripe granular IO, which would make it possible to configure the size of the extent indexed.

Block group reclaim

The split between data and metadata block groups means that we sometimes have mostly empty block groups dedicated to only data or metadata. As files are deleted, we should be able to reclaim these and put the space back into the free space pool.

(See also #Fine-grained balances)

RBtree lock contention

Btrfs uses a number of rbtrees to index in-memory data structures. Some of these are dominated by reads, and the lock contention from searching them is showing up in profiles. We need to look into an RCU and sequence counter combination to allow lockless reads.

Chunk tree backups

The chunk tree is critical to mapping logical block numbers to physical locations on the drive. We need to make the mappings discoverable via a block device scan so that we can recover from corrupted chunk trees.

Rsync integration

Now that we have code to efficiently find newly updated files, we need to tie it into tools such as rsync and dirvish. (For bonus points, we can even tell rsync _which blocks_ inside a file have changed. Would need to work with the rsync developers on that one.)

Not exactly an implementation of this, but I (A. v. Bidder) have patched dirvish to use btrfs snapshots instead of hardlinked directory trees. Discussion archived on the Dirvish mailing list.

Atomic write API

The Btrfs implementation of data=ordered only updates metadata to point to new data blocks when the data IO is finished. This makes it easy for us to implement atomic writes of an arbitrary size. Some hardware is coming out that can support this down in the block layer as well.

IO error tracking

As we get bad csums or IO errors from drives, we should track the failures and kick out the drive if it is clearly going bad.

Random write performance

Random writes introduce small extents and fragmentation. We need new file layout code to improve this and defrag the files as they are being changed.

Snapshot aware defrag

As we defragment files, we break any sharing from other snapshots. The balancing code will preserve the sharing, and defrag needs to grow this as well.

Btree lock contention

The btree locks, especially on the root block can be very hot. We need to improve this, especially in read mostly workloads.

DISCARD utilities

For SSDs with discard support, we could use a scrubber that goes through the fs and performs discard on anything that is unused. You could first use the balance operation to compact data to the front of the drive, then discard the rest.

Transparent compression as partition property

Currently, compression is enabled by using a mount option. To ensure compression on external hard disks, that are attached to more than one computer, one would have to specify the specific mount option everywhere. It would be more convenient, if the partition could be marked as `compressed' so that it is mounted as compressed automatically without the need to specify the mount options explicitly.

Projects claimed and in progress

Projects that are under development. Patches may exist, but have not been pulled into the mainline kernel.

Multiple Devices

Device IO Priorities

The disk format includes fields to record different performance characteristics of the device. This should be honored during allocations to find the most appropriate device. The allocator should also prefer to allocate from less busy drives to spread IO out more effectively.

(under development by Jan Schmidt and Arne Jansen, patch submitted)


(Under development by cmason)

The multi-device code needs a raid6 implementation, and perhaps a raid5 implementation. This involves a few different topics including extra code to make sure writeback happens in stripe sized widths, and stripe aligning file allocations.

Metadata blocks are currently clustered together, but extra code will be needed to ensure stripe efficient IO on metadata. Another option is to leave metadata in raid1 mirrors and only do raid5/6 for data.

The existing raid0/1/10 code will need minor refactoring to provide dedicated chunk allocation and lookup functions per raid level. It currently happens via a collection of if statements.

Balancing status reports

ioctls are needed to fetch status of current balancing operations

(HugoMills has posted patches for this; expected in mainline for version 3.1)

Fine-grained balances

We also need rebalancing ioctls that focus only on specific raid levels. (Hugo Mills has posted patches for kernel and userspace; expected in mainline for version 3.1)

Online raid config

Raid allocation options need to be configurable per snapshot, per subvolume and per file. It should also be possible to set some IO parameters on a directory and have all files inside that directory inherit the config.

btrfs-vol -b should take a parameter to change the raid config as it rebalances.

(Being developed in a GSoC project)

Changing RAID levels

We need ioctls to change between different raid levels. Some of these are quite easy -- e.g. for RAID0 to RAID1, we just halve the available bytes on the fs, then queue a rebalance.

(Being developed in a GSoC project)

Offline fsck

(Currently being developed by cmason)

  • Introduce semantic checks for filesystem data structures
  • Limit memory usage by triggering a back-reference only mode when too many extents are pending
  • Add verification of the generation number in btree metadata

Forced readonly mounts on errors

(liubo has submitted patches for the first step [framework] and they are now merged into mainline)

The sources have a number of BUG() statements that could easily be replaced with code to force the filesystem readonly. This is the first step in being more fault tolerant of disk corruptions. The first step is to add a framework for generating errors that should result in filesystems going readonly, and the conversion from BUG() to that framework can happen incrementally.

Dedicated metadata drives

We're able to split data and metadata IO very easily. Metadata tends to be dominated by seeks and for many applications it makes sense to put the metadata onto faster SSDs.

(Under development by Jan Schmidt and Arne Jansen, patch submitted)

Drive swapping

(Under development by Ilya Dryomov)

Right now when we replace a drive, we do so with a full FS balance. If we are inserting a new drive to remove an old one, we can do a much less expensive operation where we just put valid copies of all the blocks onto the new drive.

Free inode number cache

(Under development by Li Zefan)

As the filesystem fills up, finding a free inode number will become expensive. This should be cached the same way we do free blocks.

Backref walking utilities

(Under development by Liu Bo; patches at RFC stage)

Given a block number on a disk, the Btrfs metadata can find all the files and directories that use or care about that block. Some utilities to walk these back refs and print the results would help debug corruptions.

Given an inode, the Btrfs metadata can find all the directories that point to the inode. We should have utils to walk these back refs as well.

Test Suite

(patch submitted by Anand Jain)

Currently the xfs test suite is used to validate the base functionality of btrfs. It would be good to extend it to test btrfs-specific functions like snapshot creation/deletion, balancing and relocation.

Theres a GSoC project with similar goal done by Adytia Dani

Time slider

(Under development by Anand Jain)

Auto snapshot and its management tool

Development notes, please read

It's quite normal that there are several features being developed, and some of them can be utilized by a ioctl call, identified by a number. Please, check that your feature does not use already claimed number.

Tentative list:

Ioctl range Feature Owner Notes
32-34 Restriper, raid level convertor Ilya Dryomov
35-36 inspect-internal commands, inode -> filename Jan Schmidt
37 send command Jan Schmidt might need more than one
40-49 Subvolume Quota implementation Arne Jansen might not need all
Personal tools