Project ideas

From btrfs Wiki
(Difference between revisions)
Jump to: navigation, search
(Userspace tools projects: strace output)
Line 229: Line 229:
 
{{project
 
{{project
 
|title=Drive swapping
 
|title=Drive swapping
|who=Ilya Dryomov
+
|who=Stefan Behrens
 
|text=Right now when we replace a drive, we do so with a full FS balance. If we are inserting a new drive to remove an old one, we can do a much less expensive operation where we just put valid copies of all the blocks onto the new drive.
 
|text=Right now when we replace a drive, we do so with a full FS balance. If we are inserting a new drive to remove an old one, we can do a much less expensive operation where we just put valid copies of all the blocks onto the new drive.
 
}}
 
}}

Revision as of 15:19, 23 July 2012

Contents

Unclaimed projects

If you are actually going to implement an idea/feature, read the notes at the end of this page.

Multiple Devices

Not claimed — no patches yet — Not in kernel yet

It should be possible to use very fast devices as a front end to traditional storage. The most used blocks should be kept hot on the fast device and they should be pushed out to slower storage in large sequential chunks.

The latest generation of SSD drives can achieve high iops/sec rates at both reading and writing. They will be very effective front end caches for slower (and less expensive) spinning media. A caching layer could be added to Btrfs to store the hottest blocks on faster devices to achieve better read and write throughput. This cache could also make use of other spindles in the existing spinning storage, for example why not store frequently used random-heavy data mirrored on all drives if space is available. A similar mechanism could allow frequent random read patterns (such as booting a system) as a series of sequential blocks in this cache.


Not claimed — no patches yet — Not in kernel yet

The multi-device code includes a number of IO parameters that do not currently get used. These need tunables from userland and they need to be honored by the allocation and IO routines.


Not claimed — no patches yet — Not in kernel yet

Devices should be taken offline after they reach a given threshold of IO errors. Jeff Mahoney works on handling EIO errors (among others), this project can build on top of it.


Not claimed — no patches yet — Not in kernel yet

It should be possible to add a drive and flag it as a hot spare.

extent_io

Not claimed — no patches yet — Not in kernel yet

The extent_io code makes some assumptions about the page size and the underlying FS sectorsize or blocksize. These need to be cleaned up, especially for the extent_buffer code.

The HDD industry currently supports 512 byte blocks. We can expect HDDs in the future to support 4K Byte blocks


Not claimed — no patches yet — Not in kernel yet

The extent_io locking code works on [start, end] tuples. This should be changed to [start, length] tuples and all users should be updated.

Other projects

Not claimed — no patches yet — Not in kernel yet

One way to limit the time required for filesystem rebuilds is to limit the number of places that might reference a given area of the drive. One recent discussion of these issues is Val Henson's chunkfs work.

There are a few cases where chunkfs still breaks down to O(N) rebuild, and without making completely separate filesystems it is very difficult to avoid these in general. Btrfs has a number of options that fit in the current design:

  • Tie chunks of space to Subvolumes and snapshots of those subvolumes
  • Tie chunks of space to tree roots inside a subvolume

But these still allow a single large subvolume to span huge amounts of data. We can either place responsibility for limiting failure domains on the admin, or we can implement a key range restriction on allocations.

The key range restriction would consist of a new btree that provided very coarse grained indexing of chunk allocations against key ranges. This could provide hints to fsck in order to limit consistency checking.


Not claimed — no patches yet — Not in kernel yet

Content based storage would index data extents by a large (256bit at least) checksum of the data contents. This index would be stored in one or more dedicated btrees and new file writes would be checked to see if they matched extents already in the content btree.

There are a number of use cases where even large hashes can have security implications, and content based storage is not suitable for use by default. Options to mitigate this include verifying contents of the blocks before recording them as a duplicate (which would be very slow) or simply not using this storage mode.

There are some use cases where verifying equality of the blocks may have an acceptable performance impact. If hash collisions are recorded, it may be possible to later use idle time on the disks to verify equality. It may also be possible to verify equality immediately if another instance of the file is cached. For example, in the case of a mass web host, there are likely to be many identical instances of common software, and constant use is likely to keep these files cached. In that case, not only would disk space be saved, it may also be possible for a single instance of the data in cache to be used by all instances of the file.

If hashes match, a reference is taken against the existing extent instead of creating a new one.

If the checksum isn't already indexed, a new extent is created and the content tree takes a reference against it.

When extents are freed, if the checksum tree is the last reference holder, the extent is either removed from the checksum tree or kept for later use (configurable).

Another configurable is reading the existing block to compare with any matches or just trusting the checksum.

This work is related to stripe granular IO, which would make it possible to configure the size of the extent indexed.


Not claimed — no patches yet — Not in kernel yet

Now that we have code to efficiently find newly updated files, we need to tie it into tools such as rsync and dirvish. (For bonus points, we can even tell rsync _which blocks_ inside a file have changed. Would need to work with the rsync developers on that one.)

Not exactly an implementation of this, but I (A. v. Bidder) have patched dirvish to use btrfs snapshots instead of hardlinked directory trees. Discussion archived on the Dirvish mailing list.


Not claimed — no patches yet — Not in kernel yet

The Btrfs implementation of data=ordered only updates metadata to point to new data blocks when the data IO is finished. This makes it easy for us to implement atomic writes of an arbitrary size. Some hardware is coming out that can support this down in the block layer as well.


Chris Mason — somehow done by the autodefrag mount option — Not in kernel yet

Random writes introduce small extents and fragmentation. We need new file layout code to improve this and defrag the files as they are being changed.


Not claimed — no patches yet — Not in kernel yet

(is this the new locking scheme in 3.1?)

The btree locks, especially on the root block can be very hot. We need to improve this, especially in read mostly workloads.


Not claimed — no patches yet — Not in kernel yet

Btrfs currently has a sequence number the NFS server can use to detect changes in the file. This should be wired into the NFS support code.


Not claimed — no patches yet — Not in kernel yet

It would be nice to have a library for manipulating btrfs filesystem in a way that the btrfs tool does and make it available to other programs, or via language bindings.


Not claimed — no patches yet — Not in kernel yet

This is similar to TRIM on SSD devices, but for any device. Simply go through unallocated space and rewrite with zeros (or maybe with some poison pattern so we could recognize when data from free block end up being used). The trim code could be enhanced to submit either TRIM command or writing a zeroed block to the disk. As trim is supported by more filesystems, a new REQ_ flag could be introduced to block layer to perform the zeroing, so other filesystems can enhance their trim support also to clear the free space.


Not claimed — no patches yet — Not in kernel yet

There are a few operations that may take long, cause umount to stall or slow down filesystem. It is possible and would be nice to add some support for cancelling to filesystem defrag (was observed eg. during defrag on the root subvol directory) and device del.

There are two ways how to cancel an operation:

  • synchronous – when the operation is called from userspace and all the processing is done from the context of this process (case of btrfs fi defrag FILE), then pressing Ctrl-C will raise a signal and this will be checked inside the defrag loop. It should be discussed whether to allow Ctrl-C or only kill -9.
  • asynchronous – when the processing is done in a kernel thread, this would need same command support like scrub or balance have

There are more places when a check whether the filesystem is being unmounted will improve responsiveness, like during free space cache writeout. However, one has to be sure that cancelling such opeations is safe.


Not claimed — no patches yet — Not in kernel yet

Set mount options permanently (for ex: compress) like ext4 "tune2fs -O".

Userspace tools projects

Collection of ideas or small tasks for btrfsprogs or other relevant utilites.

btrfs

  • sort devices in btrfs fi show by name or id
  • show meaning of various error codes, eg. for the incompatiblity bits
  • error messages need massage
  • enhance btrfs subvol list to show read-only snapshots (or readonly subvolumes in general)
  • clear device signature so it does not appear in btrfs fi show list
  • merge functionality of btrfstune, eg. under btrfs dev set-seed /dev/ (discuss the command name though)

btrfs-convert

  • report progress
  • [patch sent] add option to transfer label from the original filesystem

mkfs.btrfs

strace

Add support for btrfs-specific ioctls. Currently raw numbers are printed, teach strace to print a human readable output.

Projects claimed and in progress

Projects that are under development. Patches may exist, but have not been pulled into the mainline kernel.

Multiple Devices

Jan Schmidt and Arne Jansen — submitted — Not in kernel yet

The disk format includes fields to record different performance characteristics of the device. This should be honored during allocations to find the most appropriate device. The allocator should also prefer to allocate from less busy drives to spread IO out more effectively.


Chris Mason — patch developed, needs updating and integration — Not in kernel yet

The multi-device code needs a raid6 implementation, and perhaps a raid5 implementation. This involves a few different topics including extra code to make sure writeback happens in stripe sized widths, and stripe aligning file allocations.

Metadata blocks are currently clustered together, but extra code will be needed to ensure stripe efficient IO on metadata. Another option is to leave metadata in raid1 mirrors and only do raid5/6 for data.

The existing raid0/1/10 code will need minor refactoring to provide dedicated chunk allocation and lookup functions per raid level. It currently happens via a collection of if statements.


Being developed in a GSoC project (who? Ilya Dryomov?) — no patches yet — Not in kernel yet

Raid allocation options need to be configurable per snapshot, per subvolume and per file. It should also be possible to set some IO parameters on a directory and have all files inside that directory inherit the config.


Ilya Dryomov — on mailing list — In kernel 3.3

We need ioctls to change between different raid levels. Some of these are quite easy -- e.g. for RAID0 to RAID1, we just halve the available bytes on the fs, then queue a rebalance.

Contains "Balance operation progress" project.


Stefan Behrens — submitted — Not in kernel yet

Items should be inserted into the device tree to record the location and frequency of IO errors, including checksumming errors and misplaced writes.

Other projects

Chris Mason — no patches yet — Not in kernel yet

  • Introduce semantic checks for filesystem data structures
  • Limit memory usage by triggering a back-reference only mode when too many extents are pending
  • Add verification of the generation number in btree metadata


Liu Bo — framework patches accepted — Not in kernel yet

The sources have a number of BUG() statements that could easily be replaced with code to force the filesystem readonly. This is the first step in being more fault tolerant of disk corruptions. The first step is to add a framework for generating errors that should result in filesystems going readonly, and the conversion from BUG() to that framework can happen incrementally.


Jan Schmidt and Arne Jansen — submitted — Not in kernel yet

We're able to split data and metadata IO very easily. Metadata tends to be dominated by seeks and for many applications it makes sense to put the metadata onto faster SSDs.


Stefan Behrens — no patches yet — Not in kernel yet

Right now when we replace a drive, we do so with a full FS balance. If we are inserting a new drive to remove an old one, we can do a much less expensive operation where we just put valid copies of all the blocks onto the new drive.


Liu Bo — patches at RFC stage — Not in kernel yet

Given a block number on a disk, the Btrfs metadata can find all the files and directories that use or care about that block. Some utilities to walk these back refs and print the results would help debug corruptions.

Given an inode, the Btrfs metadata can find all the directories that point to the inode. We should have utils to walk these back refs as well.


Anand Jain — no patches yet — Not in kernel yet

Currently the xfs test suite is used to validate the base functionality of btrfs. It would be good to extend it to test btrfs-specific functions like snapshot creation/deletion, balancing and relocation.

Theres a GSoC project with similar goal done by Adytia Dani https://github.com/adityadani/xfstests/


Anand Jain — no patches yet — Not in kernel yet

Auto snapshot and its management tool.

A tool partly covering this idea is snapper


Li Zefan — initial patch submitted — Not in kernel yet

Online fsck includes a number of difficult decisions around races and coherency. Given that back references allow us to limit the total amount of memory required to verify a data structure, we should consider simply implementing fsck in the kernel.


Jeff Liu — submitted — Not in kernel yet

Set file system label via ioctl(2), user can play with Btrfs label through btrfs filesystem label [label]


Sanjeev Mk, Shravan Aras, Gautam Akiwate — no patches yet — Not in kernel yet

btrfstune currently can be used to update the seeding value. This project would add on to that and make btrfstune a generic tool to tune various FS parameters.

(Would be better to have this under 'btrfs tune' or something like that)

Contact:
email : sanjeevmk4890@gmail.com IRC Nick: s-mk
email: 123.shravan@gmail.com IRC Nick: shravan
email: gautam.akiwate@gmail.com IRC Nick: gakiwate


Liu Bo — no patches yet — Not in kernel yet

Btrfs uses a number of rbtrees to index in-memory data structures. Some of these are dominated by reads, and the lock contention from searching them is showing up in profiles. We need to look into an RCU and sequence counter combination to allow lockless reads.


Li Zefan — initial on mailinglist — Not in kernel yet

As we defragment files, we break any sharing from other snapshots. The balancing code will preserve the sharing, and defrag needs to grow this as well.


Wu Bo — on mailinglist — Not in kernel yet

The chunk tree is critical to mapping logical block numbers to physical locations on the drive. We need to make the mappings discoverable via a block device scan so that we can recover from corrupted chunk trees.


David Sterba — submitted — Not in kernel yet

Find actual size of a compressed file.


Ilya Dryomov — no patches yet — Not in kernel yet

The split between data and metadata block groups means that we sometimes have mostly empty block groups dedicated to only data or metadata. As files are deleted, we should be able to reclaim these and put the space back into the free space pool.

(See also #Fine-grained balances)


Ilya Dryomov — no patches yet — Not in kernel yet

A background thread could check in regular intervals if there is enough room to balance the smallest chunk for each RAID type into the existing ones and do so. This would also handle the 'Block group reclaim'-case.


David Sterba — no patches yet — Not in kernel yet

Currently, compression is enabled by using a mount option. To ensure compression on external hard disks, that are attached to more than one computer, one would have to specify the specific mount option everywhere. It would be more convenient, if the partition could be marked as `compressed' so that it is mounted as compressed automatically without the need to specify the mount options explicitly.

Finished

Li Zefan — complete — In kernel 3.0

As the filesystem fills up, finding a free inode number will become expensive. This should be cached the same way we do free blocks.

Development notes, please read

It's quite normal that there are several features being developed, and some of them can be utilized by a ioctl call, identified by a number. Please, check that your feature does not use already claimed number.

Tentative list:

Ioctl range Feature Owner Notes
21 free
37-39 send command Jan Schmidt
40-49 Subvolume Quota implementation Arne Jansen might not need all
50 Set/Change label command Jeff Liu
51 compressed file size David Sterba
52-53 device IO error stats get/reset commands Stefan Behrens
54-58 drive swapping commands Stefan Behrens
unassigned Online fsck Li Zefan
Personal tools