Btrfs currently has a sequence number the NFS server can use to detect changes in the file. This should be wired into the NFS support code.
Device IO Priorities
The disk format includes fields to record different performance characteristics of the device. This should be honored during allocations to find the most appropriate device. The allocator should also prefer to allocate from less busy drives to spread IO out more effectively.
It should be possible to use very fast devices as a front end to traditional storage. The most used blocks should be kept hot on the fast device and they should be pushed out to slower storage in large sequential chunks.
Balancing Extents Across Drives
The code to balance extents duplicates extents found in multiple snapshots by forcing a COW write in each individual snapshot.
When multiple references are found to a data extent, the extent should be copied to a new location and all of the references should be updated to point to the new location. This will be faster and waste less space.
Extra care must be taken to avoid races with data writeback and new writes into the file. Most of the difficult problems can be avoided by checking the transaction id of the extent back reference. If the transaction id is older than the running transaction, you can safely know those bytes will not be modified in place.
The btree inode pages can be used to store a temporary copy of the extent while it is being copied to a new location on the disk.
The multi-device code needs a raid6 implementation, and perhaps a raid5 implementation. This involves a few different topics including extra code to make sure writeback happens in stripe sized widths, and stripe aligning file allocations.
Metadata blocks are currently clustered together, but extra code will be needed to ensure stripe efficient IO on metadata. Another option is to leave metadata in raid1 mirrors and only do raid5/6 for data.
The existing raid0/1/10 code will need minor refactoring to provide dedicated chunk allocation and lookup functions per raid level. It currently happens via a collection of if statements.
IO stripe size and other optimizations
The multi-device code includes a number of IO parameters that do not currently get used. These need tunables from userland and they need to be honored by the allocation and IO routines.
Balancing status reports
ioctls are needed to fetch status of current balancing operations
IO Error recording
Items should be inserted into the device tree to record the location and frequency of IO errors, including checksumming errors and misplaced writes. Devices should be taken offline after they reach a given threshold.
There should be an offline application that tries to rehab a drive by overwriting the bad sectors and exercising it.
If a checksum failure is missed write is corrected by reading from an alternate mirror, the extent should be scheduled for relocation to correct the bad data.
Hot spare support
It should be possible to add a drive and flag it as a hot spare.
A background scrubbing process should kick in while the drive is idle and verify checksums on both file data and metadata.
Online raid config
Raid allocation options need to be configurable per snapshot, per subvolume and per file. It should also be possible to set some IO parameters on a directory and have all files inside that directory inherit the config.
btrfs-vol -b should take a parameter to change the raid config as it rebalances.
Different sector sizes
The extent_io code makes some assumptions about the page size and the underlying FS sectorsize or blocksize. These need to be cleaned up, especially for the extent_buffer code.
The extent_io code is able to lock ranges in the file without pages being present, which means it can provide a number of improvements over the generic O_DIRECT code.
- No need to hold i_sem during O_DIRECT writes
- No need for strange hole creation semaphores
- Many fewer races with the page cache code
Btrfs O_DIRECT code should have two modes. One where O_DIRECT still does COW (including checksumming), and one where it simply writes directly over existing extents in the file.
The existing APIs for IO and checksumming assume existing pages are kernel pages, and so extra work will be required to make them operate on user pages.
extent_io state ranges
The extent_io locking code works on [start, end] tuples. This should be changed to [start, length] tuples and all users should be updated.
An efficient btree walk to remove snapshots needs to be implemented, along with code to find dentries that would make a snapshot directory tree busy (with the same rules as unmount). The tree walk already exists with the transaction commit code, some glue is needed to implement rmdir on the tree of tree roots and do the dentry busy checks.
Root pointers inside a subvolume
Root pointers only exist today inside the tree of tree roots. One way to address scalability problems in the btree is to allow a given directory to be created as a new tree root.
This will require some changes to properly update root pointers in the transaction subsystem.
Multiple chunk trees and extent allocation trees
The current code only supports a single chunk tree and a single extent allocation tree. We may need to implement more fine grained extent allocation trees, and it may make sense to create one extent allocation tree per chunk.
This is a fairly large change, the code makes many assumptions about there being a single extent allocation tree.
Limiting btree failure domains
One way to limit the time required for filesystem rebuilds is to limit the number of places that might reference a given area of the drive. One recent discussion of these issues is Val Henson's chunkfs work.
There are a few cases where chunkfs still breaks down to O(N) rebuild, and without making completely separate filesystems it is very difficult to avoid these in general. Btrfs has a number of options that fit in the current design:
- Tie chunks of space to Subvolumes and snapshots of those subvolumes
- Tie chunks of space to tree roots inside a subvolume
But these still allow a single large subvolume to span huge amounts of data. We can either place responsibility for limiting failure domains on the admin, or we can implement a key range restriction on allocations.
The key range restriction would consist of a new btree that provided very coarse grained indexing of chunk allocations against key ranges. This could provide hints to fsck in order to limit consistency checking.
- Introduce semantic checks for filesystem data structures
- Limit memory usage by triggering a back-reference only mode when too many
extents are pending
- Add verification of the generation number in btree metadata
Online fsck includes a number of difficult decisions around races and coherency. Given that back references allow us to limit the total amount of memory required to verify a data structure, we should consider simply implementing fsck in the kernel.
Content based storage
Content based storage would index data extents by a large (256bit at least) checksum of the data contents. This index would be stored in one or more dedicated btrees and new file writes would be checked to see if they matched extents already in the content btree.
There are a number of use cases where even large hashes can have security implications, and content based storage is not suitable for use by default. Options to mitigate this include verifying contents of the blocks before recording them as a duplicate (which would be very slow) or simply not using this storage mode.
There are some use cases where verifying equality of the blocks may have an acceptable performance impact. If hash collisions are recorded, it may be possible to later use idle time on the disks to verify equality. It may also be possible to verify equality immediately if another instance of the file is cached. For example, in the case of a mass web host, there are likely to be many identical instances of common software, and constant use is likely to keep these files cached. In that case, not only would disk space be saved, it may also be possible for a single instance of the data in cache to be used by all instances of the file.
If hashes match, a reference is taken against the existing extent instead of creating a new one.
If the checksum isn't already indexed, a new extent is created and the content tree takes a reference against it.
When extents are freed, if the checksum tree is the last reference holder, the extent is either removed from the checksum tree or kept for later use (configurable).
Another configurable is reading the existing block to compare with any matches or just trusting the checksum.
This work is related to stripe granular IO, which would make it possible to configure the size of the extent indexed.
Generation numbers in the btree allow efficient walking to discover extents that have changed since a given transaction. This information needs to be exported to userland in two different ways:
- A simple list of files and directories that have been updated since the
generation requested. This should be suitable for feeding into rsync.
- A key range system for synchronizing between two mounts. This will end
up looking a lot like crfs.
Disk format issues: The generation number in the inode is the generation number for file creation. This should be used in the traditional NFS generation number checking. A new transaction id needs to be added for the last transaction to change the inode.
A conversion utility from Ext3 already exists. It should be expanded to convert from Ext4 as well.