Btrfs currently has a sequence number the NFS server can use to detect changes in the file. This should be wired into the NFS support code.
Device IO Priorities
The disk format includes fields to record different performance characteristics of the device. This should be honored during allocations to find the most appropriate device. The allocator should also prefer to allocate from less busy drives to spread IO out more effectively.
It should be possible to use very fast devices as a front end to traditional storage. The most used blocks should be kept hot on the fast device and they should be pushed out to slower storage in large sequential chunks.
The latest generation of SSD drives can achieve high iops/sec rates at both reading and writing. They will be very effective front end caches for slower (and less expensive) spinning media. A caching layer could be added to Btrfs to store the hottest blocks on faster devices to achieve better read and write throughput. This cache could also make use of other spindles in the existing spinning storage, for example why not store frequently used random-heavy data mirrored on all drives if space is available. A similar mechanism could allow frequent random read patterns (such as booting a system) as a series of sequential blocks in this cache.
(Under development by cmason)
The multi-device code needs a raid6 implementation, and perhaps a raid5 implementation. This involves a few different topics including extra code to make sure writeback happens in stripe sized widths, and stripe aligning file allocations.
Metadata blocks are currently clustered together, but extra code will be needed to ensure stripe efficient IO on metadata. Another option is to leave metadata in raid1 mirrors and only do raid5/6 for data.
The existing raid0/1/10 code will need minor refactoring to provide dedicated chunk allocation and lookup functions per raid level. It currently happens via a collection of if statements.
IO stripe size and other optimizations
The multi-device code includes a number of IO parameters that do not currently get used. These need tunables from userland and they need to be honored by the allocation and IO routines.
Balancing status reports
ioctls are needed to fetch status of current balancing operations
IO Error recording
Items should be inserted into the device tree to record the location and frequency of IO errors, including checksumming errors and misplaced writes. Devices should be taken offline after they reach a given threshold.
There should be an offline application that tries to rehab a drive by overwriting the bad sectors and exercising it.
If a checksum failure is missed write is corrected by reading from an alternate mirror, the extent should be scheduled for relocation to correct the bad data.
Hot spare support
It should be possible to add a drive and flag it as a hot spare.
A background scrubbing process should kick in while the drive is idle and verify checksums on both file data and metadata.
Online raid config
Raid allocation options need to be configurable per snapshot, per subvolume and per file. It should also be possible to set some IO parameters on a directory and have all files inside that directory inherit the config.
btrfs-vol -b should take a parameter to change the raid config as it rebalances.
Different sector sizes
The extent_io code makes some assumptions about the page size and the underlying FS sectorsize or blocksize. These need to be cleaned up, especially for the extent_buffer code.
The HDD industry currently supports 512 Byte Blocks We can expect HDDs in the future to support 4K Byte blocks
extent_io state ranges
The extent_io locking code works on [start, end] tuples. This should be changed to [start, length] tuples and all users should be updated.
Limiting btree failure domains
One way to limit the time required for filesystem rebuilds is to limit the number of places that might reference a given area of the drive. One recent discussion of these issues is Val Henson's chunkfs work.
There are a few cases where chunkfs still breaks down to O(N) rebuild, and without making completely separate filesystems it is very difficult to avoid these in general. Btrfs has a number of options that fit in the current design:
- Tie chunks of space to Subvolumes and snapshots of those subvolumes
- Tie chunks of space to tree roots inside a subvolume
But these still allow a single large subvolume to span huge amounts of data. We can either place responsibility for limiting failure domains on the admin, or we can implement a key range restriction on allocations.
The key range restriction would consist of a new btree that provided very coarse grained indexing of chunk allocations against key ranges. This could provide hints to fsck in order to limit consistency checking.
(Currently being developed by cmason)
- Introduce semantic checks for filesystem data structures
- Limit memory usage by triggering a back-reference only mode when too many extents are pending
- Add verification of the generation number in btree metadata
Online fsck includes a number of difficult decisions around races and coherency. Given that back references allow us to limit the total amount of memory required to verify a data structure, we should consider simply implementing fsck in the kernel.
Content based storage
Content based storage would index data extents by a large (256bit at least) checksum of the data contents. This index would be stored in one or more dedicated btrees and new file writes would be checked to see if they matched extents already in the content btree.
There are a number of use cases where even large hashes can have security implications, and content based storage is not suitable for use by default. Options to mitigate this include verifying contents of the blocks before recording them as a duplicate (which would be very slow) or simply not using this storage mode.
There are some use cases where verifying equality of the blocks may have an acceptable performance impact. If hash collisions are recorded, it may be possible to later use idle time on the disks to verify equality. It may also be possible to verify equality immediately if another instance of the file is cached. For example, in the case of a mass web host, there are likely to be many identical instances of common software, and constant use is likely to keep these files cached. In that case, not only would disk space be saved, it may also be possible for a single instance of the data in cache to be used by all instances of the file.
If hashes match, a reference is taken against the existing extent instead of creating a new one.
If the checksum isn't already indexed, a new extent is created and the content tree takes a reference against it.
When extents are freed, if the checksum tree is the last reference holder, the extent is either removed from the checksum tree or kept for later use (configurable).
Another configurable is reading the existing block to compare with any matches or just trusting the checksum.
This work is related to stripe granular IO, which would make it possible to configure the size of the extent indexed.
Generation numbers in the btree allow efficient walking to discover extents that have changed since a given transaction. This information needs to be exported to userland in two different ways:
- A simple list of files and directories that have been updated since the generation requested. This should be suitable for feeding into rsync.
- A key range system for synchronizing between two mounts. This will end up looking a lot like crfs.
Disk format issues: The generation number in the inode is the generation number for file creation. This should be used in the traditional NFS generation number checking. A new transaction id needs to be added for the last transaction to change the inode.
(Li Zefan has submitted patches for this)
gzip is slow, and can overwhelm the CPU, making that the bottleneck instead of the storage device.
LZO is a much faster algorithm which, even though it compresses less, would allow almost everybody to enable compression.