Btrfs currently has a sequence number the NFS server can use to detect changes in the file. This should be wired into the NFS support code.
Device IO Priorities
The disk format includes fields to record different performance characteristics of the device. This should be honored during allocations to find the most appropriate device. The allocator should also prefer to allocate from less busy drives to spread IO out more effectively.
(under development by Jan Schmidt and Arne Jansen)
It should be possible to use very fast devices as a front end to traditional storage. The most used blocks should be kept hot on the fast device and they should be pushed out to slower storage in large sequential chunks.
The latest generation of SSD drives can achieve high iops/sec rates at both reading and writing. They will be very effective front end caches for slower (and less expensive) spinning media. A caching layer could be added to Btrfs to store the hottest blocks on faster devices to achieve better read and write throughput. This cache could also make use of other spindles in the existing spinning storage, for example why not store frequently used random-heavy data mirrored on all drives if space is available. A similar mechanism could allow frequent random read patterns (such as booting a system) as a series of sequential blocks in this cache.
(Under development by cmason)
The multi-device code needs a raid6 implementation, and perhaps a raid5 implementation. This involves a few different topics including extra code to make sure writeback happens in stripe sized widths, and stripe aligning file allocations.
Metadata blocks are currently clustered together, but extra code will be needed to ensure stripe efficient IO on metadata. Another option is to leave metadata in raid1 mirrors and only do raid5/6 for data.
The existing raid0/1/10 code will need minor refactoring to provide dedicated chunk allocation and lookup functions per raid level. It currently happens via a collection of if statements.
IO stripe size and other optimizations
The multi-device code includes a number of IO parameters that do not currently get used. These need tunables from userland and they need to be honored by the allocation and IO routines.
Balancing status reports
ioctls are needed to fetch status of current balancing operations
(HugoMills has posted patches for this)
IO Error recording
Items should be inserted into the device tree to record the location and frequency of IO errors, including checksumming errors and misplaced writes. Devices should be taken offline after they reach a given threshold.
There should be an offline application that tries to rehab a drive by overwriting the bad sectors and exercising it.
If a checksum failure is missed write is corrected by reading from an alternate mirror, the extent should be scheduled for relocation to correct the bad data.
Hot spare support
It should be possible to add a drive and flag it as a hot spare.
A background scrubbing process should kick in while the drive is idle and verify checksums on both file data and metadata.
Online raid config
Raid allocation options need to be configurable per snapshot, per subvolume and per file. It should also be possible to set some IO parameters on a directory and have all files inside that directory inherit the config.
btrfs-vol -b should take a parameter to change the raid config as it rebalances.
Different sector sizes
The extent_io code makes some assumptions about the page size and the underlying FS sectorsize or blocksize. These need to be cleaned up, especially for the extent_buffer code.
The HDD industry currently supports 512 Byte Blocks We can expect HDDs in the future to support 4K Byte blocks
extent_io state ranges
The extent_io locking code works on [start, end] tuples. This should be changed to [start, length] tuples and all users should be updated.
Limiting btree failure domains
One way to limit the time required for filesystem rebuilds is to limit the number of places that might reference a given area of the drive. One recent discussion of these issues is Val Henson's chunkfs work.
There are a few cases where chunkfs still breaks down to O(N) rebuild, and without making completely separate filesystems it is very difficult to avoid these in general. Btrfs has a number of options that fit in the current design:
- Tie chunks of space to Subvolumes and snapshots of those subvolumes
- Tie chunks of space to tree roots inside a subvolume
But these still allow a single large subvolume to span huge amounts of data. We can either place responsibility for limiting failure domains on the admin, or we can implement a key range restriction on allocations.
The key range restriction would consist of a new btree that provided very coarse grained indexing of chunk allocations against key ranges. This could provide hints to fsck in order to limit consistency checking.
(Currently being developed by cmason)
- Introduce semantic checks for filesystem data structures
- Limit memory usage by triggering a back-reference only mode when too many extents are pending
- Add verification of the generation number in btree metadata
Online fsck includes a number of difficult decisions around races and coherency. Given that back references allow us to limit the total amount of memory required to verify a data structure, we should consider simply implementing fsck in the kernel.
Content based storage
Content based storage would index data extents by a large (256bit at least) checksum of the data contents. This index would be stored in one or more dedicated btrees and new file writes would be checked to see if they matched extents already in the content btree.
There are a number of use cases where even large hashes can have security implications, and content based storage is not suitable for use by default. Options to mitigate this include verifying contents of the blocks before recording them as a duplicate (which would be very slow) or simply not using this storage mode.
There are some use cases where verifying equality of the blocks may have an acceptable performance impact. If hash collisions are recorded, it may be possible to later use idle time on the disks to verify equality. It may also be possible to verify equality immediately if another instance of the file is cached. For example, in the case of a mass web host, there are likely to be many identical instances of common software, and constant use is likely to keep these files cached. In that case, not only would disk space be saved, it may also be possible for a single instance of the data in cache to be used by all instances of the file.
If hashes match, a reference is taken against the existing extent instead of creating a new one.
If the checksum isn't already indexed, a new extent is created and the content tree takes a reference against it.
When extents are freed, if the checksum tree is the last reference holder, the extent is either removed from the checksum tree or kept for later use (configurable).
Another configurable is reading the existing block to compare with any matches or just trusting the checksum.
This work is related to stripe granular IO, which would make it possible to configure the size of the extent indexed.
(Li Zefan has submitted patches for this and they are now merged into mainline)
gzip is slow, and can overwhelm the CPU, making that the bottleneck instead of the storage device.
LZO is a much faster algorithm which, even though it compresses less, would allow almost everybody to enable compression.
Block group reclaim
The split between data and metadata block groups means that we sometimes have mostly empty block groups dedicated to only data or metadata. As files are deleted, we should be able to reclaim these and put the space back into the free space pool.
We also need rebalancing ioctls that focus only on specific raid levels.
RBtree lock contention
Btrfs uses a number of rbtrees to index in-memory data structures. Some of these are dominated by reads, and the lock contention from searching them is showing up in profiles. We need to look into an RCU and sequence counter combination to allow lockless reads.
Forced readonly mounts on errors
(liubo has submitted patches for the first step [framework] and they are now merged into mainline)
The sources have a number of BUG() statements that could easily be replaced with code to force the filesystem readonly. This is the first step in being more fault tolerant of disk corruptions. The first step is to add a framework for generating errors that should result in filesystems going readonly, and the conversion from BUG() to that framework can happen incrementally.
Dedicated metadata drives
(Under development by Arne Jansen)
We're able to split data and metadata IO very easily. Metadata tends to be dominated by seeks and for many applications it makes sense to put the metadata onto faster SSDs.
(Li Zefan has submitted patches for this and they are now merged into mainline)
The Btrfs snapshots are read/write by default. A small number of checks would allow us to make readonly snapshots instead.
Per file / directory controls for COW and compression
Data compression and data cow are controlled across the entire FS by mount options right now. ioctls are needed to set this on a per file or per directory basis. This has been proposed previously, but VFS developers wanted us to use generic ioctls rather than btrfs-specific ones. Can we use some of the same ioctls that ext4 uses? This task is mostly organizational rather than technical.
Chunk tree backups
The chunk tree is critical to mapping logical block numbers to physical locations on the drive. We need to make the mappings discoverable via a block device scan so that we can recover from corrupted chunk trees.
Now that we have code to efficiently find newly updated files, we need to tie it into tools such as rsync and dirvish. (For bonus points, we can even tell rsync _which blocks_ inside a file have changed. Would need to work with the rsync developers on that one.)
Not exactly an implementation of this, but I (A. v. Bidder) have patched dirvish to use btrfs snapshots instead of hardlinked directory trees. Discussion archived on the Dirvish mailing list.
Atomic write API
The Btrfs implementation of data=ordered only updates metadata to point to new data blocks when the data IO is finished. This makes it easy for us to implement atomic writes of an arbitrary size. Some hardware is coming out that can support this down in the block layer as well.
Backref walking utilities
Given a block number on a disk, the Btrfs metadata can find all the files and directories that use or care about that block. Some utilities to walk these back refs and print the results would help debug corruptions.
Given an inode, the Btrfs metadata can find all the directories that point to the inode. We should have utils to walk these back refs as well.
We need a periodic daemon that can walk the filesystem and verify the contents of all copies of all allocated blocks are correct. This is mostly equivalent to "find | xargs cat >/dev/null", but with the constraint that we don't want to thrash the page cache, so direct I/O should be used instead.
If we find a bad copy during this process, and we're using RAID, we should queue up an overwrite of the bad copy with a good one. The overwrite can happen in-place.
Right now when we replace a drive, we do so with a full FS balance. If we are inserting a new drive to remove an old one, we can do a much less expensive operation where we just put valid copies of all the blocks onto the new drive.
IO error tracking
As we get bad csums or IO errors from drives, we should track the failures and kick out the drive if it is clearly going bad.
Random write performance
Random writes introduce small extents and fragmentation. We need new file layout code to improve this and defrag the files as they are being changed.
Free inode number cache
(Under development by Li Zefan)
As the filesystem fills up, finding a free inode number will become expensive. This should be cached the same way we do free blocks.
Snapshot aware defrag
As we defragment files, we break any sharing from other snapshots. The balancing code will preserve the sharing, and defrag needs to grow this as well.
Btree lock contention
The btree locks, especially on the root block can be very hot. We need to improve this, especially in read mostly workloads.
Changing RAID levels
(Under development by Hugo Mills)
We need ioctls to change between different raid levels. Some of these are quite easy -- e.g. for RAID0 to RAID1, we just halve the available bytes on the fs, then queue a rebalance.
For SSDs with discard support, we could use a scrubber that goes through the fs and performs discard on anything that is unused. You could first use the balance operation to compact data to the front of the drive, then discard the rest.