Zoned
OBSOLETE CONTENT
This wiki has been archived and the content is no longer updated.
Contents |
Overview
Since version 5.12 btrfs supports zoned mode. This is a special on-disk format and allocation/write strategy that's friendly to zoned devices. In short, a device is partitioned into fixed-size zones and each zone can be updated by append-only manner, or reset. As btrfs has no fixed data structures, except the super blocks, the zoned mode only requires block placement that follows the device constraints. You can learn about the whole architecture at https://zonedstorage.io.
Because of the append-only write, several btrfs features are not supported:
- nodatacow -- overwrite in-place
- fallocate -- preallocating space for in-place first write
- mixed-bg -- unordered writes to data and metadata
- booting -- resetting the first zone after superblock rotation would remove the boot loader data
The above is really incompatible because it expects an in-place write at some point. There are some features that are not part of the first implementation but is planned to be implemented in the future:
- only 'single' profile works, TBD: synchronize zone write pointers among the block groups
- fitrim, TBD: do not depend on free space cache
Creating a zoned filesystem needs btrfs-progs 5.12. Proper detection of zoned devices with btrfs needs util-linux 2.38+
Status:
- 5.12
- first release
- 5.13
- background zone reclaim added
Devices
Real hardware
The WD Ultrastar series 600 advertises HM-SMR, ie. the host-managed zoned mode. There are two more: DA (device managed, no zoned information exported to the system), HA (host aware, can be used as regular disk but zoned writes improve performance). There are not many devices available at the moment, the information about exact zoned mode is hard to find, check data sheets or community sources gathering information from real devices.
Note: zoned mode won't work with DM-SMR disks.
- Ultrastar® DC ZN540 NVMe ZNS SSD (product brief)
Emulated: null_blk
The driver null_blk provides memory backed device and is suitable for testing. There are some quirks setting up the devices. The module must be loaded with nr_devices=0 or the numbering of device nodes will be offset. The configfs must be mounted at /sys/kernel/config and the administration of the null_blk devices is done in /sys/kernel/config/nullb. The device nodes are named like /dev/nullb0 and are numbered sequentially. NOTE: the device name may be different than the named directory in sysfs!
Setup:
modprobe configfs modprobe null_blk nr_devices=0
Create a device mydev, assuming no other previously created devices, size is 2048MiB, zone size 256MiB. There are more tunable parameters, this is a minimal example taking defaults:
cd /sys/kernel/config/nullb/ mkdir mydev cd mydev echo 2048 > size echo 1 > zoned echo 1 > memory_backed echo 256 > zone_size echo 1 > power
This will create a device /dev/nullb0 and the value of file index will match the ending number of the device node.
Remove the device:
rmdir /sys/kernel/config/nullb/mydev
Then continue with mkfs.btrfs /dev/nullb0, the zoned mode is auto-detected.
For convenience, there's a script wrapping the basic null_blk management operations https://github.com/kdave/nullb.git
Emulated: TCMU runner
TCMU is a framework to emulate SCSI devices in userspace, providing various backends for the storage, with zoned support as well. A file-backed zoned device can provide more options for larger storage and zone size. Pleae follow the instructions at https://zonedstorage.io/projects/tcmu-runner/ .
Compatiblity, incompatibility
- the feature sets an incompat bit and requires new kernel to access the filesystem (for both read and write)
- superblock needs to be handled in a special way, there are still 3 copies but at different offsets (0, 512GiB, 4TiB) and the 2 consecutive zones are a ring buffer of the superblocks, finding the latest one needs read it from the write pointer or do a full scan of the zones
- mixing zoned and non zoned devices is possible (zones are emulated) but is recommended only for testing
- mixing zoned devices with different zone sizes is not possible
- zone sizes must be power of two, zone sizes of real devices are eg. 256MiB or 1GiB, larger size is expected, maximum zone size supported by btrfs is 8GiB
Status, stability, reporting bugs
The zoned mode has been released in 5.12 and there are still some rough edges and corner cases one can hit during testing. Please report bugs to https://github.com/naota/linux/issues/ .
References
- https://zonedstorage.io
- https://zonedstorage.io/projects/libzbc/ -- libzbc is library and set of tools to directly manipulate devices with ZBC/ZAC support
- https://zonedstorage.io/projects/libzbd/ -- libzbd uses the kernel provided zoned block device interface based on the ioctl() system calls
- https://hddscan.com/blog/2020/hdd-wd-smr.html -- some details about exact device types
- https://lwn.net/Articles/853308/ -- Btrfs on zoned block devices
- https://www.usenix.org/conference/vault20/presentation/bjorling -- Zone Append: A New Way of Writing to Zoned Storage