Due to its copy-on-write nature, BTRFS is able to copy files (eg with
cp --reflink) or subvolumes (with
btrfs subvolume snapshot) without actually copying the data. A new copy of the data is created if one of the files or subvolumes is updated.
Deduplication takes this a step further, by actively identifying when the same data has been written twice, and retrospectively combining them into an extent with the same copy-on-write semantics.
Out of band / batch deduplication is deduplication done outside of the write path. We've sometimes called it offline deduplication, but that can confuse people: btrfs dedup involves the kernel and always happens on mounted filesystems. To use out-of-band deduplication, you run a tool which searches your filesystem for identical blocks, and then deduplicates them.
Dedicated btrfs deduplicators
Duperemove is a simple tool for finding duplicated extents and submitting them for deduplication. When given a list of files it will hash their contents on a block by block basis and compare those hashes to each other, finding and categorizing blocks that match each other. When given the -d option, duperemove will submit those extents for deduplication using the Linux kernel extent-same ioctl.
bedup implements incremental whole-file batch deduplication for Btrfs. It integrates deeply with btrfs so that scans are incremental and low-impact. It uses the btrfs clone ioctl to do the deduplication, rather than the extent-same ioctl, due to concerns regarding kernel crashes with the latter as of kernel 4.2.
btrfs-dedupe is a Rust library which implements incremental whole-file batch deduplication for Btrfs. It is written in Rust for safety and performance and uses the kernel ioctls to offload the actual deduplication to the kernel for safety. It maintains state for efficient regular operation, scanning file metadata on every run, hashing contents for files with new metadata, hashing file extent maps for files with new contents, and performing defragmentation and deduplication of files with matching content but non-matching extent maps.
bees block-oriented userspace dedup agent, designed to avoid scalability problems on large filesystems. Work as a daemon, not store any information about filesystem structure (only store some extent info and hashes in simple mmap'ed db), retrieves such information on demand through btrfs SEARCH_V2 and LOGICAL_INO ioctls. Very useful on large not write heavy storages, like backup servers. See more info at Bees GitHub.
dduper is a block-level off-line deduplication tool. It relies on built-in BTRFS csum-tree. dduper avoids CPU intense operation like fetching each file data block and then computing its checksum by reusing BTRFS csum-tree. It is pretty fast, for example dduper took 13.8 seconds to dedupe two 10GB files with same data.
Duplicate file finders with btrfs support
While any duplicate file finder utility (e.g. fdupes, fslint, etc) can find files for deduplication using another tool (eg duperemove), the following duplicate file finders have build-in btrfs deduplication capabilities:
rmlint is a duplicate file finder with btrfs support. To find and reflink duplicate files:
$ rmlint -T df --config=sh:handler=clone [paths...] # finds duplicates under paths and creates a batch file 'rmlint.sh' for post-processing # ...review contents of rmlint.sh, then: $ ./rmlint.sh # clones/reflinks duplicates (if possible)
Note if reflinking read-only snapshots, rmlint.sh must be run with -r option and with root priveleges, eg:
$ sudo ./rmlint.sh -r
jdupes is a fork of fdupes which includes support for BTRFS deduplication when it identifies duplicate files.
Now that the ioctl has been lifted to the VFS layer, rather than being a BTRFS-specific function, deduplication functionality can be implemented in a filesystem-independent way.
As such, xfs_io, is able to perform deduplication on a BTRFS file system, and provides a simple way to invoke the deduplication function from the command line, on any filesystem which supports the ioctl.
Inband / synchronous / inline deduplication is deduplication done in the write path, so it happens as data is written to the filesystem. This typically requires large amounts of RAM to store the lookup table of known block hashes. Patches are currently being worked on.