Deduplication

From btrfs Wiki
(Difference between revisions)
Jump to: navigation, search
(Out of band / batch deduplication)
(Out of band / batch deduplication: remove bedup - unmaintained)
Line 16: Line 16:
 
|-
 
|-
 
|[https://github.com/markfasheh/duperemove duperemove] || {{Yes}} || {{Yes}} || {{Yes}} || Sqlite database for csum. Runs by extent boundary by default, but has an option to more carefully compare.
 
|[https://github.com/markfasheh/duperemove duperemove] || {{Yes}} || {{Yes}} || {{Yes}} || Sqlite database for csum. Runs by extent boundary by default, but has an option to more carefully compare.
|-
 
|[https://github.com/g2p/bedup bedup] || {{No}} || {{No}} || {{Yes}} || Uses the clone ioctl due to concerns regarding kernel crashes with the latter as of kernel 4.2. . Appears to be unmaintained and is [https://github.com/g2p/bedup/issues/101 broken on 5.x kernels].
 
 
|-
 
|-
 
|[https://github.com/wellbehavedsoftware/btrfs-dedupe btrfs-dedupe] || {{No}} || {{Yes}} || {{Yes}} || Written in rust. Maintains state with metadata.
 
|[https://github.com/wellbehavedsoftware/btrfs-dedupe btrfs-dedupe] || {{No}} || {{Yes}} || {{Yes}} || Written in rust. Maintains state with metadata.

Revision as of 10:41, 18 June 2020

Due to its copy-on-write nature, BTRFS is able to copy files (eg with cp --reflink) or subvolumes (with btrfs subvolume snapshot) without actually copying the data. A new copy of the data is created if one of the files or subvolumes is updated.

Deduplication takes this a step further, by actively identifying when the same data has been written twice, and retrospectively combining them into an extent with the same copy-on-write semantics.

Contents

Out of band / batch deduplication

Out of band / batch deduplication is deduplication done outside of the write path. We've sometimes called it offline deduplication, but that can confuse people: btrfs dedup involves the kernel and always happens on mounted filesystems. To use out-of-band deduplication, you run a tool which searches your filesystem for identical blocks, and then deduplicates them.

Deduplication in BTRFS is mainly supported by ioctl_fideduperange(2), a compare-and-share operation, although some other tools may use the clone-oriented APIs instead.

There are multiple tools that take different approaches to deduplication, offer additional features or make trade-offs. The following table lists tools that are known to be up-to-date, maintained and widely used. There are more tools but not all of them meet the criteria and some of them have been removed. The projects are 3rd party, please check their status before you decide to use them.

Batch deduplicators for BTRFS
Name Block-based Works on other FS (XFS, OCFS2) Incremental Notes
duperemove Yes Yes Yes Sqlite database for csum. Runs by extent boundary by default, but has an option to more carefully compare.
btrfs-dedupe No Yes Yes Written in rust. Maintains state with metadata.
bees Yes No Yes Runs at a daemon. Very light database, useful for large colder storages like backup servers. Uses SEARCH_V2 and LOGICAL_INO. Has workarounds for kernel bugs.
dduper Yes No Yes Uses built-in BTRFS csum-tree, so is extremely fast and lightweight (13.8 seconds for identical 10GB files). Requires BTRFS-PROGS patch for csum access.


Duplicate file finders with btrfs support

While any duplicate file finder utility (e.g. fdupes, fslint, etc) can find files for deduplication using another tool (eg duperemove), the following duplicate file finders have build-in btrfs deduplication capabilities:

  • rmlint is a duplicate file finder with btrfs support. To find and reflink duplicate files:
$ rmlint -T df --config=sh:handler=clone [paths...]   # finds duplicates under paths and creates a batch file 'rmlint.sh' for post-processing
                                                      # ...review contents of rmlint.sh, then:
$ ./rmlint.sh                                         # clones/reflinks duplicates (if possible)

Note if reflinking read-only snapshots, rmlint.sh must be run with -r option and with root priveleges, eg:

$ sudo ./rmlint.sh -r
  • jdupes is a fork of fdupes which includes support for BTRFS deduplication when it identifies duplicate files.

Other tools

Now that the ioctl has been lifted to the VFS layer, rather than being a BTRFS-specific function, deduplication functionality can be implemented in a filesystem-independent way.

As such, xfs_io(8) is able to perform deduplication on a BTRFS file system, and provides a simple way to invoke the deduplication function from the command line, on any filesystem which supports the ioctl.

Example for deduplicating two identical files:

if cmp -s file1 file2; then
  size=$(stat --format="%s" -- file1)
  xfs_io -c "dedupe -C file2 0 0 $size" file1
fi

Inband

Inband / synchronous / inline deduplication is deduplication done in the write path, so it happens as data is written to the filesystem. This typically requires large amounts of RAM to store the lookup table of known block hashes. Patches are currently being worked on and have been in development since at least 2014. See the User notes on dedupe page for more details.

Personal tools