Design notes on Send/Receive

From btrfs Wiki
Revision as of 12:29, 20 January 2023 by Kdave (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search



This page is for documentation related to btrfs send/receive.

Btrfs send/receive design

Only the send side is happening in-kernel. Receive is happening in user-space. The kernel generates a stream of instructions meant to be replayed 1:1 on the receiving side. These instructions cover the most important calls found for the vfs. Examples are create, mkdir, link, symlink, rename, unlink, write, and so on.

Determining differences

btrfs_compare_trees (ctree.c) is used to find the differences between two trees. It calls back a supplied callback with every tree item that differs between the two trees. See changed_cb from send.c for actions that take place for changed items.

Generating the instruction stream

Based on the differences found by btrfs_compare_tree, we generate a stream of instructions. The stream is ordered in inode order (not in the order a normal fs tree iteration would give!). This means, that we can't create files/dirs in their real/final places in most cases as the whole path may not exist yet. To handle this, we work on inode refs instead of directory items to determine where an inode should lie. There are a lot of situations where we can't link/move the inode into its final destination, for example when parts of the destination path do not exist yet. To handle this, we have "orphan inodes", which are inodes that have a temporary name in the root of the fs and wait for being linked/moved later. Orphan inodes are described later.
(Need to stop writing here...will continue later)

Send stream format

Btrfs send stream has its own format. All data are stored in little endian.

Overal strucutre

Here is an example of a minimal send output. Send stream minimal example.png

The example output can be generated by the following command:

# btrfs subvolume create /mnt/btrfs/subv1
# btrfs subvolume snapshot -r /mnt/btrfs/subv1 /mnt/btrfs/ro_snap
# btrfs send -f output /mnt/btrfs/ro_snap

The stream is consisit of the following part:

  1. Send stream header
  2. One or more send commands

Send stream header

Btrfs send header
Off Size Type Description
0 13 char Send magic header, "btrfs-stream" with terminating "\0"
13 4 u32 Send version, only 0x1 is supported yet

Send commands

Send command is consist of 2 parts:

  1. Send command header
  2. Send command data
    Which is one or more send tlv objects.

Send command header

Offset are only offset inside the structure.

Btrfs send header
Off Size Type Description
0 4 u32 Command size, excluing command header itself
4 2 u16 Command type, Check btrfs_send_command in kernel send.h for all types
6 4 char CRC32 checksum, including the header, with checksum filled with 0.

Send TLV header

TLV means Type Lenght Value. Offset are only offset inside the structure.

TLV header
Off Size Type Description
0 2 u16 TLV type, check anonymous enum about send attributes in kernel send.h for all types
2 2 u16 TLV length, excludes TLV header itself.
4 Variable Variable Data, little endian

Command Example

Send subvol command.png


  • Currently, incremental sends where large directories get deleted are extremely slow. The reason is, that we can't rmdir the directory until all its items are deleted. In most cases, these items get deleted after we process the directory inode, so we need to check if a rmdir is possible for every deleted directory item that we encounter later. To check this, we currently iterate the whole dir and check if the items found there were already processed. If we detect that all have been processed, we know that we can do a rmdir now. This is extremely slow on large directories. A possible solution would be to have a "rmdir cache" which holds a "pending unlink count" for each directory that is pending for rmdir. We then decrement the count every time we encounter an inode that gets deleted from the pending dir. If it reaches 0, we can safely send a rmdir instruction.
  • The same as for deleted directories could be done for newly created directories. In case we encounter an inode with an inum smaller than the inum of the parent, we need to iterate the parent dir to find out if the parent was already created out of order. We could speed this up with a cache.
  • For the full send case, we could optimize sending of preallocated extents and dummy extents (disknr=0) by simply skipping them. Following writes/truncates will then create the same preallocated extents on the receiving side. We could probably do the same for incremental sends, but I don't know if there is a syscall to zero out the middle of a file and replace the extents with preallocated extents.
  • Sendshots
  • More output formats? An idea was for example to add a shar like output format.

Send stream v3 draft

Send version 2 has been implemented in 6.1.

Missing from v2:

  • send extent holes in files efficiently, use hole punching (was not available during v1 time)
  • send preallocated extents properly
  • extent clones within one file
  • send progress

Implemented in v3, scheduled for merge:

  • fsverity support

Unreleated to protocol:

  • optionally generate uid/gid number:string mappings and provide them on receiving side to apply
Personal tools