On-disk Format
This document describes the Btrfs on‐disk format.
Overview
Aside from the superblock, Btrfs consists entirely of several trees. The trees use copy-on-write.
Btrfs makes a distinction between logical and physical addresses. Logical addresses are used in the filesystem structures, while physical addresses are simply byte offsets on a disk. One logical address may correspond to physical addresses on any number of disks, depending on RAID settings. The chunk tree is used to convert from logical addresses to physical addresses; the dev tree can be used for the reverse. For bootstrapping purposes, the superblock contains a subset of the chunk tree.
Subvolumes and snapshots.
Basic Structures
Note that the fields are unsigned, so object ID −1 will be treated as ffffffffffffffff and sorted to the end of the tree. Since Btrfs uses little‐endian, a simple byte‐by‐byte comparison of KEYs will not work.
Off | Size | Type | Description |
---|---|---|---|
0 | 8 | UINT | Object ID. Each tree has its own set of Object IDs. |
8 | 1 | UINT | |
9 | 8 | UINT | Offset. The meaning depends on the item type. |
11 |
Btrfs uses Unix time.
Off | Size | Type | Description |
---|---|---|---|
0 | 8 | SINT | Number of seconds since 1970-01-01T00:00:00Z. |
8 | 4 | UINT | Number of nanoseconds since the beginning of the second. |
c |
Superblock
The primary superblock is located at 0x1 0000 (6410 KiB). Mirror copies of the superblock are located at physical addresses 0x400 0000 (6410 MiB), 0x40 0000 0000 (25610 GiB), and 0x4 0000 0000 0000 (1 PiB), if these locations are valid. btrfs normally updates all superblocks, but in SSD mode it will update only one at a time. The superblock with the highest generation is used when reading.
Note that btrfs only recognizes disks with a valid 0x1 0000 superblock; otherwise, there would be confusion with other filesystems.
TODO
Off | Size | Type | Description |
---|---|---|---|
0 | 20 | CSUM | Checksum of everything past this field (from 20 to 1000) |
20 | 10 | UUID | FS UUID |
30 | 8 | UINT | physical address of this block (different for mirrors) |
38 | 8 | flags | |
40 | 8 | ASCII | magic ("_BHRfS_M") |
48 | 8 | generation | |
50 | 8 | logical address of the root tree root | |
58 | 8 |
logical address of the chunk tree root | |
60 | 8 | logical address of the log tree root | |
68 | 8 | log_root_transid | |
70 | 8 | total_bytes | |
78 | 8 | bytes_used | |
80 | 8 | root_dir_objectid (usually 6) | |
88 | 8 | num_devices | |
90 | 4 | sectorsize | |
94 | 4 | nodesize | |
98 | 4 | leafsize | |
9c | 4 | stripesize | |
a0 | 4 | n | |
a4 | 8 | chunk_root_generation | |
ac | 8 | compat_flags | |
b4 | 8 | compat_ro_flags - only implementations that support the flags can write to the filesystem | |
bc | 8 | incompat_flags - only implementations that support the flags can use the filesystem | |
c4 | 2 | csum_type - Btrfs currently uses the CRC32c little-endian hash function with seed -1. | |
c6 | 1 | root_level | |
c7 | 1 | chunk_root_level | |
c8 | 1 | log_root_level | |
c9 | 62 |
DEV_ITEM data for this device | |
12b | 100 | label (may not contain '/' or '\\') | |
22b | 100 | reserved | |
32b | 800 |
(n bytes valid) Contains (KEY, CHUNK_ITEM) pairs for all SYSTEM chunks. This is needed to bootstrap the mapping from logical addresses to physical. | |
b2b | 4d5 | Currently unused | |
1000 |
Header
TODO
Off | Size | Type | Description |
---|---|---|---|
0 | 20 | CSUM | Checksum of everything after this field (from 20 to the end of the node) |
20 | 10 | UUID | FS UUID |
30 | 8 | UINT | Logical address of this node |
38 | 7 | FIELD | Flags |
3f | 1 | UINT | Backref. Rev.: always 1 (MIXED) for new filesystems; 0 (OLD) indicates an old filesystem. |
40 | 10 | UUID | Chunk tree UUID |
50 | 8 | UINT | Generation |
58 | 8 | UINT | The ID of the tree that contains this node |
60 | 4 | UINT | Number of items |
64 | 1 | UINT | Level (0 for leaf nodes) |
65 |
Internal Node
In internal nodes, the node header is followed by a number of key pointers.
Off | Size | Type | Description |
---|---|---|---|
0 | 11 | KEY | key |
11 | 8 | UINT | block number |
19 | 8 | UINT | generation |
21 |
header | key ptr | key ptr | key ptr | ... | free space |
Leaf Node
In leaf nodes, the node header is followed by a number of items. The items' data is stored at the end of the node.
Off | Size | Type | Description |
---|---|---|---|
0 | 11 | KEY | key |
11 | 4 | UINT | data offset relative to end of header (65) |
15 | 4 | UINT | data size |
19 |
header | item 0 | item 1 | ... | item N | free space | data N | ... | data 1 | data 0 |
Object Types
TODO
Objects
Root tree (1)
The root tree holds ROOT_ITEMs, ROOT_REFs, and ROOT_BACKREFs for every tree other than itself. It is used to find the other trees and to determine the subvolume structure. It also holds the items for the root tree directory. The logical address of the root tree is stored in the superblock.
EXTENT tree (2)
TODO
- Holds EXTENT_ITEMs, BLOCK_GROUP_ITEMs
- Pointed to by ROOT
EMPTY_SUBVOL dir (2)
TODO
Chunk tree (3)
The chunk tree holds all DEV_ITEMs and CHUNK_ITEMs, making it possible to determine the device(s) and physical address(es) corresponding to a given logical address. It is therefore crucial for access to the contents of the filesystem.
The chunk tree resides entirely in SYSTEM block groups, and will therefore be accessible from the CHUNK_ITEM array in the Superblock. It also has an entry in the ROOT tree.
Dev tree (4)
The dev tree holds all DEV_EXTENTs, making it possible to determine the logical address corresponding to a given physical address. This is necessary when shrinking or removing devices. The dev tree has an entry in the root tree.
FS tree (5)
TODO
- Holds INODE_ITEMs, INODE_REFs, DIR_ITEMs, DIR_INDEXen, XATTR_ITEMs, EXTENT_DATAs for a filesystem
- Pointed to by ROOT
- TODO: ".."
Root tree directory
The root tree directory is stored in the root tree. It has an INODE_ITEM and a DIR_ITEM with name "default" pointing to the FS tree. There is also a corresponding INODE_REF, but no DIR_INDEX. The objectid of the root tree directory is stored in the superblock, but is currently always 6.
Checksum tree (7)
The checksum tree contains all the EXTENT_CSUMs. It has an entry in the root tree.
ORPHAN (-5)
TODO
TREE_LOG (-6)
TODO
TREE_LOG_FIXUP (-7)
TODO
TREE_RELOC (-8)
TODO
- Just a copy of another tree
DATA_RELOC tree (-9)
TODO
- Holds 100 INODE_ITEM 0
- Holds 100 INODE_REF 100 0:'..'
- Pointed to by ROOT
EXTENT_CSUM (-a)
TODO
MULTIPLE_OBJECTIDS (-100)
TODO
Item Types
INODE_ITEM (01)
(objectid, 01, 0)
Contains the stat information for an inode; see stat(2).
Off | Size | Type | Description |
---|---|---|---|
0 | 8 | UINT | generation (TODO) |
8 | 8 | UINT | transid that last touched this (TODO) |
10 | 8 | UINT | st_size. For a directory, this is twice the total number of characters in all the entries' filenames. |
18 | 8 | UINT | st_blocks, but in bytes. This is the sum of the offset fields of all EXTENT_DATA items for this inode. For a directory, this is 0. |
20 | 8 | Block group (TODO) | |
28 | 4 | UINT | st_nlink. This is the number of INODE_REF entries for the inode. For trees and other objects with no INODE_REFs, this is 1. |
2c | 4 | UINT | st_uid |
30 | 4 | UINT | st_gid |
34 | 4 | UINT | st_mode |
38 | 8 | UINT | st_rdev. The lower 20 bits are the minor number, and the higher 44 bits are the major number. |
40 | 8 | UINT | Flags (TODO) |
48 | 8 | UINT | Sequence (for NFS compatibility): starts at 0 and increments whenever st_mtime is updated. |
50 | 20 | Reserved | |
70 | c | TIME | st_atime |
7c | c | TIME | st_ctime. Also updated when xattrs change. |
88 | c | TIME | st_mtime |
94 | c | TIME | "otime" (reserved) |
a0 |
INODE_REF (0c)
(inode_id, directory_id) TODO
From an inode to a name in a directory.
Off | Size | Type | Description |
---|---|---|---|
0 | 8 | UINT | index in the directory |
8 | 2 | UINT | (n) |
a | n | ASCII | name in the directory |
a+n |
This structure can be repeated...?
INODE_EXTREF (0d)
(inode_id, hash of name [using directory object ID as seed]) TODO
From an inode to a name in a directory. Used if the regarding INODE_REF array ran out of space. This item requires the EXTENDED_IREF feature.
Off | Size | Type | Description |
---|---|---|---|
0 | 8 | UINT | directory object ID |
8 | 8 | UINT | index in the directory |
10 | 2 | UINT | (n) |
12 | n | ASCII | name in the directory |
12+n |
This structure can be repeated...?
XATTR_ITEM (18)
(inode_id, hash of xattr name) TODO
From an inode to extended attribute(s) by name. Same contents as DIR_ITEM; the name and data are the xattr name and data, and location is NULL.
ORPHAN_ITEM (30)
(-5, objid of orphan inode) TODO
Empty.
DIR_LOG_ITEM (3c)
(directory_id, first offset) TODO
The log is considered authoritative for ([first offset, end offset)] 0 8 UINT end offset
DIR_LOG_INDEX (48)
(directory_id, first offset) TODO
Same as DIR_LOG_ITEM.
DIR_ITEM (54)
(parent objectid, 54, hash of name)
Allows looking up a directory item by name.
Off | Size | Type | Description |
---|---|---|---|
0 | 11 | KEY | location of child |
11 | 8 | UInt | transid |
19 | 2 | UInt | (m) |
1b | 2 | UInt | (n) |
1d | 1 | UInt | type of child (0=Unknown, 1=Regular File, 2=Directory, 3=character device, 4=block device, 5=FIFO, 6=socket, 7=symbolic link, 8=extended attribute) |
1e | n | Text | name of item in directory |
m | data of item in directory (empty for normal directory items) | ||
1e+n+m |
This structure can be repeated multiple times within one DIR_ITEM if multiple items have the same hash.
DIR_INDEX (60)
(parent objectid, 60, index in parent)
Allows looking up an item in a directory by index. Indices start at 2 (because of "." and ".."); removed files can cause "holes" in the index space. DIR_INDEXen have the same contents as DIR_ITEMs, but may contain only one entry.
EXTENT_DATA (6c)
(inode id, 6c, offset in file) TODO
The contents of a file.
Off | Size | Type | Description |
---|---|---|---|
0 | 8 | UINT | generation |
8 | 8 | UINT | (n) size of decoded extent |
10 | 1 | UINT | compression (0=none, 1=zlib, 2=LZO) |
11 | 1 | UINT | encryption (0=none) |
12 | 2 | UINT | other encoding (0=none) |
14 | 1 | UINT | type (0=inline, 1=regular, 2=prealloc) |
15 |
If the extent is inline, the remaining item bytes are the data bytes (n bytes in case no compression/encryption/other encoding is used).
Otherwise, the structure continues:
Off | Size | Type | Description |
---|---|---|---|
15 | 8 | UINT | (ea) logical address of extent. If this is zero, the extent is sparse and consists of all zeroes. |
1d | 8 | UINT | (es) size of extent |
25 | 8 | UINT | (o) offset within the extent |
2d | 8 | UINT | (s) logical number of bytes in file |
35 |
ea and es must exactly match an EXTENT_ITEM. If the es bytes of data at logical address ea are decoded, n bytes will result. The file's data contains the s bytes at offset o within the decoded bytes. In the simplest, uncompressed case, o=0 and n=es=s, so the file's data simply contains the n bytes at logical address ea.
EXTENT_CSUM (80)
(-a, logical address?) TODO
Contains one or more checksums of the type in the superblock for adjacent blocks starting at logical address (blocksize).
ROOT_ITEM (84)
(?, transaction id) TODO
or (-7, -7) for LOG tree 0 a0 INODE_ITEM (gen=1 size=3 nlink=1 nbytes=leafsize mode=40755 other fields 0) a0 8 UINT expected generation a8 8 UINT Object ID in this tree of this tree's root directory (always 100) b0 8 UINT block number of the root node b8 8 UINT byte_limit (always 0) c0 8 UINT bytes_used (can be negative?) c8 8 UINT The last generation a snapshot was taken of (0 for none) d0 8 UINT flags (can be negative?) d8 4 UINT Number of references dc 11 KEY drop_progress (always 0:00:0) ed 1 UINT drop_level (always 0) ee 1 UINT Level of the root of the tree
ROOT_BACKREF (90)
(subtree id, 90, tree id) TODO
Same content as ROOT_REF.
ROOT_REF (9c)
(tree id, subtree id) TODO
0 8 UINT ID of directory in [tree id] that contains the subtree 8 8 UINT Sequence (index in tree) (even, starting at 2?) 10 2 UINT (n) 12 n ASCII name
EXTENT_ITEM (a8)
(logical address, a8, size in bytes) TODO
Maps logical extents to their contents. 0 8 UINT reference count 8 8 UINT generation 10 8 UINT flags (1=DATA, 2=TREE_BLOCK) 18 11 KEY key of first entry in tree? (TREE_BLOCK only) 29 1 UINT level of node (TREE_BLOCK only) Then inline refs (one for each in reference count) sorted as they would be in a tree (by type, then for example by EXTENT_DATA_REF's hash): 0 1 UINT type (must be TREE_BLOCK_REF, EXTENT_DATA_REF, SHARED_BLOCK_REF, or SHARED_DATA_REF) 1 contents
TREE_BLOCK_REF (b0)
(logical address, b0, root object id) TODO
0 8 UINT offset (the object ID of the tree)
EXTENT_DATA_REF (b2)
(logical address, b2, hash of first three fields) TODO
0 8 UINT root objectid (id of tree contained in) 8 8 UINT object id (owner) 10 8 UINT offset (in the file data) 18 4 UINT count (always 1?)
EXTENT_REF_V0 (b4)
TODO
SHARED_BLOCK_REF (b6)
(logical address, b6, parent) TODO
Off | Size | Type | Description |
---|---|---|---|
0 | 8 | UINT | offset |
8 |
SHARED_DATA_REF (b8)
(logical address, b8, parent) TODO
Off | Size | Type | Description |
---|---|---|---|
0 | 8 | UINT | offset |
8 | 4 | UINT | count (always 1?) |
c |
BLOCK_GROUP_ITEM (c0)
(logical address, c0, size in bytes)
A block group: an allocated area of disk containing EXTENT_ITEMs. Block groups are oriented towards one single type of content (e.g. metadata). Found in the EXTENT tree.
Off | Size | Type | Description |
---|---|---|---|
0 | 8 | UINT | Used amount: the total size in bytes of EXTENT_ITEMs in this block group. |
8 | 8 | OBJID | chunk tree id (always 256) |
10 | 8 | FIELD |
flags 1: DATA oriented 2: SYSTEM oriented (the beginning of the disk, the superblock, and everything in tree 3) 4: METADATA oriented 8: RAID0 10: RAID1 (only exactly 2 disks) 20: has a duplicate (exactly 2 copies, not possible with RAID1) 40: RAID10 |
18 |
DEV_EXTENT (cc)
(device id, cc, physical address) TODO
Maps from physical address to logical.
Off | Size | Type | Description |
---|---|---|---|
0 | 8 | UINT | chunk tree (always 3) |
8 | 8 | OBJID | chunk oid (always 256?) |
10 | 8 | UINT | logical address |
18 | 8 | UINT | size in bytes |
20 | 10 | UUID | chunk tree UUID |
30 |
DEV_ITEM (d8)
(1, device id) TODO
Contains information about one device.
Off | Size | Type | Description |
---|---|---|---|
0 | 8 | UINT | device id |
8 | 8 | UINT | number of bytes |
10 | 8 | UINT | number of bytes used |
18 | 4 | UINT | optimal I/O align |
1c | 4 | UINT | optimal I/O width |
20 | 4 | UINT | minimal I/O size (sector size) |
24 | 8 | UINT | type |
2c | 8 | UINT | generation |
34 | 8 | UINT | start offset |
3c | 4 | UINT | dev group |
40 | 1 | UINT | seek speed |
41 | 1 | UINT | bandwidth |
42 | 10 | UUID | device UUID |
52 | 10 | UUID | FS UUID |
62 |
CHUNK_ITEM (e4)
(100, logical address) TODO
Maps logical address to physical. 0 8 UINT size of chunk (bytes) 8 8 OBJID root referencing this chunk (2) 10 8 UINT stripe length 18 8 UINT type (same as flags for block group?) 20 4 UINT optimal io alignment 24 4 UINT optimal io width 28 4 UINT minimal io size (sector size) 2c 2 UINT number of stripes 2e 2 UINT sub stripes 30 Stripes follow (for each number of stripes): 0 8 OBJID device id 8 8 UINT offset 10 10 UUID device UUID 20
STRING_ITEM (fd)
(anything, 0)
Contains a string; used for testing only.
(Page contents used to be on user page, moved to own during 2012 migration.)