Tree-checker

From btrfs Wiki
(Difference between revisions)
Jump to: navigation, search
(How to determine if it's caused by tree-checker)
(fix formatting)
 
Line 4: Line 4:
 
The design principle is, detect and reject, with comprehensive check.
 
The design principle is, detect and reject, with comprehensive check.
  
- '''Detect'''
+
* '''Detect'''
 
: For read time tree-checker, the check happens when btrfs reads tree block from disk, after basic checks like csum, tree-checker verifies the content.
 
: For read time tree-checker, the check happens when btrfs reads tree block from disk, after basic checks like csum, tree-checker verifies the content.
 
: For write time tree-checker, the check happens before btrfs writes tree block to disk, after csum calculation, tree-checker verifies the content.
 
: For write time tree-checker, the check happens before btrfs writes tree block to disk, after csum calculation, tree-checker verifies the content.
  
- '''Reject'''
+
* '''Reject'''
 
: For read time tree-checker, it rejects the tree block just as it doesn't pass csum, thus btrfs will still try to read other mirrors.
 
: For read time tree-checker, it rejects the tree block just as it doesn't pass csum, thus btrfs will still try to read other mirrors.
 
: For write time tree-checker, it rejects the tree block as it fails to reach disk. This will cause the current transaction to be aborted, so the fs is not further corrupted.
 
: For write time tree-checker, it rejects the tree block as it fails to reach disk. This will cause the current transaction to be aborted, so the fs is not further corrupted.
  
- '''Comprehensive check'''
+
* '''Comprehensive check'''
 
: In theory, tree-checker verifies every member of on-disk data.
 
: In theory, tree-checker verifies every member of on-disk data.
 
: Although sometimes compromise is made to accept some older kernel, but if older behavior breaks the definition of on-disk format,
 
: Although sometimes compromise is made to accept some older kernel, but if older behavior breaks the definition of on-disk format,
Line 31: Line 31:
  
 
* block group item
 
* block group item
:- chunk objectid
+
:* chunk objectid
 
::Fixed value
 
::Fixed value
:- used bytes
+
:* used bytes
 
:: Should never exceed block group size
 
:: Should never exceed block group size
:- flags
+
:* flags
 
:: Only certain combination is allowed
 
:: Only certain combination is allowed
  
Line 75: Line 75:
 
Please report to btrfs mail list <linux-btrfs@vger.kernel.org> first.
 
Please report to btrfs mail list <linux-btrfs@vger.kernel.org> first.
  
- If it's write time corruption
+
* If it's write time corruption
 
: Normally this means runtime memory corruption, either memory is unreliable or some other kernel memory corruption is causing the problem.
 
: Normally this means runtime memory corruption, either memory is unreliable or some other kernel memory corruption is causing the problem.
 
: Reporting to the mail list will help end user to pin down the cause by some extent.
 
: Reporting to the mail list will help end user to pin down the cause by some extent.
 
: But for write time corruption, since the corruption is prevented, the fs is not further corrupted. But a <code>btrfs check --readonly</code> is still recommended to make sure the fs is OK.
 
: But for write time corruption, since the corruption is prevented, the fs is not further corrupted. But a <code>btrfs check --readonly</code> is still recommended to make sure the fs is OK.
  
- If it's read time corruption
+
* If it's read time corruption
 
: This needs to be determined case by case
 
: This needs to be determined case by case
 
: If it's false alert, developers would fix it and before that, use an older kernel should be OK.
 
: If it's false alert, developers would fix it and before that, use an older kernel should be OK.
 
: If it's really a corruption, depends on the solution provided, either user need to salvage the data from the corrupted image either by mounting it RO, or "btrfs-restore".
 
: If it's really a corruption, depends on the solution provided, either user need to salvage the data from the corrupted image either by mounting it RO, or "btrfs-restore".
  
Please *NOT* use <code>btrfs check --repair</code> until instructed by a developer.
+
Please do *NOT* use <code>btrfs check --repair</code> until instructed by a developer.

Latest revision as of 17:01, 16 January 2021

Contents

[edit] Summary

Starting from kernel version 4.18, btrfs has introduced a new verification layer, tree-checker, to provide a centralized verification service so other part no longer to bother random corruption.

The design principle is, detect and reject, with comprehensive check.

  • Detect
For read time tree-checker, the check happens when btrfs reads tree block from disk, after basic checks like csum, tree-checker verifies the content.
For write time tree-checker, the check happens before btrfs writes tree block to disk, after csum calculation, tree-checker verifies the content.
  • Reject
For read time tree-checker, it rejects the tree block just as it doesn't pass csum, thus btrfs will still try to read other mirrors.
For write time tree-checker, it rejects the tree block as it fails to reach disk. This will cause the current transaction to be aborted, so the fs is not further corrupted.
  • Comprehensive check
In theory, tree-checker verifies every member of on-disk data.
Although sometimes compromise is made to accept some older kernel, but if older behavior breaks the definition of on-disk format,
tree-checker will reject them.

Starting from kernel version 5.2, tree-checker is also applied to tree blocks written to disk, thus detecting possible runtime memory bitflip/corruption.

[edit] Implementation

Btrfs tree-checker is to reject any suspicious/corrupted tree blocks before passing it to core btrfs code.

One example is check_block_group_item() of fs/btrfs/tree-checker.c. It will check the following members (all members):

  • key
Key of a block group item includes its start bytenr and length.
Length should never be 0.
  • item size
For block group item it's fixed size, so everything else is invalid
  • block group item
  • chunk objectid
Fixed value
  • used bytes
Should never exceed block group size
  • flags
Only certain combination is allowed

By such comprehensive check, we ensure every tree block (struct extent_buffer) has valid structure and data. So later struct extent_buffer user will no longer bother to check things like bad key order or unaligned bytenr.

[edit] Limitation

Tree-checker works at single tree block level, thus it can't check key sequence across leaf/node boundary.

One example of such limit looks like:

node level 1:
(EXTENT_CSUM EXTENT_CSUM 1M) block X
(EXTENT_CSUM EXTENT_CSUM 2M) block Y
(EXTENT_CSUM EXTENT_CSUM 3M) block Z

leaf X: nritems 1
(EXTENT_CSUM EXTENT_CSUM 1M)

leaf Y: nritems 2
(EXTENT_CSUM EXTENT_CSUM 2M)
(EXTENT_CSUM EXTENT_CSUM 4M) <<< Larger than the first key of next leaf

leaf Z:
(EXTENT_CSUM EXTENT_CSUM 3M).

So tree-checker will not cover 100% cases, but it is still very useful, and handled a lot of fuzzed image pretty well.

[edit] For end users

[edit] How to determine if it's caused by tree-checker

Tree check will report error like:

[13234.185509] BTRFS error (device dm-4): corrupt leaf, root=2 block=30769152 slot=0 bg_start=16777216 bg_len=0, invalid block size 0

The corrupt leaf or corrupt node is common for all tree-checker error report.

Furthermore, for kernel newer than v5.2, it will include the following message to show the timing of detection:

[13234.185509] BTRFS error (device dm-4): block=30769152 read time tree block corruption detected

[edit] How to handle such error

Please report to btrfs mail list <linux-btrfs@vger.kernel.org> first.

  • If it's write time corruption
Normally this means runtime memory corruption, either memory is unreliable or some other kernel memory corruption is causing the problem.
Reporting to the mail list will help end user to pin down the cause by some extent.
But for write time corruption, since the corruption is prevented, the fs is not further corrupted. But a btrfs check --readonly is still recommended to make sure the fs is OK.
  • If it's read time corruption
This needs to be determined case by case
If it's false alert, developers would fix it and before that, use an older kernel should be OK.
If it's really a corruption, depends on the solution provided, either user need to salvage the data from the corrupted image either by mounting it RO, or "btrfs-restore".

Please do *NOT* use btrfs check --repair until instructed by a developer.

Personal tools