Files
libvirt-passthrough/btrfs_cow_and_extents_explanation.md
2025-08-16 12:22:38 +02:00

47 lines
2.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Btrfs extents, NOCOW, reflinks — the mental model
**What an extent is**
Btrfs stores file data in variable-length extents. The allocator chooses sizes based on free space and context; in practice they can be **very large (hundreds of MiB up to multiple GiB)**. Dont assume a small fixed maximum—inspect with `filefrag` if you care.
**What NOCOW really means**
Setting `chattr +C` on a file (or creating it in a `+C` dir) disables CoW for **future, unshared** writes: data is written in place (no checksums, no compression). As long as an extent is **unshared**, small writes **do not** create new extents.
**Where fragmentation actually comes from**
* **CoW files:** random small writes naturally accumulate many small extents.
* **NOCOW files:** they stay “unfragmented” **until you share them**. The moment you **clone/reflink/snapshot**, extents are shared between files. Any later write to a shared range must CoW **only the modified sub-range**, producing **small private extents** → fragmentation over time.
**Reflinks in one sentence**
A reflink is instant, metadata-only sharing of extents. It succeeds only if the **data checksum class matches** (COW↔COW or NOCOW↔NOCOW); COW↔NOCOW attempts typically fail with `EINVAL`.
**Defragmentation, what it does and doesnt**
`btrfs filesystem defragment -t <SIZE> file` rewrites extents \*\*smaller than \*\*\`\` into larger runs (up to what the allocator will give on your FS). It **breaks sharing** (backups/snapshots bloat) and generates heavy write I/O, but improves I/O by reducing metadata churn. Running it **again with the same threshold** right after a pass is basically a **no-op**.
**Practical VM workflow (why this all matters)**
* Keep **active** VM images in a `+C` area (no reflinks of those) so writes stay in place.
* Make **backups** via reflink **into a COW area** (space-efficient), but dont keep running the VM on a reflinked copy.
* If you ever create fragmentation (e.g., you had a reflink while running), defrag the **active** image with a sensible target (e.g., `-t 128M…256M`), accept the write load, and avoid defragging the backups.
**Quick checks youll actually use later**
Largest extents (MiB):
```bash
filefrag -v file | awk -F: '/^[[:space:]]*[0-9]+:/{print ($4+0)*4096/1048576}' | sort -nr | head -5
```
How much is “small” (<256 MiB):
```bash
filefrag -v file | awk -F: '/^[[:space:]]*[0-9]+:/{len=$4+0; if(len*4096<256*1024*1024){n++; s+=len}} END{printf "extents<256MiB=%d, bytes≈%.2fGiB\n",n,s*4096/1073741824}'
```
**Progress reality check**
Defrag has **no built-in % progress**. Infer via `iostat/iotop` or run in chunks:
```bash
btrfs filesystem defragment -t 256M file
```