2.7 KiB
Btrfs extents, NOCOW, reflinks — the mental model
What an extent is
Btrfs stores file data in variable-length extents. The allocator chooses sizes based on free space and context; in practice they can be very large (hundreds of MiB up to multiple GiB). Don’t assume a small fixed maximum—inspect with filefrag
if you care.
What NOCOW really means
Setting chattr +C
on a file (or creating it in a +C
dir) disables CoW for future, unshared writes: data is written in place (no checksums, no compression). As long as an extent is unshared, small writes do not create new extents.
Where fragmentation actually comes from
- CoW files: random small writes naturally accumulate many small extents.
- NOCOW files: they stay “unfragmented” until you share them. The moment you clone/reflink/snapshot, extents are shared between files. Any later write to a shared range must CoW only the modified sub-range, producing small private extents → fragmentation over time.
Reflinks in one sentence
A reflink is instant, metadata-only sharing of extents. It succeeds only if the data checksum class matches (COW↔COW or NOCOW↔NOCOW); COW↔NOCOW attempts typically fail with EINVAL
.
Defragmentation, what it does and doesn’t
btrfs filesystem defragment -t <SIZE> file
rewrites extents **smaller than **`` into larger runs (up to what the allocator will give on your FS). It breaks sharing (backups/snapshots bloat) and generates heavy write I/O, but improves I/O by reducing metadata churn. Running it again with the same threshold right after a pass is basically a no-op.
Practical VM workflow (why this all matters)
- Keep active VM images in a
+C
area (no reflinks of those) so writes stay in place. - Make backups via reflink into a COW area (space-efficient), but don’t keep running the VM on a reflinked copy.
- If you ever create fragmentation (e.g., you had a reflink while running), defrag the active image with a sensible target (e.g.,
-t 128M…256M
), accept the write load, and avoid defragging the backups.
Quick checks you’ll actually use later
Largest extents (MiB):
filefrag -v file | awk -F: '/^[[:space:]]*[0-9]+:/{print ($4+0)*4096/1048576}' | sort -nr | head -5
How much is “small” (<256 MiB):
filefrag -v file | awk -F: '/^[[:space:]]*[0-9]+:/{len=$4+0; if(len*4096<256*1024*1024){n++; s+=len}} END{printf "extents<256MiB=%d, bytes≈%.2fGiB\n",n,s*4096/1073741824}'
Progress reality check
Defrag has no built-in % progress. Infer via iostat/iotop
or run in chunks:
btrfs filesystem defragment -t 256M file