Files
libvirt-passthrough/btrfs_cow_and_extents_explanation.md
2025-08-16 12:22:38 +02:00

2.7 KiB
Raw Permalink Blame History

Btrfs extents, NOCOW, reflinks — the mental model

What an extent is Btrfs stores file data in variable-length extents. The allocator chooses sizes based on free space and context; in practice they can be very large (hundreds of MiB up to multiple GiB). Dont assume a small fixed maximum—inspect with filefrag if you care.

What NOCOW really means Setting chattr +C on a file (or creating it in a +C dir) disables CoW for future, unshared writes: data is written in place (no checksums, no compression). As long as an extent is unshared, small writes do not create new extents.

Where fragmentation actually comes from

  • CoW files: random small writes naturally accumulate many small extents.
  • NOCOW files: they stay “unfragmented” until you share them. The moment you clone/reflink/snapshot, extents are shared between files. Any later write to a shared range must CoW only the modified sub-range, producing small private extents → fragmentation over time.

Reflinks in one sentence A reflink is instant, metadata-only sharing of extents. It succeeds only if the data checksum class matches (COW↔COW or NOCOW↔NOCOW); COW↔NOCOW attempts typically fail with EINVAL.

Defragmentation, what it does and doesnt btrfs filesystem defragment -t <SIZE> file rewrites extents **smaller than **`` into larger runs (up to what the allocator will give on your FS). It breaks sharing (backups/snapshots bloat) and generates heavy write I/O, but improves I/O by reducing metadata churn. Running it again with the same threshold right after a pass is basically a no-op.

Practical VM workflow (why this all matters)

  • Keep active VM images in a +C area (no reflinks of those) so writes stay in place.
  • Make backups via reflink into a COW area (space-efficient), but dont keep running the VM on a reflinked copy.
  • If you ever create fragmentation (e.g., you had a reflink while running), defrag the active image with a sensible target (e.g., -t 128M…256M), accept the write load, and avoid defragging the backups.

Quick checks youll actually use later

Largest extents (MiB):

filefrag -v file | awk -F: '/^[[:space:]]*[0-9]+:/{print ($4+0)*4096/1048576}' | sort -nr | head -5

How much is “small” (<256 MiB):

filefrag -v file | awk -F: '/^[[:space:]]*[0-9]+:/{len=$4+0; if(len*4096<256*1024*1024){n++; s+=len}} END{printf "extents<256MiB=%d, bytes≈%.2fGiB\n",n,s*4096/1073741824}'

Progress reality check Defrag has no built-in % progress. Infer via iostat/iotop or run in chunks:

btrfs filesystem defragment -t 256M file