From 43742d012328c081739fa17fa0efdcd28ec8b788 Mon Sep 17 00:00:00 2001
From: Paul Schulze <mail@paul.network>
Date: Sat, 16 Aug 2025 12:22:38 +0200
Subject: [PATCH] btrfs: CoW and extents - explanations

---
 btrfs_cow_and_extents_explanation.md | 46 ++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)
 create mode 100644 btrfs_cow_and_extents_explanation.md
diff --git a/btrfs_cow_and_extents_explanation.md b/btrfs_cow_and_extents_explanation.md
new file mode 100644
index 0000000..03d7084
--- /dev/null
+++ b/btrfs_cow_and_extents_explanation.md
@@ -0,0 +1,46 @@
+# Btrfs extents, NOCOW, reflinks — the mental model
+
+**What an extent is**
+Btrfs stores file data in variable-length extents. The allocator chooses sizes based on free space and context; in practice they can be **very large (hundreds of MiB up to multiple GiB)**. Don’t assume a small fixed maximum—inspect with `filefrag` if you care.
+
+**What NOCOW really means**
+Setting `chattr +C` on a file (or creating it in a `+C` dir) disables CoW for **future, unshared** writes: data is written in place (no checksums, no compression). As long as an extent is **unshared**, small writes **do not** create new extents.
+
+**Where fragmentation actually comes from**
+
+* **CoW files:** random small writes naturally accumulate many small extents.
+* **NOCOW files:** they stay “unfragmented” **until you share them**. The moment you **clone/reflink/snapshot**, extents are shared between files. Any later write to a shared range must CoW **only the modified sub-range**, producing **small private extents** → fragmentation over time.
+
+**Reflinks in one sentence**
+A reflink is instant, metadata-only sharing of extents. It succeeds only if the **data checksum class matches** (COW↔COW or NOCOW↔NOCOW); COW↔NOCOW attempts typically fail with `EINVAL`.
+
+**Defragmentation, what it does and doesn’t**
+`btrfs filesystem defragment -t <SIZE> file` rewrites extents \*\*smaller than \*\*\`\` into larger runs (up to what the allocator will give on your FS). It **breaks sharing** (backups/snapshots bloat) and generates heavy write I/O, but improves I/O by reducing metadata churn. Running it **again with the same threshold** right after a pass is basically a **no-op**.
+
+**Practical VM workflow (why this all matters)**
+
+* Keep **active** VM images in a `+C` area (no reflinks of those) so writes stay in place.
+* Make **backups** via reflink **into a COW area** (space-efficient), but don’t keep running the VM on a reflinked copy.
+* If you ever create fragmentation (e.g., you had a reflink while running), defrag the **active** image with a sensible target (e.g., `-t 128M…256M`), accept the write load, and avoid defragging the backups.
+
+**Quick checks you’ll actually use later**
+
+Largest extents (MiB):
+
+```bash
+filefrag -v file | awk -F: '/^[[:space:]]*[0-9]+:/{print ($4+0)*4096/1048576}' | sort -nr | head -5
+```
+
+How much is “small” (<256 MiB):
+
+```bash
+filefrag -v file | awk -F: '/^[[:space:]]*[0-9]+:/{len=$4+0; if(len*4096<256*1024*1024){n++; s+=len}} END{printf "extents<256MiB=%d, bytes≈%.2fGiB\n",n,s*4096/1073741824}'
+```
+
+**Progress reality check**
+Defrag has **no built-in % progress**. Infer via `iostat/iotop` or run in chunks:
+
+```bash
+btrfs filesystem defragment -t 256M file
+```
+