Files
lanspread/crates/lanspread-peer/ARCHITECTURE.md
T
ddidderr ce51d92df0 refactor(peer): tighten listener-addr handshake invariant
Follow-up hardening for 348a02c, where `listen_addr` was added to Hello and
HelloAck as `Option<SocketAddr>`. Code review surfaced three concrete problems
that the previous commit left open:

1. Cold-start asymmetry. Discovery and the QUIC/mDNS advertiser are spawned
   concurrently. If discovery saw a cached peer advertisement before our own
   advertiser had written `ctx.local_peer_addr`, our outbound Hello carried
   `listen_addr: None`. The receiver's `peer_record_addr` then returned `None`
   and silently dropped the Hello while we still recorded their HelloAck, so
   peer A learned about peer B but B never learned about A until a later
   handshake happened to win the race.

2. Duplicate game-list pipeline. The previous commit added
   `refresh_peer_games`, which post-handshake issued a `ListGames` to fetch
   `peer.games`. The library-sync path (`LibrarySnapshot`) already populates
   the same field. Both could race on first contact and overwrite each other.
   Worse, `refresh_peer_games` was misnamed: a `peer_game_count > 0` guard
   turned it into a fetch-once-then-no-op helper, while
   `handle_library_summary` independently re-triggered a full handshake when
   `previous_count == 0` was observed, producing a redundant ping-pong on
   every first contact.

3. Argument explosion. `perform_handshake_with_peer`, `spawn_library_resync`,
   and `after_peer_library_recorded` had grown to 6-8 individual parameters
   and acquired `#[allow(clippy::too_many_arguments)]` opt-outs. Every caller
   was destructuring the same fields out of `Ctx`/`PeerCtx`.

Changes (all in one commit because they jointly enforce the same invariant:
"a peer is only ever recorded by its listener address, and the local
listener address must exist before we participate in the protocol"):

- `Hello.listen_addr` and `HelloAck.listen_addr` are now `SocketAddr`, not
  `Option<SocketAddr>`. Wire-incompatible, but PROTOCOL_VERSION already moved
  to 3 in 348a02c so no additional version bump is needed.
- `required_listen_addr` reads `ctx.local_peer_addr` and returns an
  `eyre::Result`; `build_hello_from_state` and `build_hello_ack` both call
  it, so an outbound or inbound Hello can no longer be constructed before
  the local QUIC listener is bound. The inbound path maps this into a
  `Response::InternalPeerError` so the remote peer fails cleanly instead of
  seeing a malformed HelloAck.
- `run_peer_discovery` blocks on `wait_for_local_peer_addr` (25 ms poll,
  shutdown-aware) before subscribing to the mDNS browser. This closes the
  cold-start race for outbound handshakes at the source.
- `refresh_peer_games`, `request_game_list_from_peer`, and the
  `previous_count == 0` re-handshake trigger are removed. The post-handshake
  flow now relies solely on `LibrarySummary`/`LibrarySnapshot`/`LibraryDelta`
  for peer-library state; `ListGames` survives only for the
  `request_game_details_*` paths that fetch per-game file descriptions on
  demand.
- New `HandshakeCtx` (with `from_ctx` and `from_peer_ctx` constructors)
  replaces the long argument lists. All `too_many_arguments` allow-attrs in
  `handshake.rs` are gone, and call sites in `handlers.rs`, `discovery.rs`,
  and `stream.rs` collapse to a single clone.
- `handle_library_delta` no longer acquires a read lock on the apply path:
  the `peer_addr` lookup moved into the `else` resync branch where it is
  actually needed.
- `accept_inbound_hello`'s `remote_addr` parameter is renamed to
  `transport_addr`. It is now used only for warn-log formatting, and the
  new name signals that this is the ephemeral QUIC source port, never the
  authoritative listener address that gets recorded.

User-visible effect: on cold start, peers can no longer end up with an
asymmetric view of each other ("A sees B but B never sees A"). First-contact
library sync now does one handshake plus one snapshot/delta exchange instead
of the previous handshake + ListGames + redundant follow-up handshake. The
direct-connect CLI path (`handle_connect_peer_command`) now fails fast with
"local peer listener address is not ready" if invoked before the QUIC server
has bound; this is intentional - the previous behaviour would have sent a
Hello that the receiver had to silently discard.

Test Plan:
- just fmt
- just clippy
- just test (80 peer + 3 cli + 5 tauri tests pass)
- just build
- Manual: bring up `just peer-cli-alpha`/`bravo`/`charlie`, confirm symmetric
  peer discovery and that games show up on every side after one library
  digest cycle, with no duplicated ListGames traffic in trace logs.

Refs: Review feedback on commit 348a02c (listener-address handshake fix).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:21:19 +02:00

8.2 KiB

lanspread-peer proposed protocol and architecture

This document proposes a tighter, more fault-tolerant protocol while keeping the current idea: mDNS discovery, QUIC transport, on-demand metadata, and chunked file transfers.

Goals (unchanged)

  • Local LAN discovery via mDNS.
  • QUIC + JSON messages for control, raw streams for file data.
  • UI drives operations through PeerCommand, peers remain headless.
  • Peers can appear/disappear at any time without data loss.

Peer lifecycle and message flow

1) Startup and advertise

  • Start QUIC server.
  • Advertise via mDNS with TXT records:
    • peer_id (stable ID, not tied to IP)
    • proto_ver
    • library_rev (monotonic local library revision)
    • optional hostname

2) Discovery and handshake

When a peer is discovered:

  1. Connect and send Hello { peer_id, proto_ver, listen_addr, library_rev, library_digest, features }. listen_addr is mandatory; the QUIC source port is only a temporary transport port and must not be recorded as the peer's listener.
  2. Receive HelloAck { peer_id, proto_ver, listen_addr, library_rev, library_digest, features }.
  3. If the remote peer_id is already known but the address changed, update it.
  4. If protocol versions are incompatible, drop the peer (and keep mDNS watching).
  5. If library digests match, do nothing else.
  6. If digests differ:
    • If we have a known library_rev for that peer, request LibraryDelta.
    • Otherwise request LibrarySnapshot.

3) Steady state

  • Any message updates last_seen.
  • Pings run only when idle (or on a longer interval), not every 5 seconds.
  • Library updates are pushed as deltas, debounced and coalesced.

4) Shutdown

  • Optional Goodbye { peer_id } lets others remove the peer quickly.
  • If a peer vanishes without goodbye, stale timeout + ping removal handle it.
  • Goodbye is a hint, never required for correctness.

Library sync protocol

Summary and snapshot

  • LibrarySummary { peer_id, summary: { library_rev, library_digest, game_count } }
  • LibrarySnapshot { peer_id, snapshot: { library_rev, games: Vec<GameSummary> } }

Delta updates

  • LibraryDelta { peer_id, delta: { from_rev, to_rev, added, updated, removed } }
  • removed is a list of game IDs.
  • Deltas are idempotent; ignore if to_rev <= known rev.

GameSummary (concept)

  • id, name, eti_version, size, downloaded, installed
  • manifest_hash (hash of file list + sizes)
  • availability (e.g., ready, downloading, local_only)

When peers broadcast their game list

  • Only on changes, not on a timer.
  • Filesystem events are gated per game ID instead of time-debounced:
    • an active operation lock drops events for that game;
    • a rescan already running for the ID sets a rescan-pending flag;
    • the running rescan loops once more when that flag was set.
  • Send LibraryDelta to known peers; send LibrarySummary on new connections.

Local game scanning: fast and low cost

Strategy

  1. Maintain a persistent on-disk index (per game):
    • manifest_hash, total size, file list (optional), and a fingerprint (root-level version.ini mtime, root-level .eti mtime/size, and local/ directory presence).
  2. Use filesystem watchers to update only changed games.
  3. Keep a 300-second fallback scan to recover from missed events.

Fast-path scanning

  • On startup, list only top-level game directories.
  • For each game, read a cheap fingerprint:
    • root-level .eti file names, sizes, and mtimes
    • root-level version.ini mtime
    • presence of local/ as a directory
  • If fingerprint unchanged, reuse cached size and manifest hash.
  • Only run a recursive scan for new or changed games.

Local State and Recovery

Downloaded and installed are independent predicates:

  • downloaded is true only when <game_root>/version.ini exists as a regular file. The sentinel is written last through .version.ini.tmp and atomic rename. An interrupted replacement leaves no restored old sentinel because archive bytes may already have changed.
  • installed is true when <game_root>/local/ is a directory. The contents of local/ are user-owned and are skipped by manifests, fingerprints, and file serving.

Reserved per-game paths:

  • .version.ini.tmp and .version.ini.discarded are download transaction scratch files and are swept during startup recovery.
  • .local.installing/ is extraction staging.
  • .local.backup/ holds the previous install while an update or uninstall is in flight.
  • .lanspread.json is the atomic per-game intent log.
  • .lanspread_owned inside .local.* directories proves Lanspread ownership when the current intent is None.

Recovery reads .lanspread.json and combines the recorded intent with the observed local/, .local.installing/, and .local.backup/ state. Intent states Installing, Updating, and Uninstalling prove ownership of the corresponding reserved directories even if the marker was not flushed before a crash. With intent None, markerless .local.* directories are left untouched.

Result

Most scans become O(number of game dirs), with full recursion only when needed.

File manifests and downloads

  • Keep GetGame/manifest requests, but keyed by manifest_hash so repeated calls can be skipped when unchanged.
  • Downloads remain chunked QUIC streams with the existing integrity checks.
  • A game is transferable only when its ID is in the catalog, no operation is active for that ID, and the root-level version.ini sentinel exists.
  • local/ paths are never served, even if a stale or malicious manifest request asks for them.

Fault tolerance rules

  • Every peer is keyed by peer_id, not by IP address.
  • Peer addresses are listener addresses from mDNS or Hello/HelloAck, never ephemeral QUIC source ports.
  • library_rev is monotonic and guards against out-of-order updates.
  • Any mismatch or missing delta falls back to LibrarySnapshot.
  • Loss of goodbye is harmless; stale timeout is authoritative.

Roadmap from current design to this one

  1. Protocol updates in lanspread-proto:
    • Define Hello, HelloAck, LibrarySummary, LibrarySnapshot, LibraryDelta, and optional Goodbye messages.
    • Thread peer_id, library_rev, and manifest_hash through all library and manifest-bearing types.
    • Make Hello and HelloAck carry the sender's listen_addr, library_rev, and library_digest so both sides can record stable listener addresses and immediately select LibraryDelta vs LibrarySnapshot.
  2. Peer identity:
    • Persist a stable peer_id (UUID) in the peer config and inject it into PeerInfo and PeerGameDB at startup.
    • Track peer_id -> SocketAddr in the discovery table and update the address on any incoming handshake or mDNS refresh.
  3. Discovery handshake:
    • Publish peer_id and library_rev in mDNS TXT records to avoid immediate TCP/QUIC roundtrips when nothing changed.
    • Add a lightweight handshake in run_peer_discovery that exchanges Hello/HelloAck before any library sync.
    • Ignore peers that do not advertise the current protocol version.
  4. Library revisioning:
    • Store a monotonic library_rev locally and increment only after a successful index refresh completes.
    • Apply LibraryDelta when library_rev matches; reject stale or future revisions and request LibrarySnapshot instead.
    • Cache the last accepted manifest_hash per peer to short-circuit manifest requests when unchanged.
  5. Local index + scan optimizations:
    • Introduce a cached index file (e.g., .lanspread/index.json) that stores per-root fingerprints and computed manifests.
    • Use filesystem watchers with a debounce window to collect changes and incrementally update the cache.
    • Schedule a low-frequency full scan to reconcile missed watcher events.
  6. Announce updates:
    • Broadcast LibraryDelta updates keyed by library_rev.
    • Send LibrarySummary on new connections to seed the delta flow.
  7. File manifest caching:
    • Store per-game manifest_hash and only fetch details when changed.
  8. Liveness:
    • Reduce ping frequency; update last_seen on any message.
    • Add optional Goodbye on shutdown paths.
  9. Tests:
    • Delta apply/merge, rev ordering, manifest hashing, and scan cache behavior.