lanspread

Author	SHA1	Message	Date
ddidderr	ce51d92df0	refactor(peer): tighten listener-addr handshake invariant Follow-up hardening for `348a02c`, where `listen_addr` was added to Hello and HelloAck as `Option<SocketAddr>`. Code review surfaced three concrete problems that the previous commit left open: 1. Cold-start asymmetry. Discovery and the QUIC/mDNS advertiser are spawned concurrently. If discovery saw a cached peer advertisement before our own advertiser had written `ctx.local_peer_addr`, our outbound Hello carried `listen_addr: None`. The receiver's `peer_record_addr` then returned `None` and silently dropped the Hello while we still recorded their HelloAck, so peer A learned about peer B but B never learned about A until a later handshake happened to win the race. 2. Duplicate game-list pipeline. The previous commit added `refresh_peer_games`, which post-handshake issued a `ListGames` to fetch `peer.games`. The library-sync path (`LibrarySnapshot`) already populates the same field. Both could race on first contact and overwrite each other. Worse, `refresh_peer_games` was misnamed: a `peer_game_count > 0` guard turned it into a fetch-once-then-no-op helper, while `handle_library_summary` independently re-triggered a full handshake when `previous_count == 0` was observed, producing a redundant ping-pong on every first contact. 3. Argument explosion. `perform_handshake_with_peer`, `spawn_library_resync`, and `after_peer_library_recorded` had grown to 6-8 individual parameters and acquired `#[allow(clippy::too_many_arguments)]` opt-outs. Every caller was destructuring the same fields out of `Ctx`/`PeerCtx`. Changes (all in one commit because they jointly enforce the same invariant: "a peer is only ever recorded by its listener address, and the local listener address must exist before we participate in the protocol"): - `Hello.listen_addr` and `HelloAck.listen_addr` are now `SocketAddr`, not `Option<SocketAddr>`. Wire-incompatible, but PROTOCOL_VERSION already moved to 3 in `348a02c` so no additional version bump is needed. - `required_listen_addr` reads `ctx.local_peer_addr` and returns an `eyre::Result`; `build_hello_from_state` and `build_hello_ack` both call it, so an outbound or inbound Hello can no longer be constructed before the local QUIC listener is bound. The inbound path maps this into a `Response::InternalPeerError` so the remote peer fails cleanly instead of seeing a malformed HelloAck. - `run_peer_discovery` blocks on `wait_for_local_peer_addr` (25 ms poll, shutdown-aware) before subscribing to the mDNS browser. This closes the cold-start race for outbound handshakes at the source. - `refresh_peer_games`, `request_game_list_from_peer`, and the `previous_count == 0` re-handshake trigger are removed. The post-handshake flow now relies solely on `LibrarySummary`/`LibrarySnapshot`/`LibraryDelta` for peer-library state; `ListGames` survives only for the `request_game_details_*` paths that fetch per-game file descriptions on demand. - New `HandshakeCtx` (with `from_ctx` and `from_peer_ctx` constructors) replaces the long argument lists. All `too_many_arguments` allow-attrs in `handshake.rs` are gone, and call sites in `handlers.rs`, `discovery.rs`, and `stream.rs` collapse to a single clone. - `handle_library_delta` no longer acquires a read lock on the apply path: the `peer_addr` lookup moved into the `else` resync branch where it is actually needed. - `accept_inbound_hello`'s `remote_addr` parameter is renamed to `transport_addr`. It is now used only for warn-log formatting, and the new name signals that this is the ephemeral QUIC source port, never the authoritative listener address that gets recorded. User-visible effect: on cold start, peers can no longer end up with an asymmetric view of each other ("A sees B but B never sees A"). First-contact library sync now does one handshake plus one snapshot/delta exchange instead of the previous handshake + ListGames + redundant follow-up handshake. The direct-connect CLI path (`handle_connect_peer_command`) now fails fast with "local peer listener address is not ready" if invoked before the QUIC server has bound; this is intentional - the previous behaviour would have sent a Hello that the receiver had to silently discard. Test Plan: - just fmt - just clippy - just test (80 peer + 3 cli + 5 tauri tests pass) - just build - Manual: bring up `just peer-cli-alpha`/`bravo`/`charlie`, confirm symmetric peer discovery and that games show up on every side after one library digest cycle, with no duplicated ListGames traffic in trace logs. Refs: Review feedback on commit `348a02c` (listener-address handshake fix). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 18:21:19 +02:00
ddidderr	348a02c35f	fix(peer): record listener addresses during handshakes Peers discovered over mDNS could still attribute later library sync traffic to temporary QUIC source ports. In a real GUI LAN run this made Host B try to push its library to Host A's outbound port instead of Host A's advertised listener, so Host A discovered the peer but never saw its games. Carry the stable listener address in Hello and HelloAck, and key library sync messages by peer_id instead of inferring identity from the transport source address. The handshake path now explicitly refreshes an empty peer library from the known listener address, matching the reliability of the direct-connect CLI path without overwriting richer snapshot state when it already arrived. This changes the current wire protocol, so PROTOCOL_VERSION is bumped to 3 and all peers must be rebuilt together. The architecture note now documents that listener addresses come from mDNS or Hello/HelloAck, never from ephemeral QUIC source ports. Test Plan: - just fmt - just test - just clippy - just build - git diff --check Refs: Local Linux/Win11 GUI LAN test logs from 2026-05-18.	2026-05-18 17:27:15 +02:00
ddidderr	e711cf3454	fix(peer): settle current-protocol local state cleanup The follow-up backlog had drifted into three settled peer/runtime issues: the legacy game-list fallback contradicted the one-wire-version policy, the Tauri shell still re-derived local install state from disk after peer snapshots, and `Availability::Downloading` existed even though active operations are already reported through a separate operation table. Remove the legacy `AnnounceGames` request and fallback service. Discovery now ignores peers that do not advertise the current protocol and a peer id, and library changes are sent through the current delta path only. This keeps the runtime aligned with the documented current-build-only interoperability model. Make peer `LocalGamesUpdated` snapshots authoritative for local fields in the Tauri database. The GUI-side catalog still owns static metadata such as names, sizes, and descriptions, but downloaded, installed, local version, and availability now come from the peer runtime instead of a second whole-library filesystem scan. Snapshot reconciliation also pins the missing-begin and missing-finish lifecycle cases in tests. Collapse availability back to the settled `Ready` and `LocalOnly` states. Aggregation now counts only `Ready` peers as download sources, and the frontend no longer carries a dead `Downloading` enum value. The core peer also exposes the small non-GUI hooks needed by scripted callers: startup options for state and mDNS, a local-ready event, direct connection, peer snapshots, and an explicit post-download install policy. Those hooks reuse the same current protocol path and do not add compatibility shims. Test Plan: - `git diff --check` - `just fmt` - `just clippy` - `just test` Refs: BACKLOG.md, FINDINGS.md, IMPL_DECISIONS.md	2026-05-16 18:32:24 +02:00
ddidderr	2bbd2ac869	refactor(peer): adopt structured concurrency with supervised shutdown Replace the detached tokio::spawn pattern in the peer runtime with a supervised model built on tokio_util's CancellationToken and TaskTracker. Long-lived services and child tasks now have an explicit parent, a cancellation path, and a join point. Tauri can request a clean shutdown on app exit instead of leaking work into process termination. Background ~~~~~~~~~~ start_peer() previously returned only a command sender. The four startup services (QUIC server, mDNS discovery, peer liveness, local library monitor) and their child tasks (ping workers, handshake jobs, download workers, announcement fan-outs, connection/stream handlers) were spawned with raw tokio::spawn and detached. Closing the command channel sent Goodbye notifications but did not stop those services. The mDNS blocking worker had no cancellation path at all. Active downloads were stored as JoinHandle<()> and force-aborted, which could interrupt file writes mid-chunk. Supervisor ~~~~~~~~~~ The runtime now owns a CancellationToken and a TaskTracker, threaded through Ctx and PeerCtx. Each long-lived service is spawned through a small supervisor (spawn_supervised_service) that wraps the service in catch_unwind and enforces an explicit SupervisionPolicy: QuicServer: Required (fatal; cancels the runtime if it dies) Discovery: Restart(5s) (matches the prior self-restart loop) Liveness: Restart(5s) LocalMonitor: BestEffort (logs and exits, no restart) A Required failure emits a new RuntimeFailed { component, error } event to the UI and cancels the runtime; the command loop and goodbye notifications still run to completion. The Tauri layer forwards the event as "peer-runtime-failed" so a future UI can surface it. mDNS cancellation ~~~~~~~~~~~~~~~~~ MdnsBrowser previously blocked on receiver.recv() forever. It now exposes next_service_timeout(Duration) returning an MdnsServicePoll enum (Service/Timeout/Closed) via recv_timeout(). The discovery worker polls at 250ms and checks the shutdown flag between ticks, so cancellation reaches the blocking thread within one poll interval instead of waiting for the next mDNS event. Downloads ~~~~~~~~~ active_downloads is now HashMap<String, CancellationToken>. Each download gets a child token of the runtime shutdown, checked at chunk and peer-attempt boundaries (never inside file writes). When all peers with a game disappear, liveness cancels the token and emits DownloadGameFilesAllPeersGone; the download exits Ok(()) without emitting a duplicate Failed event. DownloadStateGuard (context.rs) is held inside the download task and clears downloading_games + active_downloads on Drop, covering the happy path, error returns, cancellation, and task abort. Drop falls back to spawning the cleanup if write-lock contention prevents try_write. Public API and Tauri integration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ start_peer() now returns PeerRuntimeHandle exposing: fn sender(&self) -> UnboundedSender<PeerCommand> fn shutdown(&self) async fn wait_stopped(&mut self) The Tauri layer stores the handle in managed state and switches its main loop from .run(ctx) to .build(ctx).run(\|h, e\| ...). On RunEvent::Exit it calls handle.shutdown() and blocks up to 2s on wait_stopped(), giving services time to cancel and Goodbye packets time to flush over a healthy LAN while staying short enough not to delay process exit noticeably on a dead network. The command loop distinguishes graceful shutdown from unexpected channel closure: if recv() returns None and shutdown.is_cancelled() is set, the loop returns Ok(()) silently. Only an unexpected close (no cancellation observed) still emits RuntimeFailed. This avoids a spurious failure event on every normal app close. User-visible behavior changes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Closing the app no longer leaks services into process termination; Goodbye notifications are reliably attempted before exit. - Downloads cancel cleanly (between chunks) instead of force-aborting mid-write. - A new "peer-runtime-failed" Tauri event fires when a Required service cannot recover. No frontend handler exists yet — that is a follow-up. Tradeoffs ~~~~~~~~~ - Workspace tokio-util now requires the "rt" feature for TaskTracker. - The mDNS worker still runs in spawn_blocking and may stay parked briefly between 250ms polls — acceptable for a desktop app. - The 2s shutdown timeout on app exit is a deliberate compromise. Tests ~~~~~ New unit tests: - DownloadStateGuard clears tracking on completion, cancellation, and parent-task abort (context.rs). - Required failure cancels the runtime and emits RuntimeFailed (startup.rs). - Restart policy restarts until shutdown is requested (startup.rs). - PeerRuntimeHandle.shutdown() observable via wait_stopped() (startup.rs). - Peers-gone cancellation emits only PeersGone, no duplicate Failed (services/liveness.rs). Test plan ~~~~~~~~~ cargo test --workspace cargo clippy --workspace --all-targets Manual smoke test on two peers on the same LAN: 1. Start a download, verify chunks transfer. 2. Close the receiving app mid-download — verify the sending peer logs a Goodbye, not a connection-reset error. 3. Stop the sending peer mid-download — verify the receiver emits DownloadGameFilesAllPeersGone, not Failed. Follow-ups ~~~~~~~~~~ - Frontend handler for "peer-runtime-failed". - Consider exposing the runtime handle's stopped watch to the frontend for a reconnecting indicator on Required failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 07:53:51 +02:00
ddidderr	b4585b663a	ChatGPT Codex 5.5 xhigh refactored even more	2026-05-02 15:31:37 +02:00

5 Commits