refactor(peer): adopt structured concurrency with supervised shutdown

Replace the detached tokio::spawn pattern in the peer runtime with a
supervised model built on tokio_util's CancellationToken and TaskTracker.
Long-lived services and child tasks now have an explicit parent, a
cancellation path, and a join point. Tauri can request a clean shutdown
on app exit instead of leaking work into process termination.

Background
~~~~~~~~~~

start_peer() previously returned only a command sender. The four startup
services (QUIC server, mDNS discovery, peer liveness, local library
monitor) and their child tasks (ping workers, handshake jobs, download
workers, announcement fan-outs, connection/stream handlers) were spawned
with raw tokio::spawn and detached. Closing the command channel sent
Goodbye notifications but did not stop those services. The mDNS blocking
worker had no cancellation path at all. Active downloads were stored as
JoinHandle<()> and force-aborted, which could interrupt file writes
mid-chunk.

Supervisor
~~~~~~~~~~

The runtime now owns a CancellationToken and a TaskTracker, threaded
through Ctx and PeerCtx. Each long-lived service is spawned through a
small supervisor (spawn_supervised_service) that wraps the service in
catch_unwind and enforces an explicit SupervisionPolicy:

  QuicServer:    Required     (fatal; cancels the runtime if it dies)
  Discovery:     Restart(5s)  (matches the prior self-restart loop)
  Liveness:      Restart(5s)
  LocalMonitor:  BestEffort   (logs and exits, no restart)

A Required failure emits a new RuntimeFailed { component, error } event
to the UI and cancels the runtime; the command loop and goodbye
notifications still run to completion. The Tauri layer forwards the
event as "peer-runtime-failed" so a future UI can surface it.

mDNS cancellation
~~~~~~~~~~~~~~~~~

MdnsBrowser previously blocked on receiver.recv() forever. It now
exposes next_service_timeout(Duration) returning an MdnsServicePoll
enum (Service/Timeout/Closed) via recv_timeout(). The discovery worker
polls at 250ms and checks the shutdown flag between ticks, so
cancellation reaches the blocking thread within one poll interval
instead of waiting for the next mDNS event.

Downloads
~~~~~~~~~

active_downloads is now HashMap<String, CancellationToken>. Each
download gets a child token of the runtime shutdown, checked at chunk
and peer-attempt boundaries (never inside file writes). When all peers
with a game disappear, liveness cancels the token and emits
DownloadGameFilesAllPeersGone; the download exits Ok(()) without
emitting a duplicate Failed event.

DownloadStateGuard (context.rs) is held inside the download task and
clears downloading_games + active_downloads on Drop, covering the happy
path, error returns, cancellation, and task abort. Drop falls back to
spawning the cleanup if write-lock contention prevents try_write.

Public API and Tauri integration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

start_peer() now returns PeerRuntimeHandle exposing:

  fn sender(&self) -> UnboundedSender<PeerCommand>
  fn shutdown(&self)
  async fn wait_stopped(&mut self)

The Tauri layer stores the handle in managed state and switches its
main loop from .run(ctx) to .build(ctx).run(|h, e| ...). On
RunEvent::Exit it calls handle.shutdown() and blocks up to 2s on
wait_stopped(), giving services time to cancel and Goodbye packets time
to flush over a healthy LAN while staying short enough not to delay
process exit noticeably on a dead network.

The command loop distinguishes graceful shutdown from unexpected
channel closure: if recv() returns None and shutdown.is_cancelled() is
set, the loop returns Ok(()) silently. Only an unexpected close (no
cancellation observed) still emits RuntimeFailed. This avoids a
spurious failure event on every normal app close.

User-visible behavior changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Closing the app no longer leaks services into process termination;
  Goodbye notifications are reliably attempted before exit.
- Downloads cancel cleanly (between chunks) instead of force-aborting
  mid-write.
- A new "peer-runtime-failed" Tauri event fires when a Required service
  cannot recover. No frontend handler exists yet — that is a follow-up.

Tradeoffs
~~~~~~~~~

- Workspace tokio-util now requires the "rt" feature for TaskTracker.
- The mDNS worker still runs in spawn_blocking and may stay parked
  briefly between 250ms polls — acceptable for a desktop app.
- The 2s shutdown timeout on app exit is a deliberate compromise.

Tests
~~~~~

New unit tests:
  - DownloadStateGuard clears tracking on completion, cancellation, and
    parent-task abort (context.rs).
  - Required failure cancels the runtime and emits RuntimeFailed
    (startup.rs).
  - Restart policy restarts until shutdown is requested (startup.rs).
  - PeerRuntimeHandle.shutdown() observable via wait_stopped()
    (startup.rs).
  - Peers-gone cancellation emits only PeersGone, no duplicate Failed
    (services/liveness.rs).

Test plan
~~~~~~~~~

  cargo test --workspace
  cargo clippy --workspace --all-targets

Manual smoke test on two peers on the same LAN:
  1. Start a download, verify chunks transfer.
  2. Close the receiving app mid-download — verify the sending peer
     logs a Goodbye, not a connection-reset error.
  3. Stop the sending peer mid-download — verify the receiver emits
     DownloadGameFilesAllPeersGone, not Failed.

Follow-ups
~~~~~~~~~~

- Frontend handler for "peer-runtime-failed".
- Consider exposing the runtime handle's stopped watch to the frontend
  for a reconnecting indicator on Required failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-15 07:53:51 +02:00
parent 84665cacf0
commit 2bbd2ac869
18 changed files with 1104 additions and 239 deletions
+366 -23
View File
@@ -1,16 +1,29 @@
//! Peer runtime task startup and shutdown orchestration.
use std::{net::SocketAddr, path::PathBuf, sync::Arc, time::Duration};
use std::{
any::Any,
future::Future,
net::SocketAddr,
panic::AssertUnwindSafe,
path::PathBuf,
sync::Arc,
time::Duration,
};
use futures::FutureExt as _;
use tokio::sync::{
RwLock,
mpsc::{UnboundedReceiver, UnboundedSender},
watch,
};
use tokio_util::{sync::CancellationToken, task::TaskTracker};
use crate::{
PeerCommand,
PeerEvent,
PeerRuntimeComponent,
context::Ctx,
events,
network::send_goodbye,
peer_db::PeerGameDB,
run_peer,
@@ -22,19 +35,86 @@ use crate::{
},
};
/// Handle to a running peer runtime.
///
/// Holds the command sender plus the runtime's shutdown token and a `stopped`
/// signal so callers can request a clean shutdown and wait for goodbye
/// notifications to flush.
pub struct PeerRuntimeHandle {
tx: UnboundedSender<PeerCommand>,
shutdown: CancellationToken,
stopped: watch::Receiver<bool>,
}
impl PeerRuntimeHandle {
/// Returns a clone of the command channel sender.
#[must_use]
pub fn sender(&self) -> UnboundedSender<PeerCommand> {
self.tx.clone()
}
/// Signals the runtime to shut down. Idempotent.
pub fn shutdown(&self) {
self.shutdown.cancel();
}
/// Resolves once the runtime task has fully stopped (services drained,
/// goodbye notifications sent). Returns even if the runtime stopped
/// without an explicit shutdown request.
pub async fn wait_stopped(&mut self) {
let _ = self.stopped.wait_for(|stopped| *stopped).await;
}
}
#[derive(Clone, Copy, Debug)]
pub(crate) enum SupervisionPolicy {
Required,
Restart { backoff: Duration },
BestEffort,
}
pub(crate) fn spawn_peer_runtime(
tx_control: UnboundedSender<PeerCommand>,
rx_control: UnboundedReceiver<PeerCommand>,
tx_notify_ui: UnboundedSender<PeerEvent>,
peer_game_db: Arc<RwLock<PeerGameDB>>,
peer_id: String,
game_dir: PathBuf,
) {
) -> PeerRuntimeHandle {
let shutdown = CancellationToken::new();
let task_tracker = TaskTracker::new();
let (tx_stopped, stopped) = watch::channel(false);
let runtime_shutdown = shutdown.clone();
let runtime_tracker = task_tracker.clone();
tokio::spawn(async move {
if let Err(err) = run_peer(rx_control, tx_notify_ui, peer_game_db, peer_id, game_dir).await
if let Err(err) = run_peer(
rx_control,
tx_notify_ui,
peer_game_db,
peer_id,
game_dir,
runtime_shutdown.clone(),
runtime_tracker.clone(),
)
.await
{
log::error!("Peer system failed: {err}");
}
runtime_shutdown.cancel();
runtime_tracker.close();
runtime_tracker.wait().await;
if tx_stopped.send(true).is_err() {
log::debug!("Peer runtime stopped after handle was dropped");
}
});
PeerRuntimeHandle {
tx: tx_control,
shutdown,
stopped,
}
}
pub(crate) fn spawn_startup_services(ctx: &Ctx, tx_notify_ui: &UnboundedSender<PeerEvent>) {
@@ -60,21 +140,43 @@ fn spawn_quic_server(ctx: &Ctx, tx_notify_ui: &UnboundedSender<PeerEvent>) {
let server_addr = SocketAddr::from(([0, 0, 0, 0], 0));
let peer_ctx = ctx.to_peer_ctx(tx_notify_ui.clone());
let tx_notify_ui = tx_notify_ui.clone();
let supervisor_tx = tx_notify_ui.clone();
tokio::spawn(async move {
if let Err(err) = run_server_component(server_addr, peer_ctx, tx_notify_ui).await {
log::error!("Server component error: {err}");
}
});
spawn_supervised_service(
&ctx.task_tracker,
&ctx.shutdown,
&supervisor_tx,
PeerRuntimeComponent::QuicServer,
SupervisionPolicy::Required,
move || {
let peer_ctx = peer_ctx.clone();
let tx_notify_ui = tx_notify_ui.clone();
async move { run_server_component(server_addr, peer_ctx, tx_notify_ui).await }
},
);
}
fn spawn_peer_discovery_service(ctx: &Ctx, tx_notify_ui: &UnboundedSender<PeerEvent>) {
let ctx = ctx.clone();
let tx_notify_ui = tx_notify_ui.clone();
let task_tracker = ctx.task_tracker.clone();
let shutdown = ctx.shutdown.clone();
let supervisor_tx = tx_notify_ui.clone();
tokio::spawn(async move {
run_peer_discovery(tx_notify_ui, ctx).await;
});
spawn_supervised_service(
&task_tracker,
&shutdown,
&supervisor_tx,
PeerRuntimeComponent::Discovery,
SupervisionPolicy::Restart {
backoff: Duration::from_secs(5),
},
move || {
let ctx = ctx.clone();
let tx_notify_ui = tx_notify_ui.clone();
async move { run_peer_discovery(tx_notify_ui, ctx).await }
},
);
}
fn spawn_peer_liveness_service(ctx: &Ctx, tx_notify_ui: &UnboundedSender<PeerEvent>) {
@@ -82,25 +184,59 @@ fn spawn_peer_liveness_service(ctx: &Ctx, tx_notify_ui: &UnboundedSender<PeerEve
let peer_game_db = ctx.peer_game_db.clone();
let downloading_games = ctx.downloading_games.clone();
let active_downloads = ctx.active_downloads.clone();
let shutdown = ctx.shutdown.clone();
let task_tracker = ctx.task_tracker.clone();
let supervisor_tx = tx_notify_ui.clone();
tokio::spawn(async move {
run_ping_service(
tx_notify_ui,
peer_game_db,
downloading_games,
active_downloads,
)
.await;
});
spawn_supervised_service(
&ctx.task_tracker,
&ctx.shutdown,
&supervisor_tx,
PeerRuntimeComponent::Liveness,
SupervisionPolicy::Restart {
backoff: Duration::from_secs(5),
},
move || {
let tx_notify_ui = tx_notify_ui.clone();
let peer_game_db = peer_game_db.clone();
let downloading_games = downloading_games.clone();
let active_downloads = active_downloads.clone();
let shutdown = shutdown.clone();
let task_tracker = task_tracker.clone();
async move {
run_ping_service(
tx_notify_ui,
peer_game_db,
downloading_games,
active_downloads,
shutdown,
task_tracker,
)
.await
}
},
);
}
fn spawn_local_library_monitor(ctx: &Ctx, tx_notify_ui: &UnboundedSender<PeerEvent>) {
let ctx = ctx.clone();
let tx_notify_ui = tx_notify_ui.clone();
let task_tracker = ctx.task_tracker.clone();
let shutdown = ctx.shutdown.clone();
let supervisor_tx = tx_notify_ui.clone();
tokio::spawn(async move {
run_local_game_monitor(tx_notify_ui, ctx).await;
});
spawn_supervised_service(
&task_tracker,
&shutdown,
&supervisor_tx,
PeerRuntimeComponent::LocalMonitor,
SupervisionPolicy::BestEffort,
move || {
let ctx = ctx.clone();
let tx_notify_ui = tx_notify_ui.clone();
async move { run_local_game_monitor(tx_notify_ui, ctx).await }
},
);
}
async fn send_goodbye_notification(peer_addr: SocketAddr, peer_id: String) {
@@ -110,3 +246,210 @@ async fn send_goodbye_notification(peer_addr: SocketAddr, peer_id: String) {
Err(_) => log::warn!("Timed out sending Goodbye to {peer_addr}"),
}
}
fn spawn_supervised_service<F, Fut>(
task_tracker: &TaskTracker,
shutdown: &CancellationToken,
tx_notify_ui: &UnboundedSender<PeerEvent>,
component: PeerRuntimeComponent,
policy: SupervisionPolicy,
mut make_service: F,
) where
F: FnMut() -> Fut + Send + 'static,
Fut: Future<Output = eyre::Result<()>> + Send + 'static,
{
let task_tracker = task_tracker.clone();
let shutdown = shutdown.clone();
let tx_notify_ui = tx_notify_ui.clone();
task_tracker.spawn(async move {
loop {
if shutdown.is_cancelled() {
break;
}
let result = match AssertUnwindSafe(make_service()).catch_unwind().await {
Ok(result) => result,
Err(payload) => Err(eyre::eyre!(
"component panicked: {}",
panic_payload_to_string(&payload)
)),
};
if shutdown.is_cancelled() {
break;
}
match policy {
SupervisionPolicy::Required => {
let error = match result {
Ok(()) => "component exited unexpectedly".to_string(),
Err(err) => err.to_string(),
};
report_required_service_failure(&tx_notify_ui, component, error, &shutdown);
break;
}
SupervisionPolicy::Restart { backoff } => {
match result {
Ok(()) => log::warn!("{component:?} exited; restarting in {backoff:?}"),
Err(err) => {
log::error!("{component:?} failed: {err}; restarting in {backoff:?}");
}
}
tokio::select! {
() = shutdown.cancelled() => break,
() = tokio::time::sleep(backoff) => {}
}
}
SupervisionPolicy::BestEffort => {
match result {
Ok(()) => log::warn!("{component:?} exited"),
Err(err) => log::error!("{component:?} failed: {err}"),
}
break;
}
}
}
});
}
fn report_required_service_failure(
tx_notify_ui: &UnboundedSender<PeerEvent>,
component: PeerRuntimeComponent,
error: String,
shutdown: &CancellationToken,
) {
log::error!("{component:?} failed: {error}");
events::send(tx_notify_ui, PeerEvent::RuntimeFailed { component, error });
shutdown.cancel();
}
fn panic_payload_to_string(payload: &(dyn Any + Send)) -> String {
if let Some(message) = payload.downcast_ref::<&'static str>() {
return (*message).to_string();
}
if let Some(message) = payload.downcast_ref::<String>() {
return message.clone();
}
"unknown panic payload".to_string()
}
#[cfg(test)]
mod tests {
use std::{
sync::{
Arc,
atomic::{AtomicUsize, Ordering},
},
time::Duration,
};
use tokio_util::{sync::CancellationToken, task::TaskTracker};
use super::{SupervisionPolicy, spawn_supervised_service};
use crate::{PeerRuntimeComponent, startup::PeerRuntimeHandle};
#[tokio::test]
async fn required_service_failure_cancels_runtime_and_emits_event() {
let tracker = TaskTracker::new();
let shutdown = CancellationToken::new();
let (tx, mut rx) = tokio::sync::mpsc::unbounded_channel();
spawn_supervised_service(
&tracker,
&shutdown,
&tx,
PeerRuntimeComponent::QuicServer,
SupervisionPolicy::Required,
|| async { Err(eyre::eyre!("bind failed")) },
);
let event = tokio::time::timeout(Duration::from_secs(1), rx.recv())
.await
.expect("runtime failure event should arrive")
.expect("event channel should stay open");
assert!(shutdown.is_cancelled());
assert!(matches!(
event,
crate::PeerEvent::RuntimeFailed {
component: PeerRuntimeComponent::QuicServer,
..
}
));
tracker.close();
tokio::time::timeout(Duration::from_secs(1), tracker.wait())
.await
.expect("supervisor task should stop");
}
#[tokio::test]
async fn restart_service_restarts_until_shutdown() {
let tracker = TaskTracker::new();
let shutdown = CancellationToken::new();
let (tx, _rx) = tokio::sync::mpsc::unbounded_channel();
let attempts = Arc::new(AtomicUsize::new(0));
spawn_supervised_service(
&tracker,
&shutdown,
&tx,
PeerRuntimeComponent::Discovery,
SupervisionPolicy::Restart {
backoff: Duration::from_millis(10),
},
{
let attempts = attempts.clone();
move || {
let attempts = attempts.clone();
async move {
attempts.fetch_add(1, Ordering::SeqCst);
Err(eyre::eyre!("discovery worker stopped"))
}
}
},
);
tokio::time::timeout(Duration::from_secs(1), async {
loop {
if attempts.load(Ordering::SeqCst) >= 2 {
break;
}
tokio::task::yield_now().await;
}
})
.await
.expect("restartable service should run more than once");
shutdown.cancel();
tracker.close();
tokio::time::timeout(Duration::from_secs(1), tracker.wait())
.await
.expect("restart supervisor should stop after shutdown");
}
#[tokio::test]
async fn runtime_handle_can_shutdown_and_await_stopped() {
let (tx, _rx) = tokio::sync::mpsc::unbounded_channel();
let shutdown = CancellationToken::new();
let (tx_stopped, stopped) = tokio::sync::watch::channel(false);
let mut handle = PeerRuntimeHandle {
tx,
shutdown: shutdown.clone(),
stopped,
};
tokio::spawn(async move {
shutdown.cancelled().await;
let _ = tx_stopped.send(true);
});
handle.shutdown();
tokio::time::timeout(Duration::from_secs(1), handle.wait_stopped())
.await
.expect("runtime handle should observe stopped");
}
}