test(peer-cli): harden S1-S47 scenario suite against vacuous and flaky checks #29

Merged
ddidderr merged 1 commits from test/peer-cli-scenario-audit into main 2026-06-21 10:28:02 +02:00
Owner

An adversarial audit of the headless peer-to-peer scenario suite
(crates/lanspread-peer-cli/scripts/run_extended_scenarios.py, driven by
just peer-cli-tests) found assertions that passed even when the behavior
they claim to test was not happening, plus timing races and doc-vs-code
divergences. A full baseline run (S1-S47) passed beforehand, confirming these
were test-quality gaps, not peer regressions; the baseline output itself
exposed the worst offenders -- e.g. S14 chunk totals {128 MiB, 1 MiB} (a
two-chunk file whose "balanced within one chunk" check can never fail) and
S16/S18 serving the whole ~120 MiB alienswarm.eti from a single source, so
fanout and retry were never exercised.

Test-correctness fixes (a broken behavior could previously pass green):

  • S18: the "no download-failed" check was dead -- it reused a LineWaiter
    already advanced past download-finished, so it scanned an empty tail.
    Replaced with assert_no_event_since over the whole window. Switched to a
    4*CHUNK_SIZE sparse archive so both peers get chunks; the test now proves the
    download SURVIVES a mid-download source kill (every byte delivered, survivor
    served part, no download-failed, clean diff). Retry-onto-survivor is the
    mechanism but is not asserted: the kill/serve race against docker rm -f
    cannot be forced, so asserting an exact split would be flaky.
  • S7: the only check was a diff against byte-identical ggoo fixtures, so it was
    source-agnostic. Added assertions that the download committed exactly once,
    every chunk came from the validated two-peer set, both peers served, and no
    chunk was fetched twice.
  • S14: enlarged to 4*CHUNK_SIZE so the balance check can fail under a 3+1
    imbalance; asserts an exact 2+2 split summing to the file size.
  • S16: inflated the .eti to 2*CHUNK_SIZE so it fans out across both
    catalog-version peers (the stock 120 MiB fixture is a single chunk).
  • S37: validate the throughput rate fields (positive, self-consistent
    mbit/mib == 8.388608, mib_per_s == bytes/duration), not just the byte count.
  • S35: assert the source actually advertises the unknown game before checking
    it is filtered, so "absent" means "filtered" and not "never sent".
  • S15: cross-check each peer's raw advertised eti_version via list-peers; the
    list-games eti_game_version is synthesized from the catalog and can only ever
    equal the asserted value.
  • S2: poll for library convergence and verify the bidirectional exchange
    (bravo sees alpha's 3 games, not just alpha seeing bravo's 4).
  • S12/S28: require the gating unit test to appear as " ... ok" so an
    #[ignore]d (un-run) test no longer satisfies the check.
  • S24/S25: assert the requested install=false final state.
  • S34: assert exactly 21 coherent chunks (20 files + version.ini), 21 distinct
    paths, no duplicates, instead of a >= 21 floor.

Flake fixes:

  • S19: force-kill the sole source right after download-begin on a 4*CHUNK_SIZE
    file and accept download-failed or download-peers-gone. The old graceful
    shutdown on a single-chunk file could let the transfer finish first, turning
    the expected failure into a download-finished. A chunk may complete before
    the kill lands, but the full transfer cannot, so the failure is deterministic.
  • S26: use a large sparse source so the first operation is reliably still
    active when the duplicate request is issued (TOCTOU on active_operations);
    also assert the active operation == "Downloading".
  • S11: drop the "listener address must change" assertion -- it tested the OS
    ephemeral-port allocator and could fail spuriously; keep the same-identity /
    no-duplicate invariant.

Coverage and determinism:

  • S27: add handshake::tests::inbound_hello_from_self_is_ignored for the
    protocol-level self guard. The CLI scenario only exercises the CLI
    string-compare guard, which short-circuits before any network call, so the
    peer-crate guard had no test.
  • find_fixture_game now iterates sorted(FIXTURES) so the ambiguous cnctw
    (fixture-bravo/multi/solid) resolves deterministically to fixture-bravo.

Reviewed and deliberately left as-is (documented in the run log): S20, S21,
S30, S32/S39/S44 absence checks, S42 IP-order precondition, S45.

PEER_CLI_SCENARIOS.md rows S2, S11, S14, S16, S18, S19, S27 are updated to
match the harness, and a dated run-log entry records the audit, the fixes, the
accepted items, and the live-run evidence.

Test Plan:

  • just peer-cli-tests (rebuilds the image, runs S1-S47 in Docker): baseline
    passed; post-fix passed; a final run on the exact committed code passed
    47/47. Evidence: S14 {268435456, 268435456} balanced 2+2; S16 .eti split
    across B and C {134217728, 134217728}; S18 all 536870912 bytes delivered with
    no download-failed; S19 deterministic download-failed; S37 ~874 MiB/s.
  • just test (incl. inbound_hello_from_self_is_ignored), just clippy
    (-D warnings, all-targets), and just fmt all pass.

Refs: PEER_CLI_SCENARIOS.md scenario matrix and 2026-06-21 run-log entry.

An adversarial audit of the headless peer-to-peer scenario suite (crates/lanspread-peer-cli/scripts/run_extended_scenarios.py, driven by `just peer-cli-tests`) found assertions that passed even when the behavior they claim to test was not happening, plus timing races and doc-vs-code divergences. A full baseline run (S1-S47) passed beforehand, confirming these were test-quality gaps, not peer regressions; the baseline output itself exposed the worst offenders -- e.g. S14 chunk totals {128 MiB, 1 MiB} (a two-chunk file whose "balanced within one chunk" check can never fail) and S16/S18 serving the whole ~120 MiB alienswarm.eti from a single source, so fanout and retry were never exercised. Test-correctness fixes (a broken behavior could previously pass green): - S18: the "no download-failed" check was dead -- it reused a LineWaiter already advanced past download-finished, so it scanned an empty tail. Replaced with assert_no_event_since over the whole window. Switched to a 4*CHUNK_SIZE sparse archive so both peers get chunks; the test now proves the download SURVIVES a mid-download source kill (every byte delivered, survivor served part, no download-failed, clean diff). Retry-onto-survivor is the mechanism but is not asserted: the kill/serve race against `docker rm -f` cannot be forced, so asserting an exact split would be flaky. - S7: the only check was a diff against byte-identical ggoo fixtures, so it was source-agnostic. Added assertions that the download committed exactly once, every chunk came from the validated two-peer set, both peers served, and no chunk was fetched twice. - S14: enlarged to 4*CHUNK_SIZE so the balance check can fail under a 3+1 imbalance; asserts an exact 2+2 split summing to the file size. - S16: inflated the .eti to 2*CHUNK_SIZE so it fans out across both catalog-version peers (the stock 120 MiB fixture is a single chunk). - S37: validate the throughput rate fields (positive, self-consistent mbit/mib == 8.388608, mib_per_s == bytes/duration), not just the byte count. - S35: assert the source actually advertises the unknown game before checking it is filtered, so "absent" means "filtered" and not "never sent". - S15: cross-check each peer's raw advertised eti_version via list-peers; the list-games eti_game_version is synthesized from the catalog and can only ever equal the asserted value. - S2: poll for library convergence and verify the bidirectional exchange (bravo sees alpha's 3 games, not just alpha seeing bravo's 4). - S12/S28: require the gating unit test to appear as "<name> ... ok" so an #[ignore]d (un-run) test no longer satisfies the check. - S24/S25: assert the requested install=false final state. - S34: assert exactly 21 coherent chunks (20 files + version.ini), 21 distinct paths, no duplicates, instead of a >= 21 floor. Flake fixes: - S19: force-kill the sole source right after download-begin on a 4*CHUNK_SIZE file and accept download-failed or download-peers-gone. The old graceful shutdown on a single-chunk file could let the transfer finish first, turning the expected failure into a download-finished. A chunk may complete before the kill lands, but the full transfer cannot, so the failure is deterministic. - S26: use a large sparse source so the first operation is reliably still active when the duplicate request is issued (TOCTOU on active_operations); also assert the active operation == "Downloading". - S11: drop the "listener address must change" assertion -- it tested the OS ephemeral-port allocator and could fail spuriously; keep the same-identity / no-duplicate invariant. Coverage and determinism: - S27: add handshake::tests::inbound_hello_from_self_is_ignored for the protocol-level self guard. The CLI scenario only exercises the CLI string-compare guard, which short-circuits before any network call, so the peer-crate guard had no test. - find_fixture_game now iterates sorted(FIXTURES) so the ambiguous cnctw (fixture-bravo/multi/solid) resolves deterministically to fixture-bravo. Reviewed and deliberately left as-is (documented in the run log): S20, S21, S30, S32/S39/S44 absence checks, S42 IP-order precondition, S45. PEER_CLI_SCENARIOS.md rows S2, S11, S14, S16, S18, S19, S27 are updated to match the harness, and a dated run-log entry records the audit, the fixes, the accepted items, and the live-run evidence. Test Plan: - `just peer-cli-tests` (rebuilds the image, runs S1-S47 in Docker): baseline passed; post-fix passed; a final run on the exact committed code passed 47/47. Evidence: S14 {268435456, 268435456} balanced 2+2; S16 .eti split across B and C {134217728, 134217728}; S18 all 536870912 bytes delivered with no download-failed; S19 deterministic download-failed; S37 ~874 MiB/s. - `just test` (incl. inbound_hello_from_self_is_ignored), `just clippy` (-D warnings, all-targets), and `just fmt` all pass. Refs: PEER_CLI_SCENARIOS.md scenario matrix and 2026-06-21 run-log entry.
ddidderr added 1 commit 2026-06-21 10:27:53 +02:00
An adversarial audit of the headless peer-to-peer scenario suite
(crates/lanspread-peer-cli/scripts/run_extended_scenarios.py, driven by
`just peer-cli-tests`) found assertions that passed even when the behavior
they claim to test was not happening, plus timing races and doc-vs-code
divergences. A full baseline run (S1-S47) passed beforehand, confirming these
were test-quality gaps, not peer regressions; the baseline output itself
exposed the worst offenders -- e.g. S14 chunk totals {128 MiB, 1 MiB} (a
two-chunk file whose "balanced within one chunk" check can never fail) and
S16/S18 serving the whole ~120 MiB alienswarm.eti from a single source, so
fanout and retry were never exercised.

Test-correctness fixes (a broken behavior could previously pass green):
- S18: the "no download-failed" check was dead -- it reused a LineWaiter
  already advanced past download-finished, so it scanned an empty tail.
  Replaced with assert_no_event_since over the whole window. Switched to a
  4*CHUNK_SIZE sparse archive so both peers get chunks; the test now proves the
  download SURVIVES a mid-download source kill (every byte delivered, survivor
  served part, no download-failed, clean diff). Retry-onto-survivor is the
  mechanism but is not asserted: the kill/serve race against `docker rm -f`
  cannot be forced, so asserting an exact split would be flaky.
- S7: the only check was a diff against byte-identical ggoo fixtures, so it was
  source-agnostic. Added assertions that the download committed exactly once,
  every chunk came from the validated two-peer set, both peers served, and no
  chunk was fetched twice.
- S14: enlarged to 4*CHUNK_SIZE so the balance check can fail under a 3+1
  imbalance; asserts an exact 2+2 split summing to the file size.
- S16: inflated the .eti to 2*CHUNK_SIZE so it fans out across both
  catalog-version peers (the stock 120 MiB fixture is a single chunk).
- S37: validate the throughput rate fields (positive, self-consistent
  mbit/mib == 8.388608, mib_per_s == bytes/duration), not just the byte count.
- S35: assert the source actually advertises the unknown game before checking
  it is filtered, so "absent" means "filtered" and not "never sent".
- S15: cross-check each peer's raw advertised eti_version via list-peers; the
  list-games eti_game_version is synthesized from the catalog and can only ever
  equal the asserted value.
- S2: poll for library convergence and verify the bidirectional exchange
  (bravo sees alpha's 3 games, not just alpha seeing bravo's 4).
- S12/S28: require the gating unit test to appear as "<name> ... ok" so an
  #[ignore]d (un-run) test no longer satisfies the check.
- S24/S25: assert the requested install=false final state.
- S34: assert exactly 21 coherent chunks (20 files + version.ini), 21 distinct
  paths, no duplicates, instead of a >= 21 floor.

Flake fixes:
- S19: force-kill the sole source right after download-begin on a 4*CHUNK_SIZE
  file and accept download-failed or download-peers-gone. The old graceful
  shutdown on a single-chunk file could let the transfer finish first, turning
  the expected failure into a download-finished. A chunk may complete before
  the kill lands, but the full transfer cannot, so the failure is deterministic.
- S26: use a large sparse source so the first operation is reliably still
  active when the duplicate request is issued (TOCTOU on active_operations);
  also assert the active operation == "Downloading".
- S11: drop the "listener address must change" assertion -- it tested the OS
  ephemeral-port allocator and could fail spuriously; keep the same-identity /
  no-duplicate invariant.

Coverage and determinism:
- S27: add handshake::tests::inbound_hello_from_self_is_ignored for the
  protocol-level self guard. The CLI scenario only exercises the CLI
  string-compare guard, which short-circuits before any network call, so the
  peer-crate guard had no test.
- find_fixture_game now iterates sorted(FIXTURES) so the ambiguous cnctw
  (fixture-bravo/multi/solid) resolves deterministically to fixture-bravo.

Reviewed and deliberately left as-is (documented in the run log): S20, S21,
S30, S32/S39/S44 absence checks, S42 IP-order precondition, S45.

PEER_CLI_SCENARIOS.md rows S2, S11, S14, S16, S18, S19, S27 are updated to
match the harness, and a dated run-log entry records the audit, the fixes, the
accepted items, and the live-run evidence.

Test Plan:
- `just peer-cli-tests` (rebuilds the image, runs S1-S47 in Docker): baseline
  passed; post-fix passed; a final run on the exact committed code passed
  47/47. Evidence: S14 {268435456, 268435456} balanced 2+2; S16 .eti split
  across B and C {134217728, 134217728}; S18 all 536870912 bytes delivered with
  no download-failed; S19 deterministic download-failed; S37 ~874 MiB/s.
- `just test` (incl. inbound_hello_from_self_is_ignored), `just clippy`
  (-D warnings, all-targets), and `just fmt` all pass.

Refs: PEER_CLI_SCENARIOS.md scenario matrix and 2026-06-21 run-log entry.
ddidderr merged commit 2d53848e0c into main 2026-06-21 10:28:02 +02:00
ddidderr deleted branch test/peer-cli-scenario-audit 2026-06-21 10:28:02 +02:00
Sign in to join this conversation.