Drill: strict-signature quarantine spike
Quarterly drill. Validates the audit chain integrity guarantee
under signature failure: when a producer’s signing key rotates
without the verifier’s trust store updating in lockstep, the
canonical_ingest verifier MUST reject every affected row to
audit_signature_quarantine (in strict mode) or admit + bump
admit-counters (in non-strict mode), but in either case must
NEVER drop or silently re-encode the bytes.
This is the live counterpart to the unit tests in
services/canonical_ingest/src/verifier.rs::tests::* and the
metrics tests in services/canonical_ingest/src/metrics.rs::tests.
What this drill exercises
Section titled “What this drill exercises”- Strict-mode: alert A5
SpendGuardCanonicalRejectsHighfires. - Non-strict mode (PR #2 round 1 P2#3 fix in
eec0404): theunknown_key_admitted_totalandinvalid_signature_admitted_totalcounters bump but rows still land incanonical_eventsso audit-chain isn’t broken during a rolling key rotation. - The S7 key registry (
signing_keys+signing_key_revocationstables in canonical-ingest migrations 0008/0009) — quarantine reasons differentiatekey_expired/key_revoked/key_not_yet_valid/unknown_key/invalid_signature.
Symptoms (what on-call sees)
Section titled “Symptoms (what on-call sees)”- Alert A5
SpendGuardCanonicalRejectsHighfiring. audit_signature_quarantinerow count climbing.canonical_eventscount growing slower (or flat in strict mode).audit_outbox.pending_forward = TRUEcount climbing — forwarder keeps re-attempting the same rejected rows.- User-visible: NO immediate impact on producers (sidecar / ledger / webhook still write rows). Audit consumers see growing quarantine; compliance sees gap.
First check
Section titled “First check”# 1. Quarantine breakdown by reason (Phase 5 S7 + S8 schema):psql -h $CANONICAL_PG_HOST -U spendguard -d spendguard_canonical -c " SELECT reason, count(*), max(quarantined_at) AS most_recent FROM audit_signature_quarantine WHERE quarantined_at > now() - interval '1 hour' GROUP BY reason ORDER BY count DESC;"
# 2. Which signing keys are involved?psql -h $CANONICAL_PG_HOST -U spendguard -d spendguard_canonical -c " SELECT signing_key_id, count(*) AS quarantined_rows FROM audit_signature_quarantine WHERE quarantined_at > now() - interval '1 hour' GROUP BY signing_key_id ORDER BY count DESC;"
# 3. Compare signing keys claimed by producers vs trust store:psql -h $CANONICAL_PG_HOST -U spendguard -d spendguard_canonical -c " SELECT key_id, valid_from, valid_until, revoked_at IS NOT NULL AS is_revoked FROM signing_keys ORDER BY valid_from DESC LIMIT 10;"
# 4. Strict mode check (different remediation for strict vs non-strict):kubectl exec <canonical-ingest-pod> -- env | grep STRICT_SIGNATURES# true → strict (rows rejected); false → non-strict (admitted + counted)Mitigation (short-term unblock)
Section titled “Mitigation (short-term unblock)”Route depends on which reason dominates step 1:
unknown_key dominates
Section titled “unknown_key dominates”Producer is using a key the verifier doesn’t recognise. Likely cause: key rotation deployed to producers ahead of trust-store update on canonical-ingest.
- Identify the new key (step 2 + the producer’s recent logs).
- Add it to the trust store:
Terminal window kubectl edit secret spendguard-signing-trust-store# Append the new public key + valid_from windowkubectl rollout restart deployment canonical-ingest - Replay the quarantined rows: PR #2 round 1 quarantine
keeps the original bytes verbatim. After trust store update,
manual re-ingest from
audit_signature_quarantinetable intocanonical_events(S8-followup feature; today requires manual SQL).
invalid_signature dominates
Section titled “invalid_signature dominates”This is more serious — bytes don’t match the claimed signature. Possibilities:
- Producer code regression (signing the wrong canonical bytes)
- Active tampering on the wire (mTLS misconfiguration?)
- Halt the affected producer immediately until root cause
is known:
Terminal window kubectl scale deployment <producer-name> --replicas=0 - Diff the producer image vs known-good for changes to canonical-form serialization.
- Do NOT drop or replay quarantine rows until tampering is ruled out — the bytes are forensic evidence.
key_expired / key_revoked dominates
Section titled “key_expired / key_revoked dominates”S7 validity-window enforcement. Producer is signing with a key
past its valid_until or after revoked_at.
- Rotate the producer’s signing material to a current key.
- Audit the gap: rows signed with the expired key in
valid_from-to-valid_untilwindow are still legitimate (signed by a then-valid key); rows signed AFTER the window represent a producer config bug.
Escalation
Section titled “Escalation”- 5 minutes sustained spike → page platform oncall.
- 15 minutes without diagnosis → page sidecar/ledger team oncall (depending on which producer is affected).
invalid_signature>0 rows → security team page immediately (potential tampering).- 30+ minutes sustained quarantine in strict mode → consider switching to non-strict temporarily (operator decision, requires Helm gate ack — this trades audit-chain completeness for availability while you fix the root cause).
Rehearsal
Section titled “Rehearsal”# 1. Bring up demo with strict mode enabled (default for# production profile).make demo-up DEMO_MODE=invoice
# 2. Generate a few audit rows.make demo-up DEMO_MODE=decision
# 3. Inject a "key rotation" scenario by replacing one# producer's signing key WITHOUT updating the verifier's trust# store. Easiest via re-running pki-init with a new key, then# restarting the sidecar:docker exec spendguard-pki-init /generate.sh --rotate-sidecardocker restart spendguard-sidecar
# 4. Generate more audit traffic.make demo-up DEMO_MODE=decision
# 5. Confirm quarantine row appears with reason='unknown_key'.docker exec spendguard-postgres psql -U spendguard -d spendguard_canonical -c " SELECT reason, count(*) FROM audit_signature_quarantine GROUP BY reason;"# Expected: unknown_key reason with at least 1 row.
# 6. Mitigation rehearsal: update the trust store + restart# canonical-ingest, then verify new rows land in canonical_events# (old rows stay in quarantine for the manual replay step).
make demo-downRelated
Section titled “Related”- L5 SLO definition:
docs/site/docs/operations/slos.mdrow L5 - Alert: A5
SpendGuardCanonicalRejectsHighindeploy/observability/prometheus-rules.yaml - D3 in slos.md (signature-failure handling) — high-level version
- PR #2 round 1 commit
a4dea4b— non-strict admit counters - PR #2 round 7+8 commits
409c220,d019e94— SP-side literal-pin relaxations that let real signed rows through