Drill: audit chain forwarder backlog
Quarterly drill. Validates the audit chain end-to-end: when
outbox-forwarder is paused or slow, the audit_outbox queue
grows but no rows are lost; once the forwarder resumes, the queue
drains and canonical_events catches up.
This is the L4 SLO drill (audit-outbox-forward-lag p99 < 60s within a 24h window).
What this drill exercises
Section titled “What this drill exercises”- The L4 SLO target.
- Alert A4
SpendGuardOutboxLagHighfiring at the documented threshold + clearing on recovery. - The transactional invariant from S8: rows are NEVER dropped
silently. They sit in
audit_outboxwithpending_forward = TRUEuntil either forwarded successfully (→canonical_events) or moved toaudit_signature_quarantine(signature mismatch / unknown key). - The forwarder’s idempotency: re-processing the same backlog
twice is a no-op against
canonical_events.audit_outbox_idUNIQUE.
Symptoms (what on-call sees)
Section titled “Symptoms (what on-call sees)”- Alert A4
SpendGuardOutboxLagHighpaged on Prometheus. - Dashboard panel
audit_outbox_pending_secondsshows growing oldest-pending age. SELECT count(*) FROM audit_outbox WHERE pending_forward = TRUE→ growing number, not draining.- Dashboard’s
canonical_events countpanel flat or growing slower thanaudit_outboxtotal count. - User-visible impact: NONE for the producer side (sidecar / ledger / webhook continue to write audit rows). Audit-consuming flows (downstream BI, compliance exports) see staleness.
First check
Section titled “First check”# 1. Confirm the forwarder pod is still running.kubectl get pods -l app.kubernetes.io/component=outbox-forwarder -o wide# All Running? Continue. Any CrashLoopBackOff? Tail logs:kubectl logs <forwarder-pod> --tail=200
# 2. Pending count + oldest age (single SQL query):psql -h $LEDGER_PG_HOST -U spendguard -d spendguard_ledger -c " SELECT count(*) AS pending, max(now() - recorded_at) AS oldest_age, min(recorded_at) AS oldest_recorded_at FROM audit_outbox WHERE pending_forward = TRUE;"
# 3. Are forwards still landing? Compare canonical_events count# now vs 1 minute ago:psql -h $CANONICAL_PG_HOST -U spendguard -d spendguard_canonical -c " SELECT count(*) FROM canonical_events WHERE recorded_at > now() - interval '1 minute';"# 0 = forwarder is stalled. Non-zero = forwarder is processing,# just behind on backlog.
# 4. Check the forwarder's lease state (validates "lease lost# mid-batch" isn't actually the parent incident):psql -h $LEDGER_PG_HOST -U spendguard -d spendguard_ledger -c " SELECT holder_workload_id, expires_at, expires_at < now() AS expired FROM coordination_leases WHERE lease_name = 'outbox-forwarder';"If step 4 shows expired = TRUE AND no holder, the parent incident
is “lease lost mid-batch” — see that drill instead.
Mitigation (short-term unblock)
Section titled “Mitigation (short-term unblock)”Route depends on what step 3 shows:
Forwarder stalled (step 3 returns 0)
Section titled “Forwarder stalled (step 3 returns 0)”Options in order of escalation:
- Restart the forwarder pod:
New pod acquires lease via leader election + resumes from
Terminal window kubectl delete pod <forwarder-pod>audit_outbox.pending_forward = TRUErows ordered byrecorded_at. The forward loop is idempotent so partial replays are safe. - If multiple replicas and only one is stuck: that pod is in a bad state — kill it; standby takes over.
- If all replicas show the same stall: the canonical_ingest
side is rejecting → check
audit_signature_quarantinefor recent rows. The “strict-signature-quarantine-spike” drill covers that scenario.
Forwarder processing but backed up (step 3 non-zero)
Section titled “Forwarder processing but backed up (step 3 non-zero)”- Increase forwarder replica count temporarily (operator
ack required):
Note: only the leader processes work — extra replicas just make takeover faster. To actually parallelise, the workload must be partitioned (per-tenant scope_id, not a single global lease). This is a known scope limit.
Terminal window kubectl scale deployment outbox-forwarder --replicas=2 - Tune
OUTBOX_FORWARDER_BATCH_SIZE(env var; default 100): higher values trade per-batch latency for throughput. Bump to 500–1000 during recovery, restore default after lag clears.
Escalation
Section titled “Escalation”- 15 minutes of growing pending count → page platform oncall.
- 60 minutes with no recovery → page engineering manager; consider a manual SQL drain (very risky — only if you understand the dedup invariants).
- 24 hours lag → SLO violation; document in
docs/site/docs/operations/drill-log.md.
Rehearsal
Section titled “Rehearsal”# 1. Bring up demo with full chain.make demo-up DEMO_MODE=invoice
# 2. Pause the forwarder.docker pause spendguard-outbox-forwarder
# 3. Generate audit traffic by re-running the demo a few times.for i in 1 2 3; do make demo-up DEMO_MODE=decisiondone
# 4. Confirm pending count grows.docker exec spendguard-postgres psql -U spendguard -d spendguard_ledger \ -c "SELECT count(*) AS pending FROM audit_outbox WHERE pending_forward = TRUE;"# Expected: > 5 rows pending.
# 5. Resume the forwarder.docker unpause spendguard-outbox-forwardersleep 10
# 6. Confirm drain.docker exec spendguard-postgres psql -U spendguard -d spendguard_ledger \ -c "SELECT count(*) AS pending FROM audit_outbox WHERE pending_forward = TRUE;"# Expected: pending count back to baseline (0 or close).
# 7. Confirm canonical_events caught up.docker exec spendguard-postgres psql -U spendguard -d spendguard_canonical \ -c "SELECT count(*) FROM canonical_events;"# Expected: count matches the total audit rows generated.
make demo-downRelated
Section titled “Related”- L4 SLO definition:
docs/site/docs/operations/slos.mdrow L4 - Alert: A4
SpendGuardOutboxLagHighindeploy/observability/prometheus-rules.yaml - Sister drill:
lease-lost-mid-batch.md— when the forwarder stops because its lease went stale, not because it’s slow