Drill: approval TTL wave
Quarterly drill. Validates the approval state machine + sweeper under burst load: when many approvals expire simultaneously (e.g. a tenant configured a short TTL contract rule and the approver team is offline overnight), the TTL sweeper MUST process the burst correctly without:
- Dropping or double-expiring rows.
- Bypassing the round-9 atomic TTL guard
(
resolve_approval_requestSP: user-driven approve/deny on a TTL-expired pending row → 409 CONFLICT). - Letting any approval get stuck in
pendingafter TTL. - Missing the round-5 system-actor injection (rows must end up
with
resolved_by_subject = 'system:ttl-sweeper'/resolved_by_issuer = 'system:spendguard').
What this drill exercises
Section titled “What this drill exercises”- Migration 0030 (round 5) — TTL sweeper system-actor injection.
- Migration 0033 (round 9) — atomic TTL guard inside the resolve_approval_request SP.
- Migration 0035 (followup #4) — notification outbox writes for
the
expiredtransition. - The S14 sweeper helper
expire_pending_approvals_due()under load.
Symptoms (what on-call sees)
Section titled “Symptoms (what on-call sees)”- Alert A8
SpendGuardApprovalLatencyHigh(or its expired-tail variant) firing. approval_requestsquery: many rows withstate='pending' AND ttl_expires_at < now().approval_eventsquery: burst ofto_state='expired'events in a short window.approval_notificationsquery: corresponding burst oftransition_kind='expired'rows pending dispatch.- User-visible impact: adapters waiting on ResumeAfterApproval get the typed error indicating the approval lapsed → caller raises typed exception.
First check
Section titled “First check”# 1. Snapshot pending approvals past TTL.psql -h $LEDGER_PG_HOST -U spendguard -d spendguard_ledger -c " SELECT count(*) AS stuck_pending, min(ttl_expires_at) AS oldest_overdue FROM approval_requests WHERE state = 'pending' AND ttl_expires_at < now();"# Non-zero stuck count means sweeper isn't keeping up.
# 2. Recent expire-event rate.psql -h $LEDGER_PG_HOST -U spendguard -d spendguard_ledger -c " SELECT count(*) AS expired_in_5min FROM approval_events WHERE to_state = 'expired' AND occurred_at > now() - interval '5 minutes';"
# 3. Sweeper liveness — when did it last act?psql -h $LEDGER_PG_HOST -U spendguard -d spendguard_ledger -c " SELECT max(occurred_at) AS last_sweep_action FROM approval_events WHERE actor_subject = 'system:ttl-sweeper';"
# 4. Notification outbox backlog from the burst.psql -h $LEDGER_PG_HOST -U spendguard -d spendguard_ledger -c " SELECT count(*) AS pending_dispatch FROM approval_notifications WHERE transition_kind = 'expired' AND pending_dispatch = TRUE;"Mitigation (short-term unblock)
Section titled “Mitigation (short-term unblock)”Sweeper stalled (step 3 shows last action > 1 minute ago)
Section titled “Sweeper stalled (step 3 shows last action > 1 minute ago)”The TTL sweeper isn’t running. Either:
- Pod down:
kubectl get pods -l app.kubernetes.io/component=ttl-sweeperand restart if not Running. - Lease lost: per the
lease-lost-mid-batch.mddrill — sweeper logs show “lease expired locally” warns. Same remediation: restart the pod.
Sweeper running but burst too big
Section titled “Sweeper running but burst too big”The sweeper batches via expire_pending_approvals_due() which
processes all overdue rows in one SP call. If the burst is huge
(thousands of rows), the SP can take a long time inside a single
transaction.
- Check sweeper logs for the SP completion line:
expired N approvals via sweeper. - If the SP looks stuck: check Postgres for
long-running transactions:
The sweeper holding the lock prevents new approvals from being created/resolved (they’d wait on the row-level lock).SELECT pid, now() - xact_start AS duration, queryFROM pg_stat_activityWHERE state <> 'idle'AND xact_start IS NOT NULLORDER BY xact_start;
- Operator-supervised batch limit (S14-followup): a future migration would cap the SP’s per-call batch size; for now the sweeper either completes or operators wait it out.
Mass-expire was intended (e.g. test tenant)
Section titled “Mass-expire was intended (e.g. test tenant)”No action required. Rows correctly expired with system actor + notifications enqueued.
Escalation
Section titled “Escalation”- 5 minutes of growing stuck-pending count → page approver oncall (responsible for the contract that set short TTLs).
- 15 minutes without sweeper progress → page platform oncall; TTL sweeper service may need restart.
- 30+ minutes with adapters waiting → page engineering manager. Adapters’ typed “approval lapsed” exceptions imply user impact.
Rehearsal
Section titled “Rehearsal”# 1. Bring up demo with TTL=5s so we can create a burst quickly.SIDECAR_TTL_SECONDS=5 make demo-up DEMO_MODE=ttl_sweep# (The PR #6 ttl_sweep mode wires this end-to-end.)
# 2. Generate a burst of pending approvals via direct SQL# (workaround: real adapter-driven approvals in burst would# require a test contract with REQUIRE_APPROVAL rules).docker exec spendguard-postgres psql -U spendguard -d spendguard_ledger -c " INSERT INTO approval_requests (approval_id, tenant_id, decision_id, audit_decision_event_id, state, ttl_expires_at, approver_policy, requested_effect, decision_context) SELECT gen_random_uuid(), '00000000-0000-4000-8000-000000000001', gen_random_uuid(), gen_random_uuid(), 'pending', clock_timestamp() + interval '500 ms', '{}'::jsonb, '{}'::jsonb, '{}'::jsonb FROM generate_series(1, 50);"
# 3. Wait past TTL.sleep 2
# 4. Trigger the sweeper SP directly (it would normally be# called by the ttl-sweeper service):docker exec spendguard-postgres psql -U spendguard -d spendguard_ledger -c " SELECT expire_pending_approvals_due() AS expired_count;"# Expected: 50.
# 5. Verify all rows are now expired with system actor.docker exec spendguard-postgres psql -U spendguard -d spendguard_ledger -c " SELECT state, resolved_by_subject, count(*) FROM approval_requests WHERE state IN ('expired', 'pending') GROUP BY state, resolved_by_subject;"# Expected: state=expired, resolved_by_subject=system:ttl-sweeper, count=50.
# 6. Verify notification rows landed (followup #4).docker exec spendguard-postgres psql -U spendguard -d spendguard_ledger -c " SELECT transition_kind, count(*) FROM approval_notifications GROUP BY transition_kind;"# Expected: transition_kind=expired with N rows for the tenant# IFF that tenant has a row in tenant_notification_config; 0# rows if no config (followup #4 default behavior).
make demo-downRelated
Section titled “Related”- L8 SLO definition:
docs/site/docs/operations/slos.mdrow L8 - PR #2 round 5 commit
c084a26— TTL sweeper SP fix (system actor injection) - PR #2 round 9 commit
8810c14— atomic TTL guard - PR #14 commit
6f8d4d5(followup #4) — notification outbox writes - Sister drill:
lease-lost-mid-batch.mdcovers the sweeper pod losing its lease (parent incident pattern)