SLOs, alerts, and incident drills (Phase 5 S23)
This page is the production operating contract: every numeric target an operator commits to, every alert mapped to a remediation runbook, every incident-drill scenario.
Metrics in deploy/observability/prometheus-rules.yaml enforce
these SLOs in Prometheus; the dashboard at
deploy/observability/grafana-dashboard.json renders them.
SLO summary
Section titled “SLO summary”| ID | Name | Target | Window | Owner |
|---|---|---|---|---|
| L1 | Decision latency (p99) | < 250 ms | 30 days | sidecar team |
| L2 | Decision availability | ≥ 99.9% | 30 days | sidecar team |
| L3 | Ledger commit success | ≥ 99.95% | 30 days | ledger team |
| L4 | Audit outbox forward lag (p99) | < 60 s | 24 h | platform team |
| L5 | Canonical ingest reject rate | < 0.5% | 24 h | platform team |
| L6 | Pricing snapshot age (p99) | < 24 h | 24 h | pricing team |
| L7 | Provider reconciliation lag | < 4 h | 24 h | platform team |
| L8 | Approval latency (p99) | < 5 min business | 30 days | approver oncall |
| L9 | Fencing lease takeover rate | < 1 / pod / hour | 24 h | sidecar team |
Numeric targets above are first-cut. Each owning team revisits
quarterly; all changes require an audit row in
pricing_overrides_audit-style change log (TBD: we’ll repurpose
or add a separate slo_changes table — S23-followup).
Required metrics
Section titled “Required metrics”Source: deploy/observability/prometheus-rules.yaml references
these names. ✓ = shipped (which slice); ↻ = S23-followup.
| Metric | Source slice | Status |
|---|---|---|
spendguard_decision_latency_seconds | S23 | ↻ |
spendguard_decision_total{status} | S23 | ↻ |
spendguard_ledger_transaction_total{outcome,code} | S23 | ↻ |
spendguard_ledger_lease_age_seconds{lease_name} | S1 | ✓ |
spendguard_outbox_pending_seconds{tenant} | S23 | ↻ |
spendguard_ingest_events_quarantined_total{reason} | S8 | ✓ |
spendguard_ingest_events_accepted_total{route} | S8 | ✓ |
spendguard_ingest_events_rejected_invalid_signature_total{route} | S8 | ✓ |
spendguard_pricing_snapshot_age_seconds{provider} | S13 | ↻ |
spendguard_provider_reconciliation_lag_seconds{provider} | S10 | ↻ |
spendguard_approval_latency_seconds{outcome} | S20 | ↻ |
spendguard_sidecar_fencing_acquire_total{action} | S4 | ↻ |
The ↻ rows are wiring — emit-side code lives in the relevant
service crate but isn’t yet published to a /metrics
endpoint. canonical_ingest’s /metrics (S8) is the reference
implementation; replicate the IngestMetrics + http server pattern.
Alert rules (sample)
Section titled “Alert rules (sample)”The full set lives in deploy/observability/prometheus-rules.yaml.
Excerpts here for the on-call playbook to read.
A1. Decision latency p99 above target
Section titled “A1. Decision latency p99 above target”alert: SpendGuardDecisionLatencyHighexpr: histogram_quantile(0.99, rate(spendguard_decision_latency_seconds_bucket[5m])) > 0.25for: 10mlabels: { severity: page, slo: L1 }annotations: summary: "Decision p99 > 250ms for 10m" runbook: "docs/operations/runbooks/L1-decision-latency.md"Page condition: 10 minutes sustained.
A2. Decision unavailable
Section titled “A2. Decision unavailable”alert: SpendGuardDecisionUnavailableexpr: rate(spendguard_decision_total{status="error"}[5m]) / rate(spendguard_decision_total[5m]) > 0.001for: 5mlabels: { severity: page, slo: L2 }annotations: summary: "Decision error rate > 0.1% for 5m" runbook: "docs/operations/runbooks/L2-decision-availability.md"A3. Ledger commit failures
Section titled “A3. Ledger commit failures”alert: SpendGuardLedgerCommitFailingexpr: rate(spendguard_ledger_transaction_total{outcome="error"}[5m]) / rate(spendguard_ledger_transaction_total[5m]) > 0.0005for: 5mlabels: { severity: page, slo: L3 }annotations: summary: "Ledger commit error rate > 0.05% for 5m" runbook: "docs/operations/runbooks/L3-ledger-commit.md"A4. Audit outbox lag
Section titled “A4. Audit outbox lag”alert: SpendGuardOutboxLagexpr: histogram_quantile(0.99, rate(spendguard_outbox_pending_seconds_bucket[15m])) > 60for: 15mlabels: { severity: page, slo: L4 }annotations: summary: "Audit outbox p99 lag > 60s for 15m" runbook: "docs/operations/runbooks/L4-outbox-lag.md"A5. Canonical ingest reject rate
Section titled “A5. Canonical ingest reject rate”alert: SpendGuardCanonicalIngestRejectingexpr: rate(spendguard_ingest_events_rejected_invalid_signature_total[10m]) > 0.5for: 10mlabels: { severity: page, slo: L5 }annotations: summary: "Canonical ingest rejecting > 0.5 events/sec for 10m" runbook: "docs/operations/runbooks/L5-canonical-rejects.md"A6. Pricing snapshot stale
Section titled “A6. Pricing snapshot stale”alert: SpendGuardPricingStaleexpr: (time() - spendguard_pricing_snapshot_age_seconds) > 86400for: 30mlabels: { severity: page, slo: L6 }annotations: summary: "Latest pricing_version > 24h old" runbook: "docs/operations/runbooks/L6-pricing-stale.md"This must page BEFORE the bundle-build fail-closed gate fires.
A7. Provider reconciliation lag
Section titled “A7. Provider reconciliation lag”alert: SpendGuardProviderReconciliationLagexpr: spendguard_provider_reconciliation_lag_seconds > 14400for: 1hlabels: { severity: warn, slo: L7 }annotations: summary: "Provider reconciliation > 4h behind for 1h" runbook: "docs/operations/runbooks/L7-recon-lag.md"A8. Approval latency
Section titled “A8. Approval latency”alert: SpendGuardApprovalLatencyexpr: histogram_quantile(0.99, rate(spendguard_approval_latency_seconds_bucket[1h])) > 300for: 30mlabels: { severity: warn, slo: L8 }annotations: summary: "Approval p99 > 5m for 30m" runbook: "docs/operations/runbooks/L8-approval-latency.md"A9. Fencing takeover storm
Section titled “A9. Fencing takeover storm”alert: SpendGuardFencingTakeoverStormexpr: increase(spendguard_sidecar_fencing_acquire_total{action="promote"}[1h]) > 1for: 5mlabels: { severity: page, slo: L9 }annotations: summary: "Fencing takeovers > 1 / hour — likely lease flap" runbook: "docs/operations/runbooks/L9-fencing-storm.md"Incident drill scenarios
Section titled “Incident drill scenarios”Quarterly drill rotation. The drill log at
docs/operations/drill-log.md (S23-followup template) records
results.
Per-drill deep-dive runbooks
Section titled “Per-drill deep-dive runbooks”These full-text runbooks (followup #12) walk through symptoms, first-check, mitigation, escalation, and a compose-based rehearsal for each drill — read them before being primary on-call:
- Lease lost mid-batch —
validates round-9
is_leader_now()gating in outbox-forwarder + ttl-sweeper. - Audit chain forwarder backlog — validates the L4 SLO (audit-outbox forward lag) + the forwarder’s idempotency.
- Strict-signature quarantine spike
— covers the high-level D3 below with the full triage tree
for
unknown_key/invalid_signature/key_expired/key_revokedreasons. - Approval TTL wave — sweeper burst handling + round-9 atomic TTL guard.
The high-level D1–D4 entries below stay as the executive summary; the per-drill docs above are what on-call actually reads.
D1. Ledger failover
Section titled “D1. Ledger failover”Steps:
kubectl delete pod <ledger-pod>(or simulate Postgres primary failover).- Verify A3 fires within 5 minutes.
- Verify sidecar fail policy (S22 matrix) blocks new
monetary decisions per
failPolicy.overrides. - Verify ledger-replica promotion + new ledger pod becomes leader.
- Verify post-recovery: A3 clears; in-flight reservations either commit cleanly or release via TTL.
Acceptance:
- No
audit_outbox_global_keysUNIQUE violations during the failover. audit_outbox.pending_forward = TRUEcount returns to baseline within 10 minutes of recovery.
D2. Stale fencing lease handling
Section titled “D2. Stale fencing lease handling”Steps:
- Manually expire the active sidecar’s fencing lease (UPDATE
fencing_scopesin test env, or wait for natural TTL on a killed pod). - Verify A9 increments by exactly 1.
- Verify the takeover sidecar’s first decision uses
fencing_epoch = N+1. - Verify the prior pod’s in-flight commit (if any) gets
FENCING_EPOCH_STALEfrom the SP.
Acceptance:
fencing_scope_events.action='promote'row appears.- No
audit_outbox_global_keyscollisions.
D3. Signature failure handling
Section titled “D3. Signature failure handling”Steps:
- Rotate one producer’s Ed25519 key WITHOUT updating the
verifier’s trust store (
keys.json). - Verify A5 increments + the canonical_ingest log shows
key_revoked/unknown_keyquarantine reason. - Verify the rows land in
audit_signature_quarantinewith correct claimed_canonical_bytes preserved. - Update verifier’s trust store (rolling restart).
- Verify A5 returns to baseline.
Acceptance:
- The pre-rotation rows ARE in
canonical_events(signed with old key). - The mid-rotation rows are in
audit_signature_quarantine. - The post-rotation rows ARE in
canonical_events(signed with new key).
D4. Pricing outage
Section titled “D4. Pricing outage”Steps:
- Disable pricing-sync (set crontab to empty, or pause the pricing-sync worker).
- Wait 24 hours.
- Verify A6 fires.
- Continue waiting until
bundle-buildrefuses to cut new bundles (S13-followup wires this). - Re-enable pricing-sync.
- Verify A6 clears; bundle-build resumes.
Acceptance:
pricing_sync_attempts.outcomelog shows the gap.- No spurious budget enforcement during the freshness gap (existing bundles continue using their frozen pricing tuple).
Owner page (per spec review standard)
Section titled “Owner page (per spec review standard)”| Component | Page owner | Backup |
|---|---|---|
| Sidecar | sidecar oncall | platform oncall |
| Ledger | ledger oncall | platform oncall |
| Canonical Ingest | platform oncall | sidecar oncall |
| Outbox forwarder | platform oncall | platform oncall |
| TTL sweeper | platform oncall | platform oncall |
| Webhook receiver | platform oncall | provider oncall |
| Control Plane | platform oncall | sre |
| Dashboard | platform oncall | sre |
Each runbook listed above MUST be filled in before GA. The S23 doc ships the structure; the per-alert deep dives are the next chunk.