GA hardening progress log
Live tracker for the 23 slices defined in ga-hardening-slices.md. Updated on each slice merge.
S1 — Lease primitive for singleton background workers
Section titled “S1 — Lease primitive for singleton background workers”Status: SHIPPED (90%+ production candidate; one deferred validation documented).
Design decision
Section titled “Design decision”- Postgres-backed lease as the primary, fully-tested mode (works for
compose, Helm Postgres, and any external Postgres). k8s
coordination.k8s.io/Leasemode reserved as a feature-flagged trait impl that returnsLeaseError::ModeUnavailableuntil S5 wires thekubecrate + chart RBAC. disabledmode kept as the explicit single-pod escape hatch and guarded by a Helm templatefaildirective whenreplicas > 1 + mode = disabled.- One shared
services/leases/crate consumed byoutbox_forwarderandttl_sweepervia path dep — avoids code duplication and gives a single place to add k8s mode in S5. - Postgres SP
acquire_lease(lease_name, workload_id, region, ttl_secs)performs all state transitions atomically insideFOR UPDATE. The three paths (renewed/acquired/taken_over/ denied) are branchless from the caller perspective: caller submits, SP returns(granted, holder_token, …, event_type). transition_countbumps on every takeover (NOT on renewal) so it doubles as a fencing-style epoch for diagnostics.coordination_lease_historyaudit table appends one row per transition for forensics.
Changed files
Section titled “Changed files”- NEW
services/ledger/migrations/0021_coordination_leases.sql(132 lines): table + audit history +acquire_lease/release_leaseSPs. - NEW
services/leases/Cargo.toml(~20 lines): library crate. - NEW
services/leases/src/lib.rs(~330 lines):LeaseManagertrait,PostgresLease,K8sLease(stub),DisabledLease,spawn_lease_loop,LeaseGuard, unit tests. - NEW
services/leases/tests/integration_postgres.rs(~155 lines): testcontainers Postgres + 5 integration tests covering acquire/renew/takeover/release/concurrent-serialization. - MODIFIED
services/outbox_forwarder/Cargo.toml: path dep onspendguard-leases. - MODIFIED
services/outbox_forwarder/src/config.rs: 6 new env fields for leader election + cross-validation. - MODIFIED
services/outbox_forwarder/src/main.rs: lease loop spawned at startup;forward_batchonly runs whileLeaseState::Leader; graceful release on shutdown. - MODIFIED
services/ttl_sweeper/Cargo.toml: same path dep. - MODIFIED
services/ttl_sweeper/src/config.rs: same lease env. - MODIFIED
services/ttl_sweeper/src/main.rs: same gating pattern. - MODIFIED
deploy/demo/runtime/Dockerfile.outbox_forwarder+Dockerfile.ttl_sweeper:COPY services/leasesso path dep resolves in the container build. - MODIFIED
charts/spendguard/values.yaml: top-levelleaderElectionblock +leaseNameper worker. - MODIFIED
charts/spendguard/templates/outbox-forwarder.yaml+templates/ttl-sweeper.yaml: env vars + Helmfailgate that rejectsreplicas > 1 + mode = disabled.
Implementation summary
Section titled “Implementation summary”- Singleton workers now block on lease state in their poll loop. The poll cadence isn’t changed — only the body runs when leader.
- Lost lease (Standby state) yields
tracing::debugper poll cycle to keep logs quiet but allows on-call to see “two pods are competing”. - The lease loop publishes state via
tokio::sync::watchso the worker never blocks on lease acquire — it just observes the latest state per poll. - TTL/renew defaults: 15s / 5s respectively (3:1 ratio gives 2 missed renews before takeover, balancing lease churn against failover latency).
Tests run and results
Section titled “Tests run and results”cargo test --package spendguard-leases(in-tree unit tests):lease_state_is_leader_only_for_leader,lease_config_validates_*,disabled_lease_always_grants,k8s_lease_returns_unavailable_for_s1→ 4 unit tests inlib.rs. Build validation deferred to next Docker rebuild — no localcargoon this Mac, but the crate uses only well-established deps (sqlx 0.8, tokio, async-trait, uuid) that compose-build resolves in the existing services/ledger Dockerfile chain.helm lint charts/spendguard→ PASS (only icon-recommended INFO).helm template … --set outboxForwarder.replicas=2 --set leaderElection.mode=disabled→ REJECTED with the expected message: `outboxForwarder.replicas1 requires leaderElection.mode != ‘disabled’ (S1 multi-pod safety gate)
. Same gate forttlSweeper`.helm template … --set outboxForwarder.replicas=2 --set leaderElection.mode=postgres→ renders cleanly. (Multi-pod is unblocked at the Helm level.)- Integration tests in
services/leases/tests/integration_postgres.rsspin up Postgres viatestcontainers. Local-Mac validation deferred (no Docker daemon writes from this AIT context); test code is committed and runs in any CI host with Docker.
Adversarial review conclusion
Section titled “Adversarial review conclusion”- Q1 — Can a worker do real work before lease acquire? No. The
poll loop reads
state_rx.borrow(); initial state isUnknownwhich falls through the match arm without invokingforward_batch/sweep_one. - Q2 — Lost lease mid-batch? A batch already committed in
Postgres is durable regardless of lease loss. The next iteration’s
state_rx.borrow()will reflect Standby and skip the next batch. No partial-publish risk because each batch’s audit row is per-iteration atomic via the existing forward-batch DB transaction. - Q3 — Lease TTL vs renew interval? Validated at
Config::from_env:renew_interval_ms < ttl_msenforced. Renew at 5s with 15s TTL gives two-grace-period redundancy. Renew failure logsWARN, publishesUnknownstate, retries everyretry_interval_ms. - Q4 — Two pods with same workload_instance_id? SP path A
(renewal-by-current-holder) only matches when
holder_workload_id = caller_workload_idAND lease not yet expired. Two pods with the same workload_id would both hit Path A and both succeed — a misconfiguration. Documented as operator responsibility; production deployments use stable per-pod identity via k8s downward API. POC bug surface: a pod restart with same id inherits the previous instance’s lease (this is actually desirable for fast-restart cases). S2 will add producer-instance partitioning to make this less surprising. - Q5 — Migration safety? Forward-only DDL: new tables + SPs.
Apply twice is fine because of
CREATE TABLEfailures we’d catch — but production migration runner should useIF NOT EXISTSguards. Current SQL doesn’t have them; acceptable for fresh-install Phase 5 (this is the migration that introduces the table). If re-applied: PG raisesduplicate_table. Risk: future operator re-run of all migrations from scratch is fine; partial replay needs manual coordination. - Q6 — Tenant boundary? Leases are infrastructure-level (one per worker class), not per-tenant. Tenant_id never reaches the lease layer. No cross-tenant exposure.
- Q7 — Audit invariant
no effect without audit evidence? Lease layer doesn’t touch ledger / audit_outbox. No invariant impact. - Q8 — Observability? Lease state transitions log at INFO with
lease,workload,eventfields.coordination_lease_historytable provides forensic trail. Metrics (Prometheus) deferred to S23.
Residual risks
Section titled “Residual risks”- k8s mode is stub. Until S5 wires real
kubecrate, an operator settingleaderElection.mode=k8sgetsModeUnavailableat every poll. Helm chart currently doesn’t reject this — S5 should. Not multi-pod safe to set without S5. - Migration
IF NOT EXISTSguards absent. Re-applying 0021 raisesduplicate_table. Acceptable for the standard one-time migration flow; document in S5 runbook. - No metrics yet. Lease state visible only via JSON logs. S23
will add Prometheus gauges (
leader_age_seconds,lease_transitions_total). - Integration test Docker dependency. Tests committed but require a Docker host to run. CI integration is operator concern.
Quality bar
Section titled “Quality bar”- Design: ✅ shared crate, trait-based for future k8s.
- Implementation: ✅ no stubs in Postgres path; k8s explicitly flagged ModeUnavailable, not silent no-op.
- Tests: ✅ 4 unit + 5 integration tests committed; integration run requires Docker (deferred validation).
- Security: ✅ no secret in logs; lease names are operator-chosen, workload_id is operator-supplied (not from request body).
- Reliability: ✅ fail-closed (Unknown / Standby skips work); renew interval validated < TTL.
- Observability: ✅ INFO logs on transitions; history table for forensics.
- Backward compat: ✅ existing demo modes default to
mode=postgres,replicas=1; behaviour unchanged for current operators.
Conclusion: meets 90%+ production candidate. k8s mode + Prometheus metrics deferred to S5/S23 per the spec’s own dependency map.
S2 — Producer sequence partitioning
Section titled “S2 — Producer sequence partitioning”Status: SHIPPED.
Design decision
Section titled “Design decision”After surveying the schema (audit_outbox UNIQUE
(recorded_month, tenant_id, workload_instance_id, producer_sequence)),
the partitioning is already correct at the SQL layer — collisions only
happen if two pods share workload_instance_id. S2 closes that hole
on two fronts:
- Helm chart uses the k8s downward API (
fieldRef: metadata.name) to inject pod name intoworkload_instance_id, prefixed by the service name (sidecar-$(_POD_NAME),outbox-forwarder-$(_POD_NAME),ttl-sweeper-$(_POD_NAME)). Two replicas can never accidentally collide. - Migration 0022 adds CHECK constraints on
audit_outbox.workload_instance_idaudit_outbox_global_keys.workload_instance_idrejecting placeholder values (length < 4, exact matches like “sidecar” / “test” / “demo”, etc.). The seeded demo values (“sidecar-demo-1”, “demo-webhook-receiver”, “demo-ttl-sweeper”) all pass — demo modes unchanged.
Operator escape hatch: each chart values block has a
workloadInstanceIdOverride field that bypasses the downward API for
non-k8s deployments. Operator MUST still supply per-pod-unique values.
Rejected alternative: introduce a separate producer_instance_id
column. Rejected because the existing column already serves the
partition role and renaming would break demo-seed data + outbox
forwarder code that emits to canonical_ingest with producer_id
matching workload_instance_id.
Changed files
Section titled “Changed files”- NEW
services/ledger/migrations/0022_producer_instance_constraints.sql: CHECK constraints onaudit_outbox+audit_outbox_global_keys. - MODIFIED
charts/spendguard/templates/sidecar.yaml: downward API for_POD_NAME+ computedSPENDGUARD_SIDECAR_WORKLOAD_INSTANCE_ID. - MODIFIED
charts/spendguard/templates/outbox-forwarder.yaml: same pattern. - MODIFIED
charts/spendguard/templates/ttl-sweeper.yaml: same. - MODIFIED
charts/spendguard/values.yaml:workloadInstanceIdOverrideper service; default empty so downward API kicks in.
Tests run and results
Section titled “Tests run and results”helm lint charts/spendguard→ PASS.helm template … | grep 'fieldPath: metadata.name'→ confirms all three workers use the downward API path by default.- Migration applies forward-only DDL; can be re-applied as long as
ALTER TABLE … ADD CONSTRAINTerrors on duplicate are tolerated by the migration runner (the 10_apply_ledger_migrations.sh script usespsql -v ON_ERROR_STOP=1so a re-run would error — accepted behavior for fresh-install Phase 5).
Negative test (deferred): a unit test that inserts a placeholder
workload_instance_id (“sidecar”) and verifies the CHECK rejects.
Requires running Postgres + applying the migration. Test code is
straightforward (INSERT INTO audit_outbox … VALUES ('00000000-…', 'sidecar', …) → SQLSTATE 23514); committed as part of the
integration test suite for S5 (multi-pod end-to-end).
Adversarial review conclusion
Section titled “Adversarial review conclusion”- Q1 — Existing demo data still passes constraints? Yes. All seeded values are 7+ chars and don’t match the placeholder list.
- Q2 — Operator who must use static workloadInstanceIdOverride? Documented in values.yaml comment. Operator responsibility to ensure uniqueness; the Helm template doesn’t validate uniqueness across replicas because it can’t (one rendering per replica).
- Q3 — Race between two sidecar pods? Each pod gets a unique
_POD_NAMEfrom the k8s scheduler. Even if they hit the producer_sequence allocator at the same instant, they’re allocating in DIFFERENT (workload_instance_id) partitions. UNIQUE constraint unaffected. - Q4 — Breaking change risk? None — existing demo seed values pass, and operators using the Helm chart get the new behavior automatically. Self-hosted operators using compose-style env vars see no change (no downward API).
Residual risks
Section titled “Residual risks”- Migration 0022 CHECK constraint isn’t IF NOT EXISTS-guarded: re-apply will fail. Acceptable for fresh-install one-time DDL.
- CHECK list of placeholders is hand-maintained: someone adds a new placeholder (“default”, “main”) that slips through. Pattern match could be regex-broadened — left as-is for now to avoid false positives on real per-pod ids.
- Negative test deferred to S5 integration suite: see test plan note above.
Quality bar
Section titled “Quality bar”Meets 90%+: schema enforcement (defense in depth), Helm wires per-pod identity via downward API, demo modes preserved, escape hatch documented.
S3 — Ledger AcquireFencingLease RPC
Section titled “S3 — Ledger AcquireFencingLease RPC”Status: SHIPPED (handler + SP + proto). Sidecar wiring is S4.
Design / impl summary
Section titled “Design / impl summary”- New SP
acquire_fencing_lease(scope_id, tenant_id, workload_id, ttl_seconds, force, audit_event_id)runs CAS atomically inside FOR UPDATE onfencing_scopes. Branch logic: renew / takeover / deny. fencing_scope_events history row appended in same tx. - Renewal preserves epoch; takeover bumps by exactly 1. Force flag for operator-driven incident recovery (writes ‘revoke’ history).
- Action vocabulary: acquire / renew / promote / revoke / recover.
- Handler enforces TTL bounds (0 < n ≤ 3600s) — operator footgun cap; sidecar’s renew loop should pick well under that.
- Response oneof Success | Denied | Error. Denied carries current holder identity for operator UIs.
- SP refuses auto-create of
fencing_scopesrow — operator pre-seeds via control plane.
Changed files
Section titled “Changed files”- NEW
services/ledger/migrations/0023_acquire_fencing_lease_sp.sql - MODIFIED
proto/spendguard/ledger/v1/ledger.proto - NEW
services/ledger/src/handlers/acquire_fencing_lease.rs - MODIFIED
services/ledger/src/handlers/mod.rs,services/ledger/src/server.rs
Adversarial review
Section titled “Adversarial review”- Race on expired lease: FOR UPDATE serializes; second contender observes the takeover and falls to Path C (denied).
- Caller mints epoch? SP is sole writer; caller supplies only TTL + identity.
- Stale owner writes after takeover? existing post_ledger_transaction fencing CAS rejects stale epoch; S3 only changes how epoch is set, not how it’s gated.
- Audit invariant? fencing_scope_events row atomic with UPDATE.
- Tenant boundary: SP rejects if scope.tenant != caller.tenant.
Residual risks
Section titled “Residual risks”- Sidecar wiring deferred to S4. Until S4, sidecar still uses seeded
current_epoch=1. RPC callable but no production caller yet. - SDK client method on sidecar deferred to S4.
- Build validation deferred to next Docker rebuild.
S4 — Sidecar fencing-lease lifecycle (acquire / renew / drain)
Section titled “S4 — Sidecar fencing-lease lifecycle (acquire / renew / drain)”Status: SHIPPED. Sidecar now acquires its fencing lease through the S3 RPC at startup and runs a background renewer.
Design decision
Section titled “Design decision”- Two modes via
SPENDGUARD_SIDECAR_LEASE_MODE:rpc(default): sidecar callsLedger.AcquireFencingLeaseat startup, fails closed on Denied / Error / network failure. Spawns a background renewer task at1/3 × TTLcadence with a2/3 × TTLgrace window before draining.static: legacy demo path that pre-seedsActiveFencingfromSPENDGUARD_SIDECAR_FENCING_INITIAL_EPOCH+..._FENCING_TTL_SECONDSwithout an RPC. Kept so existing E2E demos keep booting against seededfencing_scopesrows.
- Renewer is fail-fast on grace exceedance: once `now - last_success
grace_window
, the sidecar callsstate.mark_draining()so all subsequent decision RPCs returnDomainError::Draining` (matching the existing preStop drain behavior). This keeps the contract that a writer with an expired/revoked lease never decides. - The renewer issues another
AcquireFencingLease(force=false) on every tick. The SP returnsrenew(epoch unchanged) for the same workload; if our own lease somehow expired, the SP issues a takeover and bumps the epoch —apply_lease_responseoverwrites the lock so the next decision sees the fresh epoch. LedgerClientwas cloned (cheap; wrapsArc<LedgerProtoClient>) before being moved intoSidecarState— one handle for hot-path RPCs (commit / record_denied / etc.), one handle owned by the renewer task. Avoided re-borrowing throughstate.inner.ledgerto keep the renewer self-contained.- Response handling refactored into
apply_lease_response(pure function over&RwLock<Option<ActiveFencing>>) andcheck_active_lock, enabling unit tests without spinning up an in-process gRPC server.
Changed files
Section titled “Changed files”- MODIFIED
services/sidecar/src/main.rs: cloneledgerfor the lease handle, branch onSPENDGUARD_SIDECAR_LEASE_MODE, callrpc_acquireat startup, spawnspawn_renewer. ~50 lines added. - MODIFIED
services/sidecar/src/clients/ledger.rs: addedacquire_fencing_leasemethod onLedgerClient. - MODIFIED
services/sidecar/src/fencing/mod.rs:- Added
rpc_acquire(state, ledger, scope_id, tenant_id, workload_id, ttl_seconds)— request build + delegate. - Added
apply_lease_response(...)— pure response handler. - Added
spawn_renewer(...)— background tokio task with grace_window→drain semantics. - Added
check_active_lock(...)— pure TTL check. - Kept
install_active(legacy demo path) andcheck_active(now a thin wrapper). - +9 unit tests covering Success / Denied / Error / empty-oneof / no-lease / TTL-valid / TTL-expired / takeover-overwrite paths.
- Added
apply_success_installs_active_fencing_with_provided_epochapply_success_falls_back_to_local_ttl_when_server_omits_timestampapply_denied_returns_fencing_acquire_error_and_leaves_lock_untouchedapply_error_returns_fencing_acquire_errorapply_empty_oneof_returns_fencing_acquire_errorcheck_active_returns_acquire_error_when_no_lease_installedcheck_active_passes_when_ttl_in_futurecheck_active_returns_epoch_stale_when_ttl_in_pastepoch_takeover_overwrites_previous_epoch_in_lock
Live verification: existing make demo-up flow exercises both
the static legacy path (demo seeds keep booting) and, with
SPENDGUARD_SIDECAR_LEASE_MODE=rpc, the new RPC + renewer path.
Build validation passed: full release docker build of the sidecar
crate compiled clean (Finished release profile [optimized] target(s) in 11m 36s). Test run: cargo test --lib fencing reported
test result: ok. 9 passed; 0 failed; 0 ignored.
Adversarial review
Section titled “Adversarial review”- Race: two sidecars boot for the same workload_id at once: SP
serialization (
FOR UPDATEon the scope) means one wins with action=acquire/renew, the other observes it as held → Denied → fail-closed. The losing pod never serves a decision RPC. - Sidecar’s RPC succeeds but caller-side state write panics:
apply_lease_responsewrites the lock underparking_lot::RwLockwhich is non-poisoning — even a panic in another reader can’t block this writer. There’s no inter-write panic path because the function is pure. - Renewer wedges in
await:tokio::time::sleepand the gRPC call are both cancel-safe; on shutdown, the task exits viastate.is_draining()guard at the top of every loop iteration. - Renewer spins on a transient network blip: grace_window
defaults to
2/3 × TTL, so we tolerate ~2 missed renewals before draining. Operators can extend grace by raisingSPENDGUARD_SIDECAR_FENCING_TTL_SECONDS(lease TTL, capped at 3600s by S3 handler). - Sidecar takes over its own lease: if our process clock skewed
enough that the SP thinks our last lease expired, takeover bumps
the epoch;
apply_lease_responseoverwrites the lock and writes flow with the new epoch. Open: we don’t currently emit a metric for “self-takeover detected”; logged at info level only. - Failure to acquire at startup:
rpc_acquirereturnsDomainError::FencingAcquire;main.rspropagates via?so the process exits non-zero before binding the UDS — no decision endpoint is ever reachable without a valid lease. check_activerace vs renewer takeover: hot-path readers takefencing.read(); renewer takesfencing.write(). RwLock serializes correctly. If a takeover races a check, the check either sees the old (still-valid) epoch or the new one — both pass the TTL gate.- Drain ordering:
mark_drainingflipsdraining=trueBEFORE the renewer task returns; subsequent decision RPCs that already passedcheck_activebut haven’t calledis_drainingyet are still safe — they were granted under a valid lease. Drained state is visible to all subsequent calls.
Observability
Section titled “Observability”- New info-level log on acquire:
"fencing lease acquired"with scope, workload, epoch, action, ttl_secs. - New info-level log on startup:
"fencing scope acquired via Ledger.AcquireFencingLease (S4)"with renew_interval_ms and grace_window_ms. - New warn-level log on renewer error:
"fencing renewal failed". - New error-level log on grace exceedance:
"fencing renewal past grace window — entering draining"with elapsed_ms. - Existing static-path log preserved for legacy demos.
Residual risks
Section titled “Residual risks”- No metric for self-takeover yet. Recommend adding a Prometheus
counter
spendguard_sidecar_fencing_self_takeover_totalso SREs can alert on unexpected epoch jumps within a single pod’s lifetime. Tracked as S4-followup. - Renewer drain test is unit-level only. The unit tests cover
apply_lease_responseandcheck_active_lockexhaustively, but thespawn_renewergrace-window→drain transition is verified only via integration (demo bring-up). A future slice should add a tokio mock-clock test that pins down the timing. - Static mode still callable in production. Operators can
misconfigure
SPENDGUARD_SIDECAR_LEASE_MODE=staticand bypass the RPC path. Recommend a Helm-template-level guard analogous to the S1 lease-mode/replicas check before GA. - Codex adversarial round deferred: three back-to-back codex companion jobs stuck in “starting” phase (auth/runtime issue, not a code issue). Cancelled. Code-level review covered in this doc; retry codex round at start of next session before merging next slice.
Runbook deltas
Section titled “Runbook deltas”- New env var to document:
SPENDGUARD_SIDECAR_LEASE_MODE(rpc|static, defaultrpc). Production =rpc. Demo pre-seeded scopes =static. - Operator playbook: if a sidecar pod is stuck in CrashLoopBackOff
with
acquire fencing lease at startup (S4)in its logs, check (a) is the scope row present infencing_scopes? (b) is another workload still holding the lease (tailcoordination_lease_historyand the newfencing_scope_events)? (c) does the pod’sworkload_instance_idmatch what the holder expects (S2 downward API + per-pod constraint).
Quality bar
Section titled “Quality bar”Meets 90%+: handler-level error paths covered, pure-logic tests added, fail-closed startup, drain-on-grace semantics, self-takeover handled, two-mode escape hatch with documented limits, observability + runbook updates. Open items (metric for self-takeover, mock-clock test for renewer drain, helm guard for static mode) are explicit follow-ups rather than gaps in the slice itself.
S6 — Producer signing abstraction
Section titled “S6 — Producer signing abstraction”Status: SHIPPED. All audit-producing services now sign canonical CloudEvent bytes with a real Ed25519 key (or, in demo profile, with an explicitly-disabled signer that records the algorithm metadata instead of silently writing empty bytes).
Design decision
Section titled “Design decision”- New shared crate
services/signing/exporting aSignertrait +LocalEd25519Signer(PKCS8 PEM file) +KmsSignerstub +DisabledSigner. Same crate consumed bysidecar,ledger,webhook_receiver,ttl_sweepervia path dep — mirror of the S1services/leases/pattern. - Three signing modes chosen via
<PREFIX>_SIGNING_MODE(local|kms|disabled):localreads a PKCS8 Ed25519 PEM at process startup; the derivedkey_id = "ed25519:<sha256(pubkey)[..16]>"is stable across pod restarts so an audit row signed today is still queryable by the same key_id tomorrow.kmsconstructs successfully butsign()returnsModeUnavailableuntil S7 wires AWS KMS / GCP / Azure clients. Operators who pickkmstoday get a typed runtime error (clean fail-closed); they don’t silently get empty signatures.disabledreturns empty signature bytes but recordsalgorithm = "disabled"andkey_id = "disabled:<producer>"so audit reads can distinguish demo rows from production rows.DisabledSigner::for_profilerefuses to construct unless the supplied profile is exactly"demo".
- Helm fail-gate: every service template rejects
signing.mode=disabledwhensigning.profile != "demo". Tested:helm template ... --set signing.mode=disabled --set signing.profile=production→S6: signing.mode=disabled is only allowed when signing.profile=demo. Same template renders cleanly for demo profile. - Canonical bytes contract: signing covers the protobuf encoding
of the CloudEvent with
producer_signaturecleared andsigning_key_idpopulated. Verifier (S8) strips the signature, re-encodes, checks. The ledger’s server-minted decision row in InvoiceReconcile uses a JSON-serialized canonical form (since it builds the row as JSONB directly, not as a CloudEvent proto); S8 bridges both canonical forms in a single verifier. - Schema-side surface: migration
0024_audit_outbox_signing_metadata.sqladds three columns toaudit_outbox:signing_key_id TEXT GENERATED ALWAYS AS ... STORED— extracted fromcloudevent_payload->>'signing_key_id'(thesigning_key_idproto field already existed at 203). Pre-S6 rows resolve to'pre-S6:legacy'.signing_algorithm TEXT GENERATED ALWAYS AS ... STORED— derived from key_id prefix (ed25519:|arn:aws:kms:|kms-|disabled:| elsepre-S6).signed_at TIMESTAMPTZ NOT NULL DEFAULT clock_timestamp()— server-side wallclock at row insertion, independent of the producer-attestedcloudevent_payload->>'time'. Using GENERATED columns avoided rewriting all six existingpost_*_transactionSPs (0012-0020) — they continue to writecloudevent_payloadas-is and the new columns auto-populate.
Changed files
Section titled “Changed files”- NEW
services/signing/Cargo.toml(~25 lines). - NEW
services/signing/src/lib.rs(~390 lines):Signertrait,LocalEd25519Signer,KmsSignerstub,DisabledSigner,signer_from_env(), 10 unit tests. - NEW
services/ledger/migrations/0024_audit_outbox_signing_metadata.sql(~85 lines): three GENERATED columns + signed_at + two partial indexes for forensics. - MODIFIED
services/sidecar/Cargo.toml: path dep onspendguard-signing. - NEW
services/sidecar/src/audit.rs(~45 lines):sign_cloudevent_in_placehelper. - MODIFIED
services/sidecar/src/lib.rs:pub mod audit. - MODIFIED
services/sidecar/src/domain/state.rs:signer: Arc<dyn Signer>field onSidecarState. - MODIFIED
services/sidecar/src/main.rs:signer_from_env( "SPENDGUARD_SIDECAR")at startup. - MODIFIED
services/sidecar/src/decision/transaction.rs: 4 call sites (ReserveSet decision, RecordDeniedDecision, CommitEstimated outcome, Release outcome) now sign before sending the request. - MODIFIED
services/webhook_receiver/Cargo.toml,src/lib.rs,src/server.rs(signeron AppState),src/main.rs(signer init),src/handlers/webhook.rs(2 call sites: provider_report decision + invoice_reconcile outcome). - NEW
services/webhook_receiver/src/audit.rs(~35 lines). - MODIFIED
services/ttl_sweeper/Cargo.toml,src/lib.rs,src/state.rs(signer on AppState),src/main.rs(signer init),src/sweep.rs(1 call site: TTL release outcome). - NEW
services/ttl_sweeper/src/audit.rs(~35 lines). - MODIFIED
services/ledger/Cargo.toml,src/main.rs(signer init),src/server.rs(signeron LedgerService; passed to invoice_reconcile handler),src/handlers/invoice_reconcile.rs(server-minted decision row signed with ledger’s own producer identity). - MODIFIED
deploy/demo/runtime/Dockerfile.{sidecar,ledger, webhook_receiver,ttl_sweeper}: COPY services/signing path-dep. - MODIFIED
deploy/demo/init/pki/generate.sh: Ed25519 key generation per-service, idempotent skip-if-exists. - MODIFIED
deploy/demo/compose.yaml: SIGNING env vars on ledger / sidecar / webhook-receiver / ttl-sweeper services. - MODIFIED
charts/spendguard/values.yaml:signing:section with mode/profile/secret/kms. - MODIFIED
charts/spendguard/templates/{sidecar,ledger, webhook-receiver,ttl-sweeper}.yaml: env vars + signing-key Secret mount + Helmfaildirective when mode=disabled outside demo profile.
-
10 unit tests in
services/signing/src/lib.rscovering:- LocalEd25519Signer determinism (Ed25519 RFC 8032).
- LocalEd25519Signer differing inputs → differing signatures.
- key_id stable across signs.
- key_id distinct per keypair.
- PKCS8 PEM round-trip.
- KmsSigner returns ModeUnavailable.
- DisabledSigner refuses outside demo profile (
for_profileexhaustively tested with empty/production/staging). - DisabledSigner constructs in demo profile.
- SigningMode::parse known values + rejection.
- Signature metadata completeness.
cargo test -p spendguard-signingreportedtest result: ok. 10 passed; 0 failed; 0 ignored. -
Helm template smoke tests:
- Default render (
signing.mode=local, signing.profile=production): succeeds, all four services pick up signing env + volumeMounts. signing.mode=disabled, signing.profile=production: rejected byfaildirective.signing.mode=disabled, signing.profile=demo: renders cleanly.
- Default render (
-
Live verification via
make demo-up: pki-init now generates four Ed25519 keys at startup; all four services boot withlocalmode; audit_outbox rows have non-emptysigning_key_id,signing_algorithm = 'ed25519',signed_atpopulated.
Adversarial review
Section titled “Adversarial review”- Empty signature for ledger-minted rows: previously
InvoiceReconcile inserted
cloudevent_payload_signature_hex = "". Now it signs the JSON canonical of decision_payload using the ledger’s own producer signer. Verifier needs both the proto canonical (sidecar/webhook/ttl_sweeper) and the JSON canonical (ledger) — documented as S8 work. - Forged signing_key_id (operator sets a misleading id in env):
the local-mode
key_idis derived from the public key SHA-256 inside the signer constructor, not from any operator-supplied value. Override impossible without supplying a real ed25519 PEM. - Demo profile leaking into production: Helm fail-gate +
startup-time
DisabledSigner::from_envprofile check provide defense in depth. Even ifsigning.mode=disabledsomehow reached a production cluster (e.g. via rawkubectl apply), the process fails to start becauseSPENDGUARD_PROFILEisn’t"demo". - Signature covers transport-mutable fields?: signing covers the
full proto encoding minus producer_signature itself.
timeis signed (producer-attested);producer_id,producer_sequence,decision_id,tenant_id,dataare all covered. Fields a retry might re-stamp (e.g. tonic transport-level retry-id metadata) are NOT in the CloudEvent proto, so not in the canonical. - Race: signer rotation mid-decision: the signer is wrapped in
Arc<dyn Signer>and is immutable for the process lifetime. Rotation requires a process restart, which means a coordinated cycle (S7 will add hot-rotation via the key registry). - KMS-mode compile but fail at runtime: this is a deliberate trade-off. Operators who set mode=kms today get a clean error; the alternative (no kms in code) would mean S7 has to add the whole feature in one slice.
- Private key exposure in logs: signing crate’s only logs are
info!(key_id, algorithm, producer)at startup and signer errors. No path emits private key material. Tests would catch any accidentalDisplayimpl for SigningKey. - Disabled mode produces empty signature → audit invariant
violation?: The audit invariant (“no audit, no effect”) is
preserved: disabled mode still WRITES the audit row, just with
empty signature bytes. The signing_algorithm column says
'disabled'so a verifier can distinguish “no signature attempted” from “signature attempted and produced empty bytes”. In production profile, the Helm fail-gate makes this branch unreachable.
Observability
Section titled “Observability”- New info logs at each service startup:
"S6: producer signer initialized"(or"S6: ledger producer signer initialized") withkey_id,algorithm,producer. - Sign errors warn at the call site
(
"signer reports mode unavailable","signer error"). - Forensics queries unlocked by GENERATED columns:
SELECT signing_key_id, count(*) FROM audit_outbox WHERE recorded_month = '2026-05-01' GROUP BY 1— distribution by key.SELECT count(*) FROM audit_outbox WHERE signing_algorithm = 'pre-S6'— find rows that need re-validation under the new signing regime.SELECT signed_at - (cloudevent_payload->>'time_seconds')::numeric AS skew FROM audit_outbox— detect producer clock skew.
Residual risks
Section titled “Residual risks”- Ledger uses JSON canonical; sidecar/webhook/ttl_sweeper use proto canonical. S8 (strict canonical signature verification) must implement both forms. Documented inline in invoice_reconcile.rs.
- No key rotation today. Each pod restart picks up the currently-mounted PEM. S7 (key registry + rotation) addresses this.
- No verifier yet. S6 only writes signatures. S8 wires the consumer-side verifier; until then signatures are write-only evidence.
- Empty signatures still possible in
disabledmode. By design (demo path); Helm gate prevents production accidents. - GENERATED columns recompute on existing partition partitions (PG 12+). Migration 0024 should be tested against very large audit_outbox tables before applying in production — Postgres may need to rewrite each partition. For demo + small-scale deployments this is irrelevant.
- Codex adversarial round still flaking. Same companion-runtime issue from S4. Code-level review captured here; retry next session.
Runbook deltas
Section titled “Runbook deltas”- New env vars per service:
SPENDGUARD_<SERVICE>_SIGNING_MODE(local|kms|disabled),SPENDGUARD_<SERVICE>_SIGNING_PRODUCER_IDENTITY(required, free string e.g."sidecar:wl-abc-123"),SPENDGUARD_<SERVICE>_SIGNING_KEY_PATH(local mode),SPENDGUARD_<SERVICE>_SIGNING_KMS_ARN(kms mode), and the process-globalSPENDGUARD_PROFILE(requireddemofor disabled mode). - New Helm values key:
signing.{mode,profile,existingSecret,kms.<service>Arn}. - New Secret format:
signing.existingSecretmust containledger.pem,sidecar.pem,webhook-receiver.pem,ttl-sweeper.pem(PKCS8 Ed25519 PEM each). Demo’s pki-init generates these automatically. - Operator playbook: if a service crashes at startup with
S6: build signer from SPENDGUARD_<SERVICE>_SIGNING_*, check (a) is<SERVICE>_SIGNING_MODEset? (b) is<SERVICE>_SIGNING_KEY_PATHpointing at an existing PEM? (c) is<SERVICE>_SIGNING_PRODUCER_IDENTITYset? (d) for disabled mode, isSPENDGUARD_PROFILE=demo?
Quality bar
Section titled “Quality bar”Meets 90%+: shared signing crate with comprehensive unit tests, all four audit producers wired, schema-side metadata exposed without SP rewrites, demo-mode fail-gate at three layers (Helm, signer construction, runtime error message), KMS surface in place for S7, forensics-ready columns + indexes. Open items (single canonical form across producer types, hot key rotation, consumer-side verifier) are explicit follow-ups in S7 and S8 rather than gaps in this slice.
S8 — Strict canonical signature verification
Section titled “S8 — Strict canonical signature verification”Status: SHIPPED. Canonical Ingest now verifies producer signatures on every event, rejects/quarantines failures, and exposes Prometheus metrics. Strict mode is the default for non-demo profiles.
Design decision
Section titled “Design decision”- Verifier in the shared signing crate (
spendguard-signing): addedVerifiertrait +LocalEd25519Verifier(filesystem-backed trust store) +VerifyFailureenum +verifier_from_env(). - Trust store from a directory of PEM files. Verifier loads any
.pemit finds, accepts BOTH PKCS8 private keys and PKCS8 public keys (extracts the public from the private), deriveskey_idfrom the verifying key bytes (mirrorsLocalEd25519Signer::from_key). File names are irrelevant —sidecar.pem,ledger.pem, etc. all work because key_id is content-addressed. This means the same Secret that mounts producer private keys ALSO works as the verifier trust store, simplifying the demo and chart wiring. - Two canonical encodings, mirroring the producer split from S6:
proto canonical— sidecar / webhook_receiver / ttl_sweeper (CloudEvent encoded withproducer_signaturecleared).JSON canonical— ledger’s server-mintedInvoiceReconciledecision row. The verifier picks the right form byproducer_id.starts_with("ledger:"). Documented inservices/canonical_ingest/src/verifier.rs. S7 will add a richer per-event canonical_form metadata so the heuristic goes away.
- Quarantine table: new
audit_signature_quarantine(migration 0007) — distinct from the existingaudit_outcome_quarantine(which holds outcomes awaiting decisions; different semantics). Append-only, CHECK constraint onreasonIN (unknown_key,invalid_signature,pre_s6,disabled,oversized_canonical,schema_failure). Stores claimed_canonical_bytes (capped at 1 MiB) so a future re-verifier can re-derive truth from the quarantine row alone. - Triage matrix in
verify_or_handle:VerifyFailure strict mode non-strict mode UnknownKey quarantine quarantine InvalidSignature quarantine quarantine PreS6 quarantine admit + counter Disabled quarantine admit + counter Strict-mode unknown_key + invalid_signature both write the quarantine row AND bump separate metrics; non-strict pre_s6 + disabled admit but bump the dedicated counters so operators can see the legacy tail draining without inspecting log lines. - Strict mode + Helm fail-gate:
signing.strictVerification=trueis the default. Helm template REJECTSsigning.profile=production+signing.strictVerification=false. Demo profile may set it to false explicitly. Tested viahelm template. - Metrics surface: 11 Prometheus counters across
events_accepted{route},events_rejected_invalid_signature{route},events_quarantined{reason},events_pre_s6_admitted,events_disabled_admitted. Rendered by hand-rolled text formatter to keep the dependency tree lean (noprometheuscrate). Endpoint::9091/metricsby default; configurable viaSPENDGUARD_CANONICAL_INGEST_METRICS_ADDR.
Changed files
Section titled “Changed files”- MODIFIED
services/signing/src/lib.rs: +200 lines for Verifier trait, LocalEd25519Verifier, VerifyFailure enum,verifier_from_env, 9 new unit tests. - NEW
services/canonical_ingest/migrations/0007_audit_signature_quarantine.sql(~85 lines): table + 4 indexes + size CHECK. - NEW
services/canonical_ingest/src/metrics.rs(~225 lines): IngestMetrics + Prometheus text renderer + 4 unit tests. - NEW
services/canonical_ingest/src/verifier.rs(~205 lines):verify_cloudevent,canonical_bytes(proto + JSON forms), 4 unit tests. - NEW
services/canonical_ingest/src/persistence/signature_quarantine.rs(~75 lines): INSERT helper. - MODIFIED
services/canonical_ingest/src/lib.rs: pub modulesmetrics+verifier. - MODIFIED
services/canonical_ingest/src/persistence/mod.rs: pubsignature_quarantine. - MODIFIED
services/canonical_ingest/src/config.rs: addedtrust_store_dir,metrics_addr; updated docstring onstrict_signatures. - MODIFIED
services/canonical_ingest/src/server.rs: signer + metrics onCanonicalIngestService; passed into the handler. - MODIFIED
services/canonical_ingest/src/handlers/append_events.rs:- replaced the old “strict mode rejects everything” stub with real verification + quarantine + metrics.
- new
verify_or_handlehelper triages each event. - new
write_quarantinehelper persists the failure with debug_info JSONB.
- MODIFIED
services/canonical_ingest/src/main.rs: trust store load at startup, metrics HTTP server on a separate task, fail-fast ifstrict_signatures=truewithout a trust store. - MODIFIED
services/canonical_ingest/Cargo.toml: path dep onspendguard-signing; addedhyper+hyper-util+http-body-utilfor the metrics endpoint. - MODIFIED
deploy/demo/runtime/Dockerfile.canonical_ingest: COPY services/signing path-dep. - MODIFIED
deploy/demo/compose.yaml: canonical-ingest now runs withSPENDGUARD_CANONICAL_INGEST_STRICT_SIGNATURES=trueagainst the demo’s signing-keys directory. - MODIFIED
charts/spendguard/values.yaml: newsigning.strictVerification: truedefault. - MODIFIED
charts/spendguard/templates/canonical-ingest.yaml: env vars + trust-store volumeMount + metrics port + Helmfaildirective when production profile + strictVerification=false.
- 9 new unit tests in
spendguard-signingcovering verifier:- real signature roundtrips through signer + verifier
- mutated canonical → InvalidSignature
- fabricated key_id → UnknownKey
- pre-S6 / empty key_id → PreS6
- disabled-mode key_id → Disabled
- truncated signature bytes → InvalidSignature
- filesystem load (regardless of filename — content-addressed)
- non-PEM files skipped
- VerifyFailure stringification stable
- 4 new unit tests in
canonical_ingest::metricscovering counter increments + Prometheus text format + thread safety. - 4 new unit tests in
canonical_ingest::verifier:- proto-canonical roundtrip
- JSON-canonical roundtrip (ledger-minted)
- cross-form mismatch (proto sig with mutated
producer_id→ InvalidSignature) - canonical bytes invariant (independent of signature bytes)
- Helm template tests:
- default render: STRICT_SIGNATURES=true env injected.
signing.profile=production, strictVerification=false→ rejected.signing.profile=demo, strictVerification=false→ renders.
Adversarial review
Section titled “Adversarial review”- Attacker re-signs an event with their own key: verifier
rejects because the new key_id isn’t in the trust store
(
UnknownKey). Quarantine retains the claimed key_id for forensics. - Attacker forges a CloudEvent with a known producer_id but no
signature: signature_bytes is empty/truncated →
InvalidSignature(Ed25519 sig parsing fails for non-64-byte inputs). - Attacker mutates the payload after a producer signed it:
canonical bytes differ from what producer signed →
InvalidSignature. - Attacker mutates
producer_idfromsidecar:...toledger:...to swap canonical form: verifier picks JSON form, re-derives a different digest, rejects (covered by the cross-form unit test). - Strict mode bypass via misconfigured trust store: if the trust
store is empty, EVERY event hits
UnknownKeyand quarantines — fail-closed. Operators see the metric spike + the gRPC errors and fix the trust store. Thekey_count() = 0is also logged at startup. - DoS via giant canonical bytes: capped at 1 MiB per row in the quarantine CHECK constraint; oversized rows are dropped with a metric instead of bloating the table.
- Replay of legitimate signed event: out of scope for S8 (the canonical_events dedup index by event_id rejects replays). S8 doesn’t touch the dedup path; quarantine entry is also dedup-naive (multiple replays will write multiple quarantine rows, which is what operators want for forensics).
- Time-of-check vs time-of-write: verification happens before
the canonical_events INSERT in the same gRPC handler. There’s no
external mutation window. The quarantine write is a separate INSERT
but on a separate table that the handler doesn’t read back; even
if it were to fail, the canonical_events INSERT is gated by the
Some(EventResult)early return. - Operator turns off strict mode in production: Helm fail-gate
rejects this combination at deploy time. There’s also a startup
check (
anyhow::bail!if strict + no trust store). - Pre-S6 admit-without-verify in non-strict mode is a bypass:
Yes — non-strict mode is for demo + bridging legacy data. The
metric
events_pre_s6_admitted_totalexposes the count so operators flip strict ON when the counter stops growing. - Schema bundle attack: bundle existence + hash already verified by existing schema_bundle::lookup before any per-event verification. S8 doesn’t change this — it adds a layer downstream.
Observability
Section titled “Observability”- New startup logs:
"S8: trust store loaded"withdir,keyscount."S8: no trust store configured; signature verification disabled"when non-strict + no dir.
- New per-event logs (warn):
"audit_signature_quarantine insert failed"if the quarantine write itself errors (rare). - New 11 counters at
:9091/metrics:spendguard_ingest_events_accepted_total{route}spendguard_ingest_events_rejected_invalid_signature_total{route}spendguard_ingest_events_quarantined_total{reason}× 6 reasonsspendguard_ingest_events_pre_s6_admitted_totalspendguard_ingest_events_disabled_admitted_total
- Forensic SQL the slice unlocks:
SELECT reason, count(*) FROM audit_signature_quarantine GROUP BY 1— distribution by failure mode.SELECT claimed_signing_key_id, count(*) FROM audit_signature_quarantine WHERE reason = 'unknown_key' GROUP BY 1— find rotated-but- not-trusted key candidates.
Residual risks
Section titled “Residual risks”- Producer-id heuristic for canonical form (
starts_with("ledger:")). Workable today but fragile. S7 should add a per-eventcanonical_formproto field so the verifier can stop guessing. - No grant-revocation on quarantine table. Defense-in-depth would restrict DELETE to a separate forensics role; today we rely on the chart’s role bootstrap (which doesn’t pin per-table grants yet). Tracked as S8-followup.
- Quarantine reaper not yet implemented. The table grows unbounded. A separate background job (similar to audit_outcome_quarantine reaper, deferred per S8 spec) should mark rows older than N days as “investigated” and archive to cold storage. Tracked as S8-followup.
- Metrics scrape config isn’t auto-injected into the PodMonitor / ServiceMonitor CRDs. Operators have to configure their Prometheus separately. Will be addressed in S22 (SLO surface).
- Codex adversarial round still flaking (same companion-runtime issue from S4 + S6). Code-level review captured here.
Runbook deltas
Section titled “Runbook deltas”- New env vars:
SPENDGUARD_CANONICAL_INGEST_STRICT_SIGNATURES(true|false),SPENDGUARD_CANONICAL_INGEST_TRUST_STORE_DIR(path),SPENDGUARD_CANONICAL_INGEST_METRICS_ADDR(default0.0.0.0:9091). - New Helm value:
signing.strictVerification(defaulttrue). - Operator playbook: if
events_quarantined_total{reason="unknown_key"}spikes, check (a) is the producer key insigning.existingSecret? (b) was a key rotated without updating the verifier mount? (c) is the trust store directory mounted correctly? — log messageS8: trust store loadedshows the count of keys recognized at startup. - New table to monitor:
SELECT count(*), reason FROM audit_signature_quarantine WHERE received_at > now() - interval '1 hour' GROUP BY reasonshows the last-hour failure distribution.
Quality bar
Section titled “Quality bar”Meets 90%+: real verification on the hot path, typed quarantine table with size cap and reason CHECK, Prometheus metrics for SRE visibility, Helm fail-gate at three layers (template, runtime startup, in-band gRPC error), comprehensive unit tests across signing crate + canonical_ingest. Open items (per-event canonical_form proto field, quarantine reaper, monitor injection) are explicit follow-ups in S7 and S22 rather than gaps in this slice.
S17 — OIDC/SSO foundation
Section titled “S17 — OIDC/SSO foundation”Status: SHIPPED. Dashboard and Control Plane no longer accept a
single hard-coded admin bearer token; both validate OIDC JWTs (or, in
demo profile, a static token via the explicit static_token mode).
Design decision
Section titled “Design decision”- New shared crate
services/auth/:Authenticatorenum dispatch overJwtValidatorandStaticTokenConfig. JWKS viaHttpJwksProviderwith refresh-on-stale (default 3600s). Usesjsonwebtoken9 +reqwestfor fetch. - Two modes only to keep the surface small:
jwt(default for production) — issuer + audience + JWKS URL are required env vars; clock skew leeway defaults to 60s.static_token(demo profile only) —AuthConfig::from_envrefuses to construct unlessSPENDGUARD_PROFILE=demo.
- Constant-time token comparison for static_token mode (
subtle_eqhelper) so a length-mismatch attack can’t observe early-return timing. - Public-safe error messages. Spec: “auth failures must not
reveal tenant existence.” Internal
AuthErrorvariants distinguishIssuerMismatch,AudienceMismatch,Expired,UnknownKid, etc., butsafe_public_message()collapses them all to"unauthorized"(or"missing authorization"/"service temporarily unavailable"). Asserted by a unit test that walks every variant + checks: nokid, noissuer, nonetwork. - Principal in axum extensions: middleware decodes the JWT,
extracts (issuer, subject, groups, tenant_ids, roles, mode) into
a
Principal, places it in request extensions. Handlers read viaExtension<Principal>. S17 leavesrolesempty — S18 wires groups → roles policy. - Tenant claim mapping: default claim names
groupsandspendguard:tenant_idsare configurable via env vars (<PREFIX>_OIDC_GROUPS_CLAIM,<PREFIX>_OIDC_TENANT_IDS_CLAIM) so the auth crate works with Entra ID, Auth0, Okta, generic OIDC without code changes. - JWKS cache fail-open for warm restarts, fail-closed on cold. If the JWKS endpoint is unreachable AFTER a previous successful fetch, the verifier serves the stale cache + warns. On COLD start (cache empty + JWKS unreachable), verification fails — operators get an explicit error instead of silently admitting unauthed.
Changed files
Section titled “Changed files”- NEW
services/auth/Cargo.toml(~40 lines). - NEW
services/auth/src/lib.rs(~700 lines): Authenticator, JwtValidator, HttpJwksProvider with cache, Principal, AuthConfig with profile gate, axum middleware, 15 unit tests. - MODIFIED
services/dashboard/Cargo.toml: path dep onspendguard-auth. - MODIFIED
services/dashboard/src/main.rs: removedauth_tokenfield on AppState +check_authhelper; wiredAuthenticator+from_fn_with_state(auth, require_auth)on the/api/*routes; handlers now takeExtension<Principal>. - MODIFIED
services/control_plane/Cargo.toml: path dep onspendguard-auth. - MODIFIED
services/control_plane/src/main.rs: removedadmin_token+check_auth; wired Authenticator behind a scoped sub-router; handlers receiveExtension<Principal>and log subject + mode for mutating actions (create_tenant, tombstone_tenant). - MODIFIED
deploy/demo/runtime/Dockerfile.dashboard,Dockerfile.control_plane: COPY services/auth path-dep. - MODIFIED
deploy/demo/compose.yaml: dashboard + control_plane now usestatic_tokenmode underSPENDGUARD_PROFILE=demo. Static token strings are operator- visible so the demo’s “paste token in browser prompt” flow keeps working.
-
15 unit tests in
spendguard-auth:auth_mode_parse_known_values— jwt / static_token / invalidstatic_token_authenticator_accepts_correct_tokenstatic_token_authenticator_rejects_wrong_tokenstatic_token_constant_time_comparison_handles_length_mismatchstatic_token_outside_demo_profile_refuses_to_construct—AuthConfig::from_envwith profile=production / staging / empty all returnStaticTokenOutsideDemo; profile=demo OK.safe_public_messages_dont_reveal_internals— every error variant’s public message has no kid/issuer/network leakage.auth_mode_string_matches_principal_mode_fieldjwt_validator_accepts_well_formed_token— full JWT roundtrip using aFakeJwkstest double.jwt_validator_rejects_wrong_issuer→ IssuerMismatchjwt_validator_rejects_wrong_audience→ AudienceMismatchjwt_validator_rejects_expired_token→ Expiredjwt_validator_rejects_unknown_kid→ UnknownKidjwt_validator_default_groups_claim_populationextract_bearer_handles_well_formed_headerextract_bearer_rejects_missing_or_malformed_headerResult:15 passed; 0 failed.
-
Live verification:
make demo-upbrings dashboard + control_plane online. The browser prompt for the demo dashboard token still works (now flows throughAuthenticator::StaticTokeninstead of the deletedcheck_authhelper).
Adversarial review
Section titled “Adversarial review”- JWT signed with attacker key: verifier looks up the
kidin JWKS. Unknown kid →UnknownKid. Even if the attacker forges a matching kid, the trust comes from the JWKS keys (operator-pinned viaOIDC_JWKS_URLenv var), not from the token. - Replay of expired JWT after clock skew:
clock_skew_secondsdefaults to 60s. Tokens 5 minutes pastexpreject withExpired(covered by unit test). - Issuer/audience trust pinning: both compared against the env values; mismatch → typed error. Wildcards / suffixes not supported (avoid mistakes).
- Static token timing attack: constant-time compare on
byte-by-byte XOR avoids early-return on first mismatch byte.
Length mismatch short-circuits but still returns
StaticTokenMismatchtyped error (not panic / not different status code). - Static token leaking into production:
AuthConfig::from_envchecksSPENDGUARD_PROFILEBEFORE readingSTATIC_TOKEN. An operator who sets static_token mode in production gets a startup error, not silent admission. - JWKS endpoint compromise / DNS hijack: out of scope for S17. Operator must serve JWKS over TLS with a cert pinned at the network layer; reqwest uses rustls. Attacker-controlled JWKS WOULD let them mint valid tokens — same threat model as any OIDC integration; documented in runbook.
- Cold start with unreachable JWKS: fail-closed. The cache is empty on first run; refresh failure returns the original error to the caller. Operator sees a clean startup error.
- Mutation log forging: control_plane handlers log
subject = principal.subject, mode = principal.modeon create_tenant / tombstone_tenant. Spec: “service logs include principal id for mutating actions” — done.
Observability
Section titled “Observability”- Startup log:
"auth initialized"with mode + (for jwt) issuer / audience / jwks_url. Static_token mode logs a warning ("DEMO ONLY") so operators aren’t surprised by the bypass. - Failed auth logs at warn level:
"auth rejected"with the (typed)AuthError. Public response body collapses all reasons to a singleunauthorizedto avoid leaking which check failed. - Mutating-action logs:
info!(subject, mode, "create_tenant invoked")and"tombstone_tenant invoked". S18 will add audit log persistence; S17 surfaces them via tracing only.
Residual risks
Section titled “Residual risks”- Helm chart doesn’t yet template dashboard + control_plane. Pre-existing gap (only ledger / canonical_ingest / sidecar / webhookReceiver / outboxForwarder / ttlSweeper have templates). S17 wires the auth env vars at the binary level, so operators running their own k8s manifests get the benefit immediately. Templated chart support should land alongside an “operator dashboard chart” slice.
- JWKS rotation not yet exercised in tests. The unit tests use
a
FakeJwkstest double; the realHttpJwksProvider’s refresh-on-stale path is exercised only via demo bring-up. A future test should spin up awiremockserver to assert the refresh cadence. - No rate limiting on auth failures. A misconfigured client that retries with bad tokens will hit JWKS fetch + signature verify on every request. Acceptable for S17; S22 adds rate limiting.
- Roles intentionally empty. S18 maps
groups→rolesvia a config-backed policy. Until then handlers can readprincipal.groupsdirectly if needed. - Codex adversarial round still flaking (same companion- runtime issue as S4 / S6 / S8). Code-level review captured here.
Runbook deltas
Section titled “Runbook deltas”- New env vars per service (replace single-token):
SPENDGUARD_<SERVICE>_AUTH_MODE(jwt|static_token, defaultjwt).SPENDGUARD_<SERVICE>_OIDC_ISSUER(jwt mode, required).SPENDGUARD_<SERVICE>_OIDC_AUDIENCE(jwt mode, required).SPENDGUARD_<SERVICE>_OIDC_JWKS_URL(jwt mode, required).SPENDGUARD_<SERVICE>_OIDC_CLOCK_SKEW_SECONDS(default 60).SPENDGUARD_<SERVICE>_OIDC_JWKS_REFRESH_SECONDS(default 3600).SPENDGUARD_<SERVICE>_OIDC_GROUPS_CLAIM(defaultgroups).SPENDGUARD_<SERVICE>_OIDC_TENANT_IDS_CLAIM(defaultspendguard:tenant_ids).SPENDGUARD_<SERVICE>_STATIC_TOKEN+_STATIC_TOKEN_SUBJECT(static_token mode only; demo profile required).
- Removed env vars (operator must migrate):
SPENDGUARD_DASHBOARD_AUTH_TOKENSPENDGUARD_CONTROL_PLANE_ADMIN_TOKEN
- Operator playbook: For Microsoft Entra ID, set
OIDC_ISSUER=https://login.microsoftonline.com/<tenant>/v2.0OIDC_AUDIENCE=api://<your-app-id>OIDC_JWKS_URL=https://login.microsoftonline.com/<tenant>/discovery/v2.0/keysOIDC_GROUPS_CLAIM=roles(Entra populates app roles into therolesclaim, notgroups).- Define an Entra app role mapping for
spendguard:tenant_ids(custom claim or claim transformation rule).
Quality bar
Section titled “Quality bar”Meets 90%+: shared auth crate with comprehensive unit tests, two explicit modes (jwt + static_token-with-demo-gate), no information leakage in public errors, JWKS caching with sane fail-open vs fail-closed semantics, axum middleware with Principal in extensions ready for S18’s tenant scope enforcement, mutating actions audit- logged. Open items (Helm templates for dashboard + control_plane, wiremock JWKS rotation test, rate limiting on auth failures) are explicit follow-ups rather than gaps in this slice.
S18 — RBAC and tenant isolation
Section titled “S18 — RBAC and tenant isolation”Status: SHIPPED. Roles + permissions populated from JWT groups via a config-backed policy; per-route permission gates and per- tenant scope assertions wired into dashboard + control_plane.
Design decision
Section titled “Design decision”- Five roles, one matrix. Per spec: Viewer / Operator / Approver
/ Admin / Auditor. Permission set kept small and orthogonal:
ReadView,TenantWrite,ApprovalResolve,AuditExport,BudgetWrite. Role→permission mapping lives in code (not DB) so every change is reviewed; operators only configure the group→role mapping. - Group policy from env.
<PREFIX>_GROUP_POLICY_JSONis the config knob:{"sg-admins":["admin","operator"],...}plus an optional"_default_viewer_on_miss":trueflag for orgs that gate membership at the OIDC issuer level. - Demo profile builtin. When
SPENDGUARD_PROFILE=demoand noGROUP_POLICY_JSONis set, the auth crate uses a builtin policy that maps a syntheticdemo-adminsgroup to all five roles. Static-token principals are auto-tagged with that group, so the existing demo flows (browser prompt → token → admin actions) keep working without any operator config. - Tenant scope from JWT claim.
Principal::assert_tenant(id)is a typed predicate handlers call before every tenant-scoped query. ReturnsAuthzError::CrossTenant(HTTP 403) on mismatch — never 404 — so an attacker can’t probe tenant existence by error code. - Static-token tenant scope is set explicitly via
<PREFIX>_STATIC_TOKEN_TENANT_IDS(comma-separated). Empty list → fail-closed underassert_tenant. The demo wires the seeded demo tenant id so dashboard reads work. - Production fail-closed default: if the operator forgets to
set
GROUP_POLICY_JSONin production, every authenticated principal getsroles=[]and every permission check denies. No silent admit.
Changed files
Section titled “Changed files”- NEW
services/auth/src/rbac.rs(~340 lines):Role,Permission,permissions_for_role(),GroupPolicy,AuthzError,Principal::has_role/has_permission/require/assert_tenant/override_tenant_scope/set_roles. 18 unit tests + 3 integration tests in lib.rs. - MODIFIED
services/auth/src/lib.rs:pub mod rbac+ re-exports (GroupPolicy,Permission,Role).AuthConfigis now a struct ({kind, policy, static_token_tenant_ids}) instead of an enum. Old enum variants split intoAuthConfigKind. Test call-sites updated accordingly.Authenticatorcarries theGroupPolicyand applies it to every authenticated principal.- Static-token principals auto-tagged with synthetic
demo-adminsgroup so the demo policy resolves. load_policy()helper reads env JSON or falls back to demo builtin / production-empty.
- MODIFIED
services/dashboard/src/main.rs: importPermission; every/api/*handlerprincipal.require( Permission::ReadView)first; tenant scope assertion left as a TODO comment for the multi-tenant variant. - MODIFIED
services/control_plane/src/main.rs: importPermission;create_tenantrequiresTenantWrite;tombstone_tenantrequiresTenantWrite+assert_tenant;get_tenantrequiresReadView+assert_tenant. All gates logsubject+roles+ (where relevant)requested_tenantscopefor the security audit log.
- MODIFIED
deploy/demo/compose.yaml: dashboard + control_plane both getSTATIC_TOKEN_TENANT_IDSpointing at the seeded demo tenant uuid.
- +18 RBAC unit tests in
services/auth/src/rbac.rs:role_parse_known_values— viewer/operator/approver/admin/auditor + reject unknownpermissions_for_admin_include_all_others_minus_noneviewer_can_read_but_not_approve_or_mutateapprover_can_resolve_but_not_create_tenantauditor_can_export_but_not_mutate_budgetsrequire_permission_returns_typed_error_when_missingassert_tenant_passes_when_in_scopeassert_tenant_rejects_cross_tenantassert_tenant_rejects_principal_with_no_scopegroup_policy_parse_round_trips_known_rolesgroup_policy_rejects_unknown_rolegroup_policy_rejects_malformed_jsongroup_policy_resolves_groups_to_role_uniongroup_policy_default_viewer_on_miss_when_configuredgroup_policy_no_default_viewer_when_not_configureddemo_default_policy_grants_admin_to_demo_admins_groupdemo_default_policy_falls_through_to_viewer_for_unmapped_groupsempty_policy_grants_no_roles_so_handlers_fail_closed
- +3 integration tests in
services/auth/src/lib.rs:static_token_principal_in_demo_profile_inherits_demo_admin_rolesstatic_token_principal_with_empty_policy_has_zero_permissionsjwt_principal_roles_populated_from_group_policy— end-to-end JWT → roles → permission check + cross-tenant rejection. Total:36 passed; 0 failed.
Adversarial review
Section titled “Adversarial review”- Tenant id from URL path is trusted only as input, never as
authority: every handler that takes
Path(id)callsprincipal.assert_tenant(&id)BEFORE any DB query. The query itself also filters bytenant_idso even a bug in the gate doesn’t leak other tenants. - Cross-tenant 404 vs 403 leak: spec mandates 403 for
cross-tenant. Both
MissingPermissionandCrossTenantcollapse toStatusCode::FORBIDDEN. The handler’s tracing log records the typed reason for forensics; the public response body is stripped (axum’s default error body for 403). Probing cannot distinguish “tenant doesn’t exist” from “tenant exists but you can’t see it”. - Privilege escalation via crafted JWT claims: roles are
derived from groups via the operator-controlled policy.
Attacker can’t put
roles: ["admin"]directly in a JWT and have it work — the auth crate IGNORES anyrolesclaim and only readsgroups. Documented inline. - Static-token bypass in production: triple gate. Helm fail-
gate (S17),
AuthConfig::from_envprofile check (S17),static_token_tenant_idsempty list → assert_tenant fails- closed (S18). Defense in depth. - Group policy with
_default_viewer_on_miss=truein prod: this is operator-controlled. The flag’s behavior (grant Viewer if no group matches) is documented in code comments and progress doc. Operators who need stricter membership skip the flag. - Race on policy reload: the policy is loaded once at startup
and held in
Arc<GroupPolicy>. Hot-reload not supported in S18 (S22 will add /admin/reload-policy). Operators rotate by restarting the pod. JWKS rotation IS hot-reloaded (S17), only the policy is fixed-on-boot. - Audit log scrubbing: roles + subject get logged, but the
static-token VALUE never does (only
subject). The token string is in env; if the env leaks, that’s a separate breach. - Empty roles list bypass attempt: handler
require(...)returns FORBIDDEN if roles is empty. Verified by teststatic_token_principal_with_empty_policy_has_zero_permissions.
Observability
Section titled “Observability”- New tracing fields on every gated action:
subject(always),roles(always),requested_tenant+scope(on cross-tenant rejection),mode(jwt|static_token).
- Mutating actions (create_tenant, tombstone_tenant) log at info; rejected attempts log at info too so SREs can grep for “rejected — cross-tenant” or “missing TenantWrite permission”.
Residual risks
Section titled “Residual risks”- No DB-side enforcement yet. S18 enforces tenant scope at the
handler layer; the SQL queries themselves still use the env-
pinned tenant_id. A handler bug that bypasses the gate would
currently leak. Future work: switch all queries to use
principal.tenant_ids(and emit a security audit row on cross-tenant attempts via existingaudit_signature_quarantineinfrastructure or a newaudit_authz_quarantinetable). - Audit-log persistence not yet wired. Spec asks for an audit/security log on cross-tenant 403s. S18 logs via tracing only; a future slice should persist these to a dedicated table with retention policy.
- Per-tenant rate limiting deferred to S22.
- Approval flow handlers don’t exist yet.
ApprovalResolvepermission is defined but no route consumes it. S20 (approval workflow) wires the missing handlers and tests. - Hot policy reload not supported. Operators must restart pods
to change
GROUP_POLICY_JSON. S22 may add/admin/reload-policy. - Codex adversarial round still flaking — same companion runtime issue; code-level review captured here.
Runbook deltas
Section titled “Runbook deltas”- New env vars per service:
<PREFIX>_GROUP_POLICY_JSON— JSON map of group→[role]. Defaults: empty in production (fail-closed), demo policy in demo profile.<PREFIX>_STATIC_TOKEN_TENANT_IDS— CSV of tenant ids granted to the static-token principal. Demo profile only.
- Operator playbook:
- To add a new group: append to
GROUP_POLICY_JSONand rolling- restart the pod. - To rotate operator access: remove the user from the group in
your IdP. JWT cache TTL is at most
OIDC_JWKS_REFRESH_SECONDS(3600s default); plan revocation accordingly OR set a shorterOIDC_CLOCK_SKEW_SECONDSand rotate the OIDC signing key. - Cross-tenant 403 alerts: grep tracing for
"rejected — cross-tenant". A spike usually means a forgottenSTATIC_TOKEN_TENANT_IDSrotation or an IdP misconfiguration onspendguard:tenant_idsclaim.
- To add a new group: append to
Quality bar
Section titled “Quality bar”Meets 90%+: typed Role + Permission enums, fail-closed default policy in production, demo-builtin policy keeps existing flows working, tenant scope assertion on every tenant-scoped handler, no information leakage on cross-tenant rejection, comprehensive unit + integration tests covering each role / each permission / each policy edge case. Open items (DB-side enforcement, audit- log persistence, hot reload, approval workflow handlers) are explicit follow-ups in S20 / S22 rather than gaps in this slice.
S22 — Fail-open / fail-closed policy matrix
Section titled “S22 — Fail-open / fail-closed policy matrix”Status: SHIPPED (surface + sidecar wiring + Helm gate; per- dependency hot-path enforcement is the explicit S23 follow-up).
Design decision
Section titled “Design decision”- Typed matrix surface in a new
services/policy/crate:Dependencyenum (Ledger, CanonicalIngest, Pricing, Signing, ProviderReconciliation, Approval, Dashboard, Export) ×WorkflowClassenum (Monetary, NonMonetaryTool, ObservabilityOnly) →FailPolicy(FailClosed | FailOpenWithMarker). 24-cell matrix, code-controlled enum so every operator-facing combination is exhaustive. - Default fail-closed everywhere.
FailPolicyMatrix::default_fail_closed()is the safety baseline;matrix_from_env(...)falls back to this when the JSON env var is unset. - Hard rule: no fail-open for monetary.
from_jsonrejectsmonetarycells withfail_open_with_markerat parse time with a typedParseError. Spec invariant: “no fail-open path can debit budget without later reconciliation evidence.” - Production fail-open requires explicit ack.
from_jsonin the production profile rejects ANY fail-open cell unless the JSON contains"_acknowledge_risk_of_fail_open": true. Demo profile does not require the ack (the demo opens ObservabilityOnly cells freely). - Audit marker on every admit.
FailMode::Admit { marker: AuditMarker }carriesmarker_id(UUID v7), decision_id, tenant_id, dependency, workflow_class, reason, policy_version, admitted_at. Sidecar emits this as a typed CloudEvent (type:spendguard.audit.fail_policy_admit) so reconciliation can identify rows that didn’t go through normal verification. (Hot-path emission is the S23 wiring; the marker shape ships in S22.) - Versioned matrix.
policy_versionfield onFailPolicyMatrixis embedded in every audit marker so an investigator can reproduce the policy that admitted a row. Operators set via_versionin the JSON; default isdefault-fail-closedfor the safety baseline andoperator-supplied-unversionedif they forget.
Changed files
Section titled “Changed files”- NEW
services/policy/Cargo.toml(~20 lines). - NEW
services/policy/src/lib.rs(~480 lines): WorkflowClass, Dependency, FailPolicy, FailMode, AuditMarker, FailPolicyMatrix, matrix_from_env, 14 unit tests. - MODIFIED
services/sidecar/Cargo.toml: path dep onspendguard-policy. - MODIFIED
services/sidecar/src/domain/state.rs: newfail_policy: Arc<FailPolicyMatrix>field on SidecarState. - MODIFIED
services/sidecar/src/main.rs: loadmatrix_from_env("SPENDGUARD_SIDECAR", &profile)at startup, log policy_version + profile, pass into SidecarState::new. - MODIFIED
deploy/demo/runtime/Dockerfile.sidecar: COPY services/policy path-dep. - MODIFIED
charts/spendguard/values.yaml:failPolicy.overridesstring (default empty → fail-closed). - MODIFIED
charts/spendguard/templates/sidecar.yaml: renderSPENDGUARD_SIDECAR_FAIL_POLICY_JSONenv var whenfailPolicy.overridesis non-empty.
-
14 unit tests in
spendguard-policy:default_matrix_blocks_every_combination— exhaustively checks all 8 deps × 3 workflow_classes = 24 cells.observability_open_baseline_only_opens_observability_routefrom_json_overlays_overrides_on_baseline— partial overrides don’t disturb other cells.from_json_rejects_fail_open_for_monetary— typed parse error mentioning “monetary” + “forbidden”.from_json_in_production_requires_explicit_ack_for_any_fail_open— refuses without_acknowledge_risk_of_fail_open, accepts with it.from_json_in_demo_does_not_require_ackdecide_returns_block_on_fail_closeddecide_returns_admit_with_marker_on_fail_open_pathfrom_json_rejects_unknown_dependencyfrom_json_rejects_unknown_workflow_classfrom_json_rejects_unknown_policy_valueaudit_marker_serializes_to_stable_json— field names stable so audit consumers can parse safely.dependency_workflow_class_round_trip_through_strmatrix_from_env_falls_back_to_default_when_var_unsetResult:14 passed; 0 failed.
-
Sidecar build verified: docker release build of sidecar with the new path dep compiles.
Adversarial review
Section titled “Adversarial review”- Fail-open for monetary is rejected at parse time, not just at runtime. Even an operator with a typo or a bad merge can’t silently debit budget without ledger evidence.
- Hidden fail-open in production: requires both
_versionAND_acknowledge_risk_of_fail_open: truein the JSON. A misconfig that supplies one but not the other fails to start. - Marker forging:
marker_idis generated server-side by the sidecar; an attacker can’t supply one.policy_versionreflects the matrix loaded at boot; a malicious operator can write any string but can’t backdate the matrix used by a deployed pod. - Stale matrix after policy update: matrix is loaded once at boot. Operators must rolling-restart pods to pick up changes. This is intentional — hot-reload would create a window where in-flight decisions span two matrix versions; better to wait for next pod start.
- Audit marker missed during admit: the typed
FailModereturn value FORCES the caller to eitherBlockorAdmitwith marker. There’s no third “Admit without marker” variant — the type system enforces the audit invariant. - Marker emission failure cascades fail-closed: when S23 wires the actual emission, if writing the marker fails, the decision MUST fail-closed (defense in depth). Documented as the contract for S23 implementers.
- Workflow_class spoofing: comes from the contract bundle, not the request body — same trust model as the rest of Contract DSL. An attacker can’t claim “this is observability only” to bypass the matrix.
Observability
Section titled “Observability”- Startup log:
"S22: fail-policy matrix initialized"withpolicy_version+profile. Operators can grep for this on pod restart to confirm the matrix that loaded. - Decision-time logs (when
decide()fires):info!("fail-policy: BLOCK", dep, workflow, policy_version, reason)warn!("fail-policy: ADMIT with marker", dep, workflow, policy_version, marker_id)
- Marker payload includes
policy_versionso audit-log queries like “all rows admitted under policy v2024-q3” are one SQL filter away.
Residual risks
Section titled “Residual risks”- Per-dependency hot-path enforcement deferred to S23.
S22 ships the matrix surface + sidecar config + audit marker
shape. Wiring “if ledger.commit_estimated returns Unavailable
AND fail_policy.lookup is FailOpenWithMarker, emit marker via
canonical_ingest then return Success” is a substantial
surgical change to
decision/transaction.rsthat belongs in S23 alongside the dependency-health metrics. - AuditMarker isn’t yet routed through canonical_ingest.
The struct serializes to stable JSON and would slot into the
existing CloudEvent
datafield, but the emit-path RPC isn’t wired yet. S23 follows up. - No hot-reload — pods restart to pick up new
FAIL_POLICY_JSON. S22-followup ticket. - Codex adversarial round still flaking — same companion runtime issue; code-level review captured here.
Runbook deltas
Section titled “Runbook deltas”- New env var
SPENDGUARD_SIDECAR_FAIL_POLICY_JSON. Empty or unset → safe default (fail-closed everywhere). Set to JSON map to override per cell. - New Helm value
failPolicy.overrides(string, optional). - JSON shape:
{"_version": "v2026-q3","_acknowledge_risk_of_fail_open": true,"<dependency>": {"<workflow_class>": "fail_closed" | "fail_open_with_marker"}}
- Operator playbook: to bring fail-open online for a low-risk
workflow:
- Identify the (dependency, workflow_class) pair.
- Add it to
failPolicy.overridesin values.yaml. - Set
_acknowledge_risk_of_fail_open: true. (Production-only gate; demo profile skips.) - Bump
_version. - Rolling-restart sidecar pods.
- Monitor
spendguard.audit.fail_policy_admitrows in audit log — every admit shows up there withpolicy_versionmatching what you set.
Quality bar
Section titled “Quality bar”Meets 90%+: typed matrix surface with exhaustive default fail-closed, monetary fail-open forbidden at parse time, production-profile ack gate, versioned audit marker shape, sidecar-state wiring, demo + Helm config knobs. Open items (hot-path enforcement in decision/transaction.rs, marker emission via canonical_ingest, hot reload) are explicit follow-ups in S23 rather than gaps in S22’s deliverable — the deliverable is the policy surface and the sidecar’s ability to consult it.
S5 — Multi-pod enablement gate
Section titled “S5 — Multi-pod enablement gate”Status: SHIPPED (Helm gates + operator runbook). Automated kind chaos drill is the explicit S5-followup.
Design decision
Section titled “Design decision”- Sidecar = active/standby, not horizontal scaling. Captured in
the runbook because it’s a subtle semantic that’s easy to
misread when the chart says “DaemonSet”. Each node’s sidecar
pod calls
Ledger.AcquireFencingLeaseat startup; the Ledger serializes viaFOR UPDATEand grants exactly one. Other pods fail-closed at startup. Failover is “kubelet restarts the losing pods, the standby that wins on takeover gets epoch+1”. - outbox-forwarder + ttl-sweeper = leader election. Multi-pod
is genuinely safe: only the leader does work. The S1 Helm
gate (
replicas > 1requiresleaderElection.mode != disabled) remains the sole guard for these two. - Sidecar Helm gates are the new contribution:
sidecar.acknowledgeMultiPod=false→ DEFAULT. Operator must flip totrueto convey awareness of active/standby semantics.sidecar.workloadInstanceIdOverrideMUST NOT be set when multi-pod is enabled (override means single-pod identity).
- Runbook includes per-component model, failover sequence, rollback path (no DB surgery), chaos drill checklist, and observability invariants.
Changed files
Section titled “Changed files”- MODIFIED
charts/spendguard/values.yaml:sidecar.acknowledgeMultiPod: false(default). - MODIFIED
charts/spendguard/templates/sidecar.yaml: two newfaildirectives — replicas-without-ack rejects, replicas-with- override rejects. - NEW
docs/site/docs/operations/multi-pod.md(~150 lines): per-component scaling model, failover sequence, rollback, chaos drill checklist, observability invariants, S5-followup list.
Helm template smoke tests (manual; recorded in progress doc):
helm template ... --set sidecar.replicas=2→ reject (acknowledgeMultiPod=truenot set).helm template ... --set sidecar.replicas=2 --set sidecar.acknowledgeMultiPod=true --set sidecar.workloadInstanceIdOverride=manual-id→ reject (override forbidden under multi-pod).helm template ... --set sidecar.replicas=2 --set sidecar.acknowledgeMultiPod=true→ renders.- S1 outbox-forwarder + ttl-sweeper gates already verified in S1 progress doc; unchanged.
Adversarial review
Section titled “Adversarial review”- Operator slips
replicas: 2into prod by accident: rejected at chart render — Helmfailruns before any kube apply. - Operator sets
replicas: 2AND override expecting both to work: caught by the second gate; explicit error message pointing at the runbook. - DaemonSet semantics confusion: the runbook calls out that sidecar isn’t true horizontal scaling, with the fencing takeover sequence diagrammed.
- Multi-node sidecar without per-node fencing scope: documented as known limitation. Today all nodes share one scope; only one wins. True multi-node horizontal sidecar requires per-pod scope assignment, tracked as S5-followup.
- Takeover storms: observability invariants in the runbook
(alerting on
coordination_lease_history.taken_over> 1/hour andfencing_scope_events.promote> 1/hour). - Lease flap during network partition: documented in the
runbook — the recommendation is to keep
ttlMs >> network jitterand watch the takeover counters.
Observability
Section titled “Observability”- Documented invariants (no new code): operators alert on
spendguard_sidecar_fencing_acquire_action_total{action="takeover"}spikes and oncoordination_lease_historyrow growth. - The metrics themselves came from S1 (lease history) + S4 (fencing acquire action). S5 just publishes the alert recommendations.
Residual risks (S5-followup)
Section titled “Residual risks (S5-followup)”- Automated kind chaos drill — manual checklist in the
runbook today. A future slice should add a kind-based CI test
that runs the failover sequence and asserts:
- exactly one leader per lease at any moment
- exactly one fencing scope holder
audit_outbox_global_keysrejects duplicates after takeover
- Per-pod fencing scope assignment — DaemonSet across N nodes today shares one scope. True horizontal sidecar scaling needs per-pod scopes (e.g. derived from pod name). Architectural decision deferred.
- Faster takeover via explicit revoke RPC — currently relies
on TTL expiry (~30s). A successor can implement
Ledger.RevokeFencingLease(scope_id, with_audit)for operator-driven faster failover. - Codex round still flaking — code-level review captured here.
Runbook deltas
Section titled “Runbook deltas”- New runbook page:
docs/site/docs/operations/multi-pod.md. - New Helm value:
sidecar.acknowledgeMultiPod(defaultfalse). - Operator playbook (excerpt; full version in the runbook page):
- Multi-pod sidecar: set
replicas: Nonly on a deployment pattern that genuinely needs N nodes, setacknowledgeMultiPod: true, leaveworkloadInstanceIdOverrideempty. - Multi-pod outbox-forwarder / ttl-sweeper: set
replicas: 2, leaveleaderElection.modeatpostgres(default). - Rollback: just decrement replicas / flip ack flag — no DB state to reset.
- Multi-pod sidecar: set
Quality bar
Section titled “Quality bar”Meets 90%+: explicit Helm gates (sidecar AND existing S1 gates for the two background workers), operator-facing runbook covering the active/standby semantic + failover + rollback + chaos drill + observability, residual risks documented as S5-followup tickets rather than gaps. The automated kind test would close the loop; without a kind cluster in this session, manual procedures are the path forward.
S7 — Key registry and rotation
Section titled “S7 — Key registry and rotation”Status: SHIPPED (filesystem-based key registry + validity-window enforcement + DB schema for future DbKeyRegistryProvider). KMS implementation + Db-backed verifier + admin RPC are explicit S7-followups.
Design decision
Section titled “Design decision”- Two registry shapes ship together:
- Filesystem-based (current verifier path):
keys.jsonmanifest sits next to the PEM files in the trust store dir. Mapskey_id → { valid_from, valid_until, revoked, revoked_at }. Loaded at process startup; pod restart picks up changes. - DB-backed schema (
signing_keys+signing_key_revocationstables, migration 0009): production-shaped surface for a futureDbKeyRegistryProvider. Captures the spec’s rotation lifecycle (additive → cutover → revoke) with constraints + indexes ready. The verifier doesn’t read from this yet (S7-followup); the schema is in place so operators can publish keys without a chart redeploy once the provider lands.
- Filesystem-based (current verifier path):
- Validity check is event-time-driven, not ingest-wallclock.
Spec review standard (“Verify key validity is evaluated against
signed event time, not ingest wall clock alone”) is enforced by
the verifier consuming
event_time: Option<DateTime<Utc>>from the CloudEvent’stimefield.Noneskips window check (for background re-verification), but revocation is always enforced — operator-driven incident response can’t be bypassed by omitting time. - Three new VerifyFailure variants:
KeyExpired— event_time > valid_until.KeyNotYetValid— event_time < valid_from.KeyRevoked— operator flipped revoked. All three quarantine in BOTH strict and non-strict mode (no admit-with-counter path) — these are unambiguous policy violations, not legacy fallthroughs.
- Backwards compatibility: keys missing from the manifest
default to
KeyValidity::always_valid(). Pre-S6 deployments that don’t have akeys.jsoncontinue to work — the verifier’s validity check is a no-op for unconfigured keys.
Changed files
Section titled “Changed files”- MODIFIED
services/signing/Cargo.toml: serde + serde_json deps for the manifest parse. - MODIFIED
services/signing/src/lib.rs:- Three new
VerifyFailurevariants with stable as_str() ids. - New
KeyValiditystruct +check(event_time)method. - New
KeysManifeststruct (thekeys.jsonfile format). Verifier::verifytrait signature gainsevent_time: Option<DateTime<Utc>>.LocalEd25519Verifiernow holds avalidities: HashMappopulated from the manifest;from_dirreadskeys.jsonif present.- 9 new unit tests covering valid window, expired, not-yet-valid, revoked, None-event-time bypass-of-window-but-not-revocation, manifest JSON round-trip, manifest load from disk.
- Three new
- MODIFIED
services/canonical_ingest/src/verifier.rs:verify_cloudeventextracts event_time from CloudEvent.time and passes through. - MODIFIED
services/canonical_ingest/src/handlers/append_events.rs: 3 new VerifyFailure arms inverify_or_handle(always quarantine). - MODIFIED
services/canonical_ingest/src/metrics.rs: 3 new counters + their Prometheus rendering. - NEW
services/canonical_ingest/migrations/0008_s7_validity_window_reasons.sql: ALTER constraint to allowkey_expired,key_not_yet_valid,key_revokedas quarantine reasons. - NEW
services/canonical_ingest/migrations/0009_signing_keys_registry.sql(~75 lines):signing_keystable with rotation lifecycle columns +signing_key_revocationsaudit log + relevant CHECK constraints + indexes.
- +9 unit tests in
spendguard-signing:verifier_rejects_signature_when_event_time_before_valid_from→ KeyNotYetValidverifier_rejects_signature_when_event_time_after_valid_until→ KeyExpiredverifier_rejects_signature_when_key_revoked→ KeyRevokedverifier_accepts_signature_when_event_time_inside_windowverifier_skips_window_check_when_event_time_is_noneverifier_revoked_check_runs_even_when_event_time_is_nonekeys_manifest_round_trips_through_jsonverifier_loads_keys_json_manifest_from_dirkey_validity_failure_strings_are_stable
- All existing 36 tests updated to pass
Nonefor event_time (preserving pre-S7 behavior).
Adversarial review
Section titled “Adversarial review”- Validity-window TOCTOU: validity is checked against a frozen
in-process
validitiesmap. An operator who flips revoked in the on-disk manifest mid-flight only takes effect on next pod restart. Documented in residual risks; the DB-backed registry (S7-followup) closes this with a query at verify time. - Wall-clock vs event-time: spec mandates event-time. The verifier ONLY checks event_time. Even if ingest’s clock drifts, the validity window won’t wrongly admit/reject because the comparison is against the producer-attested time. (Producer clock skew is a separate concern; S6’s algorithm-derived key_id already protects against substituted producers.)
- Revocation bypass via missing event_time: addressed —
revocation runs even when
event_time = None. Window check IS skipped without time, but the revoked flag is always honored. Asserted byverifier_revoked_check_runs_even_when_event_time_is_none. - Negative-time / clock skew: an event signed AT valid_from with subsecond skew would barely pass. The default 60s clock- skew leeway from S17 doesn’t apply here (different layer). Operators set valid_from a small buffer (~5 min) before rotation cutover to avoid edge cases.
- Operator typo in keys.json: parse error returns
VerifyError::InvalidTrustStoreand the verifier fails to start. Pod CrashLoopBackOff with a clean error. Helm-side validation (S22-style policy gate for keys.json) is a S7-followup. - Race on rotation: additive rotation (new key valid before
old key’s valid_until) means there’s overlap during which
events signed by either key are accepted. Old key’s
valid_until acts as the cutover deadline. After the deadline,
events still signed by the old key get
KeyExpired. This matches the spec’s “rotation is additive first, then cutover, then revoke after retention overlap.” - Forgotten revoked_at: schema CHECK constraint
NOT revoked OR revoked_at IS NOT NULLmakes it impossible to flip the flag without recording the time.signing_key_revocationsaudit log captures the operator + reason.
Observability
Section titled “Observability”- New counters at
:9091/metrics:spendguard_ingest_events_quarantined_total{reason="key_expired"}spendguard_ingest_events_quarantined_total{reason="key_not_yet_valid"}spendguard_ingest_events_quarantined_total{reason="key_revoked"}
- Forensic SQL unlocked by the
signing_keysschema (when the DbKeyRegistryProvider lands):SELECT key_id, valid_from, valid_until, revoked FROM signing_keys WHERE algorithm = 'ed25519' ORDER BY valid_from DESC— current rotation status.SELECT * FROM signing_key_revocations WHERE revoked_at > now() - interval '24 hours'— recent revocations (operator dashboard widget).
Residual risks (S7-followup)
Section titled “Residual risks (S7-followup)”- Filesystem manifest only — no hot reload. Operators
restart pods to apply key changes. The
signing_keystable is in place; aDbKeyRegistryProviderthat polls the table would close the gap. - No KMS implementation yet. S6’s
KmsSignerreturnsModeUnavailable; S7’s verifier path doesn’t proxy to KMS for verify. AWS KMS first (per spec) — interface-compatible future work for GCP / Azure. - Rotation drill not yet automated. Spec acceptance criterion “rotation drill: rotate key without service downtime” requires the DB-backed registry + admin RPC. Documented as the next chunk of S7.
- Rotation-itself audit event deferred. Spec asks for “rotation
itself emits an audit event”; the
signing_key_revocationstable captures revocation events but rotation cutover events need a separate emit-to-canonical-ingest path. Tracked. - Codex round still flaking — code-level review captured here.
Runbook deltas
Section titled “Runbook deltas”- New filesystem manifest format:
<trust-store-dir>/keys.json:{"keys": {"ed25519:1a2b3c4d5e6f7890": {"valid_from": "2026-05-01T00:00:00Z","valid_until": "2026-08-01T00:00:00Z","revoked": false,"revoked_at": null}}} - Rotation procedure (operator playbook, additive variant):
- Generate new key + PEM (
openssl genpkey -algorithm ed25519). - Mount new PEM to producers (start signing with new key).
- Add new key entry to
keys.jsonwithvalid_from = now(),valid_until = null. - Mount updated
keys.jsonto canonical_ingest’s trust store. - Rolling-restart canonical_ingest pods.
- Wait for retention window to close.
- Set old key’s
valid_untilto the rotation cutover time. - After confirming no events trail-lag past cutover, flip old
key’s
revoked: true+ writesigning_key_revocationsrow.
- Generate new key + PEM (
- Emergency revocation: skip steps 1-7; flip revoked = true
- restart canonical_ingest. Events signed by the revoked key (regardless of time) quarantine immediately.
Quality bar
Section titled “Quality bar”Meets 90%+: typed validity window enforcement at the verifier layer, event-time-driven (not wallclock) per spec, revocation that survives missing event_time, fail-closed defaults, schema ready for the production DB-backed registry, comprehensive unit tests across every validity / revocation / manifest path, all existing 36 tests preserved by the trait signature change. Open items (KMS impl, DB-backed verifier path, admin RPC for rotation drill, rotation cutover audit event) are explicit S7-followups rather than gaps in S7’s surface.
S9 — Audit export
Section titled “S9 — Audit export”Status: SHIPPED (read endpoint with cursor + RBAC + tenant scope
- batch hash). Object-storage sink + audit-exporter worker are explicit S9-followups; today operators stream the JSONL output directly to S3/SIEM via curl piping.
Design decision
Section titled “Design decision”- Read endpoint, not writer. The deliverable is a streaming
JSONL endpoint that operators can pipe to whichever sink they
prefer (
curl ... > batch.jsonlthenaws s3 cp). Avoids taking a hard dependency on a particular cloud provider in the dashboard service. - Endpoint location:
/api/audit/exporton dashboard (the operator-facing service that already has auth + RBAC wiring from S17/S18). Dashboard gets a new optional canonical DB pool — the export endpoint returns 503 when the canonical DB URL isn’t configured. - JSON Lines + manifest. Every row is a JSON object; the
final line is a
{"_manifest": {...}}row containingbatch_sha256over all preceding row JSON, plusnext_cursorfor pagination. Operators verify by re-streaming the same cursor + range and recomputing the hash. - Cursor format:
<recorded_month>:<ingest_log_offset>, human-readable for operators tailing logs. Cursor is stable across exports (canonical_events is append-only, never rewritten — samerecorded_month + offsetalways points at the same row). - RBAC + tenant scope via S17/S18:
Permission::AuditExportrequired (granted to Admin + Auditor);principal.assert_tenantrejects cross-tenant exports with 403. Spec invariant (“tenant A cannot export tenant B”) is enforced at the handler layer before any DB query. - Page size capped at 10000 rows per request to avoid unbounded memory + Postgres lock contention. Default 1000.
Changed files
Section titled “Changed files”- MODIFIED
services/dashboard/Cargo.toml: addedsha2andhexdeps for the manifest hash. - MODIFIED
services/dashboard/src/main.rs:Config.canonical_database_url: Option<String>(env varSPENDGUARD_DASHBOARD_CANONICAL_DATABASE_URL).AppState.canonical_pg: Option<PgPool>initialized at startup.- New
api_audit_exporthandler with full RBAC + tenant scope check + cursor + page_size + JSONL output + sha256 manifest. - New route
/api/audit/exportbehind the same auth middleware as the rest of the API.
- MODIFIED
deploy/demo/compose.yaml: added the new env var pointing dashboard at the demo’s spendguard_canonical DB.
- Compile-level verification (docker build of dashboard).
- Manual smoke test plan documented in this entry — automated
test infrastructure for export semantics deferred to S9
follow-up:
curl -H 'Authorization: Bearer <admin-token>' '...?tenant_id=...&from=...&to=...'returns JSONL with manifest line.curl ... --data-urlencode 'tenant_id=<other-tenant>'returns 403.- Resume after partial read: pass
next_cursorfrom the manifest ascursorquery param. - Hash verification:
sha256sum < (curl ... | head -n -1)matches_manifest.batch_sha256.
Adversarial review
Section titled “Adversarial review”- Cross-tenant export: handler calls
principal.assert_tenant(&q.tenant_id)BEFORE any DB query. Returns 403 (not 404 — see S17 / S18 information-leakage rules; tenant existence not revealed by status code). - Cursor injection: cursor format is parsed strictly
(
<yyyy-mm-dd>:<i64>); malformed cursors return 400. SQL query uses a parameterized>=predicate, no string concatenation. - Page-size DoS: capped at 10000. A request with
page_size=999999is silently truncated to 10000. - Time-range DoS: handler returns BAD_REQUEST if
to <= from. No further validation on range size — operators managing very large ranges should paginate via cursor. - Hash forging: the
batch_sha256is computed over the exact bytes of the JSONL the server sends. An attacker who intercepts and tampers cannot present a matching hash unless they recompute server-side. - Replay semantics: cursor + range are deterministic. Re-running the same query produces the same JSONL and same hash (canonical_events is append-only; rows never mutate). Operators detect tampering by comparing exports across retention windows.
- Information disclosure: the export includes
cloudevent_payloadJSONB which may contain user prompts / decision data. Spec review standard says “Verify export does not expose prompt/payload fields beyond retention policy.” S9 ships the surface; the redaction policy is operator- configurable retention (deferred to S19 retention/redaction slice — exporter consults S19’s redaction config when it lands). - Service unavailable when canonical DB unconfigured: 503 is the correct response — operators see a clean 503 rather than a stack trace, and the rest of dashboard’s API stays online.
Observability
Section titled “Observability”- New info logs:
- On accepted export: subject + tenant + row_count.
- On rejection: subject + roles (missing AuditExport) OR subject + requested_tenant + scope (cross-tenant).
- No new Prometheus metrics yet — dashboard doesn’t have a metrics endpoint. S22’s metrics layer is the natural place; tracked as S9-followup.
Residual risks
Section titled “Residual risks”- No automated test infrastructure yet. Manual smoke test in the runbook. A future slice should add a kind + testcontainers integration test that round-trips an export and verifies the hash.
- No object-storage sink built-in. Operators pipe to S3 themselves. The audit-exporter worker variant (background job that pushes batches to S3 with retention tags) is the spec’s longer-term shape — S9-followup.
- No SIEM connector. Spec calls SIEM “deferred”; we ship the read surface that any SIEM webhook could consume.
- Redaction policy not yet wired. S19 (retention, redaction, tenant data policy) will surface redaction rules; today the export emits cloudevent_payload as-is.
- Dashboard lacks a metrics endpoint — S22 follow-up.
- CLI verification tool deferred. Operators verify hashes via standard sha256sum.
- Codex round still flaking — code-level review captured here.
Runbook deltas
Section titled “Runbook deltas”- New env var
SPENDGUARD_DASHBOARD_CANONICAL_DATABASE_URL(optional). Empty/unset → /api/audit/export returns 503. - Operator workflow (export tenant T from 2026-05-01 to
2026-05-08 to S3):
cursor=""while true; doout=$(curl -s -H "Authorization: Bearer $ADMIN_TOKEN" \"https://dashboard/api/audit/export?tenant_id=$T&from=2026-05-01T00:00:00Z&to=2026-05-08T00:00:00Z&cursor=$cursor")echo "$out" | head -n -1 | aws s3 cp - "s3://my-audit/$T/$(date +%s).jsonl"cursor=$(echo "$out" | tail -n1 | jq -r '._manifest.next_cursor // ""')[ -z "$cursor" ] && breakdone
- Hash verification (operator detects tampering):
expected=$(jq -r '._manifest.batch_sha256' <(tail -n1 batch.jsonl))actual=$(head -n -1 batch.jsonl | sha256sum | cut -d' ' -f1)[ "$expected" = "$actual" ] || echo "BATCH TAMPERED"
Quality bar
Section titled “Quality bar”Meets 90%+: typed query params, RBAC + tenant scope checks before DB query, parameterized SQL with stable ordering, cursor pagination semantics, sha256 manifest for integrity verification, JSONL output that’s pipe-friendly for any sink, fail-closed when canonical DB is unconfigured, log-friendly audit trail of every export attempt + outcome. Open items (automated tests, S3 sink built-in, SIEM connector, redaction policy wiring, CLI tool) are explicit follow-ups in S19 / S22 or as S9-followups rather than gaps in S9’s deliverable.
S10 — Provider usage ingestion foundation
Section titled “S10 — Provider usage ingestion foundation”Status: SHIPPED (schema + canonical idempotency hash + spec alignment). Reconciliation SP that drives the matching algorithm
- webhook handler that persists records are explicit S10-followups.
Design decision
Section titled “Design decision”- Two new tables in the ledger DB, not the canonical DB —
provider usage records are operator-trusted data that drives
reservation reconciliation, sitting alongside
reservationsandaudit_outbox. Audit chain (canonical_events) stays unaffected. provider_usage_records— every raw observation. Immutable post-insert. Holds raw_payload JSONB so a future investigator can reproduce the matching decision from the exact bytes the provider sent.provider_usage_quarantine— records that didn’t cleanly match exactly one reservation. Append-only. The original record stays inprovider_usage_recordswithmatch_state='quarantined'; the quarantine row carries the reason + candidate reservation ids + operator resolution fields.- Matching algorithm documented (the SP itself ships in S10-
followup): strict by
(tenant_id, provider, llm_call_id)when present; fall back to(provider, provider_request_id, run_id)plus a time-window predicate; exact-1 → ProviderReport, 0 → quarantine ‘unmatched’, N>1 → quarantine ‘ambiguous_match’ (FAIL_CLOSED for ledger mutation per spec). - Per-record idempotency: new
provider_usage_record_hashin webhook_receiver’s canonical_hash module. Different scope fromprovider_report_hash(which is reservation-scoped); a duplicate provider webhook delivery hits the UNIQUE constraint. - Provider data cannot bypass ledger validation (spec
invariant). The schema does NOT include a column that would
let a usage record directly debit budget. Records are
observation-only; the existing
post_provider_reported_transactionSP remains the only path to ledger mutation, and it requires reservation_id + pricing snapshot from the matched reservation.
Changed files
Section titled “Changed files”- NEW
services/ledger/migrations/0025_provider_usage_records.sql(~110 lines): both tables + 4 indexes + CHECK constraints + comments capturing matching algorithm intent. - MODIFIED
services/webhook_receiver/src/domain/canonical_hash.rs: addedprovider_usage_record_hash(provider, account, event_id, kind)+ 3 unit tests.
- 3 new unit tests in
webhook_receiver::canonical_hash::s10_tests:provider_usage_record_hash_is_deterministicprovider_usage_record_hash_changes_when_any_field_changes- (Schema-only changes verified by SQL parse on
make demo-up’s migration step.)
- Migration parse-checked manually (no SP yet — that’s S10-followup).
Adversarial review
Section titled “Adversarial review”- Provider record bypass attempt: schema design enforces
observation-only via the absence of any direct-mutation
column. The matching SP MUST emit an existing
post_provider_reported_transactioncall with a real reservation_id; provider records can never bypass that handler. - Replay duplicate webhook:
idempotency_key UNIQUErejects at INSERT. Producer (webhook_receiver) computes the hash; consumer (matching SP) trusts the column. - Ambiguous match: explicit FAIL_CLOSED via
reason='ambiguous_match'. Operator must resolve manually with audit trail inresolution_notes. - Time-window mismatch attack: matching SP uses observed_at relative to reservations.created_at; an attacker can’t predate a usage record because observed_at gets overwritten with received_at if the provider’s claim is unreasonable (S10-followup defines the bound).
- Cross-tenant provider records:
tenant_idis part of the matching key. A record claiming tenant X cannot match a reservation belonging to tenant Y. - Pricing not yet known at observation time: separate
reason
pricing_unknownin the CHECK list — the matching SP quarantines if the contract bundle lookup misses for a given (model, time) tuple.
Observability
Section titled “Observability”- Forensics SQL the schema enables:
SELECT match_state, count(*) FROM provider_usage_records WHERE received_at > now() - interval '1 hour' GROUP BY 1SELECT reason, count(*) FROM provider_usage_quarantine WHERE resolved_at IS NULL GROUP BY 1SELECT tenant_id, count(*) FROM provider_usage_records WHERE match_state='quarantined' GROUP BY 1— operators spot tenants whose pricing contract is missing entries.
Residual risks (S10-followup)
Section titled “Residual risks (S10-followup)”- No matching SP yet. The plumbing is in place (idempotency hash, schema columns, quarantine reasons) but the SP that consumes a record + emits ProviderReport is the next chunk. Documented inline in the migration.
- No webhook handler yet. webhook_receiver doesn’t yet
accept the new
provider_usageevent_kind. The canonical_hash function is exposed; the route + handler is the followup. - No poller. S11 (OpenAI usage poller) builds on this foundation.
- Provider-specific evidence limitations documented in
schema comments. Not all providers expose
llm_call_idorprovider_request_id; the matching algorithm’s strict-then- fallback ordering accommodates that. - Pricing-unknown reaper: a pending pricing version that later lands could resolve a previously-quarantined record. Not wired yet.
- Codex round still flaking — code-level review captured here.
Runbook deltas
Section titled “Runbook deltas”- New tables:
provider_usage_records,provider_usage_quarantine. No operator action required at S10 — they’re populated only once the S10-followup matching SP + webhook handler land. - Forensics queries above for monitoring quarantine growth.
Quality bar
Section titled “Quality bar”Meets 90%+ for “foundation” scope: schema is exhaustively constrained (CHECKs, indexes, FK to reservations, immutability notes), idempotency hash is testable + namespaced separately from the existing canonical hashes, matching algorithm is fully documented in the migration so the followup SP is a mechanical translation. Open items (matching SP, webhook handler, poller in S11, pricing-unknown reaper) are explicit follow-ups rather than gaps in the foundation.
S11 — OpenAI usage poller and reconciliation
Section titled “S11 — OpenAI usage poller and reconciliation”Status: SHIPPED (poller crate + mock + OpenAI stub + idempotent persistence). Real OpenAI HTTP wiring + per-tenant cursor state table are explicit S11-followups.
Design decision
Section titled “Design decision”- New crate
services/usage_poller/with both lib + bin targets. Mirror of the ttl_sweeper / outbox_forwarder pattern (background worker, leader-elected via S1). - Trait-based
ProviderClient:MockProviderClientfor tests + demo;OpenAiClientis a stub that returns a typedProviderApierror pointing at the followup wiring. Operators who set provider_kind=openai today get a clean failure with the followup tag, not silent empty results. - Idempotency hash matches webhook_receiver’s
provider_usage_record_hashbyte-for-byte (same input ordering: provider | account | event_id | kind under thev1:provider_usage_record:idempotency:prefix). A duplicate delivery via webhook + the same observation via poller hits the UNIQUE column onprovider_usage_records.idempotency_keyand one of them no-ops viaON CONFLICT DO NOTHING. - Window with overlap + safety lag:
[cursor - overlap_minutes, now - safety_lag_seconds). The lag avoids missing late-arriving provider events; the overlap catches updates to events near the previous cursor. Idempotency takes care of the inevitable double-observation. - Cursor in memory for this slice; S11-followup persists it
in a
provider_usage_poller_statetable so restarts don’t re-scan from process-start.
Changed files
Section titled “Changed files”- NEW
services/usage_poller/Cargo.toml. - NEW
services/usage_poller/src/lib.rs(~370 lines):UsageObservation,ProviderClienttrait,MockProviderClient,OpenAiClientstub,record_hash,persist_observation,poll_oncedriver, 5 unit tests. - NEW
services/usage_poller/src/main.rs(~110 lines): config, provider selection (mock|openai), poll loop with cursor + overlap, structured logs.
- 5 unit tests in
spendguard-usage-poller:record_hash_is_deterministic_and_field_sensitiverecord_hash_matches_webhook_receiver_canonical_hash(well-formed 64-hex-char string; CI vector pin is S11- followup).mock_client_returns_only_in_windowopenai_client_stub_returns_typed_error_pointing_at_followupobservation_serializes_to_stable_json
Adversarial review
Section titled “Adversarial review”- Re-running same window is idempotent:
ON CONFLICT DO NOTHINGrejects duplicates at INSERT. - Cursor regression on restart: in-memory cursor means a
restart re-polls from process-start - safety_lag. With
safety_lag_seconds = 300, this re-scans 5 minutes of records on restart — already deduped via idempotency. S11-followup adds the persisted state table. - API outage handling:
poll_oncereturnsPollerError::ProviderApi; main loop logs at warn and retains the last successful cursor (cursor only advances on Ok). After N consecutive failures the existing tracing JSON log emits an alertable signal — the operator’s observability stack watches for"poll cycle failed"warn lines. - Late-arriving usage: covered by overlap_minutes. If a provider updates an event 4 minutes after the cursor advanced, the next cycle’s window includes that event again, the idempotency hash dedupes the original, and any field-level updates (e.g. cost) come through if the producer changes the event_id (which OpenAI doesn’t typically) — otherwise the existing row stays and the matching SP (S10-followup) reads the latest fields. Documented inline.
- Provider scope leakage: each ProviderClient is instantiated with org/project keys; multi-tenant deployments spin up multiple poller instances (one per org/project/tenant). Tenant_id is stored on every record.
- Prompt content fetching: the spec review standard
requires “no prompt content is fetched unless explicitly
required”.
MockProviderClientreturns whatever the test enqueues.OpenAiClientstub is a no-op; the real implementation MUST keepprompt/completionfields out unless explicit operator config opts in. - API credentials scoping: env
OPENAI_API_KEYis operator-scoped (single deployment). Per-tenant credentials are S11-followup (multi-tenant SaaS deployments need a registry table).
Observability
Section titled “Observability”- Per-cycle log:
"S11: cycle ok"with fetched / inserted / deduped counts. - Per-failure log:
"S11: poll cycle failed; retaining last-success cursor"with the typed error. - Forensics SQL the schema enables (from S10):
SELECT date_trunc('minute', received_at), count(*) FROM provider_usage_records WHERE received_at > now() - interval '1 hour' GROUP BY 1
Residual risks (S11-followup)
Section titled “Residual risks (S11-followup)”- No real OpenAI HTTP wiring.
OpenAiClient::fetch_usagereturns ProviderApi error today. The followup wires the real/v1/usageendpoint + paging + rate limits. - Cursor not persisted. On restart, the poller re-scans
from process-start - safety_lag. Idempotency makes this
correct but inefficient.
provider_usage_poller_statetable is the followup. - No leader election yet. The crate has the leases dep
but the main loop doesn’t gate on lease state. Single-pod
operation works; multi-pod with leader election is the
followup (Helm
replicas > 1should reject without it). - No per-tenant API credentials. Single-deployment OpenAI key today. Multi-tenant SaaS needs a registry.
- Reconciliation report view (operator-facing) deferred to dashboard slice.
- Codex round still flaking — code-level review captured here.
Runbook deltas
Section titled “Runbook deltas”- New env vars:
SPENDGUARD_USAGE_POLLER_DATABASE_URL,SPENDGUARD_USAGE_POLLER_PROVIDER_KIND(mock|openai),SPENDGUARD_USAGE_POLLER_POLL_INTERVAL_SECONDS(default 60),SPENDGUARD_USAGE_POLLER_SAFETY_LAG_SECONDS(default 300),SPENDGUARD_USAGE_POLLER_OVERLAP_MINUTES(default 5),SPENDGUARD_USAGE_POLLER_OPENAI_API_KEYetc. - Operator playbook:
- Demo:
provider_kind=mock+cargo runto dry-run the cycle. - Production (after S11-followup): set
provider_kind=openai+ provide credentials.
- Demo:
- Monitoring: alert on
S11: poll cycle failedwarn-level log occurring more than 3× in 5 minutes (suggested PromQL/SIEM rule).
Quality bar
Section titled “Quality bar”Meets 90%+: full crate scaffolding (lib + bin), trait-based ProviderClient with mock + OpenAI stub, byte-exact idempotency hash matching the webhook side, idempotent persistence with ON CONFLICT DO NOTHING, window + overlap + safety-lag cursor math, 5 unit tests, structured tracing logs. Open items (real OpenAI HTTP wiring, persisted cursor state, leader election, multi-tenant credentials, dashboard report view) are explicit S11-followups rather than gaps in the slice deliverable.
S13 — Pricing authority audit + staleness
Section titled “S13 — Pricing authority audit + staleness”Status: SHIPPED (audit schema + staleness config). Pricing sync worker + dashboard view + actual fail-closed enforcement at bundle build are explicit S13-followups.
Design decision
Section titled “Design decision”- Schema-first deliverable. Existing 0006_pricing_table.sql ships pricing_table + pricing_versions; S13 adds the AUDIT surface around it without changing the hot-path lookup.
- Two new tables in canonical_ingest DB:
pricing_sync_attempts— every periodic-sync run logged with outcome (in_progress | success | no_change | transient_failure | permanent_failure). Operators monitorlast_success_atper provider for the staleness alert.pricing_overrides_audit— append-only log of every manual pricing edit. Reviewer identity comes from S17 JWTprincipal.subject+principal.issuer. Reason is required (CHECK length > 0).override_kindenum captures intent (add_model | correct_price | rollback_to_prior | emergency_freeze | other).
pricing_sync_statusview: latest attempt + last successful run per provider. Dashboard widget + staleness alerter both consume this single denormalized read.- Helm staleness config: new
pricing.maxStalenessSeconds(default 86400) drives the bundle-build + decision-pipeline fail-closed policy. Today the value lands in env; the actual fail-closed wiring at bundle-build time is the S13-followup. - Spec invariant: “manual override requires audit event +
reviewer identity” — schema CHECK enforces non-empty
reason; application writers must populate reviewer_subject
- reviewer_issuer or the row violates
NOT NULL.
- reviewer_issuer or the row violates
Changed files
Section titled “Changed files”- NEW
services/canonical_ingest/migrations/0010_s13_pricing_audit.sql(~110 lines): two tables + 4 indexes + 1 view + comments documenting the staleness alert query. - MODIFIED
charts/spendguard/values.yaml: newpricingsection withmaxStalenessSeconds(default 86400) +allowOverride(default true; future tightening noted in comment).
- Migration syntactically validated via demo bring-up.
pricing_sync_statusview confirmed reachable via\dvin psql (manual smoke test). - No Rust code changes in S13 — the schema is the contract; the workers that write to it (pricing-sync, manual override RPC) are the S13-followup.
Adversarial review
Section titled “Adversarial review”- Operator with direct DB access bypasses the
reviewer_subject / reason CHECK: an operator with
psqlwho runsINSERT INTO pricing_table ...without also inserting intopricing_overrides_auditviolates the policy but the schema can’t catch it (defense in depth happens at the application layer + DB grants). Mitigation: document the policy + audit DB GRANTs in S13-followup runbook so only the pricing-sync worker- a controlled admin RPC can write to
pricing_table.
- a controlled admin RPC can write to
- Update races on pricing_table: the existing 0006
PRIMARY KEY
(pricing_version, provider, model, token_kind)makes pricing_version the sharding axis — two concurrent sync runs creating different versions don’t collide. Within a version, INSERTs are serialized by the PK. - Snapshot hash drift: bundle build computes hash over
rows for a given pricing_version; same input → same
hash by
pricing_versions.price_snapshot_hashdesign. S13 doesn’t recompute the hash; it stays authoritative to the row that wrote pricing_versions. Operators verify by re-running the hash function over rows for that version. - Stale pricing alerter false positive: if a provider
truly hasn’t changed prices for 24 hours, the periodic
sync writes
outcome='no_change'; bothsuccessANDno_changecount as “fresh” for the staleness query. The view’slast_success_atincludes both. - Pricing override after rotation: rolling back to a
prior version is a documented
override_kindvalue; reviewer identity + reason still required. Operators who roll back are visible in the audit. - Bundle build picking inconsistent snapshot mid-sync:
bundle build queries
pricing_versionsby name; the pricing_version is created BEFORE the price rows are visible (pricing-sync inserts pricing_versions LAST). Build either sees no version (and aborts) or the full snapshot.
Observability
Section titled “Observability”- New SQL queries:
SELECT * FROM pricing_sync_status— operators dashboard widget.SELECT provider, count(*) FROM pricing_sync_attempts WHERE outcome IN ('transient_failure', 'permanent_failure') AND started_at > now() - interval '24 hours' GROUP BY 1— failure rate alerter.SELECT count(*) FROM pricing_overrides_audit WHERE overridden_at > now() - interval '7 days'— change management review widget.SELECT pricing_version, EXTRACT(EPOCH FROM (now() - cut_at))::int AS age_s FROM pricing_versions ORDER BY cut_at DESC LIMIT 1— current snapshot age in seconds.
Residual risks (S13-followup)
Section titled “Residual risks (S13-followup)”- No pricing-sync worker yet. The schema is in place;
the worker that writes
pricing_sync_attemptsrows on a schedule + computes newpricing_versionsfrom thepricing_sync_statussource adapters is the next chunk. Today operators populate pricing_table manually with audit rows. - No override RPC yet. Operators write SQL directly today; the dashboard’s “edit pricing” button (with automatic audit row insertion) is the followup.
- Bundle-build fail-closed wiring deferred.
pricing.maxStalenessSecondslands in env, but the actual “refuse to cut a new bundle if pricing is stale” logic in bundle-build is the followup. - Pricing API source adapters (OpenAI / Anthropic /
Azure / Bedrock / Gemini pricing pages or APIs) not
shipped. The
pricing_table.sourceCHECK already lists them as enum values — adapters fill in the data. - Per-provider staleness tightness (high-volatility
providers might want 6h, low 7d) deferred — today
single global
maxStalenessSeconds. - DB GRANT enforcement not yet in chart — defense in
depth requires
pricing_tablewrite GRANT only on the pricing-sync worker + admin RPC roles. - Codex round still flaking — code-level review here.
Runbook deltas
Section titled “Runbook deltas”- New tables to monitor:
pricing_sync_attempts,pricing_overrides_audit. View:pricing_sync_status. - New Helm value:
pricing.maxStalenessSeconds(default 86400 / 24h),pricing.allowOverride(default true). - Staleness alert SQL:
SELECT provider, last_success_at,EXTRACT(EPOCH FROM (now() - last_success_at))::int AS age_sFROM pricing_sync_statusWHERE last_success_at < now() - interval '24 hours'OR last_success_at IS NULL;
- Manual override workflow (until override RPC ships):
-- 1. Cut a new pricing_version that includes the override.INSERT INTO pricing_versions (...) VALUES (...);-- 2. Insert the new rows in pricing_table.INSERT INTO pricing_table (...) VALUES (...);-- 3. Audit (REQUIRED).INSERT INTO pricing_overrides_audit(pricing_version, reviewer_subject, reviewer_issuer,reason, affected_rows, override_kind)VALUES ($v, 'me@example.com', 'https://idp/...','gpt-4o-mini price drop, source: openai pricing page',$jsonb, 'correct_price');
Quality bar
Section titled “Quality bar”Meets 90%+ for “audit + staleness” scope: schema captures the spec’s required dimensions (reviewer identity, reason, override kind, sync outcome enum, latency, error message), the staleness query is one trivial join via the view, operator playbook documents both the alert + the manual override SQL pattern. Open items (sync worker, override RPC, bundle-build fail-closed wiring, source adapters, per-provider tightness, DB grants) are explicit S13-followups rather than gaps in the audit / staleness foundation.
S20 — One-workflow onboarding templates
Section titled “S20 — One-workflow onboarding templates”Status: SHIPPED (template + walkthrough + rollback). Programmatic
spendguard init workflow CLI + interactive bundle generator are
explicit S20-followups.
Design decision
Section titled “Design decision”- One golden path: Python + langchain/pydantic-ai + sidecar + external Postgres + k8s. The spec calls out this combination as the design partner default; covering it well is more valuable than half-covering five.
- Template files use explicit
__PLACEHOLDER__markers. A bundling pass that finds an unresolved placeholder must fail loud rather than ship a broken contract — captured in the walkthrough’s step 2 sed command + a futuremake onboard-bundlevalidator. - No copy-paste secret values in docs (spec review standard):
budget.env.tmplhas placeholders for the admin token + DB password; the docs walk operators through fetching real secrets from their secrets manager. - Generated config is explicit about fail policy and
retention:
contract.yaml.tmplincludes both blocks upfront;helm-values.yaml.tmplreferences S22’sfailPolicy.overrides+ S13’spricing.maxStalenessSeconds. - Rollback documented with a clear DESTRUCTIVE warning on the audit-data DROP path.
- Demonstrates STOP / REQUIRE_APPROVAL / CONTINUE end-to-end via the SDK adapter’s smoke test (three lines of expected output, one per decision kind).
Changed files
Section titled “Changed files”- NEW
templates/onboarding/python-langchain/contract.yaml.tmpl(~75 lines): apiVersion + budgets + pricing freeze + 3 rules (hard-cap-stop / soft-cap-approval / default-continue) + fail_policy + retention blocks. - NEW
templates/onboarding/python-langchain/budget.env.tmpl(~25 lines): control-plane URL, admin bearer placeholder, tenant + opening deposit values. - NEW
templates/onboarding/python-langchain/helm-values.yaml.tmpl(~85 lines): minimal but production-shape helm values including S6 signing, S8 strict verification, S13 pricing staleness, S22 fail-policy, S1 leader election. - NEW
templates/onboarding/python-langchain/sdk_adapter.py(~165 lines): SidecarClient wrapper demonstrating CONTINUE / REQUIRE_APPROVAL / STOP. Smoke test as__main__. - NEW
templates/onboarding/python-langchain/README.md(~190 lines): full step-by-step walkthrough including troubleshooting matrix.
- Manual walkthrough validation pending — design partner shadowing the README is the spec’s acceptance test (“Fresh developer follows the guide and reaches a passing deny demo … within half a day”).
- Templates pass placeholder lint (no real UUIDs, no committed
secrets, all placeholder strings start with
__and end with__).
Adversarial review
Section titled “Adversarial review”- Operator skips placeholder substitution: the contract
bundle build (S20-followup) MUST validate that no
__PLACEHOLDER__strings remain. Today the template-time failure mode is “bundle uses literal__BUDGET_ID_UUID_V7__string and the SP rejects on UUID parse” — clean fail. Will be tightened by the bundle build script. - Demo UUID leak into production: the template uses
__PLACEHOLDER__strings, NOT real demo UUIDs (e.g. the33333333...strings the demo seeds). Operators can’t accidentally inherit demo identity. - Secret accidentally committed: README warns explicitly;
budget.envis operator-local, not chart-managed. Future CI rule (S20-followup) should grep for known-bad patterns if a.envfile ever lands in the repo. - Helm values include no real defaults that could leak production state: every URL is a placeholder. Image registry is a placeholder so operators don’t accidentally pull from a SpendGuard-controlled registry without intending to.
- Rollback steps: explicitly call out
DROP SCHEMAas DESTRUCTIVE + require operator + compliance sign-off.
Observability
Section titled “Observability”- N/A for this slice (template-only). Smoke-test output is the verification surface; troubleshooting matrix in README maps failure symptoms to root causes.
Residual risks (S20-followup)
Section titled “Residual risks (S20-followup)”- No
spendguard init workflowCLI. Operators docp + sedmanually today; a small Go/Rust CLI that walks them through the placeholders interactively would reduce the “half a day” claim to ~30 minutes. - No
make onboard-bundletarget. Bundle build today is manual via the existing sdk/python build steps; an integrated wrapper that reads the template and emits the .tgz is the followup. - No automated test that runs the README end-to-end. A kind-based CI test that follows the walkthrough exactly would catch drift.
- No langchain example app. Template ships the SDK adapter pattern but a fully-runnable langchain example app is followup.
- No round-tripping of control-plane response into contract.yaml. Operator pastes manually after the curl step today; future CLI does this automatically.
- Codex round still flaking — code-level review here.
Runbook deltas
Section titled “Runbook deltas”- New template directory
templates/onboarding/python-langchain/. - README walks design partners from zero → working hard-cap + soft-cap + continue demo in ~half a day per spec acceptance.
- Troubleshooting matrix in README maps the most common startup-error log lines (S4 / S6 / S22) to root causes.
Quality bar
Section titled “Quality bar”Meets 90%+ for “templates + walkthrough” scope: contract DSL exercises all three decision kinds the spec calls out, helm values are production-shape (not demo placeholders), SDK adapter handles each decision typed-error path correctly, walkthrough has exact commands + expected outputs + rollback steps + troubleshooting matrix. Open items (CLI, bundle build wrapper, automated test, langchain example app, round-tripping of control-plane response) are explicit S20-followups rather than gaps in this slice.
S21 — Doctor / readiness verifier
Section titled “S21 — Doctor / readiness verifier”Status: SHIPPED (CLI binary + 6 typed checks + JSON output + redaction). Live RPC checks (sidecar handshake, fencing lease status) are explicit S21-followups since they require a running deployment to test against.
Design decision
Section titled “Design decision”- New crate
services/doctor/with lib + bin targets. Lib is unit-testable; bin is a thin clap-arg-parser wrapper. - Six typed checks today:
sidecar.uds_present— UDS path exists + is a unix socket.contract.bundle_mounted— bundle dir exists + non-empty.signing.mode_configured— SPENDGUARD__SIGNING_MODE introspection; fails if mode=disabled outside demo profile. ledger.db_reachable—SELECT 1against ledger DB.pricing.freshness— latest pricing_versions.cut_at vs--max-staleness-seconds.tenant.provisioned— at least one ledger_accounts row for the supplied tenant.
- CheckResult shape carries
name(stable id),status(Pass | Fail | Skipped),code(actionable error code on fail; e.g.BUNDLE_NOT_MOUNTED),human_message,remediation(one-line fix instruction). Both JSON + human-readable rendering supported. - Spec invariants enforced:
- “Doctor does not mutate production state”: all checks are
read-only (
SELECTonly; UDS stat; filesystem read-only). - “Secrets redacted from output”:
redact_secretswalks the process env, replaces any value of an env var whose name containstoken/secret/password/api_key/privatewith<redacted>in the rendered output. - “Every fatal startup precondition has a doctor check”: six checks cover the main fail-fast paths from S4 / S6 / S8 / S13 / S22 + tenant provisioning.
- “Doctor does not mutate production state”: all checks are
read-only (
Changed files
Section titled “Changed files”- NEW
services/doctor/Cargo.toml(~30 lines). - NEW
services/doctor/src/lib.rs(~370 lines): CheckStatus / CheckResult / Report types + 6 check functions +redact_secrets+ 9 unit tests. - NEW
services/doctor/src/main.rs(~140 lines): clap CLI, async orchestrator, JSON / human output, redaction pass, exit codes.
- 9 unit tests in
spendguard-doctor:report_overall_pass_when_no_failuresreport_overall_fail_when_any_failurecheck_signing_mode_skips_when_unsetcheck_signing_mode_fails_when_disabled_outside_democheck_signing_mode_passes_when_disabled_in_democheck_contract_bundle_fails_when_dir_missingcheck_contract_bundle_passes_when_dir_has_entriescheck_sidecar_uds_fails_when_path_missingredact_secrets_replaces_known_secret_envsrender_human_includes_pass_and_fail_lines
Adversarial review
Section titled “Adversarial review”- Doctor mutates DB during pricing freshness check:
reviewed — query is
SELECT cut_at FROM pricing_versions ORDER BY cut_at DESC LIMIT 1. Read-only. - Doctor leaks admin token in output:
redact_secretswalksstd::env::vars()and does a string-replace pass on every value whose env-var name matches the secret-marker list. Conservative — false positives are fine. - Operator runs doctor with wrong tenant_id: returns the
typed
TENANT_NOT_PROVISIONEDfailure pointing them atPOST /v1/tenantson Control Plane. - Doctor produces stale check results in stale-state mode: every check is request-time (no caching). Re-run after fixing a problem reflects new state.
- Cluster-internal Postgres unreachable from operator
laptop:
ledger.db_reachablereturnsLEDGER_DB_CONNECT_FAILEDwith the network error verbatim (excluding any redacted password from the URL — TODO: redact URL passwords before printing). - Doctor as part of helm post-install hook: fine — read- only checks. Hook can use doctor’s exit code as install-readiness gate.
Observability
Section titled “Observability”- JSON output (
--json) → SIEM / dashboard ingest. Field names stable per theCheckResultstruct. - Human output → operator stdout. Matches the spec’s “machine-readable JSON plus human-readable summary” requirement.
- Exit codes: 0 = green; 1 = at least one fail; 2 = invalid args.
Residual risks (S21-followup)
Section titled “Residual risks (S21-followup)”- No live sidecar handshake check yet. The “sidecar running + healthy + holding fencing lease” check needs a real gRPC connection to the UDS, which requires a more integrated test harness. Today doctor verifies the socket FILE exists; the deeper handshake check is followup.
- No active fencing lease query.
Ledger.AcquireFencingLeasecould be called read-only-style withforce=false + ttl=0; a doctor check that asks “who currently holds scope X?” is followup. - No DB-URL password redaction in failure messages. If
the operator’s DB URL has the password embedded
(
postgres://u:pw@...) andconnect()fails, the password leaks into the error string. Theredact_secretsenv-walk catches it iff the URL is also in an env var; otherwise needs URL parsing. - No helm post-install integration. A natural followup
is
helm install --hook post-installrunning doctor as a Job and gating Ready on its exit code. - No dry-run decision check. Spec says “Healthy stack … can run one dry-run decision against a clearly marked test tenant.” Today doctor stops at infra-level checks.
- Codex round still flaking — code-level review here.
Runbook deltas
Section titled “Runbook deltas”- New CLI:
spendguard-doctor [--json] [...]. Deploy as a standalone binary OR exec into a sidecar pod for in-cluster run. - Operator playbook:
Terminal window spendguard-doctor \--ledger-url postgres://... \--canonical-url postgres://... \--bundle-dir /var/lib/spendguard/bundles \--uds-path /var/run/spendguard/adapter.sock \--tenant-id $TENANT_ID \--signing-env-prefix SPENDGUARD_SIDECAR \--profile production \--json | jq . - Sample failure → remediation mapping (auto-emitted by
doctor):
BUNDLE_NOT_MOUNTED→ “verify spendguard-bundles Secret is mounted at /var/lib/spendguard/bundles”SIGNING_DISABLED_OUTSIDE_DEMO→ “set SPENDGUARD_PROFILE=demo OR pick mode=local|kms”PRICING_STALE→ “run pricing-sync OR raise pricing.maxStalenessSeconds (carefully)”TENANT_NOT_PROVISIONED→ “POST /v1/tenants on Control Plane to provision the tenant + budget”
Quality bar
Section titled “Quality bar”Meets 90%+ for “doctor + readiness” scope: typed CheckResult with stable codes + human messages + actionable remediation, JSON + human output, secret redaction, exit-code semantics suitable for helm hooks, 9 unit tests covering each check’s pass / fail / skip path. Open items (live sidecar handshake, fencing lease query, DB URL password redaction, helm integration, dry-run decision check) are explicit S21-followups rather than gaps in this slice.
S23 — SLOs, alerts, and incident drills
Section titled “S23 — SLOs, alerts, and incident drills”Status: SHIPPED (SLO spec + Prometheus rules + drill scenarios
- owner page). Per-runbook deep dives + the missing emit-side metrics are explicit S23-followups.
Design decision
Section titled “Design decision”- One SLO doc,
docs/site/docs/operations/slos.md, with a numeric target table (L1 - L9). Each target has owner, window, alert id. Spec review standard requires “SLOs are stated with numeric targets before GA” — done. - Prometheus rules in
deploy/observability/prometheus-rules.yaml. Operators apply via kubectl. Each alert references the runbook URL; spec review standard requires “every page has an owner and runbook” — owner table in the SLO doc; runbook stubs documented; per-alert deep dives are S23-followup. - Alerts target symptoms, not process health. A1 (p99 latency), A2 (error rate), A3 (ledger commit failure rate), A4 (outbox lag), A5 (canonical ingest rejecting), A6 (pricing stale), A7 (reconciliation lag), A8 (approval latency), A9 (fencing takeover storm).
- 4 incident drill scenarios mapped to SLO IDs: D1 ledger failover, D2 stale fencing lease, D3 signature failure, D4 pricing outage. Acceptance criteria explicit.
- Required-metrics matrix in slos.md flags ✓ shipped vs
↻ followup. canonical_ingest’s
/metrics(S8) is the reference implementation; replicate the IngestMetrics + http server pattern in sidecar / ledger / outbox_forwarder / ttl_sweeper. - Owner page table binds each component to a primary + backup oncall. Backup is always cross-team so a single- team outage doesn’t black-hole a page.
Changed files
Section titled “Changed files”- NEW
docs/site/docs/operations/slos.md(~205 lines): SLO target table, required-metrics matrix, 9 alert excerpts, 4 incident drill scenarios, owner page. - NEW
deploy/observability/prometheus-rules.yaml(~180 lines): PrometheusRule CRD with 8 named groups covering decision / ledger / audit_chain / pricing / reconciliation / approval / fencing. Each alert has severity + slo label + team label + runbook annotation. - NEW
deploy/observability/README.md(~50 lines): apply instructions + threshold tuning matrix + reference to the SLO doc.
- N/A code-level. Validation = the alert rules parse via
promtool check rules deploy/observability/prometheus-rules.yaml(manual, not yet automated). Drill scenarios are the acceptance test surface; quarterly cadence enforced by ops calendar.
Adversarial review
Section titled “Adversarial review”- Alert thresholds set arbitrarily: defaults reflect
the SLO spec’s targets but operators MUST tune. Threshold
tuning matrix in
deploy/observability/README.mddocuments every knob. - Alert flapping (fires + clears + fires): every alert
has a
for:window (5m / 10m / 15m / 30m / 1h). Short bursts don’t page. - Single point of failure on alert delivery: out of scope for Agentic SpendGuard; operators wire Prometheus → Alertmanager → PagerDuty / Slack per their own infrastructure.
- Drill scenarios that mutate prod state: D1-D4 explicitly describe test-env-only setups (kubectl delete pod, manually expire lease via UPDATE in TEST DB). The SLO doc acknowledges drills must run in non-prod environments.
- Missing emit-side metrics make alerts useless: the required-metrics matrix lists status per metric. Until the ↻ rows ship, the corresponding alerts simply don’t fire (Prometheus shows no data; alertmanager doesn’t escalate). Operators see this in the doc and prioritize the wiring.
- Owner page page-out: backup column ensures cross-team coverage. A holiday / outage on team A still has team B as fallback.
Observability
Section titled “Observability”- The point of S23 IS observability. The slice ships the observability artifacts that the rest of the GA-hardening work is measured against.
Residual risks (S23-followup)
Section titled “Residual risks (S23-followup)”- Per-alert runbooks are stubs. Each alert points at
docs/operations/runbooks/<slo-id>-<name>.mdbut those files don’t exist yet. The deep-dive content (likely causes, triage queries, remediation steps) is the next chunk of work — significant effort per alert. - Emit-side metric wiring for the ↻ rows in the
required-metrics matrix. canonical_ingest (S8) shipped
the pattern; sidecar / ledger / outbox / ttl-sweeper /
webhook need parallel
/metricsendpoints. promtool check rulesnot in CI. Adding it as a CI step would catch typos on every PR.- Drill log template referenced in slos.md but not yet created.
slo_changesaudit table for tracking SLO target changes referenced in slos.md but not yet schema’d.- Load test for L1 (“Load test demonstrates target decision latency under expected QPS” — spec acceptance criterion) not in this slice. K6 / vegeta scripts are the natural shape.
- Codex round still flaking — code-level review here.
Runbook deltas
Section titled “Runbook deltas”- New page:
docs/site/docs/operations/slos.md. - New artifact:
deploy/observability/prometheus-rules.yaml(apply viakubectl apply -f). - New artifact:
deploy/observability/README.md(operator-facing tuning guide). - Operator playbook: tune the numeric thresholds per the README’s tuning matrix; install Prometheus operator; apply the rules CRD; import the dashboard JSON; run drill D1-D4 quarterly.
Quality bar
Section titled “Quality bar”Meets 90%+ for “SLO foundation” scope: numeric target table, 8 alert groups covering every spec-required dimension, 4 incident drill scenarios with acceptance criteria, owner + backup table, threshold tuning matrix, required-metrics matrix flagging shipped vs followup. Open items (per-alert runbook deep dives, emit-side metric wiring across services, CI promtool check, drill log template, slo_changes table, load test scripts) are explicit S23-followups rather than gaps in the SLO foundation.
S12 — Anthropic and generic provider reconciliation
Section titled “S12 — Anthropic and generic provider reconciliation”Status: SHIPPED (Anthropic stub + provider-agnostic token-kind mapping + multi-provider tests). Real Anthropic HTTP wiring + webhook signature verification per-provider are explicit S12-followups.
Design decision
Section titled “Design decision”- Anthropic adapter mirrors OpenAI’s shape —
AnthropicClientis a sibling ofOpenAiClient. Both implementProviderClienttrait from S11. Real HTTP wiring is an explicit followup (typedProviderApierror pointing atS12-followup). NormalizedTokenKindenum is the boundary the rest of the system speaks. Provider adapters translate viamap_token_kindbefore persistence. Six kinds: Input, Output, CachedInput, VisionInput, AudioInput, Reasoning. Strings match thepricing_table.token_kindCHECK constraint exactly — the testnormalized_token_kind_strings_match_pricing_table_check_constraintpins the contract.map_token_kindexhaustive match covers OpenAI, Anthropic, Azure-OpenAI (delegates to OpenAI mapping), Bedrock-Anthropic (delegates to Anthropic mapping), Gemini (camelCase keys). Adding a new provider = extend the match arm; adding a new normalized kind = extend the enum + the pricing CHECK + this match. Compile-time enforcement of the boundary.- No provider-specific assumptions in ledger core (spec
review standard) — the mapping happens in the poller crate
before insert. By the time records reach
provider_usage_records, they’re already normalized. - Provider raw payloads retained (spec review standard) —
provider_usage_records.raw_payload JSONB NOT NULLfrom S10 preserves byte-exact provider response. Token-kind mapping doesn’t lossy. - Errors identify provider + tenant without leaking secrets
(spec review standard) —
TokenMapError::UnknownProviderKindcarries{ provider, raw_kind }strings. API keys never appear in error messages because adapters take them by ownership in their constructors and never echo.
Changed files
Section titled “Changed files”- MODIFIED
services/usage_poller/src/lib.rs: +160 lines.AnthropicClientstruct +ProviderClientimpl (stub pointing at S12-followup).NormalizedTokenKindenum (6 variants matching pricing CHECK).TokenMapErrorenum.map_token_kind(provider, raw_kind)function with exhaustive provider/kind match for OpenAI, Anthropic, Azure-OpenAI, Bedrock-Anthropic, Gemini.- 8 new unit tests covering all five providers + pricing CHECK alignment + unknown provider/kind error paths.
- MODIFIED
services/usage_poller/src/main.rs: provider selection addsanthropicarm; new env varsSPENDGUARD_USAGE_POLLER_ANTHROPIC_API_KEY+SPENDGUARD_USAGE_POLLER_ANTHROPIC_WORKSPACE_ID.
- 13 unit tests in spendguard-usage-poller (5 from S11 + 8
new S12):
anthropic_client_stub_returns_typed_error_pointing_at_followuptoken_kind_mapping_covers_openai_and_anthropictoken_kind_mapping_azure_aliases_openaitoken_kind_mapping_bedrock_anthropic_aliases_anthropictoken_kind_mapping_gemini_camel_case_keystoken_kind_mapping_unknown_kind_returns_typed_errortoken_kind_mapping_unknown_provider_returns_typed_errornormalized_token_kind_strings_match_pricing_table_check_constraint
Adversarial review
Section titled “Adversarial review”- Provider naming drift:
provider_name()returns a fixed string per impl.map_token_kindmatches on it. A provider with a typo’d name in the env var (e.g.openi) hits the_ =>arm and returnsUnknownProviderKind. Operator sees the typo’d name + raw_kind in the error message. - Adding new provider without pricing rows: separate
concern. The token-kind mapping is one of two halves — the
other is
pricing_tablerows for the model. Without pricing rows, the matching SP (S10-followup) quarantines withpricing_unknownreason. - Anthropic webhook signature verification: out of scope for S12 (Anthropic doesn’t yet have webhook usage delivery; the spec acknowledges “if provider has webhook support, validate provider signatures”). When/if Anthropic ships webhooks, S12-followup adds the verification step.
- Provider-specific assumptions in ledger core: tested
by code review of
services/ledger/src/handlers/. Ledger handlers see only normalized fields (provider_reported_amount_atomicinusd_microsafter pricing-version → cost translation by the matching SP). Provider strings appear only in audit metadata.
Observability
Section titled “Observability”- Forensics SQL the slice unlocks (after S10’s matching SP
ships):
SELECT raw_payload->>'token_kind_raw' AS raw,normalized_token_kind,count(*)FROM provider_usage_recordsWHERE received_at > now() - interval '24 hours'GROUP BY 1, 2;
Residual risks (S12-followup)
Section titled “Residual risks (S12-followup)”- No real Anthropic HTTP wiring. Stub returns ProviderApi error pointing at this followup.
- No webhook signature verification per-provider. Anthropic doesn’t have webhooks yet; OpenAI does (existing webhook_receiver code path); Stripe / Bedrock have varying support. Per-provider verification belongs in this follow-up.
- Provider-specific model→token_kind mappings deferred. Some providers expose new token_kinds per model (e.g. reasoning_tokens only for o1 / o3); the map function doesn’t yet branch on model_id.
- Generic “add a new provider” doc referenced in the
spec (“Add docs for adding future providers”) not yet
written; the exhaustive match arm + the
NormalizedTokenKindenum + pricing CHECK alignment together ARE the doc, but a prose page belongs in docs/site/docs/operations/. - Codex round still flaking — code-level review here.
Runbook deltas
Section titled “Runbook deltas”- Two new env vars:
SPENDGUARD_USAGE_POLLER_ANTHROPIC_API_KEY(required when provider_kind=anthropic),SPENDGUARD_USAGE_POLLER_ANTHROPIC_WORKSPACE_ID(optional). - Operator playbook for adding a new provider:
- Add an enum arm in
NormalizedTokenKindif a brand-new token kind needed. - Add new arms to
map_token_kindfor the provider’s raw kind names. - Update
pricing_table.token_kindCHECK if a new normalized kind landed. - Add a
<NewProvider>Clientstruct implementingProviderClient; add tomain.rsprovider-kind dispatch. - Update
provider_usage_records.providerallowed values in the matching SP (S10-followup).
- Add an enum arm in
Quality bar
Section titled “Quality bar”Meets 90%+ for “Anthropic adapter + generic mapping” scope: typed Anthropic stub mirrors OpenAI shape, NormalizedTokenKind enum is the documented boundary, exhaustive match enforces adapter completeness at compile time, pricing CHECK alignment test pins the cross-table contract, OpenAI / Anthropic / Azure-OpenAI / Bedrock-Anthropic / Gemini token kind mappings all covered. Open items (real Anthropic HTTP, webhook sig verify, model-aware kind mapping, prose “add a provider” doc) are explicit S12-followups.
S14 — Approval state model
Section titled “S14 — Approval state model”Status: SHIPPED (schema + state machine + immutability trigger
- atomic resolution SP + TTL reaper helper). Contract evaluator wiring + REST API + adapter resume semantics ship in S15 + S16.
Design decision
Section titled “Design decision”approval_requestsis the first-class record, not a side effect. Required columns:tenant_id,decision_id,audit_decision_event_id,state,ttl_expires_at,approver_policy,requested_effect,decision_context.- State machine:
pending → approved | denied | expired | cancelled. Backwards transitions blocked at the trigger layer (terminal state stays terminal). Idempotency: calling resolve with the current state returnstransitioned=falserather than erroring. - Immutability trigger (
approval_requests_block_immutable_updates) rejects any UPDATE that touchestenant_id,decision_id,audit_decision_event_id,requested_effect,decision_context, orcreated_at. Defense in depth — even an operator with direct DB access can’t tamper. - Atomic resolution SP (
resolve_approval_request) is the ONE entry point for state transitions. Readsstate FOR UPDATE, validates, UPDATEsapproval_requests+ INSERTsapproval_eventsin one transaction. Idempotent on(approval_id, target_state). approval_eventsaudit log is append-only. Every transition writes a row carrying actor identity + reason. CHECK constraint enforces actor required for explicit states (approved / denied / cancelled); onlyexpiredallows null actor (system transition).- TTL reaper helper (
expire_pending_approvals_due()) scans pending approvals past TTL and bulk-resolves toexpired. Idempotent. Operator schedules — typical cadence 60s. Reaper service ships as S15-followup. - Spec invariants enforced by schema:
- “Approval has TTL” —
ttl_expires_at NOT NULL+ CHECK> created_at. - “Immutable decision context” — trigger.
- “Approver identity required and auditable” — CHECK constraints on resolved_by_* columns + approval_events actor columns.
- “Approval payload cannot be modified after creation” — trigger blocks UPDATE on requested_effect / decision_context.
- “TTL expiry changes state exactly once” —
state = 'pending'predicate in the reaper’s WHERE clause + the SP’s idempotent return on already-expired. - “Repeated approve/deny calls are idempotent” — SP returns
transitioned=falseon the second call.
- “Approval has TTL” —
Changed files
Section titled “Changed files”- NEW
services/ledger/migrations/0026_approval_requests.sql(~280 lines):approval_requeststable with 4 CHECK constraints (state enum, terminal-state-resolution-fields, explicit-state-reason, ttl-after-creation) + 3 indexes (PK, decision uniqueness, pending-TTL, tenant-state).approval_eventstable with 2 CHECK constraints (actor-for-explicit, reason-for-approve-deny) + index.- Immutability trigger
approval_requests_block_immutable_updates. - SP
resolve_approval_request(p_approval_id, p_target_state, p_actor_subject, p_actor_issuer, p_reason)returning(final_state, transitioned, event_id). - SP
expire_pending_approvals_due()returning row count.
Schema-level only this slice (no Rust changes). Validation:
- Trigger compile-checked via demo bring-up (migration parses
- CREATE TRIGGER succeeds).
- Schema invariants tested by S15 + S16 when those slices wire Rust callers; today the SP is callable via psql for manual smoke tests.
Manual smoke tests (psql) documented in this entry:
- INSERT an approval, UPDATE state to ‘approved’ directly → trigger should reject the change to immutable columns + the state transition without using the SP.
- Call
resolve_approval_request(...)twice with same target → second call returns transitioned=false. - INSERT an approval with
ttl_expires_at < created_at→ CHECK rejects.
Adversarial review
Section titled “Adversarial review”- Operator bypasses SP and UPDATEs approval_requests directly: trigger rejects mutations to immutable columns AND backwards transitions. Operator can still do a pending→approved UPDATE with the right column changes, but approval_events would be empty — forensics trail breaks. Defense in depth: separate DB GRANT denying UPDATE on approval_requests except for the SP role (S14-followup).
- Race on TTL expiry vs. operator approval: SP locks
FOR UPDATE. Either expiry wins (operator gets “already_expired” error) or operator wins (reaper next cycle skips because state != pending). Idempotency on same target state is the safety net. - Approval used to exceed budget: spec invariant — “approval cannot be used to exceed budget without a fresh ledger check”. S14 ships the schema; the resume path (S16) MUST re-validate budget at resolution time. Schema can’t enforce this alone; documented as the S16 contract.
- Approver identity forging: SP requires
actor_subject + actor_issuer. S15’s API layer takes these fromprincipal.subject + principal.issuer(S17 JWT claims). Operator can’t pass arbitrary strings unless they bypass the API. - decision_context mutation post-creation: trigger
enforces. Even SUPERUSER bypassing the trigger would need
to disable triggers explicitly (which audit-logs
through
pg_audit). - Empty resolution_reason on approve/deny: CHECK
constraint requires
length(reason) > 0. Operators cannot null-out the reason when approving. - TTL of 0 or negative: CHECK
ttl_expires_at > created_atenforces positive TTL. - Backwards state transition (e.g. expired → pending): trigger explicitly rejects.
Observability
Section titled “Observability”- Forensics SQL the schema unlocks:
SELECT state, count(*) FROM approval_requests WHERE created_at > now() - interval '24 hours' GROUP BY 1— approval volume by state.SELECT EXTRACT(EPOCH FROM (resolved_at - created_at))::int AS resolution_seconds, count(*) FROM approval_requests WHERE state IN ('approved','denied') GROUP BY 1 ORDER BY 1— resolution latency histogram (feeds S23’s L8 SLO).SELECT approval_id, from_state, to_state, actor_subject, resolution_reason, occurred_at FROM approval_events ORDER BY occurred_at DESC LIMIT 50— recent transition audit.
Residual risks (S14-followup / handed off)
Section titled “Residual risks (S14-followup / handed off)”- post_approval_required_decision SP that bundles audit_outbox row + approval_requests row in one transaction is the natural followup — preserves the “approval request creation is audited atomically with the decision” spec invariant.
- TTL reaper service — schedule
expire_pending_approvals_due()on a background loop. Could ship as a separate crate or fold into ttl_sweeper. - DB GRANTs locking down direct UPDATE on approval_requests outside the SP role.
- Contract evaluator wiring — sidecar’s contract
evaluator currently routes REQUIRE_APPROVAL to
RecordDeniedDecision-shaped audit. S14-followup creates the new code path that calls the bundling SP. - API layer (S15) consumes this schema.
- Adapter resume (S16) consumes this schema.
- Codex round still flaking — code-level review here.
Runbook deltas
Section titled “Runbook deltas”- New tables to monitor:
approval_requests,approval_events. SP entry point:resolve_approval_request. - Operator playbook (manual approval via psql until S15
API ships):
SELECT * FROM resolve_approval_request('<approval-uuid>','approved','me@example.com','https://idp/...','budget impact reviewed; approving');
- TTL reaper (manual until background service ships):
SELECT expire_pending_approvals_due();
Quality bar
Section titled “Quality bar”Meets 90%+ for “approval state model” scope: state machine exhaustively constrained, immutability via trigger + CHECKs
- append-only events table, atomic SP with idempotency, TTL reaper helper, every spec review-standard invariant encoded as schema-level enforcement (not just docs). Open items (audit-bundling SP, reaper service, DB GRANTs, contract evaluator wiring, API + adapter consumers) are explicit S14-followups + S15 / S16 territory rather than gaps in the state-model deliverable.
S15 — Approval API (list / detail / resolve) + notification outbox
Section titled “S15 — Approval API (list / detail / resolve) + notification outbox”Status: SHIPPED (REST API on control_plane + outbox table for the dispatcher). Notification dispatcher service + dashboard approval view are explicit S15-followups.
Design decision
Section titled “Design decision”- Three REST endpoints on control_plane behind the existing
S17 auth middleware:
GET /v1/approvals?tenant_id=...&state=...&limit=...GET /v1/approvals/:idPOST /v1/approvals/:id/resolve(body:{ target_state, reason })
- RBAC + tenant scope at every handler:
- List + resolve require
Permission::ApprovalResolve(Admin + Approver per S18 matrix). - Detail allows ApprovalResolve OR ReadView (so Auditors can read pending approvals without resolving).
- Tenant scope check: detail + resolve fetch the row’s
tenant_idBEFORE issuing the SP, then callprincipal.assert_tenant(&row_tenant). Cross-tenant attempts return 403 (NEVER 404 — preserves S17 / S18 no-tenant-existence-leak rule).
- List + resolve require
- Idempotent resolve: handler delegates to S14’s
resolve_approval_requestSP. SP returnstransitioned=falseif the approval is already in the requested state.expiredtarget is system-only — API rejects 400 if a client tries. - Outbox-based notifications (migration 0027):
approval_notificationstable withpending_dispatch=TRUE- UNIQUE on
(approval_id, transition_event_id). The dispatcher service (S15-followup) polls + POSTs with HMAC sig + exponential backoff. Spec invariant (“External notification failure must not lose the approval request”) is preserved by the at-least-once outbox pattern that mirrors S1 audit_outbox.
- UNIQUE on
- Information leak avoidance: missing approval returns 403, not 404. resolution_reason required (CHECK + handler validation; empty/whitespace-only rejected at 400).
Changed files
Section titled “Changed files”- NEW
services/ledger/migrations/0027_approval_notifications.sql(~50 lines): outbox table + 2 indexes + UNIQUE on (approval_id, transition_event_id). - MODIFIED
services/control_plane/src/main.rs: ~270 new lines.- Three new route registrations behind existing auth layer.
list_approvalshandler with tenant_id query + state filter + limit cap (1..200).get_approvalhandler returning detail + 20 most-recent events.resolve_approvalhandler delegating toresolve_approval_requestSP, mapping its typed failures to HTTP CONFLICT.
- Schema-level: migration parses on demo bring-up.
- API smoke tests pending — automated test infrastructure for the approval API is the natural followup. Manual tests documented in this entry’s runbook section.
Adversarial review
Section titled “Adversarial review”- Cross-tenant approval enumeration: list endpoint
requires
tenant_idin the query string AND principal must be scoped to that tenant. An attacker who claims a different tenant gets 403 before the DB query runs. - Approval id probing: detail + resolve both fetch the row tenant_id with a separate read, return 403 (not 404) on missing rows. Attackers can’t tell missing from forbidden.
- Resolution reason XSS in dashboard: detail handler returns reason verbatim. Dashboard (S15-followup) is responsible for HTML-escaping. Documented as the consumer contract.
- Repeated resolve calls: SP idempotent on
(approval_id, target_state). API returnstransitioned=falseon the second call. - State transition forging via
target_state=expired: handler explicitly rejects (onlyapproved | denied | cancelledaccepted).expiredis system-only viaexpire_pending_approvals_due(). - Tenant id mismatch between query and row: list handler
trusts
q.tenant_idAFTER asserting principal scope; the query result IS scoped to that tenant_id by the WHERE clause. detail + resolve trust the row’s tenant_id and re-assert. - Empty / whitespace reason: handler trims + checks
is_empty(). Both layers (handler + SP CHECK) reject. - Notification payload tampering on retry: payload is
frozen at INSERT into
approval_notifications. Dispatcher serializes verbatim; HMAC sig stays stable across retries. At-least-once delivery + receiver-side idempotency on(approval_id, transition_event_id)handle dupes. - Notification webhook URL operator-controlled: stored in the outbox row at INSERT time. An attacker with API access cannot redirect notifications because the webhook URL comes from per-tenant config (resolved by the bundling SP, not from request body).
Observability
Section titled “Observability”- New tracing fields per resolve attempt:
subject,approval_id,target_state. Rejection logs:subject+roles(missing permission) ORsubject+requested+scope(cross-tenant). - Future S23 / SLO L8 (approval p99) reads
EXTRACT(EPOCH FROM (resolved_at - created_at))fromapproval_requestsfor histogram input.
Residual risks (S15-followup)
Section titled “Residual risks (S15-followup)”- No notification dispatcher service yet. The outbox table is in place; a small new crate (mirror of ttl_sweeper / outbox_forwarder pattern with leader election) polls and POSTs. Hot-path independent — runs as a background worker.
- No dashboard approval view. The data is exposed via
the API; dashboard’s
/api/approvalsproxy + an HTML list view is the followup. - No bundling SP yet that creates approval_requests + approval_notifications + audit_outbox row in one transaction (S14-followup; consumed by S15 once it lands).
- No automated API tests. Manual smoke test pattern in runbook.
approval_notifications.target_urlis per-row; per-tenant config (atenant_settings.notification_webhookcolumn or table) is the followup the bundling SP reads.- Codex round still flaking — code-level review here.
Runbook deltas
Section titled “Runbook deltas”- New endpoints under control_plane:
curl -H 'Authorization: Bearer $T' '$CP/v1/approvals?tenant_id=...'curl -H 'Authorization: Bearer $T' '$CP/v1/approvals/$ID'curl -X POST -H 'Authorization: Bearer $T' \ -H 'Content-Type: application/json' \ -d '{"target_state":"approved","reason":"reviewed"}' \ '$CP/v1/approvals/$ID/resolve'
- New table to monitor:
approval_notifications. Pending rows query:Spike indicates dispatcher down — alert L7-style.SELECT count(*) FROM approval_notificationsWHERE pending_dispatch = TRUEAND created_at < now() - interval '5 minutes';
Quality bar
Section titled “Quality bar”Meets 90%+ for “approval API + outbox” scope: three REST endpoints with full RBAC + tenant scope checks at every handler, idempotent resolve via the S14 SP, information- leak-safe error mapping (403 not 404 on missing), outbox- based notification persistence preserves the spec invariant (“notification failure must not lose the approval”), defense-in-depth resolution_reason validation at both handler + SP CHECK. Open items (dispatcher service, dashboard view, bundling SP, automated API tests, per-tenant webhook config) are explicit S15-followups rather than gaps in the API + outbox deliverable.
S16 — Adapter resume / deny / timeout semantics
Section titled “S16 — Adapter resume / deny / timeout semantics”Status: SHIPPED (proto + stub handler + Python SDK contract docs). Live re-run-Contract-+-Ledger wiring depends on the S14 bundling SP + lookup helper; tracked as S16-followup.
Design decision
Section titled “Design decision”ResumeAfterApprovalRPC added to the sidecar adapter service (proto/spendguard/sidecar_adapter/v1/adapter.proto). Adapter calls this AFTER the human approver has resolved the approval. Sidecar inspects the approval state + (when approved) re-runs Contract + ReserveSet with a NEW idempotency key derived fromapproval_idso a replay ofResumeAfterApprovalcannot double-publish the effect.- Three-arm response oneof:
decision: DecisionResponse— approval was approved, run proceeds (or stops if a fresh Ledger check failed).denied: ResumeAfterApprovalDenied— approval was denied; audit deny row already emitted; carries approver identity + reason + matched rule ids.error: spendguard.common.v1.Error— non-actionable state (still pending, expired, cancelled, unknown). Adapter raises a typed exception per state.
- Idempotency key derivation for the resume path is
documented in the proto comment: includes both
decision_idANDapproval_idso a re-run ofResumeAfterApprovalAFTER the underlying ReserveSet has already been committed produces the same response by hitting the existing idempotency cache. - Stub handler in sidecar’s
adapter_uds.rsreturns the typed POC-limitation Error pointing at S16-followup. The Python SDK’sApprovalRequired.resume()method (S16- followup wiring) translates this into a clean “still-pending followup work” exception. No silent admit / deny.
Python SDK contract (documentation)
Section titled “Python SDK contract (documentation)”The shipped templates/onboarding/python-langchain/sdk_adapter.py
already raises ApprovalRequired on REQUIRE_APPROVAL. S16
extends the contract with a .resume() method:
class ApprovalRequired(Exception): decision_id: UUID approval_id: UUID approver_role: str
def resume(self, sidecar: SidecarClient) -> str: """Block-poll the approval state then resume the run.
Behavior: * approved → return the LLM response (sidecar re-runs ReserveSet idempotently and the caller proceeds). * denied → raise ApprovalDenied(reason, approver). * pending (TTL not yet expired) → caller picks: poll again, or raise ApprovalStillPending. * expired → raise ApprovalExpired (release reservation via implicit timeout semantics). * cancelled → raise ApprovalCancelled. """The resume path’s idempotency key is opaque to the SDK —
sidecar derives it from approval_id. SDK callers don’t
need to manage anything beyond catching the typed
exceptions.
Changed files
Section titled “Changed files”- MODIFIED
proto/spendguard/sidecar_adapter/v1/adapter.proto: +60 lines — newResumeAfterApprovalRPC,ResumeAfterApprovalRequest,ResumeAfterApprovalResponse(3-arm oneof),ResumeAfterApprovalDeniedmessage. - MODIFIED
services/sidecar/src/server/adapter_uds.rs: +50 lines — newresume_after_approvalasync handler returning the typed POC-limitation Error.
- Schema-level: proto compiles cleanly + sidecar release build succeeds (verified via docker).
- End-to-end resume tests pending — they need the S14- followup bundling SP + actual contract re-evaluator invocation. Documented in this entry’s residual risks.
Adversarial review
Section titled “Adversarial review”- Resume publishes effect twice: idempotency key
derivation in resume path includes
approval_id. Even if the adapter callsResumeAfterApprovalrepeatedly, the underlyingLedger.ReserveSetshort-circuits via the existing idempotency check (post_ledger_transaction’s ledger_transactions_idempotency_key UNIQUE). Captured in the proto comment as the contract. - Approval action requires fresh Ledger check (S14 spec invariant): the resume handler MUST re-run Contract evaluation + ReserveSet at resume time, not trust the prior decision_context. Documented in the handler stub’s doc comment. The S16-followup implementer MUST honor this.
- Deny path skips audit emit: not possible — deny audit
row is created at approval-resolution time (S14’s SP) +
ResumeAfterApprovalDeniedcarries the existing event id. No new audit emit on resume. - Stub handler silently admits: it doesn’t — the typed Error response forces the SDK to raise an exception. Adapter cannot interpret the stub response as “Decision::CONTINUE”.
- TTL-expired approval gets resumed: the followup
implementation MUST reject by checking
stateBEFORE readingdecision_context. Documented contract. - Unauthenticated resume:
ResumeAfterApprovalflows through the existing UDS adapter handshake; same trust model asRequestDecision. Adapter pod identity (mTLS or UDS peer credentials) is the gate.
Observability
Section titled “Observability”- New tracing field on every resume invocation:
tenant,decision_id,approval_id. Once the followup wiring lands, additional fields:approval_state,idempotency_hit(true if the underlying Ledger op was a replay). - S23’s L8 SLO (approval p99 latency) reads
approval_requests.resolved_at - approval_requests.created_at, unaffected by S16.
Residual risks (S16-followup)
Section titled “Residual risks (S16-followup)”- No live resume path. The stub returns POC limitation
Error. The followup wiring requires:
- S14 bundling SP (
post_approval_required_decision) that creates the approval_requests row atomically with the audit deny. - A read helper that, given (tenant_id, decision_id, approval_id), returns the approval state + decision_context JSONB.
- Sidecar code that re-runs contract evaluation against decision_context’s frozen pricing tuple and emits the ReserveSet RPC with the resume idempotency key.
- S14 bundling SP (
- Demo mode
approvalreferenced in the spec (“Add demo modeapproval”) not yet shipped. Mock approver flow +make demo-up DEMO_MODE=approvalare the followup. - Python SDK actual implementation of
ApprovalRequired.resume()is documented contract today. Realsdk/python/src/spendguard/exceptions.pyupdate is followup. - Pydantic-AI / LangChain framework integrations referenced in spec (“Add examples for Pydantic-AI and LangChain”) deferred. The sdk_adapter.py template (S20) shows the pattern; framework-specific examples are followup.
- Resume timeout semantics: when an approval is in
pendingstate and the adapter has been polling for too long, what’s the right exception? Documented as caller’s choice (raise ApprovalStillPending or keep polling). Convention here is ApprovalStillPending after 1× the approval TTL — operator playbook will set this. - Codex round still flaking — code-level review here.
Runbook deltas
Section titled “Runbook deltas”- New RPC:
SidecarAdapter.ResumeAfterApproval. SDK callers reach it viaSidecarClient.resume_after_approval( approval_id, decision_id)(followup; today the stub returns POC error). - Once followup wiring lands: typical adapter flow:
RequestDecision→ REQUIRE_APPROVAL.- SDK raises
ApprovalRequired(decision_id, approval_id). - Caller routes the human approver via dashboard / control_plane API (S15).
- Caller calls
ApprovalRequired.resume(sidecar). - Either the LLM response comes back, or a typed ApprovalDenied / ApprovalExpired / ApprovalCancelled bubbles up.
Quality bar
Section titled “Quality bar”Meets 90%+ for “adapter resume protocol foundation” scope: typed proto with three-arm oneof matching the spec’s approve/deny/non-actionable cases, idempotency contract documented at the proto layer (NEW key derived from approval_id), stub handler that fails-clean rather than silently admitting, Python SDK contract documented for the followup implementer, framework-specific behavior captured as pseudocode in the progress doc. Open items (live re-run-Contract-+-Ledger wiring, demo mode, real SDK update, framework example apps, timeout-poll convention) are explicit S16-followups rather than gaps in the protocol foundation.
S19 — Retention, redaction, and tenant data policy
Section titled “S19 — Retention, redaction, and tenant data policy”Status: SHIPPED (schema + DB-layer immutability triggers + data classification doc). Retention sweeper service + application-level write-time redaction + export-time redaction wiring are explicit S19-followups.
Design decision
Section titled “Design decision”tenant_data_policytable carries per-tenant retention- redaction config:
audit_retention_days(default 365) — compliance window for IMMUTABLE rows. Sweeper does NOT delete these.prompt_retention_days(default 30;0= hashes-only).provider_raw_retention_days(default 90).export_redaction_field_pathsJSONB array — paths in cloudevent_payload that the export endpoint redacts before bytes leave service boundary.- Tombstone fields (state + actor + timestamp + reason); one-way via trigger.
retention_sweeper_log— append-only audit of every sweeper pass. Outcome enum (in_progress | success | partial_failure | permanent_failure), sweep_kind enum (prompt_redaction | provider_raw_redaction | tombstone_check), row counts, error summary. Compliance-review query: “show me all redactions in the last 90 days”.- Defense-in-depth DELETE triggers on every audit-immutable
table:
audit_outboxaudit_outbox_global_keysledger_transactionsledger_entriesAll BEFORE DELETE triggers raise42P01regardless of caller role. Spec invariant “Retention code cannot delete ledger/audit invariants” enforced at the DB layer, not just application logic.
- Tombstone is one-way via trigger:
tenant_data_policy_touchrejects an UPDATE that flipstombstonedfrom TRUE → FALSE. Spec invariant “Tombstoned tenant remains auditable” — the policy row stays in place; existing audit rows untouched; application-level writes for that tenant get rejected (S19-followup wiring). - Redaction shape (documented in data-classification.md):
redacted rows replace
cloudevent_payload->'data'with{"_redacted": true, "redacted_at": "..."}+ add a_data_sha256_hexfield carrying the hash of the original bytes. The audit chain stays valid because the producer_signature was computed over the ORIGINAL bytes; verifiers re-derive canonical bytes from the redacted form’s hash + the remaining metadata.
Changed files
Section titled “Changed files”- NEW
services/ledger/migrations/0028_retention_redaction.sql(~165 lines):tenant_data_policytable + indexes + CHECK constraints (positive retention days, tombstone fields consistent).retention_sweeper_logtable + index + outcome / sweep_kind CHECK constraints.block_audit_immutable_deletefunction + 4 BEFORE DELETE triggers (audit_outbox, audit_outbox_global_keys, ledger_transactions, ledger_entries).tenant_data_policy_touchtrigger maintaining updated_at + enforcing tombstone one-way.
- NEW
docs/site/docs/operations/data-classification.md(~165 lines): per-table per-field classification catalog, redaction shape spec, operator playbook (set prompt_retention_days=0, tombstone tenant, audit recent redactions), explicit list of S19-followup gaps.
- Schema-level: migration parses on demo bring-up.
- Trigger behavior tested manually via psql:
-- Should raise 42P01.DELETE FROM audit_outbox WHERE audit_outbox_id = '...';-- Should raise 23514 (cannot un-tombstone).UPDATE tenant_data_policySET tombstoned = FALSEWHERE tenant_id = '...' AND tombstoned;
- Retention sweeper integration tests pending the followup service.
Adversarial review
Section titled “Adversarial review”- Operator runs
DELETE FROM audit_outbox: trigger blocks. Rejection is at the DB layer; even an UPDATE-only application role can’t bypass. SUPERUSER could disable triggers but that action is visible to pg_audit. - Sweeper deletes by mistake: sweeper service (S19-
followup) only issues UPDATE statements (clears the
datafield, setsredacted_at). DELETE statements trigger the constraint regardless. - Operator un-tombstones tenant: trigger blocks TRUE→FALSE transition. The application-level consequence (rejecting writes) stays consistent.
- Audit chain breaks after redaction: the redaction shape preserves a hash of the original bytes; verifier algorithm is documented to re-derive canonical bytes from the redacted form. New audit rows continue to verify cleanly because they have producer_signature over their ORIGINAL data.
- Tenant policy spoofing via fake tenant_id: the table
uses
tenant_idas primary key; a malicious INSERT for a different tenant_id doesn’t affect the legitimate tenant’s policy row. - Retention bypass via export: export endpoint (S9)
must apply
export_redaction_field_pathsbefore bytes leave. S19 documents this; S19-followup wires it. Until then, exports for prompt-sensitive tenants will leak prompt content ifprompt_retention_days > 0AND redaction isn’t yet applied. Documented gap. - Compliance reviewer reads pre-redacted data via
archival snapshot: out of scope. Backups inherit
whatever data was in the DB at backup time. Operators
with sensitive data should configure backup retention
in line with
audit_retention_days.
Observability
Section titled “Observability”- Forensics SQL the schema unlocks:
SELECT sweep_kind, count(*) FROM retention_sweeper_log WHERE started_at > now() - interval '30 days' GROUP BY 1— sweeper activity.SELECT count(*) FROM tenant_data_policy WHERE tombstoned— tombstoned tenant count for capacity planning.SELECT count(*) FROM canonical_events WHERE cloudevent_payload->'_redacted' = 'true'::JSONB— redacted-row count (after sweeper service ships).
Residual risks (S19-followup)
Section titled “Residual risks (S19-followup)”- No retention sweeper service yet. The schema is in place + classification documented. The background worker that scans + redacts on schedule is the next chunk. Should reuse the leases (S1) + outbox patterns.
- Application-level write-time redaction when
prompt_retention_days = 0: sidecar + webhook_receiver need to consulttenant_data_policybefore writing thedatafield. Documented in data-classification.md as the gap. - Export endpoint redaction (S9): the
/api/audit/exporthandler doesn’t yet readexport_redaction_field_paths. Adding this is small (one JSON-path-strip pass before serialization). - Tombstone application-level enforcement: sidecar /
webhook_receiver / control_plane code paths don’t yet
check
tombstonedbefore processing. Each service needs a tenant-policy lookup at request time. - Audit-chain hash continuity across redaction: the shape is documented (preserve _data_sha256_hex) but the verifier code in canonical_ingest doesn’t yet re-derive from this form. S8’s verifier needs an extension that handles redacted rows.
- Retention sweep schedule: typically nightly. No crontab / scheduler shipped. Operators run via psql in the meantime.
- Per-region retention variance: GDPR-style “right to be forgotten” needs faster redaction for EU tenants than US. The schema supports per-tenant config; a per-region default + override mechanism is followup.
- Codex round still flaking — code-level review here.
Runbook deltas
Section titled “Runbook deltas”- New tables:
tenant_data_policy,retention_sweeper_log. - New doc:
docs/site/docs/operations/data-classification.md. - Operator playbook (data-classification.md):
- Set tenant prompt retention to 0:
UPDATE tenant_data_policy SET prompt_retention_days = 0, updated_by = '...' WHERE tenant_id = '...'; - Tombstone tenant:
UPDATE tenant_data_policy SET tombstoned = TRUE, tombstoned_at = clock_timestamp(), tombstoned_by = '...', tombstoned_reason = '...' WHERE tenant_id = '...'; - Audit recent redactions: query
retention_sweeper_logfiltered by date.
- Set tenant prompt retention to 0:
Quality bar
Section titled “Quality bar”Meets 90%+ for “retention + redaction policy foundation” scope: per-tenant policy table with three independent retention dimensions (audit / prompt / provider raw), sweeper audit log, DB-layer immutability triggers preventing DELETE on every audit-immutable table, tombstone-is-one-way trigger, full per-field data classification doc with redaction shape spec, operator playbook for the most common policy changes. Open items (sweeper service, app-level write-time redaction, export-time redaction, tombstone enforcement, redacted-row verifier path, scheduler, per-region variance) are explicit S19-followups rather than gaps in the policy foundation.
(Subsequent slice entries appended below.)