The reservation pattern for LLM budgets
The auth/capture pattern is forty years old in payments and a few years old in feature flags. It’s not yet a primitive for LLM spend, but it should be — the same shape solves the same class of problem. This page explains the pattern from first principles and shows how Agentic SpendGuard implements it.
Why the standard answer doesn’t work
Section titled “Why the standard answer doesn’t work”The naive design for “control LLM spend” is a counter:
budget_remaining = 1000.00on each LLM call: budget_remaining -= actual_cost # arrives after the call if budget_remaining < 0: alertThis fails because the actual cost arrives after the call. The provider returns a usage record that tells you how many tokens it charged for. By the time you decrement the counter, you’ve already spent the money. The counter is a record-keeping device, not a gate.
To gate, you need to know the cost before the call. But LLM cost isn’t fixed — output tokens depend on what the model generates, which depends on… the call you haven’t made yet.
This is the problem auth/capture solves in payments. The Visa network doesn’t know the final amount of a hotel charge when you check in — the hotel could add bar tabs, room service, damages. So the hotel authorizes an estimated amount (the auth-hold), which reserves funds without charging. At checkout, the hotel captures the actual amount, releasing the unused portion back.
The pattern that does
Section titled “The pattern that does”The reservation pattern, mapped to LLM budgets:
Phase 1 — Estimate Given: the messages, the model, the pricing table. Output: a projected claim (e.g., "this call will cost ~$0.04").
Phase 2 — Auth (Reserve) Sidecar checks projected_claim against budget. Budget can cover? → record a reservation entry, return RESERVED. Budget can't cover? → return STOP, the LLM call must not happen.
Phase 3 — Upstream LLM call Application makes the actual provider call. Provider returns actual_cost in the usage record.
Phase 4 — Capture (Commit) Sidecar receives actual_cost. Ledger: reservation → commit, freeing unused portion.
Phase 5 — Release on failure If Phase 3 throws / times out / crashes: Application calls sidecar.release(decision_id). Reservation rolls back, budget is restored. Otherwise, a TTL background sweeper auto-releases stale reservations after a configurable timeout.Properties that emerge:
- Pre-call refusal is mechanical. The over-budget path raises before the LLM call. There’s no soft-warning branch.
- Auth-stage estimates can be conservative. A model that returns fewer tokens than estimated has the unused portion freed at capture time. The budget reservation is never permanently locked beyond actual usage.
- Idempotency is structural. Re-running the same decision (same decision_id, same idempotency_key) collapses onto the existing reservation. Retries don’t double-charge.
- Crash-safe. A pod that dies after auth but before capture loses the in-memory state, not the durable record. The ledger entry is still there; the TTL sweeper releases it eventually.
Show me the code
Section titled “Show me the code”The pattern surfaces in the SDK as one decision call per LLM call:
# Phase 1+2: estimate + authoutcome = await sg.request_decision( trigger="LLM_CALL_PRE", run_id=run_id, decision_id=decision_id, route="llm.call", projected_claims=[estimated_claim], idempotency_key=derive_key(...),)
# Phase 3: upstream LLM calltry: response = await openai.chat.completions.create(...)except Exception: # Phase 5: release on failure await sg.release(decision_id) raise
# Phase 4: captureawait sg.commit( decision_id=decision_id, actual_claims=[claim_from(response.usage)],)The framework adapters bundle this into a single Model.request()
override so application code is one wrap-the-model line, not a
five-step protocol per call.
What this is not
Section titled “What this is not”- It is not a billing system. SpendGuard doesn’t generate invoices or settle with providers. It gates calls; the provider still bills you for the calls you make.
- It is not a usage analytics dashboard. It records every reservation and capture in an audit chain, but turning that into BI charts is a separate concern.
- It is not free. Each decision is a UDS gRPC round-trip (sub-5ms p99 in the POC). For agents that make tens of LLM calls per second, this is negligible. For higher-frequency systems, measure first.
Read more
Section titled “Read more”- 6-layer architecture — where the reservation pattern fits in the larger SpendGuard runtime
- Decision lifecycle — auth → capture → release state machine in detail
- Ledger storage spec — the Postgres schema that implements the audit chain
- Pre-call budget caps — the practical use-case framing of this pattern