Skip to content

Pre-call budget caps for LLM API requests

You want a cap that says “this agent can spend at most $X on gpt-4o per hour, and any call that would push it past the line must be refused before the request goes to the provider.” Token-usage dashboards and daily alerts give you the post-hoc number, not the gate. Here’s the pattern that gives you the gate.

Most LLM cost tooling is reconciliation, not control:

ApproachWhat it doesWhen you find out
Provider invoice / billing APITells you what you spentEnd of billing cycle
Usage dashboardAggregates token countsHours later, after the spend
Rate limit on the provider keyCaps requests per second/minuteNot by dollar — by count
Soft alert (“you’re at 80%“)Pings a webhookAfter the budget is mostly gone

None of these prevent the call. They tell you the bill, hopefully before the next bill. When an agent is in a retry loop or a tool-use loop, the gap between “spend the money” and “see the dashboard” is exactly when real damage happens.

A budget reservation sits in front of every LLM call. The reservation acts like a Stripe auth/capture:

agent → SDK wrapper
sidecar.request_decision(budget_id, projected_claim)
├── budget would be exceeded ───► STOP (raise, no LLM call)
├── budget can cover it ───► RESERVE (auth) ──┐
│ │
│ ▼
│ your LLM call goes out
│ │
├── provider response ──────► sidecar.commit (capture actual)
│ or sidecar.release (cancel auth)
└── crash / timeout ─► reservation auto-releases on TTL

Three properties that make this work:

  1. Pre-call refusal is mechanical. The over-budget path is a thrown exception, not a soft warning. Application code can’t accidentally ignore it.
  2. Reservations are accounted, not estimates. The ledger tracks reservations (auth-stage) and commits (capture-stage) separately, so an estimated 1,500 tokens reserved but actually 800 used releases 700 back to the budget.
  3. Idempotent on retry. A retried call with identical inputs collapses onto the original reservation instead of allocating a new one. Otherwise a 47-retry loop would burn 47x the reservation.

The reservation is one call. The Agentic SpendGuard SDK handles the auth/commit/release lifecycle:

from spendguard import SpendGuardClient, DecisionStopped
async with SpendGuardClient(socket_path="/var/run/spendguard/adapter.sock",
tenant_id=tenant_id) as sg:
await sg.handshake()
try:
outcome = await sg.request_decision(
trigger="LLM_CALL_PRE",
run_id=run_id, decision_id=decision_id,
route="llm.call",
projected_claims=[claim], # estimated USD or tokens
idempotency_key=derive_key(...), # stable across retries
)
# Reservation made. Make the LLM call now.
except DecisionStopped as e:
# Over budget. The LLM call must not happen.
raise

The framework adapters (Pydantic-AI / LangChain / OpenAI Agents / AGT) wrap this in a single Model.request() override so application code doesn’t change.