An AI feature with no spend cap is a feature that can bill you an unbounded amount. One runaway agent loop, one user who pastes a novel into the chat box, one prompt that makes the model ramble for 4,000 tokens — any of these quietly multiplies your invoice. The fix is two cheap, boring controls: bound every call with max_tokens, and track spend per user from the usage object so you can cut someone off before they cost more than they're worth.

None of this is hard. The hard part is trusting the numbers you cap against. A budget is only as good as the token counts feeding it — if your gateway pads usage or bills you for failed calls, your carefully computed $1/user/day ceiling is really $1.40, and you'd never know. So this post does two things: shows you the mechanics, and explains why honest billing is what makes the mechanics work at all.

Step 1 — bound every call with max_tokens

Output is the expensive half of most bills. On Claude Sonnet 4.6, output runs $12.75 per 1M tokens — five times the input rate of $2.55. An answer you expected to be three sentences that instead comes back as twelve paragraphs isn't a quality problem; it's a cost problem. max_tokens is the hard ceiling that makes the worst case knowable in advance.

cap_per_request.py

# Cap the cost of every single call with max_tokens.
# It bounds the *output* — the expensive half of most bills — to a
# hard ceiling. The model stops at the limit instead of running long.
from openai import OpenAI

client = OpenAI(
    api_key="sk-brievio-...",
    base_url="https://api.brievio.com/v1",
)

# Claude Sonnet 4.6: $2.55 in / $12.75 out per 1M tokens.
# Without a cap, a chatty answer can run 4,000+ output tokens.
# With max_tokens=500 you bound the output cost of THIS call to:
#   500 × $12.75 / 1M = $0.0064 — no matter what the model "wants" to say.
resp = client.chat.completions.create(
    model="claude-sonnet-4-6",
    max_tokens=500,                 # honored natively — a true hard stop
    messages=[
        {"role": "system", "content": "Answer in at most 3 sentences."},
        {"role": "user", "content": user_question},
    ],
)

# Always check why it stopped. "length" means you hit the ceiling and
# the answer may be truncated — the signal to raise the cap for THIS
# task class, not globally.
if resp.choices[0].finish_reason == "length":
    print("Truncated at the cap — consider a higher max_tokens here.")

print(resp.choices[0].message.content)
print(resp.usage)  # honest input/output counts — see the next section

Brievio honors max_tokens natively against the genuine model — it's a true stop, not a hint. Set it per task class, not once globally: a classifier needs maybe 5 tokens, a chat reply 500, a long summary 2,000. The discipline is to pick a number for each call site and watch finish_reason. When it comes back "length", you clipped a real answer and should raise the cap for that path — not blanket-raise it everywhere and lose the protection.

Step 2 — price the usage object yourself

Every Brievio response carries a usage object with the real input and output token counts for that call, and each request is logged with its real tokens and cost. That means you don't have to wait for an end-of-month invoice to know what anything cost — you can price each call the instant it returns:

usage.prompt_tokens — input tokens actually sent.
usage.completion_tokens — output tokens actually generated (this is what max_tokens bounds).
usage.total_tokens — the sum, for a quick sanity check.

Multiply each count by the published per-model rate and you have the exact cost of the call, in your own code, in real time. Per-model rates are published on the pricing page against the official reference, so the constants in your code are auditable — not a number you have to take on faith.

Step 3 — track spend per user and cut off at a budget

Per-request caps protect you from a single bad call. Per-user budgets protect you from a single bad actor — the one account running thousands of calls a day, whether through enthusiasm, automation, or abuse. The pattern is one durable number per user that you check before the call and increment after:

cap_per_user.py

# Track spend per user from the usage object, then cut off at a budget.
# usage returns the REAL input/output token counts for the call. You
# price them locally against the published per-model rate and add to a
# running tally keyed by user.
from decimal import Decimal

# Published Brievio rates, USD per 1M tokens (≈15% under official list).
RATES = {
    "claude-sonnet-4-6": {"in": Decimal("2.55"), "out": Decimal("12.75")},
    "claude-haiku-4-5":  {"in": Decimal("0.85"), "out": Decimal("4.25")},
}

def cost_usd(model: str, usage) -> Decimal:
    r = RATES[model]
    million = Decimal("1000000")
    return (usage.prompt_tokens     * r["in"]  / million +
            usage.completion_tokens * r["out"] / million)

# Your store of choice — Redis counter, a SQL column, whatever. The point
# is one durable number per user. Failed 4xx/5xx calls are free on Brievio,
# so you only ever add cost for calls that actually returned a result.
DAILY_BUDGET = Decimal("1.00")  # $1/user/day

def chat_for_user(user_id: str, question: str, model="claude-sonnet-4-6"):
    spent = get_spent_today(user_id)            # Decimal, defaults to 0
    if spent >= DAILY_BUDGET:
        raise BudgetExceeded(f"{user_id} hit ${DAILY_BUDGET}/day")

    resp = client.chat.completions.create(
        model=model,
        max_tokens=500,
        messages=[{"role": "user", "content": question}],
    )

    add_spent_today(user_id, cost_usd(model, resp.usage))  # tally real cost
    return resp.choices[0].message.content

Pick the threshold from your unit economics: if a paid seat is $20/month and you want AI to stay under, say, 20% of revenue, that's a budget of roughly $0.13/user/day. Soft-warn at 80% (a banner, an email), hard-cut at 100% (the BudgetExceeded path above). For free-tier users, set the cap low and tier down the model — moving non-critical traffic from Sonnet to Haiku 4.5 at $0.85 in / $4.25 out cuts the per-token cost by about two-thirds while keeping the same honest metering.

Why honest billing makes the math trustworthy

Here's the part most cost-control guides skip. Both controls above — the per-call ceiling and the per-user budget — are arithmetic on token counts. If those counts are inflated, every number downstream is wrong by the same factor, silently. Two billing behaviors decide whether your caps mean what they say:

Failed calls are free. On Brievio, a 4xx or 5xx response costs nothing — you're billed for results, not attempts. That matters for budgets specifically: a retry storm against a flaky downstream shouldn't drain a user's daily allowance. Because you only tally cost on calls that returned a result, your spend curve tracks value delivered, not errors absorbed.
Counts aren't inflated. The usage object reflects the tokens the genuine model actually processed — no padded prompt, no phantom output added to the meter. The honest move is not to take that on trust: send a known fixed prompt, read back prompt_tokens, and confirm it matches what the tokenizer says it should be. We wrote a 20-line token-inflation self-test that does exactly this — run it against any gateway, including ours, before you trust its numbers.

The connection is direct: a budget computed from padded counts cuts users off too early and overstates your cost; one that includes failed calls punishes users for your infrastructure's bad days. Honest metering is what lets a $1/day ceiling actually mean a dollar. Brievio prices chat roughly 15% under official list, pay-as-you-go, with a balance that never expires — so the cap you set is the cap you get, and unused budget stays yours.

Putting it together

A production-ready spend posture is four small habits, none of which takes more than an afternoon:

Every call site sets max_tokens sized to its task. No unbounded outputs, anywhere.
Every response is priced from usage at the published rate and added to a per-user tally.
Every user has a budget with a soft warning and a hard cutoff drawn from your unit economics.
Cheaper traffic tiers down to a smaller model (Haiku, mini-class) instead of paying flagship rates for work that doesn't need them.

Do those four and your worst-case bill is a number you chose, not a number you discover. For the deeper levers — prompt caching, batching, model selection, prompt trimming — see the cost-optimization playbook, and the docs for the full usage schema and the OpenAI-compatible parameters Brievio supports. Cost control isn't a feature you bolt on after the bill scares you — it's two lines and a counter, and it works best on a meter you can trust.

Capping AI API spend — per-request and per-user cost control

Step 1 — bound every call with max_tokens

Step 2 — price the usage object yourself

Step 3 — track spend per user and cut off at a budget

Why honest billing makes the math trustworthy

Putting it together

$ ls ./related

Vision and document understanding with Claude and Gemini via one API

Structured output and JSON mode across Claude and Gemini

Embeddings and semantic search with the OpenAI SDK (RAG guide)

Rate limits, retries and backoff: production error handling for AI APIs