Every AI API bills you per token. You trust the usage object the gateway returns — prompt_tokens, completion_tokens — to be the real count. But that number is the one thing a reseller controls completely, and it's the easiest place to quietly overcharge you. The polite name for it is token inflation: the gateway reports more tokens than you actually sent or received, and you pay 5×, 10×, sometimes 25× the honest cost — on the genuine model or not.
This is the failure mode Brievio was built against, so it's worth explaining plainly: how the padding works, a 20-line test you can run against any gateway (including ours), and how to read the result.
What token inflation actually is
The honest token count is what the provider itself would count for your messages — your system prompt, your user content, the model's reply. A trustworthy gateway passes that straight through. An inflated one pads it. Same request, very different bill:
# Honest usage — the count matches the text you sent, plus a tiny
# chat-template overhead (role markers, formatting tokens):
{"prompt_tokens": 24, "completion_tokens": 2, "total_tokens": 26}
# Inflated — you sent ~20 tokens of text but pay for 1,840:
{"prompt_tokens": 1840, "completion_tokens": 2, "total_tokens": 1842}
# ^ a ~1,800-token system prompt you never wrote was injected into the
# request and billed back to you. On a one-word question. Every call.The most common trick is a hidden injected system prompt: the reseller prepends a few hundred to a few thousand tokens of its own text — a "safety" preamble, a routing wrapper, a fake persona — into every call. You never wrote it, you can't see it, but you pay for it on every single request. At Sonnet input rates, an 1,800-token phantom prefix is about $0.0055 of pure margin on a one-word question. Multiply by a million calls a month.
The 20-line test
You don't have to trust anyone's word — including ours. Send a prompt whose size you know, then compare what the gateway billed you against what your text actually tokenizes to on your own machine:
# token_inflation_test.py
# Does your gateway report honest token counts? Send a prompt of known
# size, then compare the gateway's reported prompt_tokens against a local
# tokenizer count of the EXACT messages you sent.
import tiktoken
from openai import OpenAI
client = OpenAI(api_key="sk-brievio-...", base_url="https://api.brievio.com/v1")
messages = [
{"role": "system", "content": "You are a terse assistant."},
{"role": "user", "content": "Reply with the single word: ok."},
]
# 1. What the gateway charges you for:
resp = client.chat.completions.create(
model="claude-sonnet-4-6", messages=messages, max_tokens=5,
)
reported = resp.usage.prompt_tokens
# 2. What your messages actually tokenize to, locally:
enc = tiktoken.get_encoding("cl100k_base") # an approximation — see note below
local = sum(len(enc.encode(m["content"])) for m in messages)
print(f"gateway reported prompt_tokens: {reported}")
print(f"local token count of your text: {local}")
print(f"ratio: {reported / local:.1f}x")
# Normal → ratio ~1.1-1.6x (role markers + chat-template overhead)
# Inflation → ratio 2x / 5x / 25x (a hidden prompt is padding your input)Reading it: a small fixed overhead is normal — chat formats add a handful of tokens for role markers and message boundaries, so a ratio around 1.1–1.6× on a tiny prompt is fine and shrinks toward 1.0× as your prompt grows. A ratio of 2×, 5×, 25× is not a rounding error — it's padding.
One honest caveat: tiktoken's cl100k_base is OpenAI's tokenizer, and Claude or Gemini tokenize a little differently (typically within 10–20%). So treat the local count as an approximation, not a to-the-token audit. It will never explain a 2× gap, let alone 25× — for an exact figure, use the provider's own tokenizer or count-tokens endpoint. The test is built to catch inflation, not to quibble over single tokens.
Check output and cache too
Input is the usual target, but the same padding can hide in output tokens and in caching. Two more quick checks:
# Output and cache can be padded too. Two more 30-second checks:
#
# (a) Output: ask for exactly one token and cap it.
resp = client.chat.completions.create(
model="claude-sonnet-4-6",
messages=[{"role": "user", "content": "Reply with only: ok"}],
max_tokens=2,
)
print(resp.usage.completion_tokens) # honest: ~1-2. Padded: 50, 200, ...
#
# (b) Cache: send the same long prefix twice within ~5 min. The second call
# should bill most of the input at the reduced cache-read rate. If the
# "cached" field is always 0, you're paying full price for cache hits.
print(resp.usage.prompt_tokens_details.cached_tokens)If you ask for one token and get billed for fifty, or if your cached_tokens is stuck at zero on an identical repeat call, the meter is wrong.
Where the inflation comes from
- Injected system prompts. A wrapper preamble added to every request — the single most common source. Big, invisible, billed.
- Re-wrapped "template proxy" models. Your prompt gets stuffed into a large fixed template before it reaches the model. The template tokens are real tokens — to the model and to your bill — but they're not yours.
- Fabricated usage numbers. The crudest version: the
usageobject simply doesn't match reality. The test above catches this immediately. - Phantom output padding. Reported
completion_tokensexceed the words actually returned.
None of this requires a fake model. A gateway can serve the genuine Claude and still inflate the meter — authenticity of the model and honesty of the bill are two separate promises, and you should check both.
The honest baseline
Brievio passes the provider's own token counts through unchanged, injects nothing into your requests, and logs the real input and output tokens plus the exact cost on every call, visible in your usage dashboard. Run the test above against Brievio and you should see reported ≈ local + small overhead — the way it should read everywhere. Our pricing shows each model against its official reference rate so the discount is auditable, and the usage docs spell out exactly which fields we return.
If a gateway is 80% under list, the first question isn't "is the model real" — it's "what does the meter say." Run the 20 lines. It costs less than a cent, and it's the cheapest due diligence you'll ever do on a vendor.