$ cat ./docs/caching.md
Prompt caching
Mark a stable prompt prefix once with cache_control. Every subsequent call that reuses the prefix bills it back at a fraction of input price — 10% on Anthropic, 20% on OpenAI and Gemini. Brievio's routing layer keeps the cache warm by pinning your conversation to the same upstream key.
On a 30K-token system prompt with Claude Sonnet 4.6, a cache read costs $0.27 / 1M instead of $2.70 / 1M. For an agent that re-sends the same context 50 times a session, the cache saves ~90% of the input bill.
#How to enable it
Add a cache_control: { type: "ephemeral" } breakpoint on any system block, message, or tool definition. The model caches the span up to that breakpoint on the first call; subsequent calls that share the same prefix bytes hit the cache.
from openai import OpenAI
client = OpenAI(
api_key="sk-brievio-...",
base_url="https://api.brievio.com/v1",
)
# Mark the large, stable prefix once with cache_control.
# Every subsequent call that sends the same prefix reads it back
# from cache at ~10% of the input rate on Anthropic models.
SYSTEM = [
{
"type": "text",
"text": open("long_system_prompt.txt").read(),
"cache_control": {"type": "ephemeral"},
},
]
resp = client.chat.completions.create(
model="claude-sonnet-4-6",
messages=[
{"role": "system", "content": SYSTEM},
{"role": "user", "content": "Summarize the doc above in 3 bullets."},
],
)
u = resp.usage
# OpenAI-compat fields populated by Brievio:
# - prompt_tokens_details.cached_tokens — read from cache
# - cache_creation_input_tokens — written to cache this call
print(u.prompt_tokens_details.cached_tokens,
getattr(u, "cache_creation_input_tokens", 0))Identical wire-format on the native Messages API — cache_control passes through to Claude untranslated:
# Native Anthropic Messages API — cache_control passes through untranslated.
curl https://api.brievio.com/v1/messages \
-H "Authorization: Bearer $BRIEVIO_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-6",
"system": [
{
"type": "text",
"text": "<your large stable system prompt>",
"cache_control": { "type": "ephemeral" }
}
],
"messages": [
{ "role": "user", "content": "Summarize the doc above in 3 bullets." }
],
"max_tokens": 1024
}'#Pricing
Cache rates derive from each model's input price. Anthropic cache reads bill at 0.10× input. OpenAI and Gemini cache reads bill at 0.20× input. Brievio does not pass through Anthropic's upstream cache-write surcharge — cache writes are billed to you at the plain input rate, so there's no penalty for warming up a new cache entry.
OpenAI / Gemini — 0.20× cache read
| Model | Input | Cache read |
|---|---|---|
| Claude Opus 4.7 | $4.25 | $0.85 |
| Claude Opus 4.6 | $4.25 | $0.85 |
| Claude Sonnet 4.6 | $2.55 | $0.51 |
| Claude Sonnet 4.5 | $2.55 | $0.51 |
| Claude Haiku 4.5 | $0.85 | $0.17 |
| Gemini 2.5 Pro | $1.06 | $0.2125 |
| Gemini 2.5 Flash | $0.255 | $0.051 |
max(catalog_cost, upstream × markup) per call — the floor is computed against the cached-input rate, not the fresh rate, so the savings reach you instead of being flattened.#Affinity routing keeps the cache warm
Upstream prompt caches are per-API-key. If every request lands on a random upstream key, the cache hit rate is ~1/N. Brievio derives a stable hash from your system prompt + first user message and routes all calls with the same prefix to the same upstream key — using weighted rendezvous hashing, so only ~1/N keys are remapped when the pool changes.
You don't need to configure anything. Affinity routing applies automatically to all chat calls. (Image / video / TTS use random routing — there is no per-key cache for those modalities.)
#Dashboard metrics
Once your calls start hitting the cache, /app/usage surfaces two rolled-up metrics for the selected window:
- Hit rate — cached input tokens ÷ total input tokens. Exact, computed from per-call token counts.
- Saved — USD saved this window, computed per model with its real cache-read rate (no flat-average approximations). This is an estimate at catalog rates — since real bills are floored at catalog and cached calls can land below the floor, the number is a tight upper bound on actual savings. For typical chat workloads the two are within a few percent.
Each call's detail page (/app/usage/<id>) shows the cached + cache-write token counts when they're non-zero, and the actual billed amount.
#Response fields
Brievio surfaces cache tokens in both response shapes. On the OpenAI-compatible /v1/chat/completions endpoint:
| Field | Meaning |
|---|---|
usage.prompt_tokens | Total input tokens (includes cached + cache-write). |
usage.prompt_tokens_details.cached_tokens | Subset of input served from cache. |
usage.cache_creation_input_tokens | Tokens written to cache this call (billed as fresh input). |
On the native Messages API at /v1/messages, the Anthropic-original fields pass through:
| Field | Meaning |
|---|---|
input_tokens | Fresh (uncached) input. |
cache_read_input_tokens | Served from cache (billed 0.10× input). |
cache_creation_input_tokens | Written to cache this call (billed at the plain input rate — Brievio doesn't pass through Anthropic's upstream write surcharge). |
#Where to go next
- Messages API reference — native Anthropic shape with worked cache examples.
- Billing & the ledger — how the cache-aware max() floor interacts with upstream cost.
- Full pricing table — fresh input / output prices for every chat model, with the cache sub-table at the bottom.