$ cat ./docs/caching.md

Prompt caching

Mark a stable prompt prefix once with cache_control. Every subsequent call that reuses the prefix bills it back at a fraction of input price — 10% on Anthropic, 20% on OpenAI and Gemini. Brievio's routing layer keeps the cache warm by pinning your conversation to the same upstream key.

On a 30K-token system prompt with Claude Sonnet 4.6, a cache read costs $0.27 / 1M instead of $2.70 / 1M. For an agent that re-sends the same context 50 times a session, the cache saves ~90% of the input bill.

#How to enable it

Add a cache_control: { type: "ephemeral" } breakpoint on any system block, message, or tool definition. The model caches the span up to that breakpoint on the first call; subsequent calls that share the same prefix bytes hit the cache.

cache_example.py

from openai import OpenAI

client = OpenAI(
    api_key="sk-brievio-...",
    base_url="https://api.brievio.com/v1",
)

# Mark the large, stable prefix once with cache_control.
# Every subsequent call that sends the same prefix reads it back
# from cache at ~10% of the input rate on Anthropic models.
SYSTEM = [
    {
        "type": "text",
        "text": open("long_system_prompt.txt").read(),
        "cache_control": {"type": "ephemeral"},
    },
]

resp = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": "Summarize the doc above in 3 bullets."},
    ],
)

u = resp.usage
# OpenAI-compat fields populated by Brievio:
# - prompt_tokens_details.cached_tokens — read from cache
# - cache_creation_input_tokens          — written to cache this call
print(u.prompt_tokens_details.cached_tokens,
      getattr(u, "cache_creation_input_tokens", 0))

Identical wire-format on the native Messages API — cache_control passes through to Claude untranslated:

messages_api.sh

# Native Anthropic Messages API — cache_control passes through untranslated.
curl https://api.brievio.com/v1/messages \
  -H "Authorization: Bearer $BRIEVIO_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-6",
    "system": [
      {
        "type": "text",
        "text": "<your large stable system prompt>",
        "cache_control": { "type": "ephemeral" }
      }
    ],
    "messages": [
      { "role": "user", "content": "Summarize the doc above in 3 bullets." }
    ],
    "max_tokens": 1024
  }'

Stable bytes matter. The cache key is content-addressed. A single different character anywhere in the prefix invalidates the cache for that span. Put dynamic content (timestamps, user names, per-call IDs) after the cached block, not inside it.

#Pricing

Cache rates derive from each model's input price. Anthropic cache reads bill at 0.10× input. OpenAI and Gemini cache reads bill at 0.20× input. Brievio does not pass through Anthropic's upstream cache-write surcharge — cache writes are billed to you at the plain input rate, so there's no penalty for warming up a new cache entry.

OpenAI / Gemini — 0.20× cache read

Model	Input	Cache read
Claude Opus 4.7	$4.25	$0.85
Claude Opus 4.6	$4.25	$0.85
Claude Sonnet 4.6	$2.55	$0.51
Claude Sonnet 4.5	$2.55	$0.51
Claude Haiku 4.5	$0.85	$0.17
Gemini 2.5 Pro	$1.06	$0.2125
Gemini 2.5 Flash	$0.255	$0.051

▸

The catalog floor is cache-aware. Brievio bills max(catalog_cost, upstream × markup) per call — the floor is computed against the cached-input rate, not the fresh rate, so the savings reach you instead of being flattened.

The rates above are catalog prices. The real per-call bill is computed from the upstream credits the model actually consumed (with a markup), then floored at the catalog rate. On cache hits the upstream charges fewer credits — so the upstream-times-markup branch usually drops below catalog and the catalog branch wins. In short: the catalog rate is your worst-case price; cached calls may land below it but never above.

#Affinity routing keeps the cache warm

Upstream prompt caches are per-API-key. If every request lands on a random upstream key, the cache hit rate is ~1/N. Brievio derives a stable hash from your system prompt + first user message and routes all calls with the same prefix to the same upstream key — using weighted rendezvous hashing, so only ~1/N keys are remapped when the pool changes.

You don't need to configure anything. Affinity routing applies automatically to all chat calls. (Image / video / TTS use random routing — there is no per-key cache for those modalities.)

#Dashboard metrics

Once your calls start hitting the cache, /app/usage surfaces two rolled-up metrics for the selected window:

Hit rate — cached input tokens ÷ total input tokens. Exact, computed from per-call token counts.
Saved — USD saved this window, computed per model with its real cache-read rate (no flat-average approximations). This is an estimate at catalog rates — since real bills are floored at catalog and cached calls can land below the floor, the number is a tight upper bound on actual savings. For typical chat workloads the two are within a few percent.

Each call's detail page (/app/usage/<id>) shows the cached + cache-write token counts when they're non-zero, and the actual billed amount.

#Response fields

Brievio surfaces cache tokens in both response shapes. On the OpenAI-compatible /v1/chat/completions endpoint:

Field	Meaning
`usage.prompt_tokens`	Total input tokens (includes cached + cache-write).
`usage.prompt_tokens_details.cached_tokens`	Subset of input served from cache.
`usage.cache_creation_input_tokens`	Tokens written to cache this call (billed as fresh input).

On the native Messages API at /v1/messages, the Anthropic-original fields pass through:

Field	Meaning
`input_tokens`	Fresh (uncached) input.
`cache_read_input_tokens`	Served from cache (billed 0.10× input).
`cache_creation_input_tokens`	Written to cache this call (billed at the plain input rate — Brievio doesn't pass through Anthropic's upstream write surcharge).

#Where to go next

Messages API reference — native Anthropic shape with worked cache examples.
Billing & the ledger — how the cache-aware max() floor interacts with upstream cost.
Full pricing table — fresh input / output prices for every chat model, with the cache sub-table at the bottom.