"OpenAI-compatible" is the most overloaded phrase in the AI infra market. It can mean "you can point the OpenAI SDK at our URL and a basic chat call returns" — which is the easy 80% — or it can mean "every field, every streaming event, every tool-call round trip and every usage number behaves the way your code already expects." The gap between those two is where production incidents live. This post is the field guide: what genuinely has to match for your existing code to work unchanged, what behaves identically across upstreams, and what silently differs when the model behind the OpenAI shape is actually Claude (Anthropic) or Gemini (Google) instead of GPT.
Brievio is an OpenAI-compatible gateway in front of the genuine first-party models, so this is written from the translator's seat — the layer that has to make Anthropic's Messages API and Google's Vertex API both come out the OpenAI-shaped pipe. I'll be specific about where the abstraction is clean and where it leaks, because pretending it never leaks is how you get paged at 2am.
What "compatible" actually has to mean
Compatibility isn't a marketing checkbox; it's a contract with the SDK you already imported. The OpenAI Python and Node libraries make hard assumptions about the wire format. A gateway is only compatible if it honors all of them:
- The request schema.
POST /v1/chat/completionswithmodel,messages(a list of role/content objects), and the optional knobs —temperature,max_tokens,top_p,stop,tools,response_format. Unknown params should be accepted and ignored, not 400'd. - The response envelope. An object with
id,object: "chat.completion",model,choices[](each withmessage,index,finish_reason), and ausageblock. SDKs deserialize into typed objects; miss a field andresp.choices[0].message.contentthrows on someone else's machine. - The streaming protocol. Server-Sent Events with the
data: [DONE]sentinel and per-tokendeltaobjects. This is the single most common thing "compatible" gateways get subtly wrong. - Error shapes and HTTP codes. A 429 has to look like a rate limit, a 400 has to carry an
errorobject with atypeandmessage. Retry and backoff logic in the SDK keys off these.
Here's the baseline — the part everyone gets right. Two lines change and the call returns a normal completion object:
# The whole point: change two lines, keep your code.
# Same SDK, same request shape, same response object — different model behind it.
from openai import OpenAI
client = OpenAI(
api_key="sk-brievio-...",
base_url="https://api.brievio.com/v1", # was https://api.openai.com/v1
)
resp = client.chat.completions.create(
model="claude-sonnet-4-6", # was gpt-4o
messages=[
{"role": "system", "content": "You are concise."},
{"role": "user", "content": "Summarize the CAP theorem in two sentences."},
],
temperature=0.2,
max_tokens=300,
)
print(resp.choices[0].message.content)
print(resp.usage) # prompt_tokens / completion_tokens / total_tokens — same fields
# resp.id, resp.model, resp.choices[0].finish_reason all present and shaped like OpenAI.If a gateway can't do at least this, walk away. But this is table stakes, not the finish line. The interesting question is what happens when you turn on the features your real app uses.
Streaming: where compatibility quietly leaks
Streaming is the feature most likely to be technically present and practically broken. The SDK's streaming iterator expects three things: a text/event-stream content type, deltas arriving incrementally on choices[0].delta.content, and a literal data: [DONE] line to close the stream. Get any of them wrong and the symptom is maddening — works in your curl test, hangs in production.
# Streaming is where naive "compatibility" leaks. The contract you depend on:
# - Content-Type: text/event-stream
# - each event is "data: {json}\n\n", deltas arrive on choices[0].delta.content
# - the stream ends with a literal "data: [DONE]\n\n" sentinel
stream = client.chat.completions.create(
model="gemini-2-5-pro",
messages=[{"role": "user", "content": "Explain B-trees in one paragraph."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)
# Things that break clients if a gateway gets them wrong:
# - buffering the whole response then flushing it as one chunk (not real streaming)
# - omitting the [DONE] sentinel (some SDKs hang waiting for it)
# - usage only at the end — pass stream_options={"include_usage": True} to get it.The most common "fake streaming" failure is a gateway that calls the upstream, waits for the entire response, then emits it as one or two big chunks. The SDK doesn't error — you just lose the whole point of streaming (time-to-first-token stays terrible). A real gateway holds the connection open to the upstream and forwards each token as it arrives. For Claude that means translating Anthropic's content_block_delta events into OpenAI chat.completion.chunk events on the fly; for Gemini, the same job against Vertex's streaming format. The output looks identical to your code, but the machinery underneath is doing real per-event translation.
One genuine difference to know: usage in streamed responses. OpenAI only includes the usage block on the final chunk if you pass stream_options={"include_usage": true}. A good gateway honors that flag against every upstream so your token-accounting code doesn't have to special-case the model. See the full streaming contract in the chat completions docs.
Tools and function calling: same shape, different engine
Tool calling is the feature where the OpenAI abstraction earns its keep — because the three providers have completely different native formats, and a gateway hides all of it. You send OpenAI's tools array; you get back tool_calls on the message. What happens in between is a real translation:
# Tool / function calling: the request side matches OpenAI exactly.
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}]
resp = client.chat.completions.create(
model="claude-opus-4-7",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools,
tool_choice="auto",
)
# The model returns choices[0].message.tool_calls — a list, each with an id,
# .function.name and .function.arguments (a JSON *string* you must json.loads).
for call in resp.choices[0].message.tool_calls or []:
print(call.id, call.function.name, call.function.arguments)
# You then append a {"role": "tool", "tool_call_id": call.id, "content": result}
# message and call again. That loop is identical whether the upstream is
# Claude or Gemini — the gateway maps each provider's native tool format
# to OpenAI's tool_calls on the way out, and back on the way in.Under the hood, Anthropic returns tool_use content blocks with an input object; Gemini returns functionCall parts with args. The gateway maps both onto OpenAI's tool_calls[] shape — including the detail that OpenAI delivers arguments as a JSON string you must json.loads, not a parsed object. Your tool-execution loop — read the calls, run the functions, append role: "tool" messages, call again — is byte-for-byte the same regardless of which family you target. That is the whole value proposition: write the agent once, swap models with a string.
The honest caveats, because they exist:
- Parallel tool calls. All three families can request multiple tools in one turn, but they differ on how aggressively they do it for a given prompt. Don't assume the exact number or ordering ports across models — handle a list, not a fixed count.
- Strict / structured tool schemas. OpenAI's
strict: trueJSON-schema enforcement is an OpenAI-model feature. On Claude and Gemini the gateway passes your schema as the tool definition and the model adheres closely, but the guarantee is the upstream's, not a magic the gateway can fabricate. tool_choicenuances.autoand forcing a specific function are well supported everywhere; exotic combinations are worth a quick test on each model you actually ship.
Vision and JSON mode: pass-through, with edges
Vision uses OpenAI's multimodal content-parts format — a list mixing text and image_url entries. Against a model that natively sees images (Gemini 2.5 Pro/Flash, the Claude family), the gateway forwards the image and the multimodal call just works. JSON mode — response_format: { type: "json_object" } — constrains the output to a parseable object:
# Vision: OpenAI's multimodal content-parts format, passed through to a
# model that natively supports images. URL or base64 data URI both work.
resp = client.chat.completions.create(
model="gemini-2-5-pro",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this chart? Give me the trend."},
{"type": "image_url", "image_url": {
"url": "https://example.com/q3-revenue.png",
}},
],
}],
)
print(resp.choices[0].message.content)
# JSON mode — ask for a guaranteed-parseable object:
resp = client.chat.completions.create(
model="claude-sonnet-4-6",
messages=[{"role": "user", "content": "Extract name and email as JSON."}],
response_format={"type": "json_object"},
)
import json
data = json.loads(resp.choices[0].message.content) # parses, every time.Where the edges are: image input limits (max dimensions, max number of images per request, accepted MIME types) are set by each upstream, not invented by the gateway — so a 50MB TIFF that Gemini rejects will be rejected behind the OpenAI shape too, with a translated error. And json_object mode guarantees valid JSON, not JSON that matches your specific schema; if you need a particular structure, describe it in the prompt and validate after parsing. These aren't gateway bugs — they're the underlying model's contract showing through, which is exactly what you want a faithful translator to preserve.
Embeddings, and the things that genuinely don't port
Two more surfaces worth naming honestly. Embeddings (/v1/embeddings) are simple and stable — but vectors are not interchangeable across models. A Gemini embedding and an OpenAI embedding live in different spaces with different dimensionality; you cannot mix them in one index or compare their cosine similarities. Pick one embedding model and re-embed your whole corpus if you switch. The API is compatible; the math is not.
And the leaks that no amount of compatibility shimming can paper over — the provider-specific features that simply have no OpenAI field to carry them:
- Anthropic prompt caching. The native
cache_controlbreakpoints live on Anthropic's Messages API. Over the OpenAI shape you get OpenAI-style automatic prefix caching instead; to drive caching explicitly you use the native/v1/messagesendpoint. (Both work on Brievio — see the API docs.) - Tokenizers differ per family. "1,000 tokens" is not the same string length across GPT, Claude, and Gemini — each has its own tokenizer. So
max_tokensbudgets and your cost estimates shift when you swap models, even though the field name didn't. A good gateway reports each upstream's honest token counts inusage; it can't make three tokenizers agree, and you shouldn't trust one that pretends they do. - Extended thinking / reasoning. Claude's extended thinking and Gemini's thinking modes surface differently than OpenAI reasoning. The content comes through; the exact field plumbing is model-specific, so don't hard-code one provider's reasoning shape across all of them.
- System-prompt semantics. All three accept a system message, but they weight and truncate it slightly differently. Behavior ports; it isn't bit-identical. Test your prompts per model.
How a good gateway normalizes all this
The job of the compatibility layer is to be a faithful, lossless translator in the common path and an honest one at the edges. Concretely, that means: map the request schema both directions; translate streaming events token-by-token, sentinel included; convert each provider's native tool format to and from tool_calls; preserve finish_reason semantics; pass real images through to vision-capable models; and — the part that's easy to cheat on — report the upstream's actual token counts rather than a padded number. On Brievio the models behind the shape are the genuine first-party ones, traceable to AWS Bedrock and Google Vertex, so the behavior you're normalizing is the real model's behavior, not a cheaper stand-in. If you want to confirm that for yourself, the four tests in is your Claude really Claude take about a minute.
Two principles fall out of all this for anyone building on a "compatible" endpoint. First, test the features you actually use — a passing chat call tells you nothing about whether streaming flushes incrementally or whether tool-call IDs round trip. Second, respect the leaks: tokenizers, embeddings spaces, caching syntax, and reasoning shapes are upstream properties, and the gateway that's honest about them is the one you can trust in production. Compatibility is a spectrum, and the useful part of it is the part that survives your real workload — not the part that survives a demo.
The concrete takeaway
Point the OpenAI SDK at https://api.brievio.com/v1, change the model string, and run your existing test suite — not a hello-world, your suite. Exercise streaming with include_usage, run one tool-call round trip, send one image, request one json_object. If all four pass on the model you intend to ship, the migration is genuinely two lines. Where you need a provider-specific feature — explicit Anthropic caching, native reasoning controls — drop to the native endpoints for that path and keep the OpenAI shape everywhere else. Want the step-by-step port from an existing OpenAI codebase? Start with calling Claude with the OpenAI SDK, then browse the model list to pick what runs behind the shape.