Streaming is the difference between a chat box that sits dead for eight seconds and one that starts typing in under a second. The mechanics are the same whether the model behind your base_url is Claude, Gemini, or GPT: set stream=True, iterate the chunks, read the delta on each one, and stop at the [DONE] sentinel. Because Brievio speaks the OpenAI Chat Completions protocol, the exact same loop works across every genuine first-party model — you change the model string and nothing else.

This post covers how Server-Sent Events streaming actually works through an OpenAI-compatible endpoint, how to get accurate token usage on the final chunk with stream_options, the identical pattern in Python and Node, and the two silent failure modes that make a stream look fine while quietly betraying you: fake (buffered) streaming and missing usage.

What "streaming" means over HTTP

A non-streaming call is one request, one response: the server thinks for a few seconds, then hands you the whole completion at once. Streaming keeps the HTTP connection open and pushes the answer in pieces as the model generates it, using Server-Sent Events (SSE). On the wire, each piece arrives as a line that starts with data: followed by a JSON object, and the stream ends with a literal data: [DONE] line.

You almost never parse that text yourself — the SDK does it. What you get in code is an iterable of chunks. Each chunk looks like a normal completion object except the content lives in choices[0].delta instead of choices[0].message, and it holds only the fragment generated since the last chunk. Concatenate every delta.content in order and you have rebuilt the full message. The one metric that matters here is time to first token (TTFB): how long before the first non-empty delta shows up. That number is the entire reason to stream.

The Python pattern

Here is the whole thing — a real streaming loop that also captures usage. The only Brievio-specific line is the base_url:

stream.py

from openai import OpenAI

client = OpenAI(
    api_key="sk-brievio-...",
    base_url="https://api.brievio.com/v1",   # one base_url, real first-party models
)

# stream=True flips the response into a Server-Sent Events stream.
# You iterate the object; each element is one chunk with a partial "delta".
stream = client.chat.completions.create(
    model="claude-sonnet-4-6",               # or gemini-2.5-flash, gpt-..., etc.
    messages=[{"role": "user", "content": "Explain Raft consensus in 200 words."}],
    stream=True,
    stream_options={"include_usage": True},  # ask for usage on the FINAL chunk
)

usage = None
for chunk in stream:
    # The last data event before [DONE] carries usage and an empty choices list.
    if chunk.usage is not None:
        usage = chunk.usage
        continue
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)     # render tokens as they land

print()
# usage is populated only because include_usage was set. These are real counts.
print(usage.prompt_tokens, usage.completion_tokens, usage.total_tokens)

Three details that trip people up. First, the content delta can be None or empty on some chunks (the opening chunk often just sets the role), so guard before you print. Second, the chunk that carries usage comes after the content is done and has an empty choices list — that is why the example checks chunk.usage first and continues. Third, you do not look for [DONE] yourself; the SDK consumes that sentinel and ends the iterator for you. If you were calling the endpoint with raw requests or fetch, then you would split on newlines and break on [DONE] manually.

The same loop in Node

The Node SDK exposes the stream as an async iterable, so the structure is identical — for await ... of instead of for, and process.stdout.write instead of a flushing print:

stream.mjs

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "sk-brievio-...",
  baseURL: "https://api.brievio.com/v1",
});

// Same contract in Node: stream=true returns an async iterable of chunks.
const stream = await client.chat.completions.create({
  model: "claude-sonnet-4-6",
  messages: [{ role: "user", content: "Explain Raft consensus in 200 words." }],
  stream: true,
  stream_options: { include_usage: true },
});

let usage = null;
for await (const chunk of stream) {
  // Final event: choices is empty, usage is present.
  if (chunk.usage) {
    usage = chunk.usage;
    continue;
  }
  const delta = chunk.choices[0]?.delta?.content;
  if (delta) process.stdout.write(delta); // flush each token to the terminal
}

console.log();
console.log(usage?.prompt_tokens, usage?.completion_tokens, usage?.total_tokens);

Note the optional chaining (chunk.choices[0]?.delta?.content). On the final usage-bearing chunk, choices is empty, so indexing [0] without the guard would throw right at the finish line. This is the single most common reason a Node streaming handler crashes on the last event after appearing to work perfectly for the whole response.

Getting real usage on the last chunk

By default a streaming response does not include token counts — there is no usage object on any chunk. That is a deliberate part of the OpenAI protocol, and it bites teams who stream in production and then can't reconcile their bill. The fix is one parameter:

Set stream_options={"include_usage": True} (Python) or stream_options: { include_usage: true } (Node).
The server then emits one extra chunk just before [DONE] whose choices is empty and whose usage holds prompt_tokens, completion_tokens, and total_tokens.
On Brievio those are the genuine counts reported by the model — the same numbers you would get from a non-streaming call, billed at roughly 15% under the official rate. There is no padded usage object and no injected system prompt inflating the prompt side.

If you skip include_usage and still need a token estimate, your only option is to count locally with the model's tokenizer — which is approximate and a maintenance burden. Just set the flag.

The silent breaks: fake streaming and missing usage

Two failure modes pass a casual eyeball test and only show up under scrutiny. Both are worth a 20-second check before you trust a gateway with real traffic.

Buffered "fake" streaming. Some gateways accept stream=True, wait for the entire upstream completion, then replay it to you as a burst of chunks at the end. Your loop runs, deltas arrive, everything looks streamed — but TTFB is identical to a non-streaming call because nothing was sent until the model finished. The tell is simple: time the gap between sending the request and the first non-empty delta. On genuine streaming it lands in well under a second; on a buffered replay it equals the full generation time. If first-token latency tracks total latency, you are not streaming, you are watching a recording.
Missing or fabricated usage. A gateway that doesn't honor include_usage leaves you with no token counts on streamed calls — so you reconcile your invoice against thin air. Worse, a dishonest one can attach a usage object with inflated numbers, because on a stream the client rarely re-counts. Verify it the boring way: run the same prompt once streamed and once not, and confirm the streamed final-chunk usage matches the non-streamed usage. They should be identical.
Mid-stream errors that look like a clean end. If the upstream model errors halfway, a correct gateway surfaces it as an exception in your loop, not a silent truncation. Always check that you received a finish reason (or the usage chunk) before treating the text as complete — a stream that just stops is not the same as a stream that finished.

The takeaway

Streaming over an OpenAI-compatible endpoint is four moving parts: stream=True, iterate chunks, read each delta, and let the SDK handle [DONE]. Add stream_options={"include_usage": True} and you also get honest token counts on the final chunk. The same fifteen lines work unchanged across Claude Sonnet 4.6, Gemini 2.5 Flash, and the GPT family behind one base_url — swap the model string, keep the loop.

Before you ship, measure time to first token and diff streamed-vs-non-streamed usage. Real streaming gives you sub-second TTFB and matching counts; a buffered replay gives the latency away. On Brievio, failed 4xx/5xx calls aren't billed, so you can run these checks for free. See the Chat Completions reference for the full parameter list, the rest of the API docs for tools and vision over the same stream, the guide to calling Claude with the OpenAI SDK for the non-streaming basics, and the model catalog for every slug you can point this loop at.

Streaming Claude, Gemini and GPT with the OpenAI SDK (SSE)

What "streaming" means over HTTP

The Python pattern

The same loop in Node

Getting real usage on the last chunk

The silent breaks: fake streaming and missing usage

The takeaway

$ ls ./related

Vision and document understanding with Claude and Gemini via one API

Structured output and JSON mode across Claude and Gemini

Embeddings and semantic search with the OpenAI SDK (RAG guide)

Rate limits, retries and backoff: production error handling for AI APIs