The demo worked on the first call. Production is the other 999,999 calls — the ones that hit a rate limit during a traffic spike, catch an upstream 503 mid-deploy, or hang on a socket that never answers. The difference between a toy integration and a reliable one is almost entirely in how you handle the unhappy path: what you retry, how long you wait, and when you give up. Get it wrong and a brief provider hiccup turns into a self-inflicted outage, as every one of your workers retries in perfect unison and finishes the job the rate limiter started.

This is a practical, production-grade walkthrough: classify errors so you only retry what's retryable, back off exponentially with jitter, honor Retry-After, set sane timeouts, make retries safe with idempotency, and stop hammering a dead dependency with a circuit breaker. The code is OpenAI-SDK Python against https://api.brievio.com/v1, but the rules are the same on any HTTP client and any language.

Retry the retryable — and nothing else

The single most common bug in AI-API error handling is retrying things that will never succeed. A 401 (bad key), a 400 (malformed request), a 422 (your prompt is too long) — these are deterministic. The second attempt fails exactly like the first, except now it's five attempts and several seconds later. Worse, you've hidden a real bug behind a retry loop. The only 4xx worth retrying is 429 (rate limited), because that one is transient.

Retry: 429, and 500 / 502 / 503 / 504 — these are transient server- or capacity-side conditions.
Retry: connection errors and read timeouts (the request may never have reached the model, or the model answered into a closed socket).
Never retry: any other 4xx — 400, 401, 403, 404, 422. Surface them, alert on them, fix the caller.

A useful instinct: a retry is a bet that the same request will get a different answer. That's only true when the failure was about timing or capacity, not about the request itself.

Exponential backoff with jitter

Once you know a failure is retryable, the question is how long to wait. Retrying immediately is pointless — the condition that caused the 429 is still there a millisecond later. The standard answer is exponential backoff: wait roughly 0.5s, then 1s, 2s, 4s, each attempt doubling, capped at a ceiling so a long recovery doesn't stall you forever.

But pure exponential backoff has a vicious failure mode at scale. If 500 workers all get rate-limited at the same instant — which is exactly what happens during a spike — they all back off by the same amount and retry at the same instant, recreating the spike on a 2-second clock. This is the thundering herd. The fix is jitter: randomize each delay so the retries spread out. Full jitter (sleep a random amount between zero and the backoff ceiling) de-synchronizes the herd far better than adding a small random nudge to a fixed delay.

retry.py

# A retry wrapper you can actually ship. Two rules do most of the work:
#   1. Only retry what's retryable (429 + 5xx + connection errors). Never 400/401/422.
#   2. Back off exponentially AND add jitter, or every client retries in lockstep.
import random
import time

from openai import OpenAI, APIStatusError, APIConnectionError, APITimeoutError

client = OpenAI(
    api_key="sk-brievio-...",
    base_url="https://api.brievio.com/v1",
    timeout=30,          # hard cap per request — see "Timeouts" below
)

RETRYABLE_STATUS = {429, 500, 502, 503, 504}
MAX_ATTEMPTS = 5
BASE_DELAY = 0.5         # seconds
MAX_DELAY = 20.0         # cap the backoff so a slow recovery doesn't stall forever

def chat_with_retry(**kwargs):
    for attempt in range(MAX_ATTEMPTS):
        try:
            return client.chat.completions.create(**kwargs)
        except APIStatusError as e:
            # 4xx that isn't 429 is YOUR bug (bad params, bad key, too long).
            # Retrying it just burns latency — fail fast.
            if e.status_code not in RETRYABLE_STATUS:
                raise
            last_error = e
        except (APIConnectionError, APITimeoutError) as e:
            # Network blip or our timeout fired. Safe to retry a read.
            last_error = e

        if attempt == MAX_ATTEMPTS - 1:
            break

        # Exponential backoff with FULL jitter: sleep ∈ [0, base * 2**attempt].
        # Full jitter (not "backoff + small random") is what actually de-syncs
        # a thundering herd. See the AWS Architecture Blog on jittered backoff.
        ceiling = min(MAX_DELAY, BASE_DELAY * (2 ** attempt))
        time.sleep(random.uniform(0, ceiling))

    raise last_error

Note the cap on attempts (MAX_ATTEMPTS = 5) and the cap on delay (MAX_DELAY). Unbounded retries are how a transient blip becomes a backlog that never drains: requests pile up faster than they clear, latency climbs, and upstream callers time out and retry their requests too. Bound both, and let the failure be visible instead of buffered.

Honor Retry-After — don't guess

Your backoff curve is a guess about when capacity frees up. When the server tells you the answer directly, use it. A 429 often carries a Retry-After header (delta-seconds or an HTTP date) that reflects the real reset window. Sleeping for that value is strictly better than any formula, because it's ground truth rather than a heuristic.

retry_after.py

# When the server tells you how long to wait, listen. A 429 (and sometimes a
# 503) carries a Retry-After header. Honoring it beats any backoff curve you
# invent, because it reflects the real reset window — not a guess.
import email.utils as eut
import time

def retry_delay(resp_headers, attempt, base=0.5, cap=20.0):
    # 1. Prefer the server's instruction.
    ra = resp_headers.get("retry-after")
    if ra is not None:
        try:
            return float(ra)                       # delta-seconds form: "2"
        except ValueError:
            when = eut.parsedate_to_datetime(ra)   # HTTP-date form
            return max(0.0, when.timestamp() - time.time())

    # 2. No header? Fall back to exponential backoff with full jitter.
    import random
    return random.uniform(0, min(cap, base * (2 ** attempt)))

# Notes:
#   - Some providers also send X-RateLimit-Reset; treat it the same way.
#   - Add a tiny floor (e.g. 50ms) so a "Retry-After: 0" doesn't hot-loop.
#   - Brievio surfaces Retry-After on 429 instead of silently stalling the
#     socket, so this path is reachable — you can actually back off on signal.

This only works if the gateway actually returns the header instead of absorbing the limit and stalling your socket for ninety seconds. Brievio fails fast and loud — a rate limit comes back as a clean 429 with Retry-After, not a hung connection — so the "listen to the server" path is reachable. The full error taxonomy, with which codes carry which headers, is in the error reference.

Timeouts: the failure mode nobody tests

A retry policy is useless if the request never returns to be retried. An AI call with no timeout will, eventually, hang — a half-open connection, a stuck upstream, a load balancer that dropped the flow. Without a deadline, that one request holds a worker, a connection, and a slot in every queue behind it until the OS gives up minutes later. Set an explicit per-request timeout (the timeout=30 above) so a stuck call converts into a APITimeoutError you can retry, rather than dead weight.

Per-request timeout bounds a single attempt. For streaming, the meaningful budget is time-to-first-token plus an inter-chunk idle timeout, not a wall-clock total.
Overall deadline bounds the whole retry sequence. Track a budget (e.g. 60s end-to-end) and stop retrying once it's spent — a caller waiting on you has its own deadline.
Timeout > expected p99, not p50. Set it just above your real tail latency. Too tight and you'll cancel good requests that were about to succeed, manufacturing load out of impatience.

Idempotency: making retries safe

Retries introduce a subtle hazard. When a request times out, you don't know whether the server processed it — your read of the response failed, but the work may have completed. Retry blindly and you can double-charge a customer, send a notification twice, or write a duplicate row. Reads are naturally safe to retry. Side effects are not.

The defense is an idempotency key: a unique ID you attach to a logical operation so the server (or your own handler) collapses duplicates. For pure inference calls there's usually no external side effect to worry about — but the moment a completion triggers a DB write, a payment, or an outbound message, generate a stable key per logical unit of work and dedupe on it. The rule of thumb: if a retry could happen twice, design as if it will.

Circuit breakers: stop kicking a dead dependency

Backoff handles a single struggling request. It does nothing for a sustained outage — if an upstream is hard-down for two minutes, every request runs its full retry ladder, waits the maximum, and fails anyway, while your latency and queue depth explode. A circuit breaker short-circuits this: after N consecutive failures it "opens" and fails new calls immediately (or routes them to a fallback) for a cool-down window, then lets a single probe through to test recovery before closing again.

Closed: normal operation, requests flow, failures are counted.
Open: threshold tripped — reject fast for a cool-down period instead of piling up doomed retries.
Half-open: after cool-down, allow one trial request; success closes the breaker, failure re-opens it.

Pair the breaker with a fallback path and the outage becomes a degradation instead of a hard failure. This is also where a gateway earns its keep: Brievio does cross-vendor failover, so a single provider going dark can route to a healthy one before your breaker ever needs to open. For how that fits into a real reliability budget, see how we engineer a 99.95% SLO.

A note on what retries cost

Aggressive retry tuning makes people nervous about the bill — every retry is another billable call, right? Not on a gateway that bills honestly. Brievio charges nothing on failed 4xx/5xx calls, so the 429 that triggered your backoff and the 503 you retried past are free. You pay for the attempt that actually returns a completion, not for the ones the system rejected. That means you can tune attempt counts and timeouts for reliability without a meter penalty for doing so — and if runaway retries are still a budget worry, cap spend directly as described in how to cap your AI API spend.

The takeaway

Production error handling for AI APIs is six rules, and you can ship all of them in an afternoon:

Retry 429 and 5xx only. Never retry other 4xx — they're your bug, surfaced.
Back off exponentially with full jitter, and cap both attempts and delay.
Honor Retry-After when the server sends it — it beats any formula.
Set explicit per-request timeouts and an overall deadline; never let a call hang.
Make side-effecting retries safe with an idempotency key.
Add a circuit breaker so a sustained outage degrades instead of melting down.

None of this is exotic — it's the boring infrastructure that keeps the boring promise of staying up. It also works best on a base layer that fails fast, surfaces real signals, and doesn't charge you for the failures, so your tuning is honest and free. If you're still choosing where to point your traffic, reliability behavior under failure is one of the things worth testing first — covered in how to choose an AI API gateway.

Rate limits, retries and backoff: production error handling for AI APIs

Retry the retryable — and nothing else

Exponential backoff with jitter

Honor Retry-After — don't guess

Timeouts: the failure mode nobody tests

Idempotency: making retries safe

Circuit breakers: stop kicking a dead dependency

A note on what retries cost

The takeaway

$ ls ./related

Vision and document understanding with Claude and Gemini via one API

Structured output and JSON mode across Claude and Gemini

Embeddings and semantic search with the OpenAI SDK (RAG guide)

Building an AI agent loop: tools, memory and safe iteration