cd ../back to blog
$Guide//June 4, 2026//7 min read

Vision and document understanding with Claude and Gemini via one API

Send images to Claude and Gemini through one OpenAI-compatible API: image_url URLs, base64 uploads, multi-image prompts, OCR, charts, and scanned-document understanding.

A surprising amount of "document AI" is just sending a picture to a chat model and reading what it says back. A receipt, a dashboard screenshot, a scanned PDF page, a photo of a whiteboard — modern Claude and Gemini models read all of it natively, no separate OCR engine required. The catch is usually plumbing: every provider has its own way to attach an image, and porting code from one to the other is annoying.

Through Brievio you use one request shape — the OpenAI Chat Completions content array with an image_url part — and it works identically against Claude Opus 4.7, Sonnet 4.6, Haiku 4.5, and Gemini 2.5 Pro / Flash. These are the genuine first-party models with native vision honored, so the same JPEG that Claude reads well, Claude actually reads. This post covers image input (understanding), not image generation: URLs, base64, multi-image prompts, and the OCR / chart / scanned-document patterns that come up in real work.

The simplest case: an image by URL

If your image already lives at a public HTTPS URL, attach it as an image_url part next to your text. Swap claude-sonnet-4-6 for gemini-2.5-pro and the request body does not change — that portability is the whole point:

image_url.py
# Send an image by URL. Same OpenAI chat shape, against the genuine model.
from openai import OpenAI

client = OpenAI(
    api_key="sk-brievio-...",
    base_url="https://api.brievio.com/v1",
)

resp = client.chat.completions.create(
    model="claude-sonnet-4-6",     # or gemini-2.5-pro — same request shape
    max_tokens=500,
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What does this chart show? Give the trend in one sentence."},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/q3-revenue.png"},
                },
            ],
        }
    ],
)

print(resp.choices[0].message.content)
# Images are billed as INPUT tokens — read resp.usage.prompt_tokens to see the cost.

One thing to internalize early: images cost input tokens. A model doesn't see pixels for free — it tiles the image and each tile bills like text. Brievio reports honest token counts, so the image cost shows up in resp.usage.prompt_tokens exactly as the upstream provider charges it. A full-screen screenshot can run a few hundred to a couple thousand input tokens depending on resolution. Budget for it the way you'd budget a paragraph of context, not as a freebie.

Base64: the case you'll actually use

In production the image is rarely a public URL — it's a file a user just uploaded, a buffer from a scanner, a private S3 object. For those, inline the bytes as a base64 data: URL. The model can't tell the difference; your bytes never have to be publicly reachable:

base64_upload.py
# Most production images aren't public URLs. Inline them as base64 data URLs.
import base64
from openai import OpenAI

client = OpenAI(api_key="sk-brievio-...", base_url="https://api.brievio.com/v1")

def data_url(path: str, media_type: str = "image/png") -> str:
    with open(path, "rb") as f:
        b64 = base64.standard_b64encode(f.read()).decode("utf-8")
    return f"data:{media_type};base64,{b64}"

resp = client.chat.completions.create(
    model="gemini-2.5-flash",      # cheap + fast for OCR / receipts
    max_tokens=800,
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Extract every line item and total as JSON. Keys: items[], total."},
                {"type": "image_url", "image_url": {"url": data_url("receipt.jpg", "image/jpeg")}},
            ],
        }
    ],
)

print(resp.choices[0].message.content)
# Tip: data URLs inflate the request body ~33% over the raw file. Keep images
# reasonably sized — a 2-3MP screenshot is plenty for text; you rarely need 12MP.

Two practical caveats. First, base64 adds about 33% overhead to your request body, and there are per-image size limits (Anthropic caps single images around 5 MB on the API; Gemini has its own ceiling). If a large scan 413s, downscale it — text stays legible at far lower resolution than you'd guess. Second, send the correct media_type (image/png, image/jpeg, image/webp); a mismatched type is a common cause of a silent decode failure. When a request does fail with a 4xx or 5xx on Brievio, you aren't billed for it — failed calls are free, so you can retry a downscaled image without paying twice.

Multi-image prompts and scanned documents

The content array takes as many image_url parts as you want, interleaved with text. That unlocks the genuinely useful workflows: compare a before/after screenshot, read a multi-page scanned document, or feed a sequence of charts and ask for the through-line. The trick that pays off on both model families is labeling each image with a small text anchor so the model can cite it:

multi_image.py
# Multiple images in one prompt — compare two screenshots, or read a 4-page scan.
content = [
    {"type": "text", "text": "These are pages 1-3 of a scanned contract. Summarize the parties, term, and termination clause. Cite the page number for each."},
]
for i, path in enumerate(["page1.png", "page2.png", "page3.png"], start=1):
    content.append({"type": "text", "text": f"--- Page {i} ---"})
    content.append({"type": "image_url", "image_url": {"url": data_url(path)}})

resp = client.chat.completions.create(
    model="claude-opus-4-7",       # strongest reasoning over dense documents
    max_tokens=1200,
    messages=[{"role": "user", "content": content}],
)

print(resp.choices[0].message.content)
# Interleaving a text label before each image ("--- Page 2 ---") gives the model
# an anchor to cite, and noticeably improves multi-image grounding on both families.

For long documents there's a model-choice tradeoff. Gemini 2.5 Flash is cheap and fast and is a great default for high-volume OCR, receipts, and form extraction. Claude Opus 4.7 reasons harder over dense, multi-page material — contracts, financial statements, anything where you need it to hold several pages in view and cross-reference them. Sonnet 4.6 and Gemini 2.5 Pro sit in between. You can route per task without changing any code but the model string; see the live list on /models.

What these models are good (and not good) at

Native vision handles a wide range of real tasks well:

  • OCR and transcription — printed text, and surprisingly decent handwriting. No Tesseract pipeline to maintain.
  • Charts and dashboards — reading values off bar/line charts, summarizing a trend, sanity-checking a metric in a screenshot.
  • Structured extraction — receipts, invoices, forms, IDs to JSON. Pair it with a strict schema in your prompt for clean output.
  • UI and diagram understanding — describing a screen, reading an error dialog, explaining an architecture diagram.

And some honest limits. Models still occasionally misread a single digit or a dense table cell, so for anything where a wrong number is expensive, validate against a schema or a checksum (e.g., line items should sum to the stated total). Tiny text in a low-resolution image is the most common failure — give it a higher-res crop. And per-model behavior genuinely differs: a layout one model nails, another may stumble on, which is exactly why being able to swap models behind one API and A/B them on your own documents is worth more than any single benchmark.

Going further: vision plus tools

Vision composes with the rest of the API. You can hand the model an image and a set of tools, so it reads a screenshot and then calls a function with what it extracted — "read this invoice, then call create_expense(amount, vendor, date)." That tool-call layer is also uniform across Claude and Gemini through Brievio; the tool-use guide covers the shared shape. Combined with vision, it's most of a document-processing pipeline in one request.

The takeaway

For image understanding you do not need a separate OCR service or provider-specific SDKs. Attach an image_url part — URL or base64 data URL — to your normal chat request, label multiple images so the model can cite them, and route to the model that fits the task: Gemini Flash for cheap high-volume reads, Claude Opus for hard documents. Remember images bill as input tokens (shown honestly in usage), keep resolution sane, validate extracted numbers, and lean on failed-call-free retries when a big scan bounces. The full request reference and per-model limits are in the docs — start there and you'll have documents flowing through Claude or Gemini in an afternoon.