An agent is what you get when you put tool use in a loop. One tool call answers one question; an agent calls a tool, reads the result, decides what to do next, and keeps going until the task is actually done — search, then read the top hit, then look up a price, then write the answer. The mechanism is simple and it's the same four beats every time: the model asks for a tool, you run it, you feed the result back, you call the model again. The hard part isn't the loop. It's the guardrails that stop it from spinning forever or quietly billing you $40 on a single run.

This post builds a real, runnable agent loop against https://api.brievio.com/v1 with the OpenAI Python SDK, then wraps it in the four controls that make it safe to ship: a hard iteration cap, validated tool dispatch, a per-run cost budget read from honest token counts, and sane handling for the cases where the model misbehaves. Every snippet runs as-is; swap claude-sonnet-4-6 for gemini-2.5-pro and the same code drives a different model.

The loop, and why it needs a ceiling

Here is the whole engine. It's the tool-use loop you already know, with one addition that changes everything: for step in range(MAX_ITERS) instead of while True.

agent_loop.py

# The agent loop with a hard iteration cap. Model -> tool_calls -> run ->
# feed results back -> repeat, until the model answers in prose or we hit
# the ceiling. The cap is the difference between "agent" and "runaway bill".
from openai import OpenAI
import json

client = OpenAI(
    api_key="sk-brievio-...",
    base_url="https://api.brievio.com/v1",
)

MAX_ITERS = 8   # most tasks finish in 2-4 rounds; 8 is generous headroom.

def run_agent(question: str, model: str = "claude-sonnet-4-6") -> str:
    messages = [
        {"role": "system", "content": "You are a helpful research agent. "
         "Use the tools when you need live data. Answer directly when you "
         "already know enough."},
        {"role": "user", "content": question},
    ]

    for step in range(MAX_ITERS):
        resp = client.chat.completions.create(
            model=model,
            messages=messages,
            tools=TOOLS,
            tool_choice="auto",
        )
        msg = resp.choices[0].message

        # No tool requested -> this is the final answer. Done.
        if not msg.tool_calls:
            return msg.content

        # Append the assistant turn EXACTLY as returned — it carries the
        # tool_call ids the next messages must reference.
        messages.append(msg)

        # Run every requested call and append one tool message per id.
        for call in msg.tool_calls:
            result = dispatch(call)              # see the next snippet
            messages.append({
                "role": "tool",
                "tool_call_id": call.id,          # MUST match the call's id
                "content": json.dumps(result),
            })
        # Loop: the model now sees the tool output and continues.

    # Hit the cap without resolving. Fail loud — don't silently loop forever.
    raise RuntimeError(f"agent did not finish within {MAX_ITERS} iterations")

That bounded for is the single most important line in an agent. A capable model on a well-scoped task finishes in two to four rounds. But models get confused: they call the same search twice, chase a dead end, or — the classic failure — call a tool, get a result they don't like, and call it again with nearly identical arguments, forever. A while True turns that into an unbounded bill and a hung request. The cap converts "spins forever" into "fails after 8 tries with a clear error," which is something you can catch, log, and recover from. Pick the number from your task: a one-shot lookup needs 2, a multi-step research agent maybe 10. Set it deliberately; don't leave it unbounded.

Note the other guardrail hiding in plain sight: handle the model not calling a tool. When msg.tool_calls is empty, that's the model deciding it has enough to answer — that's your exit, not an error. A loop that assumes every turn produces a tool call will either crash or never terminate. Branch on both outcomes every iteration.

Tool dispatch: the model proposes, your code disposes

The model never touches your systems. It emits a function name and a JSON string of arguments and stops; your code decides whether to honor that. That boundary is the entire security story of an agent, so the dispatch function is where validation lives — not as a nicety, but because every argument is untrusted model output, exactly like a form field a stranger filled in.

dispatch.py

# Tool dispatch with validation. The model PROPOSES a call; your code
# DISPOSES of it. Every argument is untrusted model output — parse it,
# check the name is one you registered, and validate types before running.
def get_weather(city: str, unit: str = "celsius") -> dict:
    if not isinstance(city, str) or not city.strip():
        raise ValueError("city must be a non-empty string")
    if unit not in ("celsius", "fahrenheit"):
        raise ValueError(f"unsupported unit: {unit!r}")
    return {"city": city, "temp": 18, "unit": unit, "sky": "clear"}

# Whitelist: a model can only call what you've explicitly registered.
TOOL_IMPLS = {"get_weather": get_weather}

TOOLS = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city.",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
                "unit": {"type": "string",
                         "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["city"],
        },
    },
}]

def dispatch(call) -> dict:
    name = call.function.name
    fn = TOOL_IMPLS.get(name)
    if fn is None:
        # The model hallucinated a tool. Don't crash the loop — hand the
        # error back as a tool result so the model can correct itself.
        return {"error": f"unknown tool: {name}"}

    try:
        args = json.loads(call.function.arguments)   # always a JSON STRING
    except json.JSONDecodeError:
        return {"error": "arguments were not valid JSON"}

    try:
        return fn(**args)
    except (TypeError, ValueError) as e:
        # Bad args (wrong type, missing field, out of range). Feed the
        # message back; the model usually retries with a fixed call.
        return {"error": str(e)}

Three failure modes get handled here, and all of them feed the error back to the model instead of crashing the loop. A hallucinated tool name the model invented but you never registered — the whitelist catches it. Malformed JSON in the arguments string — rare on a genuine flagship, but you parse defensively anyway. And bad argument values — wrong type, missing required field, an enum the model made up. In every case, returning {"error": "..."} as the tool result is better than raising, because the model reads that message on the next turn and usually fixes its own call. An agent that can recover from its own mistakes is far more robust than one that dies on the first bad argument.

Keep the whitelist tight. TOOL_IMPLS.get(name) means a model — genuine or otherwise — can only ever invoke functions you explicitly registered. That single dict is your blast radius. If a tool deletes data, charges a card, or sends an email, gate it behind an explicit confirmation rather than letting the loop fire it autonomously.

The budget guard: loops re-send a growing context

The iteration cap bounds how many times you call the model. It does not bound how much each call costs — and in a loop, cost climbs every round. The reason is structural: each turn re-sends the entire conversation so far, plus every tool result appended to it. Turn one might be 800 input tokens; turn six, after five tool outputs have piled up, can be 6,000. Eight cheap rounds quietly add up to one not-cheap run. The fix is a second, independent ceiling on spend, computed from the real token counts each call returns:

budget_guard.py

# A per-run cost/token budget guard. Every loop turn re-sends a GROWING
# context (history + tool outputs), so cost climbs each round. Read the
# honest usage object after each call, price it, and stop when the run
# exceeds its budget — independent of the iteration cap.
from decimal import Decimal

# Published Brievio rates, USD per 1M tokens (~15% under official list).
RATES = {
    "claude-sonnet-4-6": {"in": Decimal("2.55"), "out": Decimal("12.75")},
    "claude-haiku-4-5":  {"in": Decimal("0.85"), "out": Decimal("4.25")},
}

def call_cost(model: str, usage) -> Decimal:
    r = RATES[model]
    m = Decimal("1000000")
    return usage.prompt_tokens * r["in"] / m + usage.completion_tokens * r["out"] / m

RUN_BUDGET = Decimal("0.10")   # 10 cents per agent run, hard ceiling.

def run_agent_budgeted(question: str, model: str = "claude-sonnet-4-6") -> str:
    messages = [{"role": "user", "content": question}]
    spent = Decimal("0")

    for step in range(MAX_ITERS):
        resp = client.chat.completions.create(
            model=model, messages=messages, tools=TOOLS, tool_choice="auto",
        )
        spent += call_cost(model, resp.usage)   # tally REAL tokens, every turn
        if spent > RUN_BUDGET:
            raise RuntimeError(f"run exceeded ${RUN_BUDGET} (spent ${spent:.4f})")

        msg = resp.choices[0].message
        if not msg.tool_calls:
            return msg.content

        messages.append(msg)
        for call in msg.tool_calls:
            messages.append({"role": "tool", "tool_call_id": call.id,
                             "content": json.dumps(dispatch(call))})

    raise RuntimeError(f"agent did not finish within {MAX_ITERS} iterations")

The key is that resp.usage on Brievio carries the honest input and output token counts the genuine model actually processed — so the running total is real money, not a guess. Reading usage after every turn and stopping at RUN_BUDGET means a confused agent that would otherwise burn through eight expensive rounds gets cut off the moment it crosses a dime, regardless of how many iterations that took. Two ceilings, two different failure modes covered: the iteration cap stops infinite loops, the budget stops expensive ones. You want both, because a loop can be short and pricey or long and cheap, and neither alone protects you from the other.

Worth knowing for the math: failed 4xx/5xx calls aren't billed on Brievio, so a retry against a flaky tool or a transient upstream error doesn't drain the run budget — you only tally cost for calls that actually returned a result. That keeps the spend curve tracking work done, not errors absorbed. The full pattern for bounding spend per call and per user is in the capping API spend guide.

Keeping the token bill down as the loop grows

Bounding cost is one thing; reducing it is another. Because every turn re-sends a growing prefix, the same context gets paid for again and again — which is exactly the shape prompt caching is built for. Mark the static parts of the request (the system prompt, the tool definitions) as cacheable, and from the second turn on you pay a fraction of the input rate on everything that didn't change. In a loop that re-sends the same multi-thousand-token tool catalog and system prompt on every single round, that's the single biggest lever on the bill.

A couple of practical habits help too. Keep the system prompt and tool definitions stable across the run — a tool added mid-loop or a timestamp in the system prompt invalidates the cache and quietly doubles your input cost. And if a tool can return a wall of data (a full web page, a thousand-row query), summarize or truncate the result before appending it to messages; the model rarely needs all of it, and every byte you append is re-sent on every subsequent turn. The agent loop is uniquely sensitive to context bloat precisely because the context is re-sent N times, not once.

Conversation memory: what to carry between runs

Everything above is memory within a single run — the messages list is the agent's working memory, and appending to it is how the model remembers what it already looked up. For a multi-turn agent that talks to a user across several requests, you carry that list forward: persist messages per session (Redis, a database column, wherever), reload it on the next request, and append the new user turn. The loop is identical; only the starting state changes.

The thing to manage is unbounded growth. A long-lived session accumulates history until it's expensive on every call and eventually overruns the context window. Two common strategies: keep a rolling window of the last N turns and drop the oldest, or periodically summarize the older history into a compact note and replace the raw turns with it. Both trade some fidelity for a bounded, predictable context size. Whichever you pick, the per-run budget guard from above still applies — it's the backstop that catches a session that grew larger than you planned for.

One key for an agent that escalates

A useful property of building this behind Brievio: an agent can change which model it uses mid-task without changing anything else. Run the cheap rounds on a smaller model and escalate to a flagship only when the task is hard — route easy tool dispatch through Haiku 4.5 at $0.85 in / $4.25 out, fall back to Sonnet or a different family for the reasoning-heavy final answer. Because one key covers every model behind a single base_url, that escalation is a one-line change to the model string inside the loop — no second SDK, no second auth scheme, no second billing relationship. The full request/response contract, including the tool fields, is in the Chat Completions docs, and the live model list with exact ids is on the models page.

It only matters, of course, if the model on the other end is genuine: an agent loop is unforgiving of a downgraded stand-in, because a model that fumbles tool arguments or ignores a tool will burn iterations and spend chasing its own mistakes. Brievio serves the genuine first-party models, honors native tool calling, and reports honest token counts — which is what makes both the loop and the budget math actually work.

The takeaway: four guardrails, then ship

The loop itself is a dozen lines. What makes it production-ready is the boundary around it:

Cap the iterations. A bounded for over while True. Fail loud at the ceiling instead of spinning forever.
Handle the no-tool case. Empty tool_calls is the exit, not an error. Branch on it every turn.
Validate every dispatch. Whitelist tool names, parse arguments, check types — and feed errors back to the model instead of crashing.
Budget the run. Read usage each turn, price it against published rates, stop at a hard spend ceiling. Watch the growing context, and cache the static prefix to keep the re-sent cost down.

Get those four right and you have an agent that does real multi-step work, recovers from its own mistakes, and has a worst-case cost you chose rather than discovered on an invoice. Start from the tool-use guide if you need the single-call mechanics first, then wrap it in the loop and the four guardrails above.

Building an AI agent loop: tools, memory and safe iteration

The loop, and why it needs a ceiling

Tool dispatch: the model proposes, your code disposes

The budget guard: loops re-send a growing context

Keeping the token bill down as the loop grows

Conversation memory: what to carry between runs

One key for an agent that escalates

The takeaway: four guardrails, then ship

$ ls ./related

Vision and document understanding with Claude and Gemini via one API

Structured output and JSON mode across Claude and Gemini

Embeddings and semantic search with the OpenAI SDK (RAG guide)

Rate limits, retries and backoff: production error handling for AI APIs