Build an effective agent harness

An agent is a loop around a model, and the loop is the easy part. This is a deep dive into the part that actually matters: the guardrails, tracing, and control flow that let you leave it running.

By Mohit Mittal · Jun 9, 2026 intermediate 16 min

PythonAnthropic APIstdlib

#agents #harness #patterns #reliability

📦 Runnable code for this recipe: mohitkmittal/applied-genai-cookbook/tree/main/src/agcookbook/harness

flowchart TD
  U[User goal] --> H{{Harness}}
  H --> M[Call model]
  M -->|final text| O[Answer + trace]
  M -->|tool_use| GR{Guardrail gate}
  GR -->|budget ok · valid · not looping| T[Dispatch tool]
  GR -->|tripwire| O
  T --> R[Tool result]
  R --> H
  classDef stop fill:#f5e7e3,stroke:#b1361e;
  class O stop;

The harness is a loop: call the model; if it wants a tool, pass through a guardrail gate before dispatching; feed the result back; stop on a final answer or a tripwire.

TL;DR

An “agent” is mostly a loop around a model that can call tools. The reliable core is about 80 lines of Python.
The loop is trivial. The value is in everything wrapped around it: a step budget, per-tool timeouts, bounded retries, loop detection, and a complete trace of every decision.
Build this yourself before you reach for a framework. The frameworks are generalizations of exactly this loop, and you’ll choose one far better once you’ve felt where the simple version hurts.

The problem: the demo-to-production cliff

Here is the demo everyone builds first. You give the model a calculator tool, ask “what’s 17% of 8,400?”, the model emits a tool_use block, you run the function, paste the result back, and it answers. It feels like magic and it took twenty minutes.

Then you point it at something real — “reconcile these two CSVs and flag the discrepancies” — and the wheels come off in ways that are individually obvious and collectively fatal:

It calls a tool with an argument that doesn’t match the schema, the function throws, and your whole process crashes.
It gets a confusing tool result, tries the same call again, gets the same result, tries again… forever. You notice when the API bill arrives.
It produces a beautiful plan that requires nine tool calls, and somewhere around call six it quietly forgets the original goal.
Something fails in the middle and you have no idea what the model saw, what it decided, or why — because nothing was recorded.

None of these are model-intelligence problems. They’re control problems. The gap between “calls a tool once” and “runs a multi-step task unattended” is precisely the harness, and it’s where most of the engineering in a real agent actually lives.

What an agent actually is

Strip away the vocabulary and an agent is a loop with three moves: ask the model what to do next, do it, tell the model what happened. Repeat until the model says it’s done.

flowchart LR
    A[Model proposes next action] --> B{Action type?}
    B -->|tool call| C[Harness executes tool]
    C --> D[Append result to conversation]
    D --> A
    B -->|final answer| E[Done]
    classDef done fill:#f5e7e3,stroke:#b1361e;
    class E done;

The agent loop as a cycle: the model proposes an action, the harness executes a tool, the result is appended to the conversation, and the model is called again.

That’s the whole conceptual model. The conversation is an append-only transcript: user goal, then alternating model turns and tool results, growing until the model stops asking for tools. Everything else in this article is about making that loop safe and observable — because the naive version above will absolutely run forever, crash on the first bad argument, and leave you blind when it does.

The forces you’re balancing

Before writing code, name the tensions. Every design choice in a harness is a trade between four forces, and being explicit about them is what separates a toy from a tool.

Force	Pulls toward	In tension with
Autonomy	letting the model choose its own steps	predictability and safety
Control	hard limits, validation, stop conditions	flexibility and “just let it work”
Cost / latency	fewer steps, smaller context, cheaper models	thoroughness and reliability
Observability	record everything, replay anything	code simplicity and speed

A prompt can nudge the first force. Only the harness can enforce the rest. If your reliability strategy is “ask the model nicely in the system prompt to not loop forever,” you don’t have a reliability strategy.

Architecture

Four parts, with a gate in the middle. The model decides; the guardrail gate decides whether to honor that decision; the tool layer executes; the trace records all of it.

flowchart TD
    subgraph Harness
      L[Control loop]
      G[Guardrails: budget, timeout, retries, loop detection]
      TR[(Trace)]
    end
    L <-->|messages| MODEL[Model client: any provider]
    L --> G
    G --> REG[Tool registry: schema + callable]
    REG --> EXT[(Your code / APIs)]
    L -.records.-> TR
    G -.records.-> TR
    REG -.records.-> TR

Components: the loop calls the model, routes tool requests through guardrails to the tool registry, and writes every event to a trace.

The most important architectural decision is the one that’s easy to miss: the model never touches your code directly. It emits a request to call a tool by name; the harness owns the mapping from that name to an actual function, validates the arguments, and decides whether to run it. The model proposes, the harness disposes.

Building it, piece by piece

1. Tools are data plus a callable

A tool is two things glued together: a JSON schema the model sees (so it knows the tool exists and how to call it) and a Python callable the harness owns (the thing that actually runs). Keep them together so they can’t drift apart.

from dataclasses import dataclass
from typing import Any, Callable

@dataclass
class Tool:
    name: str
    description: str
    input_schema: dict[str, Any]   # JSON Schema — what the MODEL sees
    fn: Callable[..., Any]          # the implementation — what WE run

    @property
    def schema(self) -> dict:
        return {"name": self.name, "description": self.description,
                "input_schema": self.input_schema}

The description and schema are prompt engineering in disguise: they’re the only thing the model knows about your tool. Vague descriptions produce vague tool use. “Search the database” invites garbage; “Search customer records by exact email; returns up to 10 matches” gets you precision.

2. The control loop

Here is the heart of it — the loop that turns a model into an agent. Read it once; the rest of the article is just hardening each line.

def run(goal, tools, model, *, guardrails=None, system=None):
    g = guardrails or Guardrails()
    trace = Trace()
    messages = ([{"role": "system", "content": system}] if system else [])
    messages.append({"role": "user", "content": goal})
    schemas = [t.schema for t in tools.values()]

    for step in range(g.max_steps + 1):
        g.check_step_budget(step)              # tripwire if over budget
        reply = model.call(messages, schemas)
        trace.record_model(reply)

        if reply.stop_reason != "tool_use":
            return Result(answer=reply.text, trace=trace)   # done

        messages.append(reply.as_assistant_message())
        for call in reply.tool_calls:
            result = dispatch(call, tools, g, trace)         # validated + guarded
            messages.append(tool_result(call, result))
    return Result(answer=None, trace=trace, stopped="max_steps")

3. Guardrails — where reliability actually comes from

This is the section that matters. Each guardrail addresses one specific failure mode from the cliff above.

The step budget. The single most important line of code in any agent. Without it, one bad reasoning chain bills you for thousands of model calls.

def check_step_budget(self, step: int) -> None:
    if step >= self.max_steps:
        raise TripwireError(f"max_steps ({self.max_steps}) reached")

Per-tool timeout and bounded retries. Tools call networks, databases, and other flaky things. A hung tool shouldn’t hang the agent, and a transient failure shouldn’t kill the run. Wrap each call:

def run_tool(self, fn, args):
    last_err = None
    for attempt in range(self.max_retries + 1):
        try:
            with ThreadPoolExecutor(max_workers=1) as ex:
                return ex.submit(lambda: fn(**args)).result(timeout=self.tool_timeout_s)
        except FutureTimeout:
            last_err = TimeoutError("tool exceeded timeout")
        except Exception as e:
            last_err = e
        time.sleep(self.retry_backoff_s * (2 ** attempt))   # exponential backoff
    raise last_err

Loop detection. The subtlest failure. The model gets stuck calling the same tool with the same arguments, expecting a different result. Catch it by fingerprinting calls:

def check_loop(self, name, args):
    key = (name, repr(sorted(args.items())))
    self._seen[key] = self._seen.get(key, 0) + 1
    if self._seen[key] > self.loop_repeat_threshold:
        raise TripwireError("loop detected: same tool, same args")

The dispatch path threads these together. Crucially, note how a tool error differs from a tripwire: an error is fed back to the model as a tool result so it can recover; a tripwire ends the run.

flowchart TD
    S[Tool call requested] --> K{Known tool?}
    K -->|no| FB[Feed error back, model retries]
    K -->|yes| LP{Loop / budget ok?}
    LP -->|no| TW[Tripwire: end run]
    LP -->|yes| EX[Execute with timeout + retries]
    EX -->|ok| RS[Return result]
    EX -->|raises| FB
    FB --> CONT[Continue loop]
    RS --> CONT
    classDef stop fill:#f5e7e3,stroke:#b1361e;
    class TW stop;

Dispatch decision flow: unknown tools and errors are fed back to the model to recover; budget and loop violations trip the wire and end the run.

4. Tracing — your debugger and your eval set

Every model turn, tool call, argument, result, and timing goes into an append-only trace. This is non-negotiable for two reasons: when something goes wrong it’s the only way to see what happened, and over time your traces become the regression suite you test future changes against.

trace.record("tool_call", name=call.name, args=call.args)
# ... after execution ...
trace.record("tool_result", name=call.name, result=result, ms=elapsed)

A pretty-printed trace from a real run reads like a transcript of the agent’s reasoning:

model -> calls: search_orders
  search_orders({'email': 'a@b.com'})
  -> [order #1021, #1044]   (84ms)
model -> calls: refund
  refund({'order_id': 1044, 'amount': 29.0})
  -> {'status': 'ok'}       (210ms)
model -> "I found two orders and refunded #1044 as requested."

When a customer says “the agent did something weird,” this trace is the difference between a five-minute diagnosis and an afternoon of guessing.

Testing a harness without spending a cent

Because the model is behind a ModelClient protocol, you can test all of your control logic deterministically with a scripted fake — no API key, no flakiness, no cost. This is how you write fast tests for agent behavior.

script = [
    ModelReply("tool_use", tool_calls=[ToolCall("t1", "add", {"a": 2, "b": 3})]),
    ModelReply("end_turn", text="5"),
]
result = run("2+3?", {"add": ADD}, ScriptedModel(script))
assert result.answer == "5"

You can script a model that loops forever and assert your loop-detection trips; script one that calls an unknown tool and assert the error is fed back, not fatal. The cookbook’s test suite does exactly this — every guardrail has a test, all green, all offline.

Trade-offs: when to graduate to a framework

Rolling your own keeps the loop legible, debuggable, and dependency-free — which matters most early, when you’re still learning what your agent needs to do. But there’s a real point where a framework earns its complexity. Here’s how to think about the crossover:

Signal	Stay with your harness	Reach for a framework
Number of tools	a handful, hand-wired	dozens, needs routing
Run shape	single linear loop	branching, parallel sub-agents
Durability	completes in one process	must pause/resume across restarts
Team	you, reading the code	many contributors
State	conversation in memory	persisted, queryable, replayable

Pitfalls

The recurring mistakes, each a sentence so you can scan them:

No step budget — the number-one cause of runaway cost; set it before anything else.
Trusting tool arguments — always validate against the schema before executing, because the model will eventually send you garbage.
Swallowing tool errors silently — feed a structured error back so the model can recover, but count failures toward a tripwire so it can’t retry forever.
One giant tool that does everything — the model can’t reason about a tool whose behavior depends on a mode argument with twelve values; prefer several small, sharply-described tools.
No trace — without it, every production failure is unreproducible and every regression is invisible.
Letting context grow unbounded — long runs blow the context window; summarize or prune old tool results once they’re no longer needed.

How to adopt this

A concrete path from wherever you are today:

Wrap your existing single model call in the run loop above.
Move each tool behind a Tool (schema + callable) and write sharp descriptions.
Add a max_steps budget and a per-tool timeout — today, before anything else.
Add bounded retries with exponential backoff around tool execution.
Record a structured trace of every model turn and tool call.
Add loop detection (same tool + same args ≥ 2) before you let it run unattended.
Write one scripted-model test per guardrail so behavior can’t silently regress.

References

This recipe is implemented end-to-end in the harness module of the cookbook — types.py, harness.py, guardrails.py, trace.py, and a runnable example, all stdlib-only with a passing test suite. It pairs naturally with spec-driven development: use a spec to decide what the agent should accomplish, then let this harness execute it with the acceptance checks running inside the loop.

Mohit Mittal

Writes Applied GenAI — practical recipes for building with generative AI. Code lives in the cookbook.