Applied GenAI

Build an effective agent harness

An agent is a loop around a model, and the loop is the easy part. This is a deep dive into the part that actually matters: the guardrails, tracing, and control flow that let you leave it running.

PythonAnthropic APIstdlib

#agents#harness#patterns#reliability

flowchart TD
  U[User goal] --> H{{Harness}}
  H --> M[Call model]
  M -->|final text| O[Answer + trace]
  M -->|tool_use| GR{Guardrail gate}
  GR -->|budget ok · valid · not looping| T[Dispatch tool]
  GR -->|tripwire| O
  T --> R[Tool result]
  R --> H
  classDef stop fill:#f5e7e3,stroke:#b1361e;
  class O stop;
The harness is a loop: call the model; if it wants a tool, pass through a guardrail gate before dispatching; feed the result back; stop on a final answer or a tripwire.

TL;DR

The problem: the demo-to-production cliff

Here is the demo everyone builds first. You give the model a calculator tool, ask “what’s 17% of 8,400?”, the model emits a tool_use block, you run the function, paste the result back, and it answers. It feels like magic and it took twenty minutes.

Then you point it at something real — “reconcile these two CSVs and flag the discrepancies” — and the wheels come off in ways that are individually obvious and collectively fatal:

None of these are model-intelligence problems. They’re control problems. The gap between “calls a tool once” and “runs a multi-step task unattended” is precisely the harness, and it’s where most of the engineering in a real agent actually lives.

What an agent actually is

Strip away the vocabulary and an agent is a loop with three moves: ask the model what to do next, do it, tell the model what happened. Repeat until the model says it’s done.

flowchart LR
    A[Model proposes next action] --> B{Action type?}
    B -->|tool call| C[Harness executes tool]
    C --> D[Append result to conversation]
    D --> A
    B -->|final answer| E[Done]
    classDef done fill:#f5e7e3,stroke:#b1361e;
    class E done;
The agent loop as a cycle: the model proposes an action, the harness executes a tool, the result is appended to the conversation, and the model is called again.

That’s the whole conceptual model. The conversation is an append-only transcript: user goal, then alternating model turns and tool results, growing until the model stops asking for tools. Everything else in this article is about making that loop safe and observable — because the naive version above will absolutely run forever, crash on the first bad argument, and leave you blind when it does.

The forces you’re balancing

Before writing code, name the tensions. Every design choice in a harness is a trade between four forces, and being explicit about them is what separates a toy from a tool.

ForcePulls towardIn tension with
Autonomyletting the model choose its own stepspredictability and safety
Controlhard limits, validation, stop conditionsflexibility and “just let it work”
Cost / latencyfewer steps, smaller context, cheaper modelsthoroughness and reliability
Observabilityrecord everything, replay anythingcode simplicity and speed

A prompt can nudge the first force. Only the harness can enforce the rest. If your reliability strategy is “ask the model nicely in the system prompt to not loop forever,” you don’t have a reliability strategy.

Architecture

Four parts, with a gate in the middle. The model decides; the guardrail gate decides whether to honor that decision; the tool layer executes; the trace records all of it.

flowchart TD
    subgraph Harness
      L[Control loop]
      G[Guardrails: budget, timeout, retries, loop detection]
      TR[(Trace)]
    end
    L <-->|messages| MODEL[Model client: any provider]
    L --> G
    G --> REG[Tool registry: schema + callable]
    REG --> EXT[(Your code / APIs)]
    L -.records.-> TR
    G -.records.-> TR
    REG -.records.-> TR
Components: the loop calls the model, routes tool requests through guardrails to the tool registry, and writes every event to a trace.

The most important architectural decision is the one that’s easy to miss: the model never touches your code directly. It emits a request to call a tool by name; the harness owns the mapping from that name to an actual function, validates the arguments, and decides whether to run it. The model proposes, the harness disposes.

Building it, piece by piece

1. Tools are data plus a callable

A tool is two things glued together: a JSON schema the model sees (so it knows the tool exists and how to call it) and a Python callable the harness owns (the thing that actually runs). Keep them together so they can’t drift apart.

from dataclasses import dataclass
from typing import Any, Callable

@dataclass
class Tool:
    name: str
    description: str
    input_schema: dict[str, Any]   # JSON Schema — what the MODEL sees
    fn: Callable[..., Any]          # the implementation — what WE run

    @property
    def schema(self) -> dict:
        return {"name": self.name, "description": self.description,
                "input_schema": self.input_schema}

The description and schema are prompt engineering in disguise: they’re the only thing the model knows about your tool. Vague descriptions produce vague tool use. “Search the database” invites garbage; “Search customer records by exact email; returns up to 10 matches” gets you precision.

2. The control loop

Here is the heart of it — the loop that turns a model into an agent. Read it once; the rest of the article is just hardening each line.

def run(goal, tools, model, *, guardrails=None, system=None):
    g = guardrails or Guardrails()
    trace = Trace()
    messages = ([{"role": "system", "content": system}] if system else [])
    messages.append({"role": "user", "content": goal})
    schemas = [t.schema for t in tools.values()]

    for step in range(g.max_steps + 1):
        g.check_step_budget(step)              # tripwire if over budget
        reply = model.call(messages, schemas)
        trace.record_model(reply)

        if reply.stop_reason != "tool_use":
            return Result(answer=reply.text, trace=trace)   # done

        messages.append(reply.as_assistant_message())
        for call in reply.tool_calls:
            result = dispatch(call, tools, g, trace)         # validated + guarded
            messages.append(tool_result(call, result))
    return Result(answer=None, trace=trace, stopped="max_steps")

3. Guardrails — where reliability actually comes from

This is the section that matters. Each guardrail addresses one specific failure mode from the cliff above.

The step budget. The single most important line of code in any agent. Without it, one bad reasoning chain bills you for thousands of model calls.

def check_step_budget(self, step: int) -> None:
    if step >= self.max_steps:
        raise TripwireError(f"max_steps ({self.max_steps}) reached")

Per-tool timeout and bounded retries. Tools call networks, databases, and other flaky things. A hung tool shouldn’t hang the agent, and a transient failure shouldn’t kill the run. Wrap each call:

def run_tool(self, fn, args):
    last_err = None
    for attempt in range(self.max_retries + 1):
        try:
            with ThreadPoolExecutor(max_workers=1) as ex:
                return ex.submit(lambda: fn(**args)).result(timeout=self.tool_timeout_s)
        except FutureTimeout:
            last_err = TimeoutError("tool exceeded timeout")
        except Exception as e:
            last_err = e
        time.sleep(self.retry_backoff_s * (2 ** attempt))   # exponential backoff
    raise last_err

Loop detection. The subtlest failure. The model gets stuck calling the same tool with the same arguments, expecting a different result. Catch it by fingerprinting calls:

def check_loop(self, name, args):
    key = (name, repr(sorted(args.items())))
    self._seen[key] = self._seen.get(key, 0) + 1
    if self._seen[key] > self.loop_repeat_threshold:
        raise TripwireError("loop detected: same tool, same args")

The dispatch path threads these together. Crucially, note how a tool error differs from a tripwire: an error is fed back to the model as a tool result so it can recover; a tripwire ends the run.

flowchart TD
    S[Tool call requested] --> K{Known tool?}
    K -->|no| FB[Feed error back, model retries]
    K -->|yes| LP{Loop / budget ok?}
    LP -->|no| TW[Tripwire: end run]
    LP -->|yes| EX[Execute with timeout + retries]
    EX -->|ok| RS[Return result]
    EX -->|raises| FB
    FB --> CONT[Continue loop]
    RS --> CONT
    classDef stop fill:#f5e7e3,stroke:#b1361e;
    class TW stop;
Dispatch decision flow: unknown tools and errors are fed back to the model to recover; budget and loop violations trip the wire and end the run.

4. Tracing — your debugger and your eval set

Every model turn, tool call, argument, result, and timing goes into an append-only trace. This is non-negotiable for two reasons: when something goes wrong it’s the only way to see what happened, and over time your traces become the regression suite you test future changes against.

trace.record("tool_call", name=call.name, args=call.args)
# ... after execution ...
trace.record("tool_result", name=call.name, result=result, ms=elapsed)

A pretty-printed trace from a real run reads like a transcript of the agent’s reasoning:

model -> calls: search_orders
  search_orders({'email': 'a@b.com'})
  -> [order #1021, #1044]   (84ms)
model -> calls: refund
  refund({'order_id': 1044, 'amount': 29.0})
  -> {'status': 'ok'}       (210ms)
model -> "I found two orders and refunded #1044 as requested."

When a customer says “the agent did something weird,” this trace is the difference between a five-minute diagnosis and an afternoon of guessing.

Testing a harness without spending a cent

Because the model is behind a ModelClient protocol, you can test all of your control logic deterministically with a scripted fake — no API key, no flakiness, no cost. This is how you write fast tests for agent behavior.

script = [
    ModelReply("tool_use", tool_calls=[ToolCall("t1", "add", {"a": 2, "b": 3})]),
    ModelReply("end_turn", text="5"),
]
result = run("2+3?", {"add": ADD}, ScriptedModel(script))
assert result.answer == "5"

You can script a model that loops forever and assert your loop-detection trips; script one that calls an unknown tool and assert the error is fed back, not fatal. The cookbook’s test suite does exactly this — every guardrail has a test, all green, all offline.

Trade-offs: when to graduate to a framework

Rolling your own keeps the loop legible, debuggable, and dependency-free — which matters most early, when you’re still learning what your agent needs to do. But there’s a real point where a framework earns its complexity. Here’s how to think about the crossover:

SignalStay with your harnessReach for a framework
Number of toolsa handful, hand-wireddozens, needs routing
Run shapesingle linear loopbranching, parallel sub-agents
Durabilitycompletes in one processmust pause/resume across restarts
Teamyou, reading the codemany contributors
Stateconversation in memorypersisted, queryable, replayable

Pitfalls

The recurring mistakes, each a sentence so you can scan them:

How to adopt this

A concrete path from wherever you are today:

References

This recipe is implemented end-to-end in the harness module of the cookbook — types.py, harness.py, guardrails.py, trace.py, and a runnable example, all stdlib-only with a passing test suite. It pairs naturally with spec-driven development: use a spec to decide what the agent should accomplish, then let this harness execute it with the acceptance checks running inside the loop.

MM
Mohit Mittal
Writes Applied GenAI — practical recipes for building with generative AI. Code lives in the cookbook.