Spec-driven development with AI agents

Code is cheap now; intent is the bottleneck. Write a tight spec, let the agent implement against it, and gate the work with executable acceptance checks — so you review intent instead of archaeology.

By Mohit Mittal · Jun 9, 2026 intermediate 14 min

MarkdownPythonAny coding agent

#sdd #workflow #patterns #review

📦 Runnable code for this recipe: mohitkmittal/applied-genai-cookbook/tree/main/src/agcookbook/sdd

flowchart LR
  I[Intent] --> S[SPEC.md<br/>goal · non-goals<br/>acceptance criteria]
  S --> R1{Human reviews<br/>the spec}
  R1 -->|approved| P[Agent drafts plan]
  P --> R2{Human skims<br/>the plan}
  R2 -->|approved| C[Agent implements]
  C --> A{Acceptance<br/>checks green?}
  A -->|no| C
  A -->|yes| D[Done · spec becomes docs]
  classDef stop fill:#f5e7e3,stroke:#b1361e;
  class D stop;

Intent becomes a spec the human reviews, then a plan the human skims, then implementation that loops against executable acceptance checks until green, ending with the spec as living documentation.

TL;DR

When an agent can write code faster than you can read it, the artifact you review has to change. Review the spec, not the diff.
Encode acceptance criteria as executable checks so “done” is a green command, not a matter of opinion or a tired skim at 6pm.
The spec you write up front becomes the living documentation the moment the work ships. You get docs for free, as a side effect of doing the work correctly.

The problem: confident, plausible, wrong

Hand a capable coding agent a one-line task — “add rate limiting to the ingest endpoint” — and it will return a large, polished, plausible diff in under a minute. The code compiles. The variable names are good. It looks like work a senior engineer would be proud of.

Then you start reading, and the questions pile up. Is the limit per API key or global? It chose global; you needed per-key. Does it persist across restarts, or reset? It used an in-memory counter; you have three instances behind a load balancer. What happens on the 101st request — a 429, or a silent drop? You can’t tell without reading the implementation closely. None of these decisions were wrong, exactly — they were guessed, because you never stated them.

So now you’re doing the most expensive kind of work there is: reverse-engineering intent from an implementation. You’re reading 300 lines of code to reconstruct decisions that could have been three lines of spec. Review has collapsed into archaeology, and you’d have been faster writing it yourself. That’s the trap, and it gets worse as agents get faster, not better.

The core insight: mismatched strengths

The reason spec-driven development works is that humans and agents are good at opposite things, and the naive workflow pits each against its weakness.

Task	Humans	Agents
Producing lots of correct code quickly	slow	fast
Judging whether intent is right	good	unreliable
Reading code to infer intent	slow, error-prone	fast but overconfident
Holding a precise spec and not deviating	good	good, if given one

The naive “just prompt and review the diff” workflow asks the human to do the thing they’re worst at (reading code to infer intent) and asks the agent to do the thing it’s worst at (guessing unstated intent). Spec-driven development flips both: the human reviews a short spec (their strength), the agent implements against it (its strength), and executable checks settle correctness without a human reading code at all.

flowchart TD
    subgraph Naive["Naive: prompt then diff"]
      direction TB
      N1[Human writes one-line prompt] --> N2[Agent guesses the details]
      N2 --> N3[Human reads 300 lines to infer intent]
    end
    subgraph SDD["Spec-driven"]
      direction TB
      D1[Human writes short spec] --> D2[Agent implements exactly that]
      D2 --> D3[Checks prove correctness]
    end

The naive flow asks each party to use its weakness; the spec-driven flow lets each play to its strength.

The four artifacts

Spec-driven development is just four artifacts produced in order, with a human review gate after the first two — the cheap, high-leverage stages.

Intent is the rough idea in your head (“we keep getting hammered by one client”).

The spec turns intent into something reviewable: goal, non-goals, constraints, and acceptance criteria. One page, no code.

The plan is the agent’s reading of the spec, expressed as steps and files to touch. Reviewing the plan is where you catch misunderstandings before a single line is written — it’s the cheapest possible place to course-correct.

The implementation is the code, produced against the plan and gated by the acceptance checks.

Anatomy of a good spec

A spec is short by design. If it’s longer than a page, the task is too big and should be split. Four sections, each earning its place:

# SPEC: rate-limit the /ingest endpoint

## Goal
Reject more than 100 requests/min per API key with HTTP 429.

## Non-goals
- Distributed rate limiting across regions (single instance is fine for now).
- Configurable limits per customer tier (future work).

## Constraints
- No new infrastructure; use the existing Redis instance.
- Must add under 5ms p99 latency to the endpoint.

## Acceptance criteria
- [ ] 101st request in a minute for one key returns 429  (check: pytest -k test_rate_limit_trips)
- [ ] Limit is per-key, not global                       (check: pytest -k test_per_key_isolation)
- [ ] 429 response includes a Retry-After header          (check: pytest -k test_retry_after)

Non-goals are the secret weapon. They’re how you stop the agent from gold-plating — from building the distributed, per-tier, infinitely-configurable version when you needed the simple one. Stating what you don’t want is often more valuable than stating what you do.

Making “done” executable

The acceptance criteria carry the weight, and the magic is the (check: ...) annotation. Each criterion can name a command that must exit zero. Now “done” isn’t a judgment call — it’s a green check. The cookbook’s sdd module parses these into structured data and runs them:

from agcookbook.sdd import parse_spec, check_spec

spec = parse_spec(open("SPEC.md").read())
spec = check_spec(spec)        # runs each (check: ...) command

for c in spec.acceptance:
    mark = {True: "PASS", False: "FAIL", None: "-"}[c.passed]
    print(mark, c.text)

print("DONE" if spec.is_done else "NOT DONE")

The parser is deliberately tiny — a regex over checkbox lines — because the format is meant to be written by hand and read by humans first, machines second. is_done returns true only when every criterion that has a check has passed. Criteria without a check (like “documented in the README”) are tracked but don’t gate, so you can mix automated and manual gates honestly.

The workflow in practice

Here’s a full cycle as it actually runs, with the agent and the checks doing the loop and the human stepping in only at the two gates.

sequenceDiagram
    participant H as Human
    participant A as Agent
    participant T as Checks
    H->>A: SPEC.md (goal, non-goals, acceptance)
    A->>H: Plan (steps + files)
    H->>A: Approve or correct a misread
    loop until green
        A->>A: Implement next step
        A->>T: Run acceptance checks
        T-->>A: pass / fail per criterion
    end
    A->>H: Done — all checks green

Human writes spec, agent proposes plan, human approves, agent implements and runs checks repeatedly until green, then reports done.

The human’s total reading load in this flow is one page of spec and a short plan. The agent’s grind — implement, run checks, read failures, fix, repeat — happens without supervision because the checks are the supervision. This is also exactly where the agent harness earns its keep: the acceptance checks become a tool the agent calls inside its own loop, so “implement until green” is literally the loop’s termination condition.

Review leverage: where your attention goes

The whole method is an argument about where to spend scarce human attention. Compare the two regimes:

	Naive (review the diff)	Spec-driven (review the spec)
What you read	300 lines of code	1 page of spec + short plan
When you catch a wrong assumption	after it’s implemented	before any code exists
What proves correctness	your tired judgment	a green check command
What you have when done	code, intent undocumented	code + a spec that documents intent
Cost of a change in direction	re-read a new diff	edit a few lines of spec

You get documentation for free

Here’s the quiet payoff. The spec doesn’t get thrown away when the work ships — it goes in the repo next to the code. Six months later, when someone asks “wait, why is the rate limit per-key and not global?”, the answer is right there in SPEC.md, in the non-goals and constraints, written at the moment the decision was made. You didn’t write documentation as a chore; it fell out of doing the work in the right order.

Trade-offs and when not to bother

Spec-driven development is overhead, and for some work the overhead isn’t worth it.

Situation	Worth a spec?
Throwaway script, exploring an idea	No — just prompt and go
One-line fix with an obvious correct answer	No
Non-trivial feature that will be reviewed	Yes
Anything that will be maintained by others	Yes
Work where “correct” is contested or subtle	Yes — the spec is where you settle it

Pitfalls

Unwritable acceptance criteria — if you can’t name the command that checks it, rewrite the criterion until you can.
Skipping the plan review — most agent misunderstandings are visible in the plan, the cheapest place to catch them; don’t skip straight to implementation.
Specs that rot — when scope changes, update the spec, or it quietly stops being trustworthy documentation.
Over-specifying — a spec is goal and constraints, not pseudocode; if you’re writing the implementation in the spec, you’ve gone too far and removed the agent’s value.
No non-goals — without them the agent gold-plates; the non-goals section is how you say “stop here.”

How to adopt this

Add a SPEC.md template to your repo (the cookbook ships one).
For the next non-trivial task, write the spec first and review it before any code.
Write each acceptance criterion as a checkbox with a (check: command) annotation.
Have the agent produce a plan from the spec; review the plan, not just the spec.
Gate merge on all acceptance checks passing.
Keep the spec in the repo as documentation when the work ships.

References

Implemented in the sdd module of the cookbook: a spec parser, an acceptance-criteria runner, and a SPEC_TEMPLATE.md to copy. Combine it with the effective agent harness so the agent runs your acceptance checks inside its own control loop — turning “implement until the spec is satisfied” into the literal stopping condition of the agent.

Mohit Mittal

Writes Applied GenAI — practical recipes for building with generative AI. Code lives in the cookbook.