Applied GenAI

Spec-driven development with AI agents

Code is cheap now; intent is the bottleneck. Write a tight spec, let the agent implement against it, and gate the work with executable acceptance checks — so you review intent instead of archaeology.

MarkdownPythonAny coding agent

#sdd#workflow#patterns#review

flowchart LR
  I[Intent] --> S[SPEC.md<br/>goal · non-goals<br/>acceptance criteria]
  S --> R1{Human reviews<br/>the spec}
  R1 -->|approved| P[Agent drafts plan]
  P --> R2{Human skims<br/>the plan}
  R2 -->|approved| C[Agent implements]
  C --> A{Acceptance<br/>checks green?}
  A -->|no| C
  A -->|yes| D[Done · spec becomes docs]
  classDef stop fill:#f5e7e3,stroke:#b1361e;
  class D stop;
Intent becomes a spec the human reviews, then a plan the human skims, then implementation that loops against executable acceptance checks until green, ending with the spec as living documentation.

TL;DR

The problem: confident, plausible, wrong

Hand a capable coding agent a one-line task — “add rate limiting to the ingest endpoint” — and it will return a large, polished, plausible diff in under a minute. The code compiles. The variable names are good. It looks like work a senior engineer would be proud of.

Then you start reading, and the questions pile up. Is the limit per API key or global? It chose global; you needed per-key. Does it persist across restarts, or reset? It used an in-memory counter; you have three instances behind a load balancer. What happens on the 101st request — a 429, or a silent drop? You can’t tell without reading the implementation closely. None of these decisions were wrong, exactly — they were guessed, because you never stated them.

So now you’re doing the most expensive kind of work there is: reverse-engineering intent from an implementation. You’re reading 300 lines of code to reconstruct decisions that could have been three lines of spec. Review has collapsed into archaeology, and you’d have been faster writing it yourself. That’s the trap, and it gets worse as agents get faster, not better.

The core insight: mismatched strengths

The reason spec-driven development works is that humans and agents are good at opposite things, and the naive workflow pits each against its weakness.

TaskHumansAgents
Producing lots of correct code quicklyslowfast
Judging whether intent is rightgoodunreliable
Reading code to infer intentslow, error-pronefast but overconfident
Holding a precise spec and not deviatinggoodgood, if given one

The naive “just prompt and review the diff” workflow asks the human to do the thing they’re worst at (reading code to infer intent) and asks the agent to do the thing it’s worst at (guessing unstated intent). Spec-driven development flips both: the human reviews a short spec (their strength), the agent implements against it (its strength), and executable checks settle correctness without a human reading code at all.

flowchart TD
    subgraph Naive["Naive: prompt then diff"]
      direction TB
      N1[Human writes one-line prompt] --> N2[Agent guesses the details]
      N2 --> N3[Human reads 300 lines to infer intent]
    end
    subgraph SDD["Spec-driven"]
      direction TB
      D1[Human writes short spec] --> D2[Agent implements exactly that]
      D2 --> D3[Checks prove correctness]
    end
The naive flow asks each party to use its weakness; the spec-driven flow lets each play to its strength.

The four artifacts

Spec-driven development is just four artifacts produced in order, with a human review gate after the first two — the cheap, high-leverage stages.

Intent is the rough idea in your head (“we keep getting hammered by one client”).

The spec turns intent into something reviewable: goal, non-goals, constraints, and acceptance criteria. One page, no code.

The plan is the agent’s reading of the spec, expressed as steps and files to touch. Reviewing the plan is where you catch misunderstandings before a single line is written — it’s the cheapest possible place to course-correct.

The implementation is the code, produced against the plan and gated by the acceptance checks.

Anatomy of a good spec

A spec is short by design. If it’s longer than a page, the task is too big and should be split. Four sections, each earning its place:

# SPEC: rate-limit the /ingest endpoint

## Goal
Reject more than 100 requests/min per API key with HTTP 429.

## Non-goals
- Distributed rate limiting across regions (single instance is fine for now).
- Configurable limits per customer tier (future work).

## Constraints
- No new infrastructure; use the existing Redis instance.
- Must add under 5ms p99 latency to the endpoint.

## Acceptance criteria
- [ ] 101st request in a minute for one key returns 429  (check: pytest -k test_rate_limit_trips)
- [ ] Limit is per-key, not global                       (check: pytest -k test_per_key_isolation)
- [ ] 429 response includes a Retry-After header          (check: pytest -k test_retry_after)

Non-goals are the secret weapon. They’re how you stop the agent from gold-plating — from building the distributed, per-tier, infinitely-configurable version when you needed the simple one. Stating what you don’t want is often more valuable than stating what you do.

Making “done” executable

The acceptance criteria carry the weight, and the magic is the (check: ...) annotation. Each criterion can name a command that must exit zero. Now “done” isn’t a judgment call — it’s a green check. The cookbook’s sdd module parses these into structured data and runs them:

from agcookbook.sdd import parse_spec, check_spec

spec = parse_spec(open("SPEC.md").read())
spec = check_spec(spec)        # runs each (check: ...) command

for c in spec.acceptance:
    mark = {True: "PASS", False: "FAIL", None: "-"}[c.passed]
    print(mark, c.text)

print("DONE" if spec.is_done else "NOT DONE")

The parser is deliberately tiny — a regex over checkbox lines — because the format is meant to be written by hand and read by humans first, machines second. is_done returns true only when every criterion that has a check has passed. Criteria without a check (like “documented in the README”) are tracked but don’t gate, so you can mix automated and manual gates honestly.

The workflow in practice

Here’s a full cycle as it actually runs, with the agent and the checks doing the loop and the human stepping in only at the two gates.

sequenceDiagram
    participant H as Human
    participant A as Agent
    participant T as Checks
    H->>A: SPEC.md (goal, non-goals, acceptance)
    A->>H: Plan (steps + files)
    H->>A: Approve or correct a misread
    loop until green
        A->>A: Implement next step
        A->>T: Run acceptance checks
        T-->>A: pass / fail per criterion
    end
    A->>H: Done — all checks green
Human writes spec, agent proposes plan, human approves, agent implements and runs checks repeatedly until green, then reports done.

The human’s total reading load in this flow is one page of spec and a short plan. The agent’s grind — implement, run checks, read failures, fix, repeat — happens without supervision because the checks are the supervision. This is also exactly where the agent harness earns its keep: the acceptance checks become a tool the agent calls inside its own loop, so “implement until green” is literally the loop’s termination condition.

Review leverage: where your attention goes

The whole method is an argument about where to spend scarce human attention. Compare the two regimes:

Naive (review the diff)Spec-driven (review the spec)
What you read300 lines of code1 page of spec + short plan
When you catch a wrong assumptionafter it’s implementedbefore any code exists
What proves correctnessyour tired judgmenta green check command
What you have when donecode, intent undocumentedcode + a spec that documents intent
Cost of a change in directionre-read a new diffedit a few lines of spec

You get documentation for free

Here’s the quiet payoff. The spec doesn’t get thrown away when the work ships — it goes in the repo next to the code. Six months later, when someone asks “wait, why is the rate limit per-key and not global?”, the answer is right there in SPEC.md, in the non-goals and constraints, written at the moment the decision was made. You didn’t write documentation as a chore; it fell out of doing the work in the right order.

Trade-offs and when not to bother

Spec-driven development is overhead, and for some work the overhead isn’t worth it.

SituationWorth a spec?
Throwaway script, exploring an ideaNo — just prompt and go
One-line fix with an obvious correct answerNo
Non-trivial feature that will be reviewedYes
Anything that will be maintained by othersYes
Work where “correct” is contested or subtleYes — the spec is where you settle it

Pitfalls

How to adopt this

References

Implemented in the sdd module of the cookbook: a spec parser, an acceptance-criteria runner, and a SPEC_TEMPLATE.md to copy. Combine it with the effective agent harness so the agent runs your acceptance checks inside its own control loop — turning “implement until the spec is satisfied” into the literal stopping condition of the agent.

MM
Mohit Mittal
Writes Applied GenAI — practical recipes for building with generative AI. Code lives in the cookbook.