Autonomy is mostly knowing when to stop

By Eric Caskey · June 9, 2026 · 7 min read

AI claude-code claude-fable autonomous-agents orchestration spec-driven-development

A couple of weeks ago I wrote that an orchestration mode is only as good as its backlog. That mode multiplies effort within a single task. Deciding which tasks exist, in what order, and where to stop is a different layer that sits above it, and that layer was on me. So I built it, handed the work-list to Claude Fable 5, told it once that it could merge its own pull requests, and let it run on a self-paced loop.

It shipped seventeen items across five repositories, nineteen pull requests, all merged. The instinct is to read that number as the result. It is not. The number that mattered does not appear in the merged column at all. It is the count of items the agent refused to touch, and the reason it refused.

The thesis I keep landing on: when execution gets cheap, the engineering moves to the edges. Not what the agent builds, but what you have told it building means, what it should ignore, and where it has to stop. A capable model will grind through any well-formed list. The scarce input is judgment about boundaries, and the sharpest form of that judgment is the stop.

The file is the program#

The setup is small. One file, BACKLOG.md, and the prompt that drives the run is a single line: work the next eligible item per the loop protocol in BACKLOG.md.

The protocol lives inside the file. The header is the rules, written as six steps: take the lowest-numbered ready item whose dependencies are done, branch, do the work, verify, open a pull request, merge if you are authorized and the checks are green, update the status, repeat. Below it is the list of items, each with a status, its dependencies, and concrete acceptance criteria.

Because the rules are in the file, the loop re-reads its own instructions every iteration. There is no harness holding state in memory between runs. The state is the file. Edit it mid-run and the next iteration picks up the change. That is what makes it safe to leave alone, because there is exactly one place to look to know what it thinks it is doing.

One iteration, end to end#

The discipline per item is deliberately boring. One item at a time, never a batch. The build can run as a subagent, but the loop does not take the subagent's word for it. It reruns the checks and reads the diff before merging, because a subagent reporting that tests pass is a claim, not a result.

A few of the items, to make it concrete. One was porting a spend-cap design I had already validated in a throwaway sandbox into the real platform, the kind of well-specified, tedious work a loop is made for. Another was a single routing change that was listed as a dependency for four other items, so the moment it merged, the planner, the monitor, and two more agents all became eligible and the graph opened up on its own. The one I liked most was an egress sandbox: Fable built the allowlisting proxy, then reported that it could not enforce the restriction at the kernel level without root on the host, documented the actual threat model, added a startup check that warns when egress is open, and did not oversell what it had built. A capable model that tells you the truth about its own work is the whole ballgame.

The boundaries you write down once#

The first version flailed. Every iteration the agent re-derived the same facts. How do I run the tests here. Which branch is the default. Is this red test a regression or has it been broken forever. That rediscovery is slow, and it is where wrong guesses get in.

So I added a block at the top of the file and made the first job of the run to fill it in by surveying the repos. The test and lint command. The default branch per repo, because one of my five uses main while the rest use master, and an agent that assumes wrong branches off nothing. The pre-existing failures to ignore, including one integration test that is red on a clean checkout, so the loop never once mistook it for damage it had caused. Whether merging is authorized. The hard rules that are enforced nowhere in the test suite: the privacy invariants, the config pairs that must ship together, the things you only learn by breaking them.

Every line in that block is a boundary. It is me telling the agent what is true here so it does not have to guess, which is the same move as context architecture: decide what a task is allowed to assume, then let it run inside that. Get the block right and the loop stops investigating and starts executing.

The stops#

Here is the part I came away thinking about.

One item was to implement the encryption helpers for a privacy-sensitive feature. The agent went looking for the spec that defined the algorithms and the wire format. It searched the repos, the history, every plausible directory. The spec was not there; it lives on another machine of mine and was never synced, and the stub comments left in the repo contradicted each other on the decisions that matter. A looser setup would have stitched something plausible together. Fable marked the item blocked, wrote down exactly what it had searched and what it would need, and moved on. Refusing to invent a cryptographic design from contradictory hints was the best decision it made all day.

Four more items carried a human gate: archiving a repository, spending money on cloud resources, deploying to production, anything that publishes outward. The loop is not allowed to start those. It surfaced each one, said why it was deferring, and took the next thing it could actually finish.

None of those were failures of the run. They were the run working. And every one of those stops was a boundary I had written into the file before it started: an acceptance criterion it could not honestly meet, a dependency it could not reach, a gate it was not permitted to cross. The agent did not decide to stop. I had decided in advance, and it held the line.

This is what spec-driven development was always for, and it took an autonomous run to make me see it plainly. An acceptance criterion is a boundary. A human gate is a boundary. The known-failure line is a boundary. SDD reads like a method for telling an agent what to build. It is at least as much a method for telling it what not to, and where the work ends. When execution was the expensive part, that second half was a nicety. Now that execution is cheap, the second half is the job.

What I pulled out of it#

Once the run was done the pattern was obviously reusable, so I extracted it into a small open tool: loop-harness. A versioned schema for the backlog and its loop protocol, and a skill that surveys a repository and writes the fitted backlog, boundary block and all. One level of generality up from a single project.

But the tool is downstream of the idea. The leverage was never the loop, which is a dozen lines of protocol I could have written in an afternoon. It was the set of edges I drew before I ever started it. The model was not the bottleneck. It has not been the bottleneck for a while. What I bring now is the boundaries.

A practical footnote, because it is the first question I get about a run like this: all of it happened on a regular Claude Code subscription, and I have not hit the plan's limits yet. Not during this run, not in the daily driving since. A self-paced loop working one item at a time turns out to live comfortably inside a subscription.

If you take one thing from this: do not measure an autonomous run by what it finished. Measure it by whether it stopped where you would have.

One more thread to pull, in a different direction. I extended a Perplexity trial for one specific job: a consultant view, a second model with its own finance APIs, pointed at my finance engine to cross-check the work and, where it earned it, influence the algorithms themselves. Why I pay for a second opinion when the first one is this good is the next post.

The setup this pays off: An orchestration mode is only as good as its backlog

The method underneath it: Context architecture beats documentation dumps

The tool: loop-harness, the backlog schema, the loop protocol, and the skill that generates both.

Keep reading

Demo

Watch the agent write

A polish agent drafts an essay against a pre-approved topic.

Read

Case study

Multi-Region Workflow Orchestration Platform

Platform running tens of thousands of workflow executions a year across multiple global regions, every one behind a wall of concurrent safety checks, expanding adoption across Amazon.

Read

Post

An orchestration mode is only as good as its backlog

Anthropic published a guide on building a session-level orchestration mode. I built it two ways, on the CLI and on the API, and then hit the part the guide does not cover: an orchestrator that fans out is useless without a backlog of real work to fan out over.

Read

Post

Ten days of June: the SDD velocity numbers, seven weeks in

In April I published one week of SDD production numbers. The same data trail rerun for June 1 through 10 shows the velocity curve: 309 PRs opened, 293 merged, about 185 production deploys, and one footnote about outrunning GitHub Actions' default limits.

Read

Post

One week of SDD in production: the numbers

The previous two posts made claims. Here is what a week of the workflow looks like as a data trail, PRs, deploys, CI runs, specs merged, pulled from GitHub.

Read

Post

SDD isn't about managing AI agents, it's about managing context

Spec-driven development reads like a methodology for controlling AI agents. It isn't. It's a methodology for managing context across stateless sessions. The spec is the persistent memory.

Read

Follow the work

New tools and writing as they ship — pick a channel.

RSS feed LinkedIn

Written by Eric Caskey. I build AI tools you can actually use. Explore the Tools or see the case studies.