Leverage by Subtraction
When you start building tooling around a coding agent, the default urge is to add. A new agent for code review. A skill for deploys. A clever prompt for that one recurring task. Each addition feels like progress, and within a month you have a drawer full of specialized tools, half of which you forget exist and the other half of which quietly disagree with each other.
I run four products and nine repositories as one person, and the thing that has kept the tooling layer sane is a rule that points the opposite direction: leverage by subtraction. Build as few moving parts as possible, define the shared ones once, and specialize only where domain judgment genuinely differs.
Two takeaways, and I will say both again at the end. First, there is a test that decides what a piece of work should even be, a script, a hook, a skill, or an agent, and applied honestly it routes most work away from agents. Second, guardrails do not belong in agents a human has to remember to run. They belong in hooks and CI, where forgetting is not an option. The named agents that survive that second cut are a short list, and the shortness is the point.
The temptation to add#
An agent is the most expensive tool in the box. It costs tokens, it costs latency, and worst of all it costs determinism: the same input can produce different output, which is exactly what you do not want for anything you intend to rely on. Yet an agent is also the easiest tool to reach for, because writing a prompt feels faster than writing a script. That mismatch, most expensive to run but cheapest to author, is the trap. Reaching for an agent should feel like a decision, not a reflex.
The determinism test#
So before anything gets built, it goes through one question: how much judgment does this actually require? The answer routes the work down a ladder, and most things land lower than they first appear.
| If the work is... | Build a... | Because |
|---|---|---|
| Mechanical and deterministic | Script | Faster, free, and correct every time. No model in the loop. |
| A rule that must always hold | Hook or CI check | The machine refuses the bad state. Nobody has to remember. |
| A fixed sequence with light judgment | Skill | Repeatable, but flexible enough for the small variations. |
| Open-ended judgment | Agent | The only tool that actually reasons. Reach for it last, not first. |
| Anything reaching a service or database | CLI or MCP server | A prompt-only skill cannot hold credentials or touch the network. |
The test does real work because it keeps demoting things. Most of what feels like it wants to be an agent is actually a script with good error messages, or a hook you have not written yet. The "code review agent" I almost built is a good example: run the test honestly and it falls apart into three jobs wearing one coat. The style checks are a linter, which is a script. The rule that a secret never reaches a diff is enforcement, which is a hook. Only the last part, the actual design-judgment call about whether the change is sound, was ever really an agent. Two of the three demoted themselves the moment I asked the question. Every rung you move down the ladder buys you speed, determinism, and one less thing that can drift.
Default to the general agent#
There is a second subtraction underneath the first. Even once something genuinely needs a model's judgment, that does not mean it needs its own named agent. A bespoke agent is a maintenance commitment: a prompt to keep current, a behavior to keep tested, a name to remember. The right default is the general-purpose agent, and you reach for a specialized one only when the same judgment-shaped task recurs often enough that defining it once pays for itself many times over. A named agent should earn its name by being used repeatedly, not by being theoretically tidy.
Guardrails are not agents#
The most important line in the whole rule is this: a guardrail you have to remember to run is not a guardrail. It is a suggestion.
If a rule must always hold, putting it inside an agent that a human invokes by hand is the wrong altitude. The day it matters most is the day someone is moving fast and skips it. So enforcement moves down into hooks and CI, where it runs whether or not anyone thought to ask. A pre-deploy check that blocks a bad stack, a hook that refuses a dirty deploy, a secret scan that fails the build: none of those should be a tool a human chooses to use. They should be the floor.
This is the cleanest dividing line I have found. Judgment that informs a human decision can live in an agent. Enforcement that protects production cannot, because it has to survive human forgetfulness, and agents do not.
The agents that earned their place#
After all that subtraction, the named agents that remain are the ones that pass every cut: real open-ended judgment, recurring often, informing a decision rather than enforcing a rule. In my fleet that is a short, stable list. A spec-conformance reviewer that reads a diff against the spec it claims to implement. A disclosure auditor that scans a draft for anything that should not be published. A spec-writer that drafts a new spec in the right place. A stack reviewer that reads a CDK diff before deploy. A branch-hygiene sweeper. An incident recorder that appends a lesson to the right file when something breaks. Each one survives because it is doing something a script cannot and a human should not do by hand fifty times a week.
What is striking is how few there are. Run the test honestly across everything that felt like it wanted to be an agent, and the great majority resolve into scripts and hooks. The agents are the residue, the irreducible core of actual judgment, and keeping that core small is what keeps the whole layer trustworthy.
Recap#
Two takeaways, as promised. First, before you build any piece of tooling, run the determinism test: mechanical work is a script, an always-true rule is a hook or a CI check, a fixed multi-step task is a skill, and only genuinely open-ended judgment is an agent. Applied honestly, it routes most work away from agents, which is a feature. Second, guardrails belong where forgetting is impossible, in hooks and CI, not in agents a human has to remember to run, and the named agents worth keeping are the short list of recurring, judgment-shaped tasks that survive that cut. The leverage was never in adding more. It was in being ruthless about how little you actually need.
Related:
- Spec-driven folder architecture, the structure this tooling operates on
- Autonomy is mostly knowing when to stop, the same restraint applied to the loop itself
Keep reading
Watch the agent write
A polish agent drafts an essay against a pre-approved topic.
Building an AI-Native Platform: A Retrospective
A year of building and operating a small fleet of finance and content products almost entirely through an AI coding agent. What worked, what was hard, the honest failures (including a flagship signal that measured nothing and an edge that vanished net of costs), and the lessons that transfer.
Hello Again, Opus
Four days after I said goodbye to Opus, an export-control directive pulled Fable 5 offline and the fallback became the workhorse again. What I shipped in the window, what it cost, and the model-tiering plan for when Fable comes back.
Ten days of June: the SDD velocity numbers, seven weeks in
In April I published one week of SDD production numbers. The same data trail rerun for June 1 through 10 shows the velocity curve: 309 PRs opened, 293 merged, about 185 production deploys, and one footnote about outrunning GitHub Actions' default limits.
Autonomy is mostly knowing when to stop
I handed a backlog to Claude Fable, told it once it could merge, and let it run. It shipped seventeen items across five repos. The line that mattered was not in the work it finished. It was in the work it refused to touch.
Goodbye Opus, Hello Fable
Anthropic shipped Claude Fable 5 and Mythos 5: same model, two names, one safeguard layer apart. What the new frontier model means for running agents in production.