← Back to Blog

Leverage by Subtraction

When you start building tooling around a coding agent, the default urge is to add. A new agent for code review. A skill for deploys. A clever prompt for that one recurring task. Each addition feels like progress, and within a month you have a drawer full of specialized tools, half of which you forget exist and the other half of which quietly disagree with each other.

I run four products and nine repositories as one person, and the thing that has kept the tooling layer sane is a rule that points the opposite direction: leverage by subtraction. Build as few moving parts as possible, define the shared ones once, and specialize only where domain judgment genuinely differs.

Two takeaways, and I will say both again at the end. First, there is a test that decides what a piece of work should even be, a script, a hook, a skill, or an agent, and applied honestly it routes most work away from agents. Second, guardrails do not belong in agents a human has to remember to run. They belong in hooks and CI, where forgetting is not an option. The named agents that survive that second cut are a short list, and the shortness is the point.

The temptation to add#

An agent is the most expensive tool in the box. It costs tokens, it costs latency, and worst of all it costs determinism: the same input can produce different output, which is exactly what you do not want for anything you intend to rely on. Yet an agent is also the easiest tool to reach for, because writing a prompt feels faster than writing a script. That mismatch, most expensive to run but cheapest to author, is the trap. Reaching for an agent should feel like a decision, not a reflex.

The determinism test#

So before anything gets built, it goes through one question: how much judgment does this actually require? The answer routes the work down a ladder, and most things land lower than they first appear.

If the work is... Build a... Because
Mechanical and deterministic Script Faster, free, and correct every time. No model in the loop.
A rule that must always hold Hook or CI check The machine refuses the bad state. Nobody has to remember.
A fixed sequence with light judgment Skill Repeatable, but flexible enough for the small variations.
Open-ended judgment Agent The only tool that actually reasons. Reach for it last, not first.
Anything reaching a service or database CLI or MCP server A prompt-only skill cannot hold credentials or touch the network.

The test does real work because it keeps demoting things. Most of what feels like it wants to be an agent is actually a script with good error messages, or a hook you have not written yet. The "code review agent" I almost built is a good example: run the test honestly and it falls apart into three jobs wearing one coat. The style checks are a linter, which is a script. The rule that a secret never reaches a diff is enforcement, which is a hook. Only the last part, the actual design-judgment call about whether the change is sound, was ever really an agent. Two of the three demoted themselves the moment I asked the question. Every rung you move down the ladder buys you speed, determinism, and one less thing that can drift.

Default to the general agent#

There is a second subtraction underneath the first. Even once something genuinely needs a model's judgment, that does not mean it needs its own named agent. A bespoke agent is a maintenance commitment: a prompt to keep current, a behavior to keep tested, a name to remember. The right default is the general-purpose agent, and you reach for a specialized one only when the same judgment-shaped task recurs often enough that defining it once pays for itself many times over. A named agent should earn its name by being used repeatedly, not by being theoretically tidy.

Guardrails are not agents#

The most important line in the whole rule is this: a guardrail you have to remember to run is not a guardrail. It is a suggestion.

If a rule must always hold, putting it inside an agent that a human invokes by hand is the wrong altitude. The day it matters most is the day someone is moving fast and skips it. So enforcement moves down into hooks and CI, where it runs whether or not anyone thought to ask. A pre-deploy check that blocks a bad stack, a hook that refuses a dirty deploy, a secret scan that fails the build: none of those should be a tool a human chooses to use. They should be the floor.

This is the cleanest dividing line I have found. Judgment that informs a human decision can live in an agent. Enforcement that protects production cannot, because it has to survive human forgetfulness, and agents do not.

The agents that earned their place#

After all that subtraction, the named agents that remain are the ones that pass every cut: real open-ended judgment, recurring often, informing a decision rather than enforcing a rule. In my fleet that is a short, stable list. A spec-conformance reviewer that reads a diff against the spec it claims to implement. A disclosure auditor that scans a draft for anything that should not be published. A spec-writer that drafts a new spec in the right place. A stack reviewer that reads a CDK diff before deploy. A branch-hygiene sweeper. An incident recorder that appends a lesson to the right file when something breaks. Each one survives because it is doing something a script cannot and a human should not do by hand fifty times a week.

What is striking is how few there are. Run the test honestly across everything that felt like it wanted to be an agent, and the great majority resolve into scripts and hooks. The agents are the residue, the irreducible core of actual judgment, and keeping that core small is what keeps the whole layer trustworthy.

Recap#

Two takeaways, as promised. First, before you build any piece of tooling, run the determinism test: mechanical work is a script, an always-true rule is a hook or a CI check, a fixed multi-step task is a skill, and only genuinely open-ended judgment is an agent. Applied honestly, it routes most work away from agents, which is a feature. Second, guardrails belong where forgetting is impossible, in hooks and CI, not in agents a human has to remember to run, and the named agents worth keeping are the short list of recurring, judgment-shaped tasks that survive that cut. The leverage was never in adding more. It was in being ruthless about how little you actually need.


Related:

Keep reading

Written by Eric Caskey. I build AI tools you can actually use. Explore the Tools or see the case studies.