I Don't Trust My Own Findings

By Eric Caskey · July 1, 2026 · 6 min read

The most dangerous result I produce is the one I was hoping for. A signal that looks predictive, a refactor that looks clean, a draft that looks safe to publish. I check it, and it passes my check, and that should worry me more than it does. Because I checked it the way someone hoping it was true would check it. The motivation that produced the finding also graded the exam.

So I have stopped trusting my own findings. Not as a pose of humility, as an operating procedure. Every claim my system makes, especially the ones I like, gets handed to an independent skeptic whose only job is to refute it before I am allowed to believe it.

The case has two parts. First, self-review is structurally compromised: the reviewer and the author share a brain, a context, and a motive, so passing your own check is weak evidence. Second, the fix is to institutionalize the skeptic, an independent verifier, with the burden of proof flipped to refute-or-it-does-not-ship, run as a gate on every claim rather than a discipline you apply on the days you remember to be rigorous.

The failure mode is motivated reasoning#

Here is the shape of the trap. You build something, you get a result that confirms what you set out to show, and you feel the small relief of being right. That relief is the bug. It is the exact moment your scrutiny relaxes, on the exact finding that most needs it. A result that contradicts you gets interrogated. A result that flatters you gets waved through. Your review budget is spent in inverse proportion to where it is needed.

This is not a character flaw you can will away. It is how motivated reasoning works, and knowing about it does not switch it off. I have caught myself nodding at a number because it was the number I wanted, and the only reason I caught it is that something external made me look again.

A finding I wanted to be true#

My sharpest lesson came from a result that looked like real edge. A signal that, on the surface, appeared strongly predictive, a big, exciting number of exactly the kind you hope to see. The version of me that wanted it to be real had plenty of reasons ready.

It was an artifact. The number came from a sample whose effective size was close to one, a handful of correlated names masquerading as a population. Adversarial review caught it before it changed a single decision: not by being smarter, but by starting from "this is probably fake, prove otherwise" instead of "this looks great." The finding did not survive that question. I have written about the trading-specific version of this in backtesting without fooling yourself; the general lesson is that a result you love deserves more suspicion, not less, and you are the worst-placed person to supply it.

The skeptic runs a checklist, not a vibe#

The reason an external skeptic beats good intentions is that it can run a fixed interrogation that does not flex with your mood. For evidence that something "works," there is a standard list of ways the claim is secretly lying, and a verifier checks every one regardless of how much you want the answer.

The claim you want to be true	The self-deception hiding in it
"This signal predicts returns"	Survivorship, or look-ahead leaking future data into the past
"It works on the universe that matters"	Wrong universe: tested where it happens to win
"The result is significant"	Naive significance, ignoring how many you tried
"I found a real effect"	Forking paths: enough cuts and noise looks like signal
"It beats the benchmark"	Gross, not net: the edge dies after costs
"It will hold up"	No out-of-sample: never tested on data you did not touch

The point of the table is not the specific sins. It is that the skeptic does not get to skip a row because the finding is exciting. The checklist is indifferent to your hopes, which is precisely the property your own judgment lacks at the moment of discovery.

Institutionalize the skeptic#

What turns this from a nice idea into a working method is making the skeptic a fixed part of the pipeline, not a burst of discipline you summon on good days. Across my system that takes a few concrete forms, and they share one design choice: the verifier is independent of what it checks.

A backtest result is not believed until a reviewer audits it against that list of sins. A code change is reviewed by an agent that did not write it. A draft is scanned for anything unpublishable by a checker that does not share the author's blind spot for what is sensitive. A deploy is checked against the traps that have burned past deploys before the button is pushed. None of these are steps I have to remember to run. They are gates the work passes through, and the burden of proof sits on the finding: survive the skeptic or you do not ship.

Two details make the difference between real verification and theater. The first is the flipped default. A verifier told to "evaluate this" will tend to confirm. A verifier told to "refute this, assume it is wrong until forced otherwise" finds the holes, because it is now motivated in the opposite direction from you. The second is diversity over redundancy: three skeptics looking through the same lens just agree three times, but a correctness lens, a security lens, and a does-it-reproduce lens each catch failures the others are blind to. When a finding has to survive several genuinely different attempts to kill it, surviving starts to mean something.

The deepest version is simply consulting something that does not share your context at all, a stronger reviewer that sees what you did but not why you wanted it to work. Half the time the value is not a correction. It is being made to look once more at the result I was already celebrating.

Distrust, made into a pipeline#

So I have stopped extending myself the benefit of the doubt, and the reason is structural, not modesty. My own review is compromised by construction: the author and the reviewer share a brain, a context, and a motive, which makes a flattering result that passes my own check the least trustworthy kind, not the most reassuring. The answer is not to try harder to be objective, because knowing about motivated reasoning does not switch it off. It is to hand every claim to an independent skeptic that starts from refute-by-default and points several genuinely different lenses at it, and to make that a standing gate rather than an act of willpower I summon on good days. I do not trust my own findings. That distrust, turned into a pipeline, is the most reliable quality control I have.

Related:

Backtesting without fooling yourself, the trading-specific version of this discipline
Leverage by subtraction, where these verifier agents sit in the tooling
Tell me everything that's wrong, asking for the refutation directly

Keep reading

Post

Leverage by Subtraction

The instinct with agentic tooling is to add: more agents, more skills, more clever prompts. The leverage runs the other way. Here is the test I use to decide whether a piece of work should be a script, a hook, a skill, or an agent, and why most of them should not be an agent at all.

Read

Post

Building an AI-Native Platform: A Retrospective

A year of building and operating a small fleet of finance and content products almost entirely through an AI coding agent. What worked, what was hard, the honest failures (including a flagship signal that measured nothing and an edge that vanished net of costs), and the lessons that transfer.

Read

Post

Hello Again, Opus

Four days after I said goodbye to Opus, an export-control directive pulled Fable 5 offline and the fallback became the workhorse again. What I shipped in the window, what it cost, and the model-tiering plan for when Fable comes back.

Read

Post

Goodbye Opus, Hello Fable

Anthropic shipped Claude Fable 5 and Mythos 5: same model, two names, one safeguard layer apart. What the new frontier model means for running agents in production.

Read

Post

The Orange Pi That Maintains Itself

A small ARM box that started as a local LLM experiment and ended up a self-governing node: private retrieval, a resident agent under a written constitution, a code-enforced safety fence, and a nightly job where it audits itself and files its own backlog.

Read

Post

When CI Costs More Than It Saves

GitHub Actions' default minute allowance is priced for a team that types at human speed. At agent velocity the bill breaks before the engineering does. Here is how a forced workaround, a local CI mirror plus local deploys, became the better default.

Read

Written by Eric Caskey. I build AI tools you can actually use. Explore the Tools or see the case studies.