← Back to Blog

How to backtest without fooling yourself

I have a personal trading engine. It scores stocks on a multi-factor model and ranks them, and for a while I wanted to know the obvious thing: does it actually predict anything? So I set out to backtest it properly.

What I learned is that backtesting is not really about finding an edge. It is about not fooling yourself into believing in one. A backtest that looks great is the default outcome, not the exciting one, because almost every degree of freedom in the process bends toward a prettier number. The work is spending those degrees of freedom on honesty instead.

This is the toolkit I used, in the order the mistakes tend to bite. The running example is my own engine, and I will tell you now where it ends: the single best signal I found, the one that passed every in-sample test I could throw at it, turned out to be worth nothing. Getting to that conclusion cleanly was the whole point.

Score the dead, not just the survivors#

The first lie a backtest tells you is built into your list of stocks. If you test on today's S&P 500 scored at past dates, you have already cheated, because today's index is the list of companies that survived. The ones that went bankrupt, got acquired, or fell out of the index are missing, and they are missing precisely because they did badly. Your universe is a winners' bracket.

The fix is point-in-time membership. I pulled the index's full add and remove history and reconstructed who was actually in it on each past date, then scored that set, dropouts included. Over my test window that recovered 73 names the naive approach would have silently deleted: failed banks, companies taken private, acquisition targets.

The result was the first of several surprises. Adding the dead names back did not make the engine look worse. It made it look slightly better, because the engine had correctly been bearish on several of the companies that later failed, and excluding them had been hiding that skill. Survivorship bias does not always flatter you in the direction you expect. The point is not which way it cuts. The point is that if you cannot say how your universe was constructed at each historical date, you do not know what your backtest measured.

Do not let the future leak in#

The next lie is subtler: using information to score the past that did not exist in the past. This is look-ahead bias, and it hides in places you would never think to check.

Mine was the risk-free rate. The scoring code fetched the current 10-year Treasury yield, a perfectly reasonable thing to do live. But when I scored a stock as of early 2022, that fetch handed back today's rate of around 4.3 percent instead of the roughly 3 percent that actually prevailed in 2022, and that wrong rate fed straight into the valuation factor. Every historical score was contaminated by a number from the future.

The fix was to force the rate that prevailed at each as-of date. Mundane, except that the same change cut the run time from eighty minutes to ten, because the live fetch had also been a per-cell network call. Removing a look-ahead leak and a performance bug turned out to be the same edit, which is the kind of thing that happens once you start treating "what did this code actually know, and when" as the central question.

Pick a universe that can answer the question#

You can do everything above and still ask an unanswerable question. My engine is a ranker: its job is to say these stocks will beat those stocks. My first real test ran it on thirty large, familiar names, and the result was a flat nothing, an information coefficient indistinguishable from zero with confidence intervals wide enough to drive a truck through.

The mistake was the test set, not the engine. Thirty mega-caps all ride the same market and tech beta, and over 2022 to 2025 their returns were driven by a handful of AI winners, not by anything a fundamental ranker could sort. You cannot evaluate a ranker on names that mostly move together. There is nothing to rank. I checked that the scores themselves were varied and differentiated, they were, so the problem was that the population could not express the answer.

Moving to the full index, and then to a point-in-time mid-cap universe of around 1,300 names with real cross-sectional spread, changed the question from "is there an edge in the most efficiently priced corner of the market" to "is there an edge somewhere a small fund could actually find one." That is the question worth asking, and it needs a universe that can answer it.

Significance, honestly#

Here is where most backtests quietly cross from analysis into self-deception. You compute an information coefficient per date, average it across dates, and slap a t-statistic on it. The naive t-statistic assumes your per-date measurements are independent. They are not. Scores are sticky month to month, whether a factor "works" runs in multi-month regimes, and at longer horizons the forward-return windows literally overlap. All of that makes consecutive measurements correlated, which deflates the true standard error and inflates the t-statistic, sometimes dramatically.

The correction is a Newey-West standard error, which accounts for that autocorrelation, and the difference is not academic. One of my factors showed a naive t-statistic of negative four, which looks like a screaming result. Its Newey-West t-statistic was negative 1.6, not significant at all. The "signal" was an artifact of treating forty autocorrelated months as forty independent facts.

The deeper lesson hiding in that number: your significance is bounded by your number of independent time periods, not your number of stocks. I could add thousands of names and tighten each date's measurement, but with roughly 45 monthly dates I had roughly 45 observations, and no amount of cross-sectional breadth changes that. More stocks make each dot more precise. Only more time gives you more dots.

Pre-register, or you will cherry-pick#

By this point I had looked at a lot of cells: four horizons, several factors, a couple of universes. Somewhere in that grid, something always looks good, the same way some lottery ticket always wins. If you run the whole grid and then report the prettiest cell as your finding, you have not found an edge. You have found the expected maximum of many noisy draws, and dressed it up as a result.

The defense is to write the test down before you run it. Before the mid-cap run I wrote a short pre-registration: the hypothesis was that the momentum factor, which had been the most consistent thing across earlier runs, would show a positive information coefficient at the one-month and six-month horizons, with a magnitude of at least 0.03. I committed to that being the test, and to treating anything else that lit up as a new hypothesis to be tested later, not a confirmation. Real strategies have a t-statistic threshold of 3, not 2, for exactly this reason: when you have searched many candidates, the bar has to rise to account for the search.

A gross signal is not a tradeable one#

Say a signal survives all of that. It still might not be worth anything, because an information coefficient is a correlation, not a profit. I built the momentum signal into an actual monthly-rebalanced long-short portfolio and charged it realistic mid-cap transaction costs. It survived the costs, with a break-even round-trip cost far above what mid-caps actually cost to trade.

But then I made myself compute the one number I had been avoiding: the t-statistic of the portfolio's return itself. The Sharpe ratio was 0.42, which sounds fine until you realize that over fewer than four years, a Sharpe of 0.42 has a t-statistic of about 0.8. The strategy's actual returns were not distinguishable from zero. The information coefficient was real and the portfolio return was not, which sounds contradictory until you see why: the IC pools hundreds of names per month into a tight estimate, while the concentrated portfolio collapses them into two buckets whose monthly return is mostly noise. I had been about to call a coin flip a strategy.

The test that actually matters#

Everything so far is in-sample: I formed my hypotheses on the same 2022-to-2025 data I was testing them on. The real test is whether a signal holds on data that played no part in finding it.

So I rebuilt the whole thing on 2015 to 2021, a period I had never looked at. I had to fetch deeper price history to do it, and I kept the bias in my favor on purpose: the older universe was tilted toward survivors, which should flatter the result.

The in-sample momentum signal had been my crown jewel. It was pre-registered. It was robust when I split the sample in half. It cleared the strict t-statistic-of-3 bar at the six-month horizon. It survived transaction costs. By every in-sample measure it looked like a real edge.

Out of sample, it was gone. The one-month information coefficient fell from 0.035 to 0.007. The six-month fell from 0.075 to 0.004. Both indistinguishable from zero, and zero even with a survivorship bias actively working in its favor. The signal that passed every test I knew how to run was a property of 2022 to 2025, not a property of the market. Had I shipped it after the in-sample results, I would have traded real money on noise and watched it evaporate, and I would have blamed the market instead of my method.

What the apparatus is for#

Tell someone you spent weeks backtesting a trading engine and concluded it has no durable edge, and it sounds like a failure. It is the opposite. The entire value of an honest backtest is that it stops you from believing in an edge that is not there, and mine did exactly that, loudly, before any money was on the line. Finding nothing rigorously is a far better outcome than finding something falsely, because only one of those two costs you when it is wrong.

This is worth being blunt about, because the instinct after a null result is to feel like the compute was wasted. It was not. The alternative to this work was not finding an edge. It was shipping a backtest that looked great, putting a public track record behind it, and watching the edge evaporate in live trading with real money on the line. In quantitative finance the base rate is no edge: most people who test honestly find nothing, and most people who "find something" found an overfit artifact and pay tuition to learn that later. The rare and uncomfortable skill is running the out-of-sample test that kills your own best idea and then believing it.

What the work actually produced#

A null result is not an empty hand. Three concrete things outlived the conclusion.

A reusable measurement rig. Point-in-time scoring, survivorship-free universe reconstruction, Newey-West significance, cost-aware portfolio simulation, an out-of-sample harness. None of it was specific to the signal that failed. It now sits ready to evaluate the next idea in an afternoon instead of a month, which is the difference between asking "does this work" and being able to answer it.

A false positive caught for the price of some compute. The momentum signal would have made a confident, well-argued, completely wrong product. Learning that now, instead of after a year of live underperformance and a damaged track record, is the cheapest that mistake will ever be.

The method itself, written down. The seven checks above are the real deliverable. The engine was just the example; the discipline transfers to any signal, any strategy, any claim that something predicts the future.

What runs next#

The work is not finished. It has moved from history to the present. The one thing I cannot test by replaying the past is the part of the engine that reads news and forms a qualitative view, because there is no honest way to reconstruct what a language model would have concluded in 2018 without leaking the future into it. That can only be measured forward.

So the forward test is now live. A resolver logs every prediction the engine makes, dated and immutable, and grades it against realized returns when the future actually arrives, scoring the quantitative factors and the qualitative layer separately so I can finally see whether the judgment adds anything the numbers do not. It runs on a schedule and accumulates on its own.

The honest part is the timeline. Significance is bounded by independent observations in time, not by the number of stocks, so a modest edge needs years, not months, to confirm or rule out. What the first year buys is not a verdict. It is a credible, dated, honestly-resolved track record, which in a field full of fabricated ones is its own kind of asset. So the next step is the least glamorous and most underrated one in all of this: wait, measure, and let the future grade the guesses. If the qualitative layer predicts nothing either, I would rather find that out in daylight than sell it in the dark. And if it predicts something, I will have the one thing almost nobody selling an edge actually has, which is the receipts to prove it.

Keep reading

Written by Eric Caskey. I build AI tools you can actually use. Explore the Tools or see the case studies.