A Boring Design Let Me Run a Black Swan on a Tuesday

By Eric Caskey · May 28, 2026 · 8 min read

AI finance Python risk software-development side-projects

Two posts ago I made an argument that sounded a little puritanical: keep the scoring and the rules deterministic (same inputs, same outputs, no AI anywhere near the deciding seat) because the part of this that touches my money has to stay inspectable. The second post was that same discipline throwing out a scoring model after it graded one of the best companies in the market a D+. Both were about getting an ordinary day right.

This is the post where that bet paid off in a way I had not planned for. Because the engine is deterministic, I can do something you simply cannot do to a system that improvises: rebuild a market crash that never happened to me, feed it to the actual production code, and trust that what comes out is what a real crash would have produced. Nothing to reproduce, nothing hidden, no "I am pretty sure it behaves like this." So one Tuesday afternoon I ran a black swan, a rebuilt March 2020, straight through my portfolio reviewer.

It handed me a flaw. A real one, the kind that quietly loses money, except it lost that money on paper, in a crash I made up, instead of in a real one with my savings inside it. That is what the post is about, and it is not really about the flaw. It is that a boring, auditable design was the only reason I could go looking for the flaw at all, and find it somewhere it could not cost me anything.

What the reviewer actually grades#

A quick refresher, because the rest only makes sense if you know what the tool does. Every week it grades each holding I own, and it does not hand back a single number. It scores a stock on several separate dimensions (how good the business is, how cheap it is, how it has been trending, how solid the balance sheet looks) and shows me each of those before blending them into one letter grade.

That segmentation is the whole point. A blended B+ tells me nothing I can act on. A great business getting cheaper with collapsing momentum is a completely different situation from a weak business trading flat, even when both land on a B+. I want to see every side of a stock, not an average that hides them. The grade is the headline; the dimensions underneath it are the story.

The gate I was worried about#

On top of those grades sits a safety gate. When the market is falling hard and fear has stayed high for a while, the gate steps in and quiets most of the sell and trim signals. The reasoning sounds obviously right when you write it down: a grading system fed nothing but falling prices will want to sell everything at the bottom, and selling everything at the bottom is the most expensive mistake a normal investor makes. So the gate holds the system's hand in exactly the moment its instincts are worst.

I had tested that the gate fires. I had never tested whether it helps. Those are not the same question. Firing is one moment; helping is the whole crash and the recovery that comes after it, and I had only ever looked at the moment.

A crash I could run on purpose#

A black swan is, by definition, something you have not lived through yet, so I could not sit and wait for one. But I did not have to. I had an AI agent build me a replay: a rebuilt March 2020: twelve weeks, a third of the market gone by week five, fear spiking to levels you see maybe once a decade, then a sharp bounce back toward where it started. I ran a sample portfolio through it week by week, a spicy mix with a sleeve of speculative names, the kind of allocation that makes a crash interesting.

The one rule I cared about was that the test had to drive the real engine, not a copy of it, not my memory of how it works, but the actual production code that decides what to buy, sell, and trim. Each week the replay handed the genuine engine the same kind of data it sees in real life, took the calls it was confident enough to act on, and did the same for a do-nothing version that just held everything and never touched it. Then it compared the two. This is the move the deterministic design quietly makes possible: feed the real code a crash that never happened and trust the result, because the same inputs always produce the same outputs and there is no judgment call hiding in the middle to reproduce.

In the replay, doing nothing won#

In the replay, holding everything untouched through the crash and the bounce preserved more of the portfolio than running my tool did. The reviewer came out about two and a half percent behind a strategy that needs no tool, no grades, and no thought at all. The thing I built to protect me lost (on paper, in a crash I made up) to doing literally nothing.

The why is the part that stuck with me, because it is not what I expected. The gate is reactive by design: it watches for a drop that is already underway and fear that has already spiked, so it does not trip until the market is a week or two into falling. I assumed that lag would be its undoing, that it would kick in too late, after the system had already panic-sold near the bottom. The opposite happened. The gate tripped in week two and stayed on the whole way down, clamping most of the sell signals shut. At the very bottom, in week five, my tool was a full six percent ahead of doing nothing.

Six percent ahead, but not quite for the reason the gate gets to take credit for. The tool was ahead at the bottom because it had trimmed a little into cash in the first week, before the gate clamped down, and in a crash cash beats falling stocks. What the gate itself did was stop it from trimming the rest. It refused to let the system dump the whole portfolio into the panic, which is the one mistake it exists to prevent. On the way down, that is exactly the behavior I wanted.

Then the market recovered, and the system gave all of it back and then some. The gate clamps most sells, but a handful of trims are exempt, flagged important enough to fire even in a crash, and those fired every single week. Twelve weeks, twelve trims into the speculative sleeve. By the time the bounce came I was holding far less of exactly the stuff that bounced hardest. So the gate did its job on the way down, and then a side door I had left open quietly sold off my upside on the way back up.

That is worse than a bug. A bug I could fix. This was every piece working exactly as designed, and the pieces together producing something none of them intended. I had a passing test proving the gate fires when it should (drop and fear in, sell signals clamped out) and that firing was the only thing I had ever proven. What it costs over a full round trip, once the market climbs back out, I had simply assumed.

What I am not claiming#

I want to be honest about what this number is not. It is one crash, run once. I rebuilt March 2020; I have not yet run 2008 or the dot-com unwind. Roll the dice differently and the two and a half percent could shift a point or two either way. I would not bet anything on the exact figure.

A fan of simulated price paths hanging in 3D space, colored by drawdown depth, with fat-tailed runs dangling deep below the rest in yellow What rolling the dice looks like. This is the drawdown topology from the playground, switched to fat-tailed returns: every strand is one simulated run, and the yellow ones are the runs that dig deep. My replay is one strand of a fan like this.

And there is a bigger caveat. In real life the system makes itself wait: a signal has to show up two weeks running before it acts on it, which slows the whole thing down, especially early in a crash. My replay skipped some of that patience, so it almost certainly traded more aggressively than the real tool would. The takeaway is not the decimal. It is that a stress test surfaced a money-losing interaction between three features that each looked correct on their own, and it did it without my having to lose a real dollar in a real crash to find out.

What the boring design actually bought me#

This is the part I did not see coming when I made the original bet. Keeping the scoring and the rules deterministic (same inputs, same outputs, no AI in the deciding seat) was supposed to buy me inspectability on ordinary days: the ability to ask why a stock got the grade it got and get a real answer. It turns out it bought something larger. Because the engine is predictable, I could feed it a catastrophe and trust that it behaved exactly as it would in a real one. You cannot replay a system that improvises, because you cannot reproduce a judgment call. The boring, auditable design is the only reason I could run a black swan through my own code on a Tuesday afternoon and believe the answer.

It is also the role I keep wanting AI to play. The agent did not decide whether the gate was good or bad. It built the test rig, ran the crash, flagged where its own shortcuts fell short, and handed me the result to judge. The judgment about what to do with a tool that sells the recovery is mine, and it should be. The slow, careful plumbing is the part I am happy to hand off.

So I have not shipped a fix, and that is the same discipline talking. The fix is easy to describe: teach those exempt trims that they are firing into a recovery, the same way the gate already knows it is firing into a crash, so the system stops selling the bounce. I could write it this afternoon. What I cannot write this afternoon is proof that it helps. I have watched the current behavior fail against one crash, run once, and shipping a fix on that much evidence would be the exact mistake this whole project is built to avoid: confirming that the new behavior fires and calling that proof it is better. So the fix waits until I have run it against more than one crash (the 2008 grind and the dot-com unwind among them) with the dice rolled enough times that I trust the answer. The design that let me catch the problem is the same design that lets me test the fix properly. That is the entire point of building it boring.

Keep reading

Tool

Investment Committee

Score any stock across five weighted dimensions and get a letter grade with a written committee verdict.

Read

Tool

Market's Best

The top-graded stocks from the latest market scan. No sign-in needed.

Read

Demo

Grade my portfolio

Run a sample portfolio through the investor committee.

Read

Case study

Factor-First AI Investment Platform Narrated by a Six-Persona Committee

Grew a single-ticker grader into a full investment platform: a four-factor composite (Quality, Valuation, Momentum, Health) narrated by a six-persona committee, a nightly scan of ~400 large caps, portfolio and net-worth tracking, and a grade scale validated by a daily backtester.

Read

Post

Composite What You Trust, Watch What You Don't: A Trust Boundary for Data With Money Attached

Every system that fuses signals into one consequential number has a fault line: the data you trust enough to composite into a grade versus the data you only trust enough to watch. How I drew that boundary in my personal finance engine, and how a test keeps it honest.

Read

Post

Building a Personal Finance Reviewer: What Survived the Rewrite

A personal portfolio reviewer where the scoring is deterministic and the AI only narrates. The architecture that held up after I had to rewrite the model it was built on, and why that boundary is the whole point.

Read

Follow the work

New tools and writing as they ship — pick a channel.

RSS feed LinkedIn

Written by Eric Caskey. I build AI tools you can actually use. Explore the Tools or see the case studies.