Caskey Engineering

← Back to Blog

Building a Personal Finance Reviewer: What Survived the Rewrite

I run a personal portfolio reviewer. It takes a position export, scores every holding on a deterministic multi-factor model, flags allocation and risk breaches against hard thresholds, and produces a written review I can act on in a few minutes instead of a Saturday morning.

An earlier version of this post described that scoring model as a value-investing rubric. That was wrong, and not in a small way: the rubric graded one of the most dominant companies in the market a D+. I rebuilt the model. The story of how it was wrong, how I caught it, and what the rewrite cost is its own post, When the Spec Was Wrong, and I would rather link to it honestly than quietly edit the old claim out of existence.

This post is about the part that did not change. The model got replaced. The architecture around it did not, and the architecture is the part worth writing down, because it is the part that made the rewrite survivable.

The one boundary that matters

The system has three layers and a rule about which one is allowed to decide anything.

A deterministic scoring layer reads the positions and the market data and produces scores. Same inputs, same outputs, no model in the loop. A rules layer applies non-negotiable thresholds: concentration limits, sleeve caps, drift bounds. These are not advice; they fire or they do not. Only then does an LLM see the result, and its job is strictly to narrate: explain the scores, surface the breaches, write the review a careful investor would recognize as coherent. It does not compute a score. It does not overrule a threshold. It is told the numbers; it is never asked to derive them.

That boundary is the entire design. Everything load-bearing is deterministic and auditable; the model is a writing layer on top of facts it cannot change. When I had to throw out the scoring model, the blast radius stopped at the deterministic layer. The rules did not move. The narration did not move. The audit trail did not move. A rewrite that would have been frightening in a system where the AI decided things was a contained change in a system where it only describes them.

Never let a missing number become a confident one

The failure mode I care most about in a system with money attached is not a wrong opinion. It is a fabricated fact presented with the same confidence as a real one.

The concrete version: a price fetch fails for one holding. The lazy behavior is to fall back to zero, multiply it through, and emit a portfolio that looks complete and is quietly wrong, with a sell recommendation on a position that did not actually crater. The reviewer does not do that. A price fetch resolves to one of three states: a fresh value, a stale-but-real cached value with its age attached, or an explicit failure. A failed fetch is excluded from totals, weights, grades, and recommendations rather than silently coerced into a number. Stale-by-a-day, clearly labeled, is a far better signal to a human than a confident zero.

I will be honest about where this principle is still being enforced rather than finished. The sparse-fundamentals case, where a data provider returns nothing for a thinly covered instrument, is harder than the price case and is the edge I am still tightening. The principle is settled. The coverage is not yet complete, and I would rather say that here than imply otherwise.

The audit trail is a feature, not logging

Every review is written as an immutable record: the scored output, the recommendations, and hashes of the exact inputs and configuration that produced them. Reviews are not overwritten and not expired. If a recommendation looks wrong three months later, I can reconstruct precisely what data and what rules produced it.

This sounds like compliance theater for a one-user tool. It is not. It is what keeps the system honest with me over time. A reviewer I cannot reconstruct is a reviewer I have to trust on feel, and the entire reason this thing exists is that I do not trust my own recency bias with real money. The immutability is the same instinct as the deterministic boundary: keep the parts with consequences inspectable, and keep the AI on the side of the system where being inspectable is not required.

What the rewrite actually taught me

The lesson is not "I picked the wrong model first." Everyone picks the wrong model first. The lesson is that the cost of being wrong is set long before you are wrong, by where you drew the line between the parts that decide and the parts that describe.

Because the scoring layer was deterministic, isolated, and specified separately from the code, replacing it was a bounded edit with a written decision record behind it. Because the AI was a narrator and not a judge, none of the rewrite touched it. The expensive version of this mistake, the one where the model is fused into the recommendations and the recommendations are fused into the prompt, is a system you cannot rewrite without rebuilding. I did not have that system, by deliberate choice, and that choice was worth more than any single thing the original model got right.

If you are building anything that scores or grades where the output has consequences, the durable question is not which model. It is: when the model turns out to be wrong, and it will, how much of the system has to move with it? Keep that number small on purpose. Ship the deterministic version first, write down why, and assume the first model is a placeholder for the second one. The code is the cheap part. The boundary is the product.