Caskey Engineering

← Back to Blog

When the Spec Was Wrong: Rewriting a Shipped Decision

Two weeks ago I wrote a post about a personal finance app I'd built. The part I was proudest of was the scoring engine: a six-dimension rubric based on Graham's value framework, with clear thresholds and transparent logic. The LLM only narrated the results; it didn't decide them. The argument was that structure mattered more than intelligence for a system with real money on the line.

A week later I ran the rubric against my actual portfolio. NVDA scored a D+.

That's not a tuning problem. A scoring system that grades the most dominant semiconductor company in the world a D+ is broken at the level of the rubric, not the thresholds. So I rewrote it. This post isn't about value investing; it's about what happens when a shipped spec meets reality.

The shipped spec accumulates assumptions you stop questioning

I chose Graham's framework because I'd read The Intelligent Investor and Graham was what I knew. The v1 rubric was a faithful operationalization of his principles: P/E thresholds, P/B as a primary signal, current-ratio screens, dividend continuity. Every grade reproducible.

Then I ran it against modern stocks. Every one of them failed. Not just NVDA. The intangibles-heavy S&P names came in the same way: D's and F's where I expected B's. The rubric was reliably grading every modern stock as a failure, which is another way of saying it was grading them against 1949.

The previous post argued that the spec is the product, and I still believe that. But shipping a spec is also when you start defending it. The deterministic engine was the right call, and I'd ship it again. The deterministic rules inside it are a separate question, and that distinction is easy to miss once a system is live. Consistency feels like correctness. A rubric that produces the same output every time looks like it's working. But reproducible isn't the same as right; it just means the same answer every time, whether the answer is any good or not.

What was actually broken

Three structural problems, all of them obvious in retrospect.

The framework excluded most of my portfolio by design. Graham wrote in an industrial economy, and he explicitly bracketed companies where intangible assets dominate value. That's roughly the top 40% of the S&P 500 by market cap today. Using absolute P/E thresholds against that universe doesn't fix the mismatch; it bakes it in.

I was craving more quant practice than Graham could give me. Graham gave me a rubric I could operationalize because I'd been reading him for years. But Graham doesn't have factor models, sector-relative scoring, or any of the toolkit modern quants actually use. The v1 spec was a personal reading list dressed up as a spec, and nothing on the reading list was post-1949.

Two scoring systems had grown in parallel. A "committee" of investor personas had been added on top of the factor scorers, and each persona computed its own weighted score. Same position, different grades, no defensible way to reconcile them. The spec said "deterministic scoring," but the implementation had drifted into two deterministic systems that disagreed.

Each of these was a shipped assumption I'd stopped questioning. None of them needed new evidence to surface. They needed someone reading the spec who wasn't me.

Perplexity Computer did the literature review

Before I rewrote a single line of the rubric, I had Perplexity Computer run a deep-research pass on modern factor frameworks. AQR's quality-value-momentum work, MSCI's factor model documentation, JP Morgan's factor views, the Russell research notes: what institutional quants actually use to evaluate equities in 2026, sourced and synthesized into a report I could read in an evening.

This is the part I want to draw out for engineers, because it changes what spec-driven development is tractable for. The first version of the spec was Graham-aligned because Graham is what I'd read. The rewrite is grounded in modern factor research because an agent did the literature review for me. What would have been a week of reading on my own showed up as a synthesis I could absorb between dinner and bed, with sources cited so I could chase any claim I wanted to verify.

Computer wasn't writing the spec. I was. But the work that has to happen before you can write a defensible spec, surveying current practice, comparing approaches, finding the dead ends, is exactly what an agent is good at. It runs sub-agents in parallel across different sources, summarizes where they conflict, and hands you something concrete to argue with.

Spec-driven development assumes you can write specs. AI-assisted research is what makes it tractable to write specs about domains you're not already an expert in, and that's the loop that's going to drive every rewrite from here.

The rewrite, and the decision that made it cheap

The replacement is a three-factor composite: Quality, Value, Momentum. The personas got demoted from parallel scorers to factor narrators. Each persona owns one factor and explains its score in their voice instead of computing a competing one. That collapses the dual-scoring problem: the committee output now follows from the factor output, not against it.

The non-obvious decision in the rewrite was a negative one. Keeping the implementation in Python.

The new spec used TypeScript pseudocode for readability, and there was real temptation to "modernize" the backend alongside the algorithmic change. Same blast radius, same deploy, why not. The answer is that none of the working pieces would have been better in another language: not the FMP client, not the caching layer, not the Lambda handlers, not the Pydantic models. The algorithm changes were independent of the host language, and combining the two would have doubled the surface area for new bugs. I'd have spent a week debugging serialization instead of validating scoring behavior.

The cheapest rewrite is the one that touches one axis at a time. Language migrations alongside algorithmic changes are the most expensive form of yak-shaving I know.

Then a review caught what I'd already convinced myself was right

The first version of the new rubric kept Growth as a separate fourth factor at 20%. A soundness review pass caught what I hadn't: Growth correlates with Quality through earnings persistence, and with Momentum through earnings momentum. Carrying it as its own bucket was scoring the same underlying signal up to three times.

So the rewrite got rewritten. Four factors, not five. Growth was deleted, and its sub-signals folded into Quality and Momentum where they actually belonged. The same review caught a second problem: the new valuation formula didn't actually solve the intangibles bias it was advertised to solve. It softened the bias; it didn't fix it. The replacement uses free cash flow and owner-earnings yield, which removes the GAAP-earnings distortion that breaks for software, semis, and stock-comp-heavy tech.

Two structural problems, both shipped under "Accepted," both caught by an external read of the spec. The review pass is part of the spec-driven loop, not an afterthought to it. ADRs are cheap. Reviews of ADRs are what make them survive.

What this means for spec-driven development

Three things I'd carry over to any project where a deterministic spec drives a system with consequences.

Specs are versioned, not eternal. ADRs that supersede other ADRs aren't a failure mode; they're the system working. The original rubric is now superseded twice over: once by the three-factor pivot, again by the four-factor consolidation. Each supersession is a record of what changed and why. The cost of writing those records is low. The cost of skipping them is silent drift between what the code does and what anyone thinks it does.

Have an agent do the literature review. The strongest single decision in the rewrite, before any code, was running Perplexity Computer against the current factor literature instead of writing the spec from memory. Specs grounded in "what I happen to have read" stop scaling the moment you cross out of your own depth. Specs grounded in "what an agent surveyed last week" can be written about almost anything you're willing to argue with.

Touch one axis at a time. The rewrite most likely to fail is the one that changes the algorithm, the language, and the deployment model in the same PR. Every axis you change multiplies the surface area for new bugs. The Python decision was worth more than any single algorithmic improvement in the rewrite.

The original post argued that the spec is the product. That's still true. The qualifier I'd add now is that the spec is the product only if you're willing to rewrite it. Shipped specs that don't get rewritten aren't specs; they're folklore.

If you're building anything that scores or grades with real consequences, ship the deterministic version first. That covers risk scores for transactions, severity grades for incidents, factor models for portfolios. Then commit to revisiting it the moment it touches reality. The first version will be wrong about something specific. The second version will be wrong about something different. That's the loop.