Caskey

← Back to Blog

Building a Personal Finance App with AI: Graham-Aligned Portfolio Review

I've been a Graham-style value investor for years. Defensive core, margin of safety, a small speculative sleeve for higher-conviction moonshots. For most of that time my "research workflow" was a mess — a rotating cast of AI assistants, Fidelity's web UI, and ad-hoc Python scripts I'd run from a Jupyter notebook when I remembered to update them.

It worked. Sort of. But it was slow, inconsistent, and prone to recency bias. I'd spend forty minutes reviewing a single position on a Saturday morning and still feel like I'd missed something.

So I built a tool to do it better. The result is a personal financial position reviewer — an AI system that continuously evaluates my portfolio against a Graham-aligned framework, scores every position, detects allocation drift, and generates plain-English rebalancing recommendations. It runs on the same stack as this site: Python Lambda on AWS, AI orchestration via Bedrock, CDK for infrastructure.

This is what I learned building it.

The Problem Is Structure, Not Intelligence

The instinct when you hear "AI-powered finance app" is to imagine some clever LLM reasoning through earnings reports and generating stock picks. That's not this.

The honest lesson I keep relearning — in enterprise systems at work and in personal projects at home — is that the hard part is rarely the AI. It's the structure around it. What data do you trust? What rules are non-negotiable? When does AI add value, and when does it add hallucination risk?

For a portfolio reviewer, those questions are especially sharp. Decisions here have financial consequences. I can't hand that over to a model and hope it reasons correctly about my actual holdings, my actual cost basis, and my actual risk tolerance.

So the architecture is deliberately layered:

  1. Deterministic scoring engine — Every position gets a score on a consistent rubric: valuation (P/E, P/B, Graham number), allocation drift (am I overweight?), beta contribution, margin of safety. Same inputs, same outputs. No LLM involved.
  2. Rules-based alerting — Hard thresholds: speculative sleeve over 5% triggers a flag. Single position over 8% of portfolio triggers a flag. Dry powder below target triggers a flag. These are not suggestions.
  3. LLM narrative layer — Once the deterministic layer has done its work, the LLM receives the scored output and writes a plain-English portfolio review. It explains, it contextualizes, it recommends. But it's working from structured facts, not raw market data.

This separation is the most important design decision in the system. The LLM is a narrator, not a judge.

Data Ingestion: The Unglamorous Part

The first real problem was getting data in. Fidelity doesn't have a public API for individual accounts, so v1 uses CSV exports — position exports from the Fidelity UI, pulled manually and uploaded to the system.

This is not glamorous. But it's practical. The alternative is a brittle screen-scraper or a third-party aggregator with its own security surface area. For a personal tool with one user, a well-structured CSV pipeline is fast to build and easy to trust.

The ingestion Lambda parses the CSV into a typed position model, enriches it with current market data from a public API, and writes to DynamoDB. The whole round-trip — upload to scored portfolio — runs in under thirty seconds.

The lesson here: don't let the ideal data pipeline block you from building the useful parts. CSV today, API tomorrow, if it turns out to matter enough.

Scoring Without Guessing

Graham's framework gives you principles: buy below intrinsic value, demand a margin of safety, avoid excessive debt, prefer earnings consistency. But turning those principles into a consistent scoring rubric requires specificity. What counts as "excessive" P/E in a rising rate environment? How do you score a biotech with no earnings?

I wrote out the rubric before I wrote any code. Every position scores on six dimensions. Each dimension has defined thresholds for A, B, C, D, and F grades. The overall portfolio gets a weighted composite score. The rubric is a markdown document that I can review, argue with, and update — not a black box inside a model.

This is spec-driven development applied to investment criteria. Before the code asked "how do I score this?", the spec answered "here is what the score means."

The speculative sleeve gets its own rubric — one that acknowledges that a biotech or a nuclear energy position isn't going to pass Graham's P/B screen. Speculative positions are scored on different criteria: catalyst timeline, conviction level, position sizing discipline.

The LLM Prompt That Actually Works

Once the scoring engine runs, the LLM prompt is straightforward because the context is structured:

  • Current portfolio scores by position
  • Allocation drift from targets
  • Flagged breaches
  • Historical trend (last 30 days of scores, if available)

The prompt asks for a portfolio review in a specific format: executive summary, top three risks, top three opportunities, recommended actions with specific sizing guidance.

The key constraint: the LLM is told the scores and the flags. It is not asked to derive them. That's the deterministic layer's job. The model's job is to write a review a thoughtful investor would recognize as coherent — not to replace the investor's judgment.

Getting this boundary right took a few iterations. The first versions gave the LLM too much latitude. The reviews were plausible-sounding but untethered — recommendations that contradicted the scores, or generic observations about sectors the model had no position data on. Tightening the context and narrowing the output format made the results dramatically more useful.

What I Learned About AI Application Architecture

Three things stand out after building this:

Determinism before intelligence. Any decision with real consequences — financial, medical, legal — should flow through a rules engine first. The AI explains and narrates. It does not decide unilaterally. This isn't about distrust of models; it's about appropriate separation of concerns.

The spec is the product. Writing the scoring rubric, the alert thresholds, the recommendation format — that work was harder than the code and more valuable. If I had started with code, I would have a system that scores things in ways I can't explain or audit. Starting with the spec gave me a system I can read and reason about.

Structured context beats clever prompting. I wasted time early trying to write prompts that would guide the LLM to good conclusions from messy input. The real fix was upstream: clean, structured, scored context going into the prompt. Better inputs beat better prompts almost every time.

Where It's Going

The v1 system does what I built it to do: takes a position export, scores the portfolio, flags drift, and generates a review I can act on in under five minutes. That's the goal. Manual research that used to take forty minutes now takes five.

The roadmap includes a Fidelity API integration if one becomes accessible, automatic alerting when positions breach thresholds, and a dashboard to visualize portfolio trends over time. But v1 ships first. The speculative roadmap doesn't help if the core loop isn't proven.

I've embedded the FinanceStack into the same CDK infrastructure that runs this site. One cdk deploy and it's live. That's the infrastructure payoff for getting the CDK patterns right from the start — adding a new domain is straightforward once the foundation is solid.

If you're building something similar — a personal tool that applies structured investment criteria with an AI narrative layer — I'd be happy to trade notes.