Building an AI-Native Platform: A Retrospective

By Eric Caskey · June 26, 2026 · 13 min read

AI agents software-development AWS retrospective side-projects

Written for engineers about to start their own AI-native project. The concrete example behind it is a small fleet of finance and content web products built almost entirely by, and operated through, an AI coding agent over roughly a year. The lessons are written to transfer. If you are standing up something that continuously collects, scores, and reports on a fuzzy quality (code health, security posture, service maturity, content quality), most of what follows is aimed squarely at you.

TL;DR (the three things I would tell my past self)#

The agent is the cheap part. Your judgment harness is the product. What made the difference was not the model writing code. It was the scaffolding around it: specs before code, reviewer gates before merge, a measurement discipline that assumed I was fooling myself, and a memory that survived between sessions. Build that harness first.
Measuring a fuzzy thing is the hard part, and it will lie to you. Most of the real failures were not bugs. They were measurement self-deception: scoring on too little data, mistaking an artifact for a signal, declaring an edge that did not survive costs. If your project is "collect and score a quality," this is your central risk, not a side concern.
Autonomy scales with the quality of your gates, not the quality of your model. Every increase in how much the agent did unattended was unlocked by adding a gate (a reviewer, a check, a spec conformance pass), never by trusting the model more.

The rest of this post expands those three, then gets specific about any system that collects, scores, and reports.

What I did, at a high level#

A single operator ran a fleet of four small web products (finance analysis tooling, a couple of content sites, a self-modeling tool) plus several backing services: a decision/scoring engine, a few MCP servers exposing tools to external LLM clients, scheduled data collectors, and the AWS infrastructure under all of it. Essentially all code, specs, infra, and operations went through an AI coding agent.

The operating model that emerged:

Spec-first. A change started as a spec or ADR in a dedicated specs repo, not as code. Code referenced the spec it implemented.
A backlog the agent could drain. Work lived in a structured BACKLOG.md the agent pulled from one item at a time, implemented in an isolated worktree, and opened a PR for.
Reviewer agents as gates. Before a PR merged or a thing deployed, specialized read-only review agents checked it: does this match the spec, does this leak anything sensitive, is this infra change safe, does this trading signal actually have edge.
Persistent memory. A file-based memory with a loaded index let the agent carry hard-won facts and lessons across sessions that would otherwise reset to zero.
An operational-lessons ledger. Every production incident got appended as a numbered lesson, and those lessons became inputs to the reviewer gates.

That structure is the actual deliverable of the year. The features came and went. The harness compounded.

What went well#

Context architecture, not documentation dumps. The thing that made everything else work was treating the agent's context as something to architect, not something to fill. The naive instinct is to dump every README, doc, and past decision into the prompt and trust the model to sort it out. That scales backwards: more text, worse signal, higher cost. What worked was the opposite, a small, curated, high-signal set of context the agent pulled from on demand: a specs repo as the source of truth, a one-line-per-fact memory index, a numbered lessons ledger, a single product map. Everything else in this section is really one instance of this idea. If I had to name the single highest-order skill, it is this: deciding what the agent should see, in what shape, and what it should not.

Spec-first paid for itself immediately. Writing the spec first did two things. It forced the fuzzy idea to become concrete before any code existed, and it gave every later reviewer (human or agent) a fixed thing to check against. "Does this match the spec" is a tractable question. "Is this good" is not. The spec is what makes review automatable.

Reviewer agents are the highest-leverage thing I built. Read-only, single-purpose agents that render a go/no-go and never touch the code. A spec-conformance checker, a sensitive-content/leak auditor, an infra-diff reviewer, and for the finance side, a backtest auditor that checks a claimed edge against a checklist of self-deceptions. These caught real problems and, more importantly, they let me (or the orchestrating agent) trust the output enough to merge without re-reading every line. The pattern generalizes: for any property you care about, a narrow reviewer that only judges that property beats a generalist that judges everything.

Memory turned a goldfish into a colleague. The clearest instance of context architecture was persistent, indexed memory. Without it, every session re-learned the same gotchas (this deploy needs the prod env file, this config pair must move in lockstep, this metric is a known artifact not a bug). With it, those became one-line facts the agent recalled. How I got there matters more than the tool: not by saving everything, but by extracting only the non-obvious facts after each session (never code structure or git history, which are already durable), writing one fact per file, loading just a one-line index of them at startup, and enforcing a hard update-not-duplicate rule so the index stayed small. A memory that holds everything is just another documentation dump; the value was in what I chose to leave out.

Cost discipline through local mirrors. I blew through a CI minute budget once. The fix was a local CI mirror that ran the same validation the cloud would, so red checks were caught before spending a cent, and a standing mode of "test and deploy locally." If your project has a per-run cost (CI minutes, API tokens, compute), build the local-equivalent early. It changes how freely you can iterate.

Parallel fan-out for independent work. When tasks were genuinely independent, dispatching multiple agents at once (each in its own isolated worktree) was a real multiplier. The key word is independent.

What was difficult#

Shared state was the recurring enemy. Almost every painful operational incident traced to two agents (or an agent and a scheduled job) touching the same checkout, the same backlog file, or the same spec at the same time. A worktree-cleanup sweep once deleted a live worktree out from under concurrent work. Loops racing on a shared backlog file produced conflicting edits. This is not an AI-specific problem, it is the same concurrency hazard large engineering teams already know well: merge conflicts, two people editing one config, a deploy stepping on another. The difference is only speed and volume, an agent fleet hits these collisions far faster than a team of humans does, so a problem you might paper over with a team of five becomes a daily event. The fix was the same one teams use, just enforced harder: more isolation. Build in a worktree off the remote's main, never the shared checkout; push before you open a PR; verify state on the remote, not on a possibly-stale local branch. If you run anything concurrently, design for isolation from day one. Retrofitting it hurts.

Some settings have to change together, and the agent kept changing only one. In a few places, two files had to move as a pair or production broke: an env var and the code that reads it, a frontend build setting and the deploy config that matches it. An agent works one file at a time, so by default it would fix one side and leave the other stale. The fix was to write these pairs down as a known list and add a check that fails if only one side changed. Lesson: list the settings that must move together and make "did both sides change?" an automatic check, because a one-file-at-a-time worker will break them otherwise.

Deploy gotchas accumulated faster than they could be remembered. Empty env files baking into a build, a sync flag that clobbered files, stale caches, dirty trees deploying uncommitted work. None individually hard, collectively a minefield. This is what drove the pre-deploy reviewer and the lessons ledger. Operational knowledge is a real artifact. Write it down where the worker will see it, or you will relearn it in production.

Stale-branch confusion. Working repos sat on long-lived loop branches, so "this file doesn't exist" was sometimes wrong, the file existed on main. A standing rule emerged: before claiming something is absent, check the remote mainline, not your checkout. For an AI worker that confidently asserts, this class of confident-but-wrong is worth a specific guardrail.

Failures (the honest section)#

The flagship measurement spent a long time measuring nothing. The finance engine's job is to rank stocks, and I kept checking whether its ranking actually predicted returns. For a long time the answer was "we can't tell," but the numbers looked like real answers. At one point a strongly negative score appeared and I briefly took it seriously, as if the engine were a good contrarian signal. It was not. I was only scoring 24 to 30 stocks, and that is far too few to tell skill from luck, so the number was just noise dressed up as a finding. The fix was not to change the model. It was to admit the test pool was too small to judge the ranker at all, and to refuse to report any score until enough names were in the pool to make the result mean something. This is the most important warning for any scoring project: you can build a polished scoring pipeline that outputs confident numbers that mean nothing, and nothing about the output will warn you, the failure is completely silent.

An edge that vanished net of costs. A signal looked good gross and failed once realistic costs were applied (a deflated performance ratio of essentially zero). Gross-of-cost evaluation is one of the classic self-deceptions. If your "score" drives any action that has a cost, evaluate net of that cost.

A scope violation that had to be reverted. An attempt to wire richer context into the scoring path crossed a data boundary it was never supposed to cross, and the whole thing was reverted to baseline. The lesson that survived: draw your data-governance boundaries explicitly and enforce them, because the agent optimizing for "better answer" will happily pull in data it should not touch. I kept an impersonal-scoring stream and a personal-data stream strictly separate, and any drift across that line was a hard revert, not a discussion.

Cost overruns from automation I did not meter. Two of these. The first was real: an automated polling loop quietly burned through a month's CI-minute budget, and once the cap hit, every deploy started failing in seconds at setup, an outage caused entirely by spend, not by code. The second was a near-miss on a paid external data API. Scheduled collectors and a research path were calling metered third-party APIs (market data, an LLM research provider) on a loop, and the per-call cost was small enough to be invisible per run but added up fast across an unattended loop running all day. I caught it before it became a real bill and put a hard budget guard on the research path (a spend ceiling that stops the loop), but the lesson is the pattern, not the dollar amount: automation without a budget guard is a way to spend money in your sleep. Any loop that touches a metered resource (CI minutes, API tokens, compute) gets a meter and a hard ceiling before it runs unattended.

Lessons for an AI-native project (the transferable core)#

1. Build the judgment harness before the features. Specs, gates, memory, lessons ledger. The model improves on its own schedule; your harness is the only part you control, and it is what determines whether you can trust output enough to move fast.

2. Architect the context first; keep the agents thin. There is a popular argument that engineers fixate on building clever agents and underinvest in the context architecture of their project. My experience is strong evidence for it. Almost every agent that earned its place was thin: a read-only reviewer that compares one thing against one source of truth and returns go/no-go. The spec-conformance reviewer is only as good as the spec it reads. The leak auditor is only as good as its rule set. The backtest auditor is only as good as its checklist of self-deceptions. None of them is clever; each is valuable because the context beneath it is well built. So yes, build specific agents for specific jobs, with directed tools and tight prompts, but get the ordering right: the leverage lives in the context, and the agent is a thin, swappable layer on top. The tell that you have it backwards is finding yourself making the agent smarter to compensate for vague context. I did not need a fleet of autonomous generalists. I needed sharp context and a narrow agent per property, and with that in place the agents almost wrote themselves. Treat your project as a context-architecture problem that happens to use agents, not an agent-building problem that happens to need context.

3. Make every important property a narrow, automatable check. "Matches spec," "leaks nothing," "infra-diff is safe," "edge survives the seven self-deceptions." Each is a single-purpose reviewer. Generalist "is this good?" review does not scale and does not compose.

4. Treat measurement as the adversary. Assume your metric is fooling you until it survives a checklist: enough effective sample, no look-ahead, right universe, real (not naive) significance, no forking-paths cherry-pick, net of cost, out-of-sample. Encode that checklist as a reviewer that gates any claim of "this works."

5. Gate autonomy, do not grant it. Decide explicitly which actions the agent may take unattended (open a PR, merge a green PR) and which always require a human (anything that spends new money, crosses a data boundary, or is hard to reverse). Every expansion of autonomy should come with a new gate, not just more trust.

6. Isolate concurrent work physically. Separate worktrees, push-before-PR, verify on the remote. Shared mutable state between agents is where the worst incidents live.

7. Persist what was non-obvious, not what the repo already records. Memory should hold the gotchas, the "this metric is a known artifact," the lockstep pairs. Not code structure or git history, which are already durable. One fact per entry, an index, update-don't-duplicate.

8. Institutionalize incidents. A numbered lessons ledger that feeds your reviewer gates turns each production scar into a permanent check. This is how the system gets safer over time instead of repeating itself.

9. Meter your automation. Any loop that spends money or compute gets a budget guard. Build the local-equivalent of any costly cloud step so iteration is free.

Specifically for a collect / score / report system#

A lot of useful projects share one shape: ingest signals about many entities, score each entity on a fuzzy quality, and report. A code-health dashboard, a security-posture tracker, a service-maturity scorecard, a content-quality grader. Structurally that is the same shape as the finance engine, which means my failures are your roadmap of what to avoid.

Spec the scoring rubric before you collect anything. What does "good" mean, concretely and checkably, per signal? If the rubric is fuzzy, every downstream number is fuzzy. Write it as a spec, version it, and let the rubric itself be reviewable.
Beware the degenerate-sample trap, hard. My worst failure was scoring confidently off a sample too small to mean anything. For you this shows up as: scoring an entity on two or three signals and presenting a crisp grade. Gate any score behind a minimum coverage threshold, and surface "insufficient data to score" as a first-class, visible state. A loud "we can't tell yet" is infinitely better than a confident wrong grade, because the wrong grade is silent and people will act on it.
Separate the collector from the scorer from the reporter. Three stages, three concerns. The collector just gathers facts. The scorer applies the rubric. The reporter presents. Keeping them separate let me swap and audit each independently, and it is what made "the score is wrong" debuggable (was it bad collection or bad scoring?).
Classify, then drain in tiers. A pattern that worked well: an audit produced a classified backlog (safe-to-auto-fix vs. needs-human-review), and the safe tier could be drained automatically while the judgment tier became proposals for a human. For remediation this maps directly: a mechanical, low-risk fix might be auto-fixable; a judgment call is a human conversation. Classify findings by how safe the fix is, and only automate the safe tier.
Net-of-cost thinking applies to remediation too. A finding is only worth surfacing if acting on it is worth more than the noise it adds. Rank findings by impact, or you train people to ignore the report.
Data-governance boundaries up front. When you collect across many sources, decide early what the collector may read and what it may never read or expose, and enforce it as a hard gate. My one scope violation taught me that the system will cross a boundary in pursuit of a "better" answer unless the boundary is enforced, not just documented.
A lessons ledger for collection quirks. Every source is configured slightly differently, and you will discover per-source gotchas constantly (a nonstandard location for a signal, a signal that is absent for a legitimate reason rather than a real gap). Persist those as facts the collector recalls, or you will re-flag the same false positives forever and erode trust in the report.

Recap#

If you take three things into your project: build the judgment harness (specs, narrow reviewer gates, persistent memory, a lessons ledger) before you chase features; treat your central measurement as something actively trying to fool you and gate every "it works" claim behind a self-deception checklist, with insufficient-data as a loud first-class state; and expand autonomy only by adding gates, never by extending trust. The model is the cheap, improving part. Everything around it is the part you build, and it is what makes an AI-native project trustworthy enough to actually move fast.

Keep reading

Post

Composite What You Trust, Watch What You Don't: A Trust Boundary for Data With Money Attached

Every system that fuses signals into one consequential number has a fault line: the data you trust enough to composite into a grade versus the data you only trust enough to watch. How I drew that boundary in my personal finance engine, and how a test keeps it honest.

Read

Post

Building a Personal Finance Reviewer: What Survived the Rewrite

A personal portfolio reviewer where the scoring is deterministic and the AI only narrates. The architecture that held up after I had to rewrite the model it was built on, and why that boundary is the whole point.

Read

Post

Building an AI Marathon Coach: Deterministic Rules, LLM Narratives, and the 2026 NYC Marathon

How I built a personal AI coaching system for marathon training, layering deterministic guardrails over an LLM narrative engine, ingesting Garmin FIT files, and designing for my own injury history.

Read

Post

The Orange Pi That Maintains Itself

A small ARM box that started as a local LLM experiment and ended up a self-governing node: private retrieval, a resident agent under a written constitution, a code-enforced safety fence, and a nightly job where it audits itself and files its own backlog.

Read

Post

Wiring Garmin Into My Marathon Coach: A Live Data Integration Without an Official API

How I replaced manual CSV exports with a live Garmin data feed for my AI marathon coach: a scheduled unofficial-API poller, resilient session handling, and the design calls that keep training and recovery data fresh and trustworthy.

Read

Post

A Boring Design Let Me Run a Black Swan on a Tuesday

Two posts ago I bet that keeping my portfolio reviewer's engine deterministic and auditable was worth it. This is where that bet paid off: because the engine is replayable, I could run a simulated market crash through the real production code and catch a money-losing flaw on paper, before it could ever cost a real dollar.

Read

Written by Eric Caskey. I build AI tools you can actually use. Explore the Tools or see the case studies.