Caskey

← Back to Blog

Building an AI Marathon Coach: Deterministic Rules, LLM Narratives, and the 2026 NYC Marathon

I'm running the 2026 TCS New York City Marathon. The training cycle starts now. And I'm building the coaching tool as I go.

This is not an AI experiment in the abstract. It's a real system, for a real race, trained on my actual runs, with my actual injury history — including a calf issue that has derailed training blocks before. If the system makes bad recommendations, I show up to the start line underprepared or injured. That sharpens the design criteria considerably.

Here's what I built, how it's architectured, and what the design decisions reveal about building AI systems where the stakes are personal.

The Problem With Generic Training Plans

Standard marathon training plans are reasonable starting points. Eighteen weeks, periodized volume, taper protocol. The problem is they don't know anything about you. They don't know that you ran sixty miles last week on strong legs. They don't know that you had two poor nights of sleep and your resting heart rate is elevated. They don't know your calf is talking to you.

Good human coaches know these things. They adjust. They read signals. They make calls that generic plans can't make.

AI is a plausible substitute for that coaching intelligence — but only if it's given the right data and the right constraints. An LLM alone won't do it. You need structure.

Architecture: Two Layers, One Pipeline

The system has two distinct layers that must not be confused:

Layer 1: Deterministic rules engine. This layer ingests training data — runs from Garmin FIT file exports, manual recovery signals (resting HR, sleep quality, soreness, pain flags) — and computes objective training metrics: Acute:Chronic Workload Ratio (ACWR), weekly volume, fatigue indicators, trend. It then evaluates fourteen defined guardrails. These guardrails produce a typed recommendation: PROCEED, REDUCE_INTENSITY, REDUCE_VOLUME, CROSS_TRAIN_OR_REST, FULL_REST, RECOVERY_RUN, or NEEDS_MORE_DATA.

The rules are hard. If a pain flag is set, the system returns FULL_REST — regardless of any other metric. If ACWR exceeds 1.8, the system returns CROSS_TRAIN_OR_REST or FULL_REST. If there are fewer than three activities in the past fourteen days, the system returns NEEDS_MORE_DATA rather than producing an unreliable estimate.

Layer 2: LLM narrative engine. Once the rules engine has produced a typed recommendation and the guardrail outcomes, the LLM receives that structured output and writes a coaching narrative. Two to four sentences. Plain language. Actionable. Grounded in the data the rules engine processed.

The LLM does not choose the recommendation type. That is the rules engine's job. The LLM explains the recommendation in human terms.

This separation is the load-bearing design decision. The boundary between what the LLM can do and what it must not do is explicit and enforced.

Why the LLM Must Not Decide Safety-Critical Outcomes

There's a category error that shows up in a lot of AI system designs: giving the LLM responsibility for decisions that require consistent, auditable reasoning and then hoping the model stays calibrated.

For a coaching system, that error is especially dangerous. Training load management is not a soft opinion domain. Overtraining injuries are real. Stress fractures are real. A recommendation to "push through" on the wrong day can end a training cycle.

Language models are not deterministic. They can reason well about training principles in general, but they will not reliably apply the same rules to the same inputs across sessions. They can be influenced by how context is framed. They can hallucinate confidence about edge cases.

Deterministic code has none of these problems. A rule that says "if pain flag is set, return FULL_REST" executes identically every time. You can test it. You can audit it. You can explain exactly why the system made the call it made.

The LLM's value is in the narrative: taking the correct, deterministic recommendation and explaining it in a way that a human athlete actually finds useful. That's a real capability — synthesizing multiple signals into a clear, encouraging, contextually-aware explanation. That's what the model does.

Garmin FIT Files and the Unglamorous Data Layer

The first engineering problem is getting run data in. Garmin's full API requires developer program approval and user OAuth. For a personal MVP, that's a month of setup for one athlete. The practical choice is FIT file exports — Garmin lets you export any activity as a .fit file from Garmin Connect.

FIT files contain everything: GPS track, lap splits, cadence, heart rate, elevation. The fitparse Python library handles parsing. The Lambda reads the uploaded file, extracts the relevant training metrics, and writes an activity record to DynamoDB.

Manual recovery signals — resting HR, sleep quality, soreness level, pain flags — are entered via an API endpoint. No automated wearable integration in v1. That's a deliberate constraint: I want the data ingestion path to be reliable and understood before I add the complexity of real-time API sync.

Garmin Connect API integration is on the roadmap for post-MVP. But MVP ships with file-based ingestion, and the system is useful immediately.

Designing for My Own Injury History

One of the more interesting design challenges: the system needs to know about my calf injury history and apply calf-specific caution in its guardrails.

In a generic coaching system, this is handled by user profile fields that parameterize the rules. My system does this, but it also surfaces calf-specific guardrail outputs in the narrative context — so when the LLM writes the coaching explanation, it has explicit signal to reference: "your recent history with the left calf warrants extra caution here."

This is a case where personalization isn't a nice-to-have. It's the point. A coaching system that doesn't know your injury history isn't a coach — it's a generic training calculator. The personal context is what makes the LLM output useful rather than generic.

The Audit Trail

Every recommendation pipeline run writes an audit record: inputs (activities, recovery signals, computed metrics), guardrail outcomes, final recommendation type, LLM prompt sent, LLM response received, timestamp.

This isn't optional infrastructure. It's foundational.

When the system tells me to rest and I disagree, I want to be able to look at exactly what data it had, exactly which guardrails fired, and exactly why. Transparency isn't just a nice property for AI systems. For any system that influences decisions you care about, the ability to inspect the reasoning is a requirement.

The audit records also let me evaluate recommendation quality over time. I journal my independent assessment alongside each recommendation. After four weeks of training, I can see whether the system's calls match what I would have chosen. That feedback loop is how you tune the guardrails.

Shipping It While Training

The interesting meta-challenge: I'm building this system while the training cycle I'm building it for has already started. That creates a useful constraint. I can't spend six months on perfect architecture and ship after the race. I need something useful now, even if it's incomplete.

This is a productive way to scope an MVP. What does the system need to do to be useful for this week's training? That's the scope. Everything else is v2.

For week one: ingest FIT files, compute ACWR, surface recovery signals, produce a typed recommendation with a narrative. That's it. The dashboard, the trend visualization, the automatic Garmin sync — those can wait.

The CoachStack is live in the same CDK infrastructure as the finance app and this site. The backend Lambda is deployed. The API is wired. The data is flowing.

Now I just have to train.

What Building This Teaches About AI System Design

Three generalizations that apply well beyond marathon coaching:

Separate what AI must not decide from what it should explain. Safety-critical, consistency-critical, auditability-critical decisions belong in deterministic code. The LLM earns its place in the narrative and synthesis layer, not in the judgment layer.

Personalization is architecture, not configuration. Building a system that adapts meaningfully to a specific person requires you to model that person's context explicitly — injury history, training goals, constraints. You can't parameterize your way to a genuinely personal system with a few user settings.

Audit everything. If you can't explain why the system made the recommendation it made, you can't trust it, evaluate it, or improve it. An audit trail isn't overhead — it's the foundation for everything useful you'll want to do with the system after it's built.

The 2026 NYC Marathon is in November. I'll be sharing updates on how the coaching system performs across the training cycle — where it nails the calls, where it misses, and how the architecture evolves.

If you're building something similar — personal coaching tools, adaptive recommendation systems, anything where AI-generated advice has real consequences — I'm genuinely interested in comparing notes.