← Back to Blog

Prompt caching is a prefix match, not a flag

Prompt caching lets a model skip work it has already done. When two requests begin with the same text, the model can store its processing of that shared opening and reuse it on the next request, charging about a tenth of the normal price for the reused tokens. (Tokens are the chunks of text a model reads and is billed by, roughly a few characters each.)

This matters more than a cheaper bill. The amount of text a model can take in one request is capped, so as a system grows that input becomes a scarce resource, and caching is one of the few levers on it. Yet the common pattern is to turn caching on, assume it works, and never notice when it silently does not, because it only works when the prompt is built a particular way and nothing tells you when you have built it wrong.

I applied it across four parts of my own system. Two needed real work, one was already handled for me, and one was not my decision to make. Below is where it paid, where it quietly did nothing, and the token counts I measured to confirm it.

The whole idea in one picture#

Every request to a model is one block of text. Caching cuts that block at a point you choose, called a breakpoint, and stores everything before it. The stored part is the prefix. On the next request, if the prefix is identical down to the byte, the model reuses the stored work instead of reprocessing the text, and charges roughly a tenth of the price for those tokens.

ONE REQUEST
┌────────────────────────────────────────────┬──────────────────┐
│  PREFIX  (identical on every request)        │  the part that   │
│  instructions + the data you are asking about │  changes: this   │
│                                              │  turn's question │
└────────────────────────────────────────────┴──────────────────┘
        ▲ stored, reused at ~1/10 the price        ▲ never stored,
                          ▲ breakpoint goes here      full price

So the design rule is simple to state: put the text that stays the same at the front, put the text that changes at the back, and place the breakpoint between them. The four parts of my system below are four versions of that same problem, and they did not all have the same answer.

The cache has three rules#

Turning caching on does nothing by itself. The prefix has to meet three conditions, and only the first one complains when it is not met.

The rule What it means Break it and
It matches bytes, not meaning the stored prefix is found by comparing exact characters from the start of the request to the breakpoint; a reworded or reordered prefix is a different prefix one changed byte before the breakpoint, anywhere, and the model reprocesses the whole thing at full price
The prefix has to be big enough caching only engages above a minimum size, about 2,048 tokens on Claude Sonnet and 4,096 on Claude Opus a breakpoint on a shorter prompt is ignored, and no error is raised
The prefix has to come back soon storing a prefix costs about 25% more than a normal read, reusing it costs about 90% less, and a stored prefix is dropped after five minutes a prefix you store but never reuse in time costs more than not caching at all

The first rule fails visibly: your bill simply does not drop. The other two fail invisibly, by storing nothing or by quietly costing more, and neither raises an error. The only way to know they are working is to read the token counts the API returns on every call. That is why every number below is measured rather than assumed.

Layer one: the agent, already handled#

The first layer is Claude Code, the coding agent I work in all day. It already caches its own system prompt and tool definitions, and it trims the conversation as it grows, so there was nothing for me to build.

The one part I control is its memory. Claude Code loads a Markdown file at the start of each session where I keep durable facts about my projects, so I do not re-explain them every time. Mine is an index file that points to a set of small single-fact notes. Because that file is part of the prefix the agent already caches, the right move was not to cache it again but to keep it short: a bloated memory file just makes every cached prefix larger for no benefit. The lesson at this layer is to check whether the platform already caches for you before you do anything yourself. Here, it did.

Layer two: my own backend calls, the real work#

The second layer is my backend calling the model directly. It has several call sites: a chat assistant that answers questions about a finished portfolio review, a widget that answers questions about a blog post, and a few one-shot text generators (single request in, single response out, no back-and-forth). I checked each one against the three rules.

The size rule ruled out most of them. A one-shot generator's instructions run a few hundred tokens, well under the 2,048 minimum, so a breakpoint there does nothing. The only call site that qualified was the chat assistant, because it resends the entire prior review on every turn of the conversation. That review is large, identical from turn to turn, and reused many times, which is exactly the shape caching rewards.

finance chat, every turn:
┌──────────────────────────────────────────┬─────────────────────┐
│ instructions + the entire prior review     │ this turn's question │
│ ≈ 5,231 tokens, stored after turn 1         │ ~60 tokens, full price│
└──────────────────────────────────────────┴─────────────────────┘

I turned caching on per call site rather than globally in the shared client, because of the third rule. A call site that is over the size floor but never reuses its prefix would pay the 25% storage premium and never earn it back. Caching everything by default would quietly tax every one-shot call to benefit the few that repeat.

The cached prefix contains the review encoded as JSON, and the database that stores the review does not guarantee a consistent field order when it is read back. Encoded without a fixed order, the JSON came out as a slightly different sequence of characters on every turn, which under the first rule is a different prefix every time. The cache would have matched nothing while the code looked entirely correct. The fix was to sort the JSON fields into a fixed order so the bytes are identical on every turn.

On the same call site I added a second, separate control. Caching protects the front of the request; it does nothing about the back, which grows as the conversation lengthens. Left unbounded, a long chat history eventually fills the model's input limit, the context window, and raises the cost of every turn. So the assistant now keeps only the last 25 turns of history: enough that no real conversation is cut short, capped so a runaway session cannot grow without limit.

Measured against a typical review, the cached prefix is 5,231 tokens. On the first turn the model stores those tokens. On the second turn it reuses all 5,231 and charges for 93 new input tokens instead of about 5,300. Every turn after the first costs roughly a tenth of what it otherwise would.

Layer three: retrieval, a reason not to build a vector store#

The third layer answers plain-language questions over the archive of past reviews. It loads the relevant reviews, trims them, and sends the whole set to the model. The trimmed archive is the large, stable part, so it is what should be cached. The problem was ordering: the code put the question first and the archive second, and caching only ever stores a prefix, never a suffix. Once the archive moved to the front, a second question over the same archive reused 2,541 stored tokens instead of paying for them again.

The more useful decision was what I chose not to build. The standard approach to retrieval is to convert documents into numeric vectors and store them in a vector database, so you can fetch only the few passages most similar to a question. But my archives still fit inside the model's context window, and at that size, sending the whole archive and caching it is more accurate, faster, and simpler than a vector database with its own moving parts. So instead of building that pipeline I wrote down an order of escalation: cache the whole archive now; add a keyword filter if it grows; reach for a vector database only when an archive genuinely stops fitting in the context window. A size gauge in the logs will tell me when that day arrives. It has not.

Layer four: a data server, not my decision#

The fourth layer is a private server that sends my own portfolio data to a separate model on request, over the Model Context Protocol (MCP, the open standard for exposing tools and data to a model). At this kind of server-to-model boundary, caching is the receiving side's decision, not the sending server's, so there was nothing for me to cache. What the server does control is how much data it sends back. So the useful lever was not caching but response size: cap the one field that could grow without limit, stop sending a duplicate copy of large results, and log the byte size of every response so an oversized one shows up. The goal is the same as the other three layers, protecting the model's context window, but the tool is different because the layer is different.

What it saved#

Surface Cached prefix New input charged, turn two Input it would have charged without caching
Finance chat 5,231 tokens 93 tokens ~5,300 tokens
Corpus retrieval 2,541 tokens a few dozen ~2,600 tokens

The finance row in plain terms: a follow-up question that would have been charged about 5,300 input tokens at full price is instead charged 93 at full price, with the other 5,231 reused from storage at a tenth of the rate. That is roughly ninety percent off the input cost of every turn after the first. Both prefixes are above the 2,048-token floor, which is the one fact that determined whether any of this did anything.

Two honest caveats. First, volume. This is a low-traffic system, so the actual dollars saved are small, and I am not going to dress a quiet personal project up as a serious cost cut. The savings are real and would scale with traffic, but the traffic is not there yet.

Second, and more important: I measured those reused prefixes with no delay between turns, one request sent the instant the previous one returned. A stored prefix lasts five minutes. In real use a person reads the answer, thinks, and then types the next question, and if that gap runs past five minutes the stored prefix is already gone and the next turn pays full price again. So my measurement proves the mechanism works, not that it works under real human timing. The token counter that will answer that is now running in the production logs. If the reuse is not happening, the fix is a longer storage window, which is a setting, not a redesign. I would rather read that number than guess it.

What held across all four#

Prompt caching is not a flag you flip to make calls cheaper. It is one mechanism, the reuse of a stored prompt prefix, governed by three rules: it matches exact bytes, the matched part has to clear a size floor, and it has to recur within five minutes. Where you sit in the system decides which rule matters. In the agent I use, the work was already done. In my own backend calls, it was real work, and most call sites did not qualify. In retrieval, it was a reason not to build a vector database yet. At a data-server boundary, it was not my decision to make, so I controlled response size instead. The common thread is that the context window, the model's limited input budget, is the scarce resource, and the only way to know whether you are spending it well is to measure, because the ways caching fails are mostly silent.

Keep reading

Post

How to backtest without fooling yourself

A backtest's job is not to find an edge. It is to stop you from believing in one that is not there. The toolkit I used to test my own trading engine, and the part where it killed my single best signal.

Read
Post

The caskeycoding.com tech stack at a glance

A high-level tour of the technologies running this site: Next.js on CloudFront, Python Lambdas behind API Gateway, DynamoDB plus S3, Anthropic's API with a Bedrock fallback, and AWS CDK wiring it together.

Read
Post

Hello Again, Opus

Four days after I said goodbye to Opus, an export-control directive pulled Fable 5 offline and the fallback became the workhorse again. What I shipped in the window, what it cost, and the model-tiering plan for when Fable comes back.

Read
Post

An orchestration mode is only as good as its backlog

Anthropic published a guide on building a session-level orchestration mode. I built it two ways, on the CLI and on the API, and then hit the part the guide does not cover: an orchestrator that fans out is useless without a backlog of real work to fan out over.

Read
Post

Building an AI-Native Platform: A Retrospective

A year of building and operating a small fleet of finance and content products almost entirely through an AI coding agent. What worked, what was hard, the honest failures (including a flagship signal that measured nothing and an edge that vanished net of costs), and the lessons that transfer.

Read
Post

Composite What You Trust, Watch What You Don't: A Trust Boundary for Data With Money Attached

Every system that fuses signals into one consequential number has a fault line: the data you trust enough to composite into a grade versus the data you only trust enough to watch. How I drew that boundary in my personal finance engine, and how a test keeps it honest.

Read
Written by Eric Caskey. I build AI tools you can actually use. Explore the Tools or see the case studies.