Welcome: Building Platforms for Scale

By Eric Caskey · May 15, 2025 · 3 min read

Hello, and welcome. Let me start with where this came from.

For most of my career I have worked on one problem in different shapes: how do you keep a large system honest about its own health, without a person having to remember to check? Monitoring, in other words, and the automation around it.

Enterprise monitoring at Prudential#

At Prudential I owned the enterprise monitoring platform, which meant the health of more than 400,000 monitors across the company. I set the standard for how infrastructure got watched, chose the hardware it ran on, and led the migration off the legacy systems onto that standard. The lesson that stuck was about blast radius. When one team's decision sets the monitoring for a whole company, a good call quietly protects thousands of services and a bad one quietly exposes them, and you often do not find out which you made until something breaks.

Large-scale automation at Amazon#

The next version of the problem was the same one an order of magnitude bigger. Infrastructure at Amazon came and went constantly, so the monitoring had to find new hosts and services on its own, attach the right alarms as they appeared, and let go as they disappeared. Writing the shared standards, alert policies, and deployment playbooks turned that from a chore every team repeated into something they got by default. I wrote about that work in more detail in a case study on fleet monitoring.

The payoff was not speed for its own sake. It was that a team could ship a service without first becoming a monitoring expert, and still trust that if the service got sick, someone would know.

The mishap#

Not every deploy went cleanly. One afternoon we rolled monitors across a large slice of the fleet and pointed every one of their alerts at a single support queue. For about ten minutes that queue took an alarm from every host at once, thousands of them, until we aimed the flood back at ourselves and rolled the change back. It taught me the rule I still work by, in one sentence: when the automation is large, a small mistake does not stay small. It arrives everywhere at the same time.

What I write about here#

That is the throughline, and it still runs through everything on this site. I build systems where the safe behavior is the structural default rather than a rule someone has to remember, and I write about how that goes, honestly, including the parts that fail.

These days the systems are my own. A finance engine that grades stocks on rules I can inspect, a playground of interactive market visualizations, and a running account of what it is like to build and ship software this way, with an AI agent doing the typing. The domain moved from enterprise infrastructure to a small fleet of my own products. The question did not: how do you build a system you can trust, and how do you prove that rather than hope it?

If you have ever watched a one-line mistake page an entire fleet, we will get along. Thanks for reading.

Eric Caskey

Keep reading

Case study

Standardized Enterprise Monitoring Across a Fortune 100 Infrastructure

Defined the org-wide monitoring standard and led a zero-disruption platform migration.

Read

Case study

Standardized Infrastructure Monitoring Across Thousands of Services

Defined the monitoring standard across thousands of services and drove cross-team adoption.

Read

Post

A One-Day Security Baseline for a Solo Fleet

You cannot out-staff a security team when you are the whole team. But the failures that actually end a solo operation are a short, known list, and each has a cheap defense you set up once. Here is the catastrophic floor I stood up in an afternoon.

Read

Post

When CI Costs More Than It Saves

GitHub Actions' default minute allowance is priced for a team that types at human speed. At agent velocity the bill breaks before the engineering does. Here is how a forced workaround, a local CI mirror plus local deploys, became the better default.

Read

Post

The caskeycoding.com tech stack at a glance

A high-level tour of the technologies running this site: Next.js on CloudFront, Python Lambdas behind API Gateway, DynamoDB plus S3, Anthropic's API with a Bedrock fallback, and AWS CDK wiring it together.

Read

Post

Rotating an Option

A 3D render crossed my feed once and stuck with me, so I tried to see an option the same way: as a surface I could grab and turn, not a number. That turned into five market visualizations on one shared trick, a compliance rule the architecture enforces by accident, and an honest lesson about wanting a crystal ball and getting understanding instead.

Read

Follow the work

New tools and writing as they ship — pick a channel.

RSS feed LinkedIn

Written by Eric Caskey. I build AI tools you can actually use. Explore the Tools or see the case studies.