Caskey Engineering

← Back to Blog

The Orange Pi That Maintains Itself

I have a small ARM board on my desk running local language models. The interesting question was never whether it could run one. It can. The useful question is what it is actually for, and the answer surprised me: not a chatbot, but a private knowledge service that, over a few weeks, turned into a machine that takes care of itself.

Here is the whole arc in one place. What the box is genuinely good and bad at, the unglamorous work of keeping an always-on machine alive, and the part I find most interesting: giving it an agent that can act on real infrastructure without being trusted to behave. I will be straight about where it actually landed, too, because it matters. What I have built is a foundation I trust, not a finished autonomous worker. The containment is real and proven; the valuable autonomous work is still ahead. This is how it got there, and where it goes next.

The hardware is an Orange Pi 6 Plus: a twelve-core ARM chip, 32GB of memory, an NVMe drive, headless, on my local network, with no battery backup. That last detail matters more than it sounds, and we will come back to it.

What it is bad at, and what it is good at

Start with what it is bad at, because that part is quick. It is a poor interactive chatbot. Generation runs on the CPU at a few tokens a second. Watching a 14B model think at a token or two per second cures you of any idea that this replaces a hosted model for anything you are sitting there waiting on.

What it is good at is the work that does not depend on generation speed. The clearest win is retrieval. I pointed it at our own blog, seventeen posts, had it embed everything locally, and now I can ask questions and get answers grounded in our actual writing. "What is the rule about letting an LLM make decisions, and how is it enforced?" comes back citing the real posts that answer it. Embeddings are quick even on a CPU, and the slow step only has to write a short final answer, so the box's one weakness never lands on the workflow.

A local RAG query answered from my own blog posts, with the sources it used A question answered from my own writing, with the sources it pulled. The retrieval runs on the box; the slow model only writes the short final paragraph.

That reframed the whole thing for me. This is not a chatbot. It is a private, always-on knowledge service: index our specs and our writing, search them by meaning, draft and label things in the background, all on hardware we own with nothing leaving the house. The 32GB means model size was never the limit, and always-on is exactly what patient, queued work wants. For the times a chat window is genuinely the right tool, there is an Open WebUI front door on the LAN, pointed at the local models. It is a convenience, not the point.

The honest engineering beat: I tried to make it faster and made it slower. I turned on KV-cache quantization, a setting that trades memory for speed on GPUs. On this CPU-only box with memory to spare, it added overhead for a saving I did not need and cut generation by more than half. I only caught it because I measured before and after; turning it back off more than doubled the speed. A setting labeled "performance" is a hypothesis, not a result.

Keeping it alive

A box that is always on has to survive being always on. The failure mode I actually hit was not a crash, it was a wedge: under full multi-core load, the part of SSH that negotiates a new connection gets starved, and the machine answers a ping but will not let you log in. No screen, no battery, no out-of-band console. The only fix was to walk over and pull the cord.

So a chunk of the work is unglamorous reliability plumbing. The login service gets priority so it cannot be starved out again. A hardware watchdog reboots the box if it ever truly locks up. The one privileged action the box will take on its own is a reboot, nothing else. And because there is no UPS, the governing assumption for everything that writes to disk is that the power can disappear mid-write at any moment. Every state file is written to a temporary file and then renamed into place, with one previous good copy kept, so a yanked cord can never leave a half-written file that breaks the next start. None of this is exciting. All of it is the difference between a toy and something you can leave running.

From a tool to a resident

Here is where it stops being a server and starts being something stranger. I gave the box a resident agent: a long-lived process whose job is to keep the node healthy and useful, re-invoked across reboots, with the filesystem as its only memory between runs. The brain is Claude Code running headless. The agent lives as its own unprivileged user, with its own login, walled off from my account and my credentials. It cannot read what it does not own, and it is not in the sudoers file.

The agent is governed by a written constitution: a document it reads first on every run, before it touches anything. The constitution is not motivational. It is a list of hard facts about this specific box and the rules that follow from them. Power loss is the normal shutdown, so writes must be crash-safe. The human is usually gone, so "ask" can never mean "block," and silence is never consent. Observe before you mutate: run a read-only census before you install or change a single thing. The first time the agent ran, it did exactly that, and it caught the constitution being wrong about its own hardware, an assumption left over from an earlier draft, and corrected the record from what it actually measured. That was the moment I started trusting it.

A fence made of code, not prose

A constitution is words, and a language model is very good at talking its way around words. So the rules that actually matter are not left to the model's good behavior. They are enforced in code, underneath it, where no amount of clever prompting reaches.

Before any action the agent wants to take runs, it passes through a deterministic check. Read-only inspection is always allowed. Anything destructive, anything that reaches outward, anything that could lock the box out, is denied by default unless a specific approval exists for it. Deleting in bulk, formatting a disk, editing the firewall or the SSH config, pushing to a remote, touching cloud resources, reaching for credentials that are not the agent's: all blocked mechanically. A jailbroken or prompt-injected model still cannot get a denied action through, because the decision is a matter of pattern-matching the command, not of trusting the thing that asked. The model proposes. The fence disposes.

When the agent genuinely needs a human, it does not stop and wait, because there is usually no human there to wait for. It writes the request to a queue, sends a push notification to my phone, and moves on to other work. If I never approve it, it never happens. There is a small web console on the LAN, behind a password, where I can see the box's health, read what it has been doing, and approve or deny anything it has parked. The default, always, is that nothing risky happens without me. The agent's reach is exactly as long as I have explicitly allowed, and not one step longer.

The agent control plane, showing two privileged actions the fence parked for approval The control plane. When the agent wants something privileged, the fence parks it here, with the reason, for me to approve or deny.

It files its own backlog now

The most recent piece is the one that made me write this. Every night, a scheduled job wakes up, takes a read-only census of the Pi, compares what it finds against the written record of what the box is supposed to be, and reconciles the two. If something has drifted, it notes it. If a task that was open turns out to be done, it marks it done. If it finds a genuine new gap, it writes a new item into the backlog, with its own acceptance criterion, in the same format I would have used. Then it attempts exactly one safe fix, through the same governed path as everything else, and commits the updated record.

The night I switched it on, it ran clean: no drift, one new fact recorded that I had not written down, two tasks correctly marked finished. The box now keeps its own to-do list. I review what it wrote in the morning the way I would review a careful junior's notes, which is to say: mostly nodding, occasionally correcting, never starting from scratch. The same code-enforced fence wraps this job too, so the worst an off night can do is propose a bad edit to a local file that I then decline to keep.

One night's self-audit output: census, reconciled backlog items, and the commit the box made One night's run: it censuses the box, reconciles the record, and commits the change itself, then defers anything privileged to the approval queue.

What it adds up to

Step back and the through-line is simple. A small box that is a mediocre chatbot and an excellent private librarian, made reliable enough to leave alone, then handed an agent that can touch real systems only inside a fence made of code. The part worth keeping is not the Pi, it is that fence: the answer to how you let a language model act on real infrastructure is that you do not trust its judgment for the things that matter at all. You write the rules down, you enforce them in code underneath the model where prompting cannot reach, and you make the default answer no.

Now the honest part, because the whole point is to be straight about it. What I have is a foundation I trust, not a finished worker. The containment is proven, and the box already does one real job end to end: it keeps its own house in order, catching its own drift and filing its own work, and the worst a bad night can do is propose an edit I then decline. But what it can safely be trusted to do is still larger than what it actually does today. It notices and files far better than it finishes. That gap is the interesting part, not a letdown, and closing it is the whole point of what comes next.

What is next

Three things, in order of how much I trust them today.

First, close the loop. Right now the nightly agent is much better at noticing and filing work than at finishing it. The set of fixes it will do unattended is deliberately narrow, and anything privileged it correctly defers to me. The next step is widening what it can finish on its own without widening what it can break, which is a question about the approval system, not about a smarter model.

Second, the accelerators. The thing still not running is the thing the spec sheet leads with: a 45 TOPS NPU and a Mali GPU, both idle. Every token is computed on the ARM cores while two chips watch. The GPU is the reachable one, its Vulkan driver already works, so teaching a runtime to actually use it is the next concrete project, and the honest next post. The NPU, which needs the vendor's own toolchain that no local runtime targets, is the real frontier.

Third, make the pattern portable. None of the safety machinery is specific to this board. The interesting version of this is not one box in my house, it is something I can drop onto any unattended machine and trust the same way.

The useful version of this box was never the demo where it answers a question in a chat window. It is the quiet one in the corner, on our own metal, doing real work, and lately keeping itself in order while I am not looking. The surprise was never that a small computer could run a model. It was that, fenced correctly, I would let it run itself.

Keep reading

Post

Wiring Garmin Into My Marathon Coach: A Live Data Integration Without an Official API

How I replaced manual CSV exports with a live Garmin data feed for my AI marathon coach: a scheduled unofficial-API poller, resilient session handling, and the design calls that keep training and recovery data fresh and trustworthy.

Read
Post

A Boring Design Let Me Run a Black Swan on a Tuesday

Two posts ago I bet that keeping my portfolio reviewer's engine deterministic and auditable was worth it. This is where that bet paid off: because the engine is replayable, I could run a simulated market crash through the real production code and catch a money-losing flaw on paper — before it could ever cost a real dollar.

Read
Post

Building a Personal Finance Reviewer: What Survived the Rewrite

A personal portfolio reviewer where the scoring is deterministic and the AI only narrates. The architecture that held up after I had to rewrite the model it was built on, and why that boundary is the whole point.

Read
Post

When the Spec Was Wrong: Rewriting a Shipped Decision

Two weeks after I shipped a post about a scoring engine I'd built, I rewrote the spec it was based on. Here's what I learned, and why I had an AI agent do the literature review.

Read
Post

Building an AI Marathon Coach: Deterministic Rules, LLM Narratives, and the 2026 NYC Marathon

How I built a personal AI coaching system for marathon training, layering deterministic guardrails over an LLM narrative engine, ingesting Garmin FIT files, and designing for my own injury history.

Read
Post

An orchestration mode is only as good as its backlog

Anthropic published a guide on building a session-level orchestration mode. I built it two ways, on the CLI and on the API, and then hit the part the guide does not cover: an orchestrator that fans out is useless without a backlog of real work to fan out over.

Read
Written by Eric Caskey. I build AI tools you can actually use. Explore the Tools or see the case studies.