The Orange Pi That Maintains Itself

By Eric Caskey · June 6, 2026 · 9 min read

AI ollama local-llm homelab rag agents side-projects

I have a small ARM board on my desk running local language models. It can run them fine; the question that turned out to matter was what it is actually for. The answer surprised me. Over a few weeks it went from a local-LLM experiment into a private knowledge service that takes care of itself.

Here is the whole arc in one place. What the box is genuinely good and bad at, the unglamorous work of keeping an always-on machine alive, and the part I find most interesting: giving it an agent that can act on real infrastructure without being trusted to behave. I will be straight about where it actually landed, too, because it matters. What I have built is a foundation I trust, not a finished autonomous worker. The containment is real and proven; the valuable autonomous work is still ahead. This is how it got there, and where it goes next.

The hardware is an Orange Pi 6 Plus: a twelve-core ARM chip, 32GB of memory, an NVMe drive, headless, on my local network, with no battery backup. That last detail matters more than it sounds, and we will come back to it.

What it is bad at, and what it is good at#

Start with what it is bad at, because that part is quick. It is a poor interactive chatbot. Generation runs on the CPU at a few tokens a second. Watching a 14B model think at a token or two per second cures you of any idea that this replaces a hosted model for anything you are sitting there waiting on.

What it is good at is the work that does not depend on generation speed. The clearest win is retrieval. I pointed it at our own blog, seventeen posts, had it embed everything locally, and now I can ask questions and get answers grounded in our actual writing. "What is the rule about letting an LLM make decisions, and how is it enforced?" comes back citing the real posts that answer it. Embeddings are quick even on a CPU, and the slow step only has to write a short final answer, so the box's one weakness never lands on the workflow.

A local RAG query answered from my own blog posts, with the sources it used A question answered from my own writing, with the sources it pulled. The retrieval runs on the box; the slow model only writes the short final paragraph.

That reframed it for me. The box is a private, always-on knowledge service: index our specs and our writing, search them by meaning, draft and label work in the background, all on hardware we own with nothing leaving the house. The 32GB means model size was never the limit, and always-on is exactly what patient, queued work wants. For the times a chat window is genuinely the right tool, there is an Open WebUI front door on the LAN, pointed at the local models, a convenience layered on top of the real work.

The honest engineering beat: I tried to make it faster and made it slower. I turned on KV-cache quantization, a setting that trades memory for speed on GPUs. On this CPU-only box with memory to spare, it added overhead for a saving I did not need and cut generation by more than half. I only caught it because I measured before and after; turning it back off more than doubled the speed. A setting labeled "performance" is a hypothesis, not a result.

Keeping it alive#

A box that is always on has to survive being always on. The failure mode I actually hit was not a crash, it was a wedge: under full multi-core load, the part of SSH that negotiates a new connection gets starved, and the machine answers a ping but will not let you log in. No screen, no battery, no out-of-band console. The only fix was to walk over and pull the cord.

So a chunk of the work is unglamorous reliability plumbing. The login service gets priority so it cannot be starved out again. A hardware watchdog reboots the box if it ever truly locks up. The one privileged action the box will take on its own is a reboot, nothing else. And because there is no UPS, the governing assumption for everything that writes to disk is that the power can disappear mid-write at any moment. Every state file is written to a temporary file and then renamed into place, with one previous good copy kept, so a yanked cord can never leave a half-written file that breaks the next start. None of this is exciting. All of it is the difference between a toy and something you can leave running.

From a tool to a resident#

Here is where it stops being a server and starts being something stranger. I gave the box a resident agent: a long-lived process whose job is to keep the node healthy and useful, re-invoked across reboots, with the filesystem as its only memory between runs. The brain is Claude Code running headless. The agent lives as its own unprivileged user, with its own login, walled off from my account and my credentials. It cannot read what it does not own, and it is not in the sudoers file.

The agent is governed by a written constitution: a document it reads first on every run, before it touches anything. The constitution is all hard facts about this specific box and the rules that follow from them. Power loss is the normal shutdown, so writes must be crash-safe. The human is usually gone, so "ask" can never mean "block," and silence is never consent. Observe before you mutate: run a read-only census before you install or change so much as a single file. The first time the agent ran, it did exactly that, and it caught the constitution being wrong about its own hardware, an assumption left over from an earlier draft, and corrected the record from what it actually measured. That was the moment I started trusting it.

A fence made of code, not prose#

A constitution is words, and a language model is very good at talking its way around words. So the rules that actually matter are not left to the model's good behavior. They are enforced in code, underneath it, where no amount of clever prompting reaches.

Before any action the agent wants to take runs, it passes through a deterministic check. Read-only inspection is always allowed. Anything destructive, anything that reaches outward, anything that could lock the box out, is denied by default unless a specific approval exists for it. Deleting in bulk, formatting a disk, editing the firewall or the SSH config, pushing to a remote, touching cloud resources, reaching for credentials that are not the agent's: all blocked mechanically. A jailbroken or prompt-injected model still cannot get a denied action through, because the decision is a matter of pattern-matching the command, not of trusting the thing that asked. The model proposes. The fence disposes.

When the agent genuinely needs a human, it does not stop and wait, because there is usually no human there to wait for. It writes the request to a queue, sends a push notification to my phone, and moves on to other work. If I never approve it, it never happens. There is a small web console on the LAN, behind a password, where I can see the box's health, read what it has been doing, and approve or deny anything it has parked. The default, always, is that nothing risky happens without me. The agent's reach is exactly as long as I have explicitly allowed, and not one step longer.

The agent control plane, showing two privileged actions the fence parked for approval The control plane. When the agent wants something privileged, the fence parks it here, with the reason, for me to approve or deny.

It files its own backlog now#

The most recent piece is the one that made me write this. Every night, a scheduled job wakes up, takes a read-only census of the Pi, compares what it finds against the written record of what the box is supposed to be, and reconciles the two. If something has drifted, it notes it. If a task that was open turns out to be done, it marks it done. If it finds a genuine new gap, it writes a new item into the backlog, with its own acceptance criterion, in the same format I would have used. Then it attempts exactly one safe fix, through the same governed path as everything else, and commits the updated record.

The night I switched it on, it ran clean: no drift, one new fact recorded that I had not written down, two tasks correctly marked finished. The box now keeps its own to-do list. I review what it wrote in the morning the way I would review a careful junior's notes, which is to say: mostly nodding, occasionally correcting, never starting from scratch. The same code-enforced fence wraps this job too, so the worst an off night can do is propose a bad edit to a local file that I then decline to keep.

One night's self-audit output: census, reconciled backlog items, and the commit the box made One night's run: it censuses the box, reconciles the record, and commits the change itself, then defers anything privileged to the approval queue.

What it adds up to#

Step back and the through-line is simple. A small box that is a mediocre chatbot and an excellent private librarian, made reliable enough to leave alone, then handed an agent that can touch real systems only inside a fence made of code. The fence is the part worth keeping. The way you let a language model act on real infrastructure is to withhold trust in its judgment exactly where the stakes are highest: you write the rules down, you enforce them in code underneath the model where prompting cannot reach, and you make the default answer no.

Now the honest part, because the whole point is to be straight about it. What I have is a foundation I trust, not a finished worker. The containment is proven, and the box already does one real job end to end: it keeps its own house in order, catching its own drift and filing its own work, and the worst a bad night can do is propose an edit I then decline. But what it can safely be trusted to do is still larger than what it actually does today. It notices and files far better than it finishes. That gap is the interesting part, not a letdown, and closing it is the whole point of what comes next.

What is next#

Three things, in order of how much I trust them today.

First, close the loop. Right now the nightly agent is much better at noticing and filing work than at finishing it. The set of fixes it will do unattended is deliberately narrow, and anything privileged it correctly defers to me. The next step is widening what it can finish on its own without widening what it can break, which is a question about the approval system, not about a smarter model.

Second, the accelerators. The thing still not running is the thing the spec sheet leads with: a 45 TOPS NPU and a Mali GPU, both idle. Every token is computed on the ARM cores while two chips watch. The GPU is the reachable one, its Vulkan driver already works, so teaching a runtime to actually use it is the next concrete project, and the honest next post. The NPU, which needs the vendor's own toolchain that no local runtime targets, is the real frontier.

Third, make the pattern portable. None of the safety machinery is specific to this board. The version I actually want runs anywhere: the same fence dropped onto any unattended machine, trusted the same way.

The useful version of this box is the quiet one in the corner, on our own metal, doing real work and lately keeping itself in order while I am not looking. Running a model locally was never the surprising part. What still surprises me is that, fenced correctly, I would let it run itself.

Keep reading

Post

The Pocket Quant

I built a quant research platform, then built an agent to operate it: a scheduled Claude session that reads the boards, keeps a pre-registered track record, and texts me three times a day without ever saying buy.

Read

Post

Building an AI-Native Platform: A Retrospective

A year of building and operating a small fleet of finance and content products almost entirely through an AI coding agent. What worked, what was hard, the honest failures (including a flagship signal that measured nothing and an edge that vanished net of costs), and the lessons that transfer.

Read

Post

Fable Thinks, Sonnet Builds

I hit the Fable usage cap twice in under 48 hours and nearly ran out the total token limit. The plan that would have prevented it was published on this blog a month ago. Here is why it failed anyway, where the plan lives now, and what the routed patterns cost side by side.

Read

Post

Ballast: An LLM App Whose Best Feature Is Saying 'I Don't Know'

I built a self-healing RAG pipeline, a guardrails gateway, and an eval gate as one system, then threw 44 adversarial questions at it. Zero hallucinations, because the most important thing it does is refuse. Here is how trust got built into the architecture instead of the prompt, and the safety check that leaked the very thing it was guarding.

Read

Post

Composite What You Trust, Watch What You Don't: A Trust Boundary for Data With Money Attached

Every system that fuses signals into one consequential number has a fault line: the data you trust enough to composite into a grade versus the data you only trust enough to watch. How I drew that boundary in my personal finance engine, and how a test keeps it honest.

Read

Post

Hello Again, Opus

Four days after I said goodbye to Opus, an export-control directive pulled Fable 5 offline and the fallback became the workhorse again. What I shipped in the window, what it cost, and the model-tiering plan for when Fable comes back.

Read

Follow the work

New tools and writing as they ship — pick a channel.

RSS feed LinkedIn

Written by Eric Caskey. I build AI tools you can actually use. Explore the Tools or see the case studies.