2026-04-03

Week notes: 27 March – 3 April 2026

There's a moment in any bootstrapping process where the thing you're building starts building things of its own. You expected it eventually, but not this week. This week I sat down to run some model evaluations and ended up watching an agent scaffold another agent from a PRD, then that agent built itself out over fifteen commits while I watched. The plan for the week dissolved somewhere around Tuesday and something better took its place.

The factory that built itself

The week was supposed to be about evaluation pipelines, finding the right balance of tool call predictability, tok/s, and cost per million tokens across local and hosted models. axon-eval already had the evaluation framework. For local serving, the focus was BFCL accuracy and tok/s: Qwen 3.5 122B hit the sweet spot at 91.1% on BFCL. To get the best throughput without losing accuracy, I used an autotuning approach inspired by Karpathy's autoresearch. A tool called autotune (namechecking Melodyne, naturally) that tunes a local model's slots, workers, batch sizes, and cache parameters until you've squeezed out every token per second the hardware will give you. For hosted models (Anthropic, OpenRouter), the evaluation was about optimising cost per million tokens for equivalent capability.

With the model dialled in, luthier was already able to turn a PRD into a build plan. So we prepared a PRD for luthier itself and ran multiple tests to see if it could scaffold itself from a spec and a catalogue of axon components. It worked. After a few iterations we wrote a PRD for maestro, a new agent that takes a plan from luthier and turns the scaffold into a working application. Luthier took that PRD and scaffolded a working agent: essentially a clone of the patterns it already uses, all axon components glued together, but with job-appropriate tools. The scaffold included an instruction for Claude to bootstrap maestro with read, write, and exec tools, then hand it its own PRD so maestro could complete the build by adding the remaining capabilities.

Then I sat there and watched it happen. Fifteen commits where it built out the remaining capabilities it needed to assemble axon apps. Then it was extracted into its own repo with further refinements.

Right now maestro delegates coding work to claude -p sessions running Sonnet with a constrained toolset and a system prompt. That CLAUDE.md should really be AGENTS.md. Eventually we'll have a coder agent talking to something like a Qwen coder model. The architecture is designed for model switching: need an expert security review? Find the best model with the right prompt for that task. Need a routine CRUD scaffold? Use something cheaper and faster.

I realised the agents needed somewhere to work. They were dumping output into wherever they happened to have been started, so we came up with a factory-floor analogy. A folder with a workbench where agents could set up builds and work on them. That led to looking for a suitable domain name, and we found werkhaus.dev. Werkhaus became the factory name. I also registered hauswerk.audio, because how could I not.

The build plan has six phases. Phases 0 and 1 are complete. The factory built axon-base and axon-lore — two real components — from PRDs, automatically. I don't have telemetry or the ledger wired up yet to measure output quality systematically. We're still in the "make it work" phase. But the unit economics are already clear: an MVP for just over a dollar in token spend, running Qwen 3.5 122B locally. The economics when using models from smaller labs are really... interesting.

Agents don't get SSH

Claude kept reaching for SSH, and I don't want agents hopping around my cluster. So aurelia went from v0.3.1 to v0.4.0 this week with a full mTLS PKI system: cert issuance via OpenBao, automatic renewal in the peer liveness loop, TLS hot reload via GetCertificate callbacks, and a new aurelia cert command for certificate management. That's over four thousand lines of new code to give agents a secure communication channel that isn't a shell session.

The security architecture across the three-node cluster is deliberate. Adyton (Mac Mini) is the secrets authority. Only a human can SSH to adyton and use aurelia set secret to store a new secret. Agents can use secrets through aurelia's secure channel but never see them. No secret in a log, no secret on disk where an agent might stumble across it while roaming the filesystem. Limen (MacBook Pro) is now mTLS-enabled and aligned with the cluster conventions, and hestia (server) got backup alerting via Signal, postgres backup as an aurelia-managed service, and a switch to vllm-mlx with Qwen 3.5 27B for infrastructure models.

This connects directly to the maestro design: constrained toolsets, scoped permissions, agents that can do their job without being handed the keys to the kingdom.

A type for every concept

I've been so focused on plumbing and infrastructure that I hadn't established any patterns for writing business rules. Every software developer needs a way to express them, and I wanted something that felt better than longhand imperative Go while still being well and truly Go.

axon-rule started as a specification library but ended up focused on types for expressing business rules and combinators for wiring them up. The API went through several rapid iterations in a single week. Spec became Rule, IsSatisfiedBy became Check, Result became Violations, the context type widened from map[string]any to any. That was me trying to build something with it, working through the ergonomics with Claude. Every time I saw a string I asked "what's that?" until we landed on a type. The same thing I'd do if I was pairing with another engineer. You end up with violation codes derived from context type names instead of magic strings.

The pattern was overdue. axon-book got journal entry validation with typed BalanceMismatch contexts. axon-synd got post lifecycle rules. axon-gate got approval resolution rules. All replacing ad-hoc imperative checks.

Making the pieces smaller

A refactoring wave rolled across the axon modules this week. The Ollama adapter got deprecated and replaced with an OpenAI-compatible adapter across lamina, vita, imago, and axon-talk. We'd proved that vllm-mlx worked well for local needs, so the heavy dependency could go. Packages got extracted: axon-push from axon's SSE infrastructure, axon-tape from axon/stream. And luthier.yaml manifests were added to nearly every axon module.

The manifests are what let luthier understand the component catalogue. Every axon module now describes itself in a way the factory can consume. But the real point of the refactoring is size. Most axon primitives are now under 1,000 lines of Go. Small modules don't bloat context windows, and they make the factory's output more predictable. An LLM can hold an entire module in context, understand what it does, and compose it with others. LLMs love patterns. A catalogue of small, consistent components that all follow the same conventions is exactly the kind of thing they're good at reasoning about.

The rest

We're watching mesh-llm closely, thinking about how we could offer factory floor services across a mesh. Really cool to see projects like this coming from Michael Neale. It shipped a CPU-only Linux build, an autoupdater workflow, integration smoke tests, and five releases this week. Worth keeping an eye on.

axon-chat got a full React frontend rewrite: chat page, agent editor, task drawer, conversation management. Three releases. It's still very much a work in progress and probably flaky in ways I haven't found yet, but the bones are there.

I also ran an experiment to see if the luthier pattern could scaffold non-axon apps. I pointed it at a known spec, a simple bookmarks app, and had it produce Rails and Phoenix versions. It worked well. The techniques luthier uses to compose axon components translated surprisingly directly to other frameworks. I don't have time to invest in that right now, but it confirmed something worth coming back to.

Phase 4 of the werkhaus build plan assigns luthier work to the factory itself. That's the moment I keep coming back to: the factory improving its own toolmaker. We're not there yet. There are bugs, there's no telemetry, the ledger isn't wired up. But the pipeline is end-to-end: PRD in, scaffold out, build executed, output on the workbench. The interesting question isn't whether it works. It's what happens when the thing that builds your tools starts deciding which tools it needs.