Week notes: 3–10 April 2026
GP//2026 Benjamin Askins
Generative Plane

Week notes: 3–10 April 2026

You can't improve what you can't measure. But before you can measure, you have to build the instruments. This week was supposed to be about running the software factory and seeing what falls out. Instead it became about building the gauges, the traceability, the accounting — all the things you need to know whether what the factory produces is worth what it costs to produce.

Sixteen runs and one clear winner

The factory pipeline — werkhaus orchestrating work-lead → plan-lead → code-lead → code-hand → scan-lead — can now scaffold, build, QC, and sign an agent from a PRD. The next question is: how well does it actually do that?

I ran two tasks through the pipeline repeatedly, tweaking the harness, prompt, or model backend between each build. Nine runs of changelog-hand, seven runs of ship-lead. Each run produces a full repo as its artifact, so you can diff not just pass/fail but the actual code structure, the failure point, the reasoning path. The run names tell the story: gemini-scaffold-fail, gemini-parse-spiral, gemini-no-toolcalls, qwen-ratelimit, loop-bug, pattern-guard-bug.

I honestly thought I'd find a model competitive with Qwen 3.5 on both performance and price. I don't think it exists. Gemini 2.5 gave me a particularly hard time — it won't consistently call the tools available to it. There's a commit in axon-code that says "add tool_choice auto for Gemini, it won't call tools without it" and another for "profile-driven prompt composition per LLM family." You end up working around model-specific quirks at the infrastructure level. Qwen 3.5 is just a lot more predictable.

The best lessons come from trying and trying again and tweaking parameters between runs. The ship-lead eval runs surfaced bugs in axon-loop itself — the agent got stuck in exact-match tool call cycles and I had to go add pattern-based loop detection and a hard kill to the loop infrastructure. The agent literally spinning in circles until I fixed the plumbing.

Seven new axon modules

The eval work drove a rapid expansion of the shared platform. Seven new axon modules this week:

axon-code implements the shared coding tools — file editing, gofmt, go mod tidy, build diagnostics, and gopls-based LSP tools for interactive type feedback. The code-hand loop is built specifically for implementing features on a brand new scaffold, and axon-hand (also new) provides the CLI chassis every agent is built on. They need slightly different loops and contexts but the same set of tools, so axon-code is where those tools live for reuse.

The LSP integration gives coding agents something closer to what a human developer gets from an IDE — real-time type feedback, automatic formatting. I haven't run proper evals on the LSP impact yet, but the broader lesson from first principles is clear: AST and LSP tools improve agent iteration speed because they can validate correctness as they go. This won't be news for anyone building harnesses, but approaching it from first principles rather than cargo culting patterns from existing tools means I'm learning a lot more along the way.

axon-cost handles token and cost accounting with per-label and per-model breakdowns and budget alert callbacks. axon-scan is the QC pipeline with sequential and parallel execution modes, finding deduplication, and attestation generation. axon-sign implements Ed25519 signing, SSHSIG-format Git commit signing, and key rotation. axon-snip provides scaffold snippets with current module APIs. axon-lore records architecture decision records.

Across the existing modules: axon-talk got token usage parsing for both OpenAI and Anthropic streaming and non-streaming responses, plus session ID support for OpenRouter request grouping. axon-base absorbed database handling from the core axon module, switching from golang-migrate to goose. axon-rule got guard-driven state machines. aurelia grew a secret caching layer and diagnostic tools.

Instruments and accountability

I don't need cost visibility as such. I just want the data. If I can reason about every step in a build — how much it cost, what tokens were consumed, at which stage — and compare that with the output, then I can weigh CoGS against the generated output and its value. Cost of goods sold, per generated agent.

That framing drove a stack of instrumentation work through werkhaus and the axon modules. Hierarchical session IDs flow through axon-hand with a card/state/role/step structure so every OpenRouter request is traceable to its place in the pipeline. JSON eval envelopes carry per-stage token accounting. axon-cost aggregates it all with per-model breakdowns.

And then there's the attestation chain. When scan-lead finishes QC, axon-scan produces an ATTESTATION.md and attestation.json. axon-sign signs the attestation with two keys: the agent's key, and the key of the human who delegated their accountability to the agent. Dual signing. The agent can prove what it checked. The human can't deny they authorised the work. It's a governance model — accountability that survives the handoff from human to machine.

Revue: Macrodata Refinement for code review

Revue started as an experiment in using RSVP to quickly detect smells in LLM-generated code. Rather than scrolling through a pull request and getting distracted, it presents focused chunks with optionally generated LLM commentary. As I was building it, it started feeling a little dystopian — a little "Macrodata Refinement" — so I leaned into it. The timer. The increasing urgency. Configurable review speed with four presets.

You can think of this as an art project and an experiment in DX to help humans keep up with the output of non-human agents. The heuristic mode works without an LLM at all — just rapid sequential presentation with pattern-based detection. The LLM commentary is an enhancement, not a dependency.

Just an art project for now. I might introduce something without the drama and theatre — the revue, if you will — that helps me focus during code review and take it one chunk at a time.

Pitching, not performing

I've been interviewing for a role, and I asked the business if I could pitch them rather than going through a traditional interview process. I'm not a fan of the "gotcha" style of questioning that has crept into tech interviewing. If you can't answer a question like "what would engineers who've worked with you say sucks about working with you" — it's a dismissive, disrespectful question that doesn't value the interviewee's time.

So I asked if I could pitch them on something instead. I built a complete ML training pipeline with autotuning and ONNX export, then wrapped it in a TUI app that combines Anthropic vision classification with ONNX tabular inference for machinery valuation. A full vertical slice: training pipeline, model export, inference engine, structured LLM prompting, and a terminal interface that ties it together. It went down well. I'm expecting an offer next week.


Approaching all of this from first principles — the agent harness, the eval methodology, the instrumentation — rather than reaching for existing frameworks means the learning compounds differently. Every failure mode I hit in the eval runs teaches me something about both the models and my own infrastructure. The Gemini tool-calling quirks, the loop bugs, the token accounting gaps: none of those would have surfaced the same way if I'd been configuring someone else's system. Sixteen runs of the same two tasks, each one slightly different, each one leaving a full repo behind as evidence. The factory is starting to produce legible output. Now I need to decide what that output is worth.