From Essay to Operating System: A Framework for Working With AI Without Losing What You Need to Know

A day after publishing Renting Competence, I sat back down with the same question and pushed on it harder. The essay had a thesis: if you let an LLM hold the model of your system, you keep shipping but stop knowing. It named modes — AUTO, REVIEW, COACH, SOLO — and gave a research grid. It read well.

It also wouldn't survive contact with a real team on a real Monday morning.

This post is what happened when I tried to make it survive. We ran the framework against three reference scenarios with the new vocabulary, watched what changed and what didn't, and ended the day with something that looks less like an essay and more like an operating system: a kickoff brief, a recipe catalog, an introduction sequence for retrofit. None of it is finished. Some of it is wrong. But the shape is now the shape of something a tech lead could pick up on Monday and run, not a manifesto to nod at.

Here is the day's work, the patterns that emerged, and the parts I would change next.

The original essay's hidden assumption

Renting Competence assumed the four-mode dial — AUTO/REVIEW/COACH/SOLO — was the framework's load-bearing piece. Pick the right mode for the right work, hold the right border tests, and you preserve real expertise.

Three things broke when I pressure-tested it.

First, SOLO is unenforceable. Telling a team "for high-stakes work, don't use the LLM" produces either lying or mutiny. Engineers will open the LLM and not tell you. The label was honest about intent and dishonest about reality. The LLM is always present now. The honest replacement is what I now call DUO: the human produces the artifact, the LLM is a bounded second voice. Two structural constraints make it testable: the artifact you ship is the human's draft, not an LLM revision; and the questions to the LLM are bounded — "what would break if I did X?" — not "draft this for me."

Second, the mode dial does less work than I claimed. Renaming SOLO to DUO is honest, but the behavior change comes from what's underneath the label: the recipes, the artifacts, the rituals. You can run a team with no mode-dial vocabulary at all, as long as you have the recipes. You cannot run a team with the vocabulary and no recipes. The mode dial earns its slot only because it makes the recipes legible — "DUO mode means use the reasoning-probe recipe" is a useful shorthand. Strip the recipes away and the labels are decoration.

Third, the essay claimed too much. "Reviewing doesn't work" is the kind of line that gets shared on LinkedIn. It's also wrong. Passive consumption — reading code and approving it, taking the quiz alone afterward — produces poor retention. The research is solid on that. Active interrogation — a colleague asking "walk me through what happens at T+30 days" and requiring a written answer — is a different mechanism, and it does work. The difference is whether the reviewer is producing-with-external-check or just consuming. An essay that conflates the two will get cited to mean something I don't believe.

So the day's first move was demoting the dial and elevating two things the essay had buried: the recipe layer beneath the modes, and the kickoff brief that lives upstream of all of it.

What the day actually produced

By the end of Monday I had eight new files in docs/framework/ and three rerun analyses in docs/brainstorm/. The shape stabilized into ten components, of which the dial is one and not the most important.

The framework now claims a project's competence is not a property of who's smart on the team. It's the joint product of:

The work shape (greenfield product, vendor integration, decision-and-architecture, operational platform — six shapes total, each with different default touchpoints).

The stakes (which collapse the mode dial toward DUO at the high end, expand it toward AUTO at the low end — stakes is a multiplier, not an axis).

The mode per work stream (six modes: AUTO / REVIEW / COACH / DUO / SHADOW / OFF — and yes OFF earned its slot, on which more below).

The essentials each named owner must hold (3–7 per owner, derived from the touchpoints, stated as "answer cold to a colleague who pushes back, no LLM, in five minutes").

The touchpoints that surface those essentials in the team's normal operating rhythm.

The border tests that detect when the human is no longer holding the model.

The capture mechanisms for SHADOW mode, where the human teaches the LLM as they work and patterns accumulate in repo files (this is where claude-shadow-learn does the actual work).

The anti-sycophancy posture — quarterly adversarial probe to check that pattern files haven't drifted from facts into preferences.

The production-vs-learning trade-off declared per stream, not per project.

The handoff and LLM-down protocols — what survives when the team disperses, what the team does when the LLM is unavailable for four hours during the worst possible moment.

The work this list does is to refuse to live as one big idea. Each row is its own artifact, owned by someone, run on a schedule, with a recipe.

The three new modes

Three additions to the dial pulled their weight.

DUO replaces SOLO. The human holds the pen; the LLM is a bounded interlocutor. The artifact you commit must be the human's, and the LLM is asked closed questions, not open prompts. This sounds small. The reason it matters is the border test — a week later, can the human answer cold questions about what they wrote, with no LLM, to a colleague who pushes back? DUO passes the test if the human really drafted; it fails if the human surreptitiously had the LLM rewrite their draft. SOLO had no such test because nobody believed it.

SHADOW is where the human teaches the LLM while doing the work. This is what claude-shadow-learn operationalizes — patterns, entities, and playbooks that accumulate in repo files, with corrections in the form "Don't [X]. Instead [Y]. Because [Z]." The mode matters because the right answer to "I keep telling the LLM the same thing" is not "type harder" — it's "capture it as a pattern file the LLM reads." SHADOW is COACH with the roles reversed: instead of LLM teaching the human a domain the human doesn't know, the human teaches the LLM the team's specific conventions. Production output and learning are simultaneous.

OFF is the one that surprised me. I thought OFF would be a fringe case — deep-expertise areas where LLM autocomplete actively hurts a senior. It is that, but the bigger use is operational. Incident-time decisions belong to a human, not because the LLM is bad at incidents, but because of accountability and latency. When the LLM is up but you're 12 minutes into a production incident, opening it for "should I rollback or hotfix?" is wrong on two counts: the LLM doesn't carry the consequences, and the round-trip costs you 30 seconds you don't have. Pager-Payments — the 2am incident scenario — surfaced OFF as the right incident-time mode even when the LLM is available. Same applies to CTO-stewardship questions where the LLM has no access to your team's three-year trajectory and would generate a plausible average-of-internet answer.

OFF is not "LLM is bad here." It's "the decision must be human-owned for accountability, latency, or weight-carrying reasons."

The kickoff brief is the load-bearing artifact

If I had to throw away nine of the ten components and keep one, I would keep the kickoff brief.

It's a 30-minute document filled in at project birth. Ten sections: work shape, stakes, conversations that will land in months 4–12, owners per conversation, essentials per owner, the rented-competence ledger (what we are choosing not to retain, with the cost of each gap named), default mode per work stream, first-instance protocol for anyone doing something new, LLM-down protocol, anti-sycophancy posture for projects using SHADOW.

The act of filling it is the work. Vague stakes mean every downstream decision is vague. If the senior cannot write five sentences someone will say in six months — actual sentences, not topic areas — they don't yet know the project's shape and should not have started building.

I tested this against the Marketplace-PostHog scenario I had used as the canonical worked example for the original essay. The original analysis was retrospective: given the seven failures that surfaced in one week, what cluster of competence had been missed and what touchpoint would have caught it. Useful for diagnosis. Useless on Friday afternoon when the senior is approving the PR.

The brief inverts this. Filling section 3 — predicted conversations — at the moment the senior approves the original PostHog PR would have surfaced five of the seven failure modes upfront. Once you've written "in six months a PM will ask is this metric real or data drift," you cannot also write "analytics isn't serious." The two sentences contradict each other.

Across all three reruns I did today (Marketplace-PostHog, Pager-Payments, Maugry-VectorDB), section 3 of the brief was the single most powerful intervention. It moved the framework from retrospective analysis to prospective prevention. Whatever else changes about the framework, the prediction test stays.

Recipes are the anti-ceremony layer

Naming a touchpoint — "we hold a quarterly cost review" — does nothing on its own. Within a quarter the touchpoint becomes ceremony: a calendar invite nobody pushes back in, a Slack post nobody reads. The framework's response is the recipe: a one-page artifact a tech lead can run on Monday morning without further interpretation, with these fields: trigger, owner, time cost, the interrogation (the exact questions asked), passing answer, failure response, what it does NOT replace.

The first full recipe is reasoning-probe-code-review.md. It fires on PRs that touch a reviewer's essentials area or are someone's first instance in a domain. The reviewer picks two or three questions from a rotating list — "what does this code do at T+30 days that it doesn't do at T+30 minutes?", "walk me through what happens if [upstream service] returns [edge case] right here", "if [external party] asked you to defend this choice, what would you say?" — and requires written answers before approval. Written, not Slack-verbal. The writing is the work.

The marker of a passing answer is whether it could only have been written by someone who built the model. Generic answers fail because the LLM could write them; specific-to-our-context answers pass because they require the model. This is the same border-test logic, applied at PR time instead of weekly review.

What makes this recipe survive contact with reality is the failure-response field: "Not block the PR." The PR can ship if the work is non-essential to the reviewer's owned areas. The gap is logged for the developer's next 1:1 and counts toward "this is your first instance, we should pair on the next one." Failure is signal, not punishment. The recipe dies the first time it conflicts with shipping if it's a blocking gate; it survives if it's a detector.

Six more recipes are stubbed — cold-read runbook drill, first-instance COACH session, ADR with weighted dimensions, sycophancy adversarial probe, LLM-down chaos drill, end-of-project handoff. Each will follow the same format. The reruns established their priority order empirically; ADR-with-weights and the two operational drills were the highest-frequency gaps across three diverse scenarios.

Seven patterns from the reruns

I ran three scenarios through the new framework today and the patterns that emerged matter more than any individual rerun.

One: the strongest framework win is upstream, not downstream. In every rerun, section 3 of the kickoff brief — predicted conversations — did the heaviest lifting. Recipes operationalize what the prediction surfaced; the prediction itself is the load-bearing artifact.

Two: mode-dial relabeling is honest but not transformative. DUO/SHADOW/OFF are testable where SOLO was wishful, but they don't change behavior on their own. Behavior change comes from the recipes, the artifacts, and the first-instance protocol. If the framework had to drop a layer, it would be the dial.

Three: two recipe gaps surface in every rerun. ADR with weighted dimensions (every decision-bearing scenario) and the operational drills (cold-read runbook + LLM-down chaos drill — both Pager-relevant but also touched by Marketplace's GDPR runbook and Maugry's ops-readiness review). These two are now empirically the highest-priority recipes to draft.

Four: sycophancy bites differently than predicted. I had originally framed it as a SHADOW-mode-only failure (pattern files drift toward agreeable preferences). Maugry-VectorDB rerun showed it bites in ad-hoc DUO-mode decision work too: when the LLM is a "second voice" on a multi-step decision memo, it agrees more readily as the conversation proceeds. The kickoff brief's anti-sycophancy posture should apply to any project using DUO for decision work, not just SHADOW.

Five: production-vs-learning is per-stream, not per-project. Some streams in the same project are production-only (a one-off governance memo); others are learning-also (a schema spec that becomes a SHADOW pattern file). The brief handles this correctly at section 7; the framework just had to make the per-stream nature explicit.

Six: never-built versus atrophied is the most actionable single distinction. They demand different recipes. Never-built means the person never had the model — first-instance COACH for them. Atrophied means the senior had it but hasn't refreshed — cold-read drill for them. Original framework conflated these; the new framework prescribes different interventions.

Seven: OFF mode earns its slot for stewardship and incidents. Marketplace-PostHog has almost no OFF cases (analytics isn't deep-expertise for anyone). Pager-Payments has OFF as the right incident-time mode. Maugry-VectorDB has OFF as the right mode for CTO-stewardship questions. OFF is rarer than the other modes but irreplaceable where it applies.

What the framework still doesn't have

I want to be honest about what's missing, because the failure mode of frameworks is to keep adding components until they look complete on paper.

The recipes are mostly stubs. Six of the seven are titles in an INDEX.md with priority ordering and no body text. The reasoning-probe recipe is the only one that reads like a thing a tech lead could run unmodified.

The framework hasn't been tested on non-engineering work. The third cluster — stewardship — was supposed to generalize to teaching and writing, but I have not run a non-engineering scenario through the new vocabulary. The Lecture-AI-Workshop scenario is sitting in the brainstorm file waiting for a rerun.

The handoff problem is partially answered, not solved. Playbooks (in shadow-learn) live in repo and persist past team dispersal. That's a start. But there's no recipe yet for "engineer who built half the auth code is rolling off — who owns this now?" The end-of-project handoff recipe is the highest-priority unwritten one for projects that actually end.

The framework applies to itself, which means the modes apply to running the framework too. Mandate-shaped rollout — "we will run all seven recipes starting next sprint" — is a violation of the framework's own anti-ceremony stance. The 90-day introduction sequence I wrote for retrofit projects starts with one recipe, run for a quarter, then calibrate. This is slower than people will want. It is also what the framework's own logic requires.

What I would tell a CTO reading this

If you are starting a new project this week, fill in the kickoff brief. The 30 minutes is the entire investment for the first month. Section 3 — predicted conversations — is the part that does the work. If you cannot write five sentences someone will say in six months, you are not yet ready to start building. Push back on the start date.

If you have an existing project, do not roll out seven recipes at once. Inventory your existing touchpoints (sprint planning, code review, on-call, retros) and your actual fires from the last twelve months. Cross-reference them. Pick the single biggest gap. Write the recipe for that one thing. Run it for a quarter. Then add a second. After four quarters, you will have five active touchpoints all producing signal. It will have taken a year. There is no faster path that produces real change instead of mandate-shaped fluency drag.

Independent of either: distinguish never-built from atrophied when something goes wrong on your team. The same essentials gap demands different interventions depending on which one you have. A senior who has lost the model needs a cold-read drill; a junior who never built it needs first-instance COACH. Conflating them produces wrong-shape responses and lasting frustration.

Be honest with yourself about which streams in your project are production-only and which are learning-also. Three engineers each prototyping one tool produces an answer fast and concentrates the comparative knowledge in one head. Three engineers each prototyping all three tools produces the same answer slower and distributes the knowledge. Both are valid. Choose, and write the choice down.

For the love of god, do not assume your LLM provider's outage and your own incident are independent events. They are not. If you have any production system, you need a paper-printed runbook for the top three incident shapes, drilled before launch and once a quarter. The on-call who is paged at 2am while their LLM is down is not a thought experiment. It is the next quarter.

Where this goes next

The next move is drafting the four highest-priority recipes — ADR with weighted dimensions, cold-read runbook drill, LLM-down chaos drill, sycophancy adversarial probe. This converts the framework from "names what's missing" to "provides what's missing." Most of the operational value is here.

After that, the Maugry-specific instantiation. I am the CTO who wrote this framework. The reruns showed the framework prescribes different things for me as CTO than for me as senior lecturer than for me as founder. Each role's essentials are different. Each role's touchpoints are different. The framework's claim is that I should be able to write down those essentials in 30 minutes per role. I have not done that yet. The fact that I haven't is itself signal: writing down what I should be able to answer cold is harder than reading about a framework that says I should be able to answer cold.

This is the work of November. I will report back.

From Essay to Operating System: A Framework for Working With AI Without Losing What You Need to Know

The original essay's hidden assumption

What the day actually produced

The three new modes

The kickoff brief is the load-bearing artifact

Recipes are the anti-ceremony layer

Seven patterns from the reruns

What the framework still doesn't have

What I would tell a CTO reading this

Where this goes next

No comments yet

Continue reading

Turning a face into a chibi: a tour of LAM, Gaussian Splatting, and what happens when you push on the mesh underneath

On PMs and LLMs: Forwarding the Work of Thinking

Of Solitary Vice