Here is a question I have been asking developers a lot lately:
"Why did you use Postgres here instead of Clickhouse?"
The answer I get more and more often is some version of "I don't know. The LLM did that."
Another: a senior pulls up code we are discussing. "Wait, why does this break here?" "I'm not sure — Cursor wrote that part."
These are not bad developers. They are competent people who have been using AI tools daily for a year or two. And they are slowly losing the ability to talk about their own work at a level above the prompt.
I want to lay out a framework for this, because I think most teams using AI tools are quietly producing what I have come to call rented competence — the appearance of skill that disappears the moment the LLM leaves the chat — and they do not yet know they are doing it. The framework is not a competency rubric. It is a process-design lens. The lever is not "tell people to learn more." The lever is the touchpoints your team's build runs on.
The two-list intuition
The first cut at this is intuitive: keep two lists. One for skills you do not need to retain (outsource to AI). One for skills you do (do yourself, or have AI teach you while it works).
This is roughly right but it collapses two different questions into one. Who does the work and whether you are learning from it are independent. The first is productivity allocation. The second governs what you will be able to do six months from now.
A more useful cut is a dial with five positions:
- AUTO — AI does it, you do not look closely. Boilerplate, formatting, mechanical refactors.
- REVIEW — AI does it, you skim and approve. Strategic but not skill-building.
- COACH — AI does it with you watching and asking, narrating tradeoffs. Deliberately leveling up.
- SOLO — You do it; AI stays out. Your edge, your judgment, or doing it yourself is the point.
- OFF — You decline the tool entirely for this work. Sometimes the right call.
The two unusual modes are COACH and OFF. Pure delegation and pure DIY both already exist; the middle and the explicit refusal are what is missing from most teams' working vocabulary. Most teams I talk to assume that if they "use AI thoughtfully" they will end up in COACH by default, and they assume the right amount of OFF is zero. Both assumptions are wrong.
The middle ground that turns out not to exist
When you actually try to operationalize COACH mode, something interesting happens. The interventions that look like middle ground — "I'll ask the AI to explain as it works," "I'll read the diff carefully," "I'll have it generate a quiz" — mostly do not produce retention.
The educational psychology here is unambiguous and decades old. Retention requires production by the learner, with external check on the result. Bjork calls this the desirable difficulties effect. Kapur calls it productive failure. Kalyuga's expertise reversal effect explains why heavy scaffolding actively hurts experts. Across all of these, the structural feature is the same: the learner has to generate something, and something or someone has to check it against ground truth.
Reading carefully is not production. Reading the AI's explanation is exposure, not acquisition. The MIT Media Lab's 2025 EEG study found that LLM-assisted writers could not quote essays they had just produced — pattern-matching to "looks correct" passes for understanding without producing it. Bjork called this the fluency illusion: smooth processing mistaken for mastery.
Worse, all the consumption-shaped interventions produce a check-the-box artifact ("I reviewed it," "I took the quiz," "I read the docs") that prevents anyone — including you — from noticing the gap. You are now further from the truth than before, because the introspective signal is corrupted.
METR's 2025 randomized trial of experienced developers showed exactly this. Experienced devs were 19% slower with AI tools on real maintenance tasks in their own large codebases — but believed they were 20% faster. A 39-point perception/reality gap. Self-assessment is unreliable. You do not know whether AI is helping or hurting your retention; only an external check can tell you.
So the practical choice is closer to binary than the smooth dial suggests. Either you treat this as essential (and apply the heavy production-checked-externally protocol) or you treat it as non-essential (and accept that the competence is now rented and will not survive the LLM leaving the chat). The genuinely middle option — faded scaffolding, where AI does it the first time with explanation, you do it the second with reminders, you do it the third alone — exists, but it is time-extended sequencing, not a per-task efficiency move. It costs the same as essentials, just amortized.
Two failure modes, not one
Most discussion of "AI and skills" assumes a single failure mode: skills you used to have are atrophying. That is real. But the more dangerous case for teams hiring juniors right now is different.
Skill atrophy is the senior who used to debug an SMTP issue in fifteen minutes and now flounders because he hasn't done one without an LLM in two years. Bad, fixable, recognizable to the senior.
Skill never-built is the three-year-into-the-job developer who has shipped twelve auth flows, all with LLM help, and still does not understand SMTP, token entropy, or what their own reset endpoint actually does after the email is sent. The LLM produced working code from day one. The competence was never built because nothing forced the build. There is no "before" to atrophy from.
Skill never-built is invisible by construction. The PRs ship. The code reviews pass — usually approved by another LLM-using engineer who reads "looks like canonical reset code" as "is correct." Tenure accumulates. In two or three years this developer is a "senior" by job title with junior-shaped knowledge, sitting on hiring panels evaluating juniors. The pipeline silently produces shallower seniors each cycle. The research community calls this deskilling; the term understates it. Unskilling would be more accurate.
The fix is not the same as for atrophy. Atrophy responds to exercise. Skill never-built responds only to scaffolded first-build — the developer's first instance in any new domain has to be COACH mode, no exceptions, with a senior reviewing the spec the developer wrote before the LLM was opened. If the first instance is AUTO or REVIEW, the second instance has the same gaps. The research is unanimous on this.
The line is the entire game
If the choice is mostly binary, then the border-drawing decision is the entire framework. Mode selection is a footnote. What is on the essentials list is what the framework rises or falls on.
A consequence falls out immediately: essentials lists must be small. Maintaining an essential is expensive. A list of fifty is fiction; it would consume your week. Realistic scope is probably five to ten active essentials at any time, with explicit cycling as your role and projects shift. For teams: each role gets a small essentials set scaled to seniority. Junior backend dev, three or four essentials this quarter. Senior, five to seven. The framework cannot pretend everyone is good at everything.
This connects to the T-shape model of engineers. The horizontal stroke (broad shallow knowledge) is now free for anyone with AI access — you can produce working code in any unfamiliar area within an hour. The vertical stroke (deep knowledge) is what AI cannot give you and what determines whether you can debug, design, hire, or judge. Without ritual, the T flattens to an "I" (only what you already learned), then to a "." (only the prompt). The two-lists framework is what keeps the T honest in the AI era.
The structural reframe: essentials are derived from process
Here is the move that I think most teams have not yet made, and that changes the framework from a competency exercise into a process-design problem.
You do not author the essentials list. You design the process touchpoints the team's build has. What the touchpoints demand becomes essential for the touchpoint owner. What no touchpoint demands falls off the list — not because anyone chose, but because no detector ever fires.
To see this concretely, take a developer who shipped a PostHog analytics integration six months ago. The PR was 200 lines, mostly LLM-written, approved without much scrutiny. Now several things happen the same week:
- The growth lead asks why the conversion funnel dropped 8% in two weeks. Is this real or data?
- The mobile release ships Friday; on Saturday night iOS events drop 90% in PostHog while Sentry shows no errors.
- A new hire reads the code and asks: "Is
pricein cents or dollars? Why some events client-side and some server-side?" - The PostHog bill arrives — $4,200 / month, up from $900. Founder asks: "Cut this in half without losing anything important."
- A GDPR deletion request arrives explicitly naming PostHog. Legal asks what data is sent, where it lives, how long.
- A competitor blogs about a metric they track. Founder asks: "Can we?"
- PostHog announces a minor SDK change. The codemod runs, tests pass. Two weeks later, one specific event isn't firing on Android. The dev has no idea why.
Each is a recognizable moment in any operating company. They are not edge cases — they are natural touchpoints a real software business runs against the team. Read them as a list:
- The PM's question is what a metrics-review process surfaces.
- The 2am page is what an on-call rotation surfaces.
- The new hire's question is what an onboarding rotation surfaces.
- The cost question is what a quarterly tooling-cost review surfaces.
- The GDPR letter is what a compliance request workflow surfaces.
- The competitor question is what a competitive review surfaces.
- The deprecation failure is what a dependency-upgrade process surfaces.
If the team has those processes, the essentials get demanded from the analytics owner. If the team does not have them, the angles still happen — but no one notices the gap. The company has no detector.
This is the structural inversion. The CTO's lever is process design, not list-writing. Want pipeline mental models retained? You need on-call rotation with a real runbook (read cold, not skimmed). Want schema discipline? Onboarding rotation through the code. Want governance retained? Quarterly cost review and a compliance liaison. Drop the touchpoint and the corresponding essential disappears — not because anyone decided, but because nothing tests for it.
This also gives a clean answer to specific questions like "should the dev know LangFuse API?" Reframed: what process touchpoints in our build involve LangFuse, and who owns them? If there is an LLM cost-review touchpoint and the dev owns it, knowing LangFuse cost reporting is essential. If there is no such touchpoint, the question is moot — the team is not detecting cost anomalies anyway.
The three competence clusters
Across worked examples — analytics integration, blockchain payment gateway adding a new chain, a 2am payment incident, a Stripe billing integration, a Postgres RLS migration, a real-time class platform, a vector-DB choice, even preparing a university lecture — three abstract clusters of competence keep emerging. The names specialize per work shape; the structure is robust.
System mental model. A head-model of the system you are integrating with or operating: how the pieces connect, where data flows, where it can be lost or corrupted, how each junction can fail. For analytics: client → consent gate → SDK → network → ingestion → query layer. For blockchain: account model, finality semantics, fee market, signing flow, reorg behavior.
This cluster has hard variants. When the source of truth lives partly in a vendor (Stripe owns half your billing model), the mental model must extend across the boundary — and LLM is worse at this than at your own code, because vendor docs are exactly what training data smooths over. When the system extends into physical reality (real-time media, IoT, robotics), LLM cannot simulate WiFi or codecs; you must touch real failures. When you migrate from one safety model to another (RLS replacing application-layer tenant isolation), the mental model must hold both contracts and the seams between them; migration is dual-system coexistence, not replacement.
This cluster is built by doing the diagnosis exercise, not by reading. The touchpoint that builds it is running synthetic incidents without AI in the room. Aviation calls this "manual flying day" and has been mandating it for forty years; pilots' manual skills decay measurably within two months without practice. The same applies to system models in software.
Judgment that requires owning the consequences. Choices AI cannot make for you because making them requires owning the blast radius. Forward-looking: the event taxonomy you will live with for years; which abstractions hold across chains and which need replacement; which subscription state-transitions count as "the same" customer. Reactive: hotfix vs rollback vs page; what to tell customers when root cause isn't confirmed. Cross-cutting: how a new feature interacts with auth and recording and moderation and billing simultaneously. Security boundaries: which bypasses are safe under what policy.
This cluster is the one AI is structurally bad at because it does not carry the consequences and cannot see your existing invariants. The touchpoint is demanding the artifact that captures the judgment. ADRs for design judgment. Postmortems with explicit "why did you choose what you chose" sections for operational judgment. Spec docs for taxonomy. Bypass-policy frameworks for security work. The act of writing the artifact is the work; reading other people's artifacts does not transfer the judgment.
Stewardship. Cost, compliance, customer trust, hiring, the long-term shape of your stack. The things that flow outward from the engineering work and accumulate over years.
Stewardship can be partly externalized in documents that get consulted at the right moment. But the policy decisions behind the documents — what data is acceptable to send, what risk posture you are taking, how customers should hear about a problem, what it costs to add a third data system to your stack for the next five years — must be SOLO. These are the calls a regulator or a customer or a CEO will eventually make you justify, and you cannot justify what you did not choose.
A note on generalization: this same three-cluster decomposition holds outside engineering. Preparing a 90-minute university lecture on AI-assisted programming has a domain mental model (what you actually believe about the topic, deep enough to handle off-deck questions), pedagogical judgment (what students need to hear, in what order), and stewardship (the multi-year compounding of your own integrated understanding). The artifact and the competence collapse into the same thing — the writing is the integration. LLM-drafted material that is not owned through rewriting is rented competence in exactly the same way an LLM-drafted PR is.
Production versus learning is a separate trade-off
When three engineers each prototype with a different vector database to inform a CTO's choice, the team produces one decision and learns three tools narrowly — one per engineer. When three engineers each prototype with all three databases, the team produces the same decision and learns three tools across all three engineers. The second costs roughly 3x in engineering time. It produces a different team.
This is the production-versus-learning trade-off and it shows up everywhere LLMs accelerate output. Faster delivery of a feature can come with learning or instead of learning. The framework will not tell you which to choose — that depends on how often the team will face similar work — but it will tell you the choice is being made, every time, whether you decide explicitly or by default. Default is "learning didn't happen and nobody noticed." Explicit is at least honest.
The CTO version of this is owning that the team's learning trajectory is a separate deliverable from the team's output trajectory. They can be aligned (deliberate practice rotations, faded-scaffolding apprenticeships) or anti-aligned (mandate fast LLM-assisted output, watch retention silently degrade). The trajectory you do not measure is the one that goes wrong.
AUTO is real, with preconditions
The framework can sound like a one-way essentials ratchet — everything is important, never delegate, don't trust the tool. It is not.
Renaming Reservation to Booking across an 80,000-line TypeScript codebase is a real piece of work, and AUTO is right for it. LLM does it in twenty minutes including type updates and tests; you review the diff in fifteen. Done. The Cluster 2 (judgment) and Cluster 3 (stewardship) loads are essentially zero. Only Cluster 1 has any load: knowing what stable surfaces in the codebase must not be renamed (a serialization tag still on disk, a public API name third parties depend on).
That is AUTO's precondition: the stable-surface boundaries are documented and enforced by tests. Without the inventory and the enforcement, AUTO occasionally breaks contracts silently. With them, AUTO is exactly what makes the rest of the framework affordable. A framework with no AUTO is just "do everything yourself" and is rejected by reality.
The corollary: a meta-essential at the codebase level — knowing where the stable surfaces are — sits underneath every legitimate AUTO use. That is a Cluster 1 cost the team pays once and reuses everywhere. Worth paying.
The framework applies to its own deployment
There is a senior engineer with nine years in the same codebase whose weekly throughput drops 30% the quarter the new CTO mandates Cursor across the org. He cannot articulate why. The LLM keeps suggesting plausible-looking but subtly wrong things in code he could have just written; he has to recognize the wrongness, articulate it, and correct it — three steps that did not exist when he just wrote code. Net negative on this work. He considers turning the tool off but worries about being seen as "not a team player." This is the METR finding in the wild.
The framework's prediction here is unambiguous: for some (engineer × domain) combinations, OFF is the right mode. The mandate that removes OFF from the menu is a framework violation. Not because mandates are bad in general, but because this specific mandate produces systematic harm to the team's most experienced people in their highest-leverage work — exactly inverted from where you want the gains.
The deeper move: org-level tool policy is itself in scope of the framework. The CTO who mandates "everyone uses LLM" is doing the same thing as the senior who waved through the casual analytics PR by saying "analytics isn't serious" — misjudging the stakes of a decision that pre-determines downstream failures. Stakes assessment recurs at every layer. The CTO's stakes assessment about tool policy is its own SOLO essential, with REVIEW from team throughput data (not PR counts; throughput).
A working organization needs an opt-out path with no political cost, lightweight individual throughput tracking, and quarterly retros that distinguish fluency gain from throughput gain. Without those, the organization optimizes for visible-fluency metrics and silently regresses.
What this looks like in practice
For an individual developer or team lead:
Pick your essentials small and explicitly. Five to ten things you are going to retain this quarter, tied to touchpoints you actually own — areas you will be on-call for, decisions you will make, audits you will defend, hires you will evaluate. Everything else, accept that you are renting competence. This is fine; what is not fine is pretending otherwise.
Write, do not read. When you are trying to retain something, the act of producing the artifact (the diagram, the ADR, the runbook, the spec) is the model-building. Reading what AI produced is not. AI is a great tool for enumerating options and explaining tradeoffs during the writing; it is a poor tool for producing the artifact you should be producing yourself.
Externalize self-checks. Border tests work only if someone other than you is checking. Self-administered tests have the same self-assessment problem METR exposed. Pair on the test with a colleague; record an explanation and listen to it weeks later; build it into peer review. Aviation has check-rides with examiners present — there is no cheating possible. Software needs the analog.
Refresh on a schedule. The aviation literature is clear that manual skills decay within two months without practice. Quarterly refresh — pick something you owned six months ago, see if you can still do it without AI. If not, either it is not essential anymore (drop it) or your touchpoints are not demanding it (add a touchpoint).
Use OFF when it is the right call. If LLM is making you slower in your own deep area, turn it off there. The mode exists; treat it as legitimate.
For a CTO or engineering leader:
Audit your touchpoints, not your competency framework. The things your team retains are the things your processes demand. List your touchpoints — on-call, code review, ADR culture, postmortems, audits, hiring panels, customer kickoffs, dependency upgrades, cost reviews, compliance workflows. Per touchpoint, ask: who owns it, and what competence does it demand of them? That is your real competency framework. The one in the wiki probably does not match.
Stakes assessment is your essential. The senior who looked at the analytics PR and said "analytics isn't serious" — that was the meta-failure that made every downstream gap possible. The senior who said "just add Solana to the chain interface" — same. Senior layers determine the stakes of work, which determines the mode, which determines what is retained. Get stakes wrong and every downstream protocol is wrong.
Make first-instance work COACH mandatory. Any developer's first instance in any new domain is COACH, regardless of stakes, with a senior reviewing the spec written before the LLM was opened. This is the only intervention that prevents skill never-built. It is expensive once per developer per domain. It is non-negotiable if you intend to grow seniors from juniors.
Measure throughput, not fluency. PR count and time-to-merge optimize for visible activity. Cycle time, time-to-incident-resolution, ramp time for new hires, hiring panel pass-through quality — these measure whether the team's competence is growing or eroding. If your dashboards show only the first kind, you cannot detect the failure mode you are most exposed to.
Make production-versus-learning explicit. When you assign work, decide and say which dimension you are optimizing. "Three engineers prototype one tool each, learn one each" or "three engineers prototype all three, learn all three at 3x cost." Default is whichever happens; default is rarely what you would have chosen.
Accept your gaps explicitly. You will not be able to put everything on the essentials list; therefore you will not have detectors for everything. Make the missing detectors visible. A good test: when something goes wrong in an area without a touchpoint, you should be able to point at the missing touchpoint as the predicted failure — not be surprised by it.
Design for LLM-down. Provider outages and your own incidents are correlated. Operational essentials must include an LLM-unavailable protocol, tested in chaos drills.
The honest part
The harder claim that follows from all of this is that most teams using AI tools today are quietly producing rented competence at scale, and they do not know it yet — because their interventions live on the consumption side: PR review without border tests, "reading the AI's output," after-the-fact docs nobody actually wrote (because the act of writing is what would have built the model). The MIT and METR data already shows this is happening at the individual level. The team-level version is harder to measure but structurally identical — the team's collective ability to handle questions one level above the prompt is degrading, slowly enough that no single moment exposes it. By the time hiring breaks, by the time an audit fails, by the time a customer's CTO asks a question your lead engineer cannot answer at a kickoff, the line was crossed months earlier.
This is not an anti-AI argument. AI is genuinely useful; productivity gains in greenfield work are real; AUTO mode is large and legitimate. The framework here is meant to help you use AI without paying a hidden cost you cannot see until it is too late to recover.
The good news, if there is any, is that the fix is not more discipline or more rigor in the abstract. It is better process design, which engineering leaders already know how to do. Design the touchpoints that demand the competences you actually need to retain. Be small and honest about the rest. Use AUTO where the boundaries are documented. Use OFF where the tool is making you worse. Use COACH on every first-instance domain. Treat the team's learning trajectory as a deliverable on par with the team's output trajectory. The framework that emerges is not extra work on top of how you already build — it is a different way to look at how you already build.
Sources referenced
- METR (2025), Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. arxiv.org/abs/2507.09089
- Kosmyna et al. (2025), Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task. MIT Media Lab.
- Bjork & Bjork (2011), Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning.
- Kapur (2014), Productive Failure in Learning Math. Cognitive Science.
- Kalyuga, Expertise Reversal Effect and Its Implications for Learner-Tailored Instruction.
- Bainbridge (1983), Ironies of Automation. Automatica. (The aviation manual-flying-day analog.)
- Casner et al., The Retention of Manual Flying Skills in the Automated Cockpit.
- Lee et al. (2025), The Impact of Generative AI on Critical Thinking. Microsoft Research / CHI 2025.
- GitClear (2024), AI Copilot Code Quality: 2025 Data Suggests 4x Growth in Code Clones.
- Willison, Not all AI-assisted programming is vibe coding (but vibe coding rocks). simonwillison.net
No comments yet