This Is the Lucky One

On feedback loops, falsification, and why the researcher closing her laptop at 2 AM has it better than you.

Prologue: The Experimenter Comes Down From the Mountain

It is two in the morning. The training run finishes and the number is below the baseline. Three weeks of work, and the result is a row in a CSV that does not flatter her. There is no chart to spin. There is no segment to cherry-pick. There is no launch tweet to draft. There is a number, and the number is lower than the one she started with.

She closes the laptop.

She has been alone with her experiments for a long time — long enough to learn that the experiments do not lie; that they cannot be made to lie; that the labor of three weeks can be cancelled by a tensor in an hour, and there is nothing to be done about it but begin again. She has learned to receive this without flinching. And the loop made her find out.

In the morning she comes down from the mountain.

The marketplace is a launch day. A feature is being shipped. The crowd has gathered to watch it cross — the tightrope walker, lit from below, balanced on the wire of a deadline. There are decks. There is a demo. There is a Slack channel with a rocket emoji and a name like #launch-aurora. The walker is dazzling. The walker is also, it turns out, falling — but the wire is so thin, and the lights so bright, and the cheer so loud, that no one in the crowd is looking at his feet. He hits the floor. The cheer does not stop. Someone retweets the demo. The metric that would have told them whether he crossed was never named, and so, in the way that matters very much and not at all, he crossed.

She tries to speak.

She says: the walker fell. She says: you have shipped a hypothesis you have not tested. She says: what number, measured how, by when, would have told you he fell? The crowd does not hear her. They are not refusing to hear her — they have no apparatus for the sentence. The sentence is in a language whose vowels are metrics and whose grammar is precommitment, and the marketplace speaks a different tongue: the tongue of we shipped, of the team is proud, of next quarter we'll measure better.

She picks up the corpse of the unmetricked feature and carries it off the stage.

She understands that she will not preach to the marketplace. The marketplace has no force that would compel it to hear her. Its applause is sincere; its ignorance is structural. It celebrates because nothing makes it weep. And no one made them find out.

She will speak instead to those who already suspect — to the engineer who has begun to dread the rewrite; to the product manager who can no longer name which features moved the number; to the researcher who has caught herself tuning on the test set.

To those, she will speak.

She begins.

Of the Three Metamorphoses (of the Knowledge Worker)

Three creatures live in every person who builds. They appear in order, and you cannot skip one.

The camel comes first. The camel kneels and accepts the load. The roadmap, the ticket, the brief — these are placed on its back and the camel says yes. The camel ships what it is told. It does not ask, when the shipping is done, whether the thing it shipped did what it was meant to do. To ask would be to refuse the next load, and the camel does not refuse loads. The camel is praised. The camel is reliable. The camel will be promoted to senior camel.

The lion comes next. The lion has read the dashboard for the first time in earnest and learned that half of what the camel shipped did nothing. The lion learns to say no. No, we will not build that yet. No, we will not call this a win. No, the chart is too noisy to support that claim. The lion kills features the camel would have shipped. The lion is necessary, and the lion is exhausting, and the lion cannot yet build — only refuse. Many become lions and stay there. A field full of lions ships nothing.

The child comes last and is rarest. The child does not refuse and does not bear; the child plays. The child forms a hypothesis, runs the experiment, reads the number, and — this is the part nobody believes until they have seen it — does not mind which way the number goes. The child has no ego in the hypothesis. The child has only the loop. She builds because the number told her to build, stops because the number told her to stop, and is no more wounded by the stopping than by the building. The child is what the experimenter on the mountain has become.

Most never reach the child. Most do not reach the lion. Most ship, and are praised, and go home. And no one makes them find out.

Of Green Tests

The engineer pushes the merge button. The CI is green. The reviewer approved. The diff is small and the commit message is clean.

He has not built a correct system. He has built a system that passes the tests he thought to write, reviewed by colleagues who share his blind spots, against the patterns the codebase already had. What was checked and what is true are not the same and have never been the same, but the green check mark is fast and the rot is slow, and a brain given a fast signal and a slow signal pointing in opposite directions will believe the fast one every time.

He knows this, abstractly. He has said it in design reviews. The tests don't catch architectural drift. The review process selects for what we already agree on. The metrics we have are not the metrics we want. He says it, and then he ships, because saying it does not generate a better instrument than the one he has, and the deadline is the deadline. The discipline of saying-it is not the discipline of doing-it.

Two years later someone is rewriting the module. The rewrite is not because the original was bad — it was the best he had, given what he knew to check — but because the dimensions the loop did not measure have, over two years, accumulated into a shape the product can no longer wear. Maintainability was not in the test suite. Right abstraction at this scale of team was not in the review checklist. Will this still be the right shape when the product pivots was a question with no instrument attached, and so was not asked, and so was not answered, and is being answered now, expensively, by someone else.

The engineer was not careless. He was well-served by his loop on the dimensions his loop measured, and abandoned by it on the dimensions it did not. The loop is not his enemy; the loop is doing exactly what it was built to do. The trouble is that he, and the team, and the field, have not yet built an instrument for the slow dimensions, and so the slow dimensions arrive as surprise, every time, and are received as failure, every time, by whoever is holding the codebase when the surprise arrives.

The tests were green. The design rotted in dimensions no one had named. And no one made him find out — until it was someone else's problem.

Of the Launch Day Glow

The product manager ships. The Slack channel celebrates. The VP retweets the demo. The dashboard, consulted in the second week, shows a number that has gone up — or down, or sideways; the chart is noisy enough to support any of these readings, and the deck supports the reading she has already chosen.

She is not lying. She is not, in any obvious way, even wrong. She believes the feature worked, and her belief is built from the available materials: the launch happened, the team was proud, customer success forwarded a kind email, a small slice of users adopted it. The slice is the wrong size to draw conclusions from, but the slice exists, and the existence of the slice is enough to fill the space where a conclusion would otherwise be. Other PMs, in other orgs, run pre-registered A/B tests with power calculations and guardrail metrics and would not be in this position. The essay is about what happens when the structure does not require the metric. Many product orgs have built structures that do; this essay is about the larger number that have not — the PM whose features are too coarse to A/B, whose users too few to detect the effect even if they could, whose dashboards lag the decision by a quarter.

For her, the question is not did the test reach significance. The question is whether, before she shipped, she named — in writing, somewhere a colleague could see — the number that would have told her she was wrong, the timeframe in which that number would be read, and the action she would take if the number arrived the way she did not want it to. She did not. Not because she is bad at her job, but because nothing in her role required her to. Her quarterly review will measure features shipped, not features that worked. Her promotion packet will list launches, not retractions. The structure has built her a loop that closes on shipping and stays open on outcome, and she has the loop the structure built.

Next quarter she will ship again. Next quarter the chart will be noisy, and the deck will support whichever reading the slide deadline demands, because the deck is the artifact the structure asks for, and the structure does not ask for the other artifact — the one that would say this did not work, and here is what I learned, and here is what I will not do again.

You do not rise to the level of your goals, James Clear writes; you fall to the level of your systems. She had the goal. She did not have the system.

The team celebrated. The metric was never named. And no one made her find out.

Of the P-Value, the Peer, and the Leaderboard

The academic submits the paper. Two of three reviewers recommend acceptance. The third dissents on methodology and is overruled by the editor. The paper appears. The citation count begins, slowly, to climb. Five years later a replication attempt by a graduate student in a different country finds the effect to be roughly a third the size reported, in the wrong direction, and only on a subsample the original paper did not preregister. The replication is published in a journal no one in his subfield reads. The original paper continues to be cited.

The academic is not a fraud. He believed his finding. He believed it because his p-value was below the threshold the field had agreed, by social convention, would mean true — a threshold Fisher proposed in 1925 as a rule of thumb for what was worth a second look, fused later with the Neyman-Pearson framework (a long-run error-rate machine, not a truth criterion), and inherited by his field as a verdict. He had performed the ritual the field had told him would deliver truth, and received what the ritual delivered, which was not truth but publishability.

What the loop measured was the willingness of two reviewers, on a Tuesday, to say fine. What he wanted it to measure was whether the thing he had written about the world was so. The two are correlated. They are not the same. They have come apart, in his field, by enough that the Open Science Collaboration's 2015 replication of one hundred psychology papers reproduced thirty-nine; the Camerer et al. social-science replications of 2018 fared somewhat better, at roughly three in five; preclinical cancer biology has done worse. The numbers vary by subfield and by what one counts as a successful replication, but the direction does not. Most published research findings are false, Ioannidis argued in 2005 on theoretical grounds, and the empirical follow-ups have not refuted him so much as told us where, and how badly.

He published. The field cited. The replication failed in a journal nobody read. And no one made him find out.

The experimenter, watching this from her terminal, does not get to feel superior for very long.

She has caught herself, more than once, doing a smaller version of the same thing. She ran the experiment on five seeds and reported the best three. She tuned a hyperparameter on the validation set and then, later, used the same validation set to claim a result. She defined the benchmark such that her method, just barely, won. She wrote the paper as if the hypothesis she ended with had been the hypothesis she started with, when in fact she had started with three and quietly dropped the two that failed. She knows what these moves are called. She has called other people out for making them. She has made them anyway, because the deadline was Friday and the number had to be the number she had hoped for, and the loop she ran — fast, cheap, hers — was, when she did this, no more honest than the academic's.

The mountain is not a place. It is a discipline. A researcher with a leaderboard and a seed she did not preregister is not on the mountain. She is in the marketplace too, with better lighting.

She went home, that Friday, with a number she had hoped for. And no one made her find out either.

Of the Eternal Recurrence of the Useless Feature

Imagine the demon comes to you in your loneliest loneliness and says: this feature you are about to ship — the launch you have already pictured, the one whose metric you have not named — you must ship it again, exactly so, forever. Every quarter for the rest of time. Knowing, as you now know, that it will not move the number. Would you ship it?

The camel would. The camel does not ask whether the load it bore last time was worth the bearing; the camel only kneels for the next load. Eternity for the camel is a thousand camels in a line, each bearing the same load, none of them looking back.

The lion would not. The lion would refuse the first repetition and the second and the thousandth, would roar at each demo, would die roaring, having killed every useless feature and built nothing in their place. Eternity for the lion is a thousand lions roaring at a thousand empty stages.

The child would not ship the same feature, but would ship something, every time differently, because she would have named the number first and would build only what the number told her to build. Her eternity is a thousand experiments, most of them failed, each of them honest, none of them the same. She can bear the recurrence because the work was answerable to something other than the meeting in which it was decided.

The test is not whether you can endure shipping the useless feature forever. The test is whether, naming the number now, before you build, you would still build the thing you are about to build. If yes, build it. If no, you have just been told what you should have been told eighteen months from now, and there is still time. And the loop made you find out.

The Experimenter Returns to the Mountain

She climbs back. The marketplace is loud behind her and the air thins as she goes. She has not converted the crowd. She did not, in the end, even try very hard; the marketplace had no apparatus for her sentence, and she had no obligation to invent one for it. She has spoken only to the three who already suspect, and to them she leaves a question — the question is the only thing portable enough to carry down off the mountain and into the office on Monday.

What number, measured how, by when, would tell you that you were wrong — and have you committed to it in a way you cannot wriggle out of?

If you can answer, you have a loop. The loop will be slower than hers and more political and less clean, and it will still, on the days it matters, force the truth on you in a way no meeting and no demo and no retweet can.

It will not save you on the days you choose to cheat it — and on enough Fridays, you will choose to cheat it. The experimenter who never has does not exist; she has cheated hers, more than once, and the loop did not stop her. But the system is what you fall to when the discipline gives out, and a loop you can sometimes cheat is, on the days you do not, the only thing standing between you and the marketplace — and the marketplace has no loop at all.

You will not always be grateful for it. You will sometimes hate it. You will, on a Friday at two in the morning, close the laptop on three weeks of work that did not pan out, and feel the particular kind of cold that comes only from having been told something true that you did not want to hear.

If you cannot answer, you have a career.

The experimenter reaches the top. The sky is what it always is. She opens the laptop. Tomorrow's hypothesis is already written down — written down before the training run, because that is the only place writing it down counts. The number, when it comes, will be the number. She will read it herself.

She did not go home.

This Is the Lucky One

Prologue: The Experimenter Comes Down From the Mountain

Of the Three Metamorphoses (of the Knowledge Worker)

Of Green Tests

Of the Launch Day Glow

Of the P-Value, the Peer, and the Leaderboard

Of the Eternal Recurrence of the Useless Feature

The Experimenter Returns to the Mountain

No comments yet

Continue reading

Call me Cassandra

Turning a face into a chibi: a tour of LAM, Gaussian Splatting, and what happens when you push on the mesh underneath

On PMs and LLMs: Forwarding the Work of Thinking