In 1973, Fischer Black and Myron Scholes published a formula for pricing options. The math was elegant: given a stock’s current price, its volatility, the time to expiration, and the risk-free interest rate, the formula produced a number. That number was supposed to be what the option was worth.
It worked. Not perfectly, but well enough that traders adopted it, first as a guide, then as a standard, then as the price itself. The Chicago Board Options Exchange had opened the same year, and the new market needed a common language for valuation. Black-Scholes became that language. Within a decade, the formula wasn’t just describing the options market. It was organizing it.1
This is the part that gets strange. Before Black-Scholes, options prices deviated substantially from what the formula would have predicted. After its widespread adoption, they converged toward it. The model didn’t discover a pre-existing regularity. It helped produce one. Traders using the same formula arrived at similar prices, and those prices became the market. The map was redrawing the territory.
Donald MacKenzie, the sociologist who documented this most carefully, called it “performativity”: a theory that reshapes the world into closer alignment with itself, not because the theory was always true, but because enough people used it as if it were. Black-Scholes didn’t just model volatility. It tamed it, temporarily, by giving everyone a shared frame.
Then came October 19, 1987. The Dow dropped 22% in a single day. Portfolio insurance strategies, built on Black-Scholes assumptions about continuous trading and liquid markets, amplified the crash by automatically selling into a falling market. The models assumed you could always trade; when everyone tried to trade at once, you couldn’t. The territory had been quietly diverging from the map, and on Black Monday, the divergence became visible all at once.2
After the crash, options markets changed permanently. Volatility surfaces developed a persistent “smile” that Black-Scholes in its original form couldn’t produce. The model still mattered, but it mattered differently: less as a literal price and more as a shared coordinate system, a baseline everyone adjusted from. The map had reshaped the territory, and the territory had reshaped the map back.
This is a general pattern. It shows up wherever a model becomes authoritative enough to guide decisions at scale.
The pattern
A model is built to measure something. The model is adopted. Because the model is trusted, people begin making decisions based on it. Those decisions change the thing being measured. The model is now partly describing a world it helped create.
The philosopher Ian Hacking called a version of this “making up people”: when a classification (say, a psychiatric diagnosis) changes the behavior of the people classified, which changes the evidence for the classification, which changes the classification itself.3 Goodhart’s Law captures the cynical version: when a measure becomes a target, it ceases to be a good measure. But Goodhart’s Law implies corruption, someone gaming a metric. The deeper phenomenon doesn’t require gaming. It just requires that the model be used.
The map eats the territory not through malice but through authority. A model that sits on a shelf changes nothing. A model that enters the decision-making loop changes everything, including the conditions under which it was validated.
The score that built the school
In 1926, the College Board administered the first Scholastic Aptitude Test to a few thousand students. The test was designed to measure something that already existed: academic readiness, independent of which school you attended or how much your family could spend on preparation.4
For a while, it arguably did. The early SAT was a selection tool for a small number of elite institutions. It influenced a few thousand admissions decisions per year. It sat lightly on the system it measured.
Then the test scaled. By the 1960s, the SAT was a mass institution. By the 1990s, it was the gatekeeper to most of American higher education. And as the stakes rose, the territory began to bend.
Schools restructured curricula to match the test’s format. A test-preparation industry emerged, eventually worth billions of dollars. Students who could afford coaching learned the test’s patterns; students who couldn’t fell behind on a metric that was supposed to be pattern-resistant. The “aptitude” the SAT measured increasingly reflected familiarity with the SAT.5
The test didn’t just measure readiness. It defined readiness. Colleges used the score as a primary filter, so students optimized for the score, so schools optimized for students who optimized for the score. The thing being measured was no longer independent of the measurement.
This isn’t a story about a bad test. The SAT’s designers were thoughtful people working on a real problem. It’s a story about what happens to any measurement that becomes load-bearing for decisions: the decisions reshape the thing being measured, and the measurement slowly drifts from its original referent. Not because anyone cheated. Because the map was useful, and useful maps get heavy.
Where you draw the line on a body
In 2003, a committee of experts lowered the diagnostic threshold for “pre-diabetes” from a fasting glucose of 126 mg/dL to 100 mg/dL. Overnight, millions of people who had been healthy on Tuesday were pre-diabetic on Wednesday. Their blood sugar hadn’t changed. The map had.6
Diagnostic thresholds are lines drawn on a continuum. Blood pressure, cholesterol, blood glucose, BMI: the underlying biology is a smooth distribution, but medicine needs categories. You are hypertensive or you are not. You have pre-diabetes or you don’t. The threshold converts a gradient into a cliff.
The cliff changes behavior. Patients above the threshold receive treatment, monitoring, lifestyle interventions. Patients below it don’t. Pharmaceutical companies develop drugs that target populations near the threshold, because moving people from “diagnosed” to “not diagnosed” is a measurable outcome. Insurance coverage follows diagnostic codes. The entire apparatus of treatment, reimbursement, and research clusters around the line.
Over time, the line reshapes the population it describes. Screening programs catch more borderline cases. Treatment expands. The prevalence of the condition rises, not necessarily because more people are sick, but because more people have been sorted to the sick side of a line that was drawn for administrative reasons. The threshold was supposed to reflect the natural history of disease. It now partly creates the clinical landscape.7
This is not an argument against diagnosis. Lines must be drawn somewhere. But recognizing that the line reshapes what it divides is a different posture than treating the line as a transparent window onto nature. The map is eating the territory, one threshold revision at a time.
The model that should eat the territory
Climate science complicates the pattern.
A climate model is also an authoritative representation that enters the decision-making loop. Policymakers read the projections, pass legislation, and the legislation changes emissions, which changes the climate, which changes what the models predict. The map eats the territory.
But here, that’s the point. The entire purpose of climate modeling is to produce a map alarming enough that people act to prevent the mapped future from arriving. A climate model that doesn’t change behavior has failed. A climate model that changes behavior so much that its worst projections never materialize has succeeded, even though it now looks, in hindsight, like it was “wrong.”8
This is the inversion that keeps the pattern honest. In finance, the model eating the territory created fragility (everyone relying on the same assumptions). In testing, it created distortion (optimizing the score instead of the skill). In medicine, it created overdiagnosis (expanding the boundary of disease). In climate, it creates the possibility of prevention: the model’s authority is the mechanism through which catastrophe is averted.
So the question can’t simply be “is the map eating the territory?” It always is, once the map is authoritative. The question is whether the eating is doing something you’d endorse if you could see the full feedback loop. Whether the co-evolution of map and territory is pushing toward a place you’d choose, or drifting toward one you wouldn’t notice until it’s too late.
Co-evolution
What all these cases share is a phase transition. The model starts as a passive description. It becomes an active force. The moment of transition is rarely visible from the inside.
Black-Scholes was a formula before it was a market institution. The SAT was a selection tool before it was a curriculum. A fasting glucose cutoff was a clinical guideline before it was an industry. Climate projections were science before they were policy. In each case, the model crossed a threshold of adoption beyond which it stopped merely describing and started organizing. And in each case, the people using the model continued to treat it as a description long after it had become something else.
This is the hard part. We have good frameworks for evaluating models as descriptions: accuracy, calibration, predictive power, out-of-sample performance. We have almost no frameworks for evaluating models as forces. What does it mean for a model to be “good” when the model changes what “good” refers to? How do you validate a map when the territory is following the map?
One answer is to track the gap between what a model was built to measure and what it is currently measuring. The SAT was built to measure aptitude; it now partly measures test preparation. That drift is a signal. Black-Scholes was built to model existing volatility; it partly produced the volatility patterns it described. That feedback is a signal too. The signals are different in kind from prediction error. They’re about the model’s relationship to its own conditions of validity, which shift as the model is used.
Another answer is to build models that account for their own influence, the way a thermostat accounts for the fact that turning on the heater will change the temperature reading. Control theory does this routinely. Social science does it rarely. The difference may be that thermostats operate in systems simple enough to model the feedback. Markets, schools, and health systems are not.
The most honest answer may be vigilance without resolution. Every model that becomes load-bearing will reshape what it measures. The reshaping is not always bad and is sometimes the goal. But the default assumption, that the model is a transparent window and the territory is independent, is almost always wrong once the model is in use. The hardest discipline is not building better models. It is continuing to look at the territory after you have a map you trust.
Footnotes
-
Donald MacKenzie, An Engine, Not a Camera: How Financial Models Shape Markets (MIT Press, 2006). MacKenzie’s core argument is that financial economics is “performative”: it doesn’t just observe markets but actively shapes them. The Black-Scholes case is his central example, documented through interviews with traders and statistical analysis of options pricing convergence after the formula’s adoption. ↩
-
The role of portfolio insurance in the 1987 crash is documented in the Brady Commission report (Report of the Presidential Task Force on Market Mechanisms, January 1988). The cascade mechanism: portfolio insurance programs automatically sold futures as markets dropped, which pushed markets further down, which triggered more selling. The models assumed continuous liquidity; the crash proved that assumption wrong precisely when it mattered most. ↩
-
Ian Hacking, “Making Up People,” in Reconstructing Individualism, ed. Heller et al. (Stanford University Press, 1986). Hacking’s insight is that social classifications create “looping effects”: the classified change their behavior in response to the classification, which changes the evidence for the classification. His examples include multiple personality disorder and childhood autism. ↩
-
The SAT was developed by Carl Brigham, initially for the College Board, building on Army IQ tests from World War I. The stated aim was meritocratic selection independent of school quality. Nicholas Lemann’s The Big Test: The Secret History of the American Meritocracy (1999) traces the social history. ↩
-
The test-preparation industry in the U.S. was estimated at over $1.1 billion annually before the pandemic. Research on whether test prep improves scores beyond what content learning provides is mixed, but the existence of the industry itself is evidence of the loop: when a metric determines life outcomes, investment flows toward optimizing the metric. ↩
-
The American Diabetes Association lowered the impaired fasting glucose threshold from 110 to 100 mg/dL in 2003. This reclassified an estimated 46 million additional Americans as pre-diabetic. The clinical utility of the lower threshold remains debated; some evidence suggests increased diagnosis without proportional improvement in outcomes. ↩
-
Ray Moynihan and colleagues have written extensively on “disease mongering” and diagnostic expansion. See Moynihan, Heath, and Henry, “Selling sickness: the pharmaceutical industry and disease mongering,” BMJ 324 (2002). The argument is not that all threshold changes are wrong, but that lowering thresholds systematically expands the treatable population in ways that benefit some stakeholders more than others. ↩
-
This is sometimes called the “prophet’s dilemma” or the “paradox of successful prediction.” If a warning succeeds in preventing the warned-about outcome, the warning looks alarmist in retrospect. Y2K preparation is a smaller example: extensive remediation prevented widespread failure, after which many concluded the threat had been overstated. The paradox makes it structurally difficult to credit successful prevention. ↩