Reading the Unwritten — Mind & Model

A pathologist puts a tissue sample under a microscope. She sees structure: clusters of cells, boundaries between regions, patterns of density and color. Some patterns she recognizes immediately. Others require staining, preparation, comparison with reference images. But after years of training, she can read this tissue. She can say: this is healthy, this is inflamed, this is something to worry about.

The cells didn’t design themselves to be legible. They aren’t trying to communicate. But generations of pathologists built a vocabulary anyway: terms for structures, names for patterns, theories that link what’s visible to what it means. Over more than a century, from Virchow’s cellular pathology to molecular diagnostics, they developed a language for reading tissue.¹ It’s still being refined. But it exists, and because it exists, we can act on what we see.

Now imagine looking inside a neural network.

You see structure: layers of numbers, patterns of activation, clusters that light up for certain inputs. Some patterns seem meaningful. Show the network a picture of a dog and certain neurons activate; show it a cat and others fire. But what do those activations mean? Are they detecting edges, shapes, concepts, something else entirely? The network works. It classifies images, answers questions, writes code. But working isn’t the same as being understood.²

This is the interpretability problem: whether we can think clearly about machines, whether we can be intelligent about neural networks rather than merely impressed by them.

The project isn’t purely aspirational. Researchers have begun to find structure. In language models, there are “induction heads,” circuits that learn to copy patterns from earlier in a text, enabling in-context learning.³ In simple templates like “When Mary and John went to the store, John gave a drink to,” models use identifiable circuits to track who is who and predict “Mary.”⁴ These are early results, limited to specific behaviors in relatively small models. But they show that the interior isn’t featureless noise. There’s something there to read.⁵

The stakes are practical. AI systems need to be auditable, correctable, legible to the people they affect. But you can’t regulate what you can’t inspect. You can’t trust what you can’t verify. You can’t fix what you can’t locate. Legibility requires a language, and for neural networks, that language doesn’t yet exist.

The problem of emerged structure

The pathologist’s advantage is that cells, despite not being designed for legibility, have structure that maps onto function in ways we’ve learned to recognize. A cell membrane is a boundary. A nucleus contains genetic material. Mitochondria produce energy. The parts have roles, and the roles are stable enough that we can name them.⁶

Neural networks have structure too, but the mapping to function is less clear. A “neuron” in a network isn’t a cell; it’s a mathematical operation. A “layer” isn’t a tissue; it’s a stage of computation. The terms are borrowed from biology, but the borrowing is metaphorical. We don’t actually know whether the parts have stable roles.⁷

Worse, the structure wasn’t designed. It emerged. When you train a neural network, you don’t specify what each neuron should do. You specify a goal (predict the next word, classify this image, minimize this loss function) and let gradient descent find parameters that achieve it. The result is a system that works, but whose internal organization is a side effect of optimization rather than a product of intention.⁸

This is the core difficulty. Designed systems are often more legible because they come with intended abstractions (modules, invariants, specifications) and the structure is at least partly organized around them. Emerged systems come with only the objective and whatever organization optimization happened to find.⁹ Chris Olah has compared the task to reverse-engineering a compiled binary: the source code had structure, but the compiled output has been transformed into something optimized for execution, not comprehension.¹⁰

When we look inside a neural network, we’re not reading a blueprint. We’re doing archaeology on a system that was never designed to be excavated.

Three levels of not-understanding

It helps to be precise about what we don’t understand. The confusion operates at multiple levels.¹¹

The representations are opaque. Neural networks encode information in patterns of activation across many neurons. What are these patterns? Early layers of image models seem to detect edges and textures, which makes intuitive sense. But deeper layers encode something harder to name: features that combine in unexpected ways. A pattern that activates for “dogs in grass” and also “certain textures” and also “images with this statistical property.” The features are real in the sense that they’re stable and predictive, but they don’t carve reality at joints we recognize.¹²

A major reason this is hard is superposition: models pack many features into shared neurons, producing polysemantic activations that don’t map to single human concepts.¹³ A neuron that fires for “academic citations” might also fire for “legal disclaimers” and “DNA sequences.” It’s not confused; it’s efficiently encoding multiple features in overlapping directions. But that efficiency makes the representations harder to read.

The algorithms remain untraceable. Knowing what features a network detects doesn’t tell you how it uses them. When a language model answers a question, what pattern of internal computations produces that answer? Is it retrieving memorized facts? Combining patterns? Reasoning through steps? The computation happens, but we can’t follow its logic the way we’d trace the execution of a program we wrote.¹⁴

The failures defy prediction. Neural networks make mistakes, and the mistakes are often baffling. A model that correctly identifies thousands of dogs misclassifies one because of an imperceptible change to the pixels. A language model that reasons correctly about most problems hallucinates confidently on others. If we understood the system, we could anticipate where it would fail. We can’t. Adversarial examples are proof: we still cannot reliably predict brittleness in advance.¹⁵

These three confusions compound. We might guess what features are involved in a decision, but without knowing how those features are combined, and without predicting when the combination will break, our understanding remains shallow. Useful, perhaps, but shallow.¹⁶

The vocabulary we’re building

Interpretability research is the project of building vocabulary anyway, constructing it piece by piece the way pathologists constructed theirs.¹⁷

One approach: find the features. Recent work has shown that neural networks represent many features as directions in activation space. “Dog” might correspond to a particular direction; “blue” to another. Techniques like sparse autoencoders can extract these directions, decomposing the network’s activations into more interpretable components.¹⁸ The features aren’t always concepts we’d name, but some are. And even the unnameable ones seem to be something: stable patterns that the network treats as meaningful units.¹⁹

Another approach: trace the circuits. If features are the vocabulary, circuits are the grammar. A circuit is a path through the network: this feature in layer three connects to that feature in layer five, which influences the output in this way. Researchers have mapped circuits for specific behaviors, including how a model copies patterns from context (induction heads), how it resolves indirect objects in simple templates (the IOI circuit), and how it balances constraints like rhyme and meaning when generating poetry.²⁰ The circuits are often baroque, involving unexpected features in combinations we wouldn’t have guessed. But they’re there, and we can find them.²¹

A third approach: read the residual stream. Transformer-based models pass information through a “residual stream,” a running tally that each layer reads from and writes to. Instead of asking what each layer does, we can ask what information is present in the stream at each point. This lets us track how the model builds up its representation of a problem, where it retrieves relevant knowledge, when it commits to an answer.²²

None of this adds up to understanding in the way we understand a system we designed. But it’s starting to add up to something: a vocabulary, a set of tools, the beginnings of a language for describing what’s inside.²³

The biology analogy, revisited

How much can the pathologist’s example guide us?

The optimistic reading: biological systems weren’t designed for legibility either, and we learned to read them anyway. It took time (decades for cell biology, longer for neuroscience) but patient observation built a vocabulary that tracks real structure. Neural networks are complex, but they’re not magic. If there’s structure, we’ll find it.²⁴

The pessimistic reading: biological systems have structure because evolution built them from modular parts under physical constraints. Cells have membranes because membranes are useful across many contexts. Neural networks face no such constraints. Gradient descent can find any parameterization that works, however tangled. There’s no guarantee the resulting structure is modular, stable, or interpretable in human terms.²⁵

My view is somewhere between. Neural networks do seem to develop modular structure, because modularity often helps with the training objective. Features that detect edges are useful for many tasks, so edge detectors emerge. Circuits that copy tokens are useful for language modeling, so copying circuits emerge. The structure isn’t guaranteed, but it’s not absent either.²⁶

The deeper disanalogy is time. Pathology had more than a century. We have years, maybe a decade, before AI systems are powerful enough that understanding them becomes urgent. The language is being built in a hurry. We’re learning to read a text that’s being written faster than we can study it.²⁷

Why legibility matters

Every proposal to govern AI systems quietly depends on interpretability.

If you want to grant capabilities only as systems earn trust, you need to verify trustworthiness. Behavioral testing has gaps. A system could pass every test while harboring goals you didn’t check. Real trust requires some ability to look inside, to confirm that the internals match the behavior.

If you want legible receipts for what an AI did and why, you need to verify the receipt. A system that explains itself in language we can’t cross-check is just producing more text. Self-reported explanations are only as good as our ability to audit them.

If you care about dispositions, not just behavior, you need to read them. A system can behave loyally while being disposed to defect; it can follow rules while searching for loopholes. Surface behavior is exactly what adversarial examples show us is fragile.

Interpretability is the tool that turns black-box monitoring into something closer to auditing. Without it, governance is aspirational. With it, governance might actually work.

What understanding would look like

It’s worth being concrete about the goal. What would it mean to understand a neural network?²⁸

Not being able to predict every output for every input. That’s intractable, and anyway prediction isn’t understanding. A lookup table predicts perfectly but understands nothing.

Not being able to state simple rules that explain behavior. Neural networks are complex because they need to be. Simple rules won’t capture what they do.

Maybe: being able to identify the features and circuits involved in a given behavior, and to predict how changes to those features and circuits would change the behavior. This is the standard we have for engineered systems. I can’t predict every output of a car engine, but I know which parts do what, and I know that if you remove the spark plugs it won’t run.²⁹

Maybe: being able to distinguish between behaviors that arise from robust internal structure and behaviors that arise from brittle coincidence. The model gets this question right because it genuinely encodes the relevant knowledge, versus because of a spurious pattern in the training data. This distinction matters enormously for trust.³⁰

Maybe: being able to locate the representations of particular facts, values, or goals, and to verify that they match what the system’s behavior suggests. The model says it wants to be helpful; can we find where helpfulness is represented and check that it’s real? This is the interpretability that alignment requires.³¹

We’re not there yet. We’re not close. But we’re closer than we were five years ago, and the pace of progress is increasing.³²

The humility required

I should be honest about what we don’t know.

The features we find may not be the “real” features, the ones the network is “actually” using, but artifacts of our analysis methods. The map might not match the territory. Even evaluating whether an explanation is faithful is nontrivial; proposed criteria exist, but faithfulness evaluation remains an active research area.³³

Current techniques may not scale. They work on small models and specific behaviors. Larger models might be qualitatively different, not just quantitatively harder. The vocabulary we’re building might not extend.³⁴

It’s even possible that understanding is not achievable in principle. Perhaps sufficiently complex systems are simply opaque, not because we lack tools, but because opacity is intrinsic to their operation. This seems unlikely to me, but I can’t rule it out.³⁵

What I do know is that we have to try. Not because success is guaranteed, but because the alternative, governing systems we can’t inspect, is worse. The pathologist doesn’t understand every molecular detail of cancer. But she understands enough to help patients. Maybe that’s the right standard: not complete understanding, but enough to act wisely.³⁶

Reading without a dictionary

There’s a library in the network. Representations of facts, concepts, relationships, procedures. It was written in a language that evolved under pressure to predict, without any expectation of being read.

We’re trying to read it anyway.

The first attempts are crude: probe here, measure there, name patterns that seem stable. The vocabulary is thin and probably wrong in places. But it’s growing. Each year we can say more about what’s inside.

The pathologist looks at tissue and sees meaning that took generations to learn to see. Her vocabulary is rich, precise, tested against thousands of cases. It lets her act with confidence, distinguish health from disease, guide treatment.

We’re in the early years of building a vocabulary like that for minds that weren’t designed to be read. We’re building stains and atlases first, tools that reveal stable structure, then developing the vocabulary that can name what the tools reveal. The work is slow and uncertain. But there’s no shortcut. You can’t govern what you can’t see, and you can’t see without a language for seeing.

The question isn’t whether the machine is intelligent. The question is whether we can be intelligent about the machine.

Modern cellular pathology traces to Rudolf Virchow’s Die Cellularpathologie (1858), which established disease as cellular malfunction. The vocabulary developed over the late 19th and 20th centuries through advances in microscopy, staining techniques, and molecular biology. Molecular diagnostics and computational pathology remain active frontiers. ↩
The “black box” framing is slightly misleading. We can examine every weight and activation (when we have access); inspection just doesn’t automatically yield understanding. The box is transparent; we simply can’t read the language. ↩
Induction heads are attention heads that learn to complete patterns by copying from earlier in the sequence. They’re a key mechanism for in-context learning. See Olsson et al. (2022), “In-context Learning and Induction Heads.” ↩
The Indirect Object Identification (IOI) task: given “When Mary and John went to the store, John gave a drink to,” predict “Mary.” Wang et al. (2023) reverse-engineered the circuit GPT-2 small uses for this. It remains one of the most complete mechanistic analyses of a natural-language behavior. ↩
These results are encouraging but limited. We can interpret specific behaviors in specific models, not yet arbitrary behaviors in arbitrary models. The gap between “we found a circuit” and “we understand the model” is large. ↩
This is partly what “How Cells Decide” explored: cells as organized systems whose parts have roles. The organization makes them comprehensible. The question is whether neural networks have analogous organization. ↩
The neural network terminology borrowed from neuroscience has been criticized as misleading. A network “neuron” shares almost nothing with a biological neuron except a vague mathematical analogy. But the terminology is entrenched. ↩
This is the central theme: structure that emerges from optimization rather than design. The same dynamic appears in evolution, markets, and other complex adaptive systems. ↩
The designed/emerged distinction isn’t absolute; evolution “designs” in some sense, and human designs often have emergent properties. But the difference in starting legibility is real. ↩
Chris Olah has developed this analogy in detail. See “Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases” (2022). The key insight: we’re not reading source code but reverse-engineering optimized binaries. ↩
This taxonomy (representations, algorithms, failures) maps loosely onto Marr’s levels, but the concerns are more practical than philosophical. ↩
The field sometimes calls these “features” or “concepts” or “directions.” The terminology is unstable because we’re not sure what we’re talking about. Are these atoms of meaning? Convenient approximations? Artifacts of analysis? Probably all three, depending on the case. ↩
Superposition is why individual neurons are hard to interpret: each neuron participates in encoding multiple features, and each feature is distributed across multiple neurons. Sparse autoencoders try to find directions that correspond to individual features. See Elhage et al. (2022), “Toy Models of Superposition.” ↩
The “mechanistic interpretability” research program aims to reverse-engineer the algorithms from the weights. Progress has been real but limited to specific behaviors in relatively small models. ↩
Adversarial examples are the canonical case: inputs designed to cause misclassification by exploiting the model’s internal structure. We have partial explanations (e.g., Goodfellow et al.’s linearity hypothesis), but we still can’t reliably predict which inputs will break which models. ↩
This is why behavioral testing alone isn’t enough. You can test extensively and still miss failures, because you don’t know where to look. Understanding the internals tells you where to look. ↩
The field goes by various names: interpretability, explainability, mechanistic interpretability, transparency research. I use “interpretability” as the umbrella term. ↩
Sparse autoencoders (SAEs) learn to reconstruct a model’s activations using a sparse set of features. The sparsity encourages decomposition into interpretable units. See Bricken et al. (2023), “Towards Monosemanticity.” ↩
The finding that features are often “linear” (corresponding to directions in activation space) is both surprising and convenient. Surprising because there’s no obvious reason for linearity; convenient because linear structure is easier to analyze. ↩
The “circuits” research program was pioneered by Olah et al. See the Transformer Circuits thread for induction heads, IOI, and other case studies. Anthropic’s “Tracing the thoughts of a large language model” (2025) extends this to larger models. ↩
“Baroque” is the right word. The circuits are complex, with redundancy and unexpected interactions. This isn’t a design flaw; it’s what optimization produces. ↩
The residual stream framework was developed by Elhage et al. (2021), “A Mathematical Framework for Transformer Circuits.” It provides a unified view of how information flows through transformers. ↩
Progress is real but easy to overstate. We can interpret specific behaviors in specific models, not yet arbitrary behaviors in arbitrary models. ↩
This is the position of much interpretability research: given enough effort, understanding is achievable. The success of biology suggests it’s not naive. ↩
This is the “bitter lesson” concern (Sutton, 2019): perhaps scale and compute matter more than human-legible structure. I don’t fully agree, but the concern deserves acknowledgment. ↩
Evidence for modularity includes the success of feature-finding methods and the existence of interpretable circuits. But we don’t have a theory predicting when modularity will emerge, so extrapolation is risky. ↩
The time pressure is real. See Amodei (April 2025), “The Urgency of Interpretability.” GPT-2 was somewhat interpretable; GPT-4 is much harder. The research community is racing to build tools before the systems become too complex to understand. ↩
Different research communities have different goals: explaining individual decisions (for legal compliance), verifying safety properties, editing or controlling model behavior. These overlap but aren’t identical. ↩
This is the “gears-level model” standard from epistemology: understanding a system means knowing how the parts interact, well enough to predict the effects of interventions. ↩
This is the “shortcut learning” problem. Models often learn spurious correlations rather than robust features. See Geirhos et al. (2020), “Shortcut learning in deep neural networks.” ↩
This is the alignment case for interpretability. If we can locate where values and goals are represented, we can check whether they’re what we intended. Without this, alignment verification is purely behavioral. ↩
Progress in the last few years: sparse autoencoders, circuit analysis, attribution methods, probing classifiers, feature visualization. The field is young and accelerating. ↩
This is a real concern. SAE features might be decompositions that our methods find, not decompositions that the model “uses.” Proposed faithfulness criteria exist (e.g., completeness, minimality in the IOI analysis), but faithfulness evaluation remains an active research area. ↩
Scaling is the central open question. Current techniques work on models with millions to billions of parameters. Frontier models have hundreds of billions. It’s unclear whether the techniques extend. ↩
Some complexity theorists have argued that certain systems are irreducibly complex. I’m skeptical (historical precedent suggests understanding usually catches up) but I can’t prove it. ↩
The pathologist analogy again: understanding enough to act, even without complete understanding. This may be the realistic target. ↩