The Ornamental and the Load-Bearing

January 15, 2026 · 14 min read

A stonemason working at Chartres in the early thirteenth century would not have had trouble telling you that flying buttresses mattered. Remove them and the vault pushes the walls outward until they bow, crack, and fail. He knew that in the way you know a tool: not as a theorem, but as a fact of life.

Ask him about everything else (the thickness of a particular pier, the pattern of stone courses, the tracery in a window, the carved moldings) and the certainty dissolves into experience, custom, and taste. Some choices were structural. Some were decorative. Some were a little of both. And some were neither, except in the sense that a craft accumulates habits that make success more likely when you can’t yet separate necessity from tradition.1

Eight centuries later, an engineer can simulate a cathedral with vocabulary and machinery the medieval builders lacked: stress paths, safety factors, boundary conditions. The building is the same; our ability to read it has changed. What used to be “that looks strong enough” becomes a map of compressive forces and tensile vulnerabilities. Whole categories of detail migrate from mystery to explanation.

This is a general pattern. We build systems that work before we know why they work. Understanding arrives late, and when it does, it often arrives as a sorting process: which parts were actually holding the system up, and which parts were there because we didn’t yet know how to leave them out.

By load-bearing, I mean: causally necessary for the system to do the job we care about, under the conditions we care about. By ornamental, I mean: present (sometimes helpful historically), but removable without breaking that job, at least in those conditions.

The definitions matter because they hide a trap. “Load-bearing” is not an intrinsic property of a component. It is always load-bearing for something: for stability, for ease of construction, for resilience to damage, for comprehensibility, for social legitimacy, for beauty. Change the goal and the categories can swap.

The history of technology is, in part, the history of learning to tell the difference.

The arc of simplification

In the early days of powered flight, the airplane wore its uncertainty on its body.

The Wright Flyer was braced like a nervous thought: wings supported by struts and wires, control achieved by twisting the wing itself. Many early aircraft looked similarly baroque: redundant bracing, exposed structure, control schemes that read like experiments because they were. When you don’t yet know how much strength you need, you add strength. When you don’t yet know which control mechanism will be stable, you bring options.

Over the next few decades, aircraft design got cleaner. External bracing retreated into cantilever wings. Stressed-skin construction made the surface do structural work. Ailerons replaced wing warping as the standard way to control roll. In most dimensions, the airplane learned to look less like scaffolding and more like a single coherent object.2

Medicine has a parallel story. For a long time, surgery meant opening the body because the surgeon’s senses (eyes, hands) were the instrument. The incision wasn’t the point; it was the access fee. Laparoscopy demonstrated that the fee could be paid differently: small ports, a camera, specialized tools. Robotic systems moved the surgeon’s hands even farther from the site without losing precision. In some cases, the arc continues past surgery altogether: what once required an operation becomes a drug regimen, targeted radiation, or a minimally invasive procedure. The “essential” incision turns out to have been an expensive workaround for limited visibility and control.

Chemistry tells the same story in a more abstract dialect. The first synthesis of a complex molecule is often a thicket: protecting groups, detours around selectivity problems, steps that exist to undo damage done by earlier steps. Later syntheses are leaner. They don’t change what needs to happen (the molecule still requires certain bonds, certain stereochemistry), but they change how much scaffolding you carry along the way. Organic chemistry even has a kind of moral ideal for this: the route where every step does necessary work, with nothing wasted.3

In each case, simplification is not the absence of intelligence; it is what intelligence looks like after a craft has learned where it can safely stop being paranoid.

The transformer, as a moving example

Machine learning is living through this arc at high speed, with the extra twist that we can watch it happen in public.

The original Transformer architecture (2017) was already a simplification: it discarded recurrence, the thing many assumed language models needed, and made attention the primary mechanism. But it also arrived with a collection of choices that were, in the authors’ own framing, pragmatic: head counts, dimensional ratios, normalization placement, positional encoding scheme. Many of these details were not derived from first principles; they were the result of “try what works,” because the field did not yet know what it could safely remove.4

Since then, researchers have done what engineers do once something works: they start asking which parts are structural.

Attention heads are an unusually vivid case. At inference time, many trained models can lose a large fraction of their heads with surprisingly little degradation. That does not mean the heads were pointless; it means the final network contains redundancy. And redundancy can be doing different jobs at different times. A head that looks ornamental at the end may have helped during training: providing gradient pathways, stabilizing early representations, acting as a temporary crutch that later became unnecessary. The cheap conclusion (“heads don’t matter”) is wrong; the interesting conclusion is that some kinds of structure are load-bearing for learning even if they are not load-bearing for prediction.5

The feedforward block is another case. The original Transformer used a particular expansion ratio (the famous “4×” hidden dimension), and that ratio became a default partly because it was a good bet, not because it was sacred. Later work explored different shapes and allocations: smaller MLPs, rearranged compute, sparse activation via mixture-of-experts. The question is not “can we remove the FFN?” so much as “what computation is the FFN doing, and how cheaply can we get it?” A structural element doesn’t become ornamental just because you can reorganize it. Sometimes you’re not eliminating load; you’re moving it to a different beam.

Then there’s precision, where the story is almost embarrassingly clear. For years, training in 32-bit floating point was treated as the natural way to do things, less because anyone had proven it necessary than because it was what hardware supported and what software assumed. Mixed precision showed that 16-bit training can work. Quantization showed that 8-bit inference often works. Then came robust 4-bit inference, and even 4-bit–style fine-tuning approaches that keep most weights quantized while learning small adapters. The lesson isn’t “precision doesn’t matter.” It’s that our early “requirements” were often defaults inherited from tools, not constraints imposed by the problem.6

Training itself is full of hedge-like components: warmup schedules, batch-size rituals, optimizer folklore. Some are genuinely load-bearing; some are workarounds for instability created elsewhere; some are simply reliable habits that survive until a better explanation arrives. We keep them because they make progress possible when understanding is incomplete, in the same way a medieval builder keeps thickness in a pier because he cannot compute the exact minimum.7

If you want one sentence for the whole phenomenon: early on, we pay for redundancy in order to buy momentum.

When progress adds complexity

But the story does not always end with elegance.

Sometimes we strip away “unnecessary” detail and then discover, through failure, measurement, or new questions, that we removed something we were relying on without realizing it.

Molecular biology offers the cleanest cautionary tale. The early textbook image (the central dogma, DNA → RNA → protein) wasn’t wrong so much as it was a deliberately narrowed lens. Reality contained feedback and exceptions: reverse transcription, regulatory layers, heritable changes in gene expression, and a vast universe of non-coding transcripts that refuse to behave like noise. Some of these complexities were always there; our early models simply didn’t need them to answer the questions we were asking. Then the questions sharpened, and the “ornament” turned out to be structure.8

Climate science traces a similar arc. Energy-balance models (sunlight in, infrared out) are powerful, and for some questions they are enough. But prediction is a jealous standard. If you want regional forecasts, extremes, decadal variability, monsoon behavior, cloud feedbacks, ice dynamics, carbon-cycle coupling, aerosol effects, then the system demands that your model contain more of the world. The simplistic model wasn’t “wrong.” It was incomplete in the directions that mattered once we tried to use it for more demanding ends.9

Security is the most sobering case because the hidden structure isn’t only nature’s; it’s your adversary’s imagination. Every time we declare a system “secure,” we are usually declaring that we have not yet met the attack that will teach us otherwise. Classes of vulnerability become legible in waves (memory corruption, injection, cross-site attacks, side channels, speculative execution, supply-chain compromise) and each wave makes yesterday’s simplicity look less like elegance and more like naïveté.10

In the simplification arc, we over-built because we were uncertain, and later understanding lets us remove hedges. In the complexification arc, we under-built because we were ignorant, and later reality forces us to admit what we left out.

Both are progress. They just move in opposite directions.

How understanding arrives

The stonemasons of Chartres could not do finite element analysis, but they still produced structures that stood for centuries. Their understanding lived in procedures: where to place mass, how to distribute thrust, which proportions survived winter and wind. It was knowledge without explicit mechanism.

We still use that kind of understanding, even in domains that pretend to be purely theoretical. And we gain understanding through several channels:

  • Theory: a derivation that tells you which relationships must hold, revealing why an element matters.
  • Experiment: systematic interventions. Remove the component, change the parameter, see what breaks.
  • Accident and failure: the collapse that teaches you what the safety factor was buying, or the weird shortcut that works and forces you to rethink what was essential.
  • Instrumentation: new ways of seeing (better sensors, better logging, better interpretability tools) that reveal structure you were previously blind to.

Ablation (the habit of “take it out and measure the change”) is especially interesting because it produces causal knowledge without requiring an explanatory story. You learn that a component matters before you learn what it does. This is useful. It is also unsettling. It tells you what to keep, but not what you are keeping it for.

The deepest confidence arrives when channels converge: when ablation says “this matters” and a mechanistic story says “here’s the computation it performs” and the two line up across contexts. When they don’t line up, something is usually wrong in a way worth studying: the story is too simple, the measurement is missing a confound, or the system is compensating with redundancy you didn’t notice.

Understanding is not a single revelation. It is an accumulation of ways to be less surprised.

The economics of hedging

When you don’t know what is load-bearing, you have two basic options: move slowly, or build insurance into the structure.

Most fields, especially those under competitive pressure, choose insurance.

Thicker piers. Extra bracing. Higher precision than seems necessary. Warmup schedules that stabilize training. More attention heads than the final model will need. A second safety interlock. A third. These aren’t necessarily mistakes; they’re rational hedges against ignorance. They’re what it looks like to pursue results while admitting that the theory is incomplete.

But insurance has a cost. It can hide the true structure by letting everything “work” without clarifying why. It can become tradition. It can harden into doctrine: this is how you do it, even when the real reason is merely this is how we stopped it from falling down while we were still learning.

So there is also an opposite skill: knowing when to strip.

Stripping is not the same as optimizing. It is a form of epistemic tightening. You remove parts not only to make the system cheaper, but to make the explanation sharper, because every unnecessary component is another place your understanding can hide.

The risk, of course, is stripping too early. Sometimes the redundancy you are eager to eliminate is not waste; it is robustness you have not learned to name. Biology is full of apparent duplication that turns out to be resilience to perturbation. Security is full of “paranoia” that turns out to be the minimum required to survive an adaptive opponent. Even in machine learning, a component that seems useless for loss can be crucial for stability, interpretability, fairness, or out-of-distribution behavior, properties you might not be measuring until it’s too late.

The art is in timing: hedge when you don’t understand; strip when you truly do; and remain suspicious of the moment you start feeling certain.

Coda: the gargoyle’s job

A gargoyle is not just a monster bolted to a cathedral. It is a drainage system: a stone throat that throws water away from the wall so the masonry lasts longer.

And then it has a face.

The face is not necessary for moving water. A plain spout would work. If you define “load-bearing” as “keeps the building standing,” the face looks ornamental.

But cathedrals are not only physical structures. They are also social ones. They must be funded, defended, maintained, and remembered. Ornament is one way a building recruits human attention and care. A grotesque that makes you look up is also a mechanism for ensuring that someone notices the crack, the leak, the weathering, the need for repair.

Meaning can be a kind of structural support: not for stone, but for the human systems that keep stone from returning to rubble.

So the final question isn’t “what can we remove?” It’s “what are we trying to hold up?”

The flying buttress is load-bearing for stability. The gargoyle’s spout is load-bearing for drainage. The gargoyle’s face may be load-bearing for meaning, and indirectly for maintenance, continuity, and survival.

In machine learning we often optimize one number and call everything that doesn’t improve it “ornament.” But the number is not the whole job. Systems also need to be understandable, robust, fair, and governable. The “ornamental” component might be the one that makes behavior legible, or the one that makes the system resilient in cases your benchmark never tests.

We learn what we need by learning what we are building for. And we learn that late, usually after we’ve already built something that works.

That is why so much of progress feels like renovation: tearing away scaffolding, discovering hidden beams, and realizing that some of the decoration was holding up the world.

Footnotes

  1. Medieval builders had sophisticated practical knowledge (geometry, proportional rules, embodied guild heuristics) without a modern framework for deriving them from first principles. When you can’t calculate margins precisely, you build margins into the stone.

  2. This was not a straight line. Some simplifications in airframes came alongside new complexity elsewhere (jets, avionics, fly-by-wire, redundancy management). But the visible “overbuilding” of early structures is a good example of insurance against uncertainty becoming internalized as understanding improves.

  3. Retrosynthetic analysis formalized the “work backwards from the target” mindset: identify the key transformations and design a route around them. The refinements that follow (step economy, avoiding protecting groups when possible) are often best understood as stripping away concessions to ignorance, limitations, or convenience.

  4. The Transformer paper used sinusoidal positional encodings in its main setup and reported learned variants as working similarly. The broader point still holds: positional information had to be injected somehow, but the “right” way was not settled by theory.

  5. Multiple lines of work suggest that many heads are redundant at test time. The harder question is whether you can train from scratch with far fewer heads and reliably reach the same performance across tasks and scales. “Redundant at inference” and “unnecessary for learning” are different claims.

  6. “4-bit” is best thought of as a family: weight-only quantization for inference, quantization-aware techniques in some settings, and approaches like adapter-based fine-tuning where the base model remains quantized. The direction of travel is real even if the mechanisms vary.

  7. Learning-rate warmup is a good example of a hedge that became standard because it improved reliability. Whether it is always treating a fundamental need or sometimes treating symptoms depends on the surrounding training recipe. The deeper point: we often stabilize systems before we can explain the stabilization.

  8. The “central dogma” was always a statement about certain kinds of information flow, but it became reified as a one-sentence law in popular understanding. Many “exceptions” are better understood as discoveries that the one-sentence version was never a complete map.

  9. Simple models can capture global mean behavior surprisingly well. But many policy- and planning-relevant questions depend on spatial structure, feedbacks, and coupled subsystems that simple models intentionally compress away.

  10. Security is special because the environment is adversarial. What is “load-bearing” is partly determined by attackers adapting to the system, not only by the system’s internal logic.