Why Systems That Make Sense Still Fail

A quiet mathematical reason why locally correct systems produce globally wrong behavior — traced through Foucault's pendulum, the 2003 blackout, Knight Capital's forty-five minutes, and the night before Challenger.

April 18, 202611 min read

essays
systems
geometry

There is a particular kind of failure that does not feel like a bug.

Nothing is obviously broken. Every component is behaving correctly. Every local decision makes sense. And yet, zoomed out, the system as a whole is wrong. If you have worked on a distributed system, a control loop, a supply chain, or a company, you have seen this. It is not dramatic. It is subtle. A number comes back slightly off. A decision made upstream does not quite reconcile downstream. You patch it. It reappears somewhere else. Eventually something snaps.

On August 14, 2003, a little after four in the afternoon, the high-voltage grid of the northeastern United States and Ontario collapsed. No plant had exploded. No one had attacked it. An alarm on a control server in Ohio silently failed, a sagging line in Walton Hills brushed a tree, and a handful of ordinary events — each one of which the region's operators had handled a hundred times before — composed, in the wrong order around the wrong loop, into eight hours of darkness for fifty-five million people. The federal post-mortem is admirably calm about this. Each local decision, it notes, was reasonable given what the operator knew. The system failed anyway.

We have vocabulary for the aftermath. Cascading failure. Race condition. Organizational drift. What we mostly do not have, at least in working engineering culture, is a name for the shape of the problem — for the thing that was wrong about the composition of the parts even though nothing was wrong about the parts. There is one, though. It comes from geometry, and once you notice it, you start seeing it everywhere.

A walk around the world

Imagine carrying a small arrow around a square drawn on a flat table. It starts pointing north. You walk east to the next corner, keeping the arrow aimed the same way — do not twist it, just carry it. Then north, still pointing north. Then west. Then south, back to the corner you began at. The arrow still points north. This is the boring and expected case: go around a closed loop, come back unchanged. It is what our intuition, trained on flat tabletops, expects all motion in all spaces to do.

Now do the same thing on the surface of the Earth. Start on the equator, somewhere off the coast of Ecuador, arrow pointing north. Walk east along the equator until you are in Gabon, ten thousand kilometers later, arrow still pointing north. Turn and walk straight north up a meridian to the pole. Then walk straight back down a different meridian to the equator, and along the equator home to Ecuador.

You did not twist the arrow. At every step you kept it pointing as straight ahead as the local ground allowed. And yet when you return to your starting point, it is aimed somewhere else. The loop rotated it — purely because of the shape of the space you walked on.

Mathematicians call this holonomy: the failure of a state to return to itself after traveling around a loop. If you stand under the dome of the Panthéon in Paris, you can watch a version of it happen in slow motion. A sixty-seven-meter pendulum, first hung there by Léon Foucault in 1851, swings in what appears, to an observer on the marble floor, as a plane that gradually rotates through the day. Nothing is pushing the pendulum sideways. What is happening is that the Earth is carrying it around a closed loop on a curved manifold, and the plane comes back rotated. It is, essentially, the same arrow. Just much bigger and much slower.

Holonomy is not a defect in the arrow or in the walker or in the pendulum. It is a feature of the space they are moving through. Loops make curvature visible.

From geometry to systems

The claim I want to make is this: the same phenomenon appears in almost every large engineered system we build, and we have not been teaching ourselves to look for it.

Replace “arrow” with anything a system carries around. A state. A belief. A price. A configuration. An invariant. A commitment. Replace “walking along a path” with propagation — data moving between services, decisions moving between teams, power moving across a network. Replace “loop” with feedback. A write that triggers a read that triggers an update. A strategy that sets a priority that shapes a metric that rewrites the strategy.

A system can go through a perfectly reasonable sequence of local steps — and not come back to where it started.

That mismatch is holonomy, in the systems we actually build. Three concrete examples follow, in increasing order of familiarity and increasing order of discomfort.

Knight Capital, forty-five minutes

Knight Capital Group was, in 2012, one of the largest market-makers on the New York Stock Exchange — the quiet counterparty on a meaningful fraction of every retail trade in America. On the morning of August 1, they pushed a software deployment to eight servers that handled their order-routing code. Seven of the eight got the new version. One did not. A feature flag that had been dormant for nearly a decade, repurposed in the new code to mean something different, got switched on.

For forty-five minutes, that one server sent orders into the market according to the old meaning of the flag, while the rest of the system priced and reported them according to the new one. Every local component was doing exactly what it had been told. Every contract between services was respected. No process crashed. And during those forty-five minutes, Knight Capital bought high and sold low, in enormous volume, across a hundred and fifty symbols. They lost roughly four hundred and sixty million dollars — about four times their net income for the prior year — and were acquired within a few months.

Nothing in the system was broken, in the sense of being unable to do what it was asked. It was the composition of the parts, traversed around the specific loop of order in → match → fill → report → adjust, that did not close. The state did not come back. Every distributed system sits on top of latency differences, caches, retries, eventual consistency; each of those is defensible alone; each was chosen deliberately. But they compose, and they compose around loops, and when the loop does not close, the state drifts. The only reason you do not see a Knight Capital every week is that the drift per loop is usually small, and that most loops get interrupted — by a timeout, by a human, by a cache eviction — before the error grows. Most. Not all.

Power that balances locally and does not globally

Electric power systems have been studied as failure modes for longer than software has. At the physical layer, power flow obeys Kirchhoff's laws: currents sum to zero at every node, voltages sum to zero around every loop. Locally, beautifully, unambiguously conservative. On top of the physical layer we run a market. In the United States most grid regions use something called locational marginal pricing — the price at each bus is the marginal cost of supplying one more megawatt there, given the current set of physical and contractual constraints. Locally, this is also rational. Every price reflects a real cost somewhere.

Now route power around the network, reapply congestion, let constraints bind and unbind. Trace a loop through the price graph. You can find cases where the prices, going around, do not sum to zero. The local relationships are all consistent. The loop is not. That nonzero sum is literal holonomy — and it is exactly what gives financial transmission rights their value, which is why traders care intensely about it. It is not a bug in the market. It is the geometry of the market. The 2003 blackout I opened with is the physical-layer cousin of this story: plenty of loops, plenty of reasonable local decisions, no single wrong operator, a system with accumulated curvature that nobody was tracking.

Rooms full of reasonable people

The third example is uncomfortable because it is the most familiar. On the night of January 27, 1986, the evening before the launch of the Space Shuttle Challenger, Allan McDonald — a senior engineer at Morton Thiokol — refused to sign the launch recommendation. The O-rings in the solid rocket boosters had never been tested below fifty-three degrees Fahrenheit, and the morning forecast at the Cape was twenty-nine. He and his colleagues had data showing the seal's behavior degraded badly with cold. NASA asked Thiokol to reconsider. Thiokol's management, aware of the contractual and political stakes, reversed the recommendation over the engineers' objections. Challenger launched the next morning at 11:38, and came apart at seventy-three seconds.

Read the Rogers Commission report and you will find, at every individual step, people behaving defensibly given the information they had and the loop they were in. The engineers flagged the risk. Their management weighed the risk against schedule pressure. NASA weighed the recommendation against a history of successful launches. The organizational feedback loop — risk goes up, pressure goes up, tolerance re-calibrates, risk goes up — completed once, and the state did not come back.

Every large organization has loops like this. A product team optimizes for growth. A finance team optimizes for margin. Leadership tells a story about the long term. Strategy sets priorities, teams execute, metrics roll up, leadership rewrites strategy. Each individual step is coherent. Nobody is being irrational. And yet run the loop enough times and priorities drift, incentives misalign, commitments made early stop reconciling with decisions made later. That is not moral failure. It is curvature.

Local correctness is not enough

The Normal Accidents argument, due to the sociologist Charles Perrow in 1984, is that in tightly coupled, complex systems, certain failures are not only possible but structural. The usual gloss is that complexity produces randomness. That is not quite right. What is actually happening is that the system is accumulating holonomy along loops nobody is watching, and periodically the accumulated mismatch has to be paid off. Perrow's famous examples — Three Mile Island, chemical plants, the early commercial airline industry — look, in retrospect, like systems whose curvature was being hidden by the speed of local corrections until it wasn't. When the correction is too slow or the loop is too fast, the drift resolves catastrophically.

Most engineering mental models still assume: if all the parts are correct, the whole is correct. Holonomy is what that assumption costs you. Even if every local step is valid, the composition of those steps is not guaranteed to be.

Where curvature comes from

In practice, you do not start a system with holonomy. You add it, layer by layer, usually on purpose and usually for excellent reasons. Every time you introduce a new mechanism for the system to coordinate across itself — consensus protocols, markets, regulatory policies, shared caches, APIs between teams, abstraction layers between subsystems — you are doing something subtle. You are giving the system a new way to close loops. Each new coordination layer brings new degrees of freedom, new constraints, and new opportunities for those freedoms and constraints to disagree on what they mean when you compose them. One such layer is easy to reason about. Five of them, interacting, is effectively a non-Euclidean space whose curvature nobody has ever computed.

Not a bug. A feature.

Here is the uncomfortable part. Holonomy is not just the problem. It is also why the system does anything interesting.

A market without it cannot discover prices; a spot price in Des Moines that does not know it is different from a spot price in Boston cannot allocate generation. A distributed system without it cannot scale, because strictly eliminating drift means serializing everything. An organization without it cannot adapt, because adaptation is exactly the willingness of different parts to reconcile in different ways in different contexts. A perfectly flat system — zero holonomy, total global consistency at every step — would be rigid, centralized, and almost certainly useless. The goal, then, is not to eliminate curvature. The goal is to see it forming before it bites.

The question to ask next time

The engineering question shifts. Instead of the default — is each component correct — you learn to ask:

Where are the loops in my system?
What is being transported around them — what state, belief, price, or invariant?
When the state comes back, is it the same? If not, how fast is it drifting, in which direction, and under what perturbations?

These are concrete engineering questions. Answering them is not easier than debugging a single component, but it is a different kind of work. It replaces the hunt for the broken part with the mapping of the loops, and the system is drifting is a much more actionable diagnosis than something is off. A distributed-systems team that has mapped its loops can reason about how much consistency error the business can tolerate before the next Knight Capital episode. A grid operator who understands price holonomy can hedge it. An organization that can see its own feedback loops can notice when a decision is about to be made by a version of itself that no longer matches the version that made the earlier one.

One last thought

For most of human history, geometry was how we understood space. Euclid, a compass, a straightedge. Then in the nineteenth century Gauss and Riemann noticed that the geometry we grew up with was only one choice, and that curvature was a thing a space could intrinsically have. Half a century later Einstein observed that the curvature they had been studying mathematically is gravity — that what looks to us like a force pulling things down is actually the straight-line motion of objects in a curved manifold. An entire concept that had hidden in plain sight as a fact about space turned out to explain how things move in it.

I think there is a similar move waiting for us in how we understand systems. Not that holonomy is literally gravity — that would be very online of me — but that the intuition we have built for how curved spaces work is more or less the right one for how composed systems behave. Curvature is emergence. Failures in large systems are not random, and they are not merely complex. They are the loops of a curved space finally being traversed.

Once you see it, it gets hard to unsee. The next time something drifts in a system you maintain — a few cents of inventory, a dashboard that will not reconcile, a decision that felt right in every meeting and wrong in its aftermath — try asking which loop just closed. Most of the time, the answer is interesting.