Good Systems Know Where to Stop Being One System

Sometimes the right architectural move is not another coordination layer but topological surgery — a walk from the Titanic's bulkheads to ERCOT's February 2021 rolling blackouts to Bezos's 2002 API mandate.

February 28, 202611 min read

essays
systems
architecture

There is a point in the life of many systems where the corrections start to outnumber the operations.

At 1:51 a.m. on February 15, 2021, the frequency of the Texas grid reached 59.302 Hz. Nominal is 60. Nine minutes below 59.4 and the automatic underfrequency relays on the interconnection would have fired in a sequence no operator could stop, and ERCOT — the grid serving roughly ninety percent of the state — could have come apart into uncontrolled islands that would have taken weeks, possibly months, to resynchronize. The cold had already done most of its damage. What was left to decide, in the thirty minutes before the relays started to trip on their own, was whether the grid would collapse or be deliberately broken.

Starting at 1:20 a.m., operators began issuing manual load-shed directives to the transmission companies. By morning, more than four and a half million customers were without power, and would remain so for days. Later estimates put the death toll above two hundred. The FERC and NERC final report, published in November 2021, is careful about this point. The rolling blackouts were not the failure. They were what prevented the failure. The system was saved by being broken in the right way.

That is a strange sentence, and it is the subject of this essay.

Corrections are evidence

Most systems do not get to stage-manage their own breakage that cleanly. What they do instead is accumulate corrections. A report gets reconciled after it is generated. A transaction gets retried after it times out. A nurse calls a unit clerk to fix what the electronic record technically allowed but clinically did not mean. A spreadsheet acquires a final_final_v3 tab whose only purpose is to unwind the last two. Most organizations treat this as background noise. Big systems need corrections; life is messy.

That is true, as far as it goes. But there is a more interesting interpretation. Corrections are not noise. They are the marks left by loops that did not close cleanly.

In Why Systems That Make Sense Still Fail I argued that many system failures are best understood through the geometric idea of holonomy: carry a state around a loop and it comes back not quite the same. Locally everything made sense; globally, the composition did not. The companion piece, Most Systems Don't Hide Their Loops, was about the practical work — how to surface those cycles with a spreadsheet and a discipline. Together they answerwhat is wrong and where to look.

They do not answer the next question, which is the one that matters most to anyone who actually has to run something. Once you can see the loops, what do you do with them?

The third move

Two answers are obvious. Instrument the loops better, so the drift becomes measurable before it is expensive. Or tolerate them, and hedge. Both are fine. Both are what most competent teams do.

There is a third move, older than software and more physical than most architecture diagrams admit: you break the system on purpose. Not because you are giving up. Because you are trying to save it.

The formal name for this, borrowed from physics and topology, is loop dimensional reduction. The working name is better. I have started calling it loop shedding, and the rest of this essay is about what it looks like when it works.

Good architecture, quite often, is not the elimination of inconsistency. It is the demotion of inconsistency into forms a human can see.

The ship with one compartment

The physical example is almost embarrassingly simple. A ship whose hull is one open volume is, narrowly, a beautifully integrated system. Water that enters any part of the hull can move anywhere. Pressure equalizes. Nothing is duplicated. It is also a terrible idea.

On the night of April 14–15, 1912, the RMS Titanic struck an iceberg in the North Atlantic. The ship had sixteen watertight compartments. The compartments did not extend high enough. Water rising in the forward compartments spilled over the tops of the bulkheads into the next compartment, and the next, and the next. The design treated the loop as if a breach in one place would stay in that place. The ocean did not agree.

The first Safety of Life at Sea convention was signed in London on January 20, 1914, roughly twenty-one months after the sinking. Among its provisions: watertight bulkheads on passenger ships would extend to the bulkhead deck. The regulation did not make ships more integrated. It mandated that they become, in a specific and inconvenient way, less so. Doors, seals, inspection burdens, new failure points. A global failure mode was converted into several local ones the crew could see and fight.

That is loop shedding in steel, and the century of maritime safety that followed is, in part, a century of refusing to let any one breach propagate into every compartment.

The grid that saves itself by becoming several grids

Return to ERCOT at 1:20 a.m. A large synchronous AC grid is powerful precisely because it is tightly coupled. It shares generation, smooths demand, and wrings efficiency out of infrastructure that would otherwise sit idle. That same coupling is what lets a disturbance travel astonishing distances. Frequency excursions, line trips, bad control actions — none of them politely remain local.

What the operators did that morning has a name in transmission engineering: load shedding, coordinated in this case with the logic of islanding, which deliberately splits a grid into smaller electrically coherent pieces during a severe event. Nobody does this because it is aesthetically pleasing. It is done because the larger loop structure has become too dangerous to maintain. You sacrifice efficiency, market coherence, and some amount of service to prevent a far worse global collapse. The Western Interconnection carries Remedial Action Schemesthat encode this logic in firmware, pre-wired to execute topological surgery faster than a human could reach a switch.

The same engineering instinct shows up as load shedding, microgrids, breaker coordination, and protection zones. Each is, in part, a way of saying the same thing. This system has become too globally entangled. Let us cut along the right seams before physics cuts for us.

Software already knows this, under poorer names

Software engineering is full of loop shedding, though it usually gets smuggled in under milder language. Bounded contexts are loop shedding. So are bulkheads. So are circuit breakers, which are exactly what they sound like. So is the decision to stop pretending that one giant shared database with cross-domain transactions is cleaner than a system with separate domains and a reconciliation boundary.

The canonical corporate version of this is a single directive, issued by Jeff Bezos around 2002 and made widely known a decade later when Steve Yegge accidentally posted it publicly. It read, in substance: all teams will expose their data and functionality through service interfaces, no team will be permitted to communicate with another team's data by any other means, and anyone who does not comply will be fired. The memo was terse and more than a little rude. It was also, structurally, topological surgery. It forbade a whole class of cross-team composition. The loops that had been running through shared tables and shared assumptions had to either become an explicit interface or cease to exist. The SOA migration that followed took years and produced, as a side effect, Amazon Web Services.

None of these patterns are novel as practices. What is uncommon is seeing them as the same move.

That move is not simply decoupling. Decoupling is too vague a word; it suggests a generic preference for modularity. Loop shedding is sharper. It says there are specific feedback cycles in this system that are producing unbounded or unobservable drift, and we are going to restructure the system so those cycles either no longer exist or no longer carry the same state around them.

A team can spend years modularizing a system and never touch the real loops. Plenty of pristine architectures push the mess into interfaces and then congratulate themselves on cleaner boxes. Real loop shedding has a test. After the cut, are the remaining loops smaller and more visible than the original one? If the answer is no, the geometry has not been simplified. The dirt has been moved.

When to reach for it

The operational heuristic here is worth stating plainly. The signal is not failure. Failure is lagging. The signal is the number and diversity of corrections the system has quietly accumulated to remain upright. Retries. Reconciliations. Compensating transactions. Manual overrides. Shadow spreadsheets. Exception queues. Quality meetings whose real purpose is to make last week's output mean what everyone hoped it would mean.

Those are not independent annoyances. They are observable holonomy. Better still, the diagnostic is not merely how many corrections exist. It is how many different kinds. If the same problem is being corrected by an automated retry, a downstream reconciliation table, a manual supervisory sign-off, and a monthly finance true-up, then the system does not have one unhappy loop. It has several overlapping ones, carrying related but non-identical states.

When corrections become structural rather than incidental, your architecture wants surgery.

Not more oversight. Not a better dashboard. Not another coordination meeting. A cut. Somewhere the system is trying to remain one connected thing, and the cost of that unity — paid in drift, reconciliation, and ambient dread — has overtaken the cost of becoming several.

What you give up

The uncomfortable part is that loop shedding often looks, from the outside, like a step backward. It is worth being honest about the ledger.

Some efficiency. Redundancy costs capacity; buffers cost latency; separate domains duplicate work the unified system only did once.
Some elegance. The aesthetic satisfaction of a beautifully integrated design gives way to something blunter: a system with visible seams, on purpose.
Some comfort. The fantasy of a single source of truth retreats; you are left with several sources and an explicit story about how they disagree.

Engineers often resist this because it offends the part of us that likes unified things. Managers sometimes resist it because it looks like duplication. Almost everyone has, at some point in the life of a large system, hoped the answer would be one more coordination layer. Usually it is not. One more coordination layer is one more way for local truths to fail to compose globally. The price is geometry, and the bill arrives later.

The mathematical cousin

There is a direct mathematical smell to this, even if the practice is older than the vocabulary. In topology one simplifies spaces by cutting them or changing how they are glued. In control theory one decomposes a tightly coupled system into faster local loops and a slower global one. In gauge physics one coarse-grains away troublesome modes. In network flow one separates path flows from cycle flows. These are not identical to loop shedding, but they rhyme strongly enough to be useful.

The plainest statement is that you are reducing the kinds of loops the system is permitted to support. The formal phrase loop dimensional reduction promises nothing more than that. The system loses some higher-order loop structure and with it some of the weird global behavior that structure enabled. In exchange it becomes more fragmented, and more understandable, and — this is the trade — no longer isometric to what it was. The old geometry does not survive. That is the point.

The rude question

The modern instinct, especially in software, is to ask whether two things should be integrated. That is not a terrible question. It is just an incomplete one.

A better question, and one worth keeping in your pocket: what new loops will this integration create, and who will perform the corrections when they fail to close?

That question is rude in exactly the right way. It forces future cost into view before it is incurred. It turns the usual integration pitch inside out. The issue is not only what capabilities the new connection unlocks. It is what geometry it introduces. Once you start asking it, a lot of ambient weirdness stops looking weird. A forecast that never quite reconciles. A team whose priorities rotate between planning cycles without anyone deciding to rotate them. A report that is consistently off in one direction for reasons nobody can name. You are watching the loops of the system finally be traversed in front of you.

Knowing where to stop

A ship survives because it is not one room. A grid survives because, at the right moment, it is willing to become several grids. A company survives because not every decision must propagate instantly and everywhere. The shape of architectural maturity is in the discipline of refusing to connect two things that would be easier to leave apart.

A great many systems are not failing because their parts are bad. They are failing because they are too globally eager to remain one thing. The art is knowing where to let them stop.

The next time you catch yourself reaching for one more coordination layer — one more service mesh rule, one more steering committee, one more spreadsheet of record — try asking which loop it would close, and whether that loop is one the system can afford.