This article presents a study of various categories of risks associated with the size of power grid and probable solutions to mitigate the risks timely. The researchers believe that engineers must stop looking at the electrical grid solely as an aggregation of components. Instead, they must also view the grid as an entity whose very size and complexity make it behave differently from other types of systems. A systems perspective can help us explain why large blackouts are inevitable and why they occur more frequently than we might expect. It also enables us to consider new ways to reduce risk by restructuring the grid . An analysis shows that a grid with 500 to 700 nodes might be the sweet spot for a grid. It is large enough to let us suppress many small and medium-size outages, yet small enough to avoid the increased probability of massive blackouts. Dividing the grid into independent modules would produce more small and medium-size blackouts. While these events are quickly remedied, they would occur frequently enough to notice and draw complaints from utility customers. It would also make the grid less efficient for operators.
We often talk about banks as being “too big to fail.” But what if the very size of the financial system not only made individual failures inevitable, but made massive failures more likely?
That is exactly what my colleagues and I believe is happening with the electrical grid. There is something about the grid’s very size and structure that increases the risk of cascading failures, where a single event can touch off a chain reaction of failures.
It happened in 2003, when a power line touched an overgrown tree branch. It short-circuited. Software failed to warn operators of the problem. Within minutes, outages had spread to four states. Ultimately, 55 million people in eight states and Canada were left in the dark. Most got power back within a day or two, but others had to wait weeks.
Night falls on New York City during the 2003 regional blackout.
Such massive blackouts are rare. Yet they are extremely costly and disruptive. Recovery takes much longer than smaller events. Because of their size and duration, massive blackouts represent the largest economic risk posed by grid failure.
For the past 15 years, my colleagues—Benjamin Carreras, a principal scientist at the physics and systems consulting firm BACV Solutions in Oak Ridge, Tenn., and Ian Dobson, an electrical engineer at Iowa State University in Ames—and I have been analyzing blackout data. We have found that large blackouts occur more frequently than you might think. More surprisingly, some engineering responses to outages may actually increase the likelihood of larger, more devastating blackouts.
To understand why, we believe engineers must stop looking at the electrical grid solely as an aggregation of components. Instead, they must also view the grid as an entity whose very size and complexity make it behave differently from other types of systems.
A systems perspective can help us explain why large blackouts are inevitable and why they occur more frequently than we might expect. It also enables us to consider new ways to reduce risk by restructuring the grid. And it lets us ask whether the same mechanisms that make the electrical grid prone to massive failure exist in other large, complex infrastructure systems.
Sand Piles and Grids
To understand grid behavior, let’s start with a sandpile or, better yet, a mound of rice.
Dribble grains onto it and each new grain will subtly alter the position of the grains around it. The mound will grow taller and steeper— and more unstable.
Once it gets to a certain point, adding just one more grain dislodges some of the surrounding grains. They, in turn, cause other grains to tumble. This chain reaction creates an avalanche that carries away large slabs of the pile.
The mound now has a broader base, but it is still driven by the same dynamic. If we keep dribbling grains onto it, it will grow higher and steeper, and eventually collapse again.
This is an example of something complexity theorists call self-organized criticality. In this type of system, the critical point is the position where system properties change abruptly. As long as we keep dribbling sand or rice onto the pile, it will automatically organize itself near that critical point every time. And every time, it will come tumbling down.
The power grid also tends to organizes itself near the critical point. The long-term drivers in this case are economic pressures and engineering responses to failure.
On one hand, demand for electricity keeps rising about 2 percent every year. To keep up, power companies build new generators, transmission lines, substations, and other components. At the same time, those companies want to maximize profits, so they run their infrastructure near capacity. This pushes the system ever closer to the critical point.
Most of the time, the system works remarkably well. But perhaps, during a heat wave, when every office and home air conditioner is running flat out, grid capacity utilization edges closer to the brink.
Then something happens. A branch might ground out a transmission line (as one did in 2003). A squirrel may short-circuit a transformer (which happened more than 50 times this past summer). Or maybe a generator has to take an unscheduled outage.
In large, complex systems like the grid, repeated disruptions are a certainty. Most of the time, even when the grid is operating close to capacity, it adjusts gracefully to these events. But sometimes, simply because of the random fluctuation of demand and current, it cannot.
So a small part of the grid goes down. Operators compensate by rerouting additional power over other transmission lines to make up for the loss. But if those lines are running close to capacity, thanks to random fluctuations in demand, shifting additional current onto them trips their circuit breakers and shuts them down. As the grid tries to compensate, more lines shut down. Failure cascades through the system, causing a massive blackout.
Blind economic forces drove the grid closer to its critical point. Random variations in demand maxed out capacity. Then, one day, someone dropped that last grain of sand, and it all came crashing down.
Ben Carreras and I first began to model sand piles at Oak Ridge National Laboratory 20 years ago. We were not looking at the electrical grid then. Instead, we wanted to understand nuclear fusion.
To achieve fusion, we need to contain a hot, dense plasma confined by a magnetic field. Like a sand pile, the system contains the plasma near the critical state until random fluctuations trigger a cascade of events that sends a burst of energy through the field.
That’s when we saw a report on a blackout caused by cascading failures. I contacted Ian Dobson, who, as an electrical engineer, could better judge the similarities between the two paradigms. Together, the three of us created a simple sand pile-like model to see if we could explain blackouts.
Efforts to Mitigate Small Disruptions may Actually Increase the Frequency of Larger Disruptions. This Appears Counterintuitive, but it is a Characteristic of many Complex Systems.
First, we analyzed 1984-1998 power blackout data from the North American Electric Reliability Corp. We wanted to see if it showed the characteristic signatures of a complex system near its critical point.
The first thing that jumped out from the data is that small blackouts are far more likely than large ones. As the number of transmission lines affected by an outage grows in number, the likelihood of the outage falls exponential- ly—up until about 10 lines.
For blackouts of 10 lines or more, the exponential rate of decline flattens out. Rather than a waterfall-shaped curve, we get something that looks like the tail of a Jurassic sauropod. Instead of an exponential drop, we get a slower decline based on a power law. This is a characteristic of complex systems near their critical point. Researchers found similar patterns when they analyzed blackout data from China, New Zealand, Norway, and Sweden. The exponential drop-off in small and moderate-size outages makes sense. Every so often, a small section of the grid fails. Each failure begins as an isolated incident in one component. The likelihood of this is small, but significant, like the likelihood of flipping a coin and getting all heads several times in a row.
For the outage to spread, a second component must go down. For example, if operators reroute power through another transmission line to make up for the outage, that line can usually handle it. But if that line has a weak link or is running too close to capacity because of random variations in current, it might shut down too. It is like flipping our coin again and getting all heads. It can happen, but it is less likely. In fact, it is exponentially less likely.
Yet large blackouts are more common than this initial dropoff in probability would suggest. Why? We think this is because the grid, like our rising sand pile, moves closer to its critical point as it grows. This is a function of size, and we see it very clearly in the data.
To test this thesis, we built a model of the Western United States electrical grid. We did not try to model the grid realistically or predict actual blackouts. Instead, we wanted to see how it would evolve over time.
To do this, we greatly simplified our model. We included only the transmission grid and ignored distribution within towns and cities. We use a dc loadflow model and eliminated transformers. We simplified transmission lines, generators, load variations, and nodes (where the power goes).
The model followed rules that reflected the actual grid. For example, we increased power demand at a constant rate, about 2 percent per year, and let daily loads fluctuate randomly (within parameters consistent with variations in the real system). To meet rising demand, we built more generators and upgraded transmission lines.
We used our NERC data to determine the probability of outages. When they occurred, we modeled the engineering response to failures. For example, we used the same type of linear programming methods that utilities use to redispatch generators and shed load to prevent blackouts. In some cases this worked; in others it did not. After “virtual” blackouts, we upgraded failed transmission lines to carry more power, just as utilities do following real outages.
We validated our results against NERC data from the Western U.S. grid. Our model’s basic dynamics and statistics showed excellent agreement with the NERC data.
Then we did something we could not do in real life: We let the model run for hundreds of virtual years and watched how the grid evolved. This is especially important when investigating massive blackouts. Because these events occur so infrequently, we needed to let the model run for prolonged periods to get enough examples to analyze statistically.
Some interesting patterns emerged. As our initial analysis of NERC data showed, massive blackouts occur more often than we might expect. We believe this is due to the grid’s self-organized criticality.
More surprisingly, long-term engineering efforts to prevent failures usually fail to reduce the likelihood of large blackouts. In fact, they can often make the problem worse.
This seems counterintuitive. Upgrading generators, transmission lines, and other components seems like it should improve grid stability. Yet these upgrades do not address the factors that push the system to the edge of criticality.
That is because the forces shaping the grid-rising demand and the push for profits-are part of society and not part of the grid itself. Improving components may move the system to a new dynamic equilibrium, but that equilibrium will still move towards the critical point every single time.
We can see this clearly in our rice mound. Grid improvements are like wetting the rice to increase its stickiness. It lets us build the mound higher and steeper. Yet it does not change the fundamental structure of the pile. Eventually, the mound will grow too high and too steep and collapse, just as it did before.
Similarly, upgrading grid components lets the system carry more power, but does nothing to address the fundamental push-pull among demand, profits, and engineering responses to failure that push the system towards its critical point.
Efforts to mitigate small disruptions may actually increase the frequency of larger disruptions. This also appears counterintuitive, but it is a characteristic of many complex systems.
Complexity theorists have studied how this works in several systems, including forest fires. Suppressing small forest fires seems like a good way to keep them from growing into large fires. True, sending out fire crews to put out small blazes initially lowers the number of small and large fires significantly.
Yet fires are part of the natural forest ecosystem. Extinguishing them changes the forest in unexpected ways. Without periodic fires, trees grow denser and there are more dead trees on the ground. These serve as fuel for the next fire. Sometimes, a crew does not spot smoke early, or cannot respond to all the fires started by lightning storms. With more fuel, these fires spread rapidly. While big fires are rare, they are still more frequent—and more destruc- tive—than they would have been if smaller fires had thinned the forest naturally.
2003 Northeast Blackout: Timeline of Events
The blackout that hit the Northeast United States and the Canadian province of Ontario struck without warning on August 14, 2003. According to a report by the New York Independent System Operator (NYISO), which runs the electrical grid in New York state, the power system was operating normally: almost all the transmission lines were in operation and there was a 3,000 MW surplus in generating capacity.
A series of events began in the late afternoon, following the shorting out of a high-voltage power line in northeast Ohio:
Beginning at around 4:06 p.m., a sustained power surge north toward Cleveland overloads three 138 kV lines.
Minutes later, NYISO noted a larger power swing–approximately 700 MW–out to Ontario, as well as a coincident swing of similar proportion from the PJM grid operating system in the Mid-Atlantic region into the NYISO.
Starting at 4:10, transmission lines trip out, first in Michigan and then in Ohio. The eastward flow of power around the south shore of Lake Erie and into New York is blocked. Cut off from demand, generating stations go offline, creating a huge power deficit. To compensate, power surges in from the east, overloading power plants in New York, Ontario, and elsewhere, which then go offline as a protective measure.
Within minutes, a cascading failure disconnects various parts of the interconnected grid. The NYISO system, for instance, separated into two electrical islands and western New York separated from the Ontario system.
The grid instability took more than 250 power plants offline. More than 80 percent of those plants went offline after the grid separations occurred, most due to the action of automatic protective controls. For instance, severe frequency oscillations in the grid caused large nuclear and combined cycle units to trip off. Some of the fossil generation tripped by relay protection, and in other cases operators took the units off-line because they were becoming thermally unstable.
The blackout left an estimated 10 million people in Ontario and 45 million people in eight U.S. states without power.
Parts of the affected region had power restored by that evening. Other areas–including New York City–saw the lights go back on the next day.
Still other areas didn't have full power until August 16, and electricity was limited in many parts of Ontario for days afterward.
One estimate put the total cost of the 2003 blackout at $6 billion. The event contributed to at least 11 deaths.
So, how does this play out in the grid? We considered different mitigation strategies. First, we adjusted the model so that no blackouts occurred until a minimum number of lines overloaded. This was our way of representing operator actions to resolve overloads by redistributing power. When we did this and ran our simulations, we found this actually increased the number of large events over time.
We also upgraded the emergency ratings of our virtual transmission lines so they could carry more power to increase component reliability. This reduced the number of small blackouts, but increased the number of large blackouts.
Finally, we increased generator capacity. If the excess power was large enough, it reduced blackouts—but only if we upgraded the transmission lines to carry additional power. Otherwise, additional capacity alone results in fewer but larger blackouts.
Finding the Sweet Spot
If the grid is prone to massive blackouts and improving its components actually makes the problems worse, how can we reduce risk and improve reliability?
We discovered a clue in “On Being the Right Size,” an essay written in 1928 by famed biologist J.B.S. Haldane. In it, Haldane surveyed the link between an animal’s size and its biological systems. An insect, for example, gets all the oxygen it needs through diffusion, while larger mammals need a complex circulatory system to supply interior cells.
The essay struck a chord. After all, the grid, like animals, evolved over time in ways that were not pre-engineered. Perhaps, like animals, the grid had a “right” size that allowed it to thrive while minimizing risk.
North America’s power grid grew to its present size to give operators more resources to deal with power supply, demand fluctuations, and potential outages. We certainly want to preserve that flexibility.
At the same time, the grid’s size and self-organized criticality makes it highly susceptible to cascading failures. While it does a good job of preventing small outages, its very success appears to make massive blackouts more likely.
To see if there was a “right” size that gave us flexibility without more risk, we modeled grids of different sizes. We compared the behavior of 200-, 400-, 800-, and 1,600-node grids with systems of the same size built from unconnected 100-node modules.
We found that smaller 200- and 400-node systems behaved similarly to even smaller systems made up of 100-node modules. The number of blackouts dropped exponentially with the size of the outage, and none of the networks had fat tails.
The 800- and 1,600-node networks behaved very differently. They had the resources to prevent small and medium-size blackouts, and had fewer of them than same-size modular systems. On the other hand, the larger networks had fat tails and much greater likelihoods of large blackouts than in the modular networks.
Our own analysis (using our simplified grid model) shows that a grid with 500 to 700 nodes might be the sweet spot for a grid. It is large enough to let us suppress many small and medium-size outages, yet small enough to avoid the increased probability of massive blackouts.
Utilities may not find this suggestion palatable. On one hand, it reduces the likelihood of massive blackouts that last days or even weeks and that pose the greatest risk to society. Yet such massive outages happen only rarely.
On the other hand, dividing the grid into independent modules would produce more small and medium-size blackouts. While these events are quickly remedied, they would occur frequently enough to notice and draw complaints from utility customers. It would also make the grid less efficient for operators.
Still, we think this is a promising avenue of research. If we start thinking of the grid as a self-organizing system that will always move towards the critical point, we may find better ways to balance stability and risk. There may better ways to share resources and secure smaller electrical networks without endangering the entire system.
We also think our research into self-organizing criticality has broad implications for other types of infrastructure. Manv are driven bv the same economic forces. They might include our vast and tightly interconnected financial, air transportation, communications, and Internet systems.
By acknowledging that failure is inevitable and rethinking acceptable risk, perhaps we can reduce the cost to society when things do go wrong.