Book Review

Normal Accidents

by

Charles Perrow



Return to Tier III 415A


Normal Accidents: Living with High-Risk Technologies, by Charles Perrow, Basic Books, NY, 1984.

This book review has been prepared for Tier III 415 A, "Entropy and Human Activity." It includes much of the material presented in class and further commentary.



Introduction

Perrow investigates normal accidents in high-risk systems. He uses these and a number of other words in a technical sense that we will understand better after reviewing the whole book, but let us start now:


High-Risk

This term encompasses risks "for the operators, passengers, innocent bystanders, and for future generations." He applies it to "enterprises [that] have catastrophic potential, the ability to take the lives of hundreds of people in one blow, or to shorten or cripple the lives of thousands or millions more." This means that although he does include chemical plant and refinery accidents, he is explicitly excluding from his focus the primary harmful impacts of fossil-fuel burning (greenhouse gases and toxic combustion products released into the atmosphere), since those effects are diffuse and happen by design, not as an accident.


Normal Accidents

Perrow uses this term in part as a synonym for "inevitable accidents." This categorization is based on a combination of features of such systems: interactive complexity and tight coupling. Normal accidents in a particular system may be common or rare ("It is normal for us to die, but we only do it once."), but the system's characteristics make it inherently vulnerable to such accidents, hence their description as "normal."


Discrete Failures

A single, specific, isolated failure is referred to as a "discrete" failure.


Redundant Sub-systems

Redundant sub-systems provide a backup, an alternate way to control a process or accomplish a task, that will work in the event that the primary method fails. This avoids the "single-point" failure modes.


Interactive Complexity

A system in which two or more discrete failures can interact in unexpected ways is described as "interactively complex." In many cases, these unexpected interactions can affect supposedly redundant sub-systems. A sufficiently complex system can be expected to have many such unanticipated failure mode interactions, making it vulnerable to normal accidents.


Tight Coupling

The sub-components of a tightly coupled system have prompt and major impacts on each other. If what happens in one part has little impact on another part, or if everything happens slowly (in particular, slowly on the scale of human thinking times), the system is not described as "tightly coupled." Tight coupling also raises the odds that operator intervention will make things worse, since the true nature of the problem may well not be understood correctly.


Incomprehensibility

A normal accident typically involves interactions that are "not only unexpected, but are incomprehensible for some critical period of time." The people involved just don't figure out quickly enough what is really going wrong.

A normal accident occurs in a complex system, one that has so many parts that it is likely that something is wrong with more than one of them at any given time. A well-designed complex system will include redundancy, so that each fault by itself does not prevent proper operation. However, unexpected interactions, especially with tight coupling, may lead to system failure.

System operators must make decisions, even with ambiguous information. The process of making a tentative choice also creates a mental model of the situation. When following through on the initial choice, the visible results are compared to those expected on the basis of that initial mental model. Provided that the first few steps' results are consistent, the fact that the mental model was tentative is likely to be forgotten, even if later results contradict it. They become "mysterious" or "incomprehensible" rather than functioning as clues to the falsity of the earlier tentative choice. This is simply the way the human mind works, and systems designed with contrary expectations of their operators are especially vulnerable to system accidents.


Operator Error

It is indeed the case that people sometimes do really stupid things, but when most of the accidents in a particular type of system (airplane, chemical plant, etc.) are blamed on the operator, that is a symptom that the operators may be confronted with an impossible task, that there is a system design problem. In a typical normal accident, the operator's actions may contribute to the problem, or even initiate the sequence of events, but the characteristics of tight coupling and interactive complexity also make their contributions.


For want of a nail ...

The old parable about the kingdom lost because of a thrown horseshoe has its parallel in many normal accidents: the initiating event is often, taken by itself, seemingly quite trivial. Because of the system's complexity and tight coupling, however, events cascade out of control to create a catastrophic outcome.


Transformation

Perhaps best illustrated by counter-example: the alternatives to transformation processes are additive or fabricating ones. Chemical reactions, turbulent fluid flow, and so on are examples of transformation processes, typical in that we cannot actually see what is going on. Empirically designed (perhaps that is an oxymoron) transformation processes are known to work, but may well not be completely understood. Assembly and fabrication processes, such as the construction of large buildings or bridges, do have accidents, but there is usually a better opportunity to learn from those accidents, and they typically exhibit far less complexity and coupling.


Organizations

Organizational issues routinely confront the analyst of normal accidents. Because the interactions among subsystems are not predictable, the operators must be able to take prompt and independent action. Because of the tight coupling, in which operators of one part of the system influence the tasks confronting operators of other parts of the system, centralized control is required. These two conflicting requirements cannot be readily resolved, and the organizational attributes that work well for one side are likely to be dysfunctional for the other.



Normal Accident at Three Mile Island

The accident at Three Mile Island ("TMI") Unit 2 on March 28, 1979, was a system accident, involving four distinct failures whose interaction was catastrophic.

  1. The cooling system
  2. The primary cooling system is a high-pressure system using water to extract heat from the nuclear core. This heated water (so hot that it would be steam at atmospheric pressure) circulates to a heat exchanger (like a radiator) that turns the water in the secondary cooling system to steam. The secondary system is also pressurized, but at a lower pressure.

    The water in the primary system contains many radioactive nuclei, fission products and neutron-activation products (including tritium, produced when one hydrogen atom absorbs first one and then another neutron). The water in the secondary system is not radioactive, but its chemical and mechanical purity is critical, because the high pressure, high temperature steam it turns into will be sprayed directly against the precisely machined turbine blades, to drive the turbine that turns the electrical generator. The "condensate polisher system" is responsible for removing impurities from the secondary coolant.

    The accident started when two secondary cooling system pumps stopped operating (probably because of a false control signal that was caused by leakage of perhaps a cup of water from seals in the condensate polisher system). With no secondary cooling system flow, the turbines shut down, and heat stopped being removed from the secondary system by the turbines and cooling tower, and therefore, heat stopped being removed from the primary system through the heat exchanger.

    The emergency feedwater pumps activated, to remove the heat that was building up in the reactor, now that the secondary cooling system was no longer removing it through the turbines, cooling tower, and heat exchanger. These emergency pumps circulate water in the secondary cooling system, which boils off because the energy is not removed by the turbine, and draw in replacement water from the emergency water storage tank.

  3. Valves Closed
  4. Both emergency feedwater pumps were operating against closed valves: they had been closed during maintenance two days earlier. The operator did verify that the pumps were operating, but did not know that they were accomplishing nothing because of the closed valves. One of the two indicator lights on the control panel that might have alerted them to the valves being closed was obscured by a repair tag hanging on the switch above it. It was only eight minutes later that this problem was discovered.

    With no secondary circulation, no more heat was being removed from the reactor core, its temperature started to rise, and the automatic emergency shutdown procedure, known as a "scram," was started. This involves the rapid insertion of control rods whose composition includes a large percentage of neutron-absorbing materials. This absorbs most of the fission neutrons before they have a chance to initiate a new fission event, stopping the chain reaction. It does not immediately stop the release of heat in the reactor core. Because many of the fission products are unstable nuclei, with half-lives ranging from fractions of a second to days, heat continues to be released in a nuclear reactor core for quite some time after the chain reaction itself is stopped by the scram. Because no heat was being removed through the secondary coolant system, temperatures and therefore also pressures rose within the core and primary coolant system.

  5. Pilot Operated Relief Valve ("PORV") Sticks Open
  6. The PORV is designed to valve off enough coolant from the primary system to keep pressures at safe levels. It initially opened because of the pressure rise resulting from the cooling failure. It was instructed to close, to keep the bubbles squeezed small. It did not close, and therefore the radioactive primary coolant continued to be drained into the sump and the bubbles in the core grew larger and larger as coolant turned to steam at the reduced pressures. Steam is much less effective at conducting heat away from the reactor fuel rods, so their temperatures rose even faster, reaching values that permitted them to resume fissioning.

  7. PORV Position Indicator Falsely Reads as Closed
  8. As soon as the pressure had been adequately reduced, a signal was sent to the PORV to close again. The control panel included an indicator light that showed that this signal had been sent. Unfortunately, despite the indicator light showing that the valve was being told to close, it did not in fact close. The primary cooling system stayed open for 140 minutes, venting 32,000 gallons, one third of the core capacity, and keeping the pressure in the core at a much lower level than it would have been with the PORV properly seated.

    The reduced pressure caused steam bubbles to form, with four results:

    1. It reduced the effectiveness of the cooling wherever those bubbles were in direct contact with the fuel rods.

    2. It impeded the flow of coolant through the core and the pipes.

    3. The fuel rod temperatures rose.

    4. The chain reaction resumed, releasing even more heat.

    The High Pressure Injection (HPI) pumps activated (one automatically, one by operator intervention) to flood the core with cold water. Reactor containment vessels are made of steel, and in operation are exposed to large amount of radiation, especially neutrons. The steel becomes brittle with age, prone to shattering. The shock of HPI operation with its cold water is a risk to be traded off against the risk of letting the core heat up.

    To reduce the risks of high pressure operation, and in particular, the risks of hydraulic shock waves traveling through the plumbing, the reactor is designed with a large surge tank. This tank, known as the "pressurizer," normally has its bottom half filled with water and its top half with steam. Heating or cooling the steam permits the pressure in the reactor core and primary cooling system to be controlled. Compression of the steam absorbs any hydraulic shocks that reach the pressurizer. If the pressurizer fills up with water, as it will if the steam all condenses, then hydraulic shock waves moving through the plumbing (caused, for example, by opening or closing of valves, or the starting or stopping of pumps) will not be absorbed, and may therefore cause pipes or pipe-joints to break.

    In order to prevent this well-recognized risk to a safety system, the operators followed standard procedures and reduced High Pressure Injection when the pressurizer pressure indicator rose toward the point that would indicate that it was about to fill.

    The reduced pressure also caused cavitation in the reactor coolant pumps, which could erode the moving metal parts of the pump, distributing fragments throughout the coolant (where they would destroy other pumps and valves in the system), so the reactor coolant pumps had to be shut down, further reducing coolant flow.


    All four of these failures took place within the first thirteen seconds, and none of them are things the operators could have been reasonably expected to be aware of.



  9. The Hydrogen Bubble
  10. With the loss of one-third of the coolant and the sustained operation at reduced pressure but high temperature, the zirconium cladding of some of the fuel rods reacted with the water, oxidizing the zirconium and leaving free hydrogen behind. The hydrogen accumulated in the reactor, forming pockets that prevented water from reaching parts of the core to cool it.

    In the reactor core, this hydrogen is inert, because there is no oxygen available to burn with it. During a loss of coolant incident, however, it is likely that some of the hydrogen will be carried out along with the coolant. It will then collect in the containment building, where it will have oxygen available. Since the building containing the reactor is full of large-scale electrically powered equipment (motor-driven pumps, etc.) it is only a matter of time until the hydrogen-air mixture is ignited by a stray spark. [Reviewer's note: So far as I know, it is not a standard precaution to continually have multiple open flames in the containment building to make sure that any hydrogen-air ignition occurs before too much hydrogen has accumulated, so that the resulting explosion is comparatively gentle.]

    At TMI, the hydrogen-air explosion took place 33 hours into the accident and produced an over-pressure fully half of the design strength of the building! The danger from hydrogen-air explosions also includes the fact that they are likely to spread debris and shrapnel throughout the building, possibly cutting control or instrumentation wiring and breaking holes in cooling water or compressed-air control lines.



Nuclear Power as a High-Risk System

In 1984, Perrow asked, "Why haven't we had more catastrophic nuclear power reactor accidents?" We now know, of course, that we have, most spectacularly at Chernobyl. The simple answer, which Perrow argues is in fact an oversimplification, is that the redundant safety systems limit the severity of the consequences of any malfunction. They might, perhaps, if malfunctions happened alone. The more complete answer is that we just haven't been using large nuclear power reactor systems long enough, that we must expect more catastrophic accidents in the future.


Operating Experience

The primary reason that Perrow claims we do not yet have enough operating experience to really know the true long-term risks of nuclear power generation is that the many different power reactors that have been built and operated are just that, many different power reactors. At the time of TMI, there were 500 "reactor-years" of operating experience, but fewer than 35 of those "reactor-years" were with systems of the large size of TMI. In the years since, we have, of course, accumulated substantially more experience, especially with the roughly 1 GW capacity systems.


Construction

Large-scale industrial construction is routinely an activity in which the people building the structure confront situations that appear to call for a deviation from the original plans. Some of these deviations are because it seems to be too difficult to build it as planned, sometimes because of variations in materials or external conditions. There are also pressures to complete the job on time and within budget. It is clearly not realistic to expect that nuclear power plants will actually be built as designed, even when the designs are very good. It is clearly not realistic to expect that the deviations from plan during construction will be enhancements! Perrow provides a number of illustrative cases that came to light in the aftermath of TMI.


Safer Designs?

Can safer nuclear power system be designed? Almost certainly. But, Perrow leads us to ask, "Would they be significantly less vulnerable to system accidents?" That is, would they be less tightly coupled, less interactively complex, slower to respond (giving the operators more time to think during an accident)? Perrow says that it seems very likely that nuclear power systems could be designed that would be somewhat better on these grounds, especially considering that the dominant U.S. reactor designs are essentially just scaled-up versions of submarine reactor systems that were necessarily designed to be both light weight and responsive.

There are a number of factors that limit the enhancement of intrinsic safety, however. The primary limiting factor is the basic transformation process that is involved in using fission chain reaction for large-scale power production. The fission products decay with a timetable that is essentially identical for all power reactors. Even after a successful "scram" shuts down the chain reaction, and even assuming that the design and the nature of the accident are such that no part of the fuel resumes chain reacting, an enormous amount of heat must be removed from the core for an extended period of time (days, not hours or minutes).

The theoretical availability of safer designs was in many ways a moot point in the U.S., because (for the reasons explored in Bernard Cohen's book, The Nuclear Energy Option) no new nuclear power plants were built in the U.S. for twenty years. The resumption of construction of nuclear power plants in the U.S. appears more and more likely, so this issue is now one that must be addressed in the U.S.


Defense in Depth

Nuclear power systems are indeed safer as a result of their redundant subsystems and other design features. TMI has shown us, however, that is it possible to encounter situations in which the redundant subsystems fail at the same time. What are the primary safety features?


Containment Buildings

Many of the earlier Soviet designs do not have containment buildings, but all U.S. commercial power reactors do, and it is reasonable to expect that any future systems will, also. The TMI hydrogen explosion came within a factor of two of breaching the containment building, a containment building whose design had been made unusually strong because of the proximity to Harrisburg's airport. Supposedly, it is strong enough to withstand the impact of a crashing jetliner. As the world learned on September 11, 2001, the risk of a crashing jetliner is not just from the mechanical impact, but also from the heat generated by the burning fuel.

In 1971, near Charlevoix, MI, a B-52 crashed while headed directly toward a nuclear power installation on the shores of Lake Michigan, impacting about two miles from the reactor. The B-52 is a subsonic bomber, with a maximum cruising speed around 600 miles per hour, which normally bombs from a high altitude, at high speed. At low altitude, it would have been likely to have been traveling between 200 and 300 miles per hour (according to public reports, stall, takeoff, and landing speeds are around 150 to 200 miles per hour). The newspaper reports describe the plane as having disintegrated in flight during a simulated bombing run, so a higher speed is at least plausible. Depending on the speed, two miles would amount to 12 to 36 seconds of flight time, quite consistent with the "twenty seconds" stated by Perrow [p. 41]: it was a very close call! An earlier version of this review was based on a mis-transcription of this section of Perrow's book; in fact, both the 1984 and the 1999 editions state the time as twenty seconds; the reviewer thanks Osvaldo Maccari for bringing this mistake to his attention.

The containment building is usually constructed of reinforced concrete. If the concrete is poured too rapidly, bubbles will not have time to collapse, resulting in "voids" and thereby reducing the strength compared to that designed.


Siting

Nuclear power installations are usually located outside major metropolitan areas. They cannot be placed too far away, however, because of the losses of power caused by the non-zero resistance of the power transmission wiring. Their location is further constrained by the need to get rid of the waste heat.


Remember the Entropy Law: any heat engine will transform only part of the heat it extracts from the high temperature reservoir (the reactor core) into useful mechanical work (driving the turbine so that its shaft will turn the electrical generator). The rest of the heat must be delivered to a lower temperature reservoir, the surrounding environment.


In particular, the only efficient methods known for cooling to remove the rejected heat of a large power installation, whether nuclear or fossil fuel powered, require a lot of water for the cooling towers, and therefore require the installations to be located near large rivers, lakes, or oceans, places where people naturally congregate. This makes it even more difficult to locate nuclear power installations at great distances from population centers.

Finally, there is the further limitation that the ideal locations are downwind of all nearby population centers. With the prevailing winds in the United States generally from the West to the East, radioactive debris released by a major accident anywhere in the interior or on the West coast would be likely to land on one city or another. The East coast shoreline includes locations that would be downwind of major cities much of the year, but those locations would be vulnerable to hurricanes in the short term and beach erosion in the long term, and the cooling water would be much more corrosive than the fresh water of lakes and rivers. Furthermore, the mountains and the various weather systems from the arctic and from the Gulf of Mexico interact to make every direction of air flow occur part of the time at any location in North America. There is no location that can be predicted to be always downwind of nearby population centers.


Emergency Core Cooling System ("ECCS")

TMI clearly demonstrated that the ECCS is a weak reed to rely on. It probably does provide a net enhancement of safety - there really are believable scenarios where the standard cooling might fail and the ECCS still function as designed. Unfortunately, we have also seen that there are believable scenarios where both would be compromised.


Trivial Events in Non-trivial Systems

Perrow reports on several cases of routine mischance that have occurred in the nuclear power industry:

  • A janitor's shirt catching on a circuit breaker handle, shutting off power to the control rod mechanism [1980, North Anna Number 1, Virginia Electric and Power Company]. It automatically scrammed, but the ensuing shutdown cost consumers several hundred thousand dollars!

  • Clam larvae are not removed by the filters on the intake water for the cooling towers, and consequently the pipes have to be scraped clean from time to time.

  • An operator changing a light bulb in a control panel dropped it, creating a short circuit in some sensors and controls [1978, Rancho Seco 1, Clay Station, California]. After the automatic scram, the disabled sensors prevented the operators from knowing the condition of the reactor, and there was substantial thermal stress on the steel vessel, with the inside much colder than the outside layers of the steel. It was relatively new and did not crack.

  • An indicator light showing high water in sumps was considered to be a malfunction for several days, allowing 100,000 gallons of water to accumulate (covering the lowest nine feet of the reactor vessel) [1980, Indian Point Number 2, New York, Consolidated Edison]. It was only detected when another failure required technicians to enter the building. One of the two sump pumps had a blown fuse, the other had a stuck float mechanism. Redundant systems both failed at once.

The point of these is simply that nuclear power is a human endeavor, and when people operate on an industrial scale, things go wrong from time to time. When the industry in question is nuclear power, the possibility for catastrophic consequences is more obvious. Other large scale transformation processes do provide similar levels of risk (petrochemical plants, for one example).


Learning from Mistakes

The Nuclear Regulatory Commission publishes a journal called Nuclear Safety, which recounts various safety-related events. Perrow reports on a few of these, describing further the ways in which they exemplify system accidents.

  • Humbolt Bay, California (Pacific Gas and Electric) in 1970 lost off-site power, automatically scrammed, as it should have, but then several malfunctions followed, including stuck valves, a pipe joint that ruptured, and 24,000 pounds of primary cooling water being released.

  • A plant (not identified by Perrow) in which, "even after a seven-month shutdown for repair of primary coolant piping, an important motor broke, and sixty-three valves malfunctioned - 35 percent of those tested prior to start-up."

  • Radioactive gases were released into the containment building of another reactor by service workers who opened a faucet trying to get demineralized water for cleaning purposes. The various other valves had not yet been set correctly, at the time they tried the faucet, and as a result there was an open path through the piping and valves to permit the radioactive gases to vent through the faucet.

  • The Dresden 2 nuclear plant, outside of Chicago, Illinois, is operated by Commonwealth Edison. At the date of this incident, Commonwealth Edison was rated as one of the two best organized and managed utilities in the country. The reactor scrammed following a steam valve malfunction. During the subsequent events, a stuck indicator mislead the operators into believing that coolant was low. As a result they increased the feedwater flow, overfilling the reactor so that water got into the steam lines. When they briefly opened a relief valve and closed it, the "water hammer" (hydraulic shock wave) popped safety valves open to relieve the transient high pressure. Unfortunately, these safety valves were mis-designed and stayed open, instead of closing after the pressure surge. The high temperature water and steam venting into the containment building raised the pressure in the building. The building pressure is normally kept at negative pressure relative to the outside air, that is at slightly less than 1 atmosphere (to prevent the leakage of radioactive gases). When the pressure exceeds 1.14 atmospheres, water is supposed to be sprayed onto the reactor to cool it and reduce the pressure, "but operators blocked this safety action because that would have cold-shocked some equipment and thereby damaged it. The pressure reached 2.4 atmospheres during the incident. The design limit for that building was 5.1 atmospheres.


Fermi

Perrow presents the example of the Fermi breeder reactor near Monroe, Michigan (on Lake Erie, near Detroit), not as a system accident, but rather as a component failure accident that does still illustrate the interactive complexity and tight coupling of nuclear power reactor systems.

The Fermi reactor was a breeder reactor, designed to produce fissionable plutonium from the U-238 in the fuel. It was designed to operate at such high temperatures that the coolant was liquid (molten) sodium metal. Four of the fuel subassemblies were damaged, two had welded themselves together.

The cause of the accident was eventually identified as a piece of zirconium sheet metal that had been added as a "safety device" on the insistence of the Advisory Reactor Safety Committee. It did not appear on the blueprints. After it was ripped loose by the flow of the liquid sodium coolant, it migrated to a position where it blocked the flow of coolant, permitting part of the core to overheat and melt the uranium fuel elements.

Perrow highlights these points about the Fermi accident:

  1. The problem started with a safety device.

  2. Poor design and negligent construction contributed to the accident. The zirconium sheet metal parts were not adequately secured against the forces generated by the flow of the sodium coolant, and their presence was not properly documented on the blueprints.

  3. Operator error was suggested in some of the post-accident analyses, despite the fact that there were no clearly pre-established procedures for such an incident. If the reactor had been scrammed sooner, the resulting thermal shock might have produced worse damage.

  4. Post accident accounts displayed "positive thinking," such as citing changes made in the design (including adding the ability, not included in the original design, to drain the sodium coolant out of the reactor vessel).


The Fuel Cycle as a System

Nuclear power necessarily includes the complete fuel cycle:

  • prospecting to find uranium ore deposits;

  • mining of uranium ores;

  • refining, enriching, processing and fabricating into the usable fuel subassemblies;

  • burning in reactors to produce heat (this is the step we have been concentrating on, reactor operation);

  • re-processing of used fuel subassemblies to separate both plutonium and still-usable uranium from the fission products and other contaminants (such as neutron-activated trace elements in the cladding);

  • fabricating new fuel subassemblies on the one hand and packaging the radioactive wastes on the other hand.

  • long-term storage of the wastes;

  • not mentioned by Perrow in this context, but still an intrinsic part of the nuclear power process: de-commissioning the reactor after the end of its useful life, including separating radioactive components, packaging them, and storing them.

Mining is routinely dangerous, and uranium mining has both mechanical and radiation hazards. The various processing and re-processing steps are transformation processes performed on dangerous materials. Like many chemical processing plants, these activities are prone to system accidents, too. There has been at least one published case of a major (loss of several lives) criticality accident in a processing plant at Tokaimura, Japan; see, for example, http://ublib.buffalo.edu/libraries/projects/cases/tokaimura/tokaimura.html, and the succeeding parts, linked from the bottom of each page, in turn.



Complexity, Coupling, and Catastrophe

In this chapter, Perrow lays the analytical groundwork for the rest of the book.

Defining Accidents

An "accident" is an event that is unintended, unfortunate, damages people or objects, affects the functioning of the system of interest, and is non-trivial. The primary alternative description for such an event is "incident"; the distinction is primarily the level at which they occur:

  • First level items are "parts," such as valves or meters. Second level items are units, collections of parts that perform a particular function within a subsystem, such as steam generators. Events whose disruptions are restricted to a single part or unit are usually called "incidents."

  • Third level items are subsystems, made up of a number of units. For example, the secondary cooling subsystem includes the steam generator, the condensate polisher, pumps, piping, etc. Fourth level items are the complete system, composed of perhaps one or two dozen subsystems, for example, the nuclear power installation. Events whose disruptions extend to the subsystem or system are called "accidents."

An incident may require shutdown or reduced output operation of the system as a whole. The distinction is that an accident will involve failure or damage at the subsystem or system level. Design safety features (e.g., redundant pumps or valves) are often found at the boundary between units and subsystems, and consequently failures or unanticipated behaviors of such safety features will often play a significant role in whether an event is an incident or an accident, as Perrow uses those terms.


Victims

The people who may be injured or killed in an incident or accident should sometimes be viewed as a "part" (e.g., a miner or a space shuttle payload specialist) and sometimes as a "subsystem" (e.g., a space shuttle pilot or a nuclear reactor operator); in other cases they will be external to the system, part of the consequences of the failure, but not part of the failing system itself (e.g., the populations downwind of Chernobyl).

Perrow divides the victims into four groups:

  1. First-party victims are the operators of the system; Perrow defines this group broadly to include those with explicit control responsibility as well as other workers who are on-site. Most industrial accidents involve one operator, who is usually blamed for the accident. Operators are routinely subjected to "production pressures": it may not be possible (or the operator may not believe it is possible) to both follow all safety regulations and produce enough for the operator to keep his or her job.

  2. Second-party victims are the nonoperating personnel or system users such as the passengers on a ship, or the truck driver delivering materials to the chemical plant. This category includes the users of the system and those workers who exercise no control of its operation. In general there is some level of voluntary participation in the risks of the system, although unemployed people may accept a job without feeling the freedom to refuse.

  3. Third-party victims are innocent bystanders, such as people on the ground where an airliner crashes. Airports and nuclear power plants are located fairly close to all major metropolitan regions, so it is not reasonable to argue that people could choose to live elsewhere.

  4. Fourth-party victims are fetuses and future generations. The mechanism in these cases is usually toxic chemicals or radiation. Perrow excludes "run-of-the-mill pollution," because accidents and incidents are at issue, not normal operating releases. This class of victims is particularly troublesome because they receive no benefit from the risky activities, but do bear a large part of the burden.


Accident Definitions

"Component failure accidents involve one or more component failures (part, unit, or subsystem) that are linked in an anticipated sequence."

"System accidents involve the unanticipated interaction of multiple failures."

Both component failure accidents and system accidents start with the failure of a part. System accidents are characterized by the progression of the accident involving multiple failures and those failures interacting in ways that are not anticipated by or comprehensible to the designers and properly trained operators.

Perrow also excludes from his analysis what he calls "final accidents," such as a wing or tail breaking off an airplane in flight or an earthquake shattering a dam: they are not interesting from an analytical point of view because there is nothing that the operator can do to influence the course of events.

In the nuclear power industry, roughly 3,000 "Licensee Event Reports" are filed each year. Perrow estimates that 90% of these are incidents, only 10% are accidents. Of the accidents, Perrow estimates that perhaps 5% or 10% are system accidents. So far as we know, all of the accidents in U.S. nuclear power plants have had only first-party victims, and very few of those.


Complex and Linear Interactions

Linear interactions comprise the vast majority of situations. Systems that are particularly subject to system accidents are those that have perhaps 90% linear interactions instead of 99%. In other words, a modest increase in the degree to which interactions are complex, or non-linear, will have a major impact on the probability of system accidents.

One source of complex or non-linear interactions occurs when a unit links one or more subsystems. Failure of a heat-exchanger that removes heat from one subsystem and transfers it to another will disrupt both subsystems simultaneously. These "common-mode" failures are intrinsically more difficult for operators to cope with, because they will be confronted with two intermingled sets of symptoms. System designs that include such features are routinely more efficient (e.g., using waste heat from one subsystem as input to another one, instead of burning extra fuel to provide the heat input).

A large class of safety features in many designs are specifically intended to reduce opportunities for common-mode failures. These devices themselves become sources of failures. Perrow cites the illustrative example of a check-valve designed to prevent back-flow between two tanks. Because the system normally operates with the first tank at a higher pressure, the check-valve spends most of its life passing flow in the intended direction. It may then not function when a pressure difference reverses the flow (because of debris blocking its motion, corrosion, or weakening of a spring held in a compressed position for too long), or once actuated, it may not release to permit the resumption of normal flow.

Unintended or unanticipated interactions may result from ordinary physical proximity. If two subsystems are located next to each other, they can both be rendered inoperable by the same explosion.

The complexity or non-linearity of the interactions has to do with their being a part of the normal operational or maintenance sequence, with their visibility, and with their comprehensibility. Interactions that are unusual, unexpected, hidden, or not immediately comprehensible deserve the description of "complex" or "non-linear." Interactions that are normal or that are visible (even if unplanned) should be described as "linear" or "simple" in this sense, even though they may involve many different parts or units.


Hidden Interactions

The control panel for a system provides a major clue to the complexity of the system's interactions. The more complex the interactions, the more different ways in which the operators will need to monitor and control what happens. System designers must choose the degree to which the operators' tasks will be simplified by automation of the subsidiary interactions. This necessarily reduces the information and control that the operators have, but it also keeps their task within the capability of a human being!

Large control panels are difficult to design: their layout is a compromise among ease of assembly, ease of repair, and ease of use. Ease of use issues include the functional grouping of indicators and controls (difficult to choose when subsystems may interact in many ways), the uniqueness or the uniformity of indicators and controls (e.g., turning guages so that in normal operation all the needles point horizontally), the coding of indicators (valves may always be shown with one color for open, whether or not their normal condition is open, or they may be shown with one color for the normal condition, whether that is open or closed; shape, or angle, or color, or all of those characteristics may convey information).

Complex non-linear systems are often made still more difficult to control because critical information cannot be directly obtained, but must instead be inferred from indirect measurements or observations. At TMI, for example, there was no direct measurement of the level of the coolant within the reactor.


Transformation Processes

Systems that fabricate or assemble their raw materials are less often irretrievably complex than those that transform their raw materials. Besides nuclear power plants, chemical plants are the most common example of a transformation process.

Complex systems are characterized by:

  • Proximity of parts or units that are not in a production sequence;

  • many common-mode connections between components (parts, units, or subsystems) not in a production sequence;

  • unfamiliar or unintended feedback loops;

  • many control parameters with potential interactions;

  • indirect or inferential information sources; and

  • limited understanding of some processes.


Linear Systems

Perrow comments on three characteristics that distinguish linear systems.

  1. They tend to be spatially spread out. There is no physical proximity of the units and subsystems that are not sequential in the production process. This is much more likely of assembly and fabrication processes than of transformation processes.

  2. The people working in linear systems tend to have less specialized jobs, and therefore are more likely to be able to fill in for each other, to understand each other's work, and to recognize interactions when they do occur. Similarly, materials in linear systems tend to have less stringent requirements, and therefore to permit substitution more readily.

  3. The feedback loops in linear systems tend to be local rather than global in scope, permitting decentralized controls with fewer interactions. The information used to control linear processes is more likely to be direct and therefore to be accurate.


Choosing Between Complex and Linear Systems

As remarked earlier, complex system designs are often the result of striving for improved process efficiency under normal operating conditions. While there do seem to have been some examples (Perrow cites air traffic control as one) where an initially complex system can be re-designed to be more linear, complex systems seem frequently to be the only way that some processes can be organized.


Tight and Loose Coupling

The concepts of tight and loose coupling originated in engineering, but have been used in similar ways by organizational sociologists. Loosely coupled systems can accommodate shocks, failures, and pressures for change without destabilization. Tightly coupled systems respond more rapidly to perturbations, but the response may be disastrous.

For linear systems, tight coupling seems to be the most efficient arrangement: an assembly line, for example, must respond promptly to a breakdown or maladjustment at any stage, in order to prevent a long series of defective product.


System Coupling Characteristics

There are four primary characteristics of coupling. These characteristics can be observed in both complex and linear systems.

  1. Tightly coupled systems have more time-dependent processes. Chemical reactions will proceed at their own pace; there may be no opportunity to store intermediate stages' output; the material may not permit cooling and subsequent re-heating; etc. Loosely coupled systems are more forgiving of delays. There may be standby modes; intermediate stages' output may be stored readily for variable periods; etc.

  2. In tightly coupled systems the sequence of steps in the process exhibits little or no variation. A senior major's course in a department will typically have as prerequisites one or more intermediate-level courses in the discipline, and they will have one or more introductory courses as their prerequisites. The courses in the academic major are fairly tightly coupled. By contrast, it typically doesn't make much difference when, or in what order, the distributional requirements are met; those are loosely coupled (both to each other and to the major).

  3. In tightly coupled systems, the overall design of the process allows only one way to reach the production goal. You can produce a car in many different ways, contracting out the production of a subsystem or substituting different materials or methods. A hydroelectric dam or a chemical plant provide far fewer alternatives in their operation.

  4. Tightly coupled systems have little slack. Everything has to be done and to go just right, or it doesn't work at all. Loosely coupled systems permit degraded operation to continue, with correction or rejection of substandard product as needed, without having to shut down the whole operation.


Recovery from Failure

Tightly coupled systems can survive failures, provided that the failure has been anticipated and provided for in the original design of the process. Loosely coupled systems can often accommodate failures through the use of spur-of-the-moment responses.

All systems need to be able to survive part and unit failures, so that incidents to not become accidents. Loosely coupled systems have the advantage that not all of the mechanisms for such survival have to be planned ahead of time. In many cases, the designers of loosely coupled systems have more than used up their advantage by not designing in even quite obvious, simple, safety features. Designers of tightly coupled systems must invest a great deal of effort and ingenuity in anticipating failure modes and providing safety features to permit survival and prompt recovery with minimal impact.


Organizations, Complexity, and Coupling

Perrow presents a graph in which the horizontal dimension describes the interactions, from extremes of linear to complex, and the vertical dimension describes the coupling, from loose to tight. He then places a variety of systems on the chart, according to their characteristics. For example:

  • airways and Junior colleges are near the center

  • the post office is near the linear interaction - loose coupling corner

  • research and development firms are near the complex interaction - loose coupling corner

  • nuclear power plants are near the complex interaction - tight coupling corner

  • dams are near the linear interaction - tight coupling corner.



Petrochemical Plants

Nuclear power is not all that large an industry, it is new, many plants do not seem to be well-run, and the companies know that they will be able to pass most of the costs of accidents on to ratepayers. The petrochemical industy, by contrast is large, mature, for the most part well-run, with strong economic incentives toward safe operations, and yet it, too, is subject to system accidents because of the complex, tightly coupled transformation processes it involves.

Perrow's book was published in 1984. On December 3, 1984, a Union Carbide pesticide plant in Bhopal, India, suffered a catastrophic accident that released a cloud of toxic gas, causing 2,000 immediate deaths, perhaps 8,000 delayed fatalities, and several hundred thousand injuries. Additional information about the Bhopal accident is available on-line at

The nature of chemical plants, with large capital investments, elaborate structures with complicated plumbing, highly automated operations, etc., usually places a small number of skilled operators in a central control location. Most system accidents will therefore not create many first- or second-party victims. The catastrophic risk is to third- and fourth-party victims, if promptly toxic, carcinogenic, or mutagenic materials are dispersed beyond the plant boundaries.

As Perrow explains, the private and relatively unregulated nature of the industry limits the availability of information, both narrative and statistical, about chemical industry accidents. Oil refineries and ammonia plants are somewhat more thoroughly studied, and their experiences are not comforting: an average of more than one fire per year per plant.

Perrow describes briefly the Texas City, Texas, fetilizer explosion aboard ships in the harbor, in 1947, and the chemical plant explosion in 1969. The latter was a system accident with no fatalities or serious injuries.

Perrow describes the 1974 disaster at Flixborough, England, in a chemical plant that was manufacturing an ingredient for nylon. There were 28 immediate fatalities and over a hundred injuries. The situation illustrates what Perrow describes as "production pressure" -- the desire to sustain normal operations for as much of the time as possible, and to get back to normal operations as soon as possible after a disruption.

Should chemical plants be designed on the assumption that there will be fires? The classical example is the gunpowder mills in the first installations that the DuPont family built along the Brandywine River: they have very strongly built (still standing) masonry walls forming a wide "U" with the opening toward the river. The roof (sloping down from the tall back wall toward the river), and the front wall along the river, were built of thin wood. Thus, whenever the gunpowder exploded while being ground down from large lumps to the desired granularity, the debris was extinguished when it landed in the river water, and the masonry walls prevented the spread of fire or explosion damage to the adjacent mill buildings or to the finished product in storage sheds behind them. As Perrow points out, this approach is difficult to emulate on the scale of today's chemical industry plants and their proximity to metropolitan areas.



Aircraft and Airways

[See page 123 and following.]



Marine Accidents

[See page 170 and following.]



Earthbound Systems

[See page 232 and following.]



Exotic Systems

[See page 256 and following.]



Living with High-Risk Systems

Perrow considers bases on which his analysis might be opposed and attempts to counter those arguments. He addresses the following issues:


Risk Assessment

Insurance company actuaries can make quite reasonable estimates of the probability of various kinds of unfortunate events that may occur in the course of traditional activities. The process of quantitatively estimating risks of new activities is necessarily far less precise.

The practice of formal risk assessment (cost-benefit analysis) has developed into an elaborate process that includes mathematical models, numerical simulations, statistical analysis of the results of surveys of the opinions of large numbers of supposedly well-informed individuals, and so on. These trappings of "scientific" analysis can readily mislead the practitioners and the public into placing unwarranted faith in the accuracy of their results.

A classic case of unrealistic risk assessment comes from the early Space Shuttle program. As described by Richard Feynman in his book, What do you care what other people think?, the managers all thought the risk of a catastrophic failure during a shuttle mission was on the order of one in 100,000 or safer. The engineers and technicians who worked more closely with the equipment thought it was on the order of one in 100 or 1 in 200. The Challenger tragedy brought this discrepancy to light, but even then it might not have been made publicly obvious if Feynman's service on the investigating committee had followed the traditional bureaucratic approach.

The same problem (of believing that if a risk did not result in a catastrophe yet, it must therefore be a small risk) appears to have been involved in the 2003 loss of the Space Shuttle Columbia: the several prior cases of insulation chunks smashing into shuttles had not resulted in their loss, but that only showed the risk was "probably safer than one-in-five," not that it was "safer than the rest of the shuttle operation" (which would be necessary if it were not to significantly degrade overall safety).

As Perrow points out, risk assessment traditionally regards all lives as equivalent. It makes no difference in calculations of Loss of Life Expectancy whether 50 people die in traffic accidents in a state during a holiday week-end, or if 50 people die in a town of 100 downwind from a nuclear power plant catastrophe. Obviously the impact on the survivors is very different if their own community is essentially intact, rather than if it is devastated. In comparing the costs and benefits, risk assessors routinely gloss over the question of whose risks, and whose benefits.

Under what circumstances is it right for an automobile company, for example, to make choices that increase or decrease the risks for their customers, trading off profits and lives? Those choices must be made by someone. Could society be organized in such a way that they would be made by the customers, or by the government, instead of by the company?


Rational Decision Making

Perrow categorizes rational decision making into three forms:

  • "Absolute rationality, which is enjoyed primarily by economists and engineers," and which is the basis, for example, of Cohen's analysis advocating nuclear power.

  • "Limited rationality, which a growing wing of risk assessors emphasize," using the results of cognitive psychology studies that "tell us how unmotivated subjects (largely undergraduates in college psychology courses) take shortcuts to solve unrealistic and often extremely complicated problems." This work does emphasize the importance of the "context" in decision making: the operator's behavior is likely to be quite rational in the context that he or she has identified, whether it closely approximates or diverges widely from reality. The public is sensible to wonder after TMI whether it is simply a rare event that happened by chance early in the life of the reactor, or whether the experts' calculations of the probability of such an event are grossly optimistic, and that a whole string of such accidents should be anticipated in future years.

  • "Social and cultural rationality, which is what most of us live by." Perrow argues that different people simply think differently. Some people are more adept at quantitative analysis, some people are more adept at recognizing the interplay of personal relations, and so on. Most real-world problems benefit from the application of a combination of talents; this is a strong basis for sustained social bonding.

As Perrow points out, "A technology that raises even unreasonable, mistaken fears is to be avoided because unreasonable fears are nevertheless real fears."


Dreadful Possibilities

Should risks be judged based on the historical record, on how many deaths would result in a typical year? Should they be judged on the basis of how many deaths might result in a particularly bad year? Perrow discusses some polling research conducted by Decision Research and members of a Clark University group. Thirty activities were rated on such characteristics as the degree to which the activity's risks were:

  • voluntary
  • controllable
  • known to science
  • known to those exposed
  • familiar
  • dreaded
  • certain to be fatal
  • catastrophic
  • immediately manifested

The study compared the responses of "experts" and the general public on these various criteria, and on the overall estimation of riskiness for each activity. Although the experts and the public agreed on the ranking of the activities for the characteristics listed above, the experts and the public disagreed on the overall riskiness of the activities. In general, the experts seemed to be responding to the actuarial numbers, but the general public was much more sensitive to issues of uncertainty, catastrophe, and impacts on future generations.


Just Do It Right

In many technologies, we can readily believe that the application of sufficient attention and care will prevent, or drastically reduce, ill effects. For example, the elimination of all government subsidies for the tobacco industry and the outlawing of smoking indoors would dramatically reduce the harmful effects of smoking, both by reducing the second-hand smoke exposure of innocent bystanders, and also by reducing their consumption through raising the price of tobacco products.

Is it reasonable to believe that nuclear power, for example, can be done safely? That is, can nuclear power plants be designed that are not complex, interactive, tightly-coupled systems? Could there be such a thing as a nuclear power plant that was not subject to "system accidents," in which incidents initiated by component failures or operator mistakes rapidly grow into sub-system or system failures? Perrow argues that it simply cannot be done. Not because nuclear power is especially scary, but because of its intrinsic characteristics.


To Err is Human

In particular, the nature of the organization that runs a nuclear power plant is subject to two contradictory requirements:

  1. The operators must avoid taking actions whose consequences have not been analyzed. This requires centralized control, in which the engineering staff develop Standard Operating Procedures, for the operators to follow.

  2. The operators must respond correctly and quickly to all incidents and failures. The real possibility of a scenario that the designers did not contemplate, and for which there is therefore no possibility of having a "Standard Operating Procedure," requires that the operators think and act independently, with decentralized control.

It is not reasonable to expect any organization to meet both of these criteria. Therefore, one way or another, we must expect that all nuclear power plants will be "badly run."


Policy Recommendations

Perrow creates a two-dimensional chart with the horizontal axis measuring the "Net Catastrophic Potential" (incorporating both the potential for system accidents and the potential for component failure accidents) and the vertical axis measuring the "Cost of Alternatives." He plots various systems on the chart, and then draws curves to split them into three categories.

  • Those in the corner with a high potential for catastrophe and a low cost for alternatives are those that he argues should simply be abandoned. We don't need them and can't afford to take the risks they entail. He includes nuclear power and nuclear weapons in this category.

  • The next region, surrounding that corner of the chart, are the systems that should be restricted. They are systems that although risky are not quite so risky as those in the first category and their alternatives are more expensive, but still cheap enough to be attractive. These are systems that we are unlikely to be able to do without, but which could be made less risky with considerable effort, and systems where the potential benefits are great enough to justify running some risks. He includes some marine transport and DNA research in this category.

  • All the rest of the chart consists of systems that should be tolerated and improved: their risks are low or the costs of alternatives are high, or both. These systems are somewhat self-correcting, and they could be improved with modest efforts. He includes chemical plants, air traffic control, mining, and highway safety in this category.


Dick Piccard revised this file (http://www.ohio.edu/people/piccard/entropy/perrow.html) on April 14, 2011.

Please E-Mail comments or suggestions to "piccard@ohio.edu".