Tag Archives: root cause analysis

Unintended Consequences, Serendipity, and Prawns

By ThinkReliability Staff

The Diama dam in Senegal was installed to create a freshwater reservoir. Unfortunately, that very dam also led to an outbreak of schistosomiasis. This was an unintended consequence: a negative result from something meant to be positive.   Schistosomiasis, which weakens the immune system and impairs the operation of organs, is transmitted by parasitic flatworms. These parasitic flatworms are hosted by snails. When the dam was installed, the snails’ main predators lost a migration route and died off. Keeping the saltwater out of the river allowed algae and plants that feed the snails to flourish. The five why analysis of the issue would go something like this: The safety goal is impacted. Why? Because of an outbreak of schistosomiasis. Why? Because of the increase in flatworms. Why? Because of the increase in snails. Why? Because of the lack of snail predators. Why? Because of the installation of the dam.

Clearly, there’s more to it. We can capture more details about this issue in a Cause Map, or visual form of root cause analysis. First, it’s important to capture the impact to the goals. In this case, the safety goal is impacted because of a serious risk to health and the environmental goal is impacted due to the spread of parasitic flatworms. The customer service goal (if we consider customers as all those who get water from the reservoir created by the dam) is impacted due to the outbreak of schistosomiasis.

Beginning with the safety goal, we can ask why questions. Instead of including just one effect, we include all effects to create a map of the cause-and-effect relationships. The serious risk to health is caused by the villagers suffering from schistosomiasis, which can cause serious health impacts. The villagers are infected with schistosomiasis and do not receive effective treatment. Not all those infected are receiving drugs due to cost and availability concerns. The drugs do not reverse the damage already done. And, most importantly, even those treated are quickly reinfected as they have little choice but to continue to use the contaminated water.

The outbreak of schistosomiasis is caused by the spread of parasitic flatworms, which carry the disease. The increase in flatworms is caused by the increased population of snails, which host the flatworms. The snail population increased after the installation of the dam killed off their predators and increased their food supply.

Many solutions to this issue were attempted and found to be less than desirable. Administering medication for treatment on its own wasn’t very effective, because (as described above) the villagers kept getting reinfected. The use of molluscicide killed off other animals in the reservoir as well. Introducing crayfish to eat the snails was derided by environmentalists as they were considered an invasive species. But they were on the right track. Now, a team is studying the reintroduction of the prawns which ate the snails. During the pilot study, the rates of schistosomiasis decreased. In addition, the prawns will serve as a valuable food source. This win-win solution is an example of serendipity and should actually return money to the community. Says Michael Hsieh, the project’s principal investigator and an assistant professor of urology, “The broad potential of this project is validation of a sustainable economic solution that not only addresses a major neglected tropical disease, but also holds the promise of breaking the poverty cycle in affected communities.”

Introducing animals to get rid of other animals can be problematic, as Macquarie Island discovered when they introduced cats to eat their exploding rodent population who ate the native seabirds). (Click here to read more about Macquarie Island.) Further research is planned to ensure the project will continue to be a success. To learn more about the project, click here. Or, click “Download PDF” to view an overview of the Cause Map.

A single human error resulted in the deadly SpaceShipTwo crash

By Kim Smiley

The National Transportation and Safety Board (NTSB) has issued a report on their investigation into the deadly SpaceShipTwo crash on October 31, 2014 during a test flight.  Investigators confirmed early suspicions that the space plane tore apart after the tail boom braking system was released too early, as discussed in a previous blog.  The tail boom is designed to feather to increase the drag and slow down the space plane, but when the drag was applied earlier than expected the additional aerodynamic forces ripped the space plane apart at both high altitude and velocity.  Amazingly, one of the two pilots survived the accident.

Information from the newly released report can be used to expand the Cause Map from the previous blog.  The investigation determined that the pilot pulled the lever that released the braking system too early.  Even though the pilots did not initiate a command to put the tail booms into the braking position, the forces on the tail booms forced them into the feathered position once they were unlocked.  The space plane could not withstand the additional aerodynamic forces created by the feathered tail booms while still accelerating and it tore apart around the pilots.

A Cause Map is built by asking “why” questions and documenting the answers in cause boxes to visually display the cause-and-effect relationships. So why did the pilot pull the lever too early?  A definitive answer to that may never be known since the pilot did not survive the crash, but it’s easy to understand how a mistake could be made in a high-stress environment while trying to recall multiple tasks from memory very quickly.  Additionally, the NTSB found that training did not emphasize the dangers of unlocking the tail booms too early so the pilot may not have been fully aware of the potential consequences of this particular error.

A more useful question to ask would be how a single mistake could result in a deadly crash.  The plane had to be designed so that it was possible for the pilot to pull a lever too early and create a dangerous situation.  Ideally, no single mistake could create a deadly accident and there would have been safeguards built into the design to prevent the tail booms from feathering prematurely.  The NTSB determined the probable cause of this accident to be “failure to consider and protect against the possibility that a single error could result in a catastrophic hazard to the SpaceShipTwo vehicle.”  The investigation found that the design of the space plane assumed that the pilots would perform the correct actions every time.  Test pilots are highly trained and the best at what they do, but assuming human perfection is generally a dangerous proposition.

The NSTB identified a few causes that contributed to the lack of safeguards in the SpaceShipTwo design.  Designing commercial space craft is a relatively new field; there is limited human factors guidance for commercial space operators and the flight database for commercial space mishaps is incomplete. Additionally, there was insufficient review during the design process because it was never identified that a single error could cause a catastrophic failure. To see the recommendations and more information on the investigation, view a synopsis from the NTSB’s report.

To see an updated Cause Map of this accident, click on “Download PDF” above.

Extensive Contingency Plans Prevent Loss of Pluto Mission

By ThinkReliability Staff

Beginning July 14, 2015, the New Horizons probe started sending photos of Pluto back to earth, much to the delight of the world (and social media).  The New Horizons probe was launched more than 9 years ago (on January 19, 2006) – so long ago that when it left, Pluto was still considered a planet. (It’s been downgraded to dwarf planet now.)  A mission that long isn’t without a few bumps in the road.  Most notably, just ten days before New Horizons’ Pluto flyby, mission control lost contact with the probe.

Loss of communication with the New Horizons probe while it was nearly 3 billion miles away could have resulted in the loss of the mission.  However, because of contingency and troubleshooting plans built in to the design of the probe and the mission, communication was able to be restored, and the New Horizons probe continued on to Pluto.

The potential loss of a mission is a near miss. Analyzing near misses can provide important information and improvements for future issues and response.  In this case, the mission goal is impacted by the potential loss of the mission (near miss).  The labor and time goal are impacted by the time for response and repair.  Because of the distance between mission control on earth and the probe on its way to Pluto, the time required for troubleshooting was considerable owing mainly to the delay in communications that had to travel nearly 3 billion miles (a 9-hour round trip).

The potential loss of the mission was caused by the loss of communication between mission control and the probe.  Details on the error have not been released, but its description as a “hard to detect” error implies that it wasn’t noticed in testing prior to launch.  Because the particular command sequence that led to the loss of communication was not being repeated in the mission, once communication was restored there was no concern for a repeat of this issue.

Not all causes are negative.  In this case, the “loss of mission” became a “potential loss of mission” because communication with the probe was able to be restored.  This is due to the contingency and troubleshooting plans built in to the design of the mission.  After the error, the probe automatically switched to a backup computer, per contingency design.  Once communication was restored, the spacecraft automatically transmits data back to mission control to aid in troubleshooting.

Of the mission, Alice Bowman, the Missions Operation Manager says, “There’s nothing we could do but trust we’d prepared it well to set off on its journey on its own.”  Clearly, they did.

Small goldfish can grow into a large problem in the wild

By Kim Smiley

Believe it or not, the unassuming goldfish can cause big problems when released into the wild.  I personally would have assumed that a goldfish set loose into the environment would quickly become a light snack for a native species, but invasive goldfish have managed to survive and thrive in lakes and ponds throughout the world.  Goldfish will keep growing as long as the environment they are in supports it.  So while goldfish kept in an aquarium will generally remain small, without the constraints of a tank, goldfish the size of dinner plates are not uncommon in the wild. These large goldfish both compete with and prey on native species, dramatically impacting native fish populations.

This issue can be better understood by building a Cause Map, a visual format of root cause analysis, which intuitively lays out the cause-and-effect relationships that contributed to the problem.  A Cause Map is built by asking “why” questions and recording the answers as a box on the Cause Map.  So why are invasive goldfish causing problems?  The problems are occurring because there are large populations of goldfish in the wild AND the goldfish are reducing native fish populations.  When there are two causes needed to produce an effect like in this case, both causes are listed on the Cause Map vertically and separated by an “and”.   Keep asking “why” questions to continue building the Cause Map.

So why are there large populations of goldfish in the wild?  Goldfish are being introduced to the wild by pet owners who no longer want to care for them and don’t want to kill their fish.  The owners likely don’t understand the potential environmental impacts of dumping non-native fish into their local lakes and ponds.  Goldfish are also hardy and some may survive being flushed down a toilet and end up happily living in a lake if a pet owner chooses to try that method of fish disposal.

Why do goldfish have such a large impact on native species?  Goldfish can grow larger than many native species and they compete with them for the same food sources.  In addition, goldfish eat small fish as well as eggs from native species.  Invasive goldfish can also introduce new diseases into bodies of water that can spread to the native species.  The presence of a large number of goldfish can also change the environment in a body of water.  Goldfish stir up mud and other matter when they feed which causes the water to be cloudier, impacting aquatic plants.  Some scientists also believe that large populations of goldfish can lead to algae blooms because goldfish feces is a potential food source for them.

Scientists are working to develop the most effective methods to deal with the invasive goldfish.  In some cases, officials may drain a lake or use electroshocking to remove the goldfish.  As an individual, you can help the problem by refraining from releasing pet fish into the wild.  It’s an understandable impulse to want to free an unwanted pet, but the consequences can be much larger than might be expected. You can contact local pet stores if you need to get rid of aquarium fish; some will allow you to return the fish.

To view a Cause Map of this problem, click on “Download PDF” above.

Cause-and-Effect: Alcohol Consumption

By ThinkReliability Staff

The human body is a pretty amazing thing. Many of the processes that take place in our body on a regular basis – keeping us breathing, walking and playing video games or skydiving (or both, though hopefully not at the same time) – have not yet been replicated. They’re that complex.

Which of course raises a lot of questions: why do our bodies work the way they do? It also leads to the subset of questions, when x happens, why does y happen? If your question is, when I drink, why do I feel so great, then so lousy, science has the answers for you . . . and yes, we can capture them in a Cause Map!

If your goal for your body is to feel well and behave pretty consistently, then drinking alcohol is going to impact those goals. First, drinking is going to result in a decrease in control of your behavior. The specifics of how this manifest are legion, but I am assured you probably have many examples. Your post-binge feelings are also going to be impacted: most likely your drinking is going to result in a hangover (generally awful feelings centered around your abdomen and head), dehydration and frequent urination. If your goal is not to eat everything in sight without any consideration about what it will do to your waistline, then your diet may also be impacted due to a desire for carbohydrates.

Beginning with one of these goals, we can ask our favorite question: Why? For example, our decrease in behavior control results from the hypothalamus, pituitary gland, and cerebellum being depressed. This decreases inhibitions, ability to think clearly and also releases a whole slew of hormones and dopamine. Additionally, alcohol impacts neurotransmitters which direct emotions, actions and motor skills, so the combination may make you think you can dance on a table . . . but really you can barely walk.

Now about the ill after-effects. That lovely hangover results from your digestive system attempting to detoxify your body from alcohol and the pounding headache caused by dehydration. When your digestive system works to remove alcohol, the byproduct is acetaldehyde and your body doesn’t like it at all. Most of the alcohol from your body is going to be flushed through your bladder. In order to speed its exit, your body redirects all the liquid it can to your bladder, leaving you dehydrated. (That’s also why you have to run the bathroom so many times after drinking.) The whole process of removing alcohol from your body takes energy. In order to direct as much energy towards alcohol removal as possible, your brain shuts down most of your other functions (which doesn’t help with the ability to function). To get that energy back, your body craves food – carbs in particular (grease optional).

With all these bad effects, you may wonder why people drink at all. Well, when you drink, the alcohol depresses some systems as discussed above, resulting in the release of a bunch of hormones and dopamine. These make us feel good (or even fabulous!). That’s why we keep drinking. (There’s also a whole bunch of social pressures which I’m not going to go into here.)

Giving up drinking altogether is difficult, and many people don’t want to. There are, however, ways to minimize the ill effects of drinking. Food in your stomach helps absorb some of the alcohol, so eating before you drink can help. The headache portion of the hangover can be minimized by drinking a lot of water (though that won’t help with the frequent urination issue). AND OF COURSE, drinking does a number on your fine motor control and general behavior, you should never, ever drink and drive or operate other heavy machinery.

To view the Cause Map of what happens when you drink, click on “Download PDF” above. The information used to create this blog is from:

The Science of Getting Drunk” and

Every Time You Get Drunk This Is What Happens To Your Body And Your Brain

Deadly balcony collapse in Berkeley

By Kim Smiley

A 21st birthday celebration quickly turned into a nightmare when a fifth-story apartment balcony collapsed in Berkeley, California on June 16, 2015, killing 6 and injuring 7.  The apartment building was less than 10 years old and there were no obvious signs to the untrained eye that the balcony was unsafe prior to the accident.

The balcony was a cantilevered design attached to the building on only one side by support beams.  A report by Berkeley’s Building and Safety Division stated that dry rot had deteriorated the support beams significantly, causing the balcony to catastrophically fail under the weight of 13 bodies.

Dry rot is decay caused by fungus and occurs when wood is exposed to water, especially in spaces that are not well-ventilated. The building in question was built in 2007 and the extensive damage to the support beam indicates that there were likely problems with the water-proofing done during construction of the balcony.  Initial speculation is that the wood was not caulked and sealed properly when the balcony was built, which allowed the wood to be exposed to moisture and led to significant dry rot. However, the initial report by the Building and Safety Division did not identify any construction code violations, which raises obvious questions about whether the codes are adequate as written.

As a short-term solution to address potential safety concerns, the other balconies in the building were inspected to identify if they were at risk of a similar collapse so they could be repaired. As a potential longer-term solution to help reduce the risk of future balcony collapses in Berkeley as a whole, officials proposed new inspection and construction rules this week.  Among other things, the proposed changes would require balconies to include better ventilation and require building owners to perform more frequent inspections.  Only time will tell if proposed code changes will be approved by the Berkeley City Council, but something should be changed to help ensure public safety.

Finding a reasonable long-term solution to this problem is needed because balconies and porches are susceptible to rot because they are naturally exposed to weather.  Deaths from balcony failures are not common, but there have been thousands of injuries.  Since 2003, only 29 deaths from collapsing balconies and porches have been reported in the United States (including this accident), but an estimated 6,500 people have been injured.

Click on “Download PDF” above to see a Cause Map, a visual format of root cause analysis, of this accident.  A Cause Map lays out all the causes that contributed to an issue to show the cause-and-effect relationships.

Rollercoaster Crash Under Investigation

By ThinkReliability Staff

A day at a resort/ theme park ended in horror on June 2, 2015 when a carriage filled with passengers on the Smiler rollercoaster crashed into an empty car in front of it. The 16 people in the carriage were injured, 5 seriously (including limb amputations). While the incident is still under investigation by the Health and Safety Executive (HSE), information that is known can be collected in cause-and-effect relationships within a Cause Map, or visual root cause analysis.

The analysis begins with determining the impact to the goals. Clearly the most important goal affected in this case is the safety goal, impacted because of the 16 injuries. In addition to the safety impacts, customer service was impacted because of the passengers who were stranded for hours in the air at a 45 degree angle. The HSE investigation and expected lawsuits are an impact to the regulatory goal. The park was closed completely for 6 days, at an estimated cost of ?3 M. (The involved rollercoaster and others with similar safety concerns remain closed.) The damage to the rollercoaster and the response, rescue and investigation are impacts to the property and labor goals, respectively.

The Cause Map is built by laying out the cause-and-effect relationships starting with one of the impacted goals. In this case, the safety goal was impacted because of the 16 injuries. 16 passengers were injured due to the force on the carriage in which they were riding. The force was due to the speed of the carriage (estimated at 50 mph) when it collided with an empty carriage. According to a former park employee, the collision resulted from both a procedural and mechanical failure.

The passenger-filled carriage should not have been released while an empty car was still on the tracks, making a test run. It’s unclear what specifically went wrong to allow the release, but that information will surely be addressed in the HSE investigation and procedural improvements going forward. There is also believed to have been a mechanical failure. The former park employee stated, “Technically, it should be absolutely impossible for two cars to enter the same block, which is down to sensors run by a computer.” If this is correct, then it is clear that there was a failure with the sensors that allowed the cars to collide. This will also be a part of the investigation and potential improvements.

After the cause-and-effect relationships have been developed as far as possible (in this case, there is much information still to be added as the investigation continues), it’s important to ensure that all the impacted goals are included on the Cause Map. In this case, the passengers were stranded in the air because the carriage was stuck on the track due to the force upon it (as described above) and also due to the time required for rescue. According to data that has so far been released, it was 38 minutes before paramedics arrived on-scene, and even longer for fire crews to arrive with the necessary equipment to begin a rescue made very difficult by the design of the rollercoaster (the world record holder for most loops: 14). The park staff did not contact outside emergency services until 16 minutes after the accident – an inexcusably long time given the gravity of the incident. The delayed emergency response will surely be another area addressed by the investigation and continuing improvements.

Although the investigation is ongoing, the owners of the park are already making improvements, not only to the Smiler but to all its rollercoasters. In a statement released June 5, the owner group said “Today we are enhancing our safety standards by issuing an additional set of safety protocols and procedures that will reinforce the safe operation of our multi-car rollercoasters. These are effective immediately.” The Smiler and similar rollercoasters remain closed while these corrective actions are implemented.

Dr. Tony Cox, a former Health and Safety Executive (HSE) advisory committee chairman, hopes the improvements don’t stop there and issues a call to action for all rollercoaster operators. “If you haven’t had the accident yourself, you want all that information and you’re going to make sure you’ve dealt with it . . . They can just call HSE and say, ‘Is there anything we need to know?’ and HSE will . . . make sure the whole industry knows. That’s part of their role. It’s unthinkable that they wouldn’t do that.”

To view the information available thus far in a Cause Map, please click “Download PDF” above.

Live anthrax mistakenly shipped to as many as 24 labs

By Kim Smiley

The Pentagon recently announced that live anthrax samples were mistakenly shipped to as many as 24 laboratories in 11 different states and two foreign countries.  The anthrax samples were intended to be inert, but testing found that at least some of the samples still contained live anthrax.  There have been no reports of illness, but more than two dozen have been treated for potential exposure.  Work has been disrupted at many labs during the investigation as testing and cleaning is performed to ensure that no unaccounted-for live anthrax remains.

The investigation is still ongoing, but the issues with anthrax samples appear to have been occurring for at least a year without being identified.  The fact that some of the samples containing live anthrax were transported via FedEx and other commercial shipping companies has heightened concern over possible implications for public safety.

Investigations are underway by both the Centers for Disease Control and the Defense Department to figure out exactly what went wrong and to determine the full scope of the problem. Initial statements by officials indicated that there may be problems with the procedure used to inactivate the anthrax.   Investigators so far have indicated that the work procedure was followed, but it may not have effectively killed 100 percent of the anthrax as intended.  Technicians believed that the samples were inert prior to shipping them out.

It may be tempting to call the issues with the work process used to inactivate the anthrax as the “root cause” of this problem, but in reality there is more than one single cause that contributed to this issue and more than one solution should be used to reduce the risk of future problems to acceptable levels.  Clearly, there is a problem if the procedure used to create inactive anthrax samples doesn’t kill all the bacteria present and that will need to be addressed, but there is also a problem if there aren’t appropriate checks and testing in place to identify that live anthrax remains in samples.  When dealing with potentially deadly consequences, a work process should be designed where a single failure cannot create a dangerous situation if possible.  An effective test for live anthrax prior to shipping the sample would have contained the problem to a single facility designed to handle live anthrax and drastically reduced the impact of the issue.  Additionally, an another layer of protection could be added by requiring that a facility receiving anthrax samples test them upon receipt and handle them with additional precautions until they were determined to be fully inert.

Building in additional testing does add time and cost to a work process, but sometimes it is worth it to identify small problems before they become much larger problems.  If issues with the process used to create inert anthrax samples were identified the first time it failed to kill all the anthrax, it could have been dealt with long before it was headline news and people were unknowingly exposed to live anthrax. Testing both before shipping and after receipt of samples may be overkill in this case, but something more than just working to fix the process for creating inert sample needs to be done because inadvertently shipping live anthrax for more than a year indicates that issues are not being identified in a timely manner.

6/4/2015 Update: It was announced that anthrax samples that are suspected of inadvertently containing live anthrax were sent to 51 facilities in 17 states, DC and 3 foreign countries (Australia, Canada and South Korea). Ten samples in 9 states have tested positive for live anthrax and the number is expected to grow as more testing is completed. 31 people have been preventative treated for exposure to anthrax, but there are still no reports of illness. Click here to read more.

Deadly Train Derailment Near Philadelphia

By Kim Smiley

On the evening of May 12, 2015, an Amtrak train derailed near Philadelphia, killing 8 and injuring more than 200.  The investigation is still ongoing with significant information about the accident still unknown, but changes are already being implemented to help reduce the risk of future rail accidents and improve investigations.

Data collected from the train’s onboard event recorder shows that the train sped up in the moments before the accident until it was traveling 106 mph in a 50 mph zone where the train track curved.  The excessive speed clearly played a role in the accident, but there has been little information released about why the train was traveling so fast going into a curve.  The engineer controlling the train suffered a head injury during the accident and has stated that he has no recollection of the accident. The engineer was familiar with the route and appears to have had all required training and qualifications.

As a result of this accident and the difficulty determining exactly what happened, Amtrak has announced that cameras will be installed inside locomotives to record the actions of engineers.  While the cameras may not directly reduce the risk of future accidents, the recorded data will help future investigations be more accurate and timely.

The excessive speed at the time of the accident is also fueling the ongoing debate about how trains should be controlled and the implementation of positive train control (PTC) systems that can automatically reduce speed.  There was no PTC system in place at the curve in the northbound direction where the derailment occurred and experts have speculated that one would have prevented the accident. In 2008, Congress mandated nationwide installation and operation of positive train control systems by 2015.  Prior to the recent accident, the Association of America Railroads stated that more than 80 percent of the track covered by the mandate will not have functional PTC systems by the deadline. The installation of PTC systems requires a large commitment of funds and resources as well as communication bandwidth that has been difficult to secure in some area and some think the end of year deadline is unrealistic. Congress is currently considering two different bills that would address some of the issues.  The recent deadly crash is sure to be front and center in their debates.

In response to the recent accident, the Federal Railroad Administration ordered Amtrak to submit plans for PTC systems at all curves where the speed limit is 20 mph less than the track leading to the curve for the main Northeast Corridor (running between Washington, D.C. and Boston).  Only time will tell how quickly positive train control systems will be implemented on the Northeast Corridor as well as the rest of the nation, and the debate on the best course of action will not be a simple one.

An initial Cause Map, a visual root cause analysis, can be created to capture the information that is known at this time.  Additional information can easily be incorporated into the Cause Map as it becomes available.  To view a high level initial Cause Map of this accident, click on “Download PDF”.

Indian Point Fire and Oil Leak

By Sarah Wrenn

At 5:50 PM on May 9, 2015, a fire ignited in one of two main transformers for the Unit 3 Reactor at Indian Point Energy Center. These transformers carry electricity from the main generator to the electrical grid. While the transformer is part of an electrical system external to the nuclear system, the reactor is designed to automatically shut down following a transformer failure. This system functioned as designed and the reactor remains shut down with the ongoing investigation. Concurrently, oil (dielectric fluid) spilled from the damaged transformer into the plant’s discharge canal and some amount was also released into the Hudson River. On May 19, Fred Dacimo, vice president for license renewal at Indian Point and Bill Mohl, president of Entergy Wholesale Commodities, stated the transformer holds more than 24,000 gallons of dielectric fluid. Inspections after the fire revealed 8,300 gallons have been collected or were combusted during the fire. As a result, investigators are working to identify the remaining 16,000 gallons of oil. Based on estimates from the Coast Guard supported by NOAA, up to approximately 3,000 gallons may have gone into the Hudson River.

The graphic located here provides details regarding the event, facility layout and response.

Step 1. Define the Problem

There are a few problems in this event. Certainly, the transformer failure and fire are major problems. The transformer is an integral component to transfer electricity from the power plant to the grid. Without the transformer, production has been halted. In addition, there is an inherent risk of injury with the fire response. The site’s fire brigade was dispatched to respond to the fire and while there were no injuries, there was a potential for injury. In addition, the release of dielectric fluid and fire-retardant foam into the Hudson River is a problem. A moat around the transformer is designed to contain these fluids if released, but evidence shows that some amounts reached the Hudson River.

As shown in the timeline and noted on our problem outline, the transformer failure and fire occurred at 5:50 PM and was officially declared out 2.25 hours later.

As far as anything out of the ordinary or unusual when this event occurred, Unit 3 had just returned to operations after a shutdown on May 7 to repair a leak of clean steam from a pipe on the non-nuclear side of the plant. Also, it was noted that this is the 3rd transformer failure in the past 8 years. This frequency of transformer failures is considered unusual. The Wall Street Journal reported that the transformer that failed earlier this month replaced another transformer that malfunctioned and caught fire in 2007. Another transformer failed in 2010, which had been in operation for four years.

Multiple organizational goals were negatively impacted by this event. As mentioned above, there was a risk of injury related to the fire response. There was also a negative impact to the environment due to the release of dielectric fluid and fire-retardant foam. The negative publicity from the event impacts the organization’s customer service goal. A notification to the NRC of an Unusual Event (the lowest of 4 NRC emergency classifications) is a regulatory impact. For production/schedule, Unit 3 was shutdown May 9 and remains shutdown during the investigation. There was a loss of the transformer which needs to be replaced. Finally, there is labor/time required to address and contain the release, repair the transformer, and investigate the incident.

Step 2. Identify the Causes (Analysis)

Now that we’ve defined the problem in relation to how the organization’s goals were negatively impacted, we want to understand why.

The Safety Goal was impacted due to the potential for injury. The risk of injury exists because of the transformer fire.

 

 

The Regulatory Goal was impacted due to the notification to the NRC. This was because of the Unit 3 shutdown, which also impacts the Production/Schedule Goal. Unit 3 shutdown as this is the designed response to the emergency. This is the designed response because of the loss of the electrical transformer, which also impacts the Property/Equipment Goal. Why was the electrical transformer lost? Because of the transformer fire.

For the other goals impacted, Customer Service was because of the negative publicity which was caused by the containment, repair, investigation time and effort. This time and effort impacts the organization’s Labor/Time Goal. This time and effort was required because of the dielectric fluid and fire-retardant foam release. Why was there a release? Because the fluid and foam were able to access the river.

Why did the fluid and foam access the river?

The fire-retardant foam was introduced because the sprinkler system was ineffective. The transformer is located outside in the transformer yard which is equipped with a sprinkler system. Reports indicate that the fire was originally extinguished by the sprinklers, but then relit. Fire responders introduced fire-retardant foam and water to more aggressively address the fire. Some questions we would ask here include why was the sprinkler system ineffective at completely controlling the fire? Alternatively, is the sprinkler system designed to begin controlling the fire as an immediate response such that the fire brigade has time to respond? If this is the case, then did the sprinkler perform as expected and designed?

The transformer moat is designed to catch fluids and was unable to contain the fluid and the foam. When a containment is unable to hold the amount of fluid that is introduced, this means that either there is a leak in the containment or the amount of fluid introduced is greater than the capacity of the containment. We want to investigate the integrity of the containment and if there are any leak paths that would have allowed fluids to escape the moat. We also want to understand the volume of fluid that was introduced. The moat is capable of holding up to 89,000 gallons of fluid. A transformer contains approximately 24,000 gallons of dielectric fluid. What we don’t know is how much fire-retardant foam was introduced. If this value plus the amount of transformer fluid is greater than the capacity of the moat, then the fluid will overflow and can access the river. If this is the case, we also would want to understand if the moat capacity is sufficient, should it be larger? Also, is the moat designed such that an overflow will result in accessing the discharge canal and is this desired?

Finally, dielectric fluid accessed the river because the fluid was released from the transformer. Questions we would ask here are: Why was the fluid released and why does a transformer contain dielectric fluid? Dielectric fluid is used to cool the transformers. Other cooling methods, such as fans are also in place. The causes of the fluid release and transformer failure is still being investigated, but in addition to determining these causes, we would also ask how are the transformers monitored and maintained? The Wall Street Journal provided a statement from Jerry Nappi, a spokesman for Entergy. Nappi said both of unit 3’s transformers passed extensive electrical inspections in March. Transformers at Indian Point get these intensive inspections every two years. Aspects of the devices also are inspected daily.

Finally, we want to understand why was there a transformer fire. The transformer fire occurred because there was some heat source (ignition source), fuel, and oxygen. We want to investigate what was the heat source – was there a spark, a short in the wiring, a static electricity build up? Also, where did the fuel come from and is it expected to be there? The dielectric fluid is flammable, but are there other fuel sources that exist?

Step 3. Select the Best Solutions (Reduce the Risk)

What can be done? With the investigation ongoing, a lot of facts still need to be gathered to complete the analysis. Once that information is gathered, we want to consider what is possible to reduce the risk of having this type of event occur in the future. We would want to evaluate what can be done to address the transformer, implementing solutions to better maintain, monitor, and/or operate it. Focusing on solutions that will minimize the risk of failure and fire. However, if a failure does occur, we want to consider solutions so that the failure and fire does not result in a release. Further, we can consider the immediate response; do these steps adequately contain the release? Identifying specific solutions to the causes identified will provide reductions to the risk of future similar events.

Resources:

This Cause Map was built using publicly available information from the following resources.

De Avila, Joseph “New York State Calls for Tougher Inspections at Indian Point” http://www.wsj.com/articles/nuclear-regulatory-commission-opens-probe-at-indian-point-1432054561 Published 5/20/2015. Accessed 5/20/2015

“Entergy’s Response to the Transformer Failure at Indian Point Energy Center” http://www.safesecurevital.com/transformer_update/ Accessed 5/19/2015

“Entergy Plans Maintenance Shutdown of Indian Point Unit 3” http://www.safesecurevital.com/entergy-plans-maintenance-shutdown-of-indian-point-unit-3/ Published 5/7/2015. Accessed 5/19/2015

“Indian Point Unit 3 Safely Shutdown Following Failure of Transformer” http://www.safesecurevital.com/indian-point-unit-3-safely-shutdown-following-failure-of-transformer/ Published 5/9/2015. Accessed 5/19/2015

“Entergy Leading Response to Monitor and Mitigate Potential Impacts to Hudson River Following Transformer Failure at Indian Point Energy Center” http://www.safesecurevital.com/entergy-leading-response-to-monitor-and-mitigate-potential-impacts-to-hudson-river-following-transformer-failure-at-indian-point-energy-center/ Published 5/13/2015. Accessed 5/19/2015

“Entergy Continues Investigation of Failed Transformer, Spilled Dielectric Fluid at Indian Point Energy Center” http://www.safesecurevital.com/entergy-continues-investigation-of-failed-transformer-spilled-dielectric-fluid-at-indian-point-energy-center/ Published 5/15/2015. Accessed 5/19/2015

McGeehan, Patrick “Fire Prompts Renewed Calls to Close the Indian Point Nuclear Plant” http://www.nytimes.com/2015/05/13/nyregion/fire-prompts-renewed-calls-to-close-the-indian-point-nuclear-plant.html?_r=0 Published 5/12/2015. Accessed 5/19/2015

Screnci, Diane. “Indian Point Transformer Fire” http://public-blog.nrc-gateway.gov/2015/05/12/indian-point-transformer-fire/comment-page-2/#comment-1568543 Accessed 5/19/2015