Root Cause Analysis - Incident Investigation

Kitty Litter Cause of Radiological Leak?

May 30, 2014 ThinkReliability Staff

The rupture of a container filled with nuclear waste from Department of Energy (DOE) sites that resulted in the radiological contamination of 21 workers appears to have resulted from a heat-producing reaction, possibly between the nuclear waste and the kitty litter used to stabilize the waste.

Yes, you read that correctly. The same stuff you use for Fluffy’s “business” is also used to stabilize nuclear waste. However, the kitty litter typically used is clay. One of the sites that provides waste to the Waste Isolation Pilot Plant, where the release occurred, changed from clay kitty litter to organic kitty litter, which is made of plant material. Although the reaction that resulted in the container’s rupture has not yet been determined, it is possible that it was due to the change in litter.

We can look at this incident in a Cause Map, or visual root cause analysis, to lay out both the effects and causes. In this case, the effects were significant. Twenty-one workers were found to have internal radiological contamination, impacting the safety goal. A radiological release off-site impacted the environmental goal. The waste repository has been shut down and is not accepting shipments, impacting both the customer service and production goals. The release requires the investigation of a formal Accident Investigation Board, impacting the regulatory and labor goal. Lastly, the damage to the container is an impact to the property goal.

The release was caused by the rupture of a container that stored radiological waste, including americium and plutonium. The release was able to leave the underground storage facility due to a leak path in the ventilation system, which was by design because the ventilation system was not designed for containment because the safety analysis assumed that a release within the storage facility would result from a roof panel fall and was adequately prevented.

The rupture appears to have resulted from a heat-producing reaction. The constituents of that reaction have not yet been determined, but the change from clay to organic kitty litter has been identified as a possible cause. (A possible cause indicates a cause for which evidence is not yet available.) More research is being done to determine the actual reaction. This will also allow a determination of which other waste containers may be at risk for rupture.

A solution that has already been implemented is to seal the leaks in the ventilation system with foam to reduce the risk of leak-by. Other solutions that have been suggested are to add an additional heavy-duty containment around the affected casks, reclassify the ventilation system as containment, and perform an independent review of the safety analysis of the site. Once appropriate solutions are determined and implemented, it’s hope the site will be able to reopen.

To view the Outline and Cause Map, please click “Download PDF” above.

Root Cause Analysis - Incident Investigation

Smoke at FAA Facility Results in Major Flight Disruptions

May 22, 2014 Kim Smiley

By Kim Smiley

A smoking bathroom fan resulted in the disruption of more than a thousand flights in the Chicago area on May 13, 2014 in a dramatic demonstration of real time cause-and-effect. This incident illuminates how a relatively small issue can quickly grow into an expensive and time-consuming problem. In an ideal world, a smoking bathroom fan wouldn’t result in national headlines.

So what happened? How did a smoking bathroom fan that wasn’t even at the airport delay so many flights? A Cause Map, a visual method for performing a root cause analysis, is a useful tool for understanding the causes that contributed to an issue. When building a Cause Map, causes are laid out based on cause-and-effect relationships to clearly show what lead to the problem.

In this example, flights were delayed because there was limited support from air traffic control available and air traffic control support is necessary for safe operation. Air traffic control support was reduced because the Elgin FAA facility that monitors airports in the Chicago area was evacuated for several hours because the building was filled with smoke. The building had to be evacuated for personnel safety and it took some time to reestablish safe conditions. Emergency personnel had a difficult time pinpointing the source of the smoke because it spread through the space. The smoke was throughout the building because the source of the smoke, a bathroom fan, was part of the HVAC system.

The media reports didn’t provide details about why exactly the bathroom fan was smoking in this particular case, but bathroom fans are a relatively common cause of building fires. Lint or dust can build up in the fan motor over time, eventually leading to the motor overheating. The situation can quickly become dangerous, particularly when a motor is left powered after it has seized which is a common failure mode for this equipment.

A few fairly easy things can be done to reduce the risk of bathroom fan fires. Fan should be cleaned at least annually, but should be cleaned more frequently if they appear dirty or dusty. A motor that is making unusual sounds or noise should be immediately turned off and inspected by an electrician prior to being returned to service. Any fan that isn’t making the typical whizz sound should also be powered off and repaired or replaced prior to use because a motor that isn’t rotating has a greater likelihood of overheating. Older models that aren’t thermally protected are most at risk for a fire and replacing them with a newer model with thermal protection can significantly reduce the risk of fire.

To view a high level Cause Map, click on “Download PDF” above.

Root Cause Analysis - Incident Investigation

Hundreds of Flights Disrupted After Air-Traffic Control System Confused by U-2 Spy Plane

May 15, 2014 Kim Smiley

By Kim Smiley

Hundreds of flights were disrupted in the Los Angeles area on April 30, 2014 when the air traffic control system En Route Automation Modernization system, known as ERAM, crashed. It’s been reported that the presence of a U-2 spy plane played a role in the air traffic control issues.

This issue can be analyzed by building a Cause Map, a visual format for performing a root cause analysis. A Cause Map intuitively lays out the cause-and-effect relationships so that the problem can be better understood and a wider range of solutions considered. In order to build a Cause Map, the impacted goals are determined and “why” questions are asked to determine all the causes that contributed to the issue.

In this example, the schedule goal was clearly impacted because 50 flights were canceled and more than 400 were delayed. Why did this occur? The flight schedule was disrupted because planes were unable to land or depart safely because the air traffic control system used to monitor the landings was down. The computer system crashed because it became overwhelmed when it tried to reroute a large number of flights in a short period of time.

The system attempted to reroute so many flights at once because the system’s calculations showed that there was a risk of plane collisions because the system misinterpreted the flight path, specifically the altitude, of a U-2 on a routine training mission in the area. U-2s are designed for ultra-high altitude reconnaissance, and the plane is reported to have been flying above 60,000 feet, well above any commercial flights. The system didn’t realize that the U-2 was thousands of feet above any other aircraft so it frantically worked to reroute planes so they wouldn’t be in unsafe proximity.

It took several hours to sort out the problem, but then the Federal Aviation Administration was able to implement a short term fix relatively quickly and get the ERAM system back online. The ERAM system is being evaluated to ensure that no other fixes are needed to ensure that a similar problem doesn’t occur again. It’s also worth noting that ERAM is a relatively new system (implementation began in 2002) that is replacing the obsolete 1970s-era hardware and software system that had been in place previously. Hopefully there won’t be many more growing pains with the changeover to a new air traffic control system.

To see a high level Cause Map of this problem, click on “Download PDF” above.

Root Cause Analysis - Incident Investigation

1990 Cascading Long Distance Failure

May 9, 2014 ThinkReliability Staff

By ThinkReliability Staff

On January 15, 1990, a cascading failure resulted in tens of thousands of people in the Northeast US without long distance service for up to 9 hours. This resulted in over 50 million calls being blocked at an estimated loss of $60 M. (Remember, there weren’t really any other ways to quickly connect outside of the immediate area at the time.)

We can examine this historical incident in a Cause Map, or visual root cause analysis, to demonstrate what went wrong, and what was done to fix the problem. First, we begin with the impact to the goals. No impacts to the safety, environmental, or property goals were discussed in the resources I used, but it is possible they were impacted, so we’ll leave those as unknown. The customer service and production goals were clearly impacted by the loss of service, which was considerable and estimated to cost $60 million, not including time for troubleshooting and repairs.

Asking “Why” questions allows development of the cause-and-effect relationships that led to the impacted goals. In this case, the outage was due to a cascading switch failure: 114 switches crashed and rebooting over and over again. The switches would crash upon receiving a message from its neighbor switches. This message was meant to inform other switches that one switch was busy to ensure messages were routed elsewhere. (A Process Map demonstrating how long distance calls were connected is included on the downloadable PDF.) Unfortunately, instead of allowing the call to be redirected, the message caused a switch to crash. This occurred when an errant line in the coding of the process allowed optional tasks to overwrite crucial communication data. The error was included in a software upgrade designed to increase throughput of messages.

It’s not entirely clear how the error (one added line of code that would bring down a huge portion of the long distance network) was released. The line appears to be added after testing was complete during a busy holiday season. That a line of code was added after testing seems to indicate that the release process wasn’t followed.

In this case, a solution needed to be found quickly. The upgraded software was pulled and replaced with the previous version. Better testing was surely used in the future because a problem of this magnitude has rarely been seen.

To view the Outline, Cause Map and Process Map, please click “Download PDF” above. Or click here to read more

Root Cause Analysis - Incident Investigation

Hundreds Die When South Korean Ferry Capsizes

May 2, 2014 ThinkReliability Staff

By ThinkReliability Staff

The nation of South Korea was devastated after a ferry capsized off Byungpoong on April 16, 2014. While the ferry tipped over and sank quickly (within two hours), the evacuation orders came slowly (a half-hour after the first distress call.) The combination resulted in over 300 being trapped within the ship and killed. The Captain and much of the crew were able to escape.

There are a multitude of causes involved in this tragedy, which can be captured within a Cause Map. A Cause Map visually develops the cause-and-effect relationships that led to organizational goals that were impacted.

Clearly, the safety goal in this case was impacted, due to the large number of deaths (at the time of this blog, 226 bodies had been found and 73 people are still missing). In addition, legal action is being taken against the captain and members of the crew responsible for navigation for negligence and failure to assist passengers. The Captain has also been arrested for “undertaking an excessive change of course without slowing down”. The loss of the ship can be considered an impact to the property goal, and the massive rescue and recovery operations are an impact to the labor/ time goal.

By asking why questions, the cause-and-effect relationships are developed. Most of the deaths resulted from passengers drowning when they were trapped in the ship as it capsized and sank. The ferry capsized because of a sharp turn and stability issues. The ship was turned too quickly at excess speed, possibly because the third mate in charge of navigation was inexperienced (this was her first time) and of steering gear issues, reported two weeks prior to the accident and apparently not fixed. The ship had been recently modified to add more passenger cabins, which made it top heavy. As a result of the modifications, the recommended cargo weight was reduced. The ship was carrying three times the cargo weight recommended at the time of the accident.

Passengers became trapped in the ferry prior to the evacuation order, which was issued thirty minutes after the first distress call (and which it appears not all passengers were able to hear). During this time, the ship had listed to a point that made it impossible to get out. The Captain was concerned about the safety of his passengers in the water and appears to have called the parent company to request permission to evacuate. Additionally, the ship’s life rafts were unable to be used. Photos show crew members being unable to release life rafts. Only 2 of the 46 on the ship were successfully deployed. Lastly, the crew provided insufficient assistance, abandoning ship without making necessary efforts to free the passengers.

This tragic incident has been compared to the Titanic (due to the insufficient number of lifeboats and people being unable to leave the ship), the Valdez oil spill (because an inexperienced third mate was performing navigation while the Captain was in his cabin), and the Costa Concordia (when the Captain left the ship without supervising the evacuation effort). As long as lessons from other organizations (and even industries) are not understood by those performing similar work, these tragedies will continue to happen.

To view the Outline and Cause Map, please click “Download PDF” above. Or click here to read more.

Your Expert Root Cause Analysis Resource

Monthly Archives: May 2014

Kitty Litter Cause of Radiological Leak?

Smoke at FAA Facility Results in Major Flight Disruptions

Hundreds of Flights Disrupted After Air-Traffic Control System Confused by U-2 Spy Plane

1990 Cascading Long Distance Failure

Hundreds Die When South Korean Ferry Capsizes