Category Archives: Root Cause Analysis – Incident Investigation

Passengers trapped in smoke-filled metro train

By Kim Smiley

A standard commute quickly turned into a terrifying ordeal for passengers on a metro train in Washington, DC the afternoon of January 12, 2015.  Shortly after leaving a station, the train abruptly stopped and then quickly filled with thick smoke. One passenger died as a result of the incident and 84 more were treated for injuries, predominantly smoke inhalation.

This incident can be analyzed by building a Cause Map, a visual root cause analysis.  A Cause Map visually lays out the cause-and-effect relationships to show all the causes that contributed to an issue.  The first step in the Cause Mapping process is to define the problem by filling in an Outline with the basic background information as well as documenting how the issue impacts the overall goals.  For this example, the safety goal is clearly impacted by the passenger death and injuries.  A number of other goals should also be considered such as the schedule goal which was impacted by significant metro delays.  (To view an Outline and initial Cause Map for this issue, click on “Download PDF” above.)

So why were passengers injured and killed?  Passengers were trapped on the train and it filled with smoke.  It is unclear why the train wasn’t able to back up to the nearby station once the smoke formed and investigators are working to learn more.  (Open issues can be documented on the Cause Map with a question mark to indicate that more evidence is needed.)  There are also questions about the time emergency workers took to reach the train to aid in evacuation of passengers so this is another area that will require more information to fully understand. By some account, it took 40 minutes for firefighters to reach the trapped passengers.

Initial reports are that smoke was caused by an electrical arcing event, likely from the cables supporting the high voltage third rail used to power the trains. The specifics of what caused the arc are being investigated by the National Transportation Safety Board and will be released when the investigation is concluded.  What is known is that there was significant smoke caused by the arc, but no fire.  There have also been reports of water near the rails that may have been a factor in the arcing.

Eyewitness accounts of this incident are horrifying.  People had little information and didn’t know whether there was fire nearby at first.  They were told to remain on the train and await rescue, but the rescue took some time, which surely felt longer to the scared passengers.  It won’t be clear what solutions need to be implemented to prevent similar problems in the future until the investigation is complete, but I think we can agree that metro officials need to work to ensure passenger safety going forward.

Bad Weather Believed to Have Brought Down AirAsia Flight QZ8501

By ThinkReliability Staff

AirAsia flight QZ8501, and the 162 people on-board, was lost on December 28, 2014 while flying through high-altitude thunderstorms. Because of a delay in finding the plane and continuing bad weather in the area, the black box, which contains data that will give investigators more detail on why the plane went down, has not yet been recovered. Even without the black box’s data, experts believe that the terrible weather in the area was a likely cause of the crash.

“From our data it looks like the last location of the plane had very bad weather and it was the biggest factor in behind the crash. These icy conditions can stall the engines of the plane and freeze and damage the plane’s machinery,” says Edvin Aldrian, the head of Research at an Indonesian weather agency. Beyond the icing of engines, there are other theories on how weather-related issue may have brought down the plane.

Early speculation was that the plane was struck by lighting; while it may have been struck by lightning, experts say it’s unlikely it would have brought the plane down, because modern planes are fairly well-equipped to deal with direct lightning strikes. High levels of turbulence can also result in stalling due to a loss of airflow over the wings. There are also some who believe the plane (an Airbus A320) may have been pushed into a vertical climb past the limit for safe operation (to escape the weather) which resulted in a stall.

While the actual mechanism of how the weather (or an unrelated issue) brought the plane down is still to be determined, aviation safety organizations are already implementing some interventions to increase the safety of air travel in the area based on some specific areas of concern. (These areas of concern can be viewed visually in a Cause Map, or visual root cause analysis, by clicking on “Download PDF” above.)

AirAsia pilots relied on “self-briefings” regarding the weather. Pilots in other locations have expressed concern about the adequacy of weather information pilots obtain using this method. Direct pilot briefings with dispatchers based on detailed weather reporting are recommended to ensure that pilots have the information they need to safely traverse areas of poor weather (or stay out of them altogether).

Heavy air traffic in the area delayed approval to climb out of storm. At 6:12 local time the flight crew requested to climb to higher altitude to attempt to escape the storm. Air traffic control did not attempt to respond to the plane until 6:17, at which point it could no longer be contacted. Air traffic in the area was heavy, possibly because:

The plane did not have permission to fly the route it was on. AirAsia was licensed to fly the route it was taking at the time of the crash four days a week, but not the day of the crash. The takeoff airport used incorrect information in allowing the plane to take off in the first place (and the airline certainly used incorrect information in trying to fly the route as well). The selection of the route has been determined not to be a factor in the crash, but it certainly may have resulted in the overcrowding that led to the delayed response from air traffic control. It also resulted in the airline’s flights on that route being suspended.

It took almost three days to find the plane. The delay is renewing calls for universal tracking of aircraft or real-time streaming of flight data that were initially raised after the loss of Malaysia Airline flight MH370, which is still missing ten months after losing radar contact. (See our previous blog on the difficulties finding it.) Not only would this reduce the suffering of families while waiting to hear their loved ones’ fates, it would reduce resources required to find lost aircraft and, in cases where survival is possible, increase the chance of survival of those on the plane.

 

Hundreds Saved by Arduous Helicopter Rescue From Ferry Fire

By Kim Smiley

In a grueling rescue effort, 427 people were saved from a passenger ferry, Norman Atlantic, which caught fire December 28, 2014 off the coast of Greece.  About 150 people managed to escape the fire in lifeboats, but the remaining passengers were lifted to safety via helicopter.  Gale force winds, heavy rain and darkness all combined to make a difficult rescue operation even more daunting. Ten people died as a result of the accident with few details known about what caused the fatalities.

A Cause Map, a visual root cause analysis, can be built to analyze this incident.  The investigation is just beginning and there are still many unknowns, but an initial Cause Map can be begun that can easily be expanded to incorporate new information as it becomes available.  Even the exact number of people onboard has been difficult to determine because there were several stowaways discovered during the rescue operations that weren’t listed on the ship’s manifest.

What is known is that the fire began early in the morning of December 28th and 427 people were rescued off the ferry. Early reports are that the fire started on the parking deck where there were tanker trucks filled with oil.  Witness accounts indicate that the fire spread fairly quickly, leading to speculation that the fire doors failed.  As the fire progressed, the ship lost power.  Once power was gone, the lifeboats were useless because they require electricity to be lowered.  The heat from the fire drove passengers to the top deck and bridge where they were bombarded by cold, rain and thick smoke for many miserable and likely terrifying hours.  Helicopters pulled passengers to safety one by one, working through the windy night with night vision goggles.

In a stark contrast to the South Korea ferry that capsized off Byungpoong in April, the captain was the last person to leave the Norman Atlantic. The rescue effort was truly impressive.  As Greek Prime Minister Antonis Samaras said, the “massive and unprecedented operation saved the lives of hundreds of passengers following the fire on the ship in the Adriatic Sea under the most difficult circumstances.”

The Italian Transport Ministry has seized the vessel pending an investigation into the fire and thorough inspection of the ship.  Whenever a disaster of this magnitude occurs, it is worth understanding exactly what happened and reviewing what could be done better in the future.  There will be many lessons to learn from this incident, both in how to prevent and fight shipboard fires and how to perform helicopter rescues at sea.

To view a high level Cause Map of this incident, click on “Download PDF” above.

Dreamliner fire: firefighter injured when battery explodes

By ThinkReliability Staff

On January 7, 20 13, smoke was discovered on a recently deplaned Boeing 787 Dreamliner. The recently released National Transportation Safety Board (NTSB) investigation found that an internal short circuit within a cell of the auxiliary power unit (APU) battery spread to adjacent cells and led to a thermal runway which released fire and smoke aboard the aircraft. A firefighter responding to the fire was injured when the battery exploded. Only 9 days later, an incident involving the main battery, which is the same model as that used for the APU, resulted in an emergency landing of another Boeing 787. As a result of these two incidents, the entire Dreamliner fleet was grounded for 3 months for the ensuing investigation and incorporation of modifications. (See our previous blog about the grounding.) Before the fleet was allowed to resume operations, certain protective modifications were required to be implemented.

The investigation determined that the internal short circuit, which provided the initial heat source for the fire within the battery cell, could not be definitively determined due to severe damage in the area, but was potentially related to defects discovered during the manufacturing process. (Defects that could result in this type of short circuit were found on similar components.) The investigation found issues within the manufacturing process and with the oversight of subcontractors by contractors, as opposed to the manufacturers themselves.

The high temperatures resulting from the battery fire allowed it to spread to adjacent cells. Localized high temperatures were found greater than allowable at times of maximum current discharge, such as the APU startup, which had recently occurred. The high temperatures were not detected by the monitoring system (the impact could have been minimized had the issue come to light sooner), because temperatures were not monitored at individual cells, but only on two cell bus bars.

The systems were not prepared to deal with a spreading fire as the design of the aircraft assumed that a short circuit internal to the cell would not propagate. The NTSB determined that the guidance provided to determine key assumptions was ineffective and that the validation of these assumptions had failed. Likely related to this assumption, the safety assessment and testing on the battery system was ineffective. The rate of occurrence of cell venting (the spreading of fire from cell to cell) was calculated by the manufacturer to be 1 in 10 million flight hours. The two occurrences that resulted in the grounding both involved cell venting and occurred while the 787 fleet had less than 52,000 flight hours.

Immediate actions that were required by the NTSB prior to a return to flight were to enclose the battery case, vent from the interior of the enclosure containing the battery to the exterior of the plane (keeping smoke out of the occupied spaces), and modify the battery to minimize the most severe effects from an internal short circuit. The NTSB also made multiple safety recommendations to the manufacturer, subcontractor and the Federal Aviation Administration (FAA).

One of these recommendations was to ensure that assumptions are validated. According to the NTSB report, “Validation of assumptions related to failure conditions that can impact safety is a critical step in the development and certification of an aircraft. The validation process must employ a level of rigor that is consistent with the potential hazard to the aircraft in case an assumption is incorrect.” This statement is true for any object that’s manufactured. Just replace the word “aircraft” with whatever is being manufactured, such as “car” or “pacemaker”. (See another disaster that resulted from not validated assumptions: the collapse of the I-35 Bridge.)

Click on “Download PDF” above to view a high level Cause Map of this issue.

10,000 Pound Buoy Falls on Workers

By Kim Smiley

On December 10, 2014, a buoy that weighs close to 10,000 pounds fell onto workers at an inactive ship maintenance facility in Pearl Harbor. Two workers were killed and two others sustained injuries. While an object this large is an extreme example of the dangers of dropped objects, worker injuries and deaths from falling objects of all sizes is a significant safety concern. A US census report of fatal occupation injuries states that 245 workers were killed after being struck by falling objects in 2013 alone.

The case of the dropped buoy can be built into a Cause Map, a visual root cause analysis, to better understand what happened. Understanding the details of an accident is necessary to ensure that a wide range of solutions is considered and that any solutions implemented will be effective at preventing future incidents.

The investigation into the falling buoy is still underway so some information is not yet available, but it can easily be incorporated into the Cause Map once it is known. Any causes that need more information or evidence can be noted with a question mark to show that there is still an open question.

Exactly what caused the buoy to drop hasn’t been released yet, but it is known that the safety lines attached to the buoy failed. Both of these issues need to be investigated to ensure that solutions can be implemented to prevent further tragedies.

Additionally, there are open questions about why people were working under the path of the lift. The workers were wearing hard hats, but this is obviously inadequate protection against a 10,000 buoy. The contractors were working to strengthen mooring lines at the time of the accident, but no one should be where they could be crushed if such a large object was dropped, as it was in this case. As stated by Jeff Romeo, the Occupational Safety and Health Administration (OSHA) Honolulu area director, “We’re still looking at the facts to try to determine the exact locations of where these employees were located. If in fact, they were working directly underneath the load, then that would be an alarming situation.”

The OSHA investigation is currently underway and is expected to take four to six months. Additionally, the Navy is launching a Safety Investigation Board to review the accident with findings expected to be released by February. Once the investigation is complete, work processes will need to be reviewed to see what changes need to be made to prevent any future injuries from falling objects.

To view an initial Cause Map of this incident, click on “Download PDF” above.

Chemical Release Kills Four Workers at Texas Pesticide Plant

By ThinkReliability Staff

In the early morning hours of November 15, 2014, a release of methyl mercaptan resulted in the deaths of four employees at a plant in Texas that manufactures pesticides. The investigation into the source of the leak is still ongoing, though persistent maintenance problems had been reported in the plant, which was shut down five days prior to the incident.

Even though the investigation has not been completed, there are some lessons learned that can be applied to this facility, and other facilities that handle chemicals, immediately.

Even “safer” chemicals are dangerous when not treated properly. The chemical released – methyl mercaptan – is stored as a safer alternative to methyl isocyanate (which was the chemical released in the Bhopal disaster). Although it’s “safer” than its alternatives, it is still lethal at concentrations above 150 parts per million. The company has stated that 23,000 pounds were released – in a room where complaints were made about insufficient ventilation. The workers were unable to escape – likely because they were quickly incapacitated by the levels of methyl mercaptan and did not have the necessary equipment to get out. (Only two air masks and oxygen tanks were found in the area where the employees were.)

A fast response is necessary for employee safety. Records show that 911 was not called for an hour after the employees were trapped. (One of the victims called his wife an hour prior to indicate there was an issue and he was attempting rescue.) The emergency industrial response group, which is trained to provide response in these sort of situations, was never called by the plant. Medical personnel could not access the employees because they were not trained in protective gear. Firefighters who responded did not have enough air to travel through the entire facility and did not have enough information on the layout to know where to go. It’s unclear whether a quicker response could have saved lives.

Providing timely, accurate information is necessary for public safety. The best way to determine the impact on the public is to measure the concentration of released chemicals at the fenceline (known as fenceline monitoring). Air monitoring was not performed for more than four hours after the release. Companies are not required to provide fenceline monitoring, although an Environmental Protection Agency (EPA) rule requiring monitoring systems for refineries is under review. (This rule would not have impacted this plant as it produced pesticides.) Until that monitoring, the only information available to the public was information provided by the company (which did not release until days later the amount of chemical released.) In Texas, companies are required to disclose the presence of chemicals, but not the amount. A reverse 911 system was used to inform residents that an odor would be present, but did not discuss the risks.

What can you do? Ensure that all chemicals at your facility are known and stored carefully. Develop a response plan that ensures that your employees can get out safely, that responders can get in safely (and are apprised of risks they may face), and that the public has the necessary information to keep them safe. Make sure these plans are trained on and posted readily. Depending on the risk of public impact from your business, involving emergency responders and the public in your drills may be desired.

To see a high level Cause Map of this incident, click on “Download PDF” above.

Chocolate Makers Warn of Possible Shortage

By Kim Smiley

Chocolate is one of the most beloved foods, but it may be becoming a little too popular.  Major chocolate makers have warned of a possible chocolate shortage looming in the near future.  According to a recent article by the Washington Post, “The world’s biggest chocolate-maker says we’re running out of chocolate”, the world consumed about 70,000 metric tons more cocoa last year than it produced.  The chocolate deficit is also predicted to get worst.

The chocolate shortage is a classic example of supply and demand in action.  The demand for cocoa is rising at the same time that the supply is dropping.  The price consumers are paying for chocolate is already increasing and is likely to get significantly higher if these trends continue.

So why is demand increasing (beyond the obvious fact that chocolate is delicious)? Part of the answer is that it is trendy to include chocolate in a wider variety of foods such as savory gourmet dishes, liquor and breakfast cereal.  Even the already questionable potato chip has been covered in chocolate to the delight of many.  The increasing popularity of dark chocolate also comes into play because dark chocolate contains significantly more cocoa than typical chocolate. (An average chocolate bar is about 10% cocoa while dark chocolate bars are usually closer to 70%.)  The sheer number of people who are eating chocolate is also growing as chocolate is more widely available worldwide, particularly in Asia where chocolate consumption is increasing rapidly.

While demand continues to grow, supply is decreasing.  Drought in West Africa, where the majority of the world’s chocolate is grown, has impacted the cocoa supply.  The plants are also being attacked by diseases; the most noteworthy is a fungus called Frosty pod, which is reducing the crop further.  The nature of chocolate trees also makes responding to difficult or changing growing conditions challenging because it takes them years to mature.  With the difficulties facing chocolate trees, many farmers are turning to other crops that are more profitable which reduces the production of cocoa.

The end result of higher demand for chocolate will likely be further increases in the price of chocolate.  It’s also likely that chocolate makers will continue to develop candy that includes non-chocolate ingredients such as nuts, raisins or nougats to meet the demand for treats while using less actual chocolate.  Additionally, farmers are working to develop new strains of cocoa that are resistant to disease and drought and/or produce more cocoa per plant, which would increase the supply of cocoa.

A Cause Map, a visual root cause analysis, can be used to show the causes that have contributed to the chocolate deficit. To view a high level Cause Map of this example, click on “Download PDF” above.

Investigation Into the Fatal Crash of Commercial Space Vehicle is Underway

By Kim Smiley

On October 31, 2014, Virgin Galactic’s commercial space vehicle, SpaceShipTwo, tore apart over the Mojave Desert in California during its fourth rocket-powered test flight. One pilot was killed and the other seriously injured. An investigation is underway to determine exactly what caused the crash, but initial data indicates that the tail booms used to slow down the vehicle moved into the feathered position prematurely, increasing the aerodynamic force. This disaster has the potential to impact the emerging commercial space industry as regulators and potential passengers are reminded of the inherent dangers of space travel.

This issue can be analyzed by building a Cause Map, a visual method for performing a root cause analysis. An initial Cause Map can be built using the information that is currently available and then easily expanded as more data is known. The first step is to fill in an Outline with the basic background information of the incident. Additionally, the impacts to the overall goals are listed on the Outline to determine the scope of the issue. The Cause Map is then built by asking “why” questions.

Starting with the safety goal in this example: one pilot was killed and another was injured because a space vehicle was destroyed and they were onboard. (When two causes both contribute to an effect, they are both listed on the Cause Map and joined with an “and”.) SpaceShipTwo is designed to hold passengers, but this was a test flight to assess a new fuel so the pilots were the only people onboard. The space vehicle tore apart because the stress on the vehicle was greater than the strength of the vehicle. The final report on the accident will not be available for many months, but the initial findings indicate that the space vehicle experienced greater aerodynamic forces than expected.

The space vehicle used tail booms that were shifted into a feathered position to increase drag and reduce speed prior to landing. Video shows the co-pilot releasing the lever that unlocked the tail booms earlier than expected while the vehicle was still accelerating. It’s unclear at this time why he released the lever. The tail booms were not designed to move when unlocked and a second lever controls movement, but investigators speculate that the aerodynamic forces on the space vehicle while it was still accelerating caused them to lift up into the feathered position once they were unlocked. The vehicle disintegrated seconds after the tail booms shifted position, likely because of the aerodynamic forces in play.

After the final report is released, the Cause Map can be expanded to include the additional information. To view a high level Cause Map of this accident, click on “Download PDF” above.

Safety Concerns Raised by 5 Railroad Accidents in 11 Months

By ThinkReliability Staff

The National Transportation Safety Board investigates major railroad accidents in the United States. It was not only the severity (6 deaths and 126 injuries) but the frequency (5 accidents over 11 months) of recent accidents on a railroad that led to an “in-depth special investigation“. Part of the purpose of the special investigation was to “examine the common elements that were found in each”.

When an organization sees a recurring issue – in this case, multiple accidents requiring investigation from the same railroad, there may be value in not only investigating the incidents separately but also in a common analysis. A root cause analysis that addresses more than one incident is known as a Cumulative Cause Map, and it captures visually much of the same information in a Failure Modes and Effects Analysis, or FMEA.

The information from the individual investigations of each of these accidents can be combined into one analysis, including an outline addressing the problems and impacts to the goals from the incidents as a whole. In this case, the problems addressed include issues on the Metro-North railroad in New York and Connecticut from May 2013 to March 2014. The five incidents during that time period resulted in 4 customer deaths and 126 injuries, 2 employee deaths, and over $23.8 million in property damage.

The analysis of the individual accidents can be combined in a Cumulative Cause Map to intuitively show the cause-and-effect relationships. The customer deaths and injuries, and the property damage, resulted from train derailments and a collision. The train collision resulted from a derailment. In two of the cases, the derailment was due to track damage that had either been missed on inspection or had maintenance deferred. In the third derailment (discussed in a previous blog), the train took a curve at an excessive rate of speed due to fatigue of the engineer. Inadequate track inspections and maintenance, and deferred maintenance were highlighted as recurring safety issues to the railroad.

Both of the employee fatalities resulted from workers being struck by a train while performing track maintenance. In one case, the worker was outside the designated protected area due to an inadequate job safety briefing. In the other, a student removed the block while working unsupervised, allowing a train to travel into the protected area. The NTSB also identified inadequate safety oversight and roadway worker protection procedures as areas needing improvement. While the NTSB already released recommendations with each of the individual investigations, it plans to issue more based on the cumulative investigation addressing all five incidents. View an overview of all 5 incidents by clicking “Download PDF” above.

Antares Cargo Rocket Explodes Seconds After Launch

By Kim Smiley

On October 28, 2014 an Antares cargo rocket bound for the International Space Station (ISS) catastrophically exploded seconds after launch.  The $200 million rocket was planned to be one of eight supply missions to the ISS that Orbital Sciences has a $1.9 billion contract to provide.  The investigation is still underway, but initial findings indicate that there may have been a problem with the engines, which were initially built in the 1960s and early 1970s by the Soviet space program.

Whenever NASA launches a rocket, it is observed by safety personnel with the ability to cause the rocket to self-destruct if it appears to be malfunctioning to minimize potential injuries and property damage. Reports by NASA have indicated that this flight-termination system was engaged shortly after liftoff in this case because the rocket malfunctioned shortly after takeoff.

Video of the launch and the subsequent explosion show the plume from one engine changing shape a second before the massive explosion.  The change in the plume has led to speculation that a turbopump failed shortly after liftoff and suggests that the engines were the source of the malfunction.  Investigators are currently reviewing the video of the launch, telemetry readings from the rocket, and studying the debris to learn as many details as possible about this failure.

The engines in question are NK-33 rocket engines that were initially built (not just designed, but actually manufactured) more than 4 decades ago. So how did engines from the Apollo era end up on a rocket decades later in 2014?  The one-word answer is money.

These engines were originally designed to support the Soviet space program which was disbanded in 1974.  For years, these engines were warehoused with no real purpose.  In 1990, these engines were sold to a company called Aerojet, reportedly for the bargain price of a cool million each.  The engines were refurbished and renamed Aerojet AJ-26s.  The cost of using these older engines was significantly less than developing a brand new rocket design.  In addition to being expensive, a new rocket design requires a significant time investment.  There are also limited alternatives available, partly due to NASA’s shrinking budget.

Orbital Sciences has announced that they will source a different engine and no longer use the AJ-26s, but it’s worth nothing that these rockets have been used successfully in recent years. They have launched Cygnus supply spacecraft three times without incident.

To view a high level Cause Map, a visual root cause analysis, of this incident, click on “Download PDF” above.