Category Archives: Root Cause Analysis – Incident Investigation

Bluff Collapse Releases Coal Ash

By ThinkReliability  Staff

On October 31, 2011, a bluff collapsed at a power plant on the shores of Lake Michigan.  The resulting mudslide took trailers, storage units, at least one truck and an unknown amount of coal ash into the lake, which provides drinking water for more than 40 million people.  Cleanup is ongoing, but the overall impact to the environment has not yet been determined.  Fortunately, no personnel were in the objects that ended up in the lake, so there were no injures.

Although the safety goal was not impacted by this incident, there was the potential for personnel injury.  Additionally, the environmental, customer service, property and labor goals were impacted by the pollution of the lake, loss of property and necessary cleanup.  The causes for these impacts to the goals can be examined in a Cause Map, or visual root cause analysis.

The mudslide which took the objects and coal ash into the lake was caused by insufficient stability of a bluff overlooking the lake.  The bluff’s instability was caused by degraded ground material stability mixed with water and no vegetation.  The vegetation had been removed for construction.  The ground in the area had been filled with coal ash – a practice allowed in previous decades.  Coal ash is less stable than soil, especially when it is exposed to water.  In this case, aerial images suggest that the water seeped into the area from a high water table or from an unlined retention pond used to store storm water.  Although a construction project was ongoing, an environmental impact study – which may have unearthed concerns about the stability of the area – was not considered necessary.

Steps are being taken to clean up the lake to the extent possible.  However, concerns about coal ash in this area and others are prompting a review by Congress to determine how coal ash can be safely dealt with.  Many say this incident suggests that stronger controls are needed.

To view the Outline and Cause Map, please click “Download PDF” above.  Or click here to read more.

BlackBerry’s Widespread Failure

By ThinkReliability Staff

BlackBerry faced yet another setback last month when service went down world-wide for multiple days.  The Research in Motion (RIM) company, already facing stiff competition from other smart phone vendors, apologized profusely for the outage and vowed to woo back its customers.  What caused the extensive and possibly business-ending service outage?

A root cause analysis can help identify what occurred.  The first step is to outline the incident.  The service outage originated in Europe, then spread to four other continents over a 72 hour period.  Customers were furious with the service outage and the slow PR response from the company.  This outage impacted two major RIM goals – to generate revenue for shareholders and maintain customer satisfaction.  Working backwards from these goals, the Cause Map shows what events led to the catastrophic failure and where further investigation is needed.

The company faces a potential loss of revenue if it loses customers.  The company may not have had to worry about the impact of such service outages in the past…except that now there are viable alternatives such as Apple and Android devices.  Continuing to work backwards, customers were upset because of a service outage.  At this point, it helps to examine the BlackBerry network architecture.

BlackBerry’s architecture is fundamentally different from that of Apple and Android.  All data is filtered through the company’s internal service network, before being passed on to carrier networks such as Sprint and Verizon.  Apple and Android don’t provide processing in the middle.  When BlackBerry’s core switch failed in an English data center, a backup switch was supposed to take over.  It had been tested successfully.  Unfortunately the backup didn’t work, leading to a buildup of messages waiting to be processed.  That mountain of messages led to backlogs in other data centers worldwide.  When the switch failed, it also corrupted the database software managing all the messages within the network.  

It turns out that this network architecture is both a liability and at the heart of the company’s business success.  By centrally processing all data messages – both compressing and encrypting them – RIM provides additional security and reduces the processing required at the user device, meaning lower energy use and a longer battery life.  Despite these strengths, RIM would be wise to find out why their network crashed.  As users store more data within the network – as with cloud computing – outages could cripple the system for even longer.

Severe Flooding in Thailand

By ThinkReliability Staff

Thailand is experiencing an unusually heavy monsoon season, but it’s management of the rains that are being blamed for the most severe flooding to occur in the area in decades.  Heavy rains resulting from the monsoon season and high tides are creating serious difficulties for officials in the area, who are having to make hard choices with where to divert water and are essentially “sacrificing” certain towns because there’s nowhere else for the water to go.  One of these decisions ended in a gunfight.  Tensions are high, and people are busying themselves attempting to protect their homes and towns with hundreds of thousands of sandbags.

We can examine the issues contributing to the risk to people and property in a Cause Map, or visual root cause analysis.  First, we define the problem within a problem outline.  In the bottom portion of the outline, we capture the impacts to the country’s goals.  More than 200 people have been reported killed as a result of the floods, which are themselves an impact to the environmental goal.  If citizens can be considered customers, the decision to “sacrifice” some towns to save others can be considered an impact to the customer service goal.  The  property goal is impacted by the destruction of towns and the labor goal is impacted by the flood preparations and rescue missions required to protect the population.

Beginning with these goals and asking “Why” questions, we can diagram the cause-and-effect relationships that contribute to the impacts discussed above.  The decision to “sacrifice” some towns to save others is caused by flooding due to heavy monsoon rains and high tides, and the fact that water had to be directed towards some towns, as there is nowhere else for the water to go.  Towns have been built in catchments and areas designed to be reservoirs. Natural waterways have been dammed and diverted.  Dams are full because insufficient water was discharged earlier in the season due to a miscalculation of water levels. Canals have been filled in or are blocked with garbage.  Insufficient control of development in the area has led to insufficient control of water flow, and lack of areas for water to gather – without endangering towns.

Thailand officials are assisting with sandbags and building new flood barriers and drainage canals.  They’re admitting that this issue needs to be repaired.  According to the director of the National Disaster Warning Center, “If we don’t have integrated water management, we will face this problem again next year.”  Hopefully this is the first step in making changes that ensure loss of life and property is minimized during the annual rainy season.

To view the Outline and Cause Map, please click “Download PDF” above.  Or click here to read more

 

Driver Death at Indy 300

By Kim Smiley

The racing world was filled with sadness with the death of Dan Wheldon during the Indy 300 race in Las Vegas on October 16, 2011.  However, many race-car drivers were not shocked at the occurrence of a 15-car pileup that resulted in Wheldon’s death.  Specifically, these drivers note that the track – which was designed for NASCAR vehicles which travel at much slower speeds – was designed with high banks that allowed cars to accelerate heavily, reaching speeds of up to 225 miles per hour.  This also contributed to the cars remaining very close together, leaving little time or space for drivers to maneuver.  Although the track was smaller in diameter than other tracks (1.5 mile oval compared to the Indy 500’s 2.5 mile oval), it allowed 4 cars to race side by side, as was happening at the time of the crash.

Drivers say that the design of the track, the speed of the cars, and the unusually high number of competitors (34, when a full field is generally 26-28 cars) contributed to the crash.  Also, the open wheel design of Indy cars means that the driver has less control when contacting other cars.  In fact, many drivers said they expected at least one spectacular crash to result, given the circumstances.  Although racecars do have special features that protect drivers in a crash, the cars used in the Indy races have open cockpits, providing less protection.  It also appears that the protective roll hoop was missing on Wheldon’s car, though more information on this has not been released.

Other drivers were also injured in the 15-car pileup, though their injuries were not critical and all others have been released from the hospital.  Wheldon was said to have suffered “unsurvivable head injuries”.   After Wheldon’s death, the race – which had a $5 million prize in hopes to boost ratings – was stopped.   This is the first fatality to occur in Indy racing since 2006.  It is hoped that new safety measures – which Wheldon had been involved with – will continue to make Indy racing safer.  However, there are some drivers that believe that regardless of the safety features in the cars, Indy racing should be done on street courses, not ovals.

To view the Outline and Cause Map, please click “Download PDF” above.

Toxic Fumes on Aircraft

By ThinkReliability Staff

A settlement against an aircraft manufacturer, with regards to a claim that faulty design allowed toxic fumes to enter the cabin, occurred in early October 2011.  It is the first of its kind to occur in the U.S., but may not be the last.  A documentary entitled “Angel Without Wings” is attempting to bring more attention to the issue, which air safety advocates claim has affected the health and job-readiness of some airline crewmembers.

Although the aircraft manufacturing and operating industries maintain that the air in cabins is safe, breaches are rare, and that the small amount of toxicity that may get into the cabin is not enough to affect human health, the issue is expected to gain more attention, as some industry officials maintain that approximately one flight a day involves leakage of toxic fumes into the passenger cabin of an aircraft.  Although there is debate about the amount of fumes required to cause various health effects, allowing toxic fumes of any amount into a passenger cabin is an impact to both the safety and environment goal.  Additionally, the lawsuit – and the potential of more to come – against the manufacturer is an impact to the customer service goal.  Although the suits have been brought by crew members, there is also a concern for the safety of passengers with respect to exposure to the contaminated air.

The toxic smoke and fumes enter the plane’s air conditioning system when engine air gets into the bleed-air system, which directs air bled from engine compressors into the cabin.  Because there is currently no effective way for crew members to determine that the air is contaminated – no detectors and insufficient training for these crew members to recognize the source and possible outcome of the fumes – the air continues to be fed to the cabin. The creators of the documentary, and other air safety advocates, are requesting that better filters be installed to prevent the toxic fumes to enter the cabin, less toxic oil be used so that the fumes from any leaking oil are less damaging to human health, that detectors be installed in air ducts to notify crew of potential toxicity in the air supply, and better education and training to help crew members identify the potential for exposure to toxic fumes.  However, the manufacturer’s newest design makes all this unnecessary by using an aircraft design that provides air from electric compressors.  Given the length of time that aircraft remain in the air, it will be decades before the system may be phased out.  In the meantime, advocates hope that other corrective actions will be implemented to decrease the potential of exposure to passengers and crew.

To view the Outline and Cause Map, please click “Download PDF” above.  Or click here to read more.

1982 Tylenol Tampering

By ThinkReliability Staff

In 1982, 31 million bottles of Tylenol were recalled after seven deaths from cyanide poisoning.  After an investigation, higher than lethal doses of cyanide were found to have been inserted into bottles of Extra-Strength Tylenol capsules in retail stores in the Chicago area. Tylenol’s manufacturer, Johnson & Johnson, immediately took action and recalled all Tylenol products.

Although the reason for the poisoning is unclear – the suspect has still not been caught, though interest in the case has recently been revived – what was clear is that the ability to tamper with a product in such a malicious way without the tampering being evident contributed to the deaths.  As a result of this issue, capsules (which are much easier to insert foreign objects into than solid pills) decreased in use, and tamper-evident packaging became used for many products.

Although the manufacturing and packaging process were not implicated in the poisonings (the adulterated packages were from different plants, but all came from stores within the Chicago area), there was concern that Tylenol would never again be popularly accepted.  However, Johnson and Johnson’s quick and effective action in the immediate recall of all products and public relations campaigns to urge people not to use products until the issue had been resolved has been considered a playbook on how to conduct an effective recall and is believed to have directly contributed to the resurgence in the popularity of Tylenol shortly after the issue.  (See “How Effective Public Relations Saved Johnson and Johnson“.)

Even though this case hasn’t been resolved, and the killer still remains unknown, it is possible to examine the issue with a Cause Map.  Because this case has stretched over many years, a timeline can help to sort through information.  The outline contains the many impacts to the goals related to the issue, and the Cause Map sorts through causes – both “good” and “bad” – related to the issue.  Solutions implemented to decrease the ability to tamper with consumer products are also noted.

Power Outage Stretches from Arizona to California

By ThinkReliability Staff

On September 8, 2011, work on a fault capacitor in Arizona began a series of events that resulted in the worst power outage in the Southwest for 15 years.  Although there were no injuries reported as a result of the power outage, there was a high potential for injuries and/or deaths, as hospitals shut down and at least one airport lost runway lighting.  Raw sewage leaked onto beaches and millions found themselves without power.  The economic losses from this incident are reported to be as high as $118 million.  The Federal Energy Regulatory Commission (FERC) will be conducting an investigation to determine how simple capacitor work resulted in an incident with such extreme effects.

The issues related to this power outage are complicated, and can be more clearly understood in a visual format, such as a Cause Map.  We can examine the cause-and-effect relationships that resulted in the impacted goals discussed above.  The potential for injury was caused by a loss of electrical power to hospitals and airports.   The loss of power was caused by a grid crash, resulting from insufficient power and high demand (at least partially due to a heat wave).  Power stations that normally provide electricity were automatically shut down when a current reverse (normally the current runs from Arizona to California) resulted from the loss of a transmission line resulting from the capacitor work.  Although “operator error” has been mentioned as a potential cause, it’s undesirable that one operator’s error could cause such an extreme power outage.  The system should be designed to prevent this, and the investigation will hopefully address issues in the system that contributed to the extent of the outage.

In addition to losing power stations, insufficient base-load capacity in the area (long a source of concern) meant that standby plants could not be brought up fast enough to prevent the crashing of the grid.  Also, renewable wind and solar energy sources weren’t much help due to less than ideal weather conditions for production (cloudy with low wind).

The FERC’s investigation will determine causes that contributed to this power outage and will provide recommendations to limit these types of incidents in the future.  Specifically, they will determine what allowed a simple capacitor issue to result in an extensive power outage and will also consider the grid stability in the area.  However, in the meantime, some individual businesses discovered a boon in having their own generators.  Additionally, U.S. Navy ships in port in San Diego used their generators to supply power to the grid.  While these actions certainly helped lessen the effect of the outage (and brought in a lot of business to locations that did have generators), broader improvements are needed to prevent these types of issues in the future.

To view the Outline and Cause Map, please click “Download PDF” above.

Crash Causes Deaths at Air Race

By ThinkReliability Staff

Sad news is nothing new for the National Championship Air Races – there have been 29 deaths associated with the races in its 47-year history.  However, the ten deaths and dozens of injuries (some extremely serious) resulting from a plane crash and explosion on September 16, 2011 have brought attention to the safety of air racing.

Although full details of the causes of the crash and explosion have not been determined by the National Transportation Safety Board, we can begin a comprehensive root cause analysis with the information available so far by building a Cause Map.  First, we capture the basic details (such as the date and time of the incident) in the Outline.  Then we record the impacts to the goals.  In this case, there was a significant impact to the safety goal, considering the high number of deaths and significant injuries.  The customer service goal can be considered to be impacted because the spectators at the show were not sufficiently protected from injury.  (The FAA grants approval to air shows based on safety of the spectators from a crash.)   The remaining days of the race were cancelled – an impact to the schedule goal.  The plane was destroyed, an impact to the property goal, and the resulting NTSB investigation will cause an impact to the labor goal because of the resources required to complete the investigation.

Once we have captured these impacts to the goals, we can use them to begin the analysis.  The injuries and deaths occurred from the plane crashing into the VIP section and the subsequent explosion which resulted in shrapnel injuries.  The pilot lost control of the plane and did not have sufficient time to recover (as evidenced by there being no indication that he made a distress call).  It’s unclear what exactly caused the loss of control; however, the plane had been modified to increase its speed, which would have impacted its stability in flight.  Additionally, photos taken just before the crash appear to indicate that a portion of the tail fell off, but the reason why has not yet been discovered.  What happened to the tail section, and how the modifications affected control of the plane, are questions the NTSB will examine in their report.

Because of the goal of an air race – traveling around a course at low altitudes and high speeds – it’s no surprise that the pilot did not have sufficient time to recover control before crashing.  Given that these conditions are expected during air races – and appear to be an acceptable risk to pilots, who continue to race even with the high number of crashes and fatalities that result – it appears that there needs to be more consideration of how spectators are protected from crashes and the shrapnel that can result from the destruction of a plane.

When more evidence is gathered, more information can be added to  the Cause Map.  Once that occurs, the NTSB can examine the causes contributing to the deaths at the air race, and make recommendations on how future deaths can be avoided.

To view the Outline and Cause Map, please click “Download PDF” above.

Explosion at Nuclear Waste Site Kills One

By Kim Smiley

An explosion at a nuclear waste processing site in France killed one and injured four workers on September 12, 2011.  The investigation is still ongoing, but it is still possible to create a Cause Map, a visual root cause analysis, that contains all known information on the incident.  As more information becomes available, the Cause Map can easily be expanded to incorporate all relevant details.  One advantage of Cause Mapping is that it can be used to document all information at each step of the investigation process in an intuitive way, in a single location.

When the word “nuclear” is involved emotions and fears can run high, especially following the recent events at the Fukushima nuclear plant in Japan.  This incident is a good example where providing clear information can help calm the situation.  The explosion in France happened when a furnace used to burn nuclear waste failed.  The cause of the explosion itself isn’t known at this time, but there is some relevant background information available that helps explains the potential ramifications of the explosion.

The key to understanding the impact of this incident is the type of nuclear waste that was being burned.  According to statements by the French government, the furnace involved was only used to burn waste with very low level contamination.  It burned things such as gloves and overalls as well as metal waste like tools and pumps.  No objects that were part of a reactor were treated in the furnace.  There are also no reactors at the site that could be potentially damaged by explosion.

There was no radiation leakage detected and the potential for large amounts of released radiation wasn’t there based on the type of material being processed.  It was a horrible accident that resulted in a death and severe injuries, but there was no risk to public health.

How France views nuclear power is also a bit of background worth knowing.  France is the world’s most nuclear power dependent country.  Fifty-eight reactors generate nearly three fourths of France’s power.  France is also a major exporter of nuclear technology.  The public relations issues associated with a nuclear disaster in France would be very complicated.

Once the investigation into this incident is complete, solutions can complete be determined and implemented to help prevent any future occurrences.

Attempted Bombing of Flight 253

By ThinkReliability Staff

Despite constantly increasing airport security, a man suspected of terrorism was able to board a flight from Amsterdam to Detroit with ~80 grams of explosive and a liquid detonator. However, the device did not detonate, likely saving the plane.

Had the explosive detonated, it may have caused the loss of the plane, resulting in the deaths of all on the plane. Even though the loss of the lives and plane did not occur, the potential for it to happen is an impact to the safety goals.

The suspect was able to board the plane because despite warnings from his father, there was insufficient information to add him to the no-fly list (see process map) and his visa was not revoked.

Officials in the U.S. were unaware a visa had been issued by the U.S. embassy in London. Additionally, while the information from the suspect’s father was entered into TIDE (a terrorist intelligence database), there was no follow-up on the information. It’s unclear if there was no follow-up required, or if the follow-up was just not performed.

In an admitted failure of safety procedures, the explosives were not detected by airport security. The information about the suspect was considered not specific enough for the suspect to be put on the “selectee list” which would have led to additional screening. The suspect was not pased through a body scan, which may have detected the explosives, because they are not used on passengers traveling to the U.S. because of the privacy issues. The ingredients were hidden in the suspect’s undergarments and so were not detected by security.

Want to learn more? Read a more detailed root cause analysis of the attempted bombing.