All posts by ThinkReliability Staff

ThinkReliability are specialists in applying root cause analysis to solve all types of problems. We investigate errors, defects, failures, losses, outages and incidents in a wide variety of industries. Our Cause Mapping analysis method of root causes, captures the complete investigation with the best solutions all in an easy to understand format. ThinkReliability provides investigation services and root cause analysis training to clients around the world and is considered the trusted authority on the subject

Root Cause Analysis - Incident Investigation

Bluff Collapse Releases Coal Ash

November 9, 2011 ThinkReliability Staff

By ThinkReliability Staff

On October 31, 2011, a bluff collapsed at a power plant on the shores of Lake Michigan. The resulting mudslide took trailers, storage units, at least one truck and an unknown amount of coal ash into the lake, which provides drinking water for more than 40 million people. Cleanup is ongoing, but the overall impact to the environment has not yet been determined. Fortunately, no personnel were in the objects that ended up in the lake, so there were no injures.

Although the safety goal was not impacted by this incident, there was the potential for personnel injury. Additionally, the environmental, customer service, property and labor goals were impacted by the pollution of the lake, loss of property and necessary cleanup. The causes for these impacts to the goals can be examined in a Cause Map, or visual root cause analysis.

The mudslide which took the objects and coal ash into the lake was caused by insufficient stability of a bluff overlooking the lake. The bluff’s instability was caused by degraded ground material stability mixed with water and no vegetation. The vegetation had been removed for construction. The ground in the area had been filled with coal ash – a practice allowed in previous decades. Coal ash is less stable than soil, especially when it is exposed to water. In this case, aerial images suggest that the water seeped into the area from a high water table or from an unlined retention pond used to store storm water. Although a construction project was ongoing, an environmental impact study – which may have unearthed concerns about the stability of the area – was not considered necessary.

Steps are being taken to clean up the lake to the extent possible. However, concerns about coal ash in this area and others are prompting a review by Congress to determine how coal ash can be safely dealt with. Many say this incident suggests that stronger controls are needed.

To view the Outline and Cause Map, please click “Download PDF” above. Or click here to read more.

Root Cause Analysis - Incident Investigation

BlackBerry’s Widespread Failure

November 3, 2011 ThinkReliability Staff

By ThinkReliability Staff

BlackBerry faced yet another setback last month when service went down world-wide for multiple days. The Research in Motion (RIM) company, already facing stiff competition from other smart phone vendors, apologized profusely for the outage and vowed to woo back its customers. What caused the extensive and possibly business-ending service outage?

A root cause analysis can help identify what occurred. The first step is to outline the incident. The service outage originated in Europe, then spread to four other continents over a 72 hour period. Customers were furious with the service outage and the slow PR response from the company. This outage impacted two major RIM goals – to generate revenue for shareholders and maintain customer satisfaction. Working backwards from these goals, the Cause Map shows what events led to the catastrophic failure and where further investigation is needed.

The company faces a potential loss of revenue if it loses customers. The company may not have had to worry about the impact of such service outages in the past…except that now there are viable alternatives such as Apple and Android devices. Continuing to work backwards, customers were upset because of a service outage. At this point, it helps to examine the BlackBerry network architecture.

BlackBerry’s architecture is fundamentally different from that of Apple and Android. All data is filtered through the company’s internal service network, before being passed on to carrier networks such as Sprint and Verizon. Apple and Android don’t provide processing in the middle. When BlackBerry’s core switch failed in an English data center, a backup switch was supposed to take over. It had been tested successfully. Unfortunately the backup didn’t work, leading to a buildup of messages waiting to be processed. That mountain of messages led to backlogs in other data centers worldwide. When the switch failed, it also corrupted the database software managing all the messages within the network.

It turns out that this network architecture is both a liability and at the heart of the company’s business success. By centrally processing all data messages – both compressing and encrypting them – RIM provides additional security and reduces the processing required at the user device, meaning lower energy use and a longer battery life. Despite these strengths, RIM would be wise to find out why their network crashed. As users store more data within the network – as with cloud computing – outages could cripple the system for even longer.

Root Cause Analysis - Incident Investigation

Severe Flooding in Thailand

October 26, 2011 ThinkReliability Staff

By ThinkReliability Staff

Thailand is experiencing an unusually heavy monsoon season, but it’s management of the rains that are being blamed for the most severe flooding to occur in the area in decades. Heavy rains resulting from the monsoon season and high tides are creating serious difficulties for officials in the area, who are having to make hard choices with where to divert water and are essentially “sacrificing” certain towns because there’s nowhere else for the water to go. One of these decisions ended in a gunfight. Tensions are high, and people are busying themselves attempting to protect their homes and towns with hundreds of thousands of sandbags.

We can examine the issues contributing to the risk to people and property in a Cause Map, or visual root cause analysis. First, we define the problem within a problem outline. In the bottom portion of the outline, we capture the impacts to the country’s goals. More than 200 people have been reported killed as a result of the floods, which are themselves an impact to the environmental goal. If citizens can be considered customers, the decision to “sacrifice” some towns to save others can be considered an impact to the customer service goal. The property goal is impacted by the destruction of towns and the labor goal is impacted by the flood preparations and rescue missions required to protect the population.

Beginning with these goals and asking “Why” questions, we can diagram the cause-and-effect relationships that contribute to the impacts discussed above. The decision to “sacrifice” some towns to save others is caused by flooding due to heavy monsoon rains and high tides, and the fact that water had to be directed towards some towns, as there is nowhere else for the water to go. Towns have been built in catchments and areas designed to be reservoirs. Natural waterways have been dammed and diverted. Dams are full because insufficient water was discharged earlier in the season due to a miscalculation of water levels. Canals have been filled in or are blocked with garbage. Insufficient control of development in the area has led to insufficient control of water flow, and lack of areas for water to gather – without endangering towns.

Thailand officials are assisting with sandbags and building new flood barriers and drainage canals. They’re admitting that this issue needs to be repaired. According to the director of the National Disaster Warning Center, “If we don’t have integrated water management, we will face this problem again next year.” Hopefully this is the first step in making changes that ensure loss of life and property is minimized during the annual rainy season.

To view the Outline and Cause Map, please click “Download PDF” above. Or click here to read more

Root Cause Analysis - Incident Investigation

Toxic Fumes on Aircraft

October 14, 2011 ThinkReliability Staff

By ThinkReliability Staff

A settlement against an aircraft manufacturer, with regards to a claim that faulty design allowed toxic fumes to enter the cabin, occurred in early October 2011. It is the first of its kind to occur in the U.S., but may not be the last. A documentary entitled “Angel Without Wings” is attempting to bring more attention to the issue, which air safety advocates claim has affected the health and job-readiness of some airline crewmembers.

Although the aircraft manufacturing and operating industries maintain that the air in cabins is safe, breaches are rare, and that the small amount of toxicity that may get into the cabin is not enough to affect human health, the issue is expected to gain more attention, as some industry officials maintain that approximately one flight a day involves leakage of toxic fumes into the passenger cabin of an aircraft. Although there is debate about the amount of fumes required to cause various health effects, allowing toxic fumes of any amount into a passenger cabin is an impact to both the safety and environment goal. Additionally, the lawsuit – and the potential of more to come – against the manufacturer is an impact to the customer service goal. Although the suits have been brought by crew members, there is also a concern for the safety of passengers with respect to exposure to the contaminated air.

The toxic smoke and fumes enter the plane’s air conditioning system when engine air gets into the bleed-air system, which directs air bled from engine compressors into the cabin. Because there is currently no effective way for crew members to determine that the air is contaminated – no detectors and insufficient training for these crew members to recognize the source and possible outcome of the fumes – the air continues to be fed to the cabin. The creators of the documentary, and other air safety advocates, are requesting that better filters be installed to prevent the toxic fumes to enter the cabin, less toxic oil be used so that the fumes from any leaking oil are less damaging to human health, that detectors be installed in air ducts to notify crew of potential toxicity in the air supply, and better education and training to help crew members identify the potential for exposure to toxic fumes. However, the manufacturer’s newest design makes all this unnecessary by using an aircraft design that provides air from electric compressors. Given the length of time that aircraft remain in the air, it will be decades before the system may be phased out. In the meantime, advocates hope that other corrective actions will be implemented to decrease the potential of exposure to passengers and crew.

To view the Outline and Cause Map, please click “Download PDF” above. Or click here to read more.

Root Cause Analysis - Incident Investigation

1982 Tylenol Tampering

October 5, 2011 ThinkReliability Staff

By ThinkReliability Staff

In 1982, 31 million bottles of Tylenol were recalled after seven deaths from cyanide poisoning. After an investigation, higher than lethal doses of cyanide were found to have been inserted into bottles of Extra-Strength Tylenol capsules in retail stores in the Chicago area. Tylenol’s manufacturer, Johnson & Johnson, immediately took action and recalled all Tylenol products.

Although the reason for the poisoning is unclear – the suspect has still not been caught, though interest in the case has recently been revived – what was clear is that the ability to tamper with a product in such a malicious way without the tampering being evident contributed to the deaths. As a result of this issue, capsules (which are much easier to insert foreign objects into than solid pills) decreased in use, and tamper-evident packaging became used for many products.

Although the manufacturing and packaging process were not implicated in the poisonings (the adulterated packages were from different plants, but all came from stores within the Chicago area), there was concern that Tylenol would never again be popularly accepted. However, Johnson and Johnson’s quick and effective action in the immediate recall of all products and public relations campaigns to urge people not to use products until the issue had been resolved has been considered a playbook on how to conduct an effective recall and is believed to have directly contributed to the resurgence in the popularity of Tylenol shortly after the issue. (See “How Effective Public Relations Saved Johnson and Johnson“.)

Even though this case hasn’t been resolved, and the killer still remains unknown, it is possible to examine the issue with a Cause Map. Because this case has stretched over many years, a timeline can help to sort through information. The outline contains the many impacts to the goals related to the issue, and the Cause Map sorts through causes – both “good” and “bad” – related to the issue. Solutions implemented to decrease the ability to tamper with consumer products are also noted.

Root Cause Analysis - Incident Investigation

Power Outage Stretches from Arizona to California

September 29, 2011 ThinkReliability Staff

By ThinkReliability Staff

On September 8, 2011, work on a fault capacitor in Arizona began a series of events that resulted in the worst power outage in the Southwest for 15 years. Although there were no injuries reported as a result of the power outage, there was a high potential for injuries and/or deaths, as hospitals shut down and at least one airport lost runway lighting. Raw sewage leaked onto beaches and millions found themselves without power. The economic losses from this incident are reported to be as high as $118 million. The Federal Energy Regulatory Commission (FERC) will be conducting an investigation to determine how simple capacitor work resulted in an incident with such extreme effects.

The issues related to this power outage are complicated, and can be more clearly understood in a visual format, such as a Cause Map. We can examine the cause-and-effect relationships that resulted in the impacted goals discussed above. The potential for injury was caused by a loss of electrical power to hospitals and airports. The loss of power was caused by a grid crash, resulting from insufficient power and high demand (at least partially due to a heat wave). Power stations that normally provide electricity were automatically shut down when a current reverse (normally the current runs from Arizona to California) resulted from the loss of a transmission line resulting from the capacitor work. Although “operator error” has been mentioned as a potential cause, it’s undesirable that one operator’s error could cause such an extreme power outage. The system should be designed to prevent this, and the investigation will hopefully address issues in the system that contributed to the extent of the outage.

In addition to losing power stations, insufficient base-load capacity in the area (long a source of concern) meant that standby plants could not be brought up fast enough to prevent the crashing of the grid. Also, renewable wind and solar energy sources weren’t much help due to less than ideal weather conditions for production (cloudy with low wind).

The FERC’s investigation will determine causes that contributed to this power outage and will provide recommendations to limit these types of incidents in the future. Specifically, they will determine what allowed a simple capacitor issue to result in an extensive power outage and will also consider the grid stability in the area. However, in the meantime, some individual businesses discovered a boon in having their own generators. Additionally, U.S. Navy ships in port in San Diego used their generators to supply power to the grid. While these actions certainly helped lessen the effect of the outage (and brought in a lot of business to locations that did have generators), broader improvements are needed to prevent these types of issues in the future.

To view the Outline and Cause Map, please click “Download PDF” above.

Root Cause Analysis - Incident Investigation

Crash Causes Deaths at Air Race

September 21, 2011 ThinkReliability Staff

By ThinkReliability Staff

Sad news is nothing new for the National Championship Air Races – there have been 29 deaths associated with the races in its 47-year history. However, the ten deaths and dozens of injuries (some extremely serious) resulting from a plane crash and explosion on September 16, 2011 have brought attention to the safety of air racing.

Although full details of the causes of the crash and explosion have not been determined by the National Transportation Safety Board, we can begin a comprehensive root cause analysis with the information available so far by building a Cause Map. First, we capture the basic details (such as the date and time of the incident) in the Outline. Then we record the impacts to the goals. In this case, there was a significant impact to the safety goal, considering the high number of deaths and significant injuries. The customer service goal can be considered to be impacted because the spectators at the show were not sufficiently protected from injury. (The FAA grants approval to air shows based on safety of the spectators from a crash.) The remaining days of the race were cancelled – an impact to the schedule goal. The plane was destroyed, an impact to the property goal, and the resulting NTSB investigation will cause an impact to the labor goal because of the resources required to complete the investigation.

Once we have captured these impacts to the goals, we can use them to begin the analysis. The injuries and deaths occurred from the plane crashing into the VIP section and the subsequent explosion which resulted in shrapnel injuries. The pilot lost control of the plane and did not have sufficient time to recover (as evidenced by there being no indication that he made a distress call). It’s unclear what exactly caused the loss of control; however, the plane had been modified to increase its speed, which would have impacted its stability in flight. Additionally, photos taken just before the crash appear to indicate that a portion of the tail fell off, but the reason why has not yet been discovered. What happened to the tail section, and how the modifications affected control of the plane, are questions the NTSB will examine in their report.

Because of the goal of an air race – traveling around a course at low altitudes and high speeds – it’s no surprise that the pilot did not have sufficient time to recover control before crashing. Given that these conditions are expected during air races – and appear to be an acceptable risk to pilots, who continue to race even with the high number of crashes and fatalities that result – it appears that there needs to be more consideration of how spectators are protected from crashes and the shrapnel that can result from the destruction of a plane.

When more evidence is gathered, more information can be added to the Cause Map. Once that occurs, the NTSB can examine the causes contributing to the deaths at the air race, and make recommendations on how future deaths can be avoided.

To view the Outline and Cause Map, please click “Download PDF” above.

Root Cause Analysis - Incident Investigation

Attempted Bombing of Flight 253

September 8, 2011 ThinkReliability Staff

By ThinkReliability Staff

Despite constantly increasing airport security, a man suspected of terrorism was able to board a flight from Amsterdam to Detroit with ~80 grams of explosive and a liquid detonator. However, the device did not detonate, likely saving the plane.

Had the explosive detonated, it may have caused the loss of the plane, resulting in the deaths of all on the plane. Even though the loss of the lives and plane did not occur, the potential for it to happen is an impact to the safety goals.

The suspect was able to board the plane because despite warnings from his father, there was insufficient information to add him to the no-fly list (see process map) and his visa was not revoked.

Officials in the U.S. were unaware a visa had been issued by the U.S. embassy in London. Additionally, while the information from the suspect’s father was entered into TIDE (a terrorist intelligence database), there was no follow-up on the information. It’s unclear if there was no follow-up required, or if the follow-up was just not performed.

In an admitted failure of safety procedures, the explosives were not detected by airport security. The information about the suspect was considered not specific enough for the suspect to be put on the “selectee list” which would have led to additional screening. The suspect was not pased through a body scan, which may have detected the explosives, because they are not used on passengers traveling to the U.S. because of the privacy issues. The ingredients were hidden in the suspect’s undergarments and so were not detected by security.

Want to learn more? Read a more detailed root cause analysis of the attempted bombing.

Root Cause Analysis - Incident Investigation

International Space Station Supply Ship Crash

September 7, 2011 ThinkReliability Staff

By ThinkReliability Staff

On August 24, 2011, a supply ship heading to the International Space Station (ISS) crashed in Siberia, losing two tons of cargo. However, the impact of this loss was much more than the two tons of cargo – it may lead to an evacuation of the ISS, which would become unmanned for some unknown period of time.

The crash of the unmanned Progress 44 supply ship, which was on its way to resupply the ISS, was caused by the emergency deactivation of the Soyuz rocket when a gas generator malfunctioned. Until the specific causes of the malfunction are determined, manned Soyuz flights are grounded. That means that a new crew cannot get to the Space Station to relieve the current crew. Although the current crew has enough supplies for the time being, they cannot remain on the space station past December. The spacecraft already at the station (their “guaranteed ride home”) are only allowed in space for 200 days – due to limited battery life and concern for degradation of rubberized seals from contact with thruster fuel.

Because of a lack of funding, American shuttles are now all mothballed, leaving the Russian Soyuz rockets the only way to and from the space station. Finding another way to get there by December is unlikely, leaving the attempt to determine and fix the problems with Soyuz the only hope for continued manning of the ISS.

We can examine this incident in a Cause Map, beginning with the impacts to the goals. For example, although there were no safety goal impacts resulting from the crash of the unmanned ship, the customer service goal is impacted due to the potential of evacuating the ISS. The production goal is impacted because of the grounding of manned Soyuz flights, and the property goal is impacted due to the two tons of lost cargo meant for the space station. We begin our Cause Map with these impacts to the goals, asking “Why” questions to complete the analysis. The amount of detail in the map is determined by the impact to the goals. Because the crash may lead to the evacuation and continued unmanned operation of the space shuttle, once specific causes are determined, this Cause Map would become quite detailed. For now, because the causes have not yet been determined, we begin with a simple map, which does capture the impacts to the goals and the basic information now known.

To view the Outline and Cause Map, please click “Download PDF” above.

Root Cause Analysis - Incident Investigation

Rioting in England

August 19, 2011 ThinkReliability Staff

By ThinkReliability Staff

Rioting is a defined as a violent, public disorder caused by a group of persons. It is a unique phenomenon in that it is difficult to pinpoint exactly what is going to trigger and sustain a riot. Social scientists know that there is a tipping point at which participants no longer fear punishment (such as jail) as the number of gatherers increases. However there are many common contributing factors. A Cause Map can help sort out what led to this month’s rioting over in the United Kingdom.

It began on August 4th, following the police shooting of a 29-year old in North London. The police claimed he was suspected of weapons possession and were attempting to execute a warrant. During the arrest, the suspect was shot and killed. However, questions arose regarding the circumstances of the arrest and family and friends came to believe that the victim, Mark Duggan, was unarmed. This led to a peaceful protest of approximately 120, ending at the police station in Tottenham, North London. Protestors demanded answers, and police officials seemed unable to satisfy the crowd.

The crowd lingered while police stalled, and grew as disgruntled local youths began to arrive at dusk. At this point, things began to spiral out of control. Why did this unsatisfied, but otherwise quiet gathering turn into a multi-day riot across an entire country?

According to social scientists, rioting generally occurs when there are certain elements present. Normally there have to be a lot of people. There also needs to be a low level of perceived risk that they will be punished for unacceptable behavior. This perception generally increases as there are fewer law enforcement officers and also as there are more people. Those people generally are upset about something. There also needs to be a feeling that others are likely to join in. But even with all these elements, a riot will not start. The final element is a “catalyst”. This is typically a person who has calculated that the risk of being targeted by law enforcement is sufficiently low, and acts out – such a throwing a rock through a window.

Examining the Cause Map reveals that these elements were present in the initial riot as well as in the general rioting that broke out across the country. It becomes evident that the rioting was cyclical – the initial riot led to more widespread rioting. And the same elements that were present in the initial riot were present in the widespread rioting as well.

After completing the Cause Map analysis, the next step is to determine how to prevent this from happening again. Everyone seems to have an opinion on what went wrong, and more importantly what needs to be done differently to prevent such costly and dangerous behavior. Resorting back to the Cause Map, we can look for opportunities to prevent future riots. Some of the elements that contribute to a riot can be controlled more easily than others. For instance it is easier to limit mass gatherings than control the emotions of a crowd. Hence, greater police presence and an ability to clear the street – through curfew or quick arrests – are usually the best solutions for limiting riots. A table of proposed solutions completes the analysis.