Root Cause Analysis - Incident Investigation

Loss of Network Cloud Compute Service

April 29, 2011 Angela Griffith

On April 21, 2011, some of Amazon’s Elastic Compute Cloud (EC2) customers experienced issues when a combination of events led to their East region’s Elastic Block Store (EBS) being unable to process read or write operations. This seriously impacted their customer service. Massive efforts were undertaken and services, and most data, was restored within 3 days. Amazon has released their post-mortem analysis of these events. Using the information they’ve provided, we can begin a visual root cause analysis, or Cause Map, laying out the event.

We begin with the affected goal. Customer service was impacted because of the inability to process read or write operations. This ability was lost due to a degraded EBS cluster. (A cluster is a group of nodes, which are responsible for replicating data and processing read and write requests.) The cluster was degraded by the failure of some nodes, and a high number of nodes searching for replicas.

At this point, we’ll look into the process to explain what’s going on. When a user makes a request, a control plane accepts and processes that request to an EBS cluster. The cluster elects a node to be the primary replica of this data. That node stores the data, and looks for other available nodes to make backup replicas. If the node-to-node connection is lost, the primary replica searches for another node. Once it has established connectivity with that node, the new node becomes another replica. This process is continuous.

In this case, a higher number of nodes were searching for replicas because they lost connection to the other nodes. Based on the process discussed above, the nodes then began a search for other nodes. However, they were unable to find any other nodes because the network was unavailable (so the nodes could not communicate with each other). The nodes had a long time-out period for searching for other nodes, so their search continued, and more nodes lost communication and began a search, increasing the volume.

The network communication was lost because data was shifted off the primary network. This was caused by an error during a network configuration change to upgrade the capacity of the primary network. The data should have been transferred to a redundant router on the primary network but was instead transferred to the secondary network. The secondary network did not have sufficient capacity to handle all the data and so was unable to maintain connectivity.

In addition to a large number of nodes searching for other nodes, the EBS cluster was impacted by node failures. Some nodes failed because of a race condition designed so that a node would fail when it attempted to process multiple concurrent requests for replicas. These requests were caused by the situation above. Additionally, the nodes failing led to more nodes losing their replicas, compounding the difficulty of recovering from this event.

Service is back to normal, and Amazon has made some changes to prevent this type of issue from reoccurring. Immediately, the data was shifted back to the primary network and the error which caused the shifting was corrected. Additional capacity was added to prevent the EBS cluster from being overwhelmed. The retry logic which resulted in the nodes continuing to search for long periods of time has been modified, and the source of the race condition resulting in the failure of the nodes has been identified and repaired.

View the root cause analysis investigation of this event – including an outline, timeline, Cause Map, solutions and Process Map, by clicking “Download PDF” above.

Root Cause Analysis - Incident Investigation

Plane Clips Another While Taxiing at JFK Airport

April 19, 2011 Kim Smiley

By Kim Smiley

Around 8:30 pm on April 11, 2011, a large passenger airplane taxiing at John F. Kennedy Airport in New York clipped the wing of a smaller plane. The larger plane involved in the incident was an Airbus A380 carrying 485 passengers and 25 crew members. The smaller plane was a Bombardier CRJ and carrying 52 passengers and 4 crew members at the time it was clipped.

At the time of the accident, the Airbus was taxiing to take off and the CRJ had recently landed and was waiting to park. The incident was caught on amateur video and it appears that the left wing tip of the Airbus struck the left horizontal stabilizer of the CRJ. No injuries were reported, but both planes sustained some damage.

After the planes made contact, the fire department responded as a precautionary measure. Passengers were deplaned from the Airbus so that the planes could be inspected and information could be gathered to support the investigation.

At this time there is limited information available about what caused this incident, but the National Transportation and Safety Board (NTSB) has begun an investigation. The NTSB has requested fight recorders from both airplanes and also plans to review the air traffic control tapes and the ground movement radar data to determine how this happened.

Even through the investigation is just getting started, it is still possible to create a Cause Map based on what is known. The first step is to create an Outline of the event by determining the impact to the organization goals. In this example, the Safety Goal was impacted because there was the potential for injuries, the Customer Service goal was impacted because the passengers were unable to reach their destination, the Production Schedule Goal was impacted because the flight was unable to depart and the Material and Labor goal was impacted because there was damage to both planes.

From this point, Causes can be added to the cause map by asking “why” question. Missing information can be noted by adding a Cause box with a “?”. Any additional information can be added later. To see an initial Cause Map of this incident and the Outline, click on the “Download PDF” above.

Root Cause Analysis - Incident Investigation

75 Year Old Woman Cuts Internet Service to Armenia With a Shovel

April 14, 2011 Kim Smiley

By Kim Smiley

On March 28, 2011, a 75-year-old woman out digging for scrap metal accidentally cut internet service to nearly all of Armenia. There were also service interruptions in Azerbaijan and part of Georgia. Some regions were able to switch to alternative internet suppliers within a few hours, but some areas were without internet service for 12 hours.

How did this happen? How could an elderly woman and a shovel cause such chaos without even trying?

A root cause analysis can be performed and a Cause Map built to show what contributed to this incident. Building a Cause Map begins with determining the impacts to the organizational goals. Then “why” questions are asked and causes are added to the map.

In this example, the Customer Service Goal is impacted because there was significant internet service interruption and the Production Schedule Goal was also impacted because of loss of worker productivity. The Material Labor Goal also needs to be considered because of the cost of repairs.

Now causes are added to the Cause Map by asking “why” questions. Internet service was disrupted because a fiber optic cable was damaged by a shovel. In addition, this one cable provided 90 percent of Armenia’s internet so damaging it created a huge interruption in internet service.

Why would a 74-year-old woman be out digging for cables? The woman was looking for copper cable and accidentally hit the fiber optic cable. This happened because both types of cables are usually buried inside PCV conduit and can look similar. The reason she was looking for copper cable is because there is a market for scrap metal. Metal scavenging is a common practice in this region because there are many abandoned copper cables left in the ground. She was also able to hit the fiber optic cable because it was closer to the surface than intended, likely exposed by mudslides or heavy rains.

The woman, who had been dubbed the spade-hacker by local media, has been released from police custody. She is still waiting to hear if she faces any punishment, but police statements implied that the prosecutor won’t push for the maximum of three years in prison due to her age.

To see the Cause Map of this issue, click on the “Download the PDF” button above.

Root Cause Analysis - Incident Investigation

Grounding the 737’s: SWA Flight 812

April 8, 2011 ThinkReliability Staff

By ThinkReliability Staff

As new information comes to light, processes need to be reevaluated. A hole in the fuselage of a 15-year-old Boeing 737-300 led to the emergency descent of Southwest Airlines Flight 812. 737’s have been grounded as federal investigators determine why the hole appeared. At the moment, consensus is that a lap joint supporting the top of the fuselage cracked.

While the investigation is still in the early stages, it appears that stress fatigue caused a lap joint to fail. Stress fatigue is a well known phenomenon, caused in aircraft by the constant pressurization and depressurization occurring during takeoff and landing. Mechanical engineers designing the aircraft would have been well aware of this phenomenon. The S-N curve, which plots a metal’s expected lifespan vs. stress, has been used for well over a century.

Just as a car needs preventative maintenance, planes are inspected regularly for parts that are ready to fail. However, the crack in lap joint wasn’t detected during routine maintenance. In fact, that joint wasn’t even checked. It wasn’t an oversight however. Often the design engineers also set the maintenance schedule, because they hold the expertise needed to determine a reasonable procedure. The engineers didn’t expect the part to fail for at least 20,000 more flight hours. At the moment, it’s unclear why that is.

In response to the incident, the FAA has grounded all similar aircraft and ordered inspections of flights nearing 30,000 flight hours. Cracks have been found in 5 aircraft of 80 grounded aircraft so far. However a looming concern is how to deal with 737’s not based in the United States, and therefore outside the FAA’s jurisdiction.

Root Cause Analysis - Incident Investigation

Air Traffic Controller Asleep On the Job

March 31, 2011 Kim Smiley

By Kim Smiley

At least three times over the past decade, air traffic controller fatigue has been investigated by the National Transportation Safety Board (NTSB) in near-miss airline accidents. Five years ago, controller fatigue was a significant factor in a Lexington, KY crash killing 49, the last fatal crash related to this problem. Again last week, controller fatigue was in the news when two early-morning aircraft had uncontrolled landings at Reagan National Airport near Washington D.C. The controller, who had 20 years of experience with most of them at Reagan, was clearly well experienced. In fact, the controller was also a supervisor. But no level of experience can overcome the effects of fatigue. The relieved controller stated that he had worked the 10 p.m. to 6 a.m. shift four nights in a row.

Faced with harsh criticism over the latest incident, the FAA reacted by mandating a second controller at Reagan National Airport and reviewing traffic management policies at all single-person towers. Regional radar controllers are now required to check in with single-person towers during night shifts to ensure controllers are prepared to handle incoming traffic.

Controller fatigue is a well known problem, and multiple solutions have been suggested over the past two decades. It has been a part of the NTSB’s Most Wanted list since 1990. In 2007 following the Lexington crash, the NTSB urged the Federal Aviation Administration (FAA) to overhaul their controller schedules, claiming that the stressful work and hectic pace were putting passengers and crews at risk. The FAA responded, and is currently working with the National Air Traffic Controllers Association (NATCA) to develop “a science-based controller fatigue mitigation plan”.

In addition, from 2007 to 2011, more than 5,500 new air traffic controllers were hired. However, many of these simply replaced air traffic controllers who were retiring, resulting in no net gain in the pool of available labor. Air traffic controllers have a mandated retirement age of 56, with exceptions available up to age 61. Additionally, on-the-job training is extensive, requiring two to four years just to receive initial certification. Adding staffing therefore is more difficult than initially meets the eye.

Faced with an expected increase in air traffic and an aging infrastructure, the FAA has aggressively pursued a long-term modernization called NextGen. With the proposed modernization and staffing, the 2011 FAA budget request is now $1.14B, a $275M or 31% increase from 2010. While material and personnel changes are often necessary, sometimes simpler solutions are equally effective or quicker to implement.

The associated Cause Map reflects the multiple solutions suggested, and even implemented, to combat the problem of controller fatigue. As discussed, the FAA, NTSB and NATCA have pursued multiple paths to overcome the issue of controller fatigue. However, as the Cause Map shows, there are multiple contributing factors in this case. Controller fatigue isn’t the only reason those planes had an uncontrolled landing, and controller fatigue wasn’t caused by just four night shifts in a row. Because there are multiple reasons why this happened, it also means there are multiple opportunities to correct future problems. The key isn’t eliminating all of the causes, but rather eliminating the right one.

Root Cause Analysis - Incident Investigation

Issues at Fukushima Daiichi Unit 3

March 21, 2011 Angela Griffith

By ThinkReliability Staff

There are many complex events occurring with some of Japan’s nuclear power plants as a result of the earthquake and tsunami on March 11, 2011. Although the issues are still very much ongoing, it is possible to begin a root cause analysis of the events and issues. In order to clearly show one issue, our analysis within this blog is limited to the issues affecting Fukushima Daiichi Unit 3. This is not to minimize the issues occurring at the other plants and units, but rather to clearly demonstrate the cause-and-effect within one small piece of the overall picture.

The issues surrounding Unit 3 are extremely complex. In events such as these, where many events contribute to the issues, it can be helpful to make a timeline of events. A timeline of the events so far can be seen by clicking “Download PDF” above. A timeline can not only help to clarify the order of contributing events, it can also help create the Cause Map, or visual root cause analysis. To show how the events on the timeline fit into the Cause Map, some of the entries are denoted with numbers, which are matched to the same events on the Cause Map. Notice that in general, because Cause Maps build from right to left with time, earlier entries are found to the right of newer events. For example, the earthquake was the cause of the tsunami, so the earthquake is to the right of the tsunami on the map. Many of the timeline events are causes, but some are also solutions. For example, the venting of the reactor is a solution to the high pressure. (It also becomes a cause on the map.)

A similar analysis could be put together for all of the units affected by the earthquake, tsunami and resulting events. Parts of this cause map could be reused as many of the issues affecting the other plants and units are similar to the analysis shown here. It would also be possible to build a larger Cause Map including all impacts from the earthquake.

The impact to goals needs to be determined prior to building a Cause Map. As a direct result of the events at Unit 3, 7 workers were injured. This is an impact to the worker safety goal. There is the potential for health effects to the population, which is an impact to the public safety goal. The environmental goal was impacted due to the release of radioactivity into the environment. The customer service goal was impacted due to evacuations and rolling blackouts, caused by the loss of electrical production capacity, which is an impact to the production goal. The loss of capacity was caused by catastrophic damage to the plant, which is an impact to the property goal. Additionally, the massive effort to cool the reactor is an impact to the labor goal.

The worker safety and property goals were impacted because of a hydrogen explosion, which was caused by a buildup of pressure in the plant, caused by increasing reactor temperature. Heat continues to be generated by a nuclear reactor, even after it is shutdown, as a natural part of the operating process. In this case, the normal cooling supply was lost when external power lines were knocked down by the tsunami (which was caused by the earthquake). The tsunami also apparently damaged the diesel generators which provided the emergency cooling system. The backup to the emergency cooling supply stopped automatically and was unable to be restarted, for reasons that are as yet unknown.

The outline, timeline and cause map shown on the PDF are extremely simplified. Part of this simplification is due to the fact that as the event is still ongoing and not all information is known, or has been released. Once more information becomes available, it can be added to the analysis, or the analysis can be revised.

To learn more about the reactor issues at Fukushima Daiichi, view our video summary. To see a blog about the impact of the fallout on the health of babies in the US, see our healthcare blog.

Root Cause Analysis - Incident Investigation

Two Killed in Barge/Tour Boat Collision

March 14, 2011 Angela Griffith

By ThinkReliability Staff

On July 7, 2010, a barge being propelled by a tug boat collided with a tour boat that had dropped anchor in the Delaware River. As a result of the collision, two passengers on the tour boat were killed and twenty-six were injured. The tour boat sank in 55 feet of water.

Detail regarding the incident has just been released in an updated NTSB report. We can use the information about this report to begin a Cause Map, or visual root cause analysis. The information in the report can also point us in the direction of important questions that remain to be answered to determine exactly what happened and, most importantly, how incidents like these can be prevented in the future.

In this case, a tour boat had dropped anchor to deal with mechanical problems. According to the tour boat crew’s testimony and radio recordings, the tour boat crew attempted to get in touch with the tug boat by yelling and making radio calls. Neither were answered or apparently noticed. The barge that was being propelled by the tug boat crashed into the tour boat, resulting in deaths, injuries and loss of the tour boat.

The lookout on the tug boat was inadequate (had it been adequate, the tug boat would have noticed the tour boat in time to avoid the collision). The report has determined that the tug boat master was off-duty and below-deck at the time of the collision. According to cell phone records, the mate who was on lookout duty was on a phone call at the time of the collision and had made several phone calls during his duty. The inadequate lookout combined with the inability of the tour boat to make contact with the tug boat resulted in the collision.

There are two obvious areas where more detail is needed in the Cause Map to determine what was going on that led to the issues on the tug boat. Specifically, why was the lookout on the cell phone and why wasn’t the tour boat able to contact the tug boat through the radio? Because of the strict requirements for lookouts on marine duty, there is also an ongoing criminal investigation into the lookout’s actions. When the final NTSB report is issued, and the criminal case is closed, these questions should be answered. More detail can be added to this Cause Map as the analysis continues. As with any investigation the level of detail in the analysis is based on the impact of the incident on the organization’s overall goals.

Root Cause Analysis - Incident Investigation

San Francisco’s Stinking Sewers

March 9, 2011 ThinkReliability Staff

By ThinkReliability Staff

The Golden Gate City is well known for its ground-breaking, environmentally-friendly initiatives. In 2007 San Francisco outlawed the use of plastic bags at major grocery stores. The city also mandated compulsory recycling and composting programs in 2009. Both ordinances were the first laws of their kind in the nation, and criticized by some for being overly aggressive. Likewise San Francisco’s latest initiative, to reduce city water usage by encouraging the use of low-flow toilets, has faced harsh criticism.

Recently San Francisco began offering substantial rebates to homeowners and businesses to install high efficiency toilets (HETs). These types of toilet use 1.28 gallons or less per flush, down from the 1.6 gpf versions required today by federal law and even older 3.4 gpf toilets from decades ago. That means that an average home user will save between 3,800 to 5,000 gallons of water per year per person. In dollars, that’s a savings of $90 annually for a family of four. This can quickly justify the cost of a new commode, since a toilet is expected to last 20 years.

Aside from cost savings, there are obvious environmental benefits to reduced water use. The city initially undertook the HET rebate initiative to decrease the amount of water used overall by the city and the amount of wastewater requiring treatment. They were successful, and water usage decreased. In fact, the city’s Public Utilities Commission stated that San Francisco residents reduced their water consumption by 20 million gallons of water last year. San Francisco last year used approximately 215 million gallons per day. This also met other goals the city had, such as reducing costs to consumers. Unintentionally though, the HET rebate initiative impacted a different goal – Customer Service.

As shown on the associated Cause Map, reduced water flow had a series of other effects. While water consumption – and presumably waste water disposal – shrank significantly, waste production has remained constant. Despite $100M in sewage systems upgrades over the past five years, current water flow rates are not high enough to keep things moving through the system. As a result sewage sludge builds up in sewer lines. As bacteria eat away at the organic matter in the sludge, hydrogen sulfide is released. Hydrogen sulfide is known for its characteristic “rotten egg” smell.

This creates an unfortunate situation. No one wants to walk through smelly streets. Further, slow sewage means a build-up of potential harmful bacteria. However, everyone agrees San Francisco should strive to conserve water. Water is a scarce and increasingly expensive resource in California. What’s the next step in solving the stinking sewer problem?

San Francisco is not the first city to deal with this issue. There is substantial debate over the city’s current plan to purchase $14M in bleach to clean up the smell. Many parties are concerned about potential environmental impacts and potential contamination to drinking water. Other solutions have been proposed by environmental activists, but may have financial ramifications.

Cause Maps can help all parties come to agreement because they focus problem solvers on the goals, not the details of the problem. In this case, all parties are trying to protect the environment and reduce costs to city residents. Based on those goals and the Cause Map, potential solutions have been developed and placed with their corresponding causes. The next step is to proactively consider how these new actions might affect the stakeholders’ goals. Perhaps other goals could be impacted, such as the safety of drinking water and potential contamination of San Francisco Bay. Financial goals will surely be impacted to varying degrees with each solution. Revising the Cause Map can help identify the pros and cons of each approach and narrow down which solution best satisfies all parties.

Root Cause Analysis - Incident Investigation

Deadly Tiger Attack

February 28, 2011 Kim Smiley

By Kim Smiley

On December 25, 2007, a tiger escaped her enclosure at the San Francisco Zoo and attacked three people. One 17 year old boy was killed and the other two were injured. The enclosure was built in the 1940s and had safely contained tigers for more than 60 years without incident.

So how did this happen? How did the tiger escape?

A Cause Map can be built using this example to help determine how this incident was able to occur. To begin a Cause Map, the impacts to the organizational goals are first determined and then “why” questions are asked to add causes to the map. In this case, there was obviously an impact to the safety goal because one zoo patron was killed and two were injured. The customer service goal was also impacted because the zoo was closed until January 3, 2008 following the incident. Why was a zoo patron killed? He was killed because he was mauled by a tiger. Why was he mauled? Because the tiger escaped her enclosure and she went after the victims.

Let’s focus on the question of how the tiger escaped her enclosure first. An investigation was conducted by the United States Department of Agriculture’s Animal and Plant Health Inspection Service, the government body who is charged with overseeing the nation’s zoos. Based on claw marks and other evidence at the scene, they determined that the tiger jumped from the bottom of a dry moat and was able to pull herself over the fence surrounding her enclosure. The investigation also determined the fence was lower than typically used around tiger enclosures. The Association of Zoos & Aquariums recommends that walls around a tiger exhibit be at least 16.4 feet and the fence around the San Francisco Zoo was only 12.5 feet at the time.

The second question of why the tiger went after the boys is not as easy to answer. A few experts have stated that the tiger didn’t behave in a typical way. There has been significant speculation in the media that the victims taunted the tiger or provoked her in some way, but nothing has ever officially been determined.

This focus on the behavior of the victims is a good example of some of the issues that can come up during an investigation. It can be tempting to focus on assigning blame when investigating an incident. But the real question is “What should we do to prevent this from happening again?”. Whether or not the boys provoked the tiger, she should never have been able to escape her enclosure.

After the incident, the zoo extensively remodeled the tiger enclosure, adding a much higher fence and with hotwire at the top to prevent any similar incidents from occurring.

Root Cause Analysis - Incident Investigation

The Phillips 66 Explosion: Planning for Emergencies

February 23, 2011 ThinkReliability Staff

By ThinkReliability Staff

All business strive to make their processes as efficient as possible and maximize productivity. Minimizing excess inventory only seems sensible, as does placing process equipment in a logical manner to minimize transit time between machines. However, when productivity consistently takes precedence over safety, seemingly insignificant decisions can snowball when it matters most.

Using the Phillips 66 explosion of 1989 as an example, it is easy to see how numerous efficiency-related decisions snowballed into a catastrophe. Examining different branches of the Cause Map highlights areas where those shortcuts played a role. Some branches focus on how the plant was laid out, how operations were run and how the firefighting system was designed. Arguably, all of these areas were maximized for production efficiency, but ended up being contributing factors in a terrible explosion and hampered subsequent emergency efforts.

For instance, the Cause Map shows that the high number of fatalities was caused not just by the initial explosion. The OSHA investigation following the explosion highlighted contributing factors regarding the building layout. The plant was cited for having process equipment located too closely together, in violation of generally accepted engineering practices. While this no doubt maximized plant capacity, it made escape from the plant difficult and did not allow adequate time for emergency shutdown procedures to complete. Additionally high occupancy structures, such as the control room and administrative building were located unnecessarily close to the reactors and storage vessels. Luckily over 100 personnel were able to escape via alternate routes. But luck is certainly not a reliable emergency plan; the plant should have been designed with safety in mind too.

Nearby ignition sources also contributed to the speed of the initial explosion, estimated to be within 90 to 120 seconds of the valve opening. OSHA cited Phillips for not using due diligence in ensuring that potential sources of ignition were kept a safe distance from flammable materials or, alternatively, using testing procedures to ensure it was safe to bring such equipment into work zones. The original spark source will never be known, but the investigation identified multiple possibilities. These included a crane, forklift, catalyst activator, welding and cutting-torch equipment, vehicles and ordinary electrical gear. While undoubtedly such a large cloud of volatile gas would have eventually found a spark, a proactive approach might have provided precious seconds for workers to escape. All who died in the explosion were within 250 feet of the maintenance site.

Another factor contributing to the extensive plant damage was the inadequate water supply for fire fighting, as detailed in the Cause Map. When the plant was designed, the water system used in the HDPE process was the same one that was to be used in an emergency. There is no doubt a single water system was selected to keep costs down. Other shortcuts include placing regular-service fire system pump components above ground. Of course, the explosion sheared electrical cords and pipes controlling the system, rending it unusable. Not only was the design of the fire system flawed, it wasn’t even adequately maintained. In the backup diesel pump system, only one of three pumps was operational; one was out of fuel and the other simply didn’t work. Because of these major flaws, emergency crews had to use hoses to pump water from remote sources. The fire was not brought under control until 10 hours after the initial explosion. As the Cause Map indicates, there may not have been such extensive damage had the water supply system been adequate.

There is a fine line between running processes at the utmost efficiency and taking short-cuts that can lead to dangerous situations. Clearly, this was an instance where that line was crossed.