Loss of Network Cloud Compute Service

By Angela Griffith

On April 21, 2011, some of Amazon’s Elastic Compute Cloud (EC2) customers experienced issues when a combination of events led to their East region’s Elastic Block Store (EBS) being unable to process read or write operations.  This seriously impacted their customer service.  Massive efforts were undertaken and services, and  most data, was restored within 3 days.  Amazon has released their post-mortem analysis of these events.  Using the information they’ve provided, we can begin a visual root cause analysis, or Cause Map, laying out the event.

We begin with the affected goal.  Customer service was impacted because of the inability to process read or write operations.  This ability was lost due to a degraded EBS cluster.  (A cluster is a group of nodes, which are responsible for replicating data and processing read and write requests.)  The cluster was degraded by the failure of some nodes, and a high number of nodes searching for replicas.

At this point, we’ll look into the process to explain what’s going on.  When a user makes a request, a control plane accepts and processes that request to an EBS cluster. The cluster elects a node to be the primary replica of this data.  That node stores the data, and looks for other available nodes to make backup replicas.  If the node-to-node connection is lost, the primary replica searches for another node.  Once it has established connectivity with that node, the new node becomes another replica.  This process is continuous.

In this case, a higher number of nodes were searching for replicas because they lost connection to the other nodes.  Based on the process discussed above, the nodes then began a search for other nodes.  However, they were unable to find any other nodes because the network was unavailable (so the nodes could not communicate with each other).  The nodes had a long time-out period for searching for other nodes, so their search continued, and more nodes lost communication and began a search, increasing the volume.

The network communication was lost because data was shifted off the primary network.  This was caused by an error during a network configuration change to upgrade the capacity of the primary network.  The data should have been transferred to a redundant router on the primary network but was instead transferred to the secondary network.  The secondary network did not have sufficient capacity to handle all the data and so was unable to maintain connectivity.

In addition to a large number of nodes searching for other nodes, the EBS cluster was impacted by node failures.  Some nodes failed because of a race condition designed so that a node would fail when it attempted to process multiple concurrent requests for replicas.  These requests were caused by the situation above.  Additionally, the nodes failing led to more nodes losing their replicas, compounding the difficulty of recovering from this event.

Service is back to normal, and Amazon has made some changes to prevent this type of issue from reoccurring.   Immediately, the data was shifted back to the primary network and the error which caused the shifting was corrected.  Additional capacity was added to prevent the EBS cluster from being overwhelmed.  The retry logic which resulted in the nodes continuing to search for long periods of time has been modified, and the source of the race condition resulting in the failure of the nodes has been identified and repaired.

View the root cause analysis investigation of this event – including an outline, timeline, Cause Map, solutions and Process Map, by clicking “Download PDF” above.

Plane Clips Another While Taxiing at JFK Airport

By Kim Smiley

Around 8:30 pm on April 11, 2011, a large passenger airplane taxiing at John F. Kennedy Airport in New York clipped the wing of a smaller plane.  The larger plane involved in the incident was an Airbus A380 carrying 485 passengers and 25 crew members.  The smaller plane was a Bombardier CRJ and carrying 52 passengers and 4 crew members at the time it was clipped.

At the time of the accident, the Airbus was taxiing to take off and the CRJ had recently landed and was waiting to park.  The incident was caught on amateur video and it appears that the left wing tip of the Airbus struck the left horizontal stabilizer of the CRJ. No injuries were reported, but both planes sustained some damage.

After the planes made contact, the fire department responded as a precautionary measure.  Passengers were deplaned from the Airbus so that the planes could be inspected and information could be gathered to support the investigation.

At this time there is limited information available about what caused this incident, but the National Transportation and Safety Board (NTSB) has begun an investigation.  The NTSB has requested fight recorders from both airplanes and also plans to review the air traffic control tapes and the ground movement radar data to determine how this happened.

Even through the investigation is just getting started, it is still possible to create a Cause Map based on what is known.  The first step is to create an Outline of the event by determining the impact to the organization goals.  In this example, the Safety Goal was impacted because there was the potential for injuries, the Customer Service goal was impacted because the passengers were unable to reach their destination, the Production Schedule Goal was impacted because the flight was unable to depart and the Material and Labor goal was impacted because there was damage to both planes.

From this point, Causes can be added to the cause map by asking “why” question. Missing information can be noted by adding a Cause box with a “?”.  Any additional information can be added later.  To see an initial Cause Map of this incident and the Outline, click on the “Download PDF” above.

75 Year Old Woman Cuts Internet Service to Armenia With a Shovel

By Kim Smiley

On March 28, 2011, a 75-year-old woman out digging for scrap metal accidentally cut internet service to nearly all of Armenia.  There were also service interruptions in Azerbaijan and part of Georgia.  Some regions were able to switch to alternative internet suppliers within a few hours, but some areas were without internet service for 12 hours.

How did this happen?  How could an elderly woman and a shovel cause such chaos without even trying?

A root cause analysis can be performed and a Cause Map built to show what contributed to this incident.  Building a Cause Map begins with determining the impacts to the organizational goals.  Then “why” questions are asked and causes are added to the map.

In this example, the Customer Service Goal is impacted because there was significant internet service interruption and the Production Schedule Goal was also impacted because of loss of worker productivity.  The Material Labor Goal also needs to be considered because of the cost of repairs.

Now causes are added to the Cause Map by asking “why” questions.  Internet service was disrupted because a fiber optic cable was damaged by a shovel.  In addition, this one cable provided 90 percent of Armenia’s internet so damaging it created a huge interruption in internet service.

Why would a 74-year-old woman be out digging for cables?  The woman was looking for copper cable and accidentally hit the fiber optic cable.  This happened because both types of cables are usually buried inside PCV conduit and can look similar.  The reason she was looking for copper cable is because there is a market for scrap metal.  Metal scavenging is a common practice in this region because there are many abandoned copper cables left in the ground.  She was also able to hit the fiber optic cable because it was closer to the surface than intended, likely exposed by mudslides or heavy rains.

The woman, who had been dubbed the spade-hacker by local media, has been released from police custody.  She is still waiting to hear if she faces any punishment, but police statements implied that the prosecutor won’t push for the maximum of three years in prison due to her age.

To see the Cause Map of this issue, click on the “Download the PDF” button above.

Grounding the 737’s: SWA Flight 812

By ThinkReliability Staff

As new information comes to light, processes need to be reevaluated.  A hole in the fuselage of a 15-year-old Boeing 737-300 led to the emergency descent of Southwest Airlines Flight 812.  737’s have been grounded as federal investigators determine why the hole appeared.  At the moment, consensus is that a lap joint supporting the top of the fuselage cracked.

While the investigation is still in the early stages, it appears that stress fatigue caused a lap joint to fail.  Stress fatigue is a well known phenomenon, caused in aircraft by the constant pressurization and depressurization occurring during takeoff and landing.  Mechanical engineers designing the aircraft would have been well aware of this phenomenon.  The S-N curve, which plots a metal’s expected lifespan vs. stress, has been used for well over a century.

Just as a car needs preventative maintenance, planes are inspected regularly for parts that are ready to fail.  However, the crack in lap joint wasn’t detected during routine maintenance.  In fact, that joint wasn’t even checked.  It wasn’t an oversight however.  Often the design engineers also set the maintenance schedule, because they hold the expertise needed to determine a reasonable procedure.  The engineers didn’t expect the part to fail for at least 20,000 more flight hours.  At the moment, it’s unclear why that is.

In response to the incident, the FAA has grounded all similar aircraft and ordered inspections of flights nearing 30,000 flight hours.   Cracks have been found in 5 aircraft of 80 grounded aircraft so far.  However a looming concern is how to deal with 737’s not based in the United States, and therefore outside the FAA’s jurisdiction.