Tag Archives: outage

Root Cause Analysis - Incident Investigation

Small fire leads to thousands of canceled flights

August 19, 2016 Kim Smiley

Starting August 8, 2016, thousands of travelers were stranded worldwide after widespread cancelations and delays of Delta Air Lines flights. The disruptions continued over several days and the impacts lingered even longer. The flight issues made headlines around the globe and the financial impact to the company was significant.

So what happened? What caused this massive headache to so many travelers? The short answer is a small fire in an airline data center, but a much longer answer is needed to understand what caused this incident. A Cause Map, a visual format for performing a root cause analysis, can be used to analyze this issue. All of the causes that contributed to an issue are visually laid out to intuitively show cause-and-effect relationships in a Cause Map. The Cause Map is built by asking “why” questions and adding the answers. For an effect with more than one cause, all of the causes that contributed to the effect are listed vertically and separated by an “and”. (Click on “Download PDF” to see an intermediate level Cause Map of this incident.)

So why were so many flights canceled and delayed? There was a system-wide computer outage and the airline depends on computer systems for everything from processing check-ins to assigning crews and gates. Bottom line, no flights leave on time without working computer systems. The issues originated at a single data center, but the design of the system led to cascading computer issues that impacted systems worldwide. The airline has not released any specific details about why exactly the issue spread, but this is certainly an area investigators would want to understand in order to create a solution to prevent a similar cascading failure in the future.

In a statement, the company indicated that an electrical component failed, causing a small fire at the data center. (Again, the specifics about what type of component and what caused the failure haven’t been released.) The fire caused a transformer to shut down which resulted in a loss of primary power to the data center. A secondary power system did kick on, but not all servers were connected to backup power. No details have been released about why some servers were not powered by the secondary power supply.

Compounding the frustration for the impacted travelers is the fact that they were unable to get updated flight information. Flight status systems, including airport monitors, continued to show that all flights were on time during the period of the cancelations and delays.

Once a large number of flights are disrupted, it is difficult to return to a normal flight schedule. The rotation schedule for airlines and pilots has to be redone, which can be time-consuming. Many commercial flights operate near capacity so it can be difficult to find seats for all the passengers impacted by canceled and delayed flights. Delta has tried to compensate travelers impacted by this incident by offering refunds and $200 in travel vouchers to people whose flights were canceled or delayed at least three hours, but an incident of this magnitude will naturally impact customer confidence in the company.

This incident is a good reminder of the importance of building robust systems with functional backups; otherwise a small problem can spread and quickly become a big problem.

Root Cause Analysis - Incident Investigation

911 Outage in Baltimore

July 22, 2016 Kim Smiley

By Kim Smiley

Nobody ever wants to find themselves in the position of dialing 911. But imagine how quickly a bad situation could get even worse if nobody answered your call for emergency help. That is exactly what happened on July 16, 2016 to people in Baltimore, Maryland. For about two hours, people dialing 911 in Baltimore got a busy signal.

This incident can be investigated by building a Cause Map, a visual root cause analysis. A Cause Map intuitively lays out the many causes that contributed to an issue to show all the cause-and-effect relationships. By focusing on the multiple causes, rather than a single root cause analysis, the range of solutions considered is naturally widened.

The first step in the Cause Mapping process is to fill in an Outline with the basic background information for the incident. Additionally, the Outline is used to capture how the incident impacts the overall goals. This incident, like most incidents, impacted more than one goal. For example, the safety goal is impacted because of the delay in emergency help and the customer service goal is impacted because people were unable to reach 911 operators.

The bottom line on the Outline is used to note the frequency of similar incidents. This is important because an incident that has occurred 12 times before may warrant a different level of investigation than an isolated incident. For this example, newspapers reported a previous 911 outage in June in the Baltimore area. The outages appear to have been caused by different issues, but do raise questions about the overall stability of the 911 system in Baltimore. Investigators should determine if the multiple outages are related and indicative of bigger issues than just this one incident.

Once the Outline is completed, the Cause Map itself is built by asking “why” questions. So why was there a 911 outage for about 2 hours? Newspapers have reported that the outage occurred because of electrical power failures after both the main and back-up power systems shut down. The power systems shut down because of a malfunctioning air conditioning unit. No details have been released about exactly why the air conditioning units malfunctioned, but additional information could quickly be added to the Cause Map as it becomes known.

The final step in the Cause Mapping process is to develop and implement solutions to reduce the risk of the problem reoccurring. The investigation into this incident is still ongoing and no information about potential long-term solutions has been announced. In the short term, callers were asked to dial 311 or call their closest fire station or police district station if they heard a busy signal or were otherwise unable to get through to 911. It is probably not a bad idea for all of us to have the numbers of our local fire and police stations on hand, just in case.

Root Cause Analysis - Incident Investigation

Software Error Causes 911 Outage

October 23, 2014 Kim Smiley

By Kim Smiley

On April 9, 2014, more than 6,000 calls to 911 went unanswered. The problem was spread across seven states and went on for hours. Calling 911 is one of those things that every child is taught and every person hopes they will never need to do – and having emergency calls go unanswered has the potential to turn into a nightmare.

The Federal Communications Commission (FCC) investigated this 911 outage and has released a study detailing what went wrong on that day in April. The short answer is that a software error led to the unanswered calls, but there is nearly always more to the story than a single “root cause”. A Cause Map, an intuitive format for performing a root cause analysis, can be used to better understand this issue by visually laying out the causes (plural) that led to the outage.

There are three steps in the Cause Mapping process. The first is to define an issue by completing an Outline that documents the basic background information and how the problem impacts the overall goals. Most incidents impact more than one goal and this issue is no exception, but for simplicity let’s focus on the safety goal. The safety goal was impacted because there was the potential for deaths and injuries. Once the Outline is completed (including the impacts to the goals), the Cause Map is built by asking “why” questions.

The second step of the Cause Mapping process is to analyze the problem by building the Cause Map. Starting with the impacted safety goal – “why” was there the potential for deaths and injuries? This occurred because more than 6,000 911 calls were not answered. An automated system was designed to answer the calls and it wouldn’t accept new calls for hours. There was a bug in the automated system’s software AND the issue wasn’t identified for a significant period of time. The error occurred because the software used a counter with a pre-set limit to assign calls a tracking number. The counter hit the limit and couldn’t assign a tracking number so it quit accepting new calls.

The delay in identification of the problem is also important to identify in the investigation because the problem would have been much less severe if it had been found and corrected more quickly. Any 911 outage is a problem, but one that lasts 30 minutes is less alarming than one that plays out over 8hours. In this example, the system identified the issue and issued alerts, but categorized them as “low level” so they were never flagged for human review.

The final step in the Cause Mapping process is to develop and implement solutions to reduce the risk of the problem recurring. In order to fix the issues with the software, the pre-set limit on the timer has been increased and will periodically be checked to ensure that the max isn’t hit again. Additionally, to help improve how quickly a problem is identified, an alert has been added to notify operators when the number of successful calls falls below a certain percentage.

New issues will likely continue to crop up as emergency systems move toward internet-powered infrastructure, but hopefully the systems will become more robust as lessons are learned and solutions are implemented. I imagine there aren’t many experiences more frightening than frantically calling 911 for help and having no one answer.

To view a high level Cause Map of this issue, including a completed Outline, click on “Download PDF” above.

Root Cause Analysis - Incident Investigation

1990 Cascading Long Distance Failure

May 9, 2014 ThinkReliability Staff

By ThinkReliability Staff

On January 15, 1990, a cascading failure resulted in tens of thousands of people in the Northeast US without long distance service for up to 9 hours. This resulted in over 50 million calls being blocked at an estimated loss of $60 M. (Remember, there weren’t really any other ways to quickly connect outside of the immediate area at the time.)

We can examine this historical incident in a Cause Map, or visual root cause analysis, to demonstrate what went wrong, and what was done to fix the problem. First, we begin with the impact to the goals. No impacts to the safety, environmental, or property goals were discussed in the resources I used, but it is possible they were impacted, so we’ll leave those as unknown. The customer service and production goals were clearly impacted by the loss of service, which was considerable and estimated to cost $60 million, not including time for troubleshooting and repairs.

Asking “Why” questions allows development of the cause-and-effect relationships that led to the impacted goals. In this case, the outage was due to a cascading switch failure: 114 switches crashed and rebooting over and over again. The switches would crash upon receiving a message from its neighbor switches. This message was meant to inform other switches that one switch was busy to ensure messages were routed elsewhere. (A Process Map demonstrating how long distance calls were connected is included on the downloadable PDF.) Unfortunately, instead of allowing the call to be redirected, the message caused a switch to crash. This occurred when an errant line in the coding of the process allowed optional tasks to overwrite crucial communication data. The error was included in a software upgrade designed to increase throughput of messages.

It’s not entirely clear how the error (one added line of code that would bring down a huge portion of the long distance network) was released. The line appears to be added after testing was complete during a busy holiday season. That a line of code was added after testing seems to indicate that the release process wasn’t followed.

In this case, a solution needed to be found quickly. The upgraded software was pulled and replaced with the previous version. Better testing was surely used in the future because a problem of this magnitude has rarely been seen.

To view the Outline, Cause Map and Process Map, please click “Download PDF” above. Or click here to read more