Tag Archives: software error

Software Error Causes 911 Outage

By Kim Smiley

On April 9, 2014, more than 6,000 calls to 911 went unanswered.  The problem was spread across seven states and went on for hours.  Calling 911 is one of those things that every child is taught and every person hopes they will never need to do –  and having emergency calls go unanswered has the potential to turn into a nightmare.

The Federal Communications Commission (FCC) investigated this 911 outage and has released a study detailing what went wrong on that day in April.  The short answer is that a software error led to the unanswered calls, but there is nearly always more to the story than a single “root cause”.  A Cause Map, an intuitive format for performing a root cause analysis, can be used to better understand this issue by visually laying out the causes (plural) that led to the outage.

There are three steps in the Cause Mapping process. The first is to define an issue by completing an Outline that documents the basic background information and how the problem impacts the overall goals.  Most incidents impact more than one goal and this issue is no exception, but for simplicity let’s focus on the safety goal.  The safety goal was impacted because there was the potential for deaths and injuries.  Once the Outline is completed (including the impacts to the goals), the Cause Map is built by asking “why” questions.

The second step of the Cause Mapping process is to analyze the problem by building the Cause Map.  Starting with the impacted safety goal – “why” was there the potential for deaths and injuries?  This occurred because more than 6,000 911 calls were not answered.   An automated system was designed to answer the calls and it wouldn’t accept new calls for hours.  There was a bug in the automated system’s software AND the issue wasn’t identified for a significant period of time.  The error occurred because the software used a counter with a pre-set limit to assign calls a tracking number.  The counter hit the limit and couldn’t assign a tracking number so it quit accepting new calls.

The delay in identification of the problem is also important to identify in the investigation because the problem would have been much less severe if it had been found and corrected more quickly.  Any 911 outage is a problem, but one that lasts 30 minutes is less alarming than one that plays out over 8hours.  In this example, the system identified the issue and issued alerts, but categorized them as “low level” so they were never flagged for human review.

The final step in the Cause Mapping process is to develop and implement solutions to reduce the risk of the problem recurring.  In order to fix the issues with the software, the pre-set limit on the timer has been increased and will periodically be checked to ensure that the max isn’t hit again.  Additionally, to help improve how quickly a problem is identified, an alert has been added to notify operators when the number of successful calls falls below a certain percentage.

New issues will likely continue to crop up as emergency systems move toward internet-powered infrastructure, but hopefully the systems will become more robust as lessons are learned and solutions are implemented.  I imagine there aren’t many experiences more frightening than frantically calling 911 for help and having no one answer.

To view a high level Cause Map of this issue, including a completed Outline, click on “Download PDF” above.

App Takes Down National Weather Service Website

By Kim Smiley

The National Weather Service (NWS) website was down for hours on August 25, 2014.  Emergency weather alerts such as tornado warnings were still disseminated through other channels, but this issue raises questions about the robustness of a vital website.

This issue can be analyzed by building a Cause Map, a visual format for performing a root cause analysis.  Cause Maps are built by laying out all the causes that contributed to a problem to show the cause-and-effect relationships.  The idea is to identify all the causes (plural), not just THE one root cause.

This example is a good illustration of the potential danger of focusing on a single root cause.  The NWS website outage was caused by an abusive Android app that bogged the site down with excessive traffic.  The app was designed to provide current weather information and it pulled data directly from the forecast.weather.gov website.  The app inadvertently queried the website thousands of times a second because of a programming error and the website was essentially overwhelmed.  It was similar to the denial of service attacks that have been directed at websites such as Bank of America and Citigroup, but the spike in traffic in this case wasn’t deliberate.

It may be tempting to say that the app was the root cause. Or you could be more specific and say the programming error was the root cause.  But labeling either of these “the root cause” would imply that you solved the problem once you fix the software error. The root cause is gone, no more problem…right?  In order to address the issue, NWS installed a filter to block the excessive queries and worked with app developer to ensure the error was fixed, but there are other factors that must be considered to effectively reduce the risk of a similar problem recurring.

One of the things that must be considered in this example is why a filter that blocked denial of service attacks wasn’t already in place.  Flooding a website with excessive traffic is a well-known strategy of hackers.  If an app could accidently take the site down for hours, it is worrisome to consider what somebody with malicious intent could do.  The NWS is responsible for disseminating important safety information to the public and needs a reasonably robust website.  In order to reduce the impact of a similar issue in the future, the NWS needs to evaluate the protections they have in place for their website and see if any other safeguards should be implemented beyond the filter that addressed this specific issue.

If the investigation was focused too narrowly on a single root cause, the entire discussion of cyber security could be missed.  Building a Cause Map of many causes ensures that a wider variety of solutions are considered and that can lead to more effective risk prevention.

To view a high level Cause Map of this issue, click on “Download PDF” above.