Software Error Causes 911 Outage

By Kim Smiley

On April 9, 2014, more than 6,000 calls to 911 went unanswered.  The problem was spread across seven states and went on for hours.  Calling 911 is one of those things that every child is taught and every person hopes they will never need to do –  and having emergency calls go unanswered has the potential to turn into a nightmare.

The Federal Communications Commission (FCC) investigated this 911 outage and has released a study detailing what went wrong on that day in April.  The short answer is that a software error led to the unanswered calls, but there is nearly always more to the story than a single “root cause”.  A Cause Map, an intuitive format for performing a root cause analysis, can be used to better understand this issue by visually laying out the causes (plural) that led to the outage.

There are three steps in the Cause Mapping process. The first is to define an issue by completing an Outline that documents the basic background information and how the problem impacts the overall goals.  Most incidents impact more than one goal and this issue is no exception, but for simplicity let’s focus on the safety goal.  The safety goal was impacted because there was the potential for deaths and injuries.  Once the Outline is completed (including the impacts to the goals), the Cause Map is built by asking “why” questions.

The second step of the Cause Mapping process is to analyze the problem by building the Cause Map.  Starting with the impacted safety goal – “why” was there the potential for deaths and injuries?  This occurred because more than 6,000 911 calls were not answered.   An automated system was designed to answer the calls and it wouldn’t accept new calls for hours.  There was a bug in the automated system’s software AND the issue wasn’t identified for a significant period of time.  The error occurred because the software used a counter with a pre-set limit to assign calls a tracking number.  The counter hit the limit and couldn’t assign a tracking number so it quit accepting new calls.

The delay in identification of the problem is also important to identify in the investigation because the problem would have been much less severe if it had been found and corrected more quickly.  Any 911 outage is a problem, but one that lasts 30 minutes is less alarming than one that plays out over 8hours.  In this example, the system identified the issue and issued alerts, but categorized them as “low level” so they were never flagged for human review.

The final step in the Cause Mapping process is to develop and implement solutions to reduce the risk of the problem recurring.  In order to fix the issues with the software, the pre-set limit on the timer has been increased and will periodically be checked to ensure that the max isn’t hit again.  Additionally, to help improve how quickly a problem is identified, an alert has been added to notify operators when the number of successful calls falls below a certain percentage.

New issues will likely continue to crop up as emergency systems move toward internet-powered infrastructure, but hopefully the systems will become more robust as lessons are learned and solutions are implemented.  I imagine there aren’t many experiences more frightening than frantically calling 911 for help and having no one answer.

To view a high level Cause Map of this issue, including a completed Outline, click on “Download PDF” above.