On January 15, 1990, a cascading failure resulted in tens of thousands of people in the Northeast US without long distance service for up to 9 hours. This resulted in over 50 million calls being blocked at an estimated loss of $60 M. (Remember, there weren’t really any other ways to quickly connect outside of the immediate area at the time.)
We can examine this historical incident in a Cause Map, or visual root cause analysis, to demonstrate what went wrong, and what was done to fix the problem. First, we begin with the impact to the goals. No impacts to the safety, environmental, or property goals were discussed in the resources I used, but it is possible they were impacted, so we’ll leave those as unknown. The customer service and production goals were clearly impacted by the loss of service, which was considerable and estimated to cost $60 million, not including time for troubleshooting and repairs.
Asking “Why” questions allows development of the cause-and-effect relationships that led to the impacted goals. In this case, the outage was due to a cascading switch failure: 114 switches crashed and rebooting over and over again. The switches would crash upon receiving a message from its neighbor switches. This message was meant to inform other switches that one switch was busy to ensure messages were routed elsewhere. (A Process Map demonstrating how long distance calls were connected is included on the downloadable PDF.) Unfortunately, instead of allowing the call to be redirected, the message caused a switch to crash. This occurred when an errant line in the coding of the process allowed optional tasks to overwrite crucial communication data. The error was included in a software upgrade designed to increase throughput of messages.
It’s not entirely clear how the error (one added line of code that would bring down a huge portion of the long distance network) was released. The line appears to be added after testing was complete during a busy holiday season. That a line of code was added after testing seems to indicate that the release process wasn’t followed.
In this case, a solution needed to be found quickly. The upgraded software was pulled and replaced with the previous version. Better testing was surely used in the future because a problem of this magnitude has rarely been seen.
To view the Outline, Cause Map and Process Map, please click “Download PDF” above. Or click here to read more