By Kim Smiley
The National Weather Service (NWS) website was down for hours on August 25, 2014. Emergency weather alerts such as tornado warnings were still disseminated through other channels, but this issue raises questions about the robustness of a vital website.
This issue can be analyzed by building a Cause Map, a visual format for performing a root cause analysis. Cause Maps are built by laying out all the causes that contributed to a problem to show the cause-and-effect relationships. The idea is to identify all the causes (plural), not just THE one root cause.
This example is a good illustration of the potential danger of focusing on a single root cause. The NWS website outage was caused by an abusive Android app that bogged the site down with excessive traffic. The app was designed to provide current weather information and it pulled data directly from the forecast.weather.gov website. The app inadvertently queried the website thousands of times a second because of a programming error and the website was essentially overwhelmed. It was similar to the denial of service attacks that have been directed at websites such as Bank of America and Citigroup, but the spike in traffic in this case wasn’t deliberate.
It may be tempting to say that the app was the root cause. Or you could be more specific and say the programming error was the root cause. But labeling either of these “the root cause” would imply that you solved the problem once you fix the software error. The root cause is gone, no more problem…right? In order to address the issue, NWS installed a filter to block the excessive queries and worked with app developer to ensure the error was fixed, but there are other factors that must be considered to effectively reduce the risk of a similar problem recurring.
One of the things that must be considered in this example is why a filter that blocked denial of service attacks wasn’t already in place. Flooding a website with excessive traffic is a well-known strategy of hackers. If an app could accidently take the site down for hours, it is worrisome to consider what somebody with malicious intent could do. The NWS is responsible for disseminating important safety information to the public and needs a reasonably robust website. In order to reduce the impact of a similar issue in the future, the NWS needs to evaluate the protections they have in place for their website and see if any other safeguards should be implemented beyond the filter that addressed this specific issue.
If the investigation was focused too narrowly on a single root cause, the entire discussion of cyber security could be missed. Building a Cause Map of many causes ensures that a wider variety of solutions are considered and that can lead to more effective risk prevention.
To view a high level Cause Map of this issue, click on “Download PDF” above.