By Kim Smiley
Starting August 8, 2016, thousands of travelers were stranded worldwide after widespread cancelations and delays of Delta Air Lines flights. The disruptions continued over several days and the impacts lingered even longer. The flight issues made headlines around the globe and the financial impact to the company was significant.
So what happened? What caused this massive headache to so many travelers? The short answer is a small fire in an airline data center, but a much longer answer is needed to understand what caused this incident. A Cause Map, a visual format for performing a root cause analysis, can be used to analyze this issue. All of the causes that contributed to an issue are visually laid out to intuitively show cause-and-effect relationships in a Cause Map. The Cause Map is built by asking “why” questions and adding the answers. For an effect with more than one cause, all of the causes that contributed to the effect are listed vertically and separated by an “and”. (Click on “Download PDF” to see an intermediate level Cause Map of this incident.)
So why were so many flights canceled and delayed? There was a system-wide computer outage and the airline depends on computer systems for everything from processing check-ins to assigning crews and gates. Bottom line, no flights leave on time without working computer systems. The issues originated at a single data center, but the design of the system led to cascading computer issues that impacted systems worldwide. The airline has not released any specific details about why exactly the issue spread, but this is certainly an area investigators would want to understand in order to create a solution to prevent a similar cascading failure in the future.
In a statement, the company indicated that an electrical component failed, causing a small fire at the data center. (Again, the specifics about what type of component and what caused the failure haven’t been released.) The fire caused a transformer to shut down which resulted in a loss of primary power to the data center. A secondary power system did kick on, but not all servers were connected to backup power. No details have been released about why some servers were not powered by the secondary power supply.
Compounding the frustration for the impacted travelers is the fact that they were unable to get updated flight information. Flight status systems, including airport monitors, continued to show that all flights were on time during the period of the cancelations and delays.
Once a large number of flights are disrupted, it is difficult to return to a normal flight schedule. The rotation schedule for airlines and pilots has to be redone, which can be time-consuming. Many commercial flights operate near capacity so it can be difficult to find seats for all the passengers impacted by canceled and delayed flights. Delta has tried to compensate travelers impacted by this incident by offering refunds and $200 in travel vouchers to people whose flights were canceled or delayed at least three hours, but an incident of this magnitude will naturally impact customer confidence in the company.
This incident is a good reminder of the importance of building robust systems with functional backups; otherwise a small problem can spread and quickly become a big problem.