Root Cause Analysis - Incident Investigation

Boeing 747 “Dreamlifter” Cargo Jet Lands At Wrong Airport

November 27, 2013 Kim Smiley

On November 21, 2013, a massive Boeing 747 Dreamlifter cargo jet made national headlines after it landed at the wrong airport near Wichita, Kansas. For a time, the Dreamlifter looked to be stuck at the small airport with a relatively short runway, but it was able to take off safely the next day after some quick calculations and a little help turning around.

At the time of the airport mix-up, the Dreamlifter was on its way to the McConnell Air Force base to retrieve Dreamliner nose sections made by nearby Spirit Aerosystems. Dreamlifters are notably large because they are modified jumbo jets designed to haul pieces of Dreamliners between the different facilities that manufacture parts for aircraft.

So how does an airplane land at the wrong airport? It’s not entirely clear yet how a mistake of this magnitude was made. The Federal Aviation Administration is planning to investigate the incident to determine what happened and to see whether any regulations were violated. What is known is that the airports have some similarities in layout that can be confusing from the air. First off, there are three airports in fairly close proximity in the region. The intended destination was the McConnell Air Force base, which has a runway configuration similar to Jabara airfield where the Dreamlifter landed by mistake. Both runways run north-south and are nearly parallel. It can also be difficult to determine how long a runway is from the airport so the shorter length isn’t necessarily easy to see. Beyond the airport similarities, the details of how the plane landed at the wrong airport haven’t been released yet.

What is known can be captured by building an initial Cause Map, a visual format for performing a root cause analysis. One of the advantages of Cause Maps is they can be easily expanded to incorporate more information as it becomes available. The first step in Cause Mapping is to fill in an Outline with the basic background information and to list how the issue impacts the overall goals. There are a number of goals impacted in this example. The potential for a plane crash means that there was an impact to both the safety and property goal because of the possibility of fatalities and damage to the jet. The effort needed to ensure that the jet could safely take off on a shorter runway is an impact to the labor goal and the delay was an impact to the schedule goal. The negative publicity surrounding this incident can also be considered an impact to the customer service goal.

Once the Outline is completed, the Cause Map is built by asking “why” questions and intuitively laying out the answers until all the causes that contributed to the issue are documented. Click on “Download PDF” above to see an Outline and initial Cause Map of this issue.

Good luck with any air travel planned for this busy holiday week. And if your plane makes it to the right airport (even if it’s a little late), take a moment to be thankful because it’s apparently not the given I’ve generally assumed.

Root Cause Analysis - Incident Investigation

Can the Epidemic of Smartphone Thefts be Stopped?

November 21, 2013 Kim Smiley

By Kim Smiley

About 1.6 million handheld devices were stolen in the United States in 2012, the majority of which were smartphones. In fact, the frequency at which the popular Apple devices are taken has given rise to a whole new term, “apple picking”. Stolen smartphones cost consumers nearly $30 billion a year. These thefts affect a significant number of smartphone owners with approximately 10 percent reporting that they have had a device stolen.

The problem of smartphone theft can be analyzed by building a Cause Map, a method for performing a visual root cause analysis. A Cause Map is built by completing an Outline by both filling in the basic background information and listing how the issue impacts the overall goals. The impacts to the goals from the Outline are then used as the first step in building the Cause Map. Causes are then added by asking “why” questions to determine what other causes contributed to an issue. (To view a high level Cause Map of this issue, click on “Download PDF” above.)

So why do so many smartphone get taken? Smartphones are a popular target because it is lucrative to resell them, they are relatively easy to steal, and many of the crimes go unpunished. Smartphones are fairly easy to steal because they are readily available since so many people carry them, and they are both small and light weight. Many criminals who steal smartphones go unpunished because there are so many of them taken and it is difficult to locate the thieves. Many stolen smartphones are shipped overseas which further complicates the situation.

The black market for smartphones is lucrative because the items are popular and relatively expensive to buy new. People buy stolen smartphones because they are cheaper and they are able to be used by the “new owner”, especially overseas where the networks are different and phones deactivated in the US may be able to be used.

One of the possible solutions suggested to reduce the number of smartphone thefts is to include a kill switch in smartphone software. This kill switch would essentially make the phone worthless because it would no longer function no matter where it was in the world. If smartphones no longer have resale value, then there would be little incentive to steal them and the number of thefts should dramatically decrease. While this idea is elegant in its simplicity, like most things there is more that needs to be considered.

The addition of a kill switch was recently rejected by cellphone carriers because of concerns about hacking and problems with reactivation. If hackers found a way to flip the kill switches they would have the ability to destroy a huge number of smartphones from anywhere in the world. Depending on how many users were targeted this could have a huge impact, which could be especially problematic for people who use their phones in an official capacity like law enforcement. It doesn’t take much imagination to see how this scenario could go horribly wrong. The proposed kill switch is also permanent so users won’t be able to reactivate their phones and any stolen phones that were recovered would be useless. Companies continue to work on a number of ideas to make it more difficult to resell smartphones, but there isn’t general agreement on the best approach yet. Only time will tell if the tide of smartphone thefts has peaked.

Root Cause Analysis - Incident Investigation

Pilot Response to Turbulence Leads to Crash

November 12, 2013 ThinkReliability Staff

By ThinkReliability Staff

All 260 people onboard Flight 587, plus 5 on the ground, were killed when the plane crashed into a residential area on November 12, 2001. Flight 587 took off shortly after another large aircraft. The plane experienced turbulence. According to the NTSB, the pilot’s overuse of the rudder mechanism, which had been redesigned and as a result was unusually sensitive, resulted in such high stress that that vertical stabilizer separated from the body of the plane.

This event is an example of an Aircraft Pilot Coupling (APC) event. According to the National Research Council, “APC events are collaborations between the pilot and the aircraft in that they occur only when the pilot attempts to control what the aircraft does. For this reason, pilot error is often listed as the cause of accidents and incidents that include an APC event. However, the [NRC] committee believes that the most severe APC events attributed to pilot error are the result of the adverse APC that misleads the pilot into taking actions that contribute to the severity of the event. In these situations, it is often possible, after the fact, to analyze the event carefully and identify a sequence of actions the pilot could have taken to overcome the aircraft design deficiencies and avoid the event. However, it is typically not feasible for the pilot to identify and execute the required actions in real time.”

This crash is a case where it is tempting to chalk up the accident to pilot error and move on. However, a more thorough investigation of causes identifies multiple issues that contributed to the accident and, most importantly, multiple opportunities to increase safety for future pilots and passengers. The impacts to the goals, causes of these impacts, and possible solutions can be organized visually in cause-and-effect relationships by using a Cause Map. To view the Outline and Cause Map, please click “Download PDF” above.

The wake turbulence that initially affected the flight was due to the small separation distance between the flight and a large plane that took off 2 minutes prior (the required separation distance by the FAA). This led to a recommendation to re-evaluate the separation standards, especially for extremely large planes. In the investigation, the NTSB found that the training provided to pilots on this particular type of aircraft was inadequate, especially because changes to the aircraft’s flight control system rendered the rudder control system extremely sensitive. This combination is believed to be what led to the overuse of the rudder system, leading to stress on the vertical stabilizer that resulted in its detachment from the plane. Specific formal training for pilots based on the flight control system for this particular plane was incorporated, as was evaluation of changes to the flight control system and requirements of handling evaluations when design changes are made to flight control systems for previously certified aircraft. A caution box related to rudder sensitivity was incorporated on these planes, as was a detailed inspection to verify stabilizer to fuselage and rudder to stabilizer attachments. An additional inspection was required for planes that experience extreme in-flight lateral loading events. Lastly, the airplane upset recovery training aid was revised to assist pilots in recovering from upsets such as from this event.

Had this investigation been limited to a discussion of pilot error, revised training may have been developed, but it’s likely that a discussion of the causes that led to the other solutions that were recommended and/or implemented as a result of this accident would not have been incorporated. It’s important to ensure that incident investigations address all the causes, so that as many solutions as possible can be considered.

Root Cause Analysis - Incident Investigation

The Morris Worm: The First Significant Cyber Attack

November 7, 2013 Kim Smiley

By Kim Smiley

In 1988 the world was introduced to the concept of a software worm when the Morris worm made headlines for significantly disrupting the fledgling internet. The mess left in the wake of the Morris worm took several days to clean up. The estimates for the cost of the Morris worm vary greatly from $100,000–10,000,000, but even at the lower range the numbers are still substantial.

A Cause Map, or visual root cause analysis, can be used to analyze this issue. A Cause Map is built by asking “why” questions and using the answers to visually lay out the causes that contributed to an issue to show the cause-and-effect relationships. In this example, a programmer was trying to build a “harmless” worm that could be used to gauge the size of the internet, but he made a mistake. The goal was to infect each computer one time, but the worm was designed to duplicate itself every seventh time a computer indicated it already had the worm to make the worm hard to defend against. The problem was that the speed of propagation was underestimated. Once released, the worm quickly reinfected computers over and over again until they were unable to function and the internet came crashing down. (To view a Cause Map of this example, click on “View PDF” above.)

One of the lasting impacts from the Morris worm that is hard to quantify is the impact on cyber security. The worm exploited known bugs that no one had worried about enough to fix. At the time of the Morris worm, there was no commercial traffic on the internet or even Web sites. The people who had access to the internet were a small, elite group and concerns about cyber security hadn’t really come up. If the first “hacker” attack had had malicious intent behind it and came a little later it’s likely that the damage would have been much more severe. While the initial impacts of the Morris worm were all negative, it’s a positive thing that it highlighted the need to consider cyber security relatively early in the development of the internet.

It’s also interesting to note that the programmer behind the Morris worm, Robert Tappan Morris, become the first person to be indicted under the 1986 Computer Fraud and Abuse Act. He was sentenced with a $10,050 fine, 400 hours of community service, and a three-year probation. Morris was a 23 year old graduate student at the time he released his infamous worm. After this initial hiccup, Morris went one to have a successful career and now works in the MIT Computer Science and Artificial Intelligence Laboratory.

Root Cause Analysis - Incident Investigation

16-Day Government Shutdown Affects Economy

November 1, 2013 Holly Maher

By Holly Maher

On October 1, 2013 at 12:01 AM, the beginning of the 2014 fiscal year, the federal government shut down all non-essential operations when Congress could not pass a continuing resolution to allow spending at current levels. The government shutdown lasted 16 days and, in addition to other impacts, closed the National Parks system (see our blog about the park closures), furloughed 800,000 federal employees, had the potential to impact payment of veterans’ benefits and negatively impacted the economy, both directly and indirectly.

So what caused the government shutdown? If you watched any TV during that 16 day period, you could certainly hear any number of experts (on both sides) explaining who was to blame. As the Cause Mapping methodology is intended to do, this analysis of the government shutdown is not trying to identify the one person, the one group or the one reason to blame for the shutdown. Instead, we will identify all the causes required to produce this effect. This will allow us to identify many possible solutions for preventing it from happening again. We start by asking “why” questions and documenting the answers to visually lay out all the causes that contributed to the shutdown. The cause and effect relationships lay out from left to right.

In this example, the government shutdown occurred because a vote on a continuing resolution bill could not be passed by Congress because there was a line item added to the continuing resolution, defunding the Affordable Care Act (ACA) that could not be agreed upon. A continuing resolution was required because the Constitution gives the power to spend money to Congress, and since they had not passed a Budget for fiscal year 2014, a continuing resolution was constitutionally required to continue operating the government after October 1. Defunding the ACA was added to the continuing resolution bill because the ACA was about to go into effect and because it can be added on a line item basis. Congress was unable to compromise to reach an agreement to pass the continuing resolution.

So why was Congress unable to reach an agreement? If the incentive to compromise was greater than the incentive to not compromise, they would have compromised. So why is the incentive to compromise ineffective? One of the reasons is because Congress’s pay is not affected when the government shuts down. Another reason is because there is significant incentive to maintain a position aligned with the party (either left or right). The desire to get re-elected (which is unlimited within Congress), the need for support in the primaries to get re-elected (based on the current primary system), and the need for campaign financing are all causes that support the incentive to maintain alignment with the party versus compromise.

Once all the causes of the government shutdown have been identified, possible solutions to prevent the shutdown from happening again can be brainstormed. One possible solution would be to legally require a continuing resolution to be a “clean” bill, with no additional line items. This would make it more likely in the future, when there are debates or discussions over current, hot button items, such as the ACA, that the result would not be a failure to pass the continuing resolution and therefore cause a government shutdown. Another possible solution would be to stop pay for Congress during the government shutdown. Other more global, systemic solutions might be to implement term limits in Congress or provide government campaign financing to reduce the dependency on party financial support.

To view the Outline and Cause Map, please click “Download PDF” above.

Your Expert Root Cause Analysis Resource

Monthly Archives: November 2013

Boeing 747 “Dreamlifter” Cargo Jet Lands At Wrong Airport

Can the Epidemic of Smartphone Thefts be Stopped?

Pilot Response to Turbulence Leads to Crash

The Morris Worm: The First Significant Cyber Attack

16-Day Government Shutdown Affects Economy