Root Cause Analysis - Incident Investigation

Why Can’t the Missing Malaysia Airliner be Found?

April 25, 2014 Holly Maher

On March 8, 2014 Malaysia Airline flight MH370 took off from Kuala Lumpur heading for Beijing, China. The aircraft had 239 passengers and crew aboard. Less than 1 hour into the flight, communication and radar contact was lost with the aircraft. Forty-nine days later, the location and fate of the aircraft is still unknown despite a massive international effort to locate the missing airliner. The search effort has dominated the news for the last month and the question is still out there: how, with today’s technology, can an entire aircraft go missing?

Since we may never know what happened to flight MH370, this analysis is intended to understand why we can’t find it and identify the causes required to produce this effect. This will allow us to identify many possible solutions for preventing it from happening again. We start by asking “why” questions and documenting the answers to visually lay out all the causes that contributed to this incident. The cause-and-effect relationships lay out from left to right.

In this example, the Customer Service Goal is impacted because we are missing 239 passengers and crew. This is caused by the fact that we can’t locate Malaysia Airline MH370. The inability to locate the airline is a result of a number of causes over the 49 day period. One reason is that 3 days were initially spent looking in the wrong location, along the original flight path from Kuala Lumpur to Beijing, in the Gulf of Thailand and the South China Sea. The reason 3 days were mistakenly spent looking in this location is that the airline had left the original flight path and officials were unaware of that fact. Why the aircraft left the original flight path is still unknown, but we can look at some of the causes that allowed the flight to leave the original flight path undetected.

One of the reasons the aircraft was able to leave the original flight path undetected was that air traffic control was unable to track the airplane with radar. The transponder onboard the aircraft, which allows the ground control to track the aircraft using airspeed and altitude, was turned off less than one hour into the flight. We don’t know the reason the transponder was turned off; however, the fact that it is designed to be turned off manually is a cause of the transponder being turned off. It is designed to be manually turned off to reduce risk in the event of failure or fire, and to reduce radio traffic when the airplane is on the ground. After 9/11, when 3 out of the 4 hijacked airplanes had transponders that had been turned off, the airline industry debated the manual on/off design of the transponder, but aviation experts strongly supported the need for the pilots to be able to turn off the transponders, as needed, for the safety of the flight.

Another reason the aircraft left the original flight path undetected was because the flight crew outside the cockpit did not communicate distress or change of route. This is because all communications from the airplane come from/through the cockpit. The aircraft is not currently equipped to allow for communication, specifically distress communications, from outside the cockpit.

Days into the investigation, radar data was identified which showed the change of course of the aircraft. This changed the area of the search away from the original flight path. However, this radar detection was not identified in real time, as the plane was moving away from the original flight path. This is also a cause of the aircraft being able to leave the flight path undetected.

Once the search area moved west, the size of the potential search area was incredibly large, another cause of being unable to locate the aircraft. At its largest, the search area was 2.96 million square miles. This was based on an analysis of how far the flight could have gotten with the amount of fuel on board. Further analysis of satellite data, or “handshakes” with the computer framework on board the aircraft, continued to refine the search area.

Many people have asked why no one on the flight made cell phone calls indicating distress (if this was an act of terrorism). The reason no cell phone calls were made was because cell phones do not work over 2000 ft. That is because there is no direct line to a cellular tower.

Another cause of being unable to locate MH370 is being unable to locate the black box. The black box is made of aluminum and is very heavy, designed to withstand significant forces in the event of a crash. This causes the black box to sink, instead of float, making it difficult to locate. The depth of the ocean in which the search is occurring ranges from 4,000-23,000 ft, adding to the difficulty of finding the black box. Acoustic pings were last detected from the black box on April 8, 2014, 32 days into the search. This is because the battery life on the black box is ~30 days. This had been the battery design life criteria prior to the Air France Flight 447 crash in 2009. It took over 2 years to locate the black box and wreckage from flight 447, therefore the design criteria for the black box battery life was changed from 30 days to 90 days. This would allow search crews more time to locate the black box. Malaysia Airlines Flight MH370 still had a black box with a battery life of 30 days.

Once the analysis has broken down incident into its causes, solutions can be identified to mitigate the risk a similar incident in the future.

To view the Outline and Cause Map, please click “Download PDF” above. Or click here to read more.

Root Cause Analysis - Incident Investigation

Nearly 2.6 million GM Vehicles Recalled, Costs Soar to 1.3 Billion

April 15, 2014 Kim Smiley

By Kim Smiley

During the first quarter of 2014, General Motors (GM) recalled 2.6 million vehicles due to ignition switch issues tied to at least 13 deaths. Costs associated with the issue are estimated to be around $1.3 billion and, possibility even more damaging to the long term health of the company, is the beating the company’s reputation has taken.

The ignition switch issues are caused by a small, inexpensive part called a switch indent plunger. An ignition switch has four main positions (off, accessories, on, and start) and the switch indent plunger holds the ignition switch into position. In the accidents associated with the recent recall, the ignition switch slipped out of the on position and into the accessories position because the ignition switch plunger didn’t have enough torque to hold it in place. When the ignition is put into the accessories mode, the car loses both power steering and power braking, and the air bags won’t inflate. It’s easy to see how a situation that makes a car less safe and more difficult to control can quickly create a dangerous, or even deadly, situation. Additionally, it’s important to know the problem is most likely to occur when driving on a bumpy surface or if a heavy key ring is pulling on the key.

The other key element of this issue is how the problem has been handled by GM. There are a lot of hard questions being asked about what was known about the problem and when it was known. It is known that the faulty part was redesigned in 2006 to address the problem, but the new design of the part wasn’t given a new part number as would normally be done. Multiple federal inquiries are working to determine when it was known that the faulty parts posed a danger to drivers and why there was such a long delay before a recall was done. The fact that the redesigned part wasn’t assigned a new part number has also lead to questions about whether there was an attempt to cover up the issue. GM is not civilly liable for deaths and injuries associated with the faulty ignition switches because of its 2009 bankruptcy, but the company could potentially be found criminally liable.

No company ever wants to recall a product, but it’s important to remember that how the recall is handled is just as important as getting the technical details right. Consumers need to believe that a company will do the right thing and that any safety concerns will quickly and openly be addressed. Once consumers lose faith in a company’s integrity the cost will be far greater than the price of a recall.

If you drive a GM car, you can get more information about the recall here. The recalled models are Chevrolet Cobalts and Pontiac G5s from the 2005 through 2007 model years; Saturn Ion compacts from 2003 through 2007; and Chevrolet HHR SUVs, and Pontiac Solstice and Saturn Sky sports cars from 2006 and 2007.

To view the Outline and Cause Map showing the root cause analysis of this issue, please click “Download PDF” above. Or click here to read more.

Root Cause Analysis - Incident Investigation

Chicago O’Hare Commuter Train Derailment Injures 33

April 11, 2014 Sarah Wrenn

By Sarah Wrenn

At 2:49 AM on March 24, 2014, a Blue Line Commuter train entered the Chicago-O’Hare International Airport Station, collided with the track bumper post, and proceeded to derail landing on an escalator and stairway. Thirty-two passengers and the train operator were injured and transported to nearby hospitals. Images showing the lead rail car perched on the escalator look like the train was involved in filming an action movie.

So what caused a Chicago Transit Authority (CTA) train, part of the nation’s second largest public transportation system, to derail? We can use the Cause Mapping process to analyze this specific incident with the following three steps: 1) Define the problem, 2) Conduct the analysis and 3) Identify the best solutions.

We start by defining the problem. In the problem outline, you’ll notice we’ve asked four questions: What is the problem? When did it happen? Where did it happen? And how did it impact the goals?

Next we’ll analyze the incident. We start with the impacted goals and begin asking “why” questions while documenting the answers to visually lay out all the causes that contributed to the incident. The cause and effect relationships lay out from left to right. As can be seen in the problem outline, this incident resulted in multiple goals being impacted.

In this incident, 33 people were injured when the train they were riding derailed in the O’Hare station thereby affecting our safety goal of zero injuries. The injuries were caused by the train derailing, so let’s dig in to why the train derailed. Let’s first ask why the train operator was unable to stop the train. Operator statements are crucial to understanding exactly what happened. Here, it is important to avoid blame by asking questions about the process followed by the operator. Interestingly, 45 seconds before the crash, the operator manually reduced the train speed. However, at some point, the train operator dozed off. The train operator’s schedule (working nearly 60 hours the previous week), length of shift, and time off are all possible causes of the lack of rest. Evidence that the operator was coming off of an 18 hour break allows us to eliminate insufficient time off between shifts as a cause. In addition, the train operator was relatively new (qualified train operator in January 2014), but also she was an “extra-board” employee meaning she substituted for other train operators who were out sick or on vacation.

Next, let’s ask why the train was unable to stop. An automatic breaking system is installed at this station and the system activated when the train crossed the fixed trip stop. The train was unable to stop, because there was an insufficient stopping distance for the train’s speed. At the location of the trip stop, the train speed limit was 25 mph and the train was traveling 26 mph. While the emergency braking system functioned correctly, the limited distance and the speed of the train did not allow the train to stop.

The train derailing impacted multiple organizational goals, but also the personal goal of the train operator who was fired. During the investigation, we learn that the train operator failed to appear at a disciplinary hearing and had a previous safety violation in which she dozed off and overshot a station. These details reveal themselves on the cause map by asking why questions.

The final step of the investigation is to use the cause map to identify and select the best solutions that will reduce the risk of the incident recurring. On April 4, 2014, the CTA announced proposed changes to the train operator scheduling policy. In addition, the CTA changed the speed limit when entering a station and moved the trip stops to increase the stopping distance. Each of these identified solutions reduce the risk of a future incident by addressing many of the causes identified during the investigation.

Root Cause Analysis - Incident Investigation

Risks of Future Landslides – and Actual Past Landslides – Ignored

April 4, 2014 ThinkReliability Staff

By ThinkReliability Staff

Risk is determined by both the probability of a given issue occurring, and the consequence (impact) if it does. In the case of the mudslide that struck Oso, Washington on March 22, 2014, both the probability and consequence were unacceptably high.

The probability of a landslide happening in the area had not only been well-documented in reports as far back as 1951, the same area where dozens were killed on March 22 had experienced 5 prior landslides since 1949. The consequences of these prior landslides were less than the 2014 landslide because of the severity of the landslide, and because increased residential development meant more people were in harm’s way.

While the search for victims is still ongoing, the causes and impacts of the landslide are mostly known. This incident can be analyzed using a Cause Map, or visual root cause analysis, to show the cause-and-effect relationships that led to the tragic landslide.

First, we capture the background information and the impact to the goals in the problem outline, thereby defining the problem. The landslide (actually a reactivation of an existing landslide, according to Professor Dave Petley, in his blog) occurred around 10:40 a.m. on March 22, 2014 in an Oso, Washington residential area. As previously noted, there had been prior landslides in the area, and there were outdated boundaries used for logging permissions (which we’ll talk more about later). The safety goal was impacted due to the 30 known deaths, 15 and people missing. (Not all of the 27 have been identified, so the known dead and missing numbers may overlap. However, at this point, there is little hope that any additional rescues will take place.) The environmental goal was impacted due to the landslide and the customer service goal (insofar as the residents can be considered customers of their local area) was impacted due to the displacement of 30 families. Logging in an area that should have been protected impacts the regulatory goal. The estimated losses (of residences and belongings) are approximately $10 million, impacting the property goal and the massive search and a recovery effort impacts the labor goal.

Beginning with these impacted goals, asking ‘why” questions allows us to develop cause-and-effect relationships showing how the incident occurred. The safety goal was impacted because of the deaths and missing, which resulted from people being overcome by a landslide. In order for this to occur, the landslide had to occur, and the people had to be in the vicinity of the landslide.

As is known from history (see the timeline on the downloadable PDF), this area is prone to landslides. Previous reports identified the erosion of the area due to the proximity of the river as a cause of these landslides. An additional cause is water seepage in the area. Water seepage is increased when the water table rises from overly wet weather (as is typically found at the end of winter). Trees can help reduce water seepage by absorbing the water. When trees are removed, water seepage in an area can increase significantly. Because of this, removal of trees (for logging or other purposes) is generally restricted near areas prone to landslides. However, for reasons yet unknown, logging was permitted in what should have been a restricted area, because the maps used to allow it were outdated. Says the geologist who developed the new maps, “I suspect it just got lost in the shuffle somewhere.” Additionally, analysis by the Seattle Times, the logging went into the “old” restricted area as well. The State Forester is investigating the allegations and whether the logging played a role in the landslide.

Regardless of the magnitude of the impact of the logging and weather, the area was prone to landslides. Yet it was allowed to be developed, despite multiple reports warning of danger and five previous landslides. In fact, construction in the area resumed just three days after the last landslide in 2006. The 2006 landslide also interrupted a plan to divert the river farther from the landslide area. Despite all of this, the area built up (with houses built as recently as 2009) and those residents were allowed to stay. (While buying out the residents was under consideration, it was apparently dismissed because the residents did not want to move.) While officials in the area maintain that they thought it was safe, a long history of reports and landslides suggest otherwise.

If a lack of knowledge of the risk of the area continues to be a concern, aerial scanning with advanced technology (lidar) could help. Use of lidar in nearby Seattle identified four times the number of landslide zones that were spotted with aerial surveying, which is more typically used.

To view a summary of the investigation, including a timeline, problem outline and Cause Map, please click “Download PDF” above.

Your Expert Root Cause Analysis Resource

Monthly Archives: April 2014

Why Can’t the Missing Malaysia Airliner be Found?

Nearly 2.6 million GM Vehicles Recalled, Costs Soar to 1.3 Billion

Chicago O’Hare Commuter Train Derailment Injures 33

Risks of Future Landslides – and Actual Past Landslides – Ignored