Tag Archives: root cause analysis

More on the Disappearance of Flight 188

By ThinkReliability Staff

In our previous blog about Flight 188 of Northwest Airlines, we discussed the first step of a root cause analysis investigation – defining the problem – and mentioned that a detailed Cause Map could be developed when more information regarding the incident was released.

The National Transportation Safety Board (NTSB) has recently released a report on what exactly happened to the flight. We can build off of the outline we already developed to put together the Cause Map, or visual root cause analysis.

First we begin with the impacts to the goals. Most importantly, the safety and property goals were impacted due to the potential danger to the flight. This was caused by the plane overshooting the destination. The pilots flew over the destination because they were distracted, warnings were not effectively delivered to them, and they couldn’t see their destination (Minneapolis-St. Paul), since it was after dark and cloudy.

The pilots were distracted by a non-operation activity. The two pilots were utilizing the scheduling software on their laptops, both of which were open in the cockpit (possibly blocking some of the flight display). Both using personal laptops and participating in non-operational activities is prohibited by the airline.

Some may ask how it’s possible that two pilots who were flying a plane – with over a hundred passengers – could be spending all their energy on another activity. Well, the pilots did not actually have any active tasks to fly the plane. The plane was on auto-pilot, and the one task that pilots ordinarily did on a regular basis (which would have certainly alerted the pilots to their position) was sending a position report. However, a dispatcher for the airliner had asked the pilots NOT to send a report, as the reports were burdensome and unneccessary.

Warnings did not effectively get through to the pilots by sight – either the flight display was physically blocked by the laptop or the pilots weren’t looking at it because they were distracted – or sound – the plane was not equipped to send audible message (such as chimes or buzzers) to the pilots, text messages sent to them were not acknowledged, and the pilots did not hear calls for them on the radio. The air traffic controllers (who were different from the air traffic controllers who had first had contact with the plane) did not know which frequency the plane was on, so only some messages got through. Because the pilots were using the speaker instead of headsets and were, again, distracted, they missed the messages.

Both of the pilots involved had their licenses revoked. Several procedures were not followed in this instance, and the FAA and individual airlines are working on highlighting the importance of these procedures. Reading about this incident (and seeing that the pilots’ license were revoked) will probably do much to highlight the importance of the procedures. Luckily, nobody was hurt for this lesson to be learned.

View the root cause analysis investigation by clicking “Download PDF” above.

Toyota Recall: Problems, Interim Solutions and Permanent Solutions

by Kim Smiley

On September 29, 2009, Toyota/Lexus issued a safety advisory that some 2004-2010 model year vehicles could be prone to a rapid acceleration issue if the floor mat moved out of place and jammed the accelerator pedal. Although the recall is only applicable in the U.S. and Canada because of the type of floor mats used, over 4 million vehicles are affected by the recall.

Although all the solutions to this problem have not yet been implemented, we can look at the issue so far in a Cause Map, or visual root cause analysis. First we define the problem. Here we could consider the problem the recall, or the acceleration problems. We can list all the models and years that are affected by the recall, and that the recall is limited to the U.S. and Canada.

We define the problem with respect to the organization’s goals. There have been at least 5 fatalities addressed by the National Highway Transportation Safety Administration (NHTSA), though some media outlets have reported more. Additionally, the NHTSA has reported 17 accidents (again, some claim more) and has received at least 100 complaints. The fatalities and accidents are impacts to the safety goal. Complaints are impacts to the customer service goal. The recall of more than 4 million cars is an impact to the production/schedule goal, and the replacement of the accelerator pedals and floor mats as a result of the recall is estimated to cost $250 million, which is an impact to the property goal.

Once we’ve completed the outline, we can begin the Cause Map, or the analysis step of the process. The fatalities are caused by vehicle crashes resulting from a loss of control of the vehicle. The loss of control is caused by a sudden surge of acceleration, inability to brake, and sometimes an inability to shut down the engine of the car. Toyota says the sudden bursts of acceleration are caused by entrapment of the accelerator pedal due to interference from floor mats. Toyota refutes the possibility that there may be a malfunction in the electronic control system, saying it’s been ruled out by Toyota research.

The vehicles are unable to brake because the brake is non-functional when the accelerator pedal is engaged, as it is in these cases. Additionally, owners whose models are equipped with keyless ignition cannot quickly turn off their ignition. These models require the ignition button to be pressed for 3 seconds to prevent inadvertent engine stops, and the instructions are not posted on the dashboard, so owners who weren’t meticulous about reading (or remembering) instructions from the owners’ manual may not know how to turn off the car while moving at very quick speeds.

When the Cause Map is complete to a sufficient level of detail, it’s time to explore some solutions. In this case, the permanent solutions (which will reduce the risk of these accidents most significantly) to be implemented by Toyota are to reconfigure the accelerator pedal, replace the floor mats, and install a brake override system which will allow the brakes to function even with the accelerator pedal engaged. However, designing and implementing these changes for more than 4 million cars will take some time, so owners of Toyotas require interim solutions. Interim solutions are those that do not sufficiently reduce the risk for long-term applicability but can be used as a stop-gap until permanent solutions are put in place. In this case, Toyota has asked owners to remove floor mats, and has put out guidance that drivers who are in an uncontrolled acceleration situation should shift the engine into neutral, which will disengage the engine and allow the brake to stop the car.

View the high level summary of the investigation by clicking “Download PDF” above.

Learn more about the recall at the NHTSA website.

Airlink Incidents: Viewing Trends in Visual Form

By ThinkReliability Staff

Over the past three months, South Africa’s Airlink airline has had four incidents, ranging from embarrassing to fatal. Four similar incidents such as these start to point out a trend, which should be investigated to improve processes and increase safety. But how do we start the investigation?

In the Cause Mapping root cause analysis method, we begin by defining the problem. Here we can define four problems, which are the four incidents over the last three months. We can look at one incident at a time in a problem outline, the first step of the Cause Mapping process. We’ll start with the earliest incident first.

On September 24, 2009 at approximately 8 a.m. a Jetstream 41 crashed into a school yard in Durban Bluff just after take-off from Durban International Airport. This was a forced landing necessitated by the loss of an engine. The pilot was killed. There were also two serious injuries of the crew, and a minor injury of a person on the ground. There were no passengers on the plane, and the impact to Airlink’s schedule is unclear. However, the plane was lost.

We can capture this information more clearly and succinctly in an outline. For example, the above paragraph has more than 80 words. The outline, which records the same information, uses only 42 words in an easily understandable visual form. (The outline for all three incidents can be viewed by clicking on “Download PDF” above.)

The second incident: On November 18, 2009 at 1:30 p.m. a BAE Systems Jetstream 41 aborted take-off for East London and slid off the runway at Port Elizabeth airport. There were high velocity cross winds, and the pilot may have been unable to establish directional control. There were no injuries, no environmental impact and damages to the plane are unknown. However, new travel arrangements had to be made by the airline for all the passengers. The frequency of Airlink incidents is now two in eight weeks. (Over 80 words; the outline has 49 words.)

The third incident: On November 24, 2009 at approximately 8 a.m. a flight en route to Harare carrying a Prime Minister was forced to return to Johannesburg Airport after it experienced a technical fault. There were no injuries, but it caused a delay in the Prime Minister’s schedule. The damage to the airplane is unclear. The frequency of Airlink incidents is now three in two months. (Over 60 words; the outline has 33 words.)

The fourth incident: On December 7, 2009 at approximately 11 a.m. a Regional airline SA Airlink Embraer 135 commuter jet hydroplaned and overshot the runway while landing at George Airport during rainy weather. There were five injuries, including a sprained ankle. This incident has led to a poor public perception of the airline and increased supervision from the authorities. We do not have a dollar amount on the property damage. The frequency of Airlink incidents is now 4 in 10 weeks. (Over 70 words; the outline has 42 words.)

In addition to the increased brevity of the outline, it provides an easy visual comparison of the four incidents by showing them in a similar visual form. On one page, we can show the timeline, and outlines of the four incidents for easy comparison. This is especially useful for a briefing tool for busy managers.

Genesis Spacecraft Crash

By Kim Smiley

The mission of the Genesis spacecraft was to collect the first samples of the solar wind and return the samples to earth to be analyzed. The goal was to provide fundamental data to help scientists determine the composition of the sun and learn more about the formation of our solar system.

Unfortunately, during descent on September 8, 2004, the Genesis crashed into the earth at high velocity. Its descent was only slowed by air resistance and the collection capsule was damaged on impact.

What happened? What went wrong with the re-entry?

A root cause analysis can be performed to evaluate this incident. The investigation can be documented by building a Cause Map that collects all the information associated with the incident in a visual format that is easy to follow.

In this case, the main goal we’ll consider is the production goal. The production goal was impacted because the collection capsule was damaged, which had the potential to destroy all the physical data collected during the three year mission.

The investigation can proceed by asking “why” questions and adding the causes to the Cause Map. In this scenario, the collection capsule was damaged because it impacted the earth at high velocity. This occurred because the parachute that was intended to slow the descent to allow for a midair recovery by helicopter failed to deploy.

Post-accident investigation determined that the parachute was never triggered to deploy because gravity switches were installed backwards. The backward installation occurred for several reasons: the design was flawed, the design review process didn’t detect the error and the testing performed didn’t detect the error.

Luckily, the impact to the production goal has been less significant than it might have been in this case. The collection capsule was cushioned somewhat by the soft ground and while desert dirt entered the capsule, liquid water did not. The solar wind particles were embedded in the collection materials and the contaminating dirt was able to be removed for the most part. NASA has been able to retrieve significant amounts of data from the mission.

NASA’s Mishap Report can be downloaded for free for additional information on the incident.

A one page PDF showing a high level Cause Map of the incident can be downloaded by clicking on the button above.

Grounding of the Empress of the North

Download PDFby Kim Smiley

On May 14 2007, the 300 foot cruise ship, Empress of the North, grounded out on rocks while rounding Rocky Island during a trip through Alaska’s Inland Passage.  There was significant damage to the hull and the two starboard propellers needed to be replaced.  Costs of repairs totaled more than $4.8 million.  Luckily no one was injured, but over two hundred passengers had to be evacuated from the ship.

This is a common route for cruise ships and the rocks were a well-known hazard clearly marked on navigation charts.  So what happened?

A root cause analysis shows that there were many causes that contributed to the accident.  One of causes is that there were no lookouts at the time of the accident.  The crew members who would have acted as lookouts were performing security rounds.  This was in violation of regulations requiring lookouts at all times and appears to have been a common practice for the crew.

When determining causes it’s important to ask, what is different?  In this case, this was the first watch as Deck Officer for the officer in charge.  He had recently graduated, was newly licensed and inexperienced.  He was not familiar with the deck procedures and the equipment. There was a lot of confusion about watch team roles and he didn’t attempt to take charge of the ship’s navigation until seconds before the grounding occurred.  The National Transportation Safety Board (NTSB) found that the actions, or inaction as the case may be, of the Deck Officer were one of the major factors contributing to the accident.

It’s tempting to stop at this point, but the analysis needs to go farther than just identifying the actions of the Deck Officer as a cause to do a thorough investigation.  Why was he standing watch if he wasn’t fully qualified?  Why wasn’t he prepared adequately prior to being given the responsibility?

The crew member originally assigned the watch was ill.  There are a limited number of possible replacements on a ship this size.  The Master of the ship believed the watch would be a good training watch because it was an easy watch with minimal course corrections needed.  It was also not the practice of the crew to have specific night orders for the overnight watches so the newly arrived junior third officer found himself standing the midnight to 4 am watch with minimal guidance.

Many investigations lead back to human error, but it’s important to ask questions beyond that point.  Changing how people are trained, improving the environment, and providing specific writing inspections can help prevent human errors in many cases.

(The photo above is an official Coast Guard photo.)