Root Cause Analysis - Incident Investigation

Pilot Locked in Bathroom Nearly Results in Terror Alert

November 30, 2011 Kim Smiley

In order for a flight to take off and land safely, many complex mechanical systems have to work for the plane to function properly. Additionally, pilots need to be properly trained and proficient at their jobs. Airline processes also have to work in order to smoothly ticket, security screen and board all the passengers.

The number of things that have to work for a successful commercial airline flight is impressive. A recent incident highlighted that even the smallest hiccup, a broken bathroom lock for example, has the potential to cause big issues in the complex world of commercial flights.

On November 18, 2011, a pilot accidentally got locked inside a bathroom just prior to landing at LaGuardia. This incident almost resulted in an emergency being declared and terrorist alert being issued. In order to understand this incident, a Cause Map can be built. A Cause Map is a visual root cause analysis that illustrates the cause and effect relationship between all the Causes that contribute to an event.

In this example, the copilot considered declaring an emergency because the pilot was gone from the cockpit longer than excepted and an unknown man with an accent knocked on the cockpit door. The copilot was concerned that this might be a potential hijacking attempt. His concern was caused by the intended destination being NYC and the 9/11 attacks that occurred there 10 years ago.

The pilot was taking longer than normal because the bathroom door lock had jammed when he had tried to exit after a bathroom break. The unknown man was a well-intended passenger who had heard the pilot calling for help. The pilot had given him the password to access the cockpit because all other crew members were inside the cockpit. There were two reasons that all other crew members were inside the cockpit. First, regulations require that at least 2 crew members are inside the cockpit at all times. Second this was a small airplane staffed with only 3 crew members. If the pilot or copilot needed to use the restroom, the only flight attendant had to enter the cockpit to meet the rules.

Luckily, the pilot was eventually able to free himself from the bathroom and return to the cockpit before anything too exciting happened. The plane landed as scheduled. The FBI and Port Authority cops met the plane, but after briefly talking to the passenger involved it was quickly determined that nothing suspicious had occurred.

Root Cause Analysis - Incident Investigation

First Airline Fine for Tarmac Delay

November 22, 2011 Kim Smiley

by Kim Smiley

The Department of Transportation (DOT) recently issued the first fine for violating new rules that limit how long passengers can be kept onboard a plane waiting on the tarmac. The new regulations, commonly called the tarmac delay rule, state that passengers may not be kept onboard a plane waiting on the runway for more than 3 hours without being given the opportunity to deplane. The rules also require that airlines provide adequate food and drinking water for passengers within 2 hours of a plane being delayed on the tarmac and to maintain operable lavatories. The tarmac delay rule, which went in effect April 2010, was created following several incidents where passengers were kept onboard airplanes for long periods of time.

The incident that resulted in a fine is not the first violation of the 3 hour rule, but this is the first time the DOT has taken the step of issuing a fine. The potential fees for violating the rules are substantial. Airlines can be fined $27,500 per passenger when the tarmac delay is beyond 3 hours. This quickly adds up, especially if multiple flights are involved. In this example, 15 American Eagle flights were delayed beyond the 3 hour limit on May 29, 2011 at O’Hare International Airport in Chicago. 608 passengers were affected and American Airlines was fined a whopping $900,000.

What happened? How were so many flights on the tarmac so long?

This example can be analyzed by building a Cause Map, a method for performing a visual root cause analysis. A Cause Map is built by determining the cause-and-effect relationships between all the causes that contributed to an incident. Click on “Download PDF” above to view a high level Cause Map of this incident.

As with many airline delays, inclement weather played a major role in this incident. Flights had been delayed taking off from O’Hare and planes that were scheduled to have departed were still sitting at the gates. Planes that landed had nowhere to go so they sat on the tarmac waiting for an open gate.

Passengers were not given an opportunity to deplane within 3 hours. The airline has procedures to get passengers off the planes even if the planes themselves were stuck waiting on the tarmac, but the procedures were not implemented within the 3 hour time limit. If there was no delay limit, an airline couldn’t violate it so the new creation of the tarmac delay role is also a cause to consider in this incident.

It will be interesting to see how this large, first of its kind fine affects the airline industry as a whole. Statistics show that the new rules have successfully reduced long tarmac delays. The first year that the rule was in effect, airlines reported only 20 tarmac delays of more than 3 hours, but in the 12 months prior to rule there were 693 delays of more than 3 hours. But this improvement may come at a high cost. Especially now that the DOT has shown that they are willing to issue fines, industry analysts are warning that a possible unintended consequence of the new tarmac will be more canceled flights. The fines are so hefty that airlines may cancel entire flights rather than risk violating the tarmac delay rules, which would obviously have an impact on travelers. Only time will tell how the new rules will affect airline travel.

Root Cause Analysis - Incident Investigation

Plane Crash Kills Hockey Team

November 17, 2011 ThinkReliability Staff

By ThinkReliability Staff

Hockey fans were devastated when, on September 7, 2011, a Yak-42 plane carrying a Russian hockey team, including many former NHL players, crashed shortly after takeoff. A total of 44 people were killed, including 36 passengers and 8 crew members. One crew member survived the crash. This incident was the 7th fatal crash to occur in Russia since June, and resulted in the loss of the license of the company who operated the plane.

Now that the Russian air safety organization has released results from its investigation, we can map the details of the crash into a Cause Map, or visual root cause analysis. The Cause Map begins with the impacts to the goals. The deaths of the crew and passengers are an impact to the safety goal. The company losing its operating license can be considered an impact to the organizational goal. The damage to the plane is an impact to the property goal. All these impacts to the goals were caused by the plane crashing into a riverbank shortly after takeoff.

We ask “Why” questions to add more detail to the map. It has been determined that the plane crashed because it had insufficient speed during takeoff, and the takeoff was not aborted. It is also possible that the pilot was attempting to emergency land in the river, and missed. The plane had insufficient speed during takeoff because the brake was pressed. Studies determined that a foot had to be placed on the brake pedal in order for the brake to be activated. Because of the force being used on the control column, it is likely that one of the pilots was attempting to push down using his foot as a brace. The pilots who were flying the plane were more familiar (and were being trained simultaneously on) another type of plane. This plane – the Yak-40 – has a foot rest where the Yak-42’s brake pedal is located. Normally pilots are only trained on one type of plane at a time to minimize this sort of confusion.

In addition, at some point during takeoff, the engine was idled. This would normally indicate that takeoff is being aborted. Once the engine was brought back into service, it took some time to regain takeoff power – and the speed had already dropped. Aviation experts say that takeoff could have been aborted and the crash would have been avoided. However, it does not appear that an abort attempt was made. Flight recordings indicate confusion and a lack of effective communication in the cockpit. Prior to the engine being idled, one of the pilots pushed the control stick forward, after which it was pulled back to resume takeoff. The crew on this plane had never trained together before which is fairly typical, and may be part of the reason for the recent poor safety record of planes in Russia. Additionally, the pilot had Phenobarbital in his system, which is known to slow reaction time. Recommendations to attempt to improve the safety of small planes of regional carriers in Russia have been under consideration with the recent rash of crashes. However, the loss of many popular hockey players may increase the urging to implement these solutions.

To view the Outline and Cause Map, please click “Download PDF” above. Or click here to read more.

Root Cause Analysis - Incident Investigation

Bluff Collapse Releases Coal Ash

November 9, 2011 ThinkReliability Staff

By ThinkReliability Staff

On October 31, 2011, a bluff collapsed at a power plant on the shores of Lake Michigan. The resulting mudslide took trailers, storage units, at least one truck and an unknown amount of coal ash into the lake, which provides drinking water for more than 40 million people. Cleanup is ongoing, but the overall impact to the environment has not yet been determined. Fortunately, no personnel were in the objects that ended up in the lake, so there were no injures.

Although the safety goal was not impacted by this incident, there was the potential for personnel injury. Additionally, the environmental, customer service, property and labor goals were impacted by the pollution of the lake, loss of property and necessary cleanup. The causes for these impacts to the goals can be examined in a Cause Map, or visual root cause analysis.

The mudslide which took the objects and coal ash into the lake was caused by insufficient stability of a bluff overlooking the lake. The bluff’s instability was caused by degraded ground material stability mixed with water and no vegetation. The vegetation had been removed for construction. The ground in the area had been filled with coal ash – a practice allowed in previous decades. Coal ash is less stable than soil, especially when it is exposed to water. In this case, aerial images suggest that the water seeped into the area from a high water table or from an unlined retention pond used to store storm water. Although a construction project was ongoing, an environmental impact study – which may have unearthed concerns about the stability of the area – was not considered necessary.

Steps are being taken to clean up the lake to the extent possible. However, concerns about coal ash in this area and others are prompting a review by Congress to determine how coal ash can be safely dealt with. Many say this incident suggests that stronger controls are needed.

To view the Outline and Cause Map, please click “Download PDF” above. Or click here to read more.

Root Cause Analysis - Incident Investigation

BlackBerry’s Widespread Failure

November 3, 2011 ThinkReliability Staff

By ThinkReliability Staff

BlackBerry faced yet another setback last month when service went down world-wide for multiple days. The Research in Motion (RIM) company, already facing stiff competition from other smart phone vendors, apologized profusely for the outage and vowed to woo back its customers. What caused the extensive and possibly business-ending service outage?

A root cause analysis can help identify what occurred. The first step is to outline the incident. The service outage originated in Europe, then spread to four other continents over a 72 hour period. Customers were furious with the service outage and the slow PR response from the company. This outage impacted two major RIM goals – to generate revenue for shareholders and maintain customer satisfaction. Working backwards from these goals, the Cause Map shows what events led to the catastrophic failure and where further investigation is needed.

The company faces a potential loss of revenue if it loses customers. The company may not have had to worry about the impact of such service outages in the past…except that now there are viable alternatives such as Apple and Android devices. Continuing to work backwards, customers were upset because of a service outage. At this point, it helps to examine the BlackBerry network architecture.

BlackBerry’s architecture is fundamentally different from that of Apple and Android. All data is filtered through the company’s internal service network, before being passed on to carrier networks such as Sprint and Verizon. Apple and Android don’t provide processing in the middle. When BlackBerry’s core switch failed in an English data center, a backup switch was supposed to take over. It had been tested successfully. Unfortunately the backup didn’t work, leading to a buildup of messages waiting to be processed. That mountain of messages led to backlogs in other data centers worldwide. When the switch failed, it also corrupted the database software managing all the messages within the network.

It turns out that this network architecture is both a liability and at the heart of the company’s business success. By centrally processing all data messages – both compressing and encrypting them – RIM provides additional security and reduces the processing required at the user device, meaning lower energy use and a longer battery life. Despite these strengths, RIM would be wise to find out why their network crashed. As users store more data within the network – as with cloud computing – outages could cripple the system for even longer.

Your Expert Root Cause Analysis Resource

Monthly Archives: November 2011

Pilot Locked in Bathroom Nearly Results in Terror Alert

First Airline Fine for Tarmac Delay

Plane Crash Kills Hockey Team

Bluff Collapse Releases Coal Ash

BlackBerry’s Widespread Failure