Root Cause Analysis - Incident Investigation

Lexington Plane Crash 2006

March 28, 2008 Kim Smiley

Root cause analysis can be a very effective technique to analyze a problem. But what if the evidence trail goes cold? Is creating a Cause Map still useful when unanswered questions remain after a thorough investigation? The crash of a Comair jet in Lexington Kentucky on August 27, 2006 is a good example of this situation. The plane crashed during takeoff, killing 49 people . The flight crew mistakenly attempted to takeoff on the wrong runway, which was too short for the plane to reach the necessary speed for lift off. Even after a detailed investigation by the National Transportation and Safety Board, it still is not clear why the flight crew used the wrong runway. As an aside, the pilot and the first officer were competent professionals from all accounts and there is no history of either making errors of this magnitude.

Plane crashes are unique in the fact that there is a lot of data available to investigators. The cockpit voice recorder (CVR) records all conversations in the cockpit and the flight data recorder (FDR) records instrument readings. Usually the reason behind plane crashes can be determined using all this data. In this case, the information did provide some useful insight, but no clear reasons why the mistake occurred.

High Level Cause Map

Buillding a Cause Map of this accident does make one thing very clear. There are many events that had to occur for this mistake to happen. One of the causes of the plane crash is clearly the error on the part of the flight crew, but another cause is the failure of the traffic controller to catch and correct the error. There were two separate windows of time where the controller had an opportunity to prevent the plane crash, but didn’t for a variety of reasons.

It’s tempting to say the plane crashed because the crew used the wrong runway and leave it at that. The main problem with this line of reasoning is that this conclusion doesn’t help prevent future crashes, especially since the error isn’t well understood. If all the focus is placed on why the wrong runway was used, an opportunity to improve the process and prevent future accidents is lost. In a case where there is missing information, building a cause map can be useful because it helps the investigation to explore all the causes and potential solutions. Only one cause needs to be eliminated to prevent the accident. For instances, the crew could had lined up at the runway and the accident could have still been prevented if the controller had caught the mistake. Focusing on a solution to eliminate the better understood causes provides a useful place to start.

Learn more about the Lexington Plane Crash.

Root Cause Analysis - Incident Investigation

Levee Break – Fernley, NV

March 26, 2008 Kim Smiley

By Kim Smiley

Just after 4 a.m. on January 5th, 2008 about 600 homes began flooding in Fernley, Nevada, about 25 miles East of Reno. A 50 foot section of a canal embankment failed flooding the adjacent area. The 32-mile canal carried water from the Truckee River south to Fallon area farms. There were no injuries in the flooding but it easily could have been very serious. The complete estimates for repairing the canal and the homes are not available at this time.

A report issued by the U.S. Bureau of Reclamation released March 20th concluded that the century-old irrigation canal failed due to burrowing rodents. A simple root cause analysis for this incident using the Cause Mapping method captures the tunneled holes in the embankment as one of the causes. Another one of the causes is the increased water flow in the canal caused by the nearly 2 inches of rain that fell the day before. The annual rainfall for the area is about 5 inches.

The Cause Map shows that the canal obviously failed because the stress on the embankment was greater than the strength of the embankment. The increased water flow added to the stress on the embankment while the holes tunneled by the rodents reduced the strength. A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.

Since the canal is almost 100 years old tunneling muskrats are not a surprise. If the holes would have been identified earlier and filled, the risk of the breach would have been reduced significantly. The evidence that the inspection and maintenance of the canals was ineffective is the fact that the canal failed due to holes. An effective inspection program would have found the holes and addressed them – that’s the purpose of inspection and maintenance. Past inspections may have been conducted exactly as required, which simply means the previous inspection requirements were inadequate. Ineffective inspections is one of the causes of the canal failure that would need to be investitgated further.

The attached PDF file contains an intermediate level root cause analysis of the canal failure. It includes causes that were considered in the Bureau of Reclamation report as well as some of the evidence and solutions. A more detailed Cause Map can be created from the specific information in the bureau’s report.

Root Cause Analysis - Incident Investigation

18 Sailors Trapped in Capsized Tugboat

March 23, 2008 Kim Smiley

By Kim Smiley

On Sunday March 23, a Ukrainian tug boat collided with a Chinese registered cargo ship. The tug boat capsized and sunk in 115 feet of water, trapping 18 sailors inside the hull. All 25 passengers on the cargo ship and seven passengers on the tugboat were rescued. Experts believe the trapped sailors could still be alive if they were able to find air pockets inside the boat. Unfortunately, no signal or sound coming from within the capsized ship has been detected during the 9 rescue attempts that have occurred so far. Rescue efforts continue, but are hindered by low visibility and strong currents.

There is very little information currently available on how the collision happened. Even through the details are vague, it can be very useful to apply the root cause analysis method during this stage of an investigation. Knowing some of the basic causes that have to be present for each type of incident can help direct the investigation efforts. For example, if a fire occurs you already know that there was a spark, oxygen and fuel present and you can start the investigation by considering each of these causes.

In the case of the tugboat collision, there are number of causes that had to be present for the collision to occur and they could be used as starting places for the investigation. Beyond the really basic, like there had to be two ships present, there are a few facts that can be assumed from the beginning. First, there are strict rules of the road that govern the path of ships, especially near land, similar to the laws that govern vehicle traffic. Somebody didn’t follow the rules and if you can figure out who didn’t and why that will go a long way to explaining why the ships collided. Second, every ship should have situational awareness and avoid other ships (even if that other ship is doing something strange) and both ships failed to keep their distance from the other ship. Either this was a failure to properly monitor position or the methods used were inadequate. In this specific case, from the damaged that both ships sustained, I’d also be willing to bet that somebody was going to fast too close to shore.

Each type of accident has fundamental causes that had to be present for it occur. While many investigations lead far beyond the causes that can initially be assumed, they can be helpful place to start. Performing a root cause analysis can help guide an investigation and ensure all the pertinent questions are asked and answered.

Root Cause Analysis - Incident Investigation

Tacoma Narrows Part 2: Failure of a Design

March 20, 2008 Kim Smiley

By Kim Smiley

The mechanics behind the failure of the Tacoma Narrows Bridge were discussed in a previous blog entry. There were many design issues with the bridge and the civil engineering community has done an excellent job of studying and incorporating lessons learned from the failure. But a question that may be more pertinent across all engineering disciplines is, “Why did the design process fail?”

How did a bridge get built that would fail in a little over four months? A root cause analysis of the bridge shows that factors that shaped the doomed bridge design are present in almost every engineering project. There is as much to learn from the failed process that led to the design as there is from the failed design.

The primary factor that led to the bridge design was cost reduction. The first design proposed for the Tacoma Narrows Bridge was a conventional suspension bridge that was estimated to cost $11 million. Funding was an issue for the bridge from the beginning, and the design that was finally approved for the bridge was an elegant bridge with a narrow roadbed and short girders. In additional to being more aesthetically pleasing, the estimated price tag of $8 million dollars was nicer to look at as well. Another contributing factor is the engineer behind the second design, a very well-known civil engineer Leon Moisseiff. His credentials were impeccable, and he had previously consulted on the famed Golden Gate Bridge, the Bronx-Whitestone Bridge and others. Additionally, he helped developed some of the methods used throughout the world to calculate forces in suspension bridges.

In a tale that is probably repeating somewhere right now, a cheaper, flashier design was recommended by a well respected engineer. Nobody wanted to listen to the voices of dissension among the less well-know engineers (and there were engineers who spoke out against the new bridge design saying it was unsafe). The project then dramatically fails.

As engineers, there is a lot we can learn from studying how past projects have balanced cost and safety. There are stories where remarkable profits and success have been achieved by finding a cheaper way to do something. But sometimes, as in the case of the Tacoma Narrows Bridge, the cheap way costs more in the end.

Learn more about the failure of the Tacoma Narrows Bridge.

Root Cause Analysis - Incident Investigation

Deadly NYC Crane Accident

March 19, 2008 Angela Griffith

By ThinkReliability Staff

Unfortunately, an investigation into a deadly construction accident is currently underway in New York City. On Saturday March 15, a 19 story crane collapsed. Four construction workers were killed and 18 others were injured. Emergency workers are still sorting through the rubble in an attempt to find any remaining survivors. The crane was being used at a high-rise construction site and was attached to the side of a skyscraper. Details as to why the crane fell are still vague, but eye witnesses report that a piece of steel fell and severed at least one tie that held the crane onto the building. Once the connection between the crane and the building was weakened, the crane toppled and split into two pieces. As it fell, the crane smashed a 4 story townhouse and damaged parts of 3 other buildings.

What made the crane fall? Part of doing a root cause analysis is sorting the pertinent facts from all the information that is available. Is it relevant that neighbors had complained that the construction crews were working illegal hours and it seemed like the building was going up too quickly? City officials had issued 13 violations to the construction project, which at first glance seems like a red flag indicating a lack of attention to safety. But Mayor Bloomberg has said that this is a normal number of violations for a project this size. Additionally, the crane had been inspected on the day before the accident and no violations were issued. Did something change in 24 hours or was the inspection inadequate? At the time the crane fell, it was being raised to enable work to begin on the next floor of the building. Did this contribute to the accident? Where did the piece of steel come from that supposedly fell? At this point in the investigation there are more questions than answers.

High Level Cause Map

There are many facts and theories that surface in the wake of any accident, and part of doing a root cause analysis is determining which are actually relevant. This is a process that is much easier said than done. The push to provide answers quickly can add to the pressure to produce a “cause” for the accident. But as anyone familiar with the concept of root cause analysis knows, there isn’t a single “cause”, there are many causes that contributed to the accident. The best approach is to record all possible causes and continue to gather evidence until you can eliminate all the noise and are left with the true causes. Then the work of creating solutions that address the causes can begin.

Root Cause Analysis - Incident Investigation

Tacoma Narrows: Failure of a Bridge

March 16, 2008 Kim Smiley

By Kim Smiley

The power of performing a root cause analysis of a problem can be demonstrated by working through well-known engineering disasters. For example, creating a cause map for the failure of the Tacoma Narrows Bridge helps explain why the bridge collapsed and illustrates some of the lessons that can be learned.

The original Tacoma Narrows Bridge was opened for traffic on July 1, 1940. A little more than four months later, the bridge violently failed and a 600 foot span of roadbed fell into the river below. Why did the bridge tear itself apart? What made the bridge collapse on November 7th and not some previous day? One of the first questions asked when performing a root cause analysis is, “What is different about this issue?” The first difference to consider was that November 7th was a windy fall day. Construction of the bridge ended in the summer so this was the first fall the new bridge had experienced. On the day the bridge failed, the wind was blowing across the roadbed at 42 mph. This was the strongest the wind had blown since the bridge was constructed. The second difference was the design of the bridge itself. The Tacoma Narrows Bridge was particularly narrow relative to its length, making the roadbed more flexible than other suspension bridges. Additionally, the bridge had shallow girders and was relatively weak in torsion compared to other suspension bridges built around the same time. The combination of fall winds and the slender bridge design resulted in the collapse of the bridge.

High Level Cause Map

As the wind impacted the bridge, the force twisted the roadbed until it hit a point where it was constrained by the suspender cables, and then it twisted back in the other direction. Other suspension bridges of the time experienced similar twisting motions, but what made this bridge different was that the amplitude of the motion increased with each cycle, rather than dying out. The bridge was unable to dissipate the wind energy, and the motion of the bridge continued to grow until the twisting motion increased to the point where the suspender cables snapped and the roadbed was dropped into the river below. The mathematical explanation of why the bridge collapsed is fairly complex, but simply put: the bridge was underdamped causing the twisting oscillations to increase rather than decrease with each twisting cycle.

Learn more in Part 2 of the blog.

Root Cause Analysis - Incident Investigation

Problem Solvers are Specific

March 4, 2008 Mark Galley

By Mark Galley

Have you ever heard anyone say “the procedure is a piece of junk?” If you ask the person if every step of the 40-step procedure is wrong they will usually say “No, not every step.” You can ask them to show you which step is wrong. When they point out that step 14 is wrong, you can ask, “Is every word in step 14 wrong?” They will usually say “Well, no, not every word, but that 5 is supposed to be a 7. You can then say “I understand. That is an issue. Thanks for catching that. I’ll get it updated. These things have got to be clear and accurate.”

The original statement “the procedure is a piece of junk” is too general. It refers to the procedure as one thing, not 40 things. People that blame and complain speak in very general terms. They group things together and generalize. People that are very good at troubleshooting and solving problems naturally think and speak in very specific terms. Analyzing a problem is about breaking a problem down into parts. Analyzing problems is always about getting more specific so that very specific actions (the solutions) can be taken.

Terms like “human error”, “procedure not followed” and “training less than adequate” are used regularly by companies to explain why a particular problem occurred. These terms are too general. They inadvertently give the impression that the cause has been found during their root cause analysis. Knowing that someone didn’t follow a procedure is important, but is not the end of an investigation. We’re just getting to the good stuff. We’re just getting the specific information that created the incident in the first place.

Our interest is not limited to fixing that person that didn’t follow that procedure. We want to address how we developed, approved, utilized and updated this particular procedure so that the procedure process can be improved. It’s about improving how we capture and communicate the best work practices in our organization as a whole. This is the leverage within the organization. To solve problems effectively be specific. Ask those who blame and complain to help us understand the issue by being more specific.

For more information about improving the problem solving skills within your organization, visit ThinkReliability – specializing in Cause Mapping – Effective Root Cause Analysis training.

Your Expert Root Cause Analysis Resource

Monthly Archives: March 2008

Lexington Plane Crash 2006

Levee Break – Fernley, NV

18 Sailors Trapped in Capsized Tugboat

Tacoma Narrows Part 2: Failure of a Design

Deadly NYC Crane Accident

Tacoma Narrows: Failure of a Bridge

Problem Solvers are Specific