Hubble Focusing Issues [ August 4th, 2008 ] Posted in » Root Cause Analysis - Incident Investigation

Hubble TelescopeThe Hubble Space Telescope was launched on April 24, 1990.  Once in orbit, it was quickly discovered that the images from Hubble were blurred.  An investigation into the issue revealed that Hubble’s primary mirror was not built to specification and couldn’t properly focus the light.  Specifically, the mirror was flattened too much away from the center and caused the light reflected from the edge of the mirror to focus on a slightly different location than the light reflected from the center.   The primary mirror in Hubble was only off specification by 2.3 micrometers, but the result to the $1.5 billion dollar project was disastrous. 

Solving Hubble’s focus issues was no small feat.  How do you repair a mirror that can’t be replaced on orbit when it is cost prohibitive to bring it back to earth for repair?  The answer was to modify the lens (which met specifications) to work with the off specification mirror.  COSTAR (Corrective Optics Space Telescope Axial Replacement) was added to Hubble during the first servicing mission in December 1993.  COSTAR is essentially eyeglasses for Hubble, additional lens built with the same error as the mirror, but in the opposite direction so that the effects of the off specification mirror shape are canceled out.  With the addition of COSTAR, Hubble met original design goals.

The primary mirror was constructed with a flaw because the tool, called a null corrector, used to create the template to guide the shaping of the mirror was itself flawed.  Null correctors use precisely located mirrors and lens to determine the shape of a mirror.  In order to assemble null correctors, reflected light is used to measure the distance between the mirror and the lens inside the tool.  When the null corrector used to shape the Hubble’s primary mirror was assembled a measurement error was made.  A small amount of reflective coating had fallen off an internal piece of the instrument and the laser used to perform the measurement reflected off the wrong location, resulting in a lens being 1.3 mm to far from the mirror.  Null correctors are extremely precise and do not change once assembled so the Hubble team used a single instrument to guide the mirror shape.  A single flawed tool and inadequate quality controls resulted in a flawed mirror.

Root Cause Analysis :: Hubble Focus Issue A visual representation of root cause analysis has been created as a Cause Map that can be downloaded.

When a Cause isn’t a Cause: The Failure of Vytorin

Vytoria is a drug intended to improve heart disease.  There are already millions taking it, or one of its parts.  Full results of its trial were released Sunday, March 30th.  Although Vytorin successfully reduced three key risk factors, it did not improve heart disease, because it had no effect on reducing plaque.  The three risk factors improved by Vytorin, and thought to lead to plaque buildup, which leads to heart disease, were LDL (low-density lipoprotein, or bad cholesterol), triglycerides (a form of fat made in the blood), and  artery inflammation as measured by CRP (C-reactive protein, which is released into the blood due to inflammation).  So, if we look at the root cause analysis, we have:

Root Cause Analysis Vytorin Failure

But if this is our Cause Map, and we reduce all three causes, we should reduce the result - plaque formation, which should reduce the occurrence of heart disease.  If we end up with the results we have here, which is no effect on plaque buildup despite proof that the three causes (called “key risk factors” in the medical world) have been reduced, it means there’s a problem with our root cause analysis.  This particular analysis gets even more confusing.  Some drugs, like statins, lower LDL and successfully reduce heart disease.  This implies that the cause-and-effect relationship of LDL and heart disease is valid.  But there was a drug that is no longer being advanced that successfully reduced cholesterol, but actually raised heart risks.  What does all this mean?  It means back to the drawing board on our cause map.  I don’t pretend to have the answers - I don’t think anybody does, or there would be a new drug out there right now - but it means that as you’re reading this, the smart folks developing new drugs are donning their lab coats and trying to figure out what went wrong.

March 31st, 2008 | Leave a Comment

Lexington Plane Crash 2006

Incident Date: August 27, 2006

Root cause analysis can be a very effective technique to analyze a problem.  But what if the evidence trail goes cold?  Is creating a Cause Map still useful when unanswered questions remain after a thorough investigation?  The crash of a Comair jet in Lexington Kentucky on August 27, 2006 is a good example of this situation.  The plane crashed during takeoff, killing 49 people . The flight crew mistakenly attempted to takeoff on the wrong runway, which was too short for the plane to reach the necessary speed for lift off.  Even after a detailed investigation by the National Transportation and Safety Board, it still is not clear why the flight crew used the wrong runway.   As an aside, the pilot and the first officer were competent professionals from all accounts and there is no history of either making errors of this magnitude.

Plane crashes are unique in the fact that there is a lot of data available to investigators.  The cockpit voice recorder (CVR) records all conversations in the cockpit and the flight data recorder (FDR) records instrument readings.  Usually the reason behind plane crashes can be determined using all this data.  In this case, the information did provide some useful insight, but no clear reasons why the mistake occurred. 

Building a Cause Map of this accident does make one thing very clear.  There are many events that had to occur for this mistake to happen.  One of the causes of the plane crash is clearly the error on the part of the flight crew, but another cause is the failure of the traffic controller to catch and correct the error.   There were two separate windows of time where the controller had an opportunity to prevent the plane crash, but didn’t for a variety of reasons.

It’s tempting to say the plane crashed because the crew used the wrong runway and leave it at that. The main problem with this line of reasoning is that this conclusion doesn’t help prevent future crashes, especially since the error isn’t well understood.  If all the focus is placed on why the wrong runway was used, an opportunity to improve the process and prevent future accidents is lost.  In a case where there is missing information, building a cause map can be useful because it helps the investigation to explore all the causes and potential solutions.  Only one cause needs to be eliminated to prevent the accident. For instances, the crew could had lined up at the runway and the accident could have still been prevented if the controller had caught the mistake.  Focusing on a solution to eliminate the better understood causes provides a useful place to start.

A high-level Cause Map of the problem is below:

Root Cause Analysis

March 28th, 2008 | Leave a Comment

Levee Break - Fernley, NV

Incident Date: January 5, 2008 

Fernley FloodJust after 4 a.m. on January 5th, 2008 about 600 homes began flooding in Fernley, Nevada, about 25 miles East of Reno.  A 50 foot section of a canal embankment failed flooding the adjacent area.  The 32-mile canal carried water from the Truckee River south to Fallon area farms.  There were no injuries in the flooding but it easily could have been very serious.  The complete estimates for repairing the canal and the homes are not available at this time.

A report issued by the U.S. Bureau of Reclamation released March 20th concluded that the century-old irrigation canal failed due to burrowing rodents.  A simple root cause analysis for this incident using the Cause Mapping method captures the tunneled holes in the embankment as one of the causes.  Another one of the causes is the increased water flow in the canal caused by the nearly 2 inches of rain that fell the day before.  The annual rainfall for the area is about 5 inches.

The Cause Map shows that the canal obviously failed because the stress on the embankment was greater than the strength of the embankment.  The increased water flow added to the stress on the embankment while the holes tunneled by the rodents reduced the strength.  A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.

Since the canal is almost 100 years old tunneling muskrats are not a surprise.  If the holes would have been identified earlier and filled, the risk of the breach would have been reduced significantly.  The evidence that the inspection and maintenance of the canals was ineffective is the fact that the canal failed due to holes.  An effective inspection program would have found the holes and addressed them - that’s the purpose of inspection and maintenance.  Past inspections may have been conducted exactly as required, which simply means the previous inspection requirements were inadequate.  Ineffective inspections is one of the causes of the canal failure that would need to be investitgated further.

Root Cause Analysis Fernley FloodThe attached PDF file contains an intermediate level root cause analysis of the canal failure.  It includes causes that were considered in the Bureau of Reclamation report as well as some of the evidence and solutions.  A more detailed Cause Map can be created from the specific information in the bureau’s report

March 26th, 2008 | Leave a Comment

The Danger in Hazardous Chemicals: Arkansas Meat Packing Plant Explosion

Incident Date: March 23, 2008

On Sunday morning, March 23rd, 2008, there was an explosion at the Cargill Meat Solutions plant in Booneville, Arkansas.  Thankfully no injuries have been reported, but 180 people were evacuated due to the ensuing ammonia leak.  Although not much is known about the root causes of the explosion, we can do a very simple analysis.

Read More …

March 24th, 2008 | Leave a Comment

18 Sailors Trapped in Capsized Tugboat

Incident Date: March 23, 2008

On Sunday March 23, a Ukrainian tug boat collided with a Chinese registered cargo ship.  The tug boat capsized and sunk in 115 feet of water, trapping 18 sailors inside the hull.  All 25 passengers on the cargo ship and seven passengers on the tugboat were rescued.  Experts believe the trapped sailors could still be alive if they were able to find air pockets inside the boat.  Unfortunately, no signal or sound coming from within the capsized ship has been detected during the 9 rescue attempts that have occurred so far.  Rescue efforts continue, but are hindered by low visibility and strong currents.

There is very little information currently available on how the collision happened. Even through the details are vague, it can be very useful to apply the root cause analysis method during this stage of an investigation.  Knowing some of the basic causes that have to be present for each type of incident can help direct the investigation efforts.  For example, if a fire occurs you already know that there was a spark, oxygen and fuel present and you can start the investigation by considering each of these causes. 

In the case of the tugboat collision, there are number of causes that had to be present for the collision to occur and they could be used as starting places for the investigation.  Beyond the really basic, like there had to be two ships present, there are a few facts that can be assumed from the beginning.  First, there are strict rules of the road that govern the path of ships, especially near land, similar to the laws that govern vehicle traffic.  Somebody didn’t follow the rules and if you can figure out who didn’t and why that will go a long way to explaining why the ships collided.  Second, every ship should have situational awareness and avoid other ships (even if that other ship is doing something strange) and both ships failed to keep their distance from the other ship.  Either this was a failure to properly monitor position or the methods used were inadequate.   In this specific case, from the damaged that both ships sustained,  I’d also be willing to bet that somebody was going to fast too close to shore. 

Each type of accident has fundamental causes that had to be present for it occur.  While many investigations lead far beyond the causes that can initially be assumed, they can be helpful place to start.  Performing a root cause analysis can help guide an investigation and ensure all the pertinent questions are asked and answered.

March 23rd, 2008 | Leave a Comment

Tacoma Narrows Part 2: Failure of a Design

Tacoma NarrowsThe mechanics behind the failure of the Tacoma Narrows Bridge were discussed in a previous blog entry.  There were many design issues with the bridge and the civil engineering community has done an excellent job of studying and incorporating lessons learned from the failure. But a question that may be more pertinent across all engineering disciplines is, “Why did the design process fail?” 

How did a bridge get built that would fail in a little over four months?  A root causes analysis of the bridge shows that factors that shaped the doomed bridge design are present in almost every engineering project.  There is as much to learn from the failed process that led to the design as there is from the failed design.

The primary factor that led to the bridge design was cost reduction.  The first design proposed for the Tacoma Narrows Bridge was a conventional suspension bridge that was estimated to cost $11 million.  Funding was an issue for the bridge from the beginning, and the design that was finally approved for the bridge was an elegant bridge with a narrow roadbed and short girders.  In additional to being more aesthetically pleasing, the estimated price tag of $8 million dollars was nicer to look at as well. Another contributing factor is the engineer behind the second design, a very well-known civil engineer Leon Moisseiff.  His credentials were impeccable, and he had previously consulted on the famed Golden Gate Bridge, the Bronx-Whitestone Bridge and others.  Additionally, he helped developed some of the methods used throughout the world to calculate forces in suspension bridges.

In a tale that is probably repeating somewhere right now, a cheaper, flashier design was recommended by a well respected engineer.  Nobody wanted to listen to the voices of dissension among the less well-know engineers (and there were engineers who spoke out against the new bridge design saying it was unsafe).  The project then dramatically fails.

As engineers, there is a lot we can learn from studying how past projects have balanced cost and safety.   There are stories where remarkable profits and success have been achieved by finding a cheaper way to do something.  But sometimes, as in the case of the Tacoma Narrows Bridge, the cheap way costs more in the end.
 

March 20th, 2008 | Leave a Comment

Deadly NYC Crane Accident

Incident Date: March 15, 2008

Unfortunately, an investigation into a deadly construction accident is currently underway in New York City.  On Saturday March 15, a 19 story crane collapsed.  Four construction workers were killed and 18 others were injured.  Emergency workers are still sorting through the rubble in an attempt to find any remaining survivors.   The crane was being used at a high-rise construction site and was attached to the side of a skyscraper.  Details as to why the crane fell are still vague, but eye witnesses report that a piece of steel fell and severed at least one tie that held the crane onto the building. Once the connection between the crane and the building was weakened, the crane toppled and split into two pieces.  As it fell, the crane smashed a 4 story townhouse and damaged parts of 3 other buildings.

What made the crane fall?  Part of doing a root cause analysis is sorting the pertinent facts from all the information that is available.  Is it relevant that neighbors had complained that the construction crews were working illegal hours and it seemed like the building was going up too quickly?  City officials had issued 13 violations to the construction project, which at first glance seems like a red flag indicating a lack of attention to safety.  But Mayor Bloomberg has said that this is a normal number of violations for a project this size.  Additionally, the crane had been inspected on the day before the accident and no violations were issued.  Did something change in 24 hours or was the inspection inadequate?  At the time the crane fell, it was being raised to enable work to begin on the next floor of the building.  Did this contribute to the accident?  Where did the piece of steel come from that supposedly fell?  At this point in the investigation there are more questions than answers.

There are many facts and theories that surface in the wake of any accident, and part of doing a root cause analysis is determining which are actually relevant.  This is a process that is much easier said than done.  The push to provide answers quickly can add to the pressure to produce a “cause” for the accident.  But as anyone familiar with the concept of root cause analysis knows, there isn’t a single “cause”, there are many causes that contributed to the accident.  The best approach is to record all possible causes and continue to gather evidence until you can eliminate all the noise and are left with the true causes.  Then the work of creating solutions that address the causes can begin.

A very high level cause map of the crane accident is below:

Root Cause Analysis Crane Incident

March 19th, 2008 | Leave a Comment

Tacoma Narrows: Failure of a Bridge

The power of performing a root cause analysis of a problem can be demonstrated by working through well-known engineering disasters.  For example, creating a cause map for the failure of the Tacoma Narrows Bridge helps explain why the bridge collapsed and illustrates some of the lessons that can be learned. 

The original Tacoma Narrows Bridge was opened for traffic on July 1, 1940.  A little more than four months later, the bridge violently failed and a 600 foot span of roadbed fell into the river below.  Why did the bridge tear itself apart?  What made the bridge collapse on November 7th and not some previous day?  One of the first questions asked when performing a root cause analysis is, “What is different about this issue?”   The first difference to consider was that November 7th was a windy fall day.  Construction of the bridge ended in the summer so this was the first fall the new bridge had experienced.  On the day the bridge failed, the wind was blowing across the roadbed at 42 mph.  This was the strongest the wind had blown since the bridge was constructed.  The second difference was the design of the bridge itself.  The Tacoma Narrows Bridge was particularly narrow relative to its length, making the roadbed more flexible than other suspension bridges.  Additionally, the bridge had shallow girders and was relatively weak in torsion compared to other suspension bridges built around the same time.  The combination of fall winds and the slender bridge design resulted in the collapse of the bridge.

As the wind impacted the bridge, the force twisted the roadbed until it hit a point where it was constrained by the suspender cables, and then it twisted back in the other direction.  Other suspension bridges of the time experienced similar twisting motions, but what made this bridge different was that the amplitude of the motion increased with each cycle, rather than dying out.  The bridge was unable to dissipate the wind energy, and the motion of the bridge continued to grow until the twisting motion increased to the point where the suspender cables snapped and the roadbed was dropped into the river below.  The mathematical explanation of why the bridge collapsed is fairly complex, but simply put: the bridge was underdamped causing the twisting oscillations to increase rather than decrease with each twisting cycle.

A high level cause map of the failure of the Tacoma Narrows Bridge is below:

Root Cause Analysis Tacoma Narrows

March 16th, 2008 | Leave a Comment

In honor of Joseph Juran

Sadly, Joseph Juran died in New York on February 28, 2008 from natural causes.  He was an astounding 103 years old.  Dr. Juran coined the phrase “Pareto principle” after Vilifredo Pareto, an economist who noted that 20% of Italians held 80% of Italy’s wealth.  Dr. Juran applied this principle to quality, noting that 20% of causes are responsible for 80% of a problem. 
As such, it seems that this would make root cause analysis easier - find 20% of the causes, and we’re 80% done with the problem!  In practice, many root causes analyses just stop there.  However, Dr. Juran himself recognized this problem, and began referring to the causes as “the vital few and the useful many.”  He understood that there is still great value, and perhaps necessity, in determining 100% of causes, not just the “vital few” that are responsible for a disproportionate share of problems. 
Doing a visual root cause analysis, or cause map, can assist us in finding the “useful many”.  The map allows us to find 100% of the causes, not just those that are obvious or most responsible for the problem.  Because the technique effectively draws out these solutions, it ensures that we do not spend 80% of our time finding the most elusive 20% of causes!
Once we’ve found all the causes, we can then assign a solution to each cause.  At this point, your organization will prioritize the solutions using the Pareto principle.  Obviously in a world of limited resources, the solutions that should be applied first are those that can solve 80% of problems.  But it’s important that we ensure that all possible causes and all possible solutions are present in our analysis.   To successfully achieve a goal of “zero defects” or “zero injuries”, we’ll have to apply solutions to all the causes. 

JuranJuran

March 12th, 2008 | Leave a Comment

Asleep at the wheel: Accidents caused by exhaustion

What happens when our root-cause-analysis of a problem leads to “operator tired” or “operator fell asleep”?  If we stop there, and blame the operator, we are missing an important opportunity to improve the safety of our organization, and potentially prevent another problem from occurring.

One of the causes of the EXXON VALDEZ oil spill is that the Third Mate who was actually manning the bridge was exhausted due to long work hours and too little sleep.  The collision of two metro trains in Washington D.C. in 2003 was caused by an operator who had worked a double shift, from 8 a.m. to 11 p.m., then returned the next day to do it again.  A few hours into his first shift, his train rolled backwards more than 2,000 feet into another train and caused millions of dollars worth of damage.  The investigation team determined that the brake had never been pressed. There have been some recent studies that show that people suffering from excessive sleep deprivation perform some tasks about as well as someone who is legally drunk.   Naturally, this is a concern for anyone operating heavy machinery or performing surgeries.  Yet rotating shift work, excessive work hours and too little time between shifts continue to occur . . . and sometimes have tragic consequences.

Based on these concerns, some organizations are trying to alleviate fatigue problems caused by their work standards.  For example, the Accreditation Council for Graduate Medical Education’s common duty hour standards took effect on July 1, 2003.  These standards reduce the number of hours medical residents can work in a week, and require adequate rest between duty periods.

When we end up with a cause on our root-cause-analysis of “operator tired” or “operator fell asleep”, it is essential that we continue to ask “Why?” to determine the factors that led to exhaustion.  Many times, regulations to ensure adequate rest before duty do not exist.  In some cases, company policy encourages or requires workload that does not allow for adequate sleep.  If we do not continue our root-cause-analysis to determine the reason that the operator is tired, we run the risk of having the same problem - or worse - happen again.

March 6th, 2008 | Leave a Comment

Site Map   Root Cause Analysis