Hubble Focusing Issues

Hubble TelescopeThe Hubble Space Telescope was launched on April 24, 1990.  Once in orbit, it was quickly discovered that the images from Hubble were blurred.  An investigation into the issue revealed that Hubble’s primary mirror was not built to specification and couldn’t properly focus the light.  Specifically, the mirror was flattened too much away from the center and caused the light reflected from the edge of the mirror to focus on a slightly different location than the light reflected from the center.   The primary mirror in Hubble was only off specification by 2.3 micrometers, but the result to the $1.5 billion dollar project was disastrous. 

Solving Hubble’s focus issues was no small feat.  How do you repair a mirror that can’t be replaced on orbit when it is cost prohibitive to bring it back to earth for repair?  The answer was to modify the lens (which met specifications) to work with the off specification mirror.  COSTAR (Corrective Optics Space Telescope Axial Replacement) was added to Hubble during the first servicing mission in December 1993.  COSTAR is essentially eyeglasses for Hubble, additional lens built with the same error as the mirror, but in the opposite direction so that the effects of the off specification mirror shape are canceled out.  With the addition of COSTAR, Hubble met original design goals.

The primary mirror was constructed with a flaw because the tool, called a null corrector, used to create the template to guide the shaping of the mirror was itself flawed.  Null correctors use precisely located mirrors and lens to determine the shape of a mirror.  In order to assemble null correctors, reflected light is used to measure the distance between the mirror and the lens inside the tool.  When the null corrector used to shape the Hubble’s primary mirror was assembled a measurement error was made.  A small amount of reflective coating had fallen off an internal piece of the instrument and the laser used to perform the measurement reflected off the wrong location, resulting in a lens being 1.3 mm to far from the mirror.  Null correctors are extremely precise and do not change once assembled so the Hubble team used a single instrument to guide the mirror shape.  A single flawed tool and inadequate quality controls resulted in a flawed mirror.

Root Cause Analysis :: Hubble Focus Issue A visual representation of root cause analysis has been created as a Cause Map that can be downloaded.

August 4th, 2008 | Leave a Comment

Pet Food Contamination - UPDATE

On May 22, 2008, Menu Foods and other pet food manufacturers agreed to a settlement on the class action lawsuit resulting from last year’s pet food contamination.  As part of the settlement, they will set up a $24 million fund to reimburse owners for expenses relating to pet deaths or injuries, screenings, and as compensation for food purchases.  This is in addition to $8 million that has already been paid to owners.  Also, they are required to screen for melamine, which owners say they are already doing.

The pet food manufacturers are bearing the brunt of the expense relating to the contamination issue.  But a root cause analysis shows that a significant portion of the blame lies in the regulatory process and dishonest raw material suppliers.  After all, the pet food manufacturers made pet food using raw materials that had been certified as meeting their requirements (which called for no foreign material contamination) and had not been flagged by the FDA. 

It has become increasingly clear that the FDA is not able to properly due its job in the increasingly global nature of U.S. foods and drugs.  The contaminated heparin found earlier this year shows that changes are too slow being made.  And, there is new evidence that private laboratory testing companies in the United States do the bidding of foreign importers who hire them, not the FDA.  These labs have stated that testing results for food entering the United States, no matter what kind of contamination they show, belong to the company.  This means that the results may only be released to the FDA once the company desires - or once a positive result has been obtained - no matter how many rounds of testing that requires.  Some labs have also claimed that importers “lab shop” - sending samples to lab after lab until they get the result they want.  Labs are not required to submit samples to the FDA.  So, the FDA may be in the dark about companies that repeatedly have contamination in their food products.

Dr. David Acheson, the FDA’s assistant commissioner for food protection supports congressional proposals that private labs be accredited by the FDA.  Hopefully action will be taken soon, before more tragedies occur.

July 7th, 2008 | Leave a Comment

General Slocum Steamship Fire

On June 15, 1904, a church group headed out for an excursion through New York City’s East River on the Steamship General Slocum.  Approximately half an hour after the ship left the pier, it caught fire.  Despite being only hundreds of yards from shore, the Captain continued to go full speed ahead in hopes of beaching at North Brother Island, where a hospital was located.  This served to fan the flames quickly over the entire highly flammable ship, killing many in the inferno.  Most of those who were not killed by the fire drowned, even though the Captain did successfully beach the ship at North Brother Island, due to the depth of the water and lack of safety equipment.

Download Root Cause Analysis DocumentTo perform a root cause analysis of the General Slocum tragedy, we can use a cause map.  A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.  First we look at the impact to the goals.  On the General Slocum there were at least 1,021 fatalities of the passengers and crew that were aboard.  (However, only two of the crew were killed.)  There were other goals that were affected but the magnitude of the loss of life makes any other goals less significant.  The deaths and injuries were the impact to the safety goals.

Passengers drowned because they were in water over their heads with inadequate help or safety equipment.  Passengers were either in the water because they fell when the deck collapsed, or because they jumped into the water trying to avoid the fire.  The water was too deep to stand because only the bow was in shallow water and the passengers could not reach the bow.  This was due to a poor decision on the Master’s part (namely his decision to beach the ship at a severe angle, with the bow in towards the island, instead of parallel to the island, where passengers would have been able to wade to shore.)  Note that the Master himself (and most of the crew) were on the bow side of the ship and were able to (and did) jump off and wade to shore.  The safety equipment, including life preservers, life boats, and life rafts, was mostly unusable due to inadequate upkeep and inadequate inspections.

Passengers (and two crewmembers) were also killed by fire.  Once the fire was started, it spread rapidly and was not put out.    The fire spread rapidly because the ship was highly flammable.  When this ship was constructed, there was no consideration of flammability.  Additionally, the current of air created by the vessel speeding ahead drove the fire across the ship.  The fact that an experienced Master would have allowed this situation was considered misconduct, negligence and inattention to duty - the charges for which the Master was later convicted.   The fire was not put out because of inadequate crew effort and insufficient fire-fighting equipment.  The crew effort was inadequate of a lack of training.  The fire-fighting equipment was insufficient because of inadequate upkeep and inadequate inspections.  (Possibly you are noticing a theme here?)

Although many people have not heard of the General Slocum tragedy, many of its lessons learned have been implemented to make ship travel safer today, although many of the solutions were not implemented widely enough or in time to prevent the Titanic disaster from occurring eight years later.  (Although there were actually more people killed on the General Slocum, it is believed that the Titanic disaster is more well known because the passengers on Titanic were wealthy, as opposed to the working class passengers on General Slocum.  It is also surmised that sympathy for the highly German population aboard General Slocum was diminished as World War I began.)

In a macabre ending to a gruesome story, ships began replacing their outdated, decrepit life preservers after the investigation on General Slocum.  It was later found that the company selling these new life preservers had hidden iron bars within the buoyant material, in a dastardly attempt to raise their apparent weight.  Unfortunately there were no adequate laws (then) against selling defective life-saving equipment.

June 28th, 2008 | Leave a Comment

Blood Substitute Risk

A study recently published by the Journal of the American Medical Association presented a review of clinical trials of hemoglobin-based blood substitutes.  This study showed that the clinical trials resulted in increased risk of heart attack and death for the patients being studied with no clinical benefit. 

Download Root Cause Analysis DocumentWe will examine this issue using the Cause Mapping process.  A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.

In clinical trials one of the overall goals is to have zero injuries. The blood substitute trials led to a 30% increased risk of death, and a 2.7-fold increase in heart attack, causing increased risk with no clinical benefit.  The two goals that are impacted in the blood substitute example are the safety goal and the customer service goal.

In this example all of our impacts to the goals are caused by the increased risk of heart attack (myocardial infarction).  Additionally, there was no clinical benefit shown because the use of blood substitutes did not limit blood transfusions. 

Why was there an increased risk of heart attack?  The increased risk of heart attack is caused by decreased blood flow, which is caused by blood vessel contraction (vasoconstriction).  This occurs because nitric oxide is responsible for blood dilation, hemoglobin molecules scavenge nitric oxide, and a patient receives an infusion of hemoglobin.

The patient receives an infusion of hemoglobin because the patients are unaware of the risk, and because of ongoing clinical trials of hemoglobin-based blood substitutes.  These trials are ongoing because hemoglobin-based blood substitutes have been developed and because clinical trials are being performed.

The hemoglobin-based blood substitutes have been developed because blood substitutes are being developed and most of the blood substitutes are hemoglobin-based, because hemoglobin is seen as the most promising substitute.  The blood substitutes are being developed because they would be better in remote areas or for portability, to help deal with the shortage of blood, and to reduce problems from blood transfusions.

The clinical trials were performed because they were approved by the FDA;  there was no checking by scientists, review boards, or the public; and the companies continued clinical trials.  There was no checking, and the companies continued the trials, because there was a lack of information available.

The FDA and the blood companies are still trying to figure out how to go forward based on these new results.  Because of the potential usefulness of blood substitutes, especially in military applications, it’s likely we’ll continue to see progress on this issue.

June 12th, 2008 | Leave a Comment

Reactor Vessel Head Degradation - Davis-Besse Nuclear Power Station

On March 7, 2002, during refueling, a cavity measuring approximately 4 x 6 inches was discovered that had completely eaten through the more than 6″ thick reactor pressure vessel head of Unit #1 reactor at Davis-Besse Nuclear Power Station.  Fortunately, the thin stainless steel cladding layer had held the reactor pressure, although it was not designed to do so.  The loss of the vessel head was also a loss of a principal fission product barrier (one of the three responsible for ensuring radiaoctive fission products remain within the pressure boundary).  This was an impact to the safety goal.  The loss of a principal fission product barrier is also considered a “significant precursor to core damage” by the NRC, which is another impact to the safety goals.  All told, the cavity resulted in nearly $300 million worth in repairs and upgrades, and a two-year closure of the plant, during electricty production at Davis-Besse was severely reduced.  These were impacts to the material, production, and customer service goals. 

Download PDF documentLet’s examine a high level root cause analysis and review some of the causes of the cavity.  A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.

The cavity was caused by continued boric acid corrosion.  The corrosion occurred when leaking coolant evaporated into boric acid.  This occurred because of a through-wall crack in a nozzle caused by primary water stress corrosion cracking that was undetected.  The corrosion also occured because the leakage was undetected, due to delayed inspections and an ineffective leakage detection methods. 

The boric acid was not removed because it was not viewed as a safety concern.

The corrosion occurred due to inadequate corrosion control, where the corrosion was not detected because of a lack of a full inspection of the head,  and because early signs of corrosion were ignored, or missed.  The oil corrosion products were not completely removed, because they were difficult to remove and their removal was on a “best-effort” basis.  Additionally, the control was inadequate because the rate was higher than expected.

Even more detail can be added to this Cause Map as the analysis continues. As with any investigation the level of detail in the analysis is based on the impact of the incident on the organization’s overall goals.

June 2nd, 2008 | Leave a Comment

DeHavilland Comet Accidents - 1954

Sir Geoffrey de Havilland built the first commercial jet that reached production, the Comet.  The Comet design was finalized in 1945, as the British aircraft industry was attempting to establish a commercial aircraft industry post-World War II.  Prior to 1954, there had been some problems (a collision at take-off and a mid-air breakup) and some fixes to the hydraulic control system.  Then, on January 10, 1954 a Comet broke up in mid-air.  Flights were temporarily voluntarily suspended, then resumed.  On April 8, 1954, another Comet broke up in air.  (Both flights were taking off from Rome.)  The lives of 56 passengers and crew were lost in these two incidents, as well as two planes.  Additionally, the prestige of the British aviation industry suffered a blow.  (I’ll consider the lost prestige of British aviation a customer service impact.)

Root Cause Analysis :: DeHavillandLet’s look at this incident in a Cause Map.  A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.  Although there were two separate plane breakups, the Cause Maps are the same (based on the analysis and investigation performed after the accidents).  Essentially, the two planes were lost due to a structural failure of the cabin, caused by fatigue growth of a crack beyond the critical crack length (in essence, the crack length at which crack propagation is so rapid as to be uncontrollable). 

The fatigue cracking of the cabin occurred because the actual pressure cycles exerted on the cabin were more than the allowable (or where cracking would occur).  This was because the allowable pressure cycles were miscalculated.  The allowable pressure cycles were miscalculated for several reasons.  First, the inadequate test program.  There was no prototype, and the fatigue tests were misleading.  One test used a section that was effectively pre-conditioned, extending its life.  In another test, the section tested was so small that the test results were influenced by boundary conditions.

Next, the actual stress was above the predicted stress.  This occurred because 1) the square shaped windows caused pressure stresses to be distributed unevenly and 2) because the actual stress increased in localized areas.  The local stress at rivet holes is far above general stress (usually along the order of three times general stress) and two rows of rivets were used to attach the window frame.

While the comet was being developed, there was a general lack of knowledge about fatigue.  Many designers (de Havilland included) thought that fatigue was associated with vibration, which did not affect jet engines.  Additionally, the spread in fatigue results is large (some experts quote as high as 9:1), meaning that one plane could fail nine times faster (or more slowly) than another.  You can see how this is a problem with a small test sample.

A last problem was that the design of the Comet stretched the bounds of experience.  The comet was designed to fly at twice the speed of other airliners, at twice the height, and at twice the cabin pressure (for passenger comfort).  As such, the design was a great extension of the existing body of knowledge in not just one, but three dimensions.

Probably the most important lesson to come from the de Havilland Comet accidents is the importance of proper testing.  Once the cause was discovered, the Comet was redesigned and flew successfully, although by then Boeing had mostly taken over the market share.  It’s tragic that these accidents had to occur before the problem was solved.

May 28th, 2008 | Leave a Comment

Brooklyn Bridge Turns 125

Brooklyn BridgeBrooklyn BridgeBrooklyn Bridge marks its 125th birthday on May 24, 2008.  When performing a root cause analysis it is easy to spend a large amount of time focused on failures, but today engineers should take a moment to appreciate the accomplishment of this truly amazing feat.  The bridge has been refurbished many times, but the towers, main cables, and main beams are original and are now 125 years old.

At the time the Brooklyn Bridge was constructed the 6,000 ft long bridge was roughly six times as long as the longest bridge of the type that had previously been built.  The Brooklyn Bridge is one of the nation’s oldest and most treasured suspension bridges.  It has shaped the development of New York City.  At the time it was constructed Brooklyn was largely rural and the bridge helped sparked a growth spurt that dramatically changed the face of Brooklyn.  Brooklyn’s population grew by 42 percent between 1880 and 1890.  At last count in 2006, the bridge carried 126,000 cars per day.

Recent inspections have revealed some deterioration of the bridge, primarily with the newer approach ramps.  In a recent survey, state inspections ranked its condition as “poor”.  New York City plans to spend $250 million to 300 million to fix and repaint the bridge.  Hopefully these updates will return the bridge to good condition and it will continue to safely serve the citizen of New York City for many decades to come.

May 23rd, 2008 | Leave a Comment

Slips, Trips and Falls - A Root Cause Analysis Primer

Slips, trips and falls happen every day.  Falls are responsible for tens of thousands of deaths each year.  (Slips and trips are considered a subset of falls, and are included in these numbers.)  Falls on the job account for 12-15% of all worker’s comp costs.  The direct and indirect costs of workers injured and killed on the job are estimated to be billions of dollars each year, both in worker’s comp claims and in lost productivity.  In 1999, as an example, 5,100 workers were killed by falls and over 570,000 injuries were reported.  However, there are many things that can be done to prevent and lessen the impact of falls.  Creating a Cause Map - a visual root cause analysis - will allow us to identify all the potential causes of falls.  A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.  Once we’ve done that, we can identify the best solutions.

A worker is injured during a fall because the worker strikes the floor, or another object, and the object contacted is hard, and the worker hits in a way that causes injury.  When I say that workers are injured because they hit an object in a way that causes injury, what I am really talking about is factors that worsen a fall, and make injury more likely. The worker could land on a part of his or her body that is more easily injured.  Another way that injuries can be worsened is if a worker falls farther than his or her height (i.e., not a same-level fall).

The worker strikes the floor or other object because he or she falls, and there is no other support for the body, such as a handrail, or a harness.   There are four different ways to fall: slips, trips, the “step and fall” - where a person gets off-balance while stepping - and becoming unbalanced on moving equipment.

A worker slips when there is inadequate traction, either because the force of stepping off is too high, or the coefficient of friction is too low.  The force of stepping off can be higher than average if the worker is walking quickly or running, making a sudden change in direction, or if he or she has an awkward gait, from injury or old age, for example.  The coefficient of friction is a function of the traction provided by the shoes the worker is wearing and the “slipperiness” of the walking surface.  The coefficient of friction is too low if the traction of the worker’s shoes is inadequate and if the floor is slippery, because the surface is wet, icy and/or oily and does not have a non-skid coating.  Of course, for this to be an issue at all, the worker has to step into the slippery area. 

A worker can become off-balance by encountering an unexpected height difference (known as the “step and fall”).  This occurs in one of two ways.  Either the front foot lands on a surface lower than expected, or the ankle turns due to one side of the foot ending up higher than the other side, with footwear that inadequately supports the ankle.  These are both due to an unexpected height difference.

When a worker trips, it is because his or her toe is stopped, but his or her upper body is not stopped.  The upper body is moving because the worker is moving and he toe is topped because it encounters an object in the walking path, a rise in the walking path, or a difference in height of subsequent stairs. 

Last but not least, falls can be caused by workers who become unbalanced on moving equipment.  For this to occur, the worker must be inadequately secured to the equipment while the equipment changes motion, either by turning, decelerating or stopping, or accelerating or starting to move.

Once we have finished our Cause Map and found all the potential causes, we can assign potential solutions to all appropriate causes.  The solutions that I have come up with are in green boxes, near the cause(s) they “solve”.   You can see that some of the solutions are the responsibility of the company, and some are the responsibility of the worker, and some are both.   Although many of the responsibilties lie with the worker, it is in a company’s best interest to provide training on how to prevent, manage and mitigate falls.  Falls may seem like everyday, ordinary minor occurences.  While falls are everyday occurences, the consequences can be anything but minor.

Root Cause Analysis :: FallsThe attached PDF document shows the visual root cause analysis as a Cause Map.

May 22nd, 2008 | Leave a Comment

SL-1 Explosion - January 3, 1961

The only fatal reactor accident in the United States occurred on January 3, 1961, when an Army prototype known as SL-1 (for stationary, low power reactor, unit 1) exploded, killing the 3 operators who were present.  We’ll use the SL-1 tragedy as an example of how the Cause Mapping process can be applied to a specific incident.  A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.

The SL-1 tragedy killed the three operators present, which is an impact to the safety goal.  Another goal is that there be no damage to the vessel. In the case of SL-1, the  vessel sustained extensive damage.

The loss of life and vessel damage were both caused by the reactor exploding.  The reactor exploded because it went prompt critical (an uncontrollable, exponentially increasing fission reaction).  The reactor went prompt critical because withdrawal of the central rod can cause prompt criticality and because the rod was rapidly, manually lifted 26.4″ out of the core.

Withdrawal of the central rod can cause prompt criticality due to a lack of shutdown margin in the core, and inadequate safety criteria.

Because most of the evidence was so effectively destroyed, nobody really knows why the control rod was lifted out of the core.  There are two theories (disregarding the bizarre and improbable murder/suicide theory): 1) the control rod got stuck while being lifted to be attached to the drive mechanism, and, as the operator was exerting greater force on it, suddenly came free, resulting in a lift far greater than intended, or that an rod drop testing/exercising was performed improperly.

The control rod was stuck, and came free while being attached because it was required to be lifted 4″ out of the core and because control rods had been sticking.  The control rods had been sticking for one or more of the following reasons: 1) reduced clearances due to radiation damage (which can cause structural material to swell), 2) the passage was blocked due to loss of poison strips in the channel, caused by poor design and inadequate testing, or 3) lifting equipment not working properly due to inadequate lifting capacity of the lifting equipment.

Exercising/testing was potentially improperly performed.  This could have occurred because the operators chose to exercise/test the rods, attempting to ensure that they would perform properly, and because they didn’t realize what would happen. This is because of inadequate training and inadequate work instructions.  The testing was also potentially done improperly due to inadequate work instructions.

On a positive note, the SL-1 incident did initiate some positive changes in the nuclear industry.  Most notably, reactor design has improved and incorporated a “one-rod stuck” criteria which specifies that a reactor can NOT go critical by the removal of any one control rod.  Additionally, procedures and training have gotten more intense and more formal, and planning for emergencies has increased.

Root Cause Analysis SL1The attached pdf gives a visual representation of the intermediate level root cause analysis, the Cause Map.  It can be printed out to fit on one page.

May 21st, 2008 | Leave a Comment

Train Derailment - Lafayette, Louisiana

About 1:40 am on May 17, six rail cars derailed and overturned near Lafayette, Louisiana.  One of the cars was damaged and leaked about 11,000 gallons of hydrochloric acid.  Authorities evaluated people with 1 mile of the accident.  Approximately 3,000 people were affected, including a few small businesses and a nursing home.  Five people, including two rail workers, were sent to a hospital and treated for eye and skin irritation.  All affect people are being reimbursed for food and hotel costs by the railway company that operated the train, BNSF Railway.

 The was potential for farther release of chemicals because one of other rail cars involved in the accident carried ethylene oxide, a flammable and dangerous chemical, and two of the remaining cars also carried hydrochloric acid. 

The Louisiana State Police’s hazardous materials unit is overseeing clean-up of the accident sit.  The spill is being neutralized with lime and the contaminated material will be removed and disposed of.  The rail car containing ethylene oxide was removed from the site as quickly to remove the potential for additional problems.

The cause of the derailment is not known at this time.  The Federal Railroad Administration will conduct an investigation of the accident.

Root Cause Analysis DerailmentThe attached PDF file contains an intermediate level root cause analysis of the train derailment.  It was built using the facts that were available in media reports on the accident.  As more details are known, the Cause Map can be expanded.

May 20th, 2008 | Leave a Comment

Site Map   Root Cause Analysis