How a Shuttle is Launched

By ThinkReliability Staff

The Space Shuttle Discovery is expected to be launched November 4th, assuming all goes well.  But what does “all going well” entail?  Some things are obvious and well-known, such as the need to ensure that the weather is acceptable for launch.  However, with an operation as complex and risky as launching a shuttle, there are a lot of steps to make sure that the launch goes off smoothly.

To show the steps involved in shuttle launch preparation, we can prepare a Process Map.  Although a Process Map looks like a Cause Map, its purpose is to show the steps that must be accomplished, in order, for successful completion of a process.  We can begin a Process Map with only one box, the process that we’ll be detailing.  Here, it’s the “Launch Preparation Process”.  We break up the process into more detailed steps in order to provide more useful information about a process.  Here the information used was from Wired Magazine and NASA’s Launch Blog (where they’ll be providing up-to-date details as the launch process begins).

Here we break down the Shuttle Launch Process into 9 steps, though we could continue to add more detail until  we had hundreds of steps.  Some of the steps have been added (or updated) based on issues with previous missions.  For example, on Apollo I, oxygen on board caught fire during a test and killed the crew.  Now one of the first steps is an oxygen purge, where oxygen in the payload bay and aft compartments is replaced with nitrogen.  On Challenger, concerns about equipment integrity in extremely cold weather were not brought to higher ups.  Now there’s a Launch Readiness Check, where more than 20 representatives of contractor organizations and departments within NASA are asked to verify their readiness for launch.  This allows all contributors to have a say regarding the launch.  One of the last steps is the weather check we mentioned above.

Similar to the Launch Readiness Check, we can add additional detail to the Launch Status Check.  This step can be further broken down to show the checks of systems and positions that must be completed before the Launch Status step can be considered complete.  Each step within each Process Map shown here can be broken down into even more detail, depending on the complexity of the process and the need for a detailed Process Map.  In the case of an extremely complex process such as this one, there may be several versions of the Process Map, such as an overview of the entire process (like we’ve shown here) and a detailed version for each step of the Process to be provided to the personnel who are performing and overseeing that portion of the process.  As you can see a lot of planning and checking goes into the launch preparations!

Cement Failure Contributed to Deepwater Horizon Explosion

By ThinkReliability Staff

The investigation to determine the causes behind the April 20, 2010 explosion of the Deepwater Horizon oil rig and the resulting oil spill is still underway.  (A previous blog discussed the BP report and the initial findings from their investigation.)

The newest piece of information that has come to light is that cement used to seal the production casing wasn’t properly tested.  It was previously assumed that the cement must have failed or the hydrocarbons would not have been able to leak into the well and subsequently feed the massive explosion that destroyed the oil rig.  But more information is coming available that explains why the cement failed.

The well was cemented with nitrogen foam cement supplied by a contractor. Investigation by the presidential commission on the oil spill has revealed that the cement was not properly tested prior to use.  More significantly, the cement was found to be unstable when tested.

 The data indicates that the cement failure was a cause of the oil rig explosion, but was it the root cause?

It’s easy to see that the cement failure was not the only cause.  In addition to the failure of the cement, there were other things that had to occur for the accident to happen.  One of the most obvious is the failure of the blowout preventer.  Even if the cement failed and the hydrocarbons leaked into the well, a functioning blowout preventer would have blocked the leak path for the hydrocarbons and prevented this tragedy.

As with any incident of this magnitude, there is no single root cause, rather there are a number of causes that contributed to the incident.  Determining all the causes that contributed to the incident will allow better understanding of the incident, which will hopefully lead to development and implementation of better solutions to prevent similar accidents in the future.

Mine Deaths in China

By ThinkReliability Staff

Following the successful rescue of all 33 miners trapped in a Chilean mine is some unhappy mine news from China.  A gas blast on October 16, 2010 in the early morning is known to have killed 26 miners, and the 11 miners unaccounted for are believed dead.   In addition to these impacts to the safety goals, the environmental goal is impacted by the extremely high levels of methane gas, the customer service and production goals are impacted by the closure of the mine, and the property and labor goals are impacted by the rescue efforts that have been required.  Unfortunately this is not an uncommon occurrence.  It is estimated that 2,600 people were killed in Chinese mine accidents last year.

It is expected that the miners were mostly killed due to suffocation.  In addition to the lack of oxygen from the extremely high levels of methane (40% compared to the normal level of 1%), the miners were buried by coal dust, released by the gas blast.  The miners were trapped in the mine by the gas blast, of which the cause is as of yet unknown.  This is a question that additional investigation will try and answer.  Additionally more information is needed about the high levels of methane.  The rescuers had difficulty reducing the levels of methane because coal dust was blocking an access shaft, but levels were high prior to the blast, for reasons that are unclear.

More detail can be added to this Cause Map as the analysis continues. As with any investigation the level of detail in the analysis is based on the impact of the incident on the organization’s overall goals.  Because of the high number of deaths (and the high frequency of this type of incident), the Cause Map should end up very detailed in order to provide as many solutions as possible to ensure that the best solutions are implemented to reduce these types of incidents.

Miners Rescued!

By ThinkReliability Staff

On October 13, 2010, after almost 70 days spent at 688 meters underground, the 33 miners who were trapped in Chile’s San Jose Mine were brought to the surface in a small rescue capsule. Although the complexity of this rescue mission was unmatched in history, it seemed to go off without a hitch, even allowing the rescue to proceed more quickly than anticipated.

The primary concern throughout the rescue was the miner’s safety. Plans for the rescue focused on ensuring the safest possible environment for the miners – and making adjustments based on the ordeal they’ve been through. For example, there was concern about damage to the miner’s eyes – they haven’t been exposed to natural light for a while. So the miners wore protective eyewear to prevent damage. In addition, medics and rescuers were sent down to the chamber where the miners had been trapped to prepare them for the trip up (in a rescue pod small enough to fit through a 60-cm diameter hole) and evaluate them for medical conditions. After the miners reached the surface, they will receive 48 hours of medical observation by a team of specialists.

The preparations for this undertaking have been extremely methodical and detail. An area near the mine exit was cleared for a helicopter landing – a backup plan in case anything should happen so that the miners would be unable to be transported to the medical facility by road.

Even less-immediate concerns have been considered. The company that owned the mine went bankrupt while the miners were trapped, meaning these brave men returned to the surface jobless. The Chilean government put out a notice, and has received more than a thousand job offers.

One of the biggest concerns is that the miners will suffer from post-traumatic stress disorder (PTSD). It’s unclear exactly what exactly is being – or can be – done to reduce the impact, but the Chilean government has consulted with NASA about potential emotional and psychological issues the miners will face.

It seems that the rescuers really tried to think of everything that would make the rescue go smoothly – and the result of this planning showed in the faces of millions who watched the last miner safely pulled from the mine. A big Bravo Zulu out to all involved!

(You can see a timeline of the events starting from the mine collapse and a Cause Map that shows some of the worries the rescuers considered – and planned for – by clicking “Download PDF” above.)

Toxic Red Sludge Spill

By Kim Smiley

On Monday, October 4, 2010, a massive wave of red sludge flooded into four villages near Kilontar, Hungary when a storage reservoir burst.  Four were killed and at least 150 have needed medical treatment for their injuries.  The most common injuries reported are burns and eye ailments.

Red sludge is a highly caustic material that is produced during the aluminum manufacturing process.  Reports indicate that the sludge had a pH of 13 while stored in the reservoir.  All life has been killed in a 25 mile stretch of river and 16 square miles of land have been covered by the pollution.  Best estimates are that 158 million to 184 million gallons of sludge were released.  This first large scale release of red sludge in history.

Hungary’s top investigative agency is looking into the accident, but the cause for the reservoir barrier failure is not known at this time.

Even with the unknowns, a root cause analysis can be started by creating a Cause Map and documenting all available information.  Any new information can easily be incorporated into the existing Cause Map.

To build a Cause Map, we start with the impacted goals and ask “why” questions.  In this example, the two goals we will consider are the Safety goal and the Environmental goal.  Starting with the Safety goal we begin by asking – Why were people injured?  They were injured because they were exposed to caustic material because red sludge flooded into their villages.  Why?  Because red sludge was stored in a nearly reservoir and the barrier on the reservoir was breached.

Why the barrier failed isn’t known, but we can still add additional information that might be useful.  We know that the red sludge reservoir was near the villages and a little research reveals that this is common practice in the region and that there are a number of similar pools nearby.  This information may become relevant if the investigation determines that the other reservoirs are at risk for a similar failure so it’s worth recording on our Cause Map at this point. There is also information available about the environmental impact that can be added.

The investigation is still incomplete, but the Cause Map can grow as more information comes available.  Once the relevant information is added, the Cause Map can be used to develop solutions to help prevent similar accidents from occurring in the future.

Dig Deeper to get to the Causes of the Oil Spill

By ThinkReliability Staff

On Sunday (September 26th, 2010) the lead investigator for the Deepwater Horizon oil spill was questioned by a National Academy of Engineering committee.  The committee brought up concerns that the investigation that had been performed was not adequate to address all the causes of the spill.  Said the lead oil spill investigator: “It is clear that you could go further into the analysis . . . this does not represent a complete penetration into potentially deeper issues.”

Specifically, the committee was concerned that the study focused on decisions made on the rig (generally by personnel who worked for other companies) but did not adequately consider input from these companies.  The study also avoided organizational issues that may have contributed to the spill.

In circumstances such as this one – where an extremely complicated event requires an organization to spend most of its resources fixing the immediate problem, an interim report – which may not delve deeply into underlying organizational issues or obtain a full spectrum of interviews – may be appropriate.  However, it’s just an interim report and should not be treated as the final analysis of the causes relating to an issue.  The organizations involved need to ensure that after the immediate actions – stopping the spill, completing the cleanup, and compensating victims – are complete, an in-depth report commensurate with the impact of the issue is performed.

In instances such as these, causes relating to an incident need to be unearthed ruthlessly and distributed freely.  This is generally why a governmental organization will perform these in-depth reviews.  The personnel involved in the investigation must not be limited to only one organization, but rather all organizations that are involved in the incident.  Once action items that will improve safety and processes have been determined, they must be freely distributed to all other organizations participating in similar endeavors.  The alternative – to wait until similar disasters happen at other sites – is unacceptable.

Largest Egg Recall In US History

By Kim Smiley

Two Iowa farms have recently been at the center of the largest egg recall in US history.  Over half a billion eggs were recalled in August after more than 1,500 people were sickened by eggs tainted with salmonella.

How did this happen?  Where did the contamination come from?  How did tainted eggs make it onto supermarket shelves?

The investigation is still ongoing, but we can begin a root cause analysis of this problem by building a Cause Map.  A Cause Map provides a simple visual explanation of all the causes that were required to produce the incident.  A good place to start building a Cause Map is to identify the impacts to the organizational goals.  Causes are then added to the map by asking “why” questions.  (Click on the “Download PDF” button to view a Cause Map of this issue.)

In this example, we’ll consider the safety goal first.  The safety goal was impacted because nearly 1,500 people got sick because they consumed eggs that were contaminated with salmonella.  Why did they eat contaminated eggs?  Contaminated eggs were eaten because they were sold.  Why?  Because the eggs were contaminated at some point and there was inadequate regulation to prevent them from being sold.

Investigators are still determining the exact source of the contamination, but there is significant information available that can be added to the Cause Map.  The eggs were contaminated with salmonella because the hens laying the eggs were contaminated. (This strain of bacteria can be found inside a chicken’s ovaries and is passed on to eggs.)  The exact source that contaminated the hens is still being determined, but testing by the FDA has determined that the hens were likely contaminated after arriving at the farms.  FDA investigators have found a number of sanitation violations, including rodents which are a known carrier of salmonella.  Salmonella is not passed from hen to hen, but is typically passed from rodent droppings to chickens.

As more information comes available we can add to the Cause Map.  Hopefully, the investigation will result in solutions that can be applied and prevent this situation from occurring again.

Golfer Burns Up the Course…Really

By Kim Smiley

On Saturday, August 28 2010, a golfer at the Shady Canyon Golf Course in Irvine, California had a bad day on the course, a really bad day.

He literally burned the course up.

He swung his golf club and accidentally hit a rock.  This put into motion a classic example of cause and effect.  The metal on rock contact produced a spark, which landed on the dry brush in the area.  This tiny spark eventually grew into a 25 acre wild fire that took 150 firefighters, 38 trucks, 53 helicopter drops and 22,000 gallons of water to finally put out.  No one was injured and no homes were destroyed, but it was still an impressively bad day of golfing.

At first glance, this might seem like a freak accident that isn’t worth expending resources to investigate.

But what if this wasn’t the first time something like this happened?

The manager of the course stated that a similar incident happened a few years ago, but the golfer had been able to put the fire out before it could spread.

It seems like it might be worth at least considering possible solutions.

A root cause analysis can be performed by building a Cause Map using the information from this example. A Cause Map provides a simple visual explanation of all the causes that were required to produce the incident.  Cause Maps can be very detailed and include hundreds of causes or can be very high level.  It depends on the type of problem being investigated.  In this case, a fairly simple Cause Map should be adequate to brain storm some possible solutions.  Click on the “Download PDF” button above to view a high level Cause Map.

In this example, there are a number of possible solutions.  The course could be watered more often so that the brush isn’t quite so dry, fire extinguishers could be put on golf carts during the dry season to extinguish any fires that occur before they have a chance to grow, more rocks could be removed from the course and surrounding areas, etc.  There are many solutions that could be implemented.  Once the issue is clearly understood and the causes determined, the most effective, cost effective solutions can be implemented.

A Serendipitous Solution

By Kim Smiley

Investigating the recent massive oil spill in the Gulf of Mexico is a tall order.  There are many contributing causes and a multitude of creative solutions are going to be needed to restore the environment.

During any investigation of this magnitude, there are guaranteed to be a few surprises.  And the Deep Horizon oil spill is no exception.

Scientists have discovered a previously unknown type of oil-eating bacteria feasting on oil from the spill.

This microbe is unique from previously studied varieties because it doesn’t consume large quantities of oxygen along with the oil.  Oxygen consumption is a concern because oxygen is needed in the sea to support life.

This microbe also thrives in cold water temperatures associated with the deep ocean, which might explain why it hasn’t been seen before.  Some scientists are theorizing that the microbe adapted in the deep ocean to consume the oil that naturally seeped from the ocean floor.  Since the huge influx of oil to the water, the bacteria populations have exploded.

Scientists are in a disagreement over how much oil remains in the Gulf, but there is no doubt that less is better.

This serendipitous solution is a welcome addition to the clean up efforts.  Obviously, there are many other solutions that will needed, but anything that safely reduces the overall amount of oil is a positive development.  Hopefully, with some additional research this microbe could be a potential solution to future incidents.

When performing an investigation, the unexpected sometimes happens.  The better understood the problem is, the easier it is to adapt to any new information. The Cause Mapping method of root cause analysis is an effective way to organize all information needed during an investigation.  Clearly understanding the causes that contribute to an incident will allow an organization to adapt as new information comes available and make sure that resources are used in the most efficient ways when implementing solutions.

Washing Machine Failure

(This week, we are proud to announce a Cause Map by a guest blogger, Bill Graham.  Thanks, Bill!)

While completing household chores in the spring of 2010, a Housewife found her front load washing machine stopped with water standing in the clothing.  Inspection of the machine uncovered the washing machine’s drain pump had failed.  Because the washer is less than two years old, it was decided to attempt repair of the machine instead of replacing it.  A replacement pump was not locally available, so the family finds and orders a pump from an Internet dealer.  Delivery time for the pump is approximately one week, during which time the household laundry chore cannot be completed and some of the family’s favorite clothing cannot be worn because it is has not been laundered.  On receiving the new pump, Dad immediately removes the broken pump and finds, to his chagrin, a small, thin guitar pick in the suction of the old pump.  Upon discovery of the guitar pick, the family’s children report that the pick had been left in the pocket of the pants that where being washed at the time of the pump’s failure.  The new pump was installed and the laundry chore resumed for the household.

While most cause analysis programs would identify the guitar pick as the root cause to the washing machine’s failure, Cause Mapping unveils all of the event’s contributing factors and what most efficient / cost effective measures might be taken to avert a similar failure.  For example, if all the family’s children aspire to be guitar players, then a top load washer may better suit their lifestyle while also averting the same mishap.  Or, maybe the family should consider wearing pocket-less clothing.  Or, maybe all family members should assume bigger role in completing the household laundry chore.  Whichever solution is chosen, the impact of these and all contributing causes is easily understood when the event is Cause Mapped.