Tag Archives: root cause analysis

The Phillips 66 Explosion: Planning for Emergencies

By ThinkReliability Staff

All business strive to make their processes as efficient as possible and maximize productivity.  Minimizing excess inventory only seems sensible, as does placing process equipment in a logical manner to minimize transit time between machines.  However, when productivity consistently takes precedence over safety, seemingly insignificant decisions can snowball when it matters most.

Using the Phillips 66 explosion of 1989 as an example, it is easy to see how numerous efficiency-related decisions snowballed into a catastrophe.  Examining different branches of the Cause Map highlights areas where those shortcuts played a role.  Some branches focus on how the plant was laid out, how operations were run and how the firefighting system was designed.  Arguably, all of these areas were maximized for production efficiency, but ended up being contributing factors in a terrible explosion and hampered subsequent emergency efforts.

For instance, the Cause Map shows that the high number of fatalities was caused not just by the initial explosion.  The OSHA investigation following the explosion highlighted contributing factors regarding the building layout.  The plant was cited for having process equipment located too closely together, in violation of generally accepted engineering practices.  While this no doubt maximized plant capacity, it made escape from the plant difficult and did not allow adequate time for emergency shutdown procedures to complete.  Additionally high occupancy structures, such as the control room and administrative building were located unnecessarily close to the reactors and storage vessels.  Luckily over 100 personnel were able to escape via alternate routes.  But luck is certainly not a reliable emergency plan; the plant should have been designed with safety in mind too.

Nearby ignition sources also contributed to the speed of the initial explosion, estimated to be within 90 to 120 seconds of the valve opening.  OSHA cited Phillips for not using due diligence in ensuring that potential sources of ignition were kept a safe distance from flammable materials or, alternatively, using testing procedures to ensure it was safe to bring such equipment into work zones.  The original spark source will never be known, but the investigation identified multiple possibilities.  These included a crane, forklift, catalyst activator, welding and cutting-torch equipment, vehicles and ordinary electrical gear.   While undoubtedly such a large cloud of volatile gas would have eventually found a spark, a proactive approach might have provided precious seconds for workers to escape.  All who died in the explosion were within 250 feet of the maintenance site.

Another factor contributing to the extensive plant damage was the inadequate water supply for fire fighting, as detailed in the Cause Map.  When the plant was designed, the water system used in the HDPE process was the same one that was to be used in an emergency.  There is no doubt a single water system was selected to keep costs down.  Other shortcuts include placing regular-service fire system pump components above ground.  Of course, the explosion sheared electrical cords and pipes controlling the system, rending it unusable.  Not only was the design of the fire system flawed, it wasn’t even adequately maintained.  In the backup diesel pump system, only one of three pumps was operational; one was out of fuel and the other simply didn’t work.  Because of these major flaws, emergency crews had to use hoses to pump water from remote sources.  The fire was not brought under control until 10 hours after the initial explosion.  As the Cause Map indicates, there may not have been such extensive damage had the water supply system been adequate.

There is a fine line between running processes at the utmost efficiency and taking short-cuts that can lead to dangerous situations.  Clearly, this was an instance where that line was crossed.

The Phillips 66 Explosion: The Rise of Process Safety Management in the Petrochemical Industry

By ThinkReliability Staff

Many of the industrial safety standards that we take for granted are the direct result of catastrophes of past decades.  Today there are strict regulations on asbestos handling, exposure limits for carcinogens, acceptable noise levels, the required use of personal protective equipment, and a slew of other safety issues.  The organization charged with enforcing those standards is the Occupational Health and Safety Administration – OSHA for short.

OSHA was founded in 1970, in an effort to promote and enforce workplace safety, and their stated mission is to “assure safe and healthful working conditions for working men and women”.  However, there was considerable controversy during its early years as it spottily began enforcing, what was perceived as, cumbersome and expensive regulations.  Notable events in the 1980s, such as the Bhopal and West Virginia Union Carbide industrial accidents, raised OSHA’s awareness that fundamental changes were needed to develop more effective safety management systems.

This awareness led to the rise of what is now known as Process Safety Management (PSM).  This discipline covers how industries safely manage highly hazardous chemicals.  OSHA’s PSM standard lays forth multiple requirements such as employee and contractor training, use of hot work permits, and emergency planning.  Unfortunately PSM was still a work-in-progress during the fall of 1989.

On October 23, 1989, the Phillips 66 Petroleum Chemical Plant near Pasadena, Texas, then producing approximately 1.5 billion of high-density polyethylene (HDPE) plastic each year, suffered a massive series of explosions.  23 died and hundreds were injured in an explosion that measured at least 3.5 on the Richter scale and destroyed much of the plant.  Many of the deficiencies identified at the Phillips 66 plant were in violation of OSHA’s PSM directives; directives which had been announced, but had not yet been formally enacted.

Looking at the Phillips 66 Explosion Cause Map, one can see how a series of procedural errors occurred that fateful day.  Contract workers were busy performing a routine maintenance task of clearing out a blockage in a collection tank for the plastic pellets produced by the reactor.  The collection tank was removed, and work commenced that morning.  However, at some point just after lunch, the valve to the reactor system was opened, releasing an enormous gas cloud which ignited less than two minutes later.

The subsequent OSHA investigation highlighted numerous errors.  First, the air hoses used to activate the valve pneumatically were left near the maintenance site.  When the air hoses were connected backwards, this automatically opened the valve, releasing a huge volatile gas cloud into the atmosphere.  It is unknown why the air hoses were reconnected at all.  Second, a lockout device had been installed by Phillips personnel the previous evening, but was removed at some point prior to the accident.  A lockout device physically prevents someone from opening a valve.  Finally, in accordance with local plant policy but not Phillips policy, no blind flange insert was used as a backup.  The insert would have stopped the flow of gas into the atmosphere if the valve had been opened.  Had any of those three procedures been executed properly, there would not have been an explosion that day.  According to the investigation, contract workers had not been adequately trained in the procedures they were charged with performing.

Additionally, there were significant design flaws in the reactor/collector system.  The valve system used had no mechanical redundancies; the single Demco ball valve was the sole cut-off point between the highly-pressurized reactor system and the atmosphere.  Additionally, there was a significant design flaw with the air hoses, as alluded to earlier.  Not only were the air hoses connected at the wrong time, but there was no physical barrier to prevent them from being connected the wrong way.  This is the same reason North American electrical plugs are mechanically keyed and can only be plugged in one way.  It can be bad news if connected incorrectly!  Connecting the air hoses backward meant the valve went full open, instead of closed.  Both of these design flaws contributed to the gas release, and again, this incident would not have occurred if either flaw was absent.

In hindsight, one can see how multiple problems led to such devastating results.  To easily understand the underlying reasons behind the Phillips 66 Explosion of 1989, a high-level Cause Map provides a quick overview of the event.  Breaking a section of the Cause Map down further can provide significant insight into the multiple reasons the event occurred.  The associated PDF for this case shows how different levels of a Cause Map can provide just the right amount of detail for understanding a complex problem such as this one.

The Phillips 66 explosion was a tragedy that could have been avoided.  The industrial safety standards that OSHA is charged with enforcing aim to prevent future tragedies like this one.  While a gradual safety-oriented transformation has come with some pain and a price tag, few will argue that such standards are unnecessary.  Industrial workers deserve to work in an environment where risk to their health has been reduced to the most practical level.

Shuttle Launch May Be Delayed Again

By ThinkReliability Staff

NASA’s plan to launch Discovery on its final mission continues to face setbacks.  As discussed in last week’s blog, the launch of Discovery was delayed past the originally planned launch window that closed on November 5 as the result of four separate issues.

One of these issues was a crack in a stringer, one of the metal supports on the external fuel tank.  NASA engineers haven identified additional stringer cracks that must also be repaired prior to launch.  These cracks are typically fixed by cutting out the cracked metal and bolting in new pieces of aluminum called doublers because they are twice as thick as the original stringers. The foam insulation that covers the stringers must then be reapplied.  The foam needs four days to cure, which makes it difficult to perform repairs quickly.

Adding to the complexity of these repairs is the fact that this is the first time they have been attempted on the launch pad. Similar repairs have been made many times, but they were performed in the factory where the fuel tanks were built.

Yesterday, NASA stated that the earliest launch date would be the morning of December 3.  If Discovery isn’t ready by December 5, the launch window will close and the next opportunity to launch will be late February.

NASA has stated that as long as Discovery is launched during the early December window the overall schedule for the final shuttle missions shouldn’t be affected.  Currently, the Endeavor is scheduled to launch during the February window and it will have to be delayed if the launch of Discovery slips until February.

In a situation like this, NASA needs to focus on the technical issues involved in the repairs, but they also need to develop a work schedule that incorporates all the possible contingencies.  Just scheduling everything is no easy feat.  In additional to the schedule of the remaining shuttle flights, the timing of Discovery’s launch will affect the schedule of work at the International Space Station because Discovery’s mission includes delivering and installing a new module and delivering critical spare components.

When dealing with a complex process, it can help to build a Process Map to lay out all possible scenarios and ensure that resources are allocated in the most efficient way.  In the same way that a Cause Map can help the root cause analysis process run more smoothly and effectively, a Process Map that clearly lays out how a process should happen can help provide direction, especially during a work process with complicated choices and many possible contingencies.

How a Shuttle is Launched

By ThinkReliability Staff

The Space Shuttle Discovery is expected to be launched November 4th, assuming all goes well.  But what does “all going well” entail?  Some things are obvious and well-known, such as the need to ensure that the weather is acceptable for launch.  However, with an operation as complex and risky as launching a shuttle, there are a lot of steps to make sure that the launch goes off smoothly.

To show the steps involved in shuttle launch preparation, we can prepare a Process Map.  Although a Process Map looks like a Cause Map, its purpose is to show the steps that must be accomplished, in order, for successful completion of a process.  We can begin a Process Map with only one box, the process that we’ll be detailing.  Here, it’s the “Launch Preparation Process”.  We break up the process into more detailed steps in order to provide more useful information about a process.  Here the information used was from Wired Magazine and NASA’s Launch Blog (where they’ll be providing up-to-date details as the launch process begins).

Here we break down the Shuttle Launch Process into 9 steps, though we could continue to add more detail until  we had hundreds of steps.  Some of the steps have been added (or updated) based on issues with previous missions.  For example, on Apollo I, oxygen on board caught fire during a test and killed the crew.  Now one of the first steps is an oxygen purge, where oxygen in the payload bay and aft compartments is replaced with nitrogen.  On Challenger, concerns about equipment integrity in extremely cold weather were not brought to higher ups.  Now there’s a Launch Readiness Check, where more than 20 representatives of contractor organizations and departments within NASA are asked to verify their readiness for launch.  This allows all contributors to have a say regarding the launch.  One of the last steps is the weather check we mentioned above.

Similar to the Launch Readiness Check, we can add additional detail to the Launch Status Check.  This step can be further broken down to show the checks of systems and positions that must be completed before the Launch Status step can be considered complete.  Each step within each Process Map shown here can be broken down into even more detail, depending on the complexity of the process and the need for a detailed Process Map.  In the case of an extremely complex process such as this one, there may be several versions of the Process Map, such as an overview of the entire process (like we’ve shown here) and a detailed version for each step of the Process to be provided to the personnel who are performing and overseeing that portion of the process.  As you can see a lot of planning and checking goes into the launch preparations!

Mine Deaths in China

By ThinkReliability Staff

Following the successful rescue of all 33 miners trapped in a Chilean mine is some unhappy mine news from China.  A gas blast on October 16, 2010 in the early morning is known to have killed 26 miners, and the 11 miners unaccounted for are believed dead.   In addition to these impacts to the safety goals, the environmental goal is impacted by the extremely high levels of methane gas, the customer service and production goals are impacted by the closure of the mine, and the property and labor goals are impacted by the rescue efforts that have been required.  Unfortunately this is not an uncommon occurrence.  It is estimated that 2,600 people were killed in Chinese mine accidents last year.

It is expected that the miners were mostly killed due to suffocation.  In addition to the lack of oxygen from the extremely high levels of methane (40% compared to the normal level of 1%), the miners were buried by coal dust, released by the gas blast.  The miners were trapped in the mine by the gas blast, of which the cause is as of yet unknown.  This is a question that additional investigation will try and answer.  Additionally more information is needed about the high levels of methane.  The rescuers had difficulty reducing the levels of methane because coal dust was blocking an access shaft, but levels were high prior to the blast, for reasons that are unclear.

More detail can be added to this Cause Map as the analysis continues. As with any investigation the level of detail in the analysis is based on the impact of the incident on the organization’s overall goals.  Because of the high number of deaths (and the high frequency of this type of incident), the Cause Map should end up very detailed in order to provide as many solutions as possible to ensure that the best solutions are implemented to reduce these types of incidents.

Dig Deeper to get to the Causes of the Oil Spill

By ThinkReliability Staff

On Sunday (September 26th, 2010) the lead investigator for the Deepwater Horizon oil spill was questioned by a National Academy of Engineering committee.  The committee brought up concerns that the investigation that had been performed was not adequate to address all the causes of the spill.  Said the lead oil spill investigator: “It is clear that you could go further into the analysis . . . this does not represent a complete penetration into potentially deeper issues.”

Specifically, the committee was concerned that the study focused on decisions made on the rig (generally by personnel who worked for other companies) but did not adequately consider input from these companies.  The study also avoided organizational issues that may have contributed to the spill.

In circumstances such as this one – where an extremely complicated event requires an organization to spend most of its resources fixing the immediate problem, an interim report – which may not delve deeply into underlying organizational issues or obtain a full spectrum of interviews – may be appropriate.  However, it’s just an interim report and should not be treated as the final analysis of the causes relating to an issue.  The organizations involved need to ensure that after the immediate actions – stopping the spill, completing the cleanup, and compensating victims – are complete, an in-depth report commensurate with the impact of the issue is performed.

In instances such as these, causes relating to an incident need to be unearthed ruthlessly and distributed freely.  This is generally why a governmental organization will perform these in-depth reviews.  The personnel involved in the investigation must not be limited to only one organization, but rather all organizations that are involved in the incident.  Once action items that will improve safety and processes have been determined, they must be freely distributed to all other organizations participating in similar endeavors.  The alternative – to wait until similar disasters happen at other sites – is unacceptable.

Largest Egg Recall In US History

By Kim Smiley

Two Iowa farms have recently been at the center of the largest egg recall in US history.  Over half a billion eggs were recalled in August after more than 1,500 people were sickened by eggs tainted with salmonella.

How did this happen?  Where did the contamination come from?  How did tainted eggs make it onto supermarket shelves?

The investigation is still ongoing, but we can begin a root cause analysis of this problem by building a Cause Map.  A Cause Map provides a simple visual explanation of all the causes that were required to produce the incident.  A good place to start building a Cause Map is to identify the impacts to the organizational goals.  Causes are then added to the map by asking “why” questions.  (Click on the “Download PDF” button to view a Cause Map of this issue.)

In this example, we’ll consider the safety goal first.  The safety goal was impacted because nearly 1,500 people got sick because they consumed eggs that were contaminated with salmonella.  Why did they eat contaminated eggs?  Contaminated eggs were eaten because they were sold.  Why?  Because the eggs were contaminated at some point and there was inadequate regulation to prevent them from being sold.

Investigators are still determining the exact source of the contamination, but there is significant information available that can be added to the Cause Map.  The eggs were contaminated with salmonella because the hens laying the eggs were contaminated. (This strain of bacteria can be found inside a chicken’s ovaries and is passed on to eggs.)  The exact source that contaminated the hens is still being determined, but testing by the FDA has determined that the hens were likely contaminated after arriving at the farms.  FDA investigators have found a number of sanitation violations, including rodents which are a known carrier of salmonella.  Salmonella is not passed from hen to hen, but is typically passed from rodent droppings to chickens.

As more information comes available we can add to the Cause Map.  Hopefully, the investigation will result in solutions that can be applied and prevent this situation from occurring again.

A Serendipitous Solution

By Kim Smiley

Investigating the recent massive oil spill in the Gulf of Mexico is a tall order.  There are many contributing causes and a multitude of creative solutions are going to be needed to restore the environment.

During any investigation of this magnitude, there are guaranteed to be a few surprises.  And the Deep Horizon oil spill is no exception.

Scientists have discovered a previously unknown type of oil-eating bacteria feasting on oil from the spill.

This microbe is unique from previously studied varieties because it doesn’t consume large quantities of oxygen along with the oil.  Oxygen consumption is a concern because oxygen is needed in the sea to support life.

This microbe also thrives in cold water temperatures associated with the deep ocean, which might explain why it hasn’t been seen before.  Some scientists are theorizing that the microbe adapted in the deep ocean to consume the oil that naturally seeped from the ocean floor.  Since the huge influx of oil to the water, the bacteria populations have exploded.

Scientists are in a disagreement over how much oil remains in the Gulf, but there is no doubt that less is better.

This serendipitous solution is a welcome addition to the clean up efforts.  Obviously, there are many other solutions that will needed, but anything that safely reduces the overall amount of oil is a positive development.  Hopefully, with some additional research this microbe could be a potential solution to future incidents.

When performing an investigation, the unexpected sometimes happens.  The better understood the problem is, the easier it is to adapt to any new information. The Cause Mapping method of root cause analysis is an effective way to organize all information needed during an investigation.  Clearly understanding the causes that contribute to an incident will allow an organization to adapt as new information comes available and make sure that resources are used in the most efficient ways when implementing solutions.

Washing Machine Failure

(This week, we are proud to announce a Cause Map by a guest blogger, Bill Graham.  Thanks, Bill!)

While completing household chores in the spring of 2010, a Housewife found her front load washing machine stopped with water standing in the clothing.  Inspection of the machine uncovered the washing machine’s drain pump had failed.  Because the washer is less than two years old, it was decided to attempt repair of the machine instead of replacing it.  A replacement pump was not locally available, so the family finds and orders a pump from an Internet dealer.  Delivery time for the pump is approximately one week, during which time the household laundry chore cannot be completed and some of the family’s favorite clothing cannot be worn because it is has not been laundered.  On receiving the new pump, Dad immediately removes the broken pump and finds, to his chagrin, a small, thin guitar pick in the suction of the old pump.  Upon discovery of the guitar pick, the family’s children report that the pick had been left in the pocket of the pants that where being washed at the time of the pump’s failure.  The new pump was installed and the laundry chore resumed for the household.

While most cause analysis programs would identify the guitar pick as the root cause to the washing machine’s failure, Cause Mapping unveils all of the event’s contributing factors and what most efficient / cost effective measures might be taken to avert a similar failure.  For example, if all the family’s children aspire to be guitar players, then a top load washer may better suit their lifestyle while also averting the same mishap.  Or, maybe the family should consider wearing pocket-less clothing.  Or, maybe all family members should assume bigger role in completing the household laundry chore.  Whichever solution is chosen, the impact of these and all contributing causes is easily understood when the event is Cause Mapped.

Spacewalk Delay for Ammonia Leak

By Kim Smiley

Astronauts at the International Space Station ran into problems during a planned replacement of a broken ammonia cooling pump on August 7, 2010.  In order to replace the pump, four ammonia hoses and five electrical cables needed to be disconnected to remove the broken pump.  One of the hoses could not be removed because of a jammed fitting.  When an astronaut was able to disconnect it by hitting the fitting with a hammer, it caused an ammonia leak.

Ammonia is toxic, so the leak impacted both the safety and environmental goals.  Because the broken pump kept one cooling system from working, there was a risk of having to evacuate the space station, should the other system (which was the same age) fail.  This can be considered an impact to the customer service goal.   The repair had to be delayed, which is an impact to the production/schedule goal.  The loss of a redundant system is an impact to the property/equipment goal.     The extended spacewalk is an impact to the labor/time goal.

Once we fill out the outline with the impact to the goals and information regarding the problem, we can go on to the Cause Map.   The ammonia leak was caused by an unknown leak path and the fitting being removed by a hammer.  The fitting was removed with a hammer because it was jammed and had to be disconnected in order for the broken pump to be replaced.  As we’re not aware of what caused the pump to break (this information will likely be discovered now that the pump has been removed), we leave a question mark on the map, to fill in later.

The failed cooling pump also caused the loss of one cooling system.  If the other system, which is near the end of its expected life, were to fail, this would require evacuation from the station.

To aid in our understanding of this incident, we can create a very simple process map of the pump replacement.  The red firework shows the step in the replacement that didn’t go well.  To view the outline, Cause Map and Process Map, click on “Download PDF” above.