All posts by ThinkReliability Staff

ThinkReliability are specialists in applying root cause analysis to solve all types of problems. We investigate errors, defects, failures, losses, outages and incidents in a wide variety of industries. Our Cause Mapping analysis method of root causes, captures the complete investigation with the best solutions all in an easy to understand format. ThinkReliability provides investigation services and root cause analysis training to clients around the world and is considered the trusted authority on the subject

The Phillips 66 Explosion: Planning for Emergencies

By ThinkReliability Staff

All business strive to make their processes as efficient as possible and maximize productivity.  Minimizing excess inventory only seems sensible, as does placing process equipment in a logical manner to minimize transit time between machines.  However, when productivity consistently takes precedence over safety, seemingly insignificant decisions can snowball when it matters most.

Using the Phillips 66 explosion of 1989 as an example, it is easy to see how numerous efficiency-related decisions snowballed into a catastrophe.  Examining different branches of the Cause Map highlights areas where those shortcuts played a role.  Some branches focus on how the plant was laid out, how operations were run and how the firefighting system was designed.  Arguably, all of these areas were maximized for production efficiency, but ended up being contributing factors in a terrible explosion and hampered subsequent emergency efforts.

For instance, the Cause Map shows that the high number of fatalities was caused not just by the initial explosion.  The OSHA investigation following the explosion highlighted contributing factors regarding the building layout.  The plant was cited for having process equipment located too closely together, in violation of generally accepted engineering practices.  While this no doubt maximized plant capacity, it made escape from the plant difficult and did not allow adequate time for emergency shutdown procedures to complete.  Additionally high occupancy structures, such as the control room and administrative building were located unnecessarily close to the reactors and storage vessels.  Luckily over 100 personnel were able to escape via alternate routes.  But luck is certainly not a reliable emergency plan; the plant should have been designed with safety in mind too.

Nearby ignition sources also contributed to the speed of the initial explosion, estimated to be within 90 to 120 seconds of the valve opening.  OSHA cited Phillips for not using due diligence in ensuring that potential sources of ignition were kept a safe distance from flammable materials or, alternatively, using testing procedures to ensure it was safe to bring such equipment into work zones.  The original spark source will never be known, but the investigation identified multiple possibilities.  These included a crane, forklift, catalyst activator, welding and cutting-torch equipment, vehicles and ordinary electrical gear.   While undoubtedly such a large cloud of volatile gas would have eventually found a spark, a proactive approach might have provided precious seconds for workers to escape.  All who died in the explosion were within 250 feet of the maintenance site.

Another factor contributing to the extensive plant damage was the inadequate water supply for fire fighting, as detailed in the Cause Map.  When the plant was designed, the water system used in the HDPE process was the same one that was to be used in an emergency.  There is no doubt a single water system was selected to keep costs down.  Other shortcuts include placing regular-service fire system pump components above ground.  Of course, the explosion sheared electrical cords and pipes controlling the system, rending it unusable.  Not only was the design of the fire system flawed, it wasn’t even adequately maintained.  In the backup diesel pump system, only one of three pumps was operational; one was out of fuel and the other simply didn’t work.  Because of these major flaws, emergency crews had to use hoses to pump water from remote sources.  The fire was not brought under control until 10 hours after the initial explosion.  As the Cause Map indicates, there may not have been such extensive damage had the water supply system been adequate.

There is a fine line between running processes at the utmost efficiency and taking short-cuts that can lead to dangerous situations.  Clearly, this was an instance where that line was crossed.

The Phillips 66 Explosion: The Rise of Process Safety Management in the Petrochemical Industry

By ThinkReliability Staff

Many of the industrial safety standards that we take for granted are the direct result of catastrophes of past decades.  Today there are strict regulations on asbestos handling, exposure limits for carcinogens, acceptable noise levels, the required use of personal protective equipment, and a slew of other safety issues.  The organization charged with enforcing those standards is the Occupational Health and Safety Administration – OSHA for short.

OSHA was founded in 1970, in an effort to promote and enforce workplace safety, and their stated mission is to “assure safe and healthful working conditions for working men and women”.  However, there was considerable controversy during its early years as it spottily began enforcing, what was perceived as, cumbersome and expensive regulations.  Notable events in the 1980s, such as the Bhopal and West Virginia Union Carbide industrial accidents, raised OSHA’s awareness that fundamental changes were needed to develop more effective safety management systems.

This awareness led to the rise of what is now known as Process Safety Management (PSM).  This discipline covers how industries safely manage highly hazardous chemicals.  OSHA’s PSM standard lays forth multiple requirements such as employee and contractor training, use of hot work permits, and emergency planning.  Unfortunately PSM was still a work-in-progress during the fall of 1989.

On October 23, 1989, the Phillips 66 Petroleum Chemical Plant near Pasadena, Texas, then producing approximately 1.5 billion of high-density polyethylene (HDPE) plastic each year, suffered a massive series of explosions.  23 died and hundreds were injured in an explosion that measured at least 3.5 on the Richter scale and destroyed much of the plant.  Many of the deficiencies identified at the Phillips 66 plant were in violation of OSHA’s PSM directives; directives which had been announced, but had not yet been formally enacted.

Looking at the Phillips 66 Explosion Cause Map, one can see how a series of procedural errors occurred that fateful day.  Contract workers were busy performing a routine maintenance task of clearing out a blockage in a collection tank for the plastic pellets produced by the reactor.  The collection tank was removed, and work commenced that morning.  However, at some point just after lunch, the valve to the reactor system was opened, releasing an enormous gas cloud which ignited less than two minutes later.

The subsequent OSHA investigation highlighted numerous errors.  First, the air hoses used to activate the valve pneumatically were left near the maintenance site.  When the air hoses were connected backwards, this automatically opened the valve, releasing a huge volatile gas cloud into the atmosphere.  It is unknown why the air hoses were reconnected at all.  Second, a lockout device had been installed by Phillips personnel the previous evening, but was removed at some point prior to the accident.  A lockout device physically prevents someone from opening a valve.  Finally, in accordance with local plant policy but not Phillips policy, no blind flange insert was used as a backup.  The insert would have stopped the flow of gas into the atmosphere if the valve had been opened.  Had any of those three procedures been executed properly, there would not have been an explosion that day.  According to the investigation, contract workers had not been adequately trained in the procedures they were charged with performing.

Additionally, there were significant design flaws in the reactor/collector system.  The valve system used had no mechanical redundancies; the single Demco ball valve was the sole cut-off point between the highly-pressurized reactor system and the atmosphere.  Additionally, there was a significant design flaw with the air hoses, as alluded to earlier.  Not only were the air hoses connected at the wrong time, but there was no physical barrier to prevent them from being connected the wrong way.  This is the same reason North American electrical plugs are mechanically keyed and can only be plugged in one way.  It can be bad news if connected incorrectly!  Connecting the air hoses backward meant the valve went full open, instead of closed.  Both of these design flaws contributed to the gas release, and again, this incident would not have occurred if either flaw was absent.

In hindsight, one can see how multiple problems led to such devastating results.  To easily understand the underlying reasons behind the Phillips 66 Explosion of 1989, a high-level Cause Map provides a quick overview of the event.  Breaking a section of the Cause Map down further can provide significant insight into the multiple reasons the event occurred.  The associated PDF for this case shows how different levels of a Cause Map can provide just the right amount of detail for understanding a complex problem such as this one.

The Phillips 66 explosion was a tragedy that could have been avoided.  The industrial safety standards that OSHA is charged with enforcing aim to prevent future tragedies like this one.  While a gradual safety-oriented transformation has come with some pain and a price tag, few will argue that such standards are unnecessary.  Industrial workers deserve to work in an environment where risk to their health has been reduced to the most practical level.

Aging Natural Gas Pipeline Finally Fails

By ThinkReliability Staff

Few ever contemplate the complex system of utilities surrounding us.  The beauty of our modern standard of living is that usually there is little reason to think about those things.  Those rare cases where power isn’t available at the flip of a switch, or fresh water at the turn of a faucet usually make the local news.

Sadly, the community of San Bruno was faced with much more than simple inconvenience.  On September 9, 2010, an explosion ripped through the suburban community, ultimately killing 8 and destroying or damaging 100 homes.  The explosion was caused by a ruptured natural gas pipeline, and it appears that a slight increase in pipe pressure led to the final failure.  That change in pressure resulted from a glitch in maintenance procedures at a pipeline  terminal.  While ultimately that glitch may have been the “straw that broke the camel’s back”, it is clear from the Cause Map analysis that the straw pile was already fairly high.

Based on National Transportation Safety Board reports, both poor pipe construction and inadequate record-keeping played a major role in the failure.  The pipes, at or near their life expectancy, were already considered too thin by the 1950s’ standards when they were originally installed.  Furthermore improperly done welding made the pipes susceptible to corrosion.  Compounding these issues was the fact that PG&E, the utilities company responsible the pipeline, wasn’t even aware that the San Bruno pipeline had such extensive welding.  This matters because gas pressures are calculated based on a number of inputs, including the construction of the pipeline.  Even that slight increase in pressure proved to be more than the aging pipe could handle.

Natural gas pipelines are fairly extensive in the United States, and with suburban sprawl many communities live close to these pipelines.  In fact, many states have already taken steps to prevent similar events from occurring in their community.  Multiple utilities companies have been mandated to install newer pipelines, as in Texas and Washington.  Additionally, the federal government requires that newly constructed pipelines must be inspected by “smart pigs” – robots able to maintain and inspect pipeline systems.  However, modernizing this aging infrastructure will be expensive for many communities.

Perhaps there are easy, inexpensive interim solutions available.  The Cause Map analysis identifies all causes leading to the explosion, and then provides a systematic method for developing solutions.  Hopefully some of the solutions generated will prevent future disasters, like the one in San Bruno.

Printing Issues with New $100 Bill

By ThinkReliability Staff

In October, the U.S. government discovered that some of the newly redesigned $100 bills were coming off the printing press with blank spots caused by creases in the paper at both sites of the Bureau of Engraving and Printing, Washington, D.C. and Fort Worth, Texas.  The government has recently announced that this will cause a delay in the introduction of these bills, planned for the spring of 2011.

Additionally, the bills that have blank spots will have to be  shredded and reprinted.  Because of complex new security features aimed at deterring counterfeiters (such as a 3-D security strip woven into the paper), the bills cost $0.12 to print.  Hundreds of millions of bills have been printed, with a possible cost of this issue in the millions of dollars.

 Although issues with currency are expensive, they’re also rare. The last time that a printing issue caused a delay in the introduction of a new bill was 1987.  It’s unclear at this point when the bills will finally be released.

It’s also unclear what happened to cause the paper to crease, creating blank spots from printing.  The additional complexity of this bill with the additional security features is being looked at, as are issues with the paper and the printing machines.  However, because similar errors occurred at both printing sites, it’s unlikely that there is a specific issue with just one site’s machines.  Although the investigation into what caused the blank spots is ongoing, we can begin a root cause analysis with what is currently known.  Once more information is discovered, the Cause Map can be updated.

Because of the high potential financial losses from this issue, the eventual investigation will likely go into great  detail and to determine fully what happened will take some time.  The Cause Map and outline for the information known now can be viewed by clicking “Download PDF” above.

Cement Failure Contributed to Deepwater Horizon Explosion

By ThinkReliability Staff

The investigation to determine the causes behind the April 20, 2010 explosion of the Deepwater Horizon oil rig and the resulting oil spill is still underway.  (A previous blog discussed the BP report and the initial findings from their investigation.)

The newest piece of information that has come to light is that cement used to seal the production casing wasn’t properly tested.  It was previously assumed that the cement must have failed or the hydrocarbons would not have been able to leak into the well and subsequently feed the massive explosion that destroyed the oil rig.  But more information is coming available that explains why the cement failed.

The well was cemented with nitrogen foam cement supplied by a contractor. Investigation by the presidential commission on the oil spill has revealed that the cement was not properly tested prior to use.  More significantly, the cement was found to be unstable when tested.

 The data indicates that the cement failure was a cause of the oil rig explosion, but was it the root cause?

It’s easy to see that the cement failure was not the only cause.  In addition to the failure of the cement, there were other things that had to occur for the accident to happen.  One of the most obvious is the failure of the blowout preventer.  Even if the cement failed and the hydrocarbons leaked into the well, a functioning blowout preventer would have blocked the leak path for the hydrocarbons and prevented this tragedy.

As with any incident of this magnitude, there is no single root cause, rather there are a number of causes that contributed to the incident.  Determining all the causes that contributed to the incident will allow better understanding of the incident, which will hopefully lead to development and implementation of better solutions to prevent similar accidents in the future.

Dig Deeper to get to the Causes of the Oil Spill

By ThinkReliability Staff

On Sunday (September 26th, 2010) the lead investigator for the Deepwater Horizon oil spill was questioned by a National Academy of Engineering committee.  The committee brought up concerns that the investigation that had been performed was not adequate to address all the causes of the spill.  Said the lead oil spill investigator: “It is clear that you could go further into the analysis . . . this does not represent a complete penetration into potentially deeper issues.”

Specifically, the committee was concerned that the study focused on decisions made on the rig (generally by personnel who worked for other companies) but did not adequately consider input from these companies.  The study also avoided organizational issues that may have contributed to the spill.

In circumstances such as this one – where an extremely complicated event requires an organization to spend most of its resources fixing the immediate problem, an interim report – which may not delve deeply into underlying organizational issues or obtain a full spectrum of interviews – may be appropriate.  However, it’s just an interim report and should not be treated as the final analysis of the causes relating to an issue.  The organizations involved need to ensure that after the immediate actions – stopping the spill, completing the cleanup, and compensating victims – are complete, an in-depth report commensurate with the impact of the issue is performed.

In instances such as these, causes relating to an incident need to be unearthed ruthlessly and distributed freely.  This is generally why a governmental organization will perform these in-depth reviews.  The personnel involved in the investigation must not be limited to only one organization, but rather all organizations that are involved in the incident.  Once action items that will improve safety and processes have been determined, they must be freely distributed to all other organizations participating in similar endeavors.  The alternative – to wait until similar disasters happen at other sites – is unacceptable.

Washing Machine Failure

(This week, we are proud to announce a Cause Map by a guest blogger, Bill Graham.  Thanks, Bill!)

While completing household chores in the spring of 2010, a Housewife found her front load washing machine stopped with water standing in the clothing.  Inspection of the machine uncovered the washing machine’s drain pump had failed.  Because the washer is less than two years old, it was decided to attempt repair of the machine instead of replacing it.  A replacement pump was not locally available, so the family finds and orders a pump from an Internet dealer.  Delivery time for the pump is approximately one week, during which time the household laundry chore cannot be completed and some of the family’s favorite clothing cannot be worn because it is has not been laundered.  On receiving the new pump, Dad immediately removes the broken pump and finds, to his chagrin, a small, thin guitar pick in the suction of the old pump.  Upon discovery of the guitar pick, the family’s children report that the pick had been left in the pocket of the pants that where being washed at the time of the pump’s failure.  The new pump was installed and the laundry chore resumed for the household.

While most cause analysis programs would identify the guitar pick as the root cause to the washing machine’s failure, Cause Mapping unveils all of the event’s contributing factors and what most efficient / cost effective measures might be taken to avert a similar failure.  For example, if all the family’s children aspire to be guitar players, then a top load washer may better suit their lifestyle while also averting the same mishap.  Or, maybe the family should consider wearing pocket-less clothing.  Or, maybe all family members should assume bigger role in completing the household laundry chore.  Whichever solution is chosen, the impact of these and all contributing causes is easily understood when the event is Cause Mapped.

Containment Cap Removed from Gulf Oil Leak

By ThinkReliability Staff

Last Wednesday, another set back occurred in the attempt to stem the flow of oil in the Gulf of Mexico from the a well head that was damaged when the Deepwater Horizon Oil Rig exploded on April 20 and sank 36 hours later .

The containment cap used to siphon oil from the damaged well head for the last three weeks had to be temporarily removed for more than 11 hours.  Before being removed, the containment system was sucking up about 29,000 gallons an hour.

So what happened?  Why remove a containment cap that had been working successful?

A root cause analysis of this problem can be built as a Cause Map.  A Cause Map is started by considering the impact to the goals and asking “why” questions to add Causes.  In this example, the first goal we will consider is the Environmental Goal.  Obviously, the environmental goal is impacted because there was additional oil released to the environment because the cap was removed.

Continuing to ask “why” questions we can add additional causes.  The cap was removed because the ship connected to the containment cap system needed to be moved away from the well because there a safety concern because of the potential for an explosion.

There was an explosion concern because there was evidence that flammable gas was flowing up from the well head because liquid was being pushed out of a valve in the containment system.  This gas was getting into the containment cap system because an underwater vent was bumped by one of the remote-controlled submersible robots being used to monitor the damaged well.

More detail could be added to the Cause Map by continuing to ask why questions.  The detailed Cause Map could then be used to develop solutions that could be implemented to help prevent the problem from reoccurring.

Click on the “Download PDF” button above to view an initial Cause Map.

The containment cap was put back into place around 9 pm on June 23.  The efforts to contain and clean up the oil spill will continue for months and possibly years to come, but at least this small issue has been fixed.

Oil Rig Explosion

By ThinkReliability Staff

On April 20, 2010 about 10 pm a huge explosion rocked a semi-submersible drilling oil rig about 40 miles off the coast of Louisiana in the Gulf of Mexico. The oil rig was called the Deepwater Horizon and was owned by Transocean Ltd and leased to the British Petroleum Company through September 2013.

The oil rig burned for about 36 hours before sinking.  126 people were on the oil rig at the time of the explosion.  Eleven are missing and presumed dead and 4 were critically injured. Oil continues to leak from the wellhead more than a mile underwater on the ocean floor at an estimated rate of 42,000 gallons a day.

Remotely operated submersible vehicles were used to examine the wellhead.  The vehicles were also used in an effort to manually trigger the blowout preventer, which would close the wellhead and prevent any farther release of oil.  The blowout preventer is a 450-ton valve installed at the wellhead that is designed to automatically shut to prevent oil leaks in the event of an accident.  Attempts to manually close the blowout preventer have not been successful.

The other containment options being explored are drilling a separate well nearby to plug the flow at a location below the blowout preventer and building underwater domes that would contain the oil until it could be safely pumped to the surface for disposal.  Both of these alternatives are being actively worked and will take months to complete.  It is estimated that 4.2 million gallons of oil will be released if the blowout preventer is not able to be closed.

The cause of the explosion is unknown at this time.  An investigation is underway by the Coast Guard and the Minerals Management Service.

A preliminary root cause analysis can be started using the information that is known and details can be added as they become available.  The analysis can be documented using a Cause Map which is a simple, intuitive format that visually lays out all known causes for an incident.  The first step in building a Cause Map is to determine how the organizational goals were impacted by the incident.  Causes for each impacted goal are determined to begin building the Cause Map.

In this case, the safety goal was impacted because 11 people were killed and several injured.  The environmental goal was impacted because there was a significant oil release.  The materials goal was impacted because the $700 million oil rig is a complete loss and the production/schedule goal was impacted because the oil drilling operation is shut down.

Click on the “Download PDF” button above to view an initial Cause Map.

Water Pollution from Sewer

By ThinkReliability Staff

Thanks in part to the Clean Water Act, passed in 1972 and revised in 2000, most residents of the United States have continual access to clean, safe water.  However, extenuating circumstances may result in pathogens remaining in drinking water – or contaminating swimming water – resulting in potential illnesses.  In fact, researchers estimate that up to 20 million people per year become ill due to ingesting pathogens in water.  In addition to the environmental impact of untreated sewage reaching waterways, up to 400,000 basements and thousands of roads have been flooded with untreated sewage.

These floods generally occur when the sewer systems are overwhelmed or clogged.  A clogged sewage system can result from buildup of leaves, or other debris, including that from illegal dumping.

An overwhelmed sewer system is generally the result of a high volume of water passing through the system.  As the population increases, the strain on the system increases as well.  Since most municipalities do not have the funds available to upgrade or replace their systems, an aging, inappropriately sized system is all that remains to provide needed water.  However, systems are generally able to keep up with demand, except during times of high rainfall.  Many sewer systems handle both waste and rainwater through the same system.  When a heavy rainfall occurs, the system is overwhelmed, resulting in overflow of sewage.  This overflow is often directed into the waterways.  Dumping untreated or partially treated sewage into waterways is illegal, but fines are hardly ever levied.  The Federal Government may be unwilling to levy fines against municipalities for illegal dumping, especially because Federal funding to maintain sewer systems has decreased significantly.  With municipal budgets stretched already, dealing with aging sewer systems just isn’t happening.

However, there are some things that municipalities can do.  Green spaces (as opposed to paved areas) absorb rainfall, decreasing the amount directed in to the sewer system.  By planning more green space, or better drainage, the amount of rainfall that actually enters the system can be reduced.  Additionally, municipalities could redirect rainfall to keep it out of the waste portion of the sewer system. The cost of doing this may make it infeasible; however, calls for Federal stimulus money for repairs to sewer systems may result in municipalities’ ability to finally upgrade their systems.