Tag Archives: Cause Mapping

Loss of Network Cloud Compute Service

By ThinkReliability Staff

On April 21, 2011, some of Amazon’s Elastic Compute Cloud (EC2) customers experienced issues when a combination of events led to their East region’s Elastic Block Store (EBS) being unable to process read or write operations.  This seriously impacted their customer service.  Massive efforts were undertaken and services, and  most data, was restored within 3 days.  Amazon has released their post-mortem analysis of these events.  Using the information they’ve provided, we can begin a visual root cause analysis, or Cause Map, laying out the event.

We begin with the affected goal.  Customer service was impacted because of the inability to process read or write operations.  This ability was lost due to a degraded EBS cluster.  (A cluster is a group of nodes, which are responsible for replicating data and processing read and write requests.)  The cluster was degraded by the failure of some nodes, and a high number of nodes searching for replicas.

At this point, we’ll look into the process to explain what’s going on.  When a user makes a request, a control plane accepts and processes that request to an EBS cluster. The cluster elects a node to be the primary replica of this data.  That node stores the data, and looks for other available nodes to make backup replicas.  If the node-to-node connection is lost, the primary replica searches for another node.  Once it has established connectivity with that node, the new node becomes another replica.  This process is continuous.

In this case, a higher number of nodes were searching for replicas because they lost connection to the other nodes.  Based on the process discussed above, the nodes then began a search for other nodes.  However, they were unable to find any other nodes because the network was unavailable (so the nodes could not communicate with each other).  The nodes had a long time-out period for searching for other nodes, so their search continued, and more nodes lost communication and began a search, increasing the volume.

The network communication was lost because data was shifted off the primary network.  This was caused by an error during a network configuration change to upgrade the capacity of the primary network.  The data should have been transferred to a redundant router on the primary network but was instead transferred to the secondary network.  The secondary network did not have sufficient capacity to handle all the data and so was unable to maintain connectivity.

In addition to a large number of nodes searching for other nodes, the EBS cluster was impacted by node failures.  Some nodes failed because of a race condition designed so that a node would fail when it attempted to process multiple concurrent requests for replicas.  These requests were caused by the situation above.  Additionally, the nodes failing led to more nodes losing their replicas, compounding the difficulty of recovering from this event.

Service is back to normal, and Amazon has made some changes to prevent this type of issue from reoccurring.   Immediately, the data was shifted back to the primary network and the error which caused the shifting was corrected.  Additional capacity was added to prevent the EBS cluster from being overwhelmed.  The retry logic which resulted in the nodes continuing to search for long periods of time has been modified, and the source of the race condition resulting in the failure of the nodes has been identified and repaired.

View the root cause analysis investigation of this event – including an outline, timeline, Cause Map, solutions and Process Map, by clicking “Download PDF” above.

Plane Clips Another While Taxiing at JFK Airport

By Kim Smiley

Around 8:30 pm on April 11, 2011, a large passenger airplane taxiing at John F. Kennedy Airport in New York clipped the wing of a smaller plane.  The larger plane involved in the incident was an Airbus A380 carrying 485 passengers and 25 crew members.  The smaller plane was a Bombardier CRJ and carrying 52 passengers and 4 crew members at the time it was clipped.

At the time of the accident, the Airbus was taxiing to take off and the CRJ had recently landed and was waiting to park.  The incident was caught on amateur video and it appears that the left wing tip of the Airbus struck the left horizontal stabilizer of the CRJ. No injuries were reported, but both planes sustained some damage.

After the planes made contact, the fire department responded as a precautionary measure.  Passengers were deplaned from the Airbus so that the planes could be inspected and information could be gathered to support the investigation.

At this time there is limited information available about what caused this incident, but the National Transportation and Safety Board (NTSB) has begun an investigation.  The NTSB has requested fight recorders from both airplanes and also plans to review the air traffic control tapes and the ground movement radar data to determine how this happened.

Even through the investigation is just getting started, it is still possible to create a Cause Map based on what is known.  The first step is to create an Outline of the event by determining the impact to the organization goals.  In this example, the Safety Goal was impacted because there was the potential for injuries, the Customer Service goal was impacted because the passengers were unable to reach their destination, the Production Schedule Goal was impacted because the flight was unable to depart and the Material and Labor goal was impacted because there was damage to both planes.

From this point, Causes can be added to the cause map by asking “why” question. Missing information can be noted by adding a Cause box with a “?”.  Any additional information can be added later.  To see an initial Cause Map of this incident and the Outline, click on the “Download PDF” above.

Grounding the 737’s: SWA Flight 812

By ThinkReliability Staff

As new information comes to light, processes need to be reevaluated.  A hole in the fuselage of a 15-year-old Boeing 737-300 led to the emergency descent of Southwest Airlines Flight 812.  737’s have been grounded as federal investigators determine why the hole appeared.  At the moment, consensus is that a lap joint supporting the top of the fuselage cracked.

While the investigation is still in the early stages, it appears that stress fatigue caused a lap joint to fail.  Stress fatigue is a well known phenomenon, caused in aircraft by the constant pressurization and depressurization occurring during takeoff and landing.  Mechanical engineers designing the aircraft would have been well aware of this phenomenon.  The S-N curve, which plots a metal’s expected lifespan vs. stress, has been used for well over a century.

Just as a car needs preventative maintenance, planes are inspected regularly for parts that are ready to fail.  However, the crack in lap joint wasn’t detected during routine maintenance.  In fact, that joint wasn’t even checked.  It wasn’t an oversight however.  Often the design engineers also set the maintenance schedule, because they hold the expertise needed to determine a reasonable procedure.  The engineers didn’t expect the part to fail for at least 20,000 more flight hours.  At the moment, it’s unclear why that is.

In response to the incident, the FAA has grounded all similar aircraft and ordered inspections of flights nearing 30,000 flight hours.   Cracks have been found in 5 aircraft of 80 grounded aircraft so far.  However a looming concern is how to deal with 737’s not based in the United States, and therefore outside the FAA’s jurisdiction.

Issues at Fukushima Daiichi Unit 3

By ThinkReliability Staff

There are many complex events occurring with some of Japan’s nuclear power plants as a result of the earthquake and tsunami on March 11, 2011.  Although the issues are still very much ongoing, it is possible to begin a root cause analysis of the events and issues.  In order to clearly show one issue, our analysis within this blog is limited to the issues affecting Fukushima Daiichi Unit 3.  This is not to minimize the issues occurring at the other plants and units, but rather to clearly demonstrate the cause-and-effect within one small piece of the overall picture.

The issues surrounding Unit 3 are extremely complex.  In events such as these, where many events contribute to the issues, it can be helpful to make a timeline of events.  A timeline of the events so far can be seen by clicking “Download PDF” above.  A timeline can not only help to clarify the order of contributing events, it can also help create the Cause Map, or visual root cause analysis.  To show how the events on the timeline fit into the Cause Map, some of the entries are denoted with numbers, which are matched to the same events on the Cause Map.  Notice that in general, because Cause Maps build from right to left with time, earlier entries are found to the right of newer events.  For example, the earthquake was the cause of the tsunami, so the earthquake is to the right of the tsunami on the map.  Many of the timeline events are causes, but some are also solutions.  For example, the venting of the reactor is a solution to the high pressure.  (It also becomes a cause on the map.)

A similar analysis could be put together for all of the units affected by the earthquake, tsunami and resulting events.  Parts of this cause map could be reused as many of the issues affecting the other plants and units are     similar to the analysis shown here. It would also be possible to build a larger Cause Map including all impacts from the earthquake.

The impact to goals needs to be determined prior to building a Cause Map. As a direct result of the events at Unit 3, 7 workers were injured.  This is an impact to the worker safety goal.  There is the potential for health effects to the population, which is an impact to the public safety goal.  The environmental goal was impacted due to the release of radioactivity into the environment.  The customer service goal was impacted due to evacuations and rolling blackouts, caused by the loss of electrical production capacity, which is an impact to the production goal.  The loss of capacity was caused by catastrophic damage to the plant, which is an impact to the property goal.  Additionally, the massive effort to cool the reactor is an impact to the labor goal.

The worker safety and property goals were impacted because of a hydrogen explosion, which was caused by a buildup of pressure in the plant, caused by increasing reactor temperature.  Heat continues to be generated by a nuclear reactor, even after it is shutdown, as a natural part of the operating process.  In this case, the normal cooling supply was lost when external power lines were knocked down by the tsunami (which was caused by the earthquake).  The tsunami also apparently damaged the diesel generators which provided the emergency cooling system.  The backup to the emergency cooling supply stopped automatically and was unable to be restarted, for reasons that are as yet unknown.

The outline, timeline and cause map shown on the PDF are extremely simplified.  Part of this simplification is due to the fact that as the event is still ongoing and not all information is known, or has been released. Once more information becomes available, it can be added to the analysis, or the analysis can be revised.

To learn more about the reactor issues at Fukushima Daiichi, view our video summary.  To see a blog about the impact of the fallout on the health of babies in the US, see our healthcare blog.

Two Killed in Barge/Tour Boat Collision

By ThinkReliability Staff

On July 7, 2010, a barge being propelled by a tug boat collided with a tour boat that had dropped anchor in the Delaware River.  As a result of the collision, two passengers on the tour boat were killed and twenty-six were injured.  The tour boat sank in 55 feet of water.

Detail regarding the incident has just been released in an updated NTSB report.  We can use the information about this report to begin a Cause Map, or visual root cause analysis.  The information in the report can also point us in the direction of important questions that remain to be answered to determine exactly what happened and, most importantly, how incidents like these can be prevented in the future.

In this case, a tour boat had dropped anchor to deal with mechanical problems.  According to the tour boat crew’s testimony and radio recordings, the tour boat crew attempted to get in touch with the tug boat by yelling and making radio calls.  Neither were answered or apparently noticed.  The barge that was being propelled by the tug boat crashed into the tour boat, resulting in deaths, injuries and loss of the tour boat.

The lookout on the tug boat was inadequate (had it been adequate, the tug boat would have noticed the tour boat in time to avoid the collision).  The report has determined that the tug boat master was off-duty and below-deck at the time of the collision.  According to cell phone records, the mate who was on lookout duty was on a phone call at the time of the collision and had made several phone calls during his duty. The inadequate lookout combined with the inability of the tour boat to make contact with the tug boat resulted in the collision.

There are two obvious areas where more detail is needed in the Cause Map to determine what was going on that led to the issues on the tug boat.  Specifically, why was the lookout on the cell phone and why wasn’t the tour boat able to contact the tug boat through the radio?  Because of the strict requirements for lookouts on marine duty, there is also an ongoing criminal investigation into the lookout’s actions.  When the final NTSB report is issued, and the criminal case is closed, these questions should be answered.  More detail can be added to this Cause Map as the analysis continues. As with any investigation the level of detail in the analysis is based on the impact of the incident on the organization’s overall goals.

San Francisco’s Stinking Sewers

By ThinkReliability Staff

The Golden Gate City is well known for its ground-breaking, environmentally-friendly initiatives.  In 2007 San Francisco outlawed the use of plastic bags at major grocery stores.  The city also mandated compulsory recycling and composting programs in 2009.  Both ordinances were the first laws of their kind in the nation, and criticized by some for being overly aggressive.  Likewise San Francisco’s latest initiative, to reduce city water usage by encouraging the use of low-flow toilets, has faced harsh criticism.

Recently San Francisco began offering substantial rebates to homeowners and businesses to install high efficiency toilets (HETs).  These types of toilet use 1.28 gallons or less per flush, down from the 1.6 gpf versions required today by federal law and even older 3.4 gpf toilets from decades ago.  That means that an average home user will save between 3,800 to 5,000 gallons of water per year per person.  In dollars, that’s a savings of $90 annually for a family of four.  This can quickly justify the cost of a new commode, since a toilet is expected to last 20 years.

Aside from cost savings, there are obvious environmental benefits to reduced water use.  The city initially undertook the HET rebate initiative to decrease the amount of water used overall by the city and the amount of wastewater requiring treatment.  They were successful, and water usage decreased.  In fact, the city’s Public Utilities Commission stated that San Francisco residents reduced their water consumption by 20 million gallons of water last year.  San Francisco last year used approximately 215 million gallons per day.  This also met other goals the city had, such as reducing costs to consumers.  Unintentionally though, the HET rebate initiative impacted a different goal – Customer Service.

As shown on the associated Cause Map, reduced water flow had a series of other effects.  While water consumption – and presumably waste water disposal – shrank significantly, waste production has remained constant.  Despite $100M in sewage systems upgrades over the past five years, current water flow rates are not high enough to keep things moving through the system.  As a result sewage sludge builds up in sewer lines.  As bacteria eat away at the organic matter in the sludge, hydrogen sulfide is released.  Hydrogen sulfide is known for its characteristic “rotten egg” smell.

This creates an unfortunate situation.  No one wants to walk through smelly streets.  Further, slow sewage means a build-up of potential harmful bacteria.  However, everyone agrees San Francisco should strive to conserve water.  Water is a scarce and increasingly expensive resource in California.  What’s the next step in solving the stinking sewer problem?

San Francisco is not the first city to deal with this issue.  There is substantial debate over the city’s current plan to purchase $14M in bleach to clean up the smell.   Many parties are concerned about potential environmental impacts and potential contamination to drinking water.  Other solutions have been proposed by environmental activists, but may have financial ramifications.

Cause Maps can help all parties come to agreement because they focus problem solvers on the goals, not the details of the problem.  In this case, all parties are trying to protect the environment and reduce costs to city residents.  Based on those goals and the Cause Map, potential solutions have been developed and placed with their corresponding causes.  The next step is to proactively consider how these new actions might affect the stakeholders’ goals.  Perhaps other goals could be impacted, such as the safety of drinking water and potential contamination of San Francisco Bay.  Financial goals will surely be impacted to varying degrees with each solution.  Revising the Cause Map can help identify the pros and cons of each approach and narrow down which solution best satisfies all parties.

Deadly Tiger Attack

By Kim Smiley

On December 25, 2007, a tiger escaped her enclosure at the San Francisco Zoo and attacked three people.  One 17 year old boy was killed and the other two were injured. The enclosure was built in the 1940s and had safely contained tigers for more than 60 years without incident.

So how did this happen?  How did the tiger escape?

A Cause Map can be built using this example to help determine how this incident was able to occur. To begin a Cause Map, the impacts to the organizational goals are first determined and then “why” questions are asked to add causes to the map.  In this case, there was obviously an impact to the safety goal because one zoo patron was killed and two were injured.  The customer service goal was also impacted because the zoo was closed until January 3, 2008 following the incident.  Why was a zoo patron killed?  He was killed because he was mauled by a tiger.  Why was he mauled?  Because the tiger escaped her enclosure and she went after the victims.

Let’s focus on the question of how the tiger escaped her enclosure first.  An investigation was conducted by the United States Department of Agriculture’s Animal and Plant Health Inspection Service, the government body who is charged with overseeing the nation’s zoos.  Based on claw marks and other evidence at the scene, they determined that the tiger jumped from the bottom of a dry moat and was able to pull herself over the fence surrounding her enclosure.  The investigation also determined the fence was lower than typically used around tiger enclosures.  The Association of Zoos & Aquariums recommends that walls around a tiger exhibit be at least 16.4 feet and the fence around the San Francisco Zoo was only 12.5 feet at the time.

The second question of why the tiger went after the boys is not as easy to answer.  A few experts have stated that the tiger didn’t behave in a typical way.  There has been significant speculation in the media that the victims taunted the tiger or provoked her in some way, but nothing has ever officially been determined.

This focus on the behavior of the victims is a good example of some of the issues that can come up during an investigation.  It can be tempting to focus on assigning blame when investigating an incident.  But the real question is “What should we do to prevent this from happening again?”.  Whether or not the boys provoked the tiger, she should never have been able to escape her enclosure.

After the incident, the zoo extensively remodeled the tiger enclosure, adding a much higher fence and with hotwire at the top to prevent any similar incidents from occurring.

The Phillips 66 Explosion: Planning for Emergencies

By ThinkReliability Staff

All business strive to make their processes as efficient as possible and maximize productivity.  Minimizing excess inventory only seems sensible, as does placing process equipment in a logical manner to minimize transit time between machines.  However, when productivity consistently takes precedence over safety, seemingly insignificant decisions can snowball when it matters most.

Using the Phillips 66 explosion of 1989 as an example, it is easy to see how numerous efficiency-related decisions snowballed into a catastrophe.  Examining different branches of the Cause Map highlights areas where those shortcuts played a role.  Some branches focus on how the plant was laid out, how operations were run and how the firefighting system was designed.  Arguably, all of these areas were maximized for production efficiency, but ended up being contributing factors in a terrible explosion and hampered subsequent emergency efforts.

For instance, the Cause Map shows that the high number of fatalities was caused not just by the initial explosion.  The OSHA investigation following the explosion highlighted contributing factors regarding the building layout.  The plant was cited for having process equipment located too closely together, in violation of generally accepted engineering practices.  While this no doubt maximized plant capacity, it made escape from the plant difficult and did not allow adequate time for emergency shutdown procedures to complete.  Additionally high occupancy structures, such as the control room and administrative building were located unnecessarily close to the reactors and storage vessels.  Luckily over 100 personnel were able to escape via alternate routes.  But luck is certainly not a reliable emergency plan; the plant should have been designed with safety in mind too.

Nearby ignition sources also contributed to the speed of the initial explosion, estimated to be within 90 to 120 seconds of the valve opening.  OSHA cited Phillips for not using due diligence in ensuring that potential sources of ignition were kept a safe distance from flammable materials or, alternatively, using testing procedures to ensure it was safe to bring such equipment into work zones.  The original spark source will never be known, but the investigation identified multiple possibilities.  These included a crane, forklift, catalyst activator, welding and cutting-torch equipment, vehicles and ordinary electrical gear.   While undoubtedly such a large cloud of volatile gas would have eventually found a spark, a proactive approach might have provided precious seconds for workers to escape.  All who died in the explosion were within 250 feet of the maintenance site.

Another factor contributing to the extensive plant damage was the inadequate water supply for fire fighting, as detailed in the Cause Map.  When the plant was designed, the water system used in the HDPE process was the same one that was to be used in an emergency.  There is no doubt a single water system was selected to keep costs down.  Other shortcuts include placing regular-service fire system pump components above ground.  Of course, the explosion sheared electrical cords and pipes controlling the system, rending it unusable.  Not only was the design of the fire system flawed, it wasn’t even adequately maintained.  In the backup diesel pump system, only one of three pumps was operational; one was out of fuel and the other simply didn’t work.  Because of these major flaws, emergency crews had to use hoses to pump water from remote sources.  The fire was not brought under control until 10 hours after the initial explosion.  As the Cause Map indicates, there may not have been such extensive damage had the water supply system been adequate.

There is a fine line between running processes at the utmost efficiency and taking short-cuts that can lead to dangerous situations.  Clearly, this was an instance where that line was crossed.

The Phillips 66 Explosion: The Rise of Process Safety Management in the Petrochemical Industry

By ThinkReliability Staff

Many of the industrial safety standards that we take for granted are the direct result of catastrophes of past decades.  Today there are strict regulations on asbestos handling, exposure limits for carcinogens, acceptable noise levels, the required use of personal protective equipment, and a slew of other safety issues.  The organization charged with enforcing those standards is the Occupational Health and Safety Administration – OSHA for short.

OSHA was founded in 1970, in an effort to promote and enforce workplace safety, and their stated mission is to “assure safe and healthful working conditions for working men and women”.  However, there was considerable controversy during its early years as it spottily began enforcing, what was perceived as, cumbersome and expensive regulations.  Notable events in the 1980s, such as the Bhopal and West Virginia Union Carbide industrial accidents, raised OSHA’s awareness that fundamental changes were needed to develop more effective safety management systems.

This awareness led to the rise of what is now known as Process Safety Management (PSM).  This discipline covers how industries safely manage highly hazardous chemicals.  OSHA’s PSM standard lays forth multiple requirements such as employee and contractor training, use of hot work permits, and emergency planning.  Unfortunately PSM was still a work-in-progress during the fall of 1989.

On October 23, 1989, the Phillips 66 Petroleum Chemical Plant near Pasadena, Texas, then producing approximately 1.5 billion of high-density polyethylene (HDPE) plastic each year, suffered a massive series of explosions.  23 died and hundreds were injured in an explosion that measured at least 3.5 on the Richter scale and destroyed much of the plant.  Many of the deficiencies identified at the Phillips 66 plant were in violation of OSHA’s PSM directives; directives which had been announced, but had not yet been formally enacted.

Looking at the Phillips 66 Explosion Cause Map, one can see how a series of procedural errors occurred that fateful day.  Contract workers were busy performing a routine maintenance task of clearing out a blockage in a collection tank for the plastic pellets produced by the reactor.  The collection tank was removed, and work commenced that morning.  However, at some point just after lunch, the valve to the reactor system was opened, releasing an enormous gas cloud which ignited less than two minutes later.

The subsequent OSHA investigation highlighted numerous errors.  First, the air hoses used to activate the valve pneumatically were left near the maintenance site.  When the air hoses were connected backwards, this automatically opened the valve, releasing a huge volatile gas cloud into the atmosphere.  It is unknown why the air hoses were reconnected at all.  Second, a lockout device had been installed by Phillips personnel the previous evening, but was removed at some point prior to the accident.  A lockout device physically prevents someone from opening a valve.  Finally, in accordance with local plant policy but not Phillips policy, no blind flange insert was used as a backup.  The insert would have stopped the flow of gas into the atmosphere if the valve had been opened.  Had any of those three procedures been executed properly, there would not have been an explosion that day.  According to the investigation, contract workers had not been adequately trained in the procedures they were charged with performing.

Additionally, there were significant design flaws in the reactor/collector system.  The valve system used had no mechanical redundancies; the single Demco ball valve was the sole cut-off point between the highly-pressurized reactor system and the atmosphere.  Additionally, there was a significant design flaw with the air hoses, as alluded to earlier.  Not only were the air hoses connected at the wrong time, but there was no physical barrier to prevent them from being connected the wrong way.  This is the same reason North American electrical plugs are mechanically keyed and can only be plugged in one way.  It can be bad news if connected incorrectly!  Connecting the air hoses backward meant the valve went full open, instead of closed.  Both of these design flaws contributed to the gas release, and again, this incident would not have occurred if either flaw was absent.

In hindsight, one can see how multiple problems led to such devastating results.  To easily understand the underlying reasons behind the Phillips 66 Explosion of 1989, a high-level Cause Map provides a quick overview of the event.  Breaking a section of the Cause Map down further can provide significant insight into the multiple reasons the event occurred.  The associated PDF for this case shows how different levels of a Cause Map can provide just the right amount of detail for understanding a complex problem such as this one.

The Phillips 66 explosion was a tragedy that could have been avoided.  The industrial safety standards that OSHA is charged with enforcing aim to prevent future tragedies like this one.  While a gradual safety-oriented transformation has come with some pain and a price tag, few will argue that such standards are unnecessary.  Industrial workers deserve to work in an environment where risk to their health has been reduced to the most practical level.

Aging Natural Gas Pipeline Finally Fails

By ThinkReliability Staff

Few ever contemplate the complex system of utilities surrounding us.  The beauty of our modern standard of living is that usually there is little reason to think about those things.  Those rare cases where power isn’t available at the flip of a switch, or fresh water at the turn of a faucet usually make the local news.

Sadly, the community of San Bruno was faced with much more than simple inconvenience.  On September 9, 2010, an explosion ripped through the suburban community, ultimately killing 8 and destroying or damaging 100 homes.  The explosion was caused by a ruptured natural gas pipeline, and it appears that a slight increase in pipe pressure led to the final failure.  That change in pressure resulted from a glitch in maintenance procedures at a pipeline  terminal.  While ultimately that glitch may have been the “straw that broke the camel’s back”, it is clear from the Cause Map analysis that the straw pile was already fairly high.

Based on National Transportation Safety Board reports, both poor pipe construction and inadequate record-keeping played a major role in the failure.  The pipes, at or near their life expectancy, were already considered too thin by the 1950s’ standards when they were originally installed.  Furthermore improperly done welding made the pipes susceptible to corrosion.  Compounding these issues was the fact that PG&E, the utilities company responsible the pipeline, wasn’t even aware that the San Bruno pipeline had such extensive welding.  This matters because gas pressures are calculated based on a number of inputs, including the construction of the pipeline.  Even that slight increase in pressure proved to be more than the aging pipe could handle.

Natural gas pipelines are fairly extensive in the United States, and with suburban sprawl many communities live close to these pipelines.  In fact, many states have already taken steps to prevent similar events from occurring in their community.  Multiple utilities companies have been mandated to install newer pipelines, as in Texas and Washington.  Additionally, the federal government requires that newly constructed pipelines must be inspected by “smart pigs” – robots able to maintain and inspect pipeline systems.  However, modernizing this aging infrastructure will be expensive for many communities.

Perhaps there are easy, inexpensive interim solutions available.  The Cause Map analysis identifies all causes leading to the explosion, and then provides a systematic method for developing solutions.  Hopefully some of the solutions generated will prevent future disasters, like the one in San Bruno.