All posts by ThinkReliability Staff

ThinkReliability are specialists in applying root cause analysis to solve all types of problems. We investigate errors, defects, failures, losses, outages and incidents in a wide variety of industries. Our Cause Mapping analysis method of root causes, captures the complete investigation with the best solutions all in an easy to understand format. ThinkReliability provides investigation services and root cause analysis training to clients around the world and is considered the trusted authority on the subject

Sugar Dust Explosion

By ThinkReliability Staff

On February 7, 2008, an explosion at the sugar refinery in Port Wentworth, Georgia resulted in the deaths of 14 workers.  It also injured 36 and caused significant damage to the refinery.  Immediately following the incident, we began a very simple root cause analysis, leaving the more detailed analysis for when the Chemical Safety Board (CSB) report was released and more detailed information could be found.  The CSB final draft reportwas recently issued and with the information it contains, we can add more detail to our Cause Map.

We can begin our analysis by beginning with a goal that was impacted and using the “5-whys” approach.  The 14 deaths and 36 injuries were caused by the propagation of secondary explosions and fire.  The secondary explosions and fires were caused by a primary explosion, which was caused by an explosive concentration of sugar dust, which was caused by inadequate housekeeping.

From here we can add more detail to our map.  For example, difficulty evacuating the plant was also a cause of the deaths and injuries.  The difficulty was caused by having no evacuation drills, and using cell phones and radios to communicate instead of an intercom or emergency alert system.

In order for the explosions to propagate, they needed additional fuel.  This was found in the accumulated sugar dust in open areas of the plant, due to inadequate housekeeping, and a dust removal system that was not functioning properly and had ducts filled with sugar dust.

Since “inadequate housekeeping” has now come up twice on our map, let’s expand on that a little.  There was a lack of awareness of the hazards of sugar dust.  The facility risk assessment did not address these hazards, there was very little training on dust hazards, and there was little regulatory oversight which might have created more awareness or cleanliness requirements.  OSHA’s hazardous dust safety standards were limited to grain, and the State of Georgia had no regulations addressing dust.  (Both of these issues are in the process of being fixed.)

Although the sugar dust accumulated due to lack of housekeeping, it required more to reach explosive levels.  The containment was provided by steel panels installed around the conveyor which were designed to protect the sugar from contamination.   The dust also required an ignition source.  Due to the extensive damage, the CSB was not able to pinpoint the ignition source.

The CSB identified several solutions that would mitigate the risk of future incidents.  Some of these solutions are for Imperial Sugar to implement at this site, such as holding evacuation drills, increasing training on dust hazards, improving the housekeeping program, and installing (and using) an intercom system.  As discussed above, OSHA and the State of Georgia are implementing standards and regulations to decrease the chances of a dust explosion in their jurisdictions.  Also, the CSB has recommended that the company who performed the risk assessment at Imperial Sugar consider dust hazards as a risk.

Click on “Download PDF” above to see all the information discussed above in a visual form.

Learn more about dust explosions.

Chicago High Rise Fire

By ThinkReliability Staff

At approximately 5:00 p.m. on October 17, 2003, a fire began in a storage closet on the 12th floor of a Cook County Administration Building in Chicago. Since there were no Fire Safety Director personnel at the building, the building engineer decided to evacuate.  The Emergency Voice/Alarm Communications (EVAC) system was activated, informing personnel that they should evacuate the building using any set of stairs.  The Chicago Fire Department (CFD) was called and began fighting the fire from the southeast stairway on the 12th floor.

Personnel evacuating from above the 12th floor in the southeast stairway were stopped at the 12th floor by firefighters and told to go back.  When they did, they found all the doors locked up to the 27th floor.  However, before all the evacuees on the stairway could make it up to the 27th floor, the firefighters opened the stairway door to fight the fire.  This, combined with a smoke tower system that may not have been functioning correctly, led to the stairway filling with smoke and toxic gases, which overcame several people on the stairs.  Six of these people died.  The last body was found in the stairway approximately 90 minutes after the fire began.

A report commissioned by the Governor of Illinois found multiple issues that led to the deaths.  There was no s sprinkler system, which allowed the fire to spread.  The stairway doors were locked, and the evacuees and CFD personnel were generally unaware that they’d be locked, since there was no evacuation procedure or mandatory fire drills in the building.  The building had a Fire Safety Director, who was not certified and was 40 minutes away from the building when the fire occurred, and no deputies.  The firefighters appeared to place a priority on fighting the fire over searching for trapped people, even after several 9-1-1 calls indicated there were personnel trapped on the stairs.  Miscommunication and a lack of leadership within the CFD meant that 90 minutes elapsed before victims were found in the stairway.  Had they been found sooner, more would have survived.  Additionally, the fire department did not follow certain procedures, such as breaking windows above and below the firefighting site to allow smoke to escape and searching the area before opening a door that was trapping smoke.

A thorough root cause analysis built as a Cause Map can capture all of these causes in a simple, intuitive format   that fits on one page.  To view the complete investigation in visual form, click on “Download PDF” above.

A Cause Map also captures proposed solutions.  A solution is tied to a particular cause on the map.  Solutions are placed directly above the causes they control.  Some of these solutions have already been implemented, and many are valid for any high rise building to consider implementing.

Learn more about the Cook County Administration Building fire.

San Francisco Transit: Planning Pays Off

By ThinkReliability Staff

San Francisco’s 73-year old Bay Bridge partially collapsed during the Loma Prieta earthquake of  1989.  As a result, a seismic upgrade project was planned.  The bridge closed Thursday night, September 3rd, 2009, as part of the upgrade project.  Authorities conducted a thorough inspection of the bridge while it was closed.  During this inspection, an eyebar was found to be cracked about halfway through.

Unfortunately for San Francisco,  “The crack is significant enough to have closed the bridge on its own” says Caltrans spokesman Bart Ney.  Thus the area quickly made plans for repairing the bridge, which would necessitate closing it for longer than just Labor Day weekend, as planned.   However, commuters received a pleasant surprise when the bridge opened  at about 6:30 on Tuesday morning, less than two hours after originally planned (before the cracked eyebar was discovered).Construction crews worked around the clock to get the bridge repaired and inspected before morning rush hour.

Was it worth the rush?  Ask the 260,000 commuters who normally cross the bridge every day.   However, local transit officials did not rely on the bridge opening on time.  Instead, they made other arrangements, including adding high-speed catamarans to the ferry line-up.

This is an excellent demonstration of the use of “Plan B”, or implementing multiple solutions for issues with great impacts to the goals.  In this case, the repairs were necessitated by the possible loss of the bridge – certainly an impact to the goals of a transit authority.  The accelerated repair schedule and additional transit options were necessitated by the potential loss of the bridge as a transportation route during high traffic-volume times, resulting in an impact to the customer service goal.

When the Power Goes Out . . .

Basing Contingency Plans on the Impacts to your Organization’s Goals

By ThinkReliability Staff

An excellent discussion resulted as part of our free Webinar series last week. An attendee asked the question  “What if there’s a cause you can’t control, like the weather?” So another question was raised; “How do you prepare for those sorts of things?”

You can prepare for potential problems that may arise by using a Cause Map, just like you would after an actual problem occurred. We call the Cause Map of things that COULD happen a “proactive” Cause Map, while a Cause Map of something that DID happen is a “reactive” Cause Map. Typically you will see reactive Cause Maps, but a proactive analysis can be extremely useful for contingency planning, as well as to develop problem-solving skills.

To create a proactive (or COULD) Cause Map, follow the same steps normally used in a root cause investigation, trying to imagine the possibilities for impacts to the organization’s goals.  Then create the Cause Map and determine possible solutions (action items). The “cost” of the impacts to the goals will depend which solutions are reasonable to implement.

As an example, let’s look at a power outage from the perspective of a hospital. (View The Joint Commission’s Sentinel Event Alert on power outage.)  A power outage could lead to the deaths of patients, resulting in an impact to the safety goal. It could lead to the loss of life-saving equipment, resulting in an impact to the customer service goal. It could cause the facility to not be able to admit new patients, resulting in an impact to the production goal.  And, it can result in material and labor costs resulting from the transfer of patients to another facility.

Beginning with these impacts to the goals, we can create a Cause Map. (The Outline and Cause Map are shown on the downloadable PDF.) All the impacts to the goals lead back to a loss of electrical power, caused by both a power outage AND a lack of back-up electricity source.

When determining solutions, there are a few that come to mind, including transferring patients to another healthcare facility (which itself becomes an impact to the goals) and installing battery backups in equipment.  However, because of the severe impacts to the goals, a hospital will likely decide that the whole problem can be solved by installing an emergency generator.  Problem solved.  However, is installing an emergency generator always the right contingency plan for a power outage?

Let’s look at the same situation from the perspective of an office building. A power outage could cause some employees to get injured as they’re exiting the building, resulting in an impact to the safety goals. It will result in the loss of the business function of the office, resulting in an impact to the customer service and production goals. It may also result in paying employees for a non-work day, which is an impact to the labor goal.

The Cause Map looks similar to the hospital power outage Cause Map in that all the impacts lead back to a loss of electrical power, caused by a power outage and lack of back-up electricity source. So, we could put in an emergency generator just like the hospital did and have our problem solved. But the effort and capital required to install an emergency generator based on the lesser impacts to the goals is probably not worth it. Instead, some of the less expensive and consuming solutions can be implemented, such as installing emergency lights and setting up remote work stations for employees.

View the Outlines and Cause Maps for both the hospital and office building power outages by clicking “Download PDF” above.

Midair Aircraft/Helicopter Collision Over Hudson River

By ThinkReliability Staff

On August 8, 2009, a small airplane clipped the wing of a sightseeing helicopter and both aircraft crashed into the Hudson River, killing all nine people.  The crowded corridor above the Hudson River was also the site of the successful crash landing of U.S. Airways Flight 1549 in January, 2009.  The evidence from the crash is still being recovered from the accident site, so the investigation is ongoing.  However, just because we don’t have all the causes doesn’t mean we can’t start our root cause analysis.

A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.  To begin, we define the problem in an outline.  So far, we know the date and approximate time of the collision.  (We may be able to refine the time of the accident as more information is released.)  We know the location of the collision based on eyewitness accounts and the discovery of wreckage.  We also know the type of plane and helicopter involved, and what they were doing (the plane was in transit to Ocean City; the helicopter was on a sightseeing tour).

Next we define the problem with respect to the impact to the goals.  The safety goal was impacted because nine people were killed.  Both the airplane and helicopter were lost (or at the very least, severely damaged), which is an impact to the material goal.   Lastly, if we have the information, we can record the frequency of this type of incidents.  The last helicopter/airplane collision in the New York City area was in 1983.

Once we’ve completed the outline, we can move on to the Cause Map.  We begin with the impacts to the goals and fill in the Cause Map by asking “Why” questions.  Both goals were impacted because the plane and helicopter crashed into the water.  We continue to ask “Why” questions.  Both aircraft fell into the water because the plane clipped the helicopter’s wing.  The pilot clipped the helicopter’s wing because the plane and the helicopter were in the same airspace.  And, it’s surmised that the pilot could not see the helicopter.  (We don’t have any solid evidence supporting this yet, so we’ll leave a question mark.)

The plane and the helicopter were in the same airspace because the area is crowded with sightseeing helicopters and small planes which are prohibited from flying above buildings or over 1,100 feet.  Around New York City, that pretty much leaves the river.  Pilots who are flying below 1,100 feet are free to choose their own route, and are not under the control of air traffic controllers.  Instead, they use the “see and avoid” method.

Unfortunately that method isn’t successful when a pilot can’t see an incoming helicopter.  Although small planes are not controlled by air traffic controllers, they are in communication with them.  However, the pilot of the plane had never contacted the Newark controllers.  The helicopter was ascending at the time of the crash, so it’s likely that it came from below the plane (where the pilot would be unable to see it).  The helicopter may have been unaware of the plane because it’s not required (though it is recommended) for pilots to announce their position.

As the NTSB investigation continues, more detail can be added to this Cause Map… As with any investigation the level of detail in the analysis is based on the impact of the incident on the organization’s overall goals.

Saving Sharks from Extinction

By ThinkReliability Staff

In honor of the Discovery Channel’s “Shark Week”, we’ll use the problem of shark species at risk of extinction as an root cause analysis example. We’ll begin by building a Cause Map, which is a visual method of performing a root cause analysis.

We begin a root cause analysis with an impact to the goal.  Shark species being at risk for extinction is an impact to the environmental goals.  While I didn’t add this kind of detail, evidence has shown that a decrease in the number of sharks results in problems for the rest of the food chain.

We fill out the Cause Map by asking “Why” questions.  Shark species are at a risk of extinction because the death rates of sharks are higher than the birth rates.   Sharks have low reproductive rates (they mature slowly, have long gestational periods, and birth few young), and increasing death rates.  The increasing death rate is due to over fishing (fishing without regard to population), injured sharks being left to die, and loss of habitat, caused by pollution.  The combination of sharks being fished for sport, food, or products (which are rising in value; sharks are thought to cure cancer) and the lack of effective regulation has led to over fishing.  Many sharks are injured, either as “bycatch” meaning sharks are brought up in fishing nets while fishing for something else, or by a practice known as “finning”, where a shark’s fin is cut off.  (Shark fin soup is very popular.) In both cases, sharks are typically thrown back into the water injured and left to die.   Many countries have a ban on finning, but the ban is not always effectively enforced.

Many countries around the world are trying to protect sharks.  Some of the solutions they have implemented are to create shark fishing quotas, increase enforcement of fishing quotas and finning bans,  decrease the market for shark products and shark fin soup, and limiting any fishing in known shark habitats.  Solutions can be shown on the Cause Map, directly above the cause they control.  Once solutions have been selected for implementation, as these have been, they are listed in the Action Items list.  (To see the Cause Map and Action Items list, click on “Download PDF” above.)

How an Unchecked Assumption Brought Down a Bridge

By ThinkReliability Staff

On August 1, 2007, the I-35 bridge over the Mississippi River in Minneapolis, Minnesota collapsed during evening rush hour, killing 13 and injuring at least 145.  During the National Transportation Safety Board’s investigation, it was discovered that the gusset plates (the riveted metal plate that joins several structural members) were designed with inadequate load capacity.  At the time of the bridge collapse, the load on the gusset plate that failed was higher than usual, due to construction materials and equipment concentrated on the deck over the location of the gusset plate and rush hour traffic slowed by the construction.  In addition to these weights, the dead load (weight of the bridge structure) had increased by more than four million pounds due to improvements made to the bridge since it opened in 1967.

Bridges are inspected regularly, and go through a design review process . . . so how did the gusset plate design error get missed?    The design for the gusset plates was apparently supposed to be a preliminary design, which neglected shear stress.  Although the firm that designed the bridge required a review of all calculations before the final design, the procedure did not ensure that all calculations were rechecked, so the gusset plate calculations that ignored shear stress were overlooked.

The design was reviewed by the government, but their design review did not apply to gusset plates.  The gusset plate capacity was not calculated as part of the load rating calculations.   Gusset plates were not listed as a separate element to be inspected during a bridge inspection.  And, the training for bridge inspectors continued very little information about gusset plates.  Why?  Because it was widely assumed that gusset plates are stronger than the members they join and so can be neglected in calculations in order to simplify the analysis.  In most cases, this assumption is true.  However, since the gusset plates were designed incorrectly, and so were much weaker than typical, allowing this assumption to go unchecked, on several different occasions, proved disastrous.

Thanks to this tragedy, it’s unlikely the same problem will happen again.  Structural design and bridge inspection training material is being rewritten to include the lessons learned from this bridge collapse, and inspections are now considering the strength of gusset plates as part of their evaluation.  Assumptions are made all the time, but these assumptions need to be verified.

Click on download PDF to see the NTSB’s root cause analysis investigation results visually displayed in a Cause  Map.  A  Cause Map can capture all of the causes from an investigation in a simple, intuitive format that fits on one page.

Click here for another example of a case where a minor item caused some major issues.

Learn more about the I-35 Bridge collapse.

Confined Space Asphyxiation

By ThinkReliability Staff

During the overnight shift on November 5, 2005, two workers at a refinery in Delaware City, Delaware died from asphyxiation.  Both workers had entered a confined space that was filled with nitrogen.  We will use information from the Chemical Safety Board’s root cause analysis investigation to create a Cause Map.  A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.

The first step in our analysis is to define the problem by filling out the outline.  The outline contains the what, when, where and impact to the goals.  The “what” is the problem; in this case two workers were asphyxiated.  The when is the overnight shift of November 5, 2005, and the where is the hydrocracker reactor of a Delaware City refinery.  The workers were apparently attempting to retrieve dropped tape.

Because two workers were killed, there was an impact to the safety goal.  There may have been impacts to other goals as well, but the loss of life makes other impacts less significant.

Once the outline is completed, we use the impacted goal to begin the Cause Map.  We begin with the impacted goal and ask ‘why’ questions.  A good way to begin is using the “5-why” technique.  Begin with the impacted goal and ask “why” 5 times.  This will start the Cause Map.  For this incidence: the safety goal was impacted. Why? Because two workers died.  Why? Because they were asphyxiated.  Why? Because they entered a confined space.  Why?  They were attempting to retrieve lost tape.  Why?  Because the tape was left in the reactor.

From the “5-why” Cause Map we can add more detail to the root cause analysis.  Additional causes can be added before, after and between the causes on the 5-why map.  For example, the workers were asphyxiated because they entered the confined space AND the space was filled with nitrogen.  The space being filled with nitrogen is added as an additional cause of asphyxiation, and is joined with “AND” because both causes had to be present for the asphyxiation to occur.

Even more detail can be added to this Cause Map as the root cause analysis continues. As with any investigation the level of detail in the analysis is based on the impact of the incident on the organization’s overall goals. The outline, “5-why” Cause Map and detailed Cause Map can be seen by clicking “Download PDF” above.

Yellow Fever Epidemic

By ThinkReliability Staff

With swine flu in the news lately, ‘epidemic’ has been on many minds. However, there is still much that isn’t understood about swine flu. There are other epidemics that we understand much better, such as yellow fever.  Yellow fever has been causing epidemics for a long, long time.

But how does it happen?  We can do a root cause analysis of a yellow fever epidemic to find out.  A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.

Since we are not looking at a specific event, but rather a general situation we will start with just one impacted goal. A yellow fever epidemic can result in the deaths of thousands of people, which we will consider an impact to the safety goal.

We begin the root cause analysis with this impacted goal and ask “why” questions.  Several thousand people may die because there is no cure for yellow fever, it has a high mortality rate, and several thousand people get infected.  The people get infected because they’re not vaccinated, and they are bitten by an infected mosquito in the epidemic zone.  (The endemic zone is areas of Africa and South America where a low level of yellow fever is always present.  The epidemic zone is an area outside the endemic zone to where yellow fever is spread and an epidemic occurs.)

People are not vaccinated because they don’t have access to the vaccine: either it costs too much, or the area is to isolated to receive vaccine. In order for someone to get bit by an infected mosquito in the epidemic zone, the mosquito must be infected, and the person must have been exposed to a mosquito in the epidemic zone.  In order for a person to be exposed to a mosquito, the mosquito must have access to a person, and mosquitoes must exist, meaning they are able to breed, meaning breeding pools exist.

A mosquito gets infected by biting a person infected with yellow fever. For yellow fever to spread from the endemic zone to the epidemic zone, this means a person was infected with yellow fever in the endemic zone,
and traveled to the epidemic zone.  The person gets infected with yellow fever by being bitten by a mosquito infected with yellow fever (in the endemic zone) without being vaccinated.  The person gets bitten by an infected mosquito because they are exposed to mosquitoes (for the same reasons listed above) that are infected, usually by biting monkeys who have been infected by yellow fever.

If you had trouble following all of that, you can see why a process map would be helpful.  On the downloadable PDF, both the Cause Map and process map are shown.

Pedestrain Bridge Collapse on July 4th

Download PDFBy ThinkReliability Staff

On the evening of July 4th, after watching fireworks, revelers at a park in Merrillville, Indiana headed back to their cars over a pedestrian bridge.  The bridge became overloaded and collapsed when two suspension cables snapped.  Somewhere between 50 and 120 people fell into the lake.  Although 25 were treated for injuries, nobody was killed, thanks to quick action by nearby lifeguards, police officers, firefighters and other rescuers who formed a human chain to help get everyone safely out of the water.  We’ll use this as an root cause analysis example.  A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.First we complete the outline.  The problem is a bridge collapse.  It happened at 10:00 p.m. on the 4th of July, while there were large numbers of people on the bridge.  It was a pedestrian bridge in Merrillville, IN, and people were crossing it to return home after a party.

Once we have defined the problem we list the impacts to the goals.  People being injured is an impact to the safety goal, as is the potential for drowning.  People fell into the lake, which was an impact to the customer service goal.  Additionally, the loss of the bridge is an impact to the material and labor goal.

We begin our Cause Map by listing the impacted goals and asking “why” questions to fill out the Cause Map to the right.  Begin with 5 “why” questions to start the Cause Map.  This is known as the “5-whys” technique.  For example, the safety goal was impacted.  Why? The safety goal was impacted because people were injured.  Why? People were injured because they fell into the lake.  Why?  They fell into the lake because the bridge collapsed. Why?  The bridge collapsed because the suspension cables broke.  Why? The cables broke because the weight on the bridge exceeded the bridge capacity.

Even more detail can be added to this Cause Map as the analysis continues. As with any investigation the level of detail in the analysis is based on the impact of the incident on the organization’s overall goals.  For this investigation, we can add some more detail to the “5-why” Cause Map to help our investigation.  For example, pedestrians fell into the lake because the bridge collapsed AND because pedestrians were on the bridge, returning to their cars after the 4th of July party.

There may have been additional stress on the bridge due to pedestrians jumping up and down, as reported by witnesses.  Additionally, we can add more detail after the “weight exceeded capacity” on the bridge.  The bridge was built to hold 40 people, but “at least twice that” were on the bridge when it collapsed.  So many people were on the bridge because they were returning to their cars (as discussed above), and because of ineffective crowd control.  There were too many people on the bridge despite officers stationed on either side.  Why was the crowd control ineffective?  It’s not known at this point, but we’ll put a question mark here.  The next step of the investigation will be to replace that question mark with reasons for the ineffective crowd control.  Once we’ve done that, we can come up with solutions that will keep an event like this one from occurring in the future.