Category Archives: Root Cause Analysis – Incident Investigation

Chicago High Rise Fire

By ThinkReliability Staff

At approximately 5:00 p.m. on October 17, 2003, a fire began in a storage closet on the 12th floor of a Cook County Administration Building in Chicago. Since there were no Fire Safety Director personnel at the building, the building engineer decided to evacuate.  The Emergency Voice/Alarm Communications (EVAC) system was activated, informing personnel that they should evacuate the building using any set of stairs.  The Chicago Fire Department (CFD) was called and began fighting the fire from the southeast stairway on the 12th floor.

Personnel evacuating from above the 12th floor in the southeast stairway were stopped at the 12th floor by firefighters and told to go back.  When they did, they found all the doors locked up to the 27th floor.  However, before all the evacuees on the stairway could make it up to the 27th floor, the firefighters opened the stairway door to fight the fire.  This, combined with a smoke tower system that may not have been functioning correctly, led to the stairway filling with smoke and toxic gases, which overcame several people on the stairs.  Six of these people died.  The last body was found in the stairway approximately 90 minutes after the fire began.

A report commissioned by the Governor of Illinois found multiple issues that led to the deaths.  There was no s sprinkler system, which allowed the fire to spread.  The stairway doors were locked, and the evacuees and CFD personnel were generally unaware that they’d be locked, since there was no evacuation procedure or mandatory fire drills in the building.  The building had a Fire Safety Director, who was not certified and was 40 minutes away from the building when the fire occurred, and no deputies.  The firefighters appeared to place a priority on fighting the fire over searching for trapped people, even after several 9-1-1 calls indicated there were personnel trapped on the stairs.  Miscommunication and a lack of leadership within the CFD meant that 90 minutes elapsed before victims were found in the stairway.  Had they been found sooner, more would have survived.  Additionally, the fire department did not follow certain procedures, such as breaking windows above and below the firefighting site to allow smoke to escape and searching the area before opening a door that was trapping smoke.

A thorough root cause analysis built as a Cause Map can capture all of these causes in a simple, intuitive format   that fits on one page.  To view the complete investigation in visual form, click on “Download PDF” above.

A Cause Map also captures proposed solutions.  A solution is tied to a particular cause on the map.  Solutions are placed directly above the causes they control.  Some of these solutions have already been implemented, and many are valid for any high rise building to consider implementing.

Learn more about the Cook County Administration Building fire.

NASA Budget Realities

By Kim Smiley

A recent report by a White House panel of independent space experts says NASA’s current goal to return to the moon isn’t feasible with the current budget. The panel estimates that NASA would need about $3 billion extra a year beyond the current budget to continue with human space flight.

The budget shortfall is obviously a problem that may prevent NASA from meeting their overall organizational goals.  A root cause analysis built as a Cause Map can be created to understand how this issue developed.

In this case, the production goal is impacted because NASA is likely to be unable to meet the stated goal of a moon mission by 2010.  This is caused by the high cost of a moon mission, other budget considerations (such as the cost of possibly extending the moon mission and the International Space Station) and the limited NASA budget.  The causes of each of these can then be explored.

NASA has been working toward a return to the moon because five years ago then-President George W. Bush stated that NASA should work to return astronauts to the moon, with a proposed date of 2020.  NASA has already spent $7.7 billion working toward this goal, including the design and the construction of new rockets.

Part of the plan to pay for this venture was to retire the space shuttle in 2010 and deorbit the International Space Station in 2015, but the panel also recommended revaluating these deadlines, which would add additional budget pressure.

The panel found that extending the life of the space station beyond 2015 would allow a better return on the billions of dollars invested into it.  The panel also felt the space shuttle should be evaluated for possible life extension as well in order to continue to service the space station, since there is no viable alternative that will be developed in the necessary time frame.

NASA budget continues to be limited as national budget constraints increase.  In order to raise funds, the panel also recommended including other countries and private-for-profit firms in addition to increasing NASA budget.

This problem has no easy, clear solution.  Only time will tell how President Obama will choose to respond to these findings and if human space flight will continue to be a goal for NASA.

San Francisco Transit: Planning Pays Off

By ThinkReliability Staff

San Francisco’s 73-year old Bay Bridge partially collapsed during the Loma Prieta earthquake of  1989.  As a result, a seismic upgrade project was planned.  The bridge closed Thursday night, September 3rd, 2009, as part of the upgrade project.  Authorities conducted a thorough inspection of the bridge while it was closed.  During this inspection, an eyebar was found to be cracked about halfway through.

Unfortunately for San Francisco,  “The crack is significant enough to have closed the bridge on its own” says Caltrans spokesman Bart Ney.  Thus the area quickly made plans for repairing the bridge, which would necessitate closing it for longer than just Labor Day weekend, as planned.   However, commuters received a pleasant surprise when the bridge opened  at about 6:30 on Tuesday morning, less than two hours after originally planned (before the cracked eyebar was discovered).Construction crews worked around the clock to get the bridge repaired and inspected before morning rush hour.

Was it worth the rush?  Ask the 260,000 commuters who normally cross the bridge every day.   However, local transit officials did not rely on the bridge opening on time.  Instead, they made other arrangements, including adding high-speed catamarans to the ferry line-up.

This is an excellent demonstration of the use of “Plan B”, or implementing multiple solutions for issues with great impacts to the goals.  In this case, the repairs were necessitated by the possible loss of the bridge – certainly an impact to the goals of a transit authority.  The accelerated repair schedule and additional transit options were necessitated by the potential loss of the bridge as a transportation route during high traffic-volume times, resulting in an impact to the customer service goal.

Loan Payments and Unemployment

By Kim Smiley

In recent years, the cost of attending a college in the US has risen about three times faster than inflation and the average student loan amounts have increased accordingly.  Combine this with the highest unemployment rates of college graduates since 1979 (4.8% in May, up from 2.1% at the start of the recession) and lower starting salaries during recessions, and there are many recent graduations struggling to make their loan payments.

As with any problem, it is possible to perform a root cause analysis of the issue. To begin let’s assume the production goal of an individual considering college is to earn a comfortable living and the potential impact to this goal will be difficultly making loan payments.

The causes of this potential problem repaying loans could be unemployment after graduation, lower starting salaries for new hires during a recession and large loan payments.  The causes of each of these factors can then be explored.  Click on the Download PDF graphic above to view an intermediate level Cause Map with more causes added.

A recent USA Today article entitled “In a Recession, Is College Worth It? Fear of Debt Changes Plans” discussed how many students are rethinking their college plans.  Enrollment at community college is soaring and many students are choosing a less expensive option and skipping the big name private institutions.

This makes sense when considering the potential difficulty repaying college loans because the only cause that the student has direct control over is the size of the loan payments.

The bottom line is that each individual needs to think through their particular situation, consider how much the college costs and how much the starting salary for their particular degree is projected to be.  There are very real dangers in amassing large student loans without calculating the monthly payments and ensuring that they are within a realistic budget.

The reality is that some universities cost more and there is no guarantee that attending a more expensive college will result in a higher salary.  It may well be a smart decision to choose a less expensive option when selecting a college.

If you’re interested in reading an analysis of the 2009 Financial Mess, please click here.

Lessons from Three Mile Island

By Kim Smiley

The partial meltdown of a core at the nuclear power plant at Three Mile Island is one of the most well known engineering disasters in US history.  Luckily, no one was injured and there was no significant environmental impact, but the potential for major issues was very real.  Three Mile Island also had a huge impact on the nuclear industry and required a major clean up effort.

Performing a root cause analysis of historical incidents is useful because there are a number of lessons learned that can often be applied across a variety of industries.

As is true with any complex system, there were many causes that contributed to the Three Mile Island incident.  At the most simplified level, cooling water flow was stopped to the primary system (the nuclear portion).  The primary system then started to heat up, increasing the pressure to the point that a relief valve lifted.  The relief valve then failed to reseat and a large volume of coolant was lost.  The core eventually overheated because it was uncovered due to the loss of coolant.

Another factor that contributed significantly to the Three Mile Island incident was operator action during the casualty, which occurred over several shifts.  Had operators been able to understand the status of the plant in a timelier manner, the plant could have been put into a safe condition.

At first glance, it’s easy to stop at this point and use a term like “operator error”, but a thorough analysis requires more digging. Even if the technology being considered is radically different than a nuclear power plant, there are many lessons that can be learned from studying how the control room design impacted the operator actions during the incident.

The design of the control room significantly contributed to the operators’ inability to identify plant conditions.  The control room was huge with hundred of instruments to monitor, some of which were on the back of the control panels and couldn’t be viewed in the normal watch standing locations.  Dozens of alarms, both audible and flashing lights, went off in a very short period of time without any obvious priority.  The alarms continued throughout the casualty and the sheer volume of information was nearly impossible to interpret accurately.

Many industries continue to benefit from the lessons learned from the design of the control room.

For more detailed information on the Three Mile Island accident, please see the NRC’s Three Mile Island fact sheet.

When the Power Goes Out . . .

Basing Contingency Plans on the Impacts to your Organization’s Goals

By ThinkReliability Staff

An excellent discussion resulted as part of our free Webinar series last week. An attendee asked the question  “What if there’s a cause you can’t control, like the weather?” So another question was raised; “How do you prepare for those sorts of things?”

You can prepare for potential problems that may arise by using a Cause Map, just like you would after an actual problem occurred. We call the Cause Map of things that COULD happen a “proactive” Cause Map, while a Cause Map of something that DID happen is a “reactive” Cause Map. Typically you will see reactive Cause Maps, but a proactive analysis can be extremely useful for contingency planning, as well as to develop problem-solving skills.

To create a proactive (or COULD) Cause Map, follow the same steps normally used in a root cause investigation, trying to imagine the possibilities for impacts to the organization’s goals.  Then create the Cause Map and determine possible solutions (action items). The “cost” of the impacts to the goals will depend which solutions are reasonable to implement.

As an example, let’s look at a power outage from the perspective of a hospital. (View The Joint Commission’s Sentinel Event Alert on power outage.)  A power outage could lead to the deaths of patients, resulting in an impact to the safety goal. It could lead to the loss of life-saving equipment, resulting in an impact to the customer service goal. It could cause the facility to not be able to admit new patients, resulting in an impact to the production goal.  And, it can result in material and labor costs resulting from the transfer of patients to another facility.

Beginning with these impacts to the goals, we can create a Cause Map. (The Outline and Cause Map are shown on the downloadable PDF.) All the impacts to the goals lead back to a loss of electrical power, caused by both a power outage AND a lack of back-up electricity source.

When determining solutions, there are a few that come to mind, including transferring patients to another healthcare facility (which itself becomes an impact to the goals) and installing battery backups in equipment.  However, because of the severe impacts to the goals, a hospital will likely decide that the whole problem can be solved by installing an emergency generator.  Problem solved.  However, is installing an emergency generator always the right contingency plan for a power outage?

Let’s look at the same situation from the perspective of an office building. A power outage could cause some employees to get injured as they’re exiting the building, resulting in an impact to the safety goals. It will result in the loss of the business function of the office, resulting in an impact to the customer service and production goals. It may also result in paying employees for a non-work day, which is an impact to the labor goal.

The Cause Map looks similar to the hospital power outage Cause Map in that all the impacts lead back to a loss of electrical power, caused by a power outage and lack of back-up electricity source. So, we could put in an emergency generator just like the hospital did and have our problem solved. But the effort and capital required to install an emergency generator based on the lesser impacts to the goals is probably not worth it. Instead, some of the less expensive and consuming solutions can be implemented, such as installing emergency lights and setting up remote work stations for employees.

View the Outlines and Cause Maps for both the hospital and office building power outages by clicking “Download PDF” above.

Midair Aircraft/Helicopter Collision Over Hudson River

By ThinkReliability Staff

On August 8, 2009, a small airplane clipped the wing of a sightseeing helicopter and both aircraft crashed into the Hudson River, killing all nine people.  The crowded corridor above the Hudson River was also the site of the successful crash landing of U.S. Airways Flight 1549 in January, 2009.  The evidence from the crash is still being recovered from the accident site, so the investigation is ongoing.  However, just because we don’t have all the causes doesn’t mean we can’t start our root cause analysis.

A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.  To begin, we define the problem in an outline.  So far, we know the date and approximate time of the collision.  (We may be able to refine the time of the accident as more information is released.)  We know the location of the collision based on eyewitness accounts and the discovery of wreckage.  We also know the type of plane and helicopter involved, and what they were doing (the plane was in transit to Ocean City; the helicopter was on a sightseeing tour).

Next we define the problem with respect to the impact to the goals.  The safety goal was impacted because nine people were killed.  Both the airplane and helicopter were lost (or at the very least, severely damaged), which is an impact to the material goal.   Lastly, if we have the information, we can record the frequency of this type of incidents.  The last helicopter/airplane collision in the New York City area was in 1983.

Once we’ve completed the outline, we can move on to the Cause Map.  We begin with the impacts to the goals and fill in the Cause Map by asking “Why” questions.  Both goals were impacted because the plane and helicopter crashed into the water.  We continue to ask “Why” questions.  Both aircraft fell into the water because the plane clipped the helicopter’s wing.  The pilot clipped the helicopter’s wing because the plane and the helicopter were in the same airspace.  And, it’s surmised that the pilot could not see the helicopter.  (We don’t have any solid evidence supporting this yet, so we’ll leave a question mark.)

The plane and the helicopter were in the same airspace because the area is crowded with sightseeing helicopters and small planes which are prohibited from flying above buildings or over 1,100 feet.  Around New York City, that pretty much leaves the river.  Pilots who are flying below 1,100 feet are free to choose their own route, and are not under the control of air traffic controllers.  Instead, they use the “see and avoid” method.

Unfortunately that method isn’t successful when a pilot can’t see an incoming helicopter.  Although small planes are not controlled by air traffic controllers, they are in communication with them.  However, the pilot of the plane had never contacted the Newark controllers.  The helicopter was ascending at the time of the crash, so it’s likely that it came from below the plane (where the pilot would be unable to see it).  The helicopter may have been unaware of the plane because it’s not required (though it is recommended) for pilots to announce their position.

As the NTSB investigation continues, more detail can be added to this Cause Map… As with any investigation the level of detail in the analysis is based on the impact of the incident on the organization’s overall goals.

Saving Sharks from Extinction

By ThinkReliability Staff

In honor of the Discovery Channel’s “Shark Week”, we’ll use the problem of shark species at risk of extinction as an root cause analysis example. We’ll begin by building a Cause Map, which is a visual method of performing a root cause analysis.

We begin a root cause analysis with an impact to the goal.  Shark species being at risk for extinction is an impact to the environmental goals.  While I didn’t add this kind of detail, evidence has shown that a decrease in the number of sharks results in problems for the rest of the food chain.

We fill out the Cause Map by asking “Why” questions.  Shark species are at a risk of extinction because the death rates of sharks are higher than the birth rates.   Sharks have low reproductive rates (they mature slowly, have long gestational periods, and birth few young), and increasing death rates.  The increasing death rate is due to over fishing (fishing without regard to population), injured sharks being left to die, and loss of habitat, caused by pollution.  The combination of sharks being fished for sport, food, or products (which are rising in value; sharks are thought to cure cancer) and the lack of effective regulation has led to over fishing.  Many sharks are injured, either as “bycatch” meaning sharks are brought up in fishing nets while fishing for something else, or by a practice known as “finning”, where a shark’s fin is cut off.  (Shark fin soup is very popular.) In both cases, sharks are typically thrown back into the water injured and left to die.   Many countries have a ban on finning, but the ban is not always effectively enforced.

Many countries around the world are trying to protect sharks.  Some of the solutions they have implemented are to create shark fishing quotas, increase enforcement of fishing quotas and finning bans,  decrease the market for shark products and shark fin soup, and limiting any fishing in known shark habitats.  Solutions can be shown on the Cause Map, directly above the cause they control.  Once solutions have been selected for implementation, as these have been, they are listed in the Action Items list.  (To see the Cause Map and Action Items list, click on “Download PDF” above.)

How an Unchecked Assumption Brought Down a Bridge

By ThinkReliability Staff

On August 1, 2007, the I-35 bridge over the Mississippi River in Minneapolis, Minnesota collapsed during evening rush hour, killing 13 and injuring at least 145.  During the National Transportation Safety Board’s investigation, it was discovered that the gusset plates (the riveted metal plate that joins several structural members) were designed with inadequate load capacity.  At the time of the bridge collapse, the load on the gusset plate that failed was higher than usual, due to construction materials and equipment concentrated on the deck over the location of the gusset plate and rush hour traffic slowed by the construction.  In addition to these weights, the dead load (weight of the bridge structure) had increased by more than four million pounds due to improvements made to the bridge since it opened in 1967.

Bridges are inspected regularly, and go through a design review process . . . so how did the gusset plate design error get missed?    The design for the gusset plates was apparently supposed to be a preliminary design, which neglected shear stress.  Although the firm that designed the bridge required a review of all calculations before the final design, the procedure did not ensure that all calculations were rechecked, so the gusset plate calculations that ignored shear stress were overlooked.

The design was reviewed by the government, but their design review did not apply to gusset plates.  The gusset plate capacity was not calculated as part of the load rating calculations.   Gusset plates were not listed as a separate element to be inspected during a bridge inspection.  And, the training for bridge inspectors continued very little information about gusset plates.  Why?  Because it was widely assumed that gusset plates are stronger than the members they join and so can be neglected in calculations in order to simplify the analysis.  In most cases, this assumption is true.  However, since the gusset plates were designed incorrectly, and so were much weaker than typical, allowing this assumption to go unchecked, on several different occasions, proved disastrous.

Thanks to this tragedy, it’s unlikely the same problem will happen again.  Structural design and bridge inspection training material is being rewritten to include the lessons learned from this bridge collapse, and inspections are now considering the strength of gusset plates as part of their evaluation.  Assumptions are made all the time, but these assumptions need to be verified.

Click on download PDF to see the NTSB’s root cause analysis investigation results visually displayed in a Cause  Map.  A  Cause Map can capture all of the causes from an investigation in a simple, intuitive format that fits on one page.

Click here for another example of a case where a minor item caused some major issues.

Learn more about the I-35 Bridge collapse.

Confined Space Asphyxiation

By ThinkReliability Staff

During the overnight shift on November 5, 2005, two workers at a refinery in Delaware City, Delaware died from asphyxiation.  Both workers had entered a confined space that was filled with nitrogen.  We will use information from the Chemical Safety Board’s root cause analysis investigation to create a Cause Map.  A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.

The first step in our analysis is to define the problem by filling out the outline.  The outline contains the what, when, where and impact to the goals.  The “what” is the problem; in this case two workers were asphyxiated.  The when is the overnight shift of November 5, 2005, and the where is the hydrocracker reactor of a Delaware City refinery.  The workers were apparently attempting to retrieve dropped tape.

Because two workers were killed, there was an impact to the safety goal.  There may have been impacts to other goals as well, but the loss of life makes other impacts less significant.

Once the outline is completed, we use the impacted goal to begin the Cause Map.  We begin with the impacted goal and ask ‘why’ questions.  A good way to begin is using the “5-why” technique.  Begin with the impacted goal and ask “why” 5 times.  This will start the Cause Map.  For this incidence: the safety goal was impacted. Why? Because two workers died.  Why? Because they were asphyxiated.  Why? Because they entered a confined space.  Why?  They were attempting to retrieve lost tape.  Why?  Because the tape was left in the reactor.

From the “5-why” Cause Map we can add more detail to the root cause analysis.  Additional causes can be added before, after and between the causes on the 5-why map.  For example, the workers were asphyxiated because they entered the confined space AND the space was filled with nitrogen.  The space being filled with nitrogen is added as an additional cause of asphyxiation, and is joined with “AND” because both causes had to be present for the asphyxiation to occur.

Even more detail can be added to this Cause Map as the root cause analysis continues. As with any investigation the level of detail in the analysis is based on the impact of the incident on the organization’s overall goals. The outline, “5-why” Cause Map and detailed Cause Map can be seen by clicking “Download PDF” above.