Category Archives: Root Cause Analysis – Incident Investigation

Deadly Kansas City Walkway Collapse

By Kim Smiley

On July 17, 1981, the second and fourth floor suspended walkways collapsed at the newly opened Hyatt Regency of Kansas City, Missouri.  A dance contest had attracted a crowd and the atrium under the walkway was filled with people.  This accident killed 113 people and injured 186.

The hotel was newly constructed and the walkways were well maintained.  So how did this happen?

A root cause analysis of this accident shows that there were a number of causes that contributed to the walkways collapsing.  Investigation into the accident shows that the structural design of the walkway was inadequate.  A weld failed which allowed a support rod to pull through the box beam and the walkways fell.

Additionally, the weld had greater stress than normal on it at the time of the failure because a large crowd had gathered to watch a danced content.  About 20 people were on second floor walkway and about 40 were on the fourth floor walkway at the time of the accident.  The higher loading combined to the walkway collapse.

Identifying the failure mechanism is important during an investigation, but a thorough root cause analysis needs to take the analysis farther to really understand the causes.  The reason that an inadequate design was built needs to be determined.

In this case, it appears that the design was changed without approval of the structural engineer.  This resulted from a communication error between the fabricator and the structural engineer.  The structural engineer sent a sketch of a proposed walkway design to the fabricator, assuming that the fabricator would work the details of the design himself. The fabricator assumed the sketch was a finalized drawing.   The fabricator then picked standard parts to fit the sketch.  This resulted in a significant change from the original design and dramatically decreased the load bearing capacity of the walkways.

The original design called for continuous hanger rods (a non-standard part that would have needed to be manufactured) that passed through the fourth floor walkway beam box to the second floor walkway, resulting in the ceiling connecting supporting the weight of both walkways.  The fabricator changed the design to use two shorter rods (standard parts) which resulted in the fourth floor walkway supporting the weight of the second floor walkway, which it wasn’t designed to handle.

It’s important to investigate beyond the point of inadequate design to learn what failed in the design process to prevent future accidents from occurring.

Deadly Plane Crash in Lagos, Nigeria

by ThinkReliability Staff

A devastating air crash in Lagos, Nigeria killed all on board and at least 10 on the ground.  This was the first major commercial air disaster since 2006.  Safety efforts since that disaster resulted in the US Federal Aviation Administration ( FAA) granting Nigerian    airlines its top air-safety rating.  Now concerns about air safety in Nigeria have resurfaced.  As a result of the crash, according to Harold Demuren, head of Nigerian civil aviation body: “We have suspended the entire Dana fleet.  They will be grounded as long as it takes to carry out the necessary investigations into whether they are airworthy.”

We can examine this incident in a Cause Map, or a visual root cause analysis.  We begin with the goals that were impacted.  In this case, the safety goal was impacted due to the deaths of people on the plane and on the ground.  We begin by asking “Why” questions to put together a very simple cause-and-effect relationship.  In this case, after losing both engines, Dana Air flight 992 crashed into a residential building in a highly populated suburb of Lagos, Nigeria, killing all 153 people on board and at least 10 on the ground.

The investigation of the plane crash is still ongoing.  However, it is known that both engines of the plane lost power, causing the plane to rapidly lose altitude and crash into a highly populated area.  Some of the areas being investigated that may have contributed to the crash are:

1) a bird strike (bird remains were found in one engine),

2) poor maintenance (although the plane was regularly inspected, there were also reports of leaking hydraulics and a history of poor airline safety in Nigeria, which appeared to have been remedied in recent years as indicated by the US FAA’s granting of its top air-safety rating,

3) overworked planes, likely due to financial considerations (the plane that crashed was on its fourth trip of the day), and/or

4) the age of the airplane (at 22 years old, it was technically not permitted to fly in Nigeria, which bans the use of planes over 20 years old).

As more information is revealed during the investigation it can be added to the Cause Map.  As the investigation is concluded, there will likely be more changes to Nigerian requirements and oversight for air safety.

To view the Outline and Cause Map, please click “Download PDF” above.

Unticketed Man Bypasses Security, Boards Plane

by ThinkReliability Staff

On May 29, 2012 a man boarded a flight from the tarmac at San Diego International Airport.  The man did not have a ticket, and had not been through security.  The extra passenger was not noticed until a flight attendant’s head count was noted to be off.

We can examine this issue in a Cause Map, which is a root cause analysis that visually represents the cause-and-effect relationships that result in impacts to the organization’s goals.  We begin by defining what those impacts to the goals were.  In this instance, although no one was hurt, there was the potential for a safety issue.  The customer service goal was impacted due to the deplaning required for the passengers already onboard.  The schedule goal was impacted because the plane was delayed due to the rechecks required of passengers.  The personnel time required for these rechecks is an impact to the labor goal.

Once we’ve determined the impacts to the goals, we ask “why” questions to determine the causes resulting in the impacted goals.  In this case, the rechecks were required because the flight attendant’s head count was off.  The flight attendant’s count was off because a man without a ticket had boarded the plane.   Because the man was able to board the plane from the tarmac without showing a ticket, tickets were presumably not checked at the aircraft door.  Likely they were instead checked at the door leading to the tarmac.  Because the unticketed man was able to access the tarmac through an emergency exit, he was able to get on the plane without a ticket.  How was he able to access the tarmac?  He went through an emergency exit door in a public area.  Security did not realize that he had exited this way, either because there was no alarm associated with using the door, or the notification from the alarm was inadequate to ensure that security was notified.

According to San Diego Harbor Police Chief John Bolduc, “He completely bypassed TSA screening.  He was in a public area and went out an emergency fire door, which gave him access to the tarmac.”

The airport is carefully scrutinizing its security to ensure that this never happens again.  One solution would be to install emergency exit alarms so that security personnel are notified that security bypass procedures should be initiated.  A solution for the plane operators is to check tickets at the door of the plane, in addition to or instead of at the exit to the tarmac.

To view the Outline, Cause Map and potential solutions, please click “Download PDF” above.

The Collapse of Agricultural Buildings

By ThinkReliability Staff

Every winter there are pockets of agricultural building collapses in areas that have seen heavy snow and ice accumulation.    The causes for these collapses can be examined in a Cause Map, or visual root cause analysis.  First, we begin by capturing the basic information about the issue.  In this case, there  were three areas that suffered building collapses due to winter weather accumulation.  This included New  York State in 1999,  Wisconsin in 2010, and England in 2010 and 2011.  Important to note is that each of these areas experienced heavy snowfall during the periods of collapse and in each region, agricultural buildings were more likely than other types of buildings to have collapsed.  It is also important to note that in each of these areas, agricultural buildings were not regulated to the same level as other buildings.

To begin our root cause analysis, we begin with the impacts to the goals. The collapse of an agricultural building carries with it the risk of human injury or loss of life, as well as potential loss of livestock.  A building collapse results in property damage as well as time spent on cleanup, repairs, and anything else that needs to be done to get the facility up and running again.

To continue the analysis, we begin with the impacted goals and ask “why” questions.  These impacts to the goals are all related to the collapse of an agricultural building.  The collapse of a building results when the stress (in this case, the structural load) exceeds the strength.  The structural loads in the case of the collapsed buildings generally result from accumulation of ice and snow, which  may be unevenly distributed, increasing local load, due to drifting, and an improperly engineered building.  Agricultural buildings are more likely to collapse due to structural loads because they are exempt from codes in most of the US and unregulated in England.  If engineering is desired, a properly engineered building may be scaled up or altered, resulting in changing loads and strengths, meaning the engineering review may no longer be valid to protect the building.  Although engineering is frequently skipped due to cost measures, experts say that proper engineering can save money by ensuring that supports are put in only where they’re needed (and, of course, reducing the risk of a collapse.)

Generally the collapsed buildings are found to have inadequate bracing, which reduces the strength of the building to the point of collapse.  If the buildings are not properly engineered, bracing may be inadequate for the design of the building.  Another issue  frequently seen is that the trusses are engineered, but are not reviewed with respect to the overall building design, leading to an insufficient analysis that does not take into account all of the factors that impact building loads and strength.

Although states and countries could elect to consider agricultural buildings in their codes, farmers don’t need to wait.  If you are building an agricultural building (or any building that may be exempt from code), ensure it’s adequately structurally engineered.  It may save a life.

To view the Outline and Cause Map, please click “Download PDF” above.  Or read more:

1999 collapses in NY State

2010 collapses in Wisconsin

2010 & 2011 collapses in England

$3 Bolt Causes $2.2 million in Damages to US Submarine

By ThinkReliability Staff

A $3 bolt was left in the main reduction gear of the USS Georgia after a routine inspection.  The extensive damage caused by the bolt resulted in 3 months in the shipyard for the submarine, causing it to miss deployment.  The propulsion shaft was left to operate for two days after sounds indicated that there was something wrong.  This may have increased the damage to the main reduction gear – damage which cost $2.2 million.

How did the bolt end up in the main reduction gear? Why was the propulsion shaft operated for 2 days after damage was suspected?

We can look at the causes that led to this incident in a Cause Map, a visual root cause analysis that clearly outlines cause-and-effect relationships that result in impacts to an organization’s goal.  The first step to building a Cause Map is to determine how the issue impacts the organization’s overall goals.  Here we can consider the US Navy as the organization.  The customer service goal (with the rest of the country as the “customers”) was impacted because the submarine was unavailable for deployment.  The production/schedule goal was impacted because the submarine was in the shipyard for  three months.  The damage to the main reduction gear is an impact to the property goal, and the repairs are an impact to the labor/time goal.  The total cost resulting from this issue was estimated to be $2.2 million.  Once the impacts to the goals have been determined, we can ask why questions to put together the cause-and-effect relationships that led to these impacts.

The bolt was left behind after a routine, annual inspection.  Because of the great potential for damage when foreign objects remain within equipment, detailed procedures are used for these inspections and include a log of all equipment brought into the area and a protective tent to keep objects from falling in.  Details of what went wrong that resulted in the bolt falling into the main reduction gear were not released, but the inspection was reported to have “inadequate prep and oversight” which likely contributed to the issue.

After the propulsion shaft was turned back on, noise indicated that there was a problem.  However, the shaft was operated for two days in a failed attempt at troubleshooting.  It’s likely that this increased the damage to the main reduction gear.  It is unknown what procedures were – or should have been – in place for troubleshooting, but the actions taken as a result of this incident suggest that proper procedures were not followed once the damage was suspected.

In this case, members of the crew who were found to not have performed their job – possibly by not following proper procedure – were punished in varying ways.  It is likely that the investigation went into great detail about whether procedures were adequate, what steps were not followed, and why, and the results also used to improve procedures for the next inspection.

To view the Outline and Cause Map, please click “Download PDF” above.

Fire kills 146, Leads to Improved Working Conditions

By ThinkReliability Staff

146 workers were killed when a fire raced through the Triangle Company, which occupied the top three floors of a skyscraper in New York City.  The workers were unable to escape the fire.  We can examine this incident using a Cause Map, a visual form of root cause analysis, which allows us to diagram the cause-and-effect relationships that led to organizational issues – in this case, the death of 146 workers.

On March 25, 1911 at approximately 4:40 p.m., a fire began on the 8th floor of a New York City skyscraper (one of three floors housing the Triangle Waist Company).  Although it’s not clear what sparked the fire (cigarettes and sewing machine engines are likely heat sources), a large amount of accumulated scraps (last picked up in January) provided plenty of fuel.  There were no sprinklers and the interior fire hose was not connected to a water source.  The fire spread quickly and burned for approximately a half an hour before firefighters extinguished it.

During that half-hour, 146 workers, mostly young women, were killed.  Nearly all of these workers were from the 9th floor of the building.  Workers from the 8th and 10th floor were able to escape to the ground or roof using the stairs, but one of the access doors on the 9th floor was locked.  This left only one set of stairs and elevators, which did rescue many but were overcrowded and the elevator machinery eventually failed due to heat.  Many attempted to escape using the fire escape, which was not built for quick escape (in fact, experts determined it would take 3 hours to reach ground from the Triangle Company floors) and eventually collapsed due to the collective weight, killing those on it in the fall.

Many workers jumped from the 9th floor, but the force of the fall was too great for the fire nets, which mainly broke and the jumpers died.

People were horrified at the conditions in the factories that resulted in these deaths.  In the following years, public outcry resulted in many workers’ rights improvements, including many advances in regulations regarding fire protection and working conditions.  However, these types of issues continue in other countries that have not defined such requirements.

To view the Outline and Cause Map, please click “Download PDF” above.  Or click here to read more

113 Killed When a Plane Hit a Hill in Guadeloupe

By ThinkReliability Staff

Flying into a small airport surrounded by mountains at night, in a thunderstorm, with virtually no support from ground equipment proved to be too difficult for even an experienced pilot.

All 113 passengers and crew on Air France Flight 117 were killed when the plane crashed into a hill near the airport in Point-à-Pitre, Guadeloupe on June 22, 1962. The crash occurred in the early morning hours, during a severe thunderstorm.   We can examine the causes of this tragedy in a Cause Map, a visual form of root cause analysis that shows the cause-and-effect relationships that led to an incident  such as this one.  The VHF (very high frequency) omnidirectional range (VOR) indicator, which helps aircraft determine position and stay on course, at the airport in Guadeloupe was not functional.  (It’s not clear if the crew of the Air France flight was aware of this, or how long the equipment had been broken.)  The plane in question was a Boeing 707.

The safety goal was impacted because all people onboard the plane – passengers and crew – were killed.  The plane (valued at $5.5 million) was completely destroyed.  The lack of a working VOR, and the incorrect information provided by the  Automatic direction finder (ADF) can be considered impacts to the customer service goal.  Beginning with the impacted safety goal, we can ask “Why” questions to begin mapping cause-and-effect relationships.   The passengers and crew were killed (and the plane destroyed) when the plane crashed into a hill.

The plane crashed into a hill because the airport was surrounded by mountains, and the plane strayed off the let down track, which it should have used for its approach to the airport.  The pilot went off track because he was using a visual approach, probably due to the fact that the VOR was not providing data since it was not working.   The pilot was unable to see the track due to low (10 km) visibility and since it was early morning (~4 a.m.).  In addition, the plane received incorrect position indication from the ADF, which appeared to malfunction as a result of the severe thunderstorm in the area.

This incident resulted in concern from pilots of substandard landing conditions at certain airports.  More care is now taken with take-off and landing during inclement weather, poor visibility, or conditions that result in landing with decreased equipment support.

To view the Outline and Cause Map, please click “Download PDF” above.

Deadly Sawmill Explosion

By ThinkReliability Staff

An explosion and subsequent fire at a sawmill in British Columbia has killed two workers and injured two dozen more.  Although the cause of the explosion is not known, there have been five explosions linked to wood dust in British Columbia since 2009.

A dust explosion results from the presence of combustible dust, such as that created by the sawmilling process.  In order for an explosion to occur, the dust must be dispersed into the air but confined by a structure in the presence of oxygen and a spark.  (Learn more about dust explosions.) 

To view all the causes that contributed to this tragic explosion, we can examine the incident in a Cause Map, or visual root cause analysis.  We begin with the impacts to the goals. The employee deaths and injuries are an impact to the safety goal.  This is the primary focus of any issue that results in human death or injury.  In addition, the environmental goal was impacted as the smoke migrated to the nearby town.  The production goal was impacted due to the shutdown of the facility.  The property goal was impacted due to destruction of the sawmill, log processing facility, and sorting facility.  Lastly, the investigation and cleanup will impact labor goals.

Once we have determined the impacts to the goals, we can ask why questions to determine the cause-and-effect relationships that led to the incident.  In this case, the injuries were due to the fire.  The fire may have been caused by a dust explosion (explosion due to natural gas leak has been ruled out).  In order for a dust explosion to occur, five factors are necessary: 1) presence of combustible dust, 2) oxygen, 3) dust is dispersed into the air, 4) dust particles are confined, and 5) the mixture is ignited.

In this case, the ignition source is not known and, due to the damage at the facility, may never be conclusively determined.  Similarly, the cause that resulted in the dust being dispersed may also not be known.  The oxygen must be present for worker safety and the dust is confined because it is held within a closed structure.  The dust is present because it is created during sawmilling operations.  What makes a dust combustible depends on the properties of the dust.  This mill was processing pine beetle wood, or wood that was ravaged by beetles.  This makes the wood drier, which results in a drier, finer, more combustible dust.  Thorough cleaning of any facility that creates potentially combustible dust is a necessity – inadequate cleaning (including dust that may gather on hard-to-access surfaces, such as the ceiling) increases the possibility of an explosion.  The union believes that cleaning has been reduced as a result of the economy.

Local government has begun inspections of saw mills but are asking plants to examine potential dust and ignition sources. Reducing dust and ignition sources are the most effective way to reduce risk of dust explosions.  Other solutions being considered include adding water to the air to increase humidity and increased ventilation, which can reduce the confinement of the dust and increase cleanliness.

To view the Outline and Cause Map, please click “Download PDF” above.

 

Is Lard The Misunderstood Fat?

By Kim Smiley

For many of us the word lard instantly invokes images of clogged arteries and heart disease.  A hundred years ago, lard was a staple item in nearly every pantry, but today few of us can imagine cooking with such an unhealthy substance.

But what if lard isn’t as bad as the collective knee jerk reaction would lead us to believe?

While lard is certainly not olive oil, the reality is that lard is actually a relatively healthy option when a solid fat is needed.  This is true because most of the fat in lard is monounsaturated fat, which is healthier than saturated fat.  The fat in lard is 40 percent saturated compared with 60 percent saturated fat in butter.   The partially hydrogenated fats found in vegetable shortening are now considered to be the least healthy option.  While Crisco no longer contains trans-fats, lard has always been naturally trans-fat free.

So how did lard get such a bad reputation?  A Cause Map, a visual, intuitive root cause analysis format, can be built to explore this question.  To view a high level Cause Map of this example, click on “Download PDF” above.

A recent article from National Public Radio tried to answer the question – who killed lard?  The article claimed that a number of factors contributed to the fall of lard’s popularity.  The public became uneasy about the pork industry after the publication of Upton Sinclair’s The Jungle.  Included in the book was a disgusting scene that depicted workers falling into vats of lard and being sold along with it for human consumption which understandably cooled the public’s appetite for lard.

A second major factor was that an alternative fat product became available that offered an option to a public queasy about the pork industry.  Crisco came on the market, armed with a massive marketing campaign, offering a fat option that wasn’t associated with the pork industry.  The creation of Crisco was possible because of the invention of hydrogenation and a surplus of cottonseed oil.  The oil had previously been used to manufacture candles, but the invention of the light bulb had dimmed the demand.

At about the same time Crisco was hitting shelves, scientist began asking questions about the saturated fat in lard.  Ironically the bad publicity about the health impacts of trans-fats (which were in shortening at the time) was years away, but the early findings that linked saturated fat to heart disease were another strike against the popularity of lard.

Today, lard is making a comeback with foodies, but it still isn’t widely used and it is difficult to find in stores.  Only time will tell if lard will once again became a popular pantry staple.

Deadly Stage Collapse at State Fair

By Kim Smiley

On August 13, 2011, a stage at the Indiana State fair collapsed, killing seven and injuring dozens more.  The accident occurred just before 9 pm as a crowd waited to watch the popular country band Sugarland perform.

Why did the stage collapse?  What caused this tragic accident to occur?

This incident can be analyzed by building a Cause Map, an intuitive, visual format for performing a root cause analysis.  The first step when beginning a Cause Map is to determine what goals have been impacted.  In this example, the focus will be on the safety goal since there were fatalities and many injuries.  Once the impact is determined, the Cause Map is built by asking “why” questions to determine what causes contributed to the accident.

In this example, people were killed and injured because they were near the stage and the stage collapsed.  They were near the stage because they were waiting for a concert and the area had not been evacuated.  The area had not been evacuated because the decision to evacuate wasn’t made in time.  The decision didn’t happen in a timely manner because it wasn’t clear who had the authority to make the decision because there was not an adequate emergency plan in place.  The bad weather wasn’t a surprise.  The storm was being monitored and the National Weather Service had issued a warning, but the decision to evacuate wasn’t made until too late to prevent the tragedy.

Recently findings by investigators determined that the stage collapsed because it wasn’t up to code.  The structure was required to be able to withstand winds up to 68 mph, but the stage collapsed at winds below this limit.  Investigators determined that the lateral supports were inadequate and the stage wasn’t strong enough to stand up to the wind.  The stage also wasn’t inspected because it was a temporary structure and they are not required to be inspected.

On Tuesday, (April 17, 2012)  Indiana Governor Daniels reported that he has ordered temporary outdoor structures to be inspected by the Indiana Department of Homeland Security to help prevent a similar accident in the future.

To view a high level Cause Map of this incident, click “Download PDF” above.