Category Archives: Root Cause Analysis – Incident Investigation

Chernobyl Reactor Explosion

by ThinkReliability Staff

On April 26, 1986, reactor #4 at the Chernobyl Power Plant exploded, spreading radioactive contamination.  There is much debate about the effects, the magnitude of the effects, and the causes, but we can put together a summary of the root cause analysis here.

It is estimated that thousands (perhaps tens of thousands) of people will die from the aftereffects of Chernobyl.  More than 4,000 children have contracted thryoid cancer.  Additionally, between 50 and 250 million Curies of radioactivity were released, more than 350,000 residents have been resettled, a large area remains contaminated, and over 20 countries received radioactive fallout.

The radioactivity, which had built up in the reactor, was released by an explosion and a fire that occurred due to an uncontrolled power surge.  Inadequate containment resulted in the radioactivity spreading beyond the plant.  The power surge resulted from several actions that increased power and disabled safety systems, and from an unsafe reactor design.  (The reactor was designed so that increased steam production leads to an increase in power. US reactor designs are the opposite.)

The after-effects of Chernobyl continue.  The applications of lessons learned from root cause analysis have been applied in many areas – nuclear power, evacuation planning, radiation health treatments, and food supply.  The only remaining reactors of this type are being shut down.  Hopefully this will not only ensure that another Chernobyl never occurs, but will also improve the safety of many other industries.

Finding Solutions

By ThinkReliability Staff

Once you’ve finished your root cause analysis, determined what the causes of a given incident are and built the Cause Map, now comes the really important part: how do you make sure it never happens again?  To keep an incident from happening again, an organization needs to implement solutions. The first step to implementing solutions is to find possible solutions.  We do this by brainstorming.  The brainstorming process is made easier by the root cause analysis, because instead of finding a solution for “person falls down stairs” we brainstorm solutions for very specific causes, such as “stairs were wet” and “handrail doesn’t extend far enough”.  There are many different methods for brainstorming, but the important point is: don’t discount any suggestions.  Write them down, and move on.  We’ll sort through them later.  Attach the solutions to the causes they control (for example, a solution to “stairs were wet” is “cover stairs from exposure to rain”).  Some causes won’t have any solutions, and some solutions will appear on more than one cause.

Have a wide variety of personnel available for brainstorming.  Sometimes it’s easier for someone farther from the work to see potential solutions, and sometimes the people who do the work every day will have great suggestions they’ve been waiting to bring up.  The more suggestions, the better!  Sometimes a seemingly crazy suggestion will lead to a very practical solution.  Allow people to add on to others’ suggestions.  This can result in a synergistic solution better than the original suggestion.

Once the brainstorming is complete, you’ll have a list of possible solutions.  There are as many ways to select solutions as there are to brainstorm, but I suggest something like the following.  First, make a list of the solutions.  Rate the effectiveness of each solution at preventing similar types of incidents (from 1 to 10, 1 being not very effective, 10 being very effective).   Then rate the ease of implementing the solution (from 1 to 10, 1 being not very easy to implement, 10 being very easy to implement).   Multiply the two together for each solution’s score.  Then, rank the solutions.  The solutions at the top will give you the most “bang for your buck”, or are the most easily-implemented, effective solutions.

Grounding of the Empress of the North

Download PDFby Kim Smiley

On May 14 2007, the 300 foot cruise ship, Empress of the North, grounded out on rocks while rounding Rocky Island during a trip through Alaska’s Inland Passage.  There was significant damage to the hull and the two starboard propellers needed to be replaced.  Costs of repairs totaled more than $4.8 million.  Luckily no one was injured, but over two hundred passengers had to be evacuated from the ship.

This is a common route for cruise ships and the rocks were a well-known hazard clearly marked on navigation charts.  So what happened?

A root cause analysis shows that there were many causes that contributed to the accident.  One of causes is that there were no lookouts at the time of the accident.  The crew members who would have acted as lookouts were performing security rounds.  This was in violation of regulations requiring lookouts at all times and appears to have been a common practice for the crew.

When determining causes it’s important to ask, what is different?  In this case, this was the first watch as Deck Officer for the officer in charge.  He had recently graduated, was newly licensed and inexperienced.  He was not familiar with the deck procedures and the equipment. There was a lot of confusion about watch team roles and he didn’t attempt to take charge of the ship’s navigation until seconds before the grounding occurred.  The National Transportation Safety Board (NTSB) found that the actions, or inaction as the case may be, of the Deck Officer were one of the major factors contributing to the accident.

It’s tempting to stop at this point, but the analysis needs to go farther than just identifying the actions of the Deck Officer as a cause to do a thorough investigation.  Why was he standing watch if he wasn’t fully qualified?  Why wasn’t he prepared adequately prior to being given the responsibility?

The crew member originally assigned the watch was ill.  There are a limited number of possible replacements on a ship this size.  The Master of the ship believed the watch would be a good training watch because it was an easy watch with minimal course corrections needed.  It was also not the practice of the crew to have specific night orders for the overnight watches so the newly arrived junior third officer found himself standing the midnight to 4 am watch with minimal guidance.

Many investigations lead back to human error, but it’s important to ask questions beyond that point.  Changing how people are trained, improving the environment, and providing specific writing inspections can help prevent human errors in many cases.

(The photo above is an official Coast Guard photo.)

Preventing Dog Attacks

Download PDFby ThinkReliability Staff

The occurrence of dog attacks is a significant ongoing problem.  An estimated 4.5 million people are attacked each year, of whom 800,000 seek medical care.  These statistics only include attacks that were significant enough to be reported, so the actual incidence is no doubt larger.  One action that has been taken to reduce the incidence of dog attacks is banning specific dog breeds associated with aggressive tendencies (mostly large breed dogs like Pit Bull Terriers, Boxer Dogs, and German Shepherd Dogs), known as Breed Specific Legislation (BSL).

Although BSL is gaining popularity, it does not address all the causes of dog attacks.  A root cause analysis of dog attacks identifies factors related to the dog (inherent temperament, socialization, protective tendencies, location and level of restraint), the owner (treatment and control of the dog) and the victim (behavior, location, age and experience with dogs).  The etiology of a dog attack is multifactorial and as such, should be dealt with in a broad and diverse approach.

Some suggested alternatives to BSL that take into account the complex nature of dog attacks and are targeted at preventing all dog attacks follow:

– Education about proper behavior around dogs would greatly decrease the potential for dog attacks.  Approximately 80% of attacks are by a known dog and more than half of attacks are against children under 12, suggesting that human behavior around a dog is an important trigger since children are more likely to engage in activities that may be perceived as threatening (such as loud noises, running, improper touching).

– Proper enforcement of existing legislation is a readily available method of reducing dog attacks, as many municipalities have restraint laws that are poorly enforced.  An attack cannot occur without the interaction of a dog and person.  Proper restraint on and off private property would reduce the potential for attacks.

– Stricter regulations and more frequent inspections of breeding operations could play a role in reducing improper treatment of young dogs.  Early socialization plays a large role in that puppies that have little interaction or negative interaction with humans are more likely to develop aggressive tendencies.  In most cases this early interaction occurs within breeding operations.

– Encouragement of voluntary spaying and neutering takes advantage of a widely available procedure to reduce the potential for dog attacks.  One of the most significant predictors of attack is a sexually intact dog.  Outside of a breeding operation there is little reason for not spaying or neutering, and the procedure can have additional benefits for the health of the animal, help control the dog population, and reduce unwanted dogs.

To view the PDF file including the root cause analysis of a dog attack, please click “Download PDF” above.

Sinking of the Andrea Doria

Download PDFBy ThinkReliability Staff

On July 25, 1956, the Andrea Doria (an Italian luxury passenger liner) was struck off Nantucket by the Stockholm (a Swedish passenger liner).  Andrea Doria was struck head on, which was bad enough.  What made it even worse was that Stockholm was outfitted with a reinforced icebreaking bow for its travels in frigid waters.  If you look at the severe damage to Stockholm’s reinforced bow (estimated to be $1 M in 1956 dollars), it’s no surprise that Andrea Doria suffered fatal damage.

Although one lesson we can take from this is to never be arrogant enough to call your ship “unsinkable”, we can perform a root cause analysis into the tragedy to determine what else went wrong.  A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.

First, we look at the impact to the goals.  51 people were killed (46 on Andrea Doria, 5 on Stockholm).  This is an impact to the safety goal.  The $29 million (in 1956 dollars) Andrea Doria was a total loss, and  Stockholm suffered $1 million worth of damage.  These are both impacts to the material goal.

When Stockholm struck Andrea Doria, it ripped a 50×30 foot hole in Andrea Doria.  This compromised Andrea Doria’s watertight compartment system (one of the features that made it “unsinkable”), so it began to take on water.  Within 5 minutes of the collision, it was listing 20 degrees starboard.  It was designed to stay afloat with a 15 degree list (another “unsinkable” feature), but not as much as 20, so the ship sank.

Now, why did the Stockholm’s bow strike Andrea Doria’s side?  Stockholm turned starboard, trying to avoid Andrea Doria because they were on a collision course.  The turn was insufficient because of a delay in response time by Stockholm while they plotted the course of the oncoming vessel, which was standard procedure, and because their speed was not reduced.  Both the delay and the speed not being reduced were partially caused by an inexperienced watch – a 3rd mate was in charge and he was the only officer on deck.  It is also believed that the navigator on Stockholm was unaware of the fog.  (Note that although Andrea Doria was in extremely thick fog, Stockholm sailed in clear skies until just before the collision.)  Andrea Doria’s starboard side was exposed because they made a hard left turn, attempting to avoid Stockholm, which was also insufficient due to their speed, which was not reduced sufficiently because the ship was trying to make good time.  Operations in fog call for “moderate speed”, which is defined as the speed at which a ship could be stopped within its visibility distance.  Andrea Doria’s visibility was 1/2 mile, while its stopping distance was far greater.  (While Stockholm had not yet reached the fog, Andrea Doria was already in it, which would seem to be reason enough to reduce speed.)  We’ll also tie the fact that they were on a collision course as a reason for the impact.

How did the two ships get on a collision course?  Andrea Doria made an unexpected turn, to attempt to pass Stockholm starboard to starboard, despite the fact that ships normally pass port to port, per rules of the road.  They did this because they believed Stockholm was already to their starboard side.  They were unaware of Stockholm’s course because they did not plot it (possibly because the Captain was relying on his two state of the art radar systems).  Additionally, Stockholm was north of its recommended route, because the recommended route added distance and time, and was very crowded.

Stockholm turned starboard, to try and avoid Andrea Doria; however, Stockholm had miscalculated Andrea Doria’s position and course, partially due to ineffective navigation on Stockholm.  (Either Stockholm’s radar was providing incorrect data  or, as some experts believe, the radar data was being misinterpreted because the scale, which had to be manually set, was on the wrong setting.)

The ships also suffered from a lack of communication:  Stockholm was not using proper signals (its fog horn and turn signal).  There was no visual contact between the ships due to reduced visibility from fog and the fact that the ships were traveling at night.  Also, there were no radios to communicate between the ships (a fact that has thankfully been remedied).  The attached PDF, available for download, has a high-level visual root cause analysis (cause map) of the incident.  Even more detail can be added to this Cause Map as the analysis continues. As with any investigation the level of detail in the analysis is based on the impact of the incident on the organization’s overall goals.  (In the case of Andrea Doria, the high level cause map has 16 boxes; the detailed map has more than 100.)

Northeast Blackout of 2003

By ThinkReliability Staff

Download PDFOn August 14, 2003, over 50 million people in the U.S. and Canada were without power, Download PDFsome for several days.  Damages from the loss of power – including damaged refrigerated items and looting – totalled approximately $6 billion (U.S.).  508 generating units shut down, resulting in the loss of border and port control systems.  After the blackout, a U.S.-Canada Power System Outage Task Force was appointed to investigate the cause.  We will use the data they obtained to perform a root cause analysis of the event.  A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.

The blackout was triggered by a shut-down cascade, unsustainable power surges in numerous transmission lines.  This occurred due to a supply/demand mismatch – a large decrease in available power without load shedding (where operators drop some consumers off the grid to prevent outages).  Operators did not shed loads because they weren’t warned of impending outages, due to a lack of communication from FirstEnergy, the company whose lines began shutting down first, and a lack of warning by the regional coordinator.

The decrease in available power was due to a key transmission line being shut down.  This happened because the line contacted overgrown trees when it sagged due to a power surge because other, smaller lines shut down when they sagged and touched overgrown trees.  The lines originally sagged due to power surges caused by an automatic shutdown of a power generating unit.  The power surge could have been stopped by operators shedding loads, but they did not because they were not immediately aware of problems, thanks to a failure in their grid monitoring equipment, and due to a lack of training.

Due to the complexity of the event, it is possible to make a much more detailed Cause Map.  As with any investigation the level of detail in the root cause analysis is based on the impact of the incident on the organization’s overall goals.  For example, this map has 21 boxes.  The detailed map that includes the findings of the Task Force has more than 70 boxes, and is at a more appropriate detail to find solutions to ensure that this sort of energy reliability problem does not happen again.

Smoking – Effects and Causes

Download PDFBy ThinkReliability Staff

Currently, more than 43 million Americans smoke.  Why does this happen, and what effect does it have?  We will do a very simplistic root cause analysis.  A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.

Smoking leads to an estimated 440,000 premature deaths each year.  This includes deaths caused by smoking and by exposure to secondhand smoke.  Additionally, 8.6 million people suffer from smoking-related illnesses.  And, 900 infant deaths are caused annually from smoking during prengnancy.  These are all impacts to the safety goal.  The deaths and diseases are caused because smoking raises the risk of cancer, cardiovascular disease, and respiratory disease.  The first two are caused by exposure to tobacco smoke (including secondhand smoke) and the third is caused by inhalation of smoke.  Either way, the cause is that many people smoke cigarettes.

Why do people smoke?  Well, it’s because they start smoking and because it is extremely difficult to quit.  There are many reasons why it is difficult to quit.  Some of these reasons are: cigarettes are extremely addictive, severe withdrawal symptoms cause relapses, smokers have a lack of assistance in quitting, they are afraid of weight gain, and there is a lack of increase in the cost of cigarettes.  This last one sounds odd, but studies have shown that an increase in the cost of cigarettes decreases the number of smokers.  However, the cost of cigarettes does not reflect the true cost of cigarettes (based on health costs and productivity losses), and the small increase in taxes (which has not kept up with inflation) is offset by cigarette company promotions.

People start smoking because of the positive imagery of smoking – the heavy advertising and promotion of cigarettes, smoking in popular culture (mainly movies), and the lack of counter-advertising by federal organizations and anti-smoking campaigns.  Additionally, most smokers (90%) start as children (before the age of 18) because cigarettes entice children, there is a lack of counseling against their use, teens may suffer from peer pressure encouraging, and teens are more susceptible to addiction than adults.

 

Hubble Focusing Issues

Download PDFBy Kim Smiley

The Hubble Space Telescope was launched on April 24, 1990.  Once in orbit, it was quickly discovered that the images from Hubble were blurred.  An investigation into the issue revealed that Hubble’s primary mirror was not built to specification and couldn’t properly focus the light.  Specifically, the mirror was flattened too much away from the center and caused the light reflected from the edge of the mirror to focus on a slightly different location than the light reflected from the center.   The primary mirror in Hubble was only off specification by 2.3 micrometers, but the result to the $1.5 billion dollar project was disastrous.

Solving Hubble’s focus issues was no small feat.  How do you repair a mirror that can’t be replaced on orbit when it is cost prohibitive to bring it back to earth for repair?  The answer was to modify the lens (which met specifications) to work with the off specification mirror.  COSTAR (Corrective Optics Space Telescope Axial Replacement) was added to Hubble during the first servicing mission in December 1993.  COSTAR is essentially eyeglasses for Hubble, additional lens built with the same error as the mirror, but in the opposite direction so that the effects of the off specification mirror shape are canceled out.  With the addition of COSTAR, Hubble met original design goals.

The primary mirror was constructed with a flaw because the tool, called a null corrector, used to create the template to guide the shaping of the mirror was itself flawed.  Null correctors use precisely located mirrors and lens to determine the shape of a mirror.  In order to assemble null correctors, reflected light is used to measure the distance between the mirror and the lens inside the tool.  When the null corrector used to shape the Hubble’s primary mirror was assembled a measurement error was made.  A small amount of reflective coating had fallen off an internal piece of the instrument and the laser used to perform the measurement reflected off the wrong location, resulting in a lens being 1.3 mm to far from the mirror.  Null correctors are extremely precise and do not change once assembled so the Hubble team used a single instrument to guide the mirror shape.  A single flawed tool and inadequate quality controls resulted in a flawed mirror.

A visual representation of the root cause analysis has been created as a Cause Map that can be downloaded.

View a video about the Hubble Telescope.

Brooklyn Bridge Turns 125

Brooklyn BridgeBy Kim Smiley

Brooklyn Bridge marks its 125th birthday on May 24, 2008.  When performing a root cause analysis it is easy to spend a large amount of time focused on failures, but today engineers should take a moment to appreciate the accomplishment of this truly amazing feat.  The bridge has been refurbished many times, but the towers, main cables, and main beams are original and are now 125 years old.

At the time the Brooklyn Bridge was constructed the 6,000 ft long bridge was roughly six times as long as the longest bridge of the type that had previously been built.  The Brooklyn Bridge is one of the nation’s oldest and most treasured suspension bridges.  It has shaped the development of New York City.  At the time it was constructed Brooklyn was largely rural and the bridge helped sparked a growth spurt that dramatically changed the face of Brooklyn.  Brooklyn’s population grew by 42 percent between 1880 and 1890.  At last count in 2006, the bridge carried 126,000 cars per day.

Recent inspections have revealed some deterioration of the bridge, primarily with the newer approach ramps.  In a recent survey, state inspections ranked its condition as “poor”.  New York City plans to spend $250 million to 300 million to fix and repaint the bridge.  Hopefully these updates will return the bridge to good condition and it will continue to safely serve the citizen of New York City for many decades to come.

Train Derailment – Lafayette, Louisiana

by Kim Smiley

About 1:40 am on May 17, six rail cars derailed and overturnedDownload PDF near Lafayette, Louisiana.  One of the cars was damaged and leaked about 11,000 gallons of hydrochloric acid.  Five people, including two rail workers, were sent to a hospital and treated for eye and skin irritation.

Authorities evacuated people with 1 mile of the accident.  Approximately 3,000 people were affected, including a few small businesses and a nursing home.   All affected people are being reimbursed for food and hotel costs by the railway company that operated the train.

There was potential for further release of chemicals because one of the rail cars involved in the accident carried ethylene oxide, a flammable and dangerous chemical, and two of the remaining cars also carried hydrochloric acid.

The Louisiana State Police’s hazardous materials unit is overseeing clean-up of the accident site.  The spill is being neutralized with lime and the contaminated material will be removed and disposed of.  The rail car containing ethylene oxide was removed from the site quickly to reduce the potential for additional problems.

The cause of the derailment is not known at this time.  The Federal Railroad Administration will conduct an investigation of the accident.

The attached PDF file contains an intermediate level root cause analysis of the train derailment built using Cause Mapping, a visual form of root cause analysis.  It was built using the facts that were available in media reports on the accident.  As more details are known, the Cause Map can be expanded.