All posts by ThinkReliability Staff

ThinkReliability are specialists in applying root cause analysis to solve all types of problems. We investigate errors, defects, failures, losses, outages and incidents in a wide variety of industries. Our Cause Mapping analysis method of root causes, captures the complete investigation with the best solutions all in an easy to understand format. ThinkReliability provides investigation services and root cause analysis training to clients around the world and is considered the trusted authority on the subject

Interim Recommendations After Fatal Chemical Release

By ThinkReliability Staff

After a fatal chemical release on November 15, 2014 (see our previous blog for an initial analysis), the Chemical Safety Board (CSB) immediately sent an investigative team. The team spent seven months on-site. Prior to the release of the final report, the CSB has approved and released interim recommendations that will be addressed by the site as part of its restart.

Additional detail related to the causes of the incident was also released. As more information is obtained, the root cause analysis can be updated. The Cause Map, or visual root cause analysis, begins with the impacts to the organization’s goals. While multiple goals were impacted, in this update we’ll focus on the safety goal, which was impacted due to four fatalities.

Four workers died due to chemical asphyxiation. This occurred when methyl mercaptan was released and concentrated within a building. Two workers were in the building and were unable to get out. One of these workers made a distress call, to which four other workers responded. Two of the responding workers were also killed. (Details on the attempted rescue process, including personal protective equipment used, have not yet been released.) Although multiple gas detectors alarmed in the days prior to the incident, the building was not evacuated. The investigation found that the alarms were set above permissible exposure limits and did not provide effective warning to workers.

Methyl mercaptan was used at the facility to manufacture pesticide. Prior to the incident, water accessed the piping system. In the cold weather, the water and methyl mercaptan formed a solid, blocking the pipes. Just prior to the release, the blockage had been cleared. However, different workers, who were unaware the blockage had been cleared, opened valves in the system as previously instructed to deal with a pressure problem. Investigators found that the pressure relief system did not vent to a “safe” location but rather into the enclosed building. The CSB has recommended performing a site-wide pressure relief study to ensure compliance with codes and standards.

The building, which contained the methyl mercaptan piping, was enclosed and inadequately ventilated. The building had two ventilation fans, which were not operating.   Even though these fans were designed PSM critical equipment (meaning their failure could result in high consequence event), an urgent work order written the month prior had not been fulfilled. Even with both fans operating, preliminary calculations performed as part of the investigation determined the ventilation would still not have been adequate. The CSB has recommended an evaluation of the building design and ventilation system.

Although the designs for processes involving methyl isocyanate were updated after the Bhopal incident, the processes involving methyl mercaptan were not. The investigation has found that there was a general issue with control of hazards, specifically because non-routine operations were not considered as part of hazard analyses. The CSB has recommended conducting and implementing a “comprehensive, inherently safer design review” as well as developing an expedited schedule for other “robust, more detailed” process hazard analyses (PHAs).

Other recommendations may follow in the CSB’s final report, but these interim recommendations are expected to be implemented prior to the site’s restart, in order to ensure that workers are protected from future similar events.

To view an updated Cause Map of the event, including the CSB’s interim recommendations, click “Download PDF” above. Click here to view information on the CSB’s ongoing investigation.

Runway Fire Forces Evacuation of Airplane

By ThinkReliability Staff

On September 8, 2015, an airplane caught fire during take-off from an airport in Las Vegas, Nevada. The pilot was able to stop the plane, reportedly in just 9 seconds after becoming aware of the fire. The crew then evacuated the 157 passengers, 27 of whom received minor injuries as a result of the evacuation by slide. Although the National Transportation Safety Board (NTSB) investigation is ongoing, information that is known, as well as potential causes that are under consideration, can be diagrammed in a Cause Map, or visual root cause analysis.

The first step of Cause Mapping is to define the problem by completing a problem outline. The problem outline captures the background information (what, when and where) of the problem, as well as the impact to the goals. In this case, the safety goal is impacted due to the passenger injuries. The evacuation of the airplane impacts the customer service goal. The NTSB investigation impacts the regulatory goal. The schedule goal is impacted by a temporary delay of flights in the area, and the property goal is impacted by the significant damage to the plane. The rescue, response and investigation is an impact to the labor goal.

The Cause Map is built by beginning with one of the impacted goals and asking “Why” questions to develop the cause-and-effect relationships that led to an issue.   In this case, the injuries were due to evacuation by slide (primarily abrasions, though some sources also said there were some injuries from smoke inhalation). These injuries were caused by the evacuation of the airplane. The airplane was evacuated due to an extensive fire. Another cause leading to the evacuation was that take-off was aborted.

The fact that take-off was able to be aborted, for which the pilot has been hailed as a hero, is actually a positive cause. Had the take-off been unable to be aborted, the result would likely have been far worse. In the case of the Concorde accident, a piece of debris on the runway ruptured a tire, which caused damage to the fuel tank, leading to a fire after the point where take-off could be aborted. Instead, the aircraft stalled and crashed into a hotel, killing all onboard the craft and 4 in the hotel. The pilot’s ability to quickly save the plane almost certainly saved many lives.

The fire is thought to have been initiated by an explosion in the left engine due a catastrophic uncontained explosion of the high-pressure compressor. This assessment is based on the compressor fragments that were found on the runway. This likely resulted from either a bird strike (as happened in the case of US Airways flight 1549), or a strike from other debris on the runway (as occurred with the Concorde), or fatigue failure of the engine components due to age. This is the first uncontained failure of this type of engine, so some consider fatigue failure to be less likely. (Reports of an airworthiness directive after cracks were detected in weld joints of compressors were in engines with different parts and a different compressor configuration.)

In this incident, the fire was unable to be put out without assistance from responding firefighters. This is potentially due to an ongoing leak of fuel if fuel lines were ruptured and the failure of the airplane’s fire suppression system, which reportedly deployed but did not extinguish the fire. Both the fuel lines and fire suppression system were likely damaged when the engine exploded. The engine’s outer casing is not strong enough to contain an engine explosion by design, based on the weight and cost of providing that strength.

The NTSB investigation is examining airplane parts and the flight data and cockpit voice recorders in order to provide a full accounting of what happened in the incident. Once these results are known, it will be determined whether this is considered an anomaly or whether changes to all planes using a similar design and configuration need to take action to prevent against a similar event recurring.

To view the initial investigation information on a one-page downloadable PDF, please click “Download PDF” above.

 

Explosions raise concern over hazardous material storage

By ThinkReliability Staff

On August 12, a fire began at a storage warehouse in Tianjin, China. More than a thousand firefighters were sent in to fight the fire. About an hour after the firefighters went in, two huge explosions registered on the earthquake measurement scale (2.3 and 2.9, respectively). Follow-on explosions continued and at least 114 firefighters, workers and area residents have been reported dead so far, with 57 still missing (at this point, most are presumed dead).

Little is known for sure about what caused the initial fire and continuing explosions. What is known is that the fire, explosions and release of hazardous chemicals that were stored on site have caused significant impacts to the surrounding population and rescuers. These impacts can be used to develop cause-and-effect relationships to determine the causes that contributed to an event. It’s particularly important in an issue like this – where so many were adversely affected – to find effective solutions to reduce the risk of a similar incident recurring in the future.

Even with so much information unavailable, an initial root cause analysis can identify many issues that led to an adverse event. In this case, the cause of the initial fire is still unknown, but the site was licensed to handle calcium carbide, which releases flammable gases when exposed to water. If the chemical was present on site, the fire would have continued to spread when firefighters attempted to fight it using water. Contract firefighters, who are described as being young and inexperienced, have said that they weren’t adequately trained for the hazards they faced. Once the fire started, it likely ignited explosive chemicals, including the 800 tons of ammonium nitrate and 500 tons of potassium nitrate stored on site.

Damage to the site released those and other hazardous chemicals. More than 700 tons of sodium cyanide were reported to be stored at the site, though it was only permitted 10 tons at a time. Sodium cyanide is a particular problem for human safety. Says David Leggett, a chemical risk consultant, “Sodium cyanide is a very toxic chemical. It would take about a quarter of teaspoon to kill you. Another problem with sodium cyanide is that it can change into prussic acid, which is even more deadly.”

But cleaning up the mess is necessary, especially because there are residents living within 2,000 ft. of the site, despite regulations that hazardous sites are a minimum of 3,200 ft. away from residential areas. Developers who built an apartment building within the exclusion zone say they were told the site stored only common goods. Rain could make the situation worse, both by spreading the chemicals and because of the potential that the released chemicals will react with water.

The military has taken over the response and cleanup. Major General Shi Luze, chief of the general staff of the military region, said, “After on-site inspection, we have found several hundred tons of cyanide material at two locations. If the blasts have ripped the barrels open, we neutralize it with hydrogen peroxide or other even better methods. If a large quantity is already mixed with other debris, which may be dangerous, we have built 1-meter-high walls around it to contain the material — in case of chemical reactions if it rains. If we find barrels that remain intact, we collect them and have police transport them to the owners.”

In addition to sending in a team of hazardous materials experts to neutralize and/or contain the chemicals and limiting the public from the area in hopes to limit further impact to public safety, the state media had said they were trying to prevent rain from falling, presumably using the same strategies developed to ensure clear skies for the 2008 Summer Olympics. Whether it worked or not hasn’t been said, but it did rain on August 18, nearly a week after the blast, leaving white foam that residents have said creates a burning or itchy sensation with contact.

View an initial Cause Map of the incident by clicking on “Download PDF” above.

Legionnaires’ Disease Outbreak Blamed on Contaminated Cooling Towers

By ThinkReliability Staff

An outbreak of Legionnaires’ disease has affected at least 115 and killed 12 in the South Bronx area of New York City. While Legionnaires’, a respiratory disease caused by breathing in vaporized Legionella bacteria, has struck the New York City area before, the magnitude of the current outbreak is catching the area by surprise. (Because the vaporization is required, drinking water is safe, as is home air conditioning.) It’s also galvanizing a call for actions to better regulate the causes of the outbreak.

It’s important when dealing with an outbreak that affects public health to fully analyze an issue to determine all the causes that contributed to the problem. In the case of the current Legionnaires’ outbreak, our analysis will be performed in the form of a Cause Map, or visual root cause analysis. We begin by capturing the basic information (what, when and where) about the issue in a problem outline. Because the issue unfolded over months, we will reference the timeline (to view the analysis including the timeline, click on “Download PDF”) to describe when the incident occurred. Some important differences to note – people with underlying medical conditions and smokers are at a higher risk from Legionnaires’, and Legionella bacteria are resistant to chlorine. Infection results from breathing in contaminated mist, which has been determined to have come from South Bronx area cooling towers (which is part of the air conditioning and heating systems of some large buildings).

Next we capture the impact to the goals. The safety goal is impacted due to the 12 deaths, and 115 who have been infected. The customer service goal is impacted by the outbreak of Legionnaires’. The environmental and property goals are impacted because at least eleven cooling towers in the area have been found to be contaminated with Legionella. The issue is resulting in increased regulation, an impact to the regulatory goal, and testing and disinfection, which is being performed by at least 350 workers and is an impact to the labor goal.

The analysis begins by asking “why” questions from one of the impacted goals. In this case, the deaths resulted from an outbreak of Legionnaires’ disease. The outbreak results from exposure to mist from one of the contaminated cooling towers. The design of some cooling towers allows exposure to the mist produced. It is common for water sources to contain Legionella (which again, is resistant to chlorine) but certain conditions allow the bacteria to “take root”: the damp warm environment found in cooling towers and insufficient cleaning/ disinfection. The cost of cleaning is believed to be an issue – studies have found that, like this outbreak, impoverished areas are more prone to these types of outbreaks. Additionally, there are insufficient regulations regarding cooling towers. The city does not regularly inspect cooling towers. According to the mayor and the city’s deputy commissioner for disease control, there just hasn’t been enough evidence to indicate that cooling towers are a potential source of Legionnaires’ outbreaks.

Evidence would indicate otherwise, however. A study that researched risk factors for Legionnaires’ in New York City from 2002-2011 specifically indicated that proximity to cooling towers was an environmental risk. A 2010 hearing on indoor air quality discussed Legionella after a failed resolution in 2000 to reduce outbreaks at area hospitals. New York City is no stranger to Legionnaires’; the first outbreak occurred in 1977, just after Legionnaires’ was identified. There have been two previous outbreaks of Legionnaires’ this year. Had there been a look at other outbreaks, such as the 2012 outbreak in Quebec City, cooling towers would have been identified as a definite risk factor.

For now, though the outbreak appears to be waning (no new cases have been reported since August 3), the city is playing catch-up. Though they are requiring all cooling towers to be disinfected by August 20 and plan increase inspections, right now there isn’t even a list of all the cooling towers in the city. Echoing the frustrations of many, Bill Pearson, member of the committee that wrote standards to address the risk of legionella in cooling towers, says “Hindsight is 20-20, but it’s not a new disease. And it’s not like we haven’t known about the risk of cooling towers, and it’s not like people in New York haven’t died of Legionnaires’ before.”

Ruben Diaz Jr., Bronx borough president, brings up a good point for the cities that may have Legionella risks from cooling towers, “Why, instead of doing a good job responding, don’t we do a good job proactively inspecting?” Let’s hope this outbreak will be a call for others to learn from these tragic deaths, and take a proactive approach to protecting their citizens from Legionnaire’s disease.

Unintended Consequences, Serendipity, and Prawns

By ThinkReliability Staff

The Diama dam in Senegal was installed to create a freshwater reservoir. Unfortunately, that very dam also led to an outbreak of schistosomiasis. This was an unintended consequence: a negative result from something meant to be positive.   Schistosomiasis, which weakens the immune system and impairs the operation of organs, is transmitted by parasitic flatworms. These parasitic flatworms are hosted by snails. When the dam was installed, the snails’ main predators lost a migration route and died off. Keeping the saltwater out of the river allowed algae and plants that feed the snails to flourish. The five why analysis of the issue would go something like this: The safety goal is impacted. Why? Because of an outbreak of schistosomiasis. Why? Because of the increase in flatworms. Why? Because of the increase in snails. Why? Because of the lack of snail predators. Why? Because of the installation of the dam.

Clearly, there’s more to it. We can capture more details about this issue in a Cause Map, or visual form of root cause analysis. First, it’s important to capture the impact to the goals. In this case, the safety goal is impacted because of a serious risk to health and the environmental goal is impacted due to the spread of parasitic flatworms. The customer service goal (if we consider customers as all those who get water from the reservoir created by the dam) is impacted due to the outbreak of schistosomiasis.

Beginning with the safety goal, we can ask why questions. Instead of including just one effect, we include all effects to create a map of the cause-and-effect relationships. The serious risk to health is caused by the villagers suffering from schistosomiasis, which can cause serious health impacts. The villagers are infected with schistosomiasis and do not receive effective treatment. Not all those infected are receiving drugs due to cost and availability concerns. The drugs do not reverse the damage already done. And, most importantly, even those treated are quickly reinfected as they have little choice but to continue to use the contaminated water.

The outbreak of schistosomiasis is caused by the spread of parasitic flatworms, which carry the disease. The increase in flatworms is caused by the increased population of snails, which host the flatworms. The snail population increased after the installation of the dam killed off their predators and increased their food supply.

Many solutions to this issue were attempted and found to be less than desirable. Administering medication for treatment on its own wasn’t very effective, because (as described above) the villagers kept getting reinfected. The use of molluscicide killed off other animals in the reservoir as well. Introducing crayfish to eat the snails was derided by environmentalists as they were considered an invasive species. But they were on the right track. Now, a team is studying the reintroduction of the prawns which ate the snails. During the pilot study, the rates of schistosomiasis decreased. In addition, the prawns will serve as a valuable food source. This win-win solution is an example of serendipity and should actually return money to the community. Says Michael Hsieh, the project’s principal investigator and an assistant professor of urology, “The broad potential of this project is validation of a sustainable economic solution that not only addresses a major neglected tropical disease, but also holds the promise of breaking the poverty cycle in affected communities.”

Introducing animals to get rid of other animals can be problematic, as Macquarie Island discovered when they introduced cats to eat their exploding rodent population who ate the native seabirds). (Click here to read more about Macquarie Island.) Further research is planned to ensure the project will continue to be a success. To learn more about the project, click here. Or, click “Download PDF” to view an overview of the Cause Map.

Fingertips Amputated After Slip on Ice

By ThinkReliability Staff

Information on a slip that caused severe damage to an electrical contractor in Newcastle in August 2013 was recently released by Great Britain’s Health and Safety Executive (HSE). Though this incident didn’t make the front pages of the newspaper, it is representative of many of the injury investigations which we facilitate using the Cause Mapping method.

The first step in the Cause Mapping method of root cause analysis is to capture the what, when and where of the incident and the impacts to the organizational goals. In this case, the what (contractor slip and hand injury), when (August 30, 2013) and where (a moving conveyor at a baguette manufacturer in Leeds) are captured, as well as any differences and the task being performed at the time of the incident. There were two notable differences during the incident as compared to an “average” day that should also be noted: the safety guard had been removed from the conveyor and ice had accumulated on the floor. These differences may or may not be causally related to the incident. Additionally, the task being performed (cleaning up after contract electrical work) is captured as it, too, may be causally related to the incident.

The impacts to the goals are analogous to what stood in the way of a perfect day. A serious injury involving the partial amputation of two fingers and the injury of a third is an impact to the safety goal in this example. The £8,500 fine levied by the HSE is an impact to the regulatory goal. The worker had four weeks off work due to the injury, which is an impact to the labor goal. It is unclear if any other goals were impacted by this incident.

Once at least one impact to the goals has been determined, asking “why” questions helps us complete the second step, or analysis. In the analysis, we capture cause-and-effect relationships that map out the issues that led to the incident. In this case, the injury was caused by the contractor’s hand striking an unprotected drive chain on a moving conveyor. This occurred because the hand struck the area, the drive chain was unprotected, and the conveyor was moving. All three of these causes had to occur for the resulting injury.

The contractor’s hand struck the area because of a slip on an icy floor. Ice from an open freezer door (which appeared to be malfunctioning) had built up and had not been removed.   The drive chain was unprotected because the safety guard had been removed from the conveyor, which was moving likely due to normal operations.

According to Shuna Rank, the HSE inspector, “This worker’s injuries should not and need not have happened. This incident was easily preventable had Country Style Foods Ltd ensured safety guards were in place on the machinery. The company should also have taken steps to prevent the accumulation of ice on the freezer floor. Guards and safety systems are there for a reason, and companies have a legal duty of care to ensure they are properly fitted and working effectively at all times. Slips and trips are the biggest cause of major injuries in the food and drink industry with 37% of all major accidents in the industry being as a result of slips.”

The inspector’s quote clearly identifies the areas for improvement that could reduce the risk of similar incidents occurring. Namely, the manufacturer must ensure that damage resulting in ice buildup is fixed as soon as possible and that in the meantime, ice is regularly cleared away and the area is marked as a slip hazard. If a safety guard is removed for any reason, the conveyor should not be operating until it has been replaced properly. Ensuring that equipment is in proper working order is essential to reduce the risk to workers such as the injuries demonstrated in this case.

To view the Outline and Cause Map, please click “Download PDF” above. Or click here to read more.

Kitty Litter Cause of Radiological Leak?

By ThinkReliability Staff

The rupture of a container filled with nuclear waste from Department of Energy (DOE) sites that resulted in the  radiological contamination of 21 workers appears to have resulted from a heat-producing reaction, possibly between the nuclear waste and the kitty litter used to stabilize the waste.

DOE photo of damaged container

Yes, you read that correctly. The same stuff you use for Fluffy’s “business” is also used to stabilize nuclear waste.  However, the kitty litter typically used is clay.  One of the sites that provides waste to the Waste Isolation Pilot Plant, where the release occurred, changed from clay kitty litter to organic kitty litter, which is made of plant material.  Although the reaction that resulted in the container’s rupture has not yet been determined, it is possible that it was due to the change in litter.

We can look at this incident in a Cause Map, or visual root cause analysis, to lay out both the effects and causes.  In this case, the effects were significant.  Twenty-one workers were found to have internal radiological contamination, impacting the safety goal.  A radiological release off-site impacted the environmental goal.  The waste repository has been shut down and is not accepting shipments, impacting both the customer service and production goals.  The release requires the investigation of a formal Accident Investigation Board, impacting the regulatory and labor goal.  Lastly, the damage to the container is an impact to the property goal.

The release was caused by the rupture of a container that stored radiological waste, including americium and plutonium.  The release was able to leave the underground storage facility due to a leak path in the ventilation system, which was by design because the ventilation system was not designed for containment because the safety analysis assumed that a release within the storage facility would result from a roof panel fall and was adequately prevented.

The rupture appears to have resulted from a heat-producing reaction. The constituents of that reaction have not yet been determined, but the change from clay to organic kitty litter has been identified as a possible cause.  (A possible cause indicates a cause for which evidence is not yet available.)  More research is being done to determine the actual reaction.  This will also allow a determination of which other waste containers may be at risk for rupture.

A solution that has already been implemented is to seal the leaks in the ventilation system with foam to reduce the risk of leak-by.  Other solutions that have been suggested are to add an additional heavy-duty containment around the affected casks, reclassify the ventilation system as containment, and perform an independent review of the safety analysis of the site.  Once appropriate solutions are determined and implemented, it’s hope the site will be able to reopen.

To view the Outline and Cause Map, please click “Download PDF” above.

1990 Cascading Long Distance Failure

By ThinkReliability Staff

On January 15, 1990, a cascading failure resulted in tens of thousands of people in the Northeast US without long distance service for up to 9 hours.  This resulted in over 50 million calls being blocked at an estimated loss of $60 M.  (Remember, there weren’t really any other ways to quickly connect outside of the immediate area at the time.)

We can examine this historical incident in a Cause Map, or visual root cause analysis, to demonstrate what went   wrong, and what was done to fix the problem.  First, we begin with the impact to the goals.  No impacts to the safety, environmental, or property goals were discussed in the resources I used, but it is possible they were impacted, so we’ll leave those as unknown.  The customer service and production goals were clearly impacted by the loss of service, which was considerable and estimated to cost $60 million, not including time for troubleshooting and repairs.

Asking “Why” questions allows development of the cause-and-effect relationships that led to the impacted goals.  In this case, the outage was due to a cascading switch failure: 114 switches crashed and rebooting over and over again.  The switches would crash upon receiving a message from its neighbor switches.  This message was meant to inform other switches that one switch was busy to ensure messages were routed elsewhere.  (A Process Map demonstrating how long distance calls were connected is included on the downloadable PDF.)  Unfortunately, instead of allowing the call to be redirected, the message caused a switch to crash.  This occurred when an errant line in the coding of the process allowed optional tasks to overwrite crucial communication data.  The error was included in a software upgrade designed to increase throughput of messages.

It’s not entirely clear how the error (one added line of code that would bring down a huge portion of the long distance network) was released.  The line appears to be added after testing was complete during a busy holiday season. That a line of code was added after testing seems to indicate that the release process wasn’t followed.

In this case, a solution needed to be found quickly. The upgraded software was pulled and replaced with the previous version.  Better testing was surely used in the future because a problem of this magnitude has rarely been seen.

To view the Outline, Cause Map and Process Map, please click “Download PDF” above.  Or click here to read more

Hundreds Die When South Korean Ferry Capsizes

By ThinkReliability Staff

The nation of South Korea was devastated after a ferry capsized off Byungpoong on April 16, 2014.  While the ferry tipped over and sank quickly (within two hours), the evacuation orders came slowly (a half-hour after the first distress call.)  The combination resulted in over 300 being trapped within the ship and killed.  The Captain and much of the crew were able to escape.

There are a multitude of causes involved in this tragedy, which can be captured within a Cause Map.  A Cause Map visually develops the cause-and-effect relationships that led to organizational goals that were impacted.

Clearly, the safety goal in this case was impacted, due to the large number of deaths (at the time of this blog, 226 bodies had been found and 73 people are still missing).  In addition, legal action is being taken against the captain and members of the crew responsible for navigation for negligence and failure to assist passengers. The Captain has also been arrested for “undertaking an excessive change of course without slowing down”.  The loss of the ship can be considered an impact to the property goal, and the massive rescue and recovery operations are an impact to the labor/ time goal.

By asking why questions, the cause-and-effect relationships are developed.  Most of the deaths resulted from passengers drowning when they were trapped in the ship as it capsized and sank.  The ferry capsized because of a sharp turn and stability issues.  The ship was turned too quickly at excess speed, possibly because the third mate in charge of navigation was inexperienced (this was her first time) and of steering gear issues, reported two weeks prior to the accident and apparently not fixed.  The ship had been recently modified to add more passenger cabins, which made it top heavy.  As a result of the modifications, the recommended cargo weight was reduced.  The ship was carrying three times the cargo weight recommended at the time of the accident.

Passengers became trapped in the ferry prior to the evacuation order, which was issued thirty minutes after the first distress call (and which it appears not all passengers were able to hear).  During this time, the ship had listed to a point that made it impossible to get out.  The Captain was concerned about the safety of his passengers in the water and appears to have called the parent company to request permission to evacuate.  Additionally, the ship’s life rafts were unable to be used.  Photos show crew members being unable to release life rafts.  Only 2 of the 46 on the ship were successfully deployed.   Lastly, the crew provided insufficient assistance, abandoning ship without making necessary efforts to free the passengers.

This tragic incident has been compared to the Titanic (due to the insufficient number of lifeboats and people being unable to leave the ship), the Valdez oil spill (because an inexperienced third mate was performing navigation while the Captain was in his cabin), and the Costa Concordia (when the Captain left the ship without supervising the evacuation effort).  As long as lessons from other organizations (and even industries) are not understood by those performing similar work, these tragedies will continue to happen.

To view the Outline and Cause Map, please click “Download PDF” above.  Or click here to read more.

Risks of Future Landslides – and Actual Past Landslides – Ignored

By ThinkReliability Staff

Risk is determined by both the probability of a given issue occurring, and the consequence (impact) if it does. In the case of the mudslide that struck Oso, Washington on March 22, 2014, both the probability and consequence were unacceptably high.

The probability of a landslide happening in the area had not only been well-documented in reports as far back as 1951, the same area where dozens were killed on March 22 had experienced 5 prior landslides since 1949. The consequences of these prior landslides were less than the 2014 landslide because of the severity of the landslide, and because increased residential development meant more people were in harm’s way.

While the search for victims is still ongoing, the causes and impacts of the landslide are mostly known. This incident can be analyzed using a Cause Map, or visual root cause analysis, to show the cause-and-effect relationships that led to the tragic landslide.

First, we capture the background information and the impact to the goals in the problem outline, thereby defining the problem. The landslide (actually a reactivation of an existing landslide, according to Professor Dave Petley, in his blog) occurred around 10:40 a.m. on March 22, 2014 in an Oso, Washington residential area. As previously noted, there had been prior landslides in the area, and there were outdated boundaries used for logging permissions (which we’ll talk more about later). The safety goal was impacted due to the 30 known deaths, 15 and people missing. (Not all of the 27 have been identified, so the known dead and missing numbers may overlap. However, at this point, there is little hope that any additional rescues will take place.) The environmental goal was impacted due to the landslide and the customer service goal (insofar as the residents can be considered customers of their local area) was impacted due to the displacement of 30 families. Logging in an area that should have been protected impacts the regulatory goal. The estimated losses (of residences and belongings) are approximately $10 million, impacting the property goal and the massive search and a recovery effort impacts the labor goal.

Beginning with these impacted goals, asking ‘why” questions allows us to develop cause-and-effect relationships showing how the incident occurred. The safety goal was impacted because of the deaths and missing, which resulted from people being overcome by a landslide. In order for this to occur, the landslide had to occur, and the people had to be in the vicinity of the landslide.

As is known from history (see the timeline on the downloadable PDF), this area is prone to landslides. Previous reports identified the erosion of the area due to the proximity of the river as a cause of these landslides. An additional cause is water seepage in the area. Water seepage is increased when the water table rises from overly wet weather (as is typically found at the end of winter). Trees can help reduce water seepage by absorbing the water. When trees are removed, water seepage in an area can increase significantly. Because of this, removal of trees (for logging or other purposes) is generally restricted near areas prone to landslides. However, for reasons yet unknown, logging was permitted in what should have been a restricted area, because the maps used to allow it were outdated. Says the geologist who developed the new maps, “I suspect it just got lost in the shuffle somewhere.” Additionally, analysis by the Seattle Times, the logging went into the “old” restricted area as well. The State Forester is investigating the allegations and whether the logging played a role in the landslide.

Regardless of the magnitude of the impact of the logging and weather, the area was prone to landslides. Yet it was allowed to be developed, despite multiple reports warning of danger and five previous landslides. In fact, construction in the area resumed just three days after the last landslide in 2006. The 2006 landslide also interrupted a plan to divert the river farther from the landslide area. Despite all of this, the area built up (with houses built as recently as 2009) and those residents were allowed to stay. (While buying out the residents was under consideration, it was apparently dismissed because the residents did not want to move.) While officials in the area maintain that they thought it was safe, a long history of reports and landslides suggest otherwise.

If a lack of knowledge of the risk of the area continues to be a concern, aerial scanning with advanced technology (lidar) could help. Use of lidar in nearby Seattle identified four times the number of landslide zones that were spotted with aerial surveying, which is more typically used.

To view a summary of the investigation, including a timeline, problem outline and Cause Map, please click “Download PDF” above.