Tag Archives: Cause Mapping

Trading Glitch Loses Goldman Sachs Millions

By Kim Smiley

A Goldman Sachs trading glitch on August 20, 2013 caused a large number of erroneous single stock and ETF options trades.  About 80 percent of the errant trades were cancelled, but the financial damage is still speculated to be as much as one hundred million dollars. The company also finds itself once again in the uncomfortable position of making headlines for negative reasons which is never good for business.

The glitch occurred during an update to an internal computer system that is used to determine where to price options.  The update changed the software so that the system began inadvertently misinterpreted non-binding indications of interest as actual bids and offers.  The system acted on these bids and executed a large volume of trades at errant prices that were out of touch with actual market prices.

This issue can be built into a Cause Map, an intuitive method for performing a root cause analysis.  One of the advantages of a Cause Map is that it visually lays out all the causes and the cause-and-effect relationships between them. Seeing all the causes can broaden the solutions that are considered.

In this example, a Cause Map can help illustrate the fact that the software glitch itself isn’t the only thing worth focusing on.  The lack of an effective test program also contributed to the problem and testing may be the easiest place to implement an effective solution.  If the problem would have been caught in testing, the only cost would have been the time and effort needed to fix the software.  The importance of a robust test program for software is difficult to overstate.  If the software is vital to whatever your company’s mission is, develop a way to test it.

To view a high level Cause Map of this issue, click on “Download PDF” above.  Click here to read about the loss of the Mars Climate Orbiter, another excellent example of a software error with huge consequences.

Thousands Injured Each Year From Falling Televisions

By Kim Smiley

Nearly all parents know about the dangers of watching too much television, but a new study shows that too few are aware of the risk of injury from televisions.  The number of television injuries is more than most would guess with more than 17,000 children visiting emergency rooms for  television related injuries each year.   Falling televisions have also caused hundreds of deaths with 29 killed in just 2011.  The rate of injuries associated with televisions is also increasing at an alarming pace, jumping 126% since 1990.

The majority of victims were young children under five.  The accidents seem to be a potentially deadly combination of their lack of situational awareness and unanchored televisions set on unsafe surfaces.  The study didn’t include why the televisions were in unsafe locations, but one theory is that many older televisions are moved into secondary locations that aren’t as safe as families acquire bigger, fancier televisions.  The older televisions may be on dressers or night stands that were never meant to hold televisions.  Children climb the furniture either attempting to turn on the television or retrieve something off the top and the television tumbles down on top of them.  Dressers with drawers are particularly dangerous because children may figure out how to use the drawers as steps and manage to climb much higher than anticipated.

The rapid rate of technological advances may also play a role since typical families are buying new televisions more frequently than in previous decades and the number of televisions in an average home has increased.  The changing design of televisions is also relevant.  New thinner televisions have significantly smaller bases making them top heavy and more likely to topple over.  Many families are also buying bigger televisions with can amplify the danger if they topple.

Experts have suggested a few potential solutions to this problem.  First and foremost, parents need to be made more aware of the issue, possibly through a public awareness campaign.  A campaign to distribute anchoring devices has been discussed as well as providing them with new televisions at purchase.  Another option may be to add stability requirements to new designs so that televisions are less likely to topple.  It is also recommended that parents never store remote controls or toys on top of a television because they may entice children into climbing to reach them.  Only time will tell which solution if any are implemented, but this study is a first step in raising public awareness about this issue.

To view a Cause Map, or visual root cause analysis, of this issue, click on “Download PDF” above.  A Cause Map visually lays out the causes that contribute to a problem to show the cause-and-effect relationships and can help clarify a situation.  The possible solutions are included on the Cause Map.

The loss of the Steamship General Slocum, June 15, 1904

By ThinkReliability Staff

On June 15, 1904, a church group headed out for an excursion through New York City’s East River on the Steamship General Slocum.  Approximately half an hour after the ship left the pier, it caught fire.  Despite being only hundreds of yards from shore, the Captain continued to go full speed ahead in hopes of beaching at North Brother Island, where a hospital was located.  This served to fan the flames quickly over the entire highly flammable ship, killing many in the inferno.  Most of those who were not killed by the fire drowned, even though the Captain did successfully beach the ship at North Brother Island, due to the depth of the water and lack of safety equipment.

To perform a root cause analysis of the General Slocum tragedy, we can use a Cause Map.  A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.  First we look at the impact to the goals.  On the General Slocum there were at least 1,021 fatalities of the passengers and crew that were aboard.  (However, only two of the crew were killed.)  Additionally, 180 were injured.  There were other goals that were affected but the loss of life makes any other goals insignificant.  The deaths and injuries are impacts to the safety goals.

Passengers drowned because they were in water over their heads with inadequate help or safety equipment.  Passengers were either in the water because they fell when the deck collapsed, or because they jumped into the water trying to avoid the fire.  The water was too deep to stand because only the bow was in shallow water and the passengers could not reach the bow.  This was due to a poor decision on the Master’s part (namely his decision to beach the ship at a severe angle, with the bow in towards the island, instead of parallel to the island, where passengers would have been able to wade to shore.)  Note that the Master himself (and most of the crew) were on the bow side of the ship and were able to (and did) jump off and wade to shore.  The safety equipment, including life preservers, life boats, and life rafts, was mostly unusable due to inadequate upkeep and inadequate inspections.

Passengers (and two crewmembers) were also killed by fire.  Once the fire was started, it spread rapidly and was not put out.    The fire spread rapidly because the ship was highly flammable.  When this ship was constructed, there was no consideration of flammability.  Additionally, the current of air created by the vessel speeding ahead drove the fire across the ship.  The fact that an experienced Master would have allowed this situation was considered misconduct, negligence and inattention to duty – charges for which the Master was later convicted.   The fire was not put out because of inadequate crew effort and insufficient fire-fighting equipment.  The crew effort was inadequate because of a lack of training.  The fire-fighting equipment was insufficient because of inadequate upkeep and inadequate inspections.  (Possibly you are noticing a theme here?)

Although many people have not heard of the General Slocum tragedy, many of its lessons learned have been implemented to make ship travel safer today.  However, many of the solutions were not implemented widely enough or in time to prevent the Titanic disaster from occurring eight years later.  (Although there were nearly as many people killed on the General Slocum, it is believed that the Titanic disaster is more well known because the passengers on Titanic were wealthy, as opposed to the working class passengers on General Slocum.  It is also surmised that sympathy for the highly German population aboard General Slocum was diminished as World War I began.)

In a macabre ending to a gruesome story, ships began replacing their outdated, decrepit life preservers after the investigation on General Slocum.  It was later found that the company selling these new life preservers had hidden iron bars within the buoyant material, in a dastardly attempt to raise their apparent weight.  Unfortunately there were no adequate laws (then) against selling defective life-saving equipment.

Deadly Plane Crash at San Francisco Airport

By Kim Smiley

On July 6, 2013, Asiana Airlines Flight 214 crashed while attempting to land at the San Francisco International Airport. Three people have died as a result of the crash and around 180 others were injured, 13 critically. The cause of the crash is currently under investigation, but there were no obvious mechanical issues and the weather was near perfect.

Even though the investigation is still in its infancy, an initial Cause Map can be built to document what is known now about the accident and it can easily be expanded later as more information becomes available. A Cause Map is a visual format for performing a root cause analysis that intuitively lays out the different causes for an accident. The first step in the Cause Mapping process is to fill in an Outline with the basic background information for an issue. On the bottom half of the Outline there is space to document how the problem impacts the overall goals. This is useful because it helps everyone involved in the process understand the big picture and the issues with the more significant impacts can be prioritized first.

There is also space on the Outline to list anything that was different or unusual at the time the problem occurred. It’s important to note any differences because they are usually worth exploring during an investigation because they may have played a role in the accident. In this specific example, this was the first time the pilots had worked together and the two main pilots were both in unfamiliar roles. The pilot landing the plane had limited experience with Boeing 777s even though he was an experienced pilot and this was his first time landing this type of aircraft at the San Francisco airport. There was another pilot instructing him, but it was his first flight as an instructor.

Once the Outline is completed, the next step is to ask “why” question and add the answers to the Cause Map. In this example, we know that the airplane was coming in too low and too slow to land safely, but it isn’t known why that happened. The NTSB has initiated an investigation and the results will reported when the analysis is complete. Some of the early speculation is that there may have been an equipment failure, mismanagement of automated systems or ineffective communication in the cockpit. The fact that this crew was different than the typical staffing has been a focus of investigators, but it isn’t known what role they may have played in the crash.

Another piece of this puzzle is that one of the passengers who died at the crash scene appears to have been killed when she was run over by a fire engine. She was covered in foam on the ground and the firefighters were unaware of her location. Emergency response procedures will need to be reviewed as part of the investigation into this accident to ensure that first responders can do their jobs in the safest way possible.

To view an initial Cause Map of this issue, click on “Download PDF” above.

 

A Potentially Stinging Situation – Jellyfish Blooms

By Kim Smiley

Jellyfish are some of nature’s most impressive survivors.  They have been around since long before the dinosaurs roamed the earth and continue to thrive.  In some cases, they may even be thriving a little too successfully.  Massive jellyfish blooms can flourish in the right environment and can decimate other species and cause significant damage.

Naturally occurring jellyfish blooms have been around for ages and while they may be inconvenient at times, they aren’t particularly alarming.  The real concern is that manmade conditions may lead to the growth of jellyfish blooms at times or regions that wouldn’t normally see them.  Large numbers of jellyfish can cause a number of serious issues.  Safety is a concern because jellyfish stings are painful and can even be deadly.  Regions that depend on tourism can also be impacted because travelers may avoid areas with large numbers of jellyfish.  Jellyfish have caused damage to ships and buildings when they clog intake lines.  Populations of other species have also been decimated in some areas by jellyfish blooms which can affect commercial fishing operations.

What causes these jellyfish blooms can be explored by building a Cause Map or visual root cause analysis.  A Cause Map intuitively lays out causes that contribute to an issue and shows the cause-and-effect relationships between them.  In this example, the jellyfish blooms grow because jellyfish are well suited for life in low oxygen “dead zones” that are being created in the ocean.

It all starts with fertilizer containing nutrients running into the ocean.  An algae bloom forms as algae feed on the nutrients.  Eventually the nutrients are depleted and the algae dies off leading to the growth of a bacterial bloom as bacteria feed on the dead algae.  The bacterial bloom depletes the oxygen making the region unsuitable for most species.  However, the opportunistic jellyfish can survive and even thrive in low oxygen levels.  Jellyfish are able to rapidly grow and reproduce quickly so the population surges upward in an environment with few predators and little competition.

A few facts so that the reproductive abilities of jellyfish can be fully appreciated: a single female jellyfish can release tens of thousands of eggs per day, and jellyfish are able to double their weight in a single day if food is abundant.

Eating habits of jellyfish also make it very difficult for other species to move back into the region even if oxygen levels increase.  Jellyfish not only compete for the same food as larvae of other species, plankton, they are fond of eating larvae and eggs.  It’s difficult to compete with a species that is both a predator and competitor.

Before anyone has nightmares of huge jellyfish causing wide scale destruction, I should note that researchers have not found evidence that jellyfish are in danger of overrunning the oceans.  But many scientists do believe that human activities have contributed to jellyfish blooms growing in localized areas.  It’s always worth trying to understand how human activities are impacting our environment, especially when a species so well equipped for survival is involved.

To view a high level Cause Map of this issue, click on “Download PDF” above.

Chemical Plant Explosion Kills 2 and Injures Dozens in LA

By Kim Smiley

On June 13, 2013, an explosion at a chemical plant in Louisiana killed two and injured more than seventy others.  The cause of the explosion is still unknown, but the federal Occupational Safety and Health Administration and the U.S. Chemical Safety Board are investigating the accident.

Even though the investigation is still ongoing, an initial Cause Map, or visual root cause analysis can be built for this issue.  The initial Cause Map can document what is known at this point and can easily be expanded to incorporate more details as they become available.  The first step in the Cause Mapping process is to fill in an Outline with the basic background information for the accident (such as the location, time and date) as well as document what overall goals were impacted by the incident.

In this case, the safety goal was obviously impacted because of the fatalities and injuries.  The damage to the plant is an impact to the material goal and the time the plant is shut down is an impact to the schedule goal.  Once the Outline is complete, including the impacts to the goal, the Cause Map is built by asking “why” questions.  For example, we would ask “why” people were killed and injured and would add that there was an explosion at the chemical plant to the Cause Map.

What caused the explosion isn’t known, but every explosion requires oxygen, a spark and fuel so these basic facts can be added to the Cause Map.  The plant housed a large amount of flammable material because it manufactures polymer grade propylene which is used to make plastics.  If investigators are able to determine what created the spark that information could be added as well as any other relevant information that comes to light.

The Outline also has space to document anything that is different or unusual at the time of the accident.  Anything unusual about the situation when the accident occurred is often a good starting point in an investigation because it may have played a role in the accident.  In this example, the plant was being expanded at the time of the accident and there were many contract workers on site.  If this is found to have played a role in the accident, this information would be incorporated onto the Cause Map as well as the Outline.

The final step of the Cause Mapping process is to use the Cause Map to develop solutions that can be implemented to help prevent a similar problem from occurring in the future.  Once a final Cause Map is built that incorporates all the findings from the investigation, it will be helpful in understanding any lessons to be learned and discussing potential solutions.

To view a high level Cause Map and an Outline for this accident, click on “Download PDF” above.

Failure of the Teton Dam in 1976

by ThinkReliability Staff

On June 5, 1976, workers were called to the Teton Dam on the Teton River in Idaho to attempt to repair a leak.  Workers in bulldozers narrowly avoided being sucked into the dam with their equipment, and watched helplessly as the dam was breached.  It would kill 14 people and cause nearly hundreds of millions of dollars in property and environmental damage.  To examine what went wrong, we can perform a visual root cause analysis, or Cause Map.

The Cause Mapping process begins by determining the impacts to an organization’s goals.  From the perspective of the government, specifically the Bureau of Reclamation, the safety goal is impacted because of the 14 deaths.  The environmental goal is impacted due to the severe impact the dam failure and subsequent flooding had on the ecosystem of the area.  The customer service goal was impacted due to the evacuation of three towns.  The production goal is impacted due to the abandonment of the dam – at a cost of approximately $50 million.  Additionally, property damage of at least $400 million (some estimates are much higher) is an impact to the property goal.  (There were also substantial claims related to the loss of property and livelihood from impacts to industries, particularly fishing.)

Once we have determined the impacts to the goals, we begin with an impacted goal, such as the safety goal, and ask “Why” questions to determine the cause-and-effect relationships that led to the impacted goals (also known as “problems”.)  In the case of the Teton Dam failure, people were killed due to a massive wave of water released from the dam (which was filled to capacity) when it failed.  The dam failure was also the cause of severe damage to the dam, which was never rebuilt, leading to the impacted production goal.

The failure of the dam was found to be caused by erosion and inadequate strength.  Due to the less than ideal geological conditions of the site (which was picked because there were no “good” sites available), unequal stress distribution and inadequate fill material (which was used from the site) led to reduced strength.  Susceptible materials and seepage from leaks in the embankment, caused by joints that were not resistant to water pressure due to inadequate testing, and inadequate protection from water due to an over-reliance on an ineffective curtain intended to stop flow, led to the erosion.

Many geologists had predicted problems with the dam before it was built.  Specifically, in his book “Normal Accidents”, Charles Perrow states “The Bureau ignored its own data that rocks in the area were full of fissures, and in addition they filled the dam too fast . . . All it takes to bring a dam down is one crack, if that crack wets the soil within the interior portions of the dam, turning it into a quagmire.”

Although tragic, and expensive, the failure of the Teton Dam did lead to many reforms in the Bureau of Reclamation, who is responsible for dam safety.  Detailed geological studies performed in order to determine the causes of the dam failure also provided additional insight to the strength provided by various types of earth, erosion and seepage.

Bridge Collapse In Washington Dumps Cars in River

By Kim Smiley

On May 23, 2013, a section of a four lane bridge over Skagit River near Mount Vernon, Washington unexpectedly collapsed, sending two cars into the river.  No one was killed, but the bridge failure is going to take months and an estimated $15 million to repair.  Additionally, the bridge was one of Washington’s main arteries to Canada with around 70,000 vehicles crossing it a day and detours during the repairs are significantly impacting the region.

So what caused the bridge to fail and how can a similar collapse be prevented in the future? This issue can be analyzed by building a Cause Map, a visual root cause analysis.  A Cause Map intuitively shows the causes that contributed to an issue and the cause-and-effect relationships between them. The collapse occurred after the top of an oversized truck hit a steel girder.  The bridge was a ‘fracture critical’ design, meaning that the design had little redundancy and fracture of one critical component, in this case the overhead steel girder, caused the whole bridge to collapse.  This type of design was common when the bridge was built in the 1950s because it was relatively quick and cheap to build.  Newer designs typically incorporate more redundancy to prevent a single failure from causing significantly damage, but the average bridge in the United States is 42 years old and there are thousands of fracture critical bridges across the nation.

So why did the truck impact the bridge?  This question is more complicated than it might appear on the surface.  The driver appears to have done his due diligence, but he had no warning that his truck was taller than the clearance.  The driver had a permit for hauling an oversized load on this stretch of highway.  The truck was also following a guide who gave no indication of potential clearance issues.  Additionally, there was no sign about low overhead clearance on the bridge because signage wasn’t required.  Signs are only required for overcrossing less than 14 feet and the lowest point on the bridge was higher than that.

The truck was traveling in the outside lane at the time it impacted the bridge.  The clearance over the outside lane of the bridge is lower than the inside lane because of the arch design of the bridge.  The truck’s load was 15 feet 9 inches high and the lowest clearance over the outside lane was 14 feet and 7 inches, but the inside lane has about a 17 feet clearance.  Bottom line, if the truck had simply moved into the inside lane it should have had the clearance to safely cross over the bridge.

This incident is certainly a warning about the need for redundancy in designs, but it also illustrates the need for clear communication.  If the driver had been aware that there was a potential issue, he could have changed lanes (which is a free and relatively easy solution) and the bridge collapse wouldn’t have happened.  Something needs to be changed to ensure that drivers are aware of any potential clearance issues.  In an ideal world, all bridges would be the safest, most up to date designs available, but the reality is that there are thousands of “fracture critical” bridges in use throughout the United States and we’re going to have to find ways to use them as safely as possible for quite some time.

Click here to see a Cause Map of another bridge failure, the 2007 I-35 Bridge Collapse and here to see a Cause Map of the failure of the Tacoma Narrows Bridge.

Update: Cause of Death of Schoolchildren from Tornado in Moore, Oklahoma Not Drowning

by ThinkReliability Staff

Although they are sometimes treated as a static object, Cause Maps (and any root cause analysis) can – and should – change based on updated or corrected information.  A frequent question we get asked is “What if I make a mistake on my Cause Map?”  Well, you fix it.  Let me show you how.

First, a little background on my error.  Last week, I thought it would be important and useful to demonstrate what had happened in the aftermath of Moore, Oklahoma, after a category 5 tornado hit much of the town, including an elementary school.  (See the previous blog.)  Because there are certain expectations for public safety at an elementary school, I decided to focus the analysis on the children who died at the elementary school and the causes that led to their deaths, as well as information on the potential and implemented solutions to reduce that risk.

I researched how specifically the children had died – an unfortunate necessity to ensure that the solutions are working towards the correct causes – and discovered a statement from the Lt. Gov. of Oklahoma the morning after the tornado saying that the children who died had drowned in the basement due to a burst water main.

As you can imagine, sometimes information that is relayed in the immediate scene of a disaster is not entirely accurate.  In this case, the information that the children had drowned was incorrect.  Rather, the children who died were in a classroom and died from blunt force trauma and asphyxiation (suffocation) due to being struck or covered by debris from the tornado.

Once we have verified that our initial cause-and-effect relationship is incorrect, we can correct the Cause Map.  Rather than just erasing the “wrong” causes and adding in the new causes, we suggest crossing off the causes that have been disproved with evidence.  (Click on “Download PDF” above to see an example of a corrected Cause Map.)  This way anyone who may have seen an earlier version of the Cause Map, or heard the same initial erroneous information that was used to make it, will have a clear version of what did happen, including the evidence that verifies the correct information.

Obviously the fact that the children died is tragic, so some may wonder what difference it makes exactly how they died.  Generally people who are killed in tornadoes are killed by objects striking them.   This is why tornado survival drills focus on getting to spots where there is the least possible dangerous debris, or the least risk of the debris becoming dangerous flying objects. Windowless rooms are recommended, because glass can be broken and easily turn into shrapnel.  Basements are recommended because the strong winds associated with tornados have less access to underground areas.  Bathrooms are another option because most everything in a bathroom is secured to the walls and/or floors.  In a pinch, people seek protection under heavy pieces of furniture.  (Survivors from the affected school have said that they hid under their desks and held on for dear life.)

Because the basement is a recommended sheltering location, the possibility of drowning from  equipment that may be damaged by a tornado meant that the basement needed to be reconsidered as a sheltering location.  Because the school did not have a designated safe room, during the 16-minute warning teachers got their students to anywhere they could, including, in many cases, under their own bodies for protection.  (Again, based on the extreme damage to the school the death toll, while tragic, demonstrates the remarkably quick and effective action  taken the teachers.  I can’t emphasize this enough.)  Because this protection was very likely causally related to the death toll (in that without the amazing response from the teachers the death toll may have been much higher), I added additional evidence to the cause of injury.

Be aware that changing the causes may impact the recommended solutions.  The solutions discussed in the previous blog are still valid, especially the recommendations for inclusion of storm shelters for schools in the area.  An additional clarification added in the update is that this has been required since 1999 (after this school was built).  All the schools being rebuilt as a result of the tornado damage will have storm shelters, as will schools built in the future.  Individual communities will still be faced with the choice of which buildings will and will not be required to have storm shelters, and any incentives that will be put into place to encourage their construction.

To view the Outline and Cause Map, please click “Download PDF” above.  Or click here to read more.

Children Killed When School Hit by Category 5 Tornado

by ThinkReliability Staff

A category 5 (the most destructive) tornado hit Moore, Oklahoma on May 20th, destroying the town and killing 24.  Of those killed, 7 were elementary school children, who drowned when water mains burst in the basement where they were sheltered.

Examining this tragedy can help provide lessons to reduce the risk of this issue happening again.  We can analyze the tornado impact at the most severely impacted elementary school in a Cause Map, in order to visually diagram the cause-and-effect relationships that led to the tragic deaths.

First, we determine the impacted goals.  In this case, all other goals are overshadowed by the deaths of seven  elementary students, and injuries to dozens.  In addition, the school was completely devastated (demonstrating the unbelievable destructive power of the tornado), resulting in early school closure and intense rescue, recovery and cleanup.

To perform our root cause analysis, we begin with the safety goal and ask “Why” questions.  The deaths in this case are reportedly due to drowning, which occurred when children in the basement (a recommended sheltering location in the case of tornadoes) drowned due to water from bursting water mains.  The specific failure mechanism of the failure is not known (and may never be due to the extreme levels of damage) but is likely related to the direct strike of the tornado, which is common in the area (close to the center of tornado alley).

Students who were injured by crushing and asphyxia were in the hallways and bathrooms of the school.  (These are recommended sheltering locations for buildings that don’t have basements.)   It is remarkable that, despite the complete annihilation of the school, students who were sheltered in hallways and bathrooms all survived, thanks in many cases to teachers protecting them with their own bodies.  A 16-minute warning from the National Weather Service combined with carefully rehearsed crisis plans that were put into action, allowed the best possible protection for students in a school without a safe room or storm shelter.

This storm has reignited the discussion about expectations for safety shelters in public places that are prone to natural disasters.  The devastating loss at the school has also raised the safety issue of ensuring that the locations used for shelter are cleared of other potential hazards, such as water mains and fire risks.  Because of the relatively short warning time (16 minutes in this case, which is above average) before a tornado strikes, emphasis on tornado drills and safety plans should continue.

To view the Outline and Cause Map, please click “Download PDF” above.  Or click here to read more.