Tag Archives: root cause analysis

Root Cause Analysis - Incident Investigation

16-Day Government Shutdown Affects Economy

November 1, 2013 Holly Maher

On October 1, 2013 at 12:01 AM, the beginning of the 2014 fiscal year, the federal government shut down all non-essential operations when Congress could not pass a continuing resolution to allow spending at current levels. The government shutdown lasted 16 days and, in addition to other impacts, closed the National Parks system (see our blog about the park closures), furloughed 800,000 federal employees, had the potential to impact payment of veterans’ benefits and negatively impacted the economy, both directly and indirectly.

So what caused the government shutdown? If you watched any TV during that 16 day period, you could certainly hear any number of experts (on both sides) explaining who was to blame. As the Cause Mapping methodology is intended to do, this analysis of the government shutdown is not trying to identify the one person, the one group or the one reason to blame for the shutdown. Instead, we will identify all the causes required to produce this effect. This will allow us to identify many possible solutions for preventing it from happening again. We start by asking “why” questions and documenting the answers to visually lay out all the causes that contributed to the shutdown. The cause and effect relationships lay out from left to right.

In this example, the government shutdown occurred because a vote on a continuing resolution bill could not be passed by Congress because there was a line item added to the continuing resolution, defunding the Affordable Care Act (ACA) that could not be agreed upon. A continuing resolution was required because the Constitution gives the power to spend money to Congress, and since they had not passed a Budget for fiscal year 2014, a continuing resolution was constitutionally required to continue operating the government after October 1. Defunding the ACA was added to the continuing resolution bill because the ACA was about to go into effect and because it can be added on a line item basis. Congress was unable to compromise to reach an agreement to pass the continuing resolution.

So why was Congress unable to reach an agreement? If the incentive to compromise was greater than the incentive to not compromise, they would have compromised. So why is the incentive to compromise ineffective? One of the reasons is because Congress’s pay is not affected when the government shuts down. Another reason is because there is significant incentive to maintain a position aligned with the party (either left or right). The desire to get re-elected (which is unlimited within Congress), the need for support in the primaries to get re-elected (based on the current primary system), and the need for campaign financing are all causes that support the incentive to maintain alignment with the party versus compromise.

Once all the causes of the government shutdown have been identified, possible solutions to prevent the shutdown from happening again can be brainstormed. One possible solution would be to legally require a continuing resolution to be a “clean” bill, with no additional line items. This would make it more likely in the future, when there are debates or discussions over current, hot button items, such as the ACA, that the result would not be a failure to pass the continuing resolution and therefore cause a government shutdown. Another possible solution would be to stop pay for Congress during the government shutdown. Other more global, systemic solutions might be to implement term limits in Congress or provide government campaign financing to reduce the dependency on party financial support.

To view the Outline and Cause Map, please click “Download PDF” above.

Root Cause Analysis - Incident Investigation

“Ghost Train” Causes Head-On Collision in Chicago

October 24, 2013 Kim Smiley

By Kim Smiley

On September 30, 2013, an unoccupied train collided head on with another train sending 30 people to the hospital in Chicago. In a nod to the season and the bizarre circumstances of the accident, the unoccupied train has been colorfully dubbed “the ghost train”.

So what caused the “ghost train” and how did it end up causing a dangerous train collision? Investigators from the National Transportation Safety Board (NTSB) are still reviewing the details of the accident, but some information is available. An initial Cause Map, or visual root cause analysis, can be built to capture what is already known and can be expanded to incorporate more information as the investigation progresses. A Cause Map is built by asking “why” questions and documenting the answers to visually lay out all the causes that contributed to an accident to show the cause-and-effect relationships from left to right.

In this example, the trains collided because an unoccupied train began moving and the safety systems in place did not stop the train. Investigators still haven’t determined exactly what caused the train cars to move, but a key piece of the puzzle is that there was still power to the cars while they were being stored in a repair terminal awaiting maintenance. The NTSB believes that it was common practice to leave power to cars so that the lights could be used to illuminate the terminal. Workers used the lights to discourage graffiti and vandalism because the terminal was located in a high crime neighborhood.

Investigators will need to not only determine why the train started rolling, but also learn more about why the safety systems didn’t prevent the accident. Before colliding with another train, the unoccupied train traveled through five mechanical train-stop mechanisms, each of which should have stopped a train without a driver. Emergency brakes were applied at each train-stop that caused the train to pause momentarily, but then it started moving because the setting on the master lever caused the train to restart. Review of the safety systems will need to be part of the investigation to ensure that adequate protection is in place to prevent anything similar from occurring again.

The NTSB investigation is still ongoing, but the NTSB has stated that de-energizing propulsion power and using an alternate brake setting could help prevent unintended movement of unoccupied train cars. Additionally, the NTSB believes the use of a wheel chock and/or derail would ensure that a train stopped by a mechanical train stop mechanism remains stopped. Based on the information already uncovered, the NTSB has issued an urgent safety recommendation to the Federal Transit Authority (FTA). The NTSB recommended that the FTA issue a safety advisory to all rail transit properties to review procedures for storing unoccupied train cars to ensure that they were left in a safe condition that wouldn’t allow unintended movement and to ensure that they had redundant means of stopping any unintended movement. There is more information that is needed to fully understand this accident, but these precautions would be effective solutions that can be quickly implemented to reduce the risk of train accidents.

Root Cause Analysis - Incident Investigation

The Salvage Process of Costa Concordia

October 3, 2013 ThinkReliability Staff

By ThinkReliability Staff

On September 16, 2013, the fatally stricken Costa Concordia was lifted upright (known as “parbuckling”) after salvage operations that were the most expensive and involved the largest ship ever. The ship ran aground off the coast of Italy January 13, 2012 (see our previous blog about the causes of the ship running aground) and has been lying on its side for the 20 months since.

The ship grounding had immediate, catastrophic impacts, including the death of 32 people. However, it also had longer term impacts, mainly pollution from the fuel, sewage and other hazardous materials stored aboard the ship. It was determined that the best way to minimize the leakage from the ship would be to return it upright and tow it to port, where it the onboard waste could be emptied and disposed of, then the ship broken up for scrap.

Because a salvage operation of this magnitude (due to the size and location of the ship) had never been attempted, careful planning was necessary. Processes like this salvage operation can be described in a Process Map, which visually diagrams the steps that need to be taken for a process to be completed successfully. A Process Map differs from a Cause Map, which visually diagrams cause-and-effect relationships to show the causes that led to the impacts (such as the deaths and pollution). Whereas a Cause Map reads backwards in time (the impacted goals result from the causes, which generally must precede those impacts), a Process Map reads from left to right along with time. (Step 1 is to the left of, and must be performed before, Step 2.) In both cases, arrows indicate the direction of time.

Like a Cause Map, Process Maps can be built in varying levels of detail. In a complex process, many individual steps will consist of more detailed steps. Both a high level overview of a process, as a well as a more detailed breakdown, can be useful when developing a process. Processes can be used as part of the analysis step of an incident investigation – to show which steps in a process did not go well – or as part of the solutions – to show how a process developed as a solution should be implemented.

In the example of the salvaging of the Costa Concordia, we use the Process Map for the latter. The salvaging process is part of the solutions – how to remove the ship while minimizing further damage and pollution. This task was not easy – uprighting the ship (only the first step in the salvage process) took 19 hours, involved 500 crewmembers from 26 countries and cost nearly $800 million. Other options used for similar situations included blowing up the ship or taking it apart on-site. Because of the hazardous substances onboard – and the belief that two bodies are still trapped under or inside the ship – these options were considered unacceptable.

Instead, a detailed plan was developed to prepare for leakage with oil booms that held sponges and skirts, then installed an underwater platform and 12 turrets to aid in the parbuckling and hold the ship upright. The ship was winched upright using 36 cables and is being held steady on the platform with computer-controlled chains until Spring, when the ship will be floated off the platform and delivered to Sicily to be taken apart.

To view the Process Map in varying levels of detail, please click “Download PDF” above. Or, see the Cause Map about the grounding of the ship in our previous blog.

Root Cause Analysis - Incident Investigation

Sea life Devastated by Molasses Spill

September 26, 2013 Kim Smiley

By Kim Smiley

On September 9, 2013, a reported 1,400 tons of molasses was inadvertently spilled into Honolulu Harbor in Hawaii, devastating the sea life. When I think of ocean spills, pictures of oil-covered animals jump into my mind, but the molasses spill is proving to be potentially just as damaging to the environment.

This incident can be analyzed by building a Cause Map, an intuitive format for performing a root cause analysis. A Cause Map visually lays out the causes that contribute to an accident to show the cause-and-effect relationships between them so that it’s easier to understand the factors that led to the issue. Understanding all the causes and not just focusing on a single “root cause” helps broaden the potential solutions that are considered and can lead to a better long term solution. The first step in the Cause Mapping process is to define how the problem impacted the goals and then these impacts are used as the starting point for the Cause Map.

The most obvious impact from the molasses spill is that thousands of fish and other marine life were killed. They suffocated because the molasses sank and displaced the oxygen- containing seawater in the harbor. The density of molasses is what makes this spill so different from an oil spill. Oil is lighter than water and floats on top of the ocean while molasses sinks to the bottom, with devastating effects at all levels in the ocean. Divers investigating the molasses spill reported that there were no signs of life in the ocean near the spill; all bottom dwellers had been killed.

The fact that molasses sinks also means that there is no practical way to clean it up. One positive about molasses is that molasses, unlike oil, will mix with water. It sits on the bottom until it is diluted and ocean movements disperse it. Since the spill occurred in a protected harbor, the ocean movements are weaker and the time frame to move the molasses is longer than it would be in the open ocean, but nature will eventually return oxygen levels in the harbor to life-supporting levels.

The cause of the spill has been reported to be a leaking pipe. Molasses produced on Hawaii was being pumped into a ship for transportation to the mainland where it was planned for use in animal feed. During the transfer, the molasses was accidently pumped through a pipe with a leak and nobody noticed before the majority of the molasses had been released into the harbor. Details about what specifically caused the leak haven’t been released.

There are also other impacts from the spill that are worth considering. With any environment issue, the cost of the investigation and any clean up that needs to be done is always substantial. Many businesses in the area were also impacted by a drop in tourism because the harbor was closed for about two weeks after the accident and normal tourism levels will probably not return until marine life in the area begins to recover. There was also a potential safety risk to any swimmers for a time after the accident because the presence of thousands of dead fish could attract predators.

To view an Outline and high level Cause Map of this accident, click on “Download PDF” above.

Root Cause Analysis - Incident Investigation

Is Your Emergency Response Plan Really Good Enough?

September 18, 2013 Kim Smiley

By Kim Smiley

The deadly shooting at the Washington Navy Yard this week hit especially close to home. I live about 15 miles away from the Navy Yard and I also worked there for 5 years. And my husband, a Navy civilian, still does.

As far as the thousands impacted by the shootings, my family was very lucky. My spouse came home safely while others did not. Additionally, we hit the jackpot from a logistics stand point because he was in his office (which is not in the affected building) when the order to shelter in place was issued. He had access to his phone and internet (as well as a bathroom and his packed lunch). I had word almost immediately that he was safe and was able to communicate with him throughout the day. When word came that he could go home but his car couldn’t, we were able to coordinate and get him home as quickly as possible.

Like many people across the country, I was riveted by the news and was holding my breath as the information fluctuated by the hour. At ThinkReliability, we are generally called in to help investigate or document information after an incident so the opportunity to watch an incident as it plays out in real time is fairly rare. I was throwing together a makeshift family emergency response as I was bombarded with calls and messages from concerned friends and family as well as trying to figure when and how my husband would get home. And for somebody who works on processes and solutions for a living, my personal emergency response wasn’t very impressive. Take my word for it: the ideal time to discover your mother in-law has your old cell phone number isn’t when your husband’s place of employment just made national headlines.

A time like this is an excellent opportunity to review both your organization and family’s emergency response plan. Is your organization ready to handle a shelter-in-place situation? Do you know which authorities to contact in case of emergencies? And, one piece that I think is often overlooked: how you would handle the flow of information? How do you pass word to families if something significant occurs and do people know where to look for the information? Would you post the information on the website? Would an old-fashioned phone tree serve your needs? Do you have updated contact information and home phone numbers?

It’s also important to have a basic plan in place for your family in case something unforeseen happens. There was a flurry of activity on Monday as everyone worked to make sure that there was a plan for all the children of the people we knew on the Navy Yard to be picked up and potentially kept overnight. Thousands of people work on the Navy Yard and there were several cases were a single parent or both parents were stuck on lock-down for an indeterminate amount of time. Are you really ready to handle a situation like that? If your family or employees have any special needs, like requiring medication, I would recommend making a plan to deal with it. I also highly recommend taking a moment to make sure that any list with people allowed to pick up your children is up to date and includes a few folks who do not work in your building or even on the same side of town. Fairly simple precautions can make a tough situation go much smoother.

And don’t think you don’t need a basic plan if you have no dependents. Do you know how you would get home if you suddenly had to leave your car at work like many of the Navy Yard employees did? What if your wallet was left behind in a rushed evacuation? It might be a good idea to have enough money to cover cab fare in your car or in your badge holder if you wear one to work. How do you pass word to your parents that you’re okay, especially if you don’t have access to a phone? Would your mom think to check her email? Do you have a friend who has your parents’ or siblings’ phone numbers and could call them for you if they aren’t comfortable with social media or computers? Trust me; your families would be very interested in hearing that you’re okay.

I hope you never experience any crisis even remotely close to the tragedy at the Navy Yard. But if there is ever an emergency, you’ll be grateful if you made a plan beforehand.

Root Cause Analysis - Incident Investigation

NYT Website Disrupted for Hours

September 13, 2013 Kim Smiley

By Kim Smiley

On Tuesday, August 27, 2013 the New York Times website went dark for several hours after being attacked by a well-known group of hackers. Reports of hacked websites are becoming increasingly common and the New York Times was just one of many recent victims.

A Cause Map, or visual root cause analysis, can be used to analyze the recent attack on the New York Times website. A Cause Map lays out the many causes that contribute to an issue in an intuitive format that illustrates the cause-and-effect relationships. A Cause Map is useful for understanding all the causes involved and can help when brainstorming solutions. To see a Cause Map of this example, click on “Download PDF” above.

Some details of how the attack was done have been released, as documented on the Cause Map. The New York Times website itself was not technically hacked, but traffic was redirected away from the legitimate website to another web domain. To pull off this feat, hackers changed the domain name records for the New York Times website after acquiring the user name and password of an employee at the domain name registrar company. The employee inadvertently provided the information to the hackers by responding to a phishing email asking for personal information.

The email sent by the hackers looked legitimate enough to fool the employee.

So why did hackers target the New York Times in the first place? The answer is that the New York Times is one of many western media outlets to be targeted by Syrian Electronic Army (S.E.A.), who has claimed responsibility for the attack. The S.E.A. supports President Bashar al-Assad of Syria and is generally unhappy with the way the events in Syria have been portrayed in the West.

So the next logical question is how do you protect yourself from a phishing scheme? The first step is awareness. Pretty much everybody who uses email can expect to receive some suspicious emails. A few things to look out for: attachments, links, misspellings, and a mismatched “from” field or subject line. Also any alarming language should be a red flag. For example, an email from your credit card company warning you that your account will be closed unless you take immediate action is probably not the real deal. A good rule of thumb is to never respond to any email with personal information or to click on links in emails. If you think a request for action may be real, either call the company or open a new web browser window and type in the company’s web address. It’s best to delete any suspicious emails immediately.

This example is also a good reminder to be aware that websites can get hacked. A great example of this is when the S.E.A. hacked the Associated Press’s twitter feed last April and used it to announce (falsely) that the White House had been bombed. That one tweet is estimated to have caused a $136 billion loss in the stock markets as people responded to the news. In general, it is probably good to be skeptical about anything shocking you read online until the information is confirmed.

Root Cause Analysis - Incident Investigation

What Happens When a Copy Isn’t a Copy?

September 6, 2013 Kim Smiley

By Kim Smiley

Think of how many documents are scanned every day. Imagine how important some of these pieces of paper are, such as invoices, property records, and medical files. Now try to picture what might happen if the copies of these documents aren’t true copies. This is exactly the scenario that Xerox was recently facing.

It recently came to light that some copies of scanned documents were altered by the scanning process. Specifically, some scanner/copier machines changed numbers on documents. This issue can be analyzed by building a Cause Map, an intuitive, visual format for performing a root cause analysis. The first step in the Cause Mapping process is to fill in an Outline with the basic background information on an issue. Additionally, the impacts to the overall goals are documented on the Outline to help clarify the severity of any given issue. In this example, the customer service goal is impacted because the scanners weren’t operating as expected. There is also a potential impact to the overall economic goal because the altered documents could result in any number of issues. There is also an impact because of the labor needed to investigate and fix the problem.

After completing the Outline, the next step is ask “why” questions to build the Cause Map. Why weren’t the scanners operating as expected? This happened because the scanners were changing some documents during the scanning process. Scanners use software to help interpret the original documents and Xerox has stated that the problem happened because of a software bug. Testing showed that the number substitutions were more likely to occur when the settings on the scanners were set to lower quality/ higher compression because of the specific software used for these settings. Testing also showed that the error was more likely to occur when scanning those documents that were more difficult to read such as those with small fonts or that had already been copied multiple times.

Xerox had been aware of the potential for number substitution at lower quality settings, but didn’t appear to expect it to occur at factory settings (which was found to be very unlikely, but possible). A notice that stated that character substitutions were possible appeared on the scanners when lower resolution settings were selected and was included in some manuals, but this approach seems to have been ineffective since many users were caught unaware by this issue.

After a Cause Map has been built with enough detail to understand the issue, it can be used to help develop solutions. In this example, Xerox developed a software patch that corrected the error. Xerox also posted several blogs on their website to keep customers informed about the issue and worked with users to ensure that the patch was successful in correcting the error.

To see a high level Cause Map of this issue, click on “Download PDF” above.

Root Cause Analysis - Incident Investigation

Trading Glitch Loses Goldman Sachs Millions

August 30, 2013 Kim Smiley

By Kim Smiley

A Goldman Sachs trading glitch on August 20, 2013 caused a large number of erroneous single stock and ETF options trades. About 80 percent of the errant trades were cancelled, but the financial damage is still speculated to be as much as one hundred million dollars. The company also finds itself once again in the uncomfortable position of making headlines for negative reasons which is never good for business.

The glitch occurred during an update to an internal computer system that is used to determine where to price options. The update changed the software so that the system began inadvertently misinterpreted non-binding indications of interest as actual bids and offers. The system acted on these bids and executed a large volume of trades at errant prices that were out of touch with actual market prices.

This issue can be built into a Cause Map, an intuitive method for performing a root cause analysis. One of the advantages of a Cause Map is that it visually lays out all the causes and the cause-and-effect relationships between them. Seeing all the causes can broaden the solutions that are considered.

In this example, a Cause Map can help illustrate the fact that the software glitch itself isn’t the only thing worth focusing on. The lack of an effective test program also contributed to the problem and testing may be the easiest place to implement an effective solution. If the problem would have been caught in testing, the only cost would have been the time and effort needed to fix the software. The importance of a robust test program for software is difficult to overstate. If the software is vital to whatever your company’s mission is, develop a way to test it.

To view a high level Cause Map of this issue, click on “Download PDF” above. Click here to read about the loss of the Mars Climate Orbiter, another excellent example of a software error with huge consequences.

Root Cause Analysis - Incident Investigation

Train Derailment Kills 79 in Spain

August 2, 2013 Kim Smiley

By Kim Smiley

On July 24, 2013, a train carrying 247 people violently derailed near Santiago de Compostela Spain. Over 130 were injured and 79 were killed as a result of the accident. Many details are still unknown, but investigators have determined that the train was traveling about twice the posted speed over a curved section of track.

The derailment was the worst train accident Spain has suffered in 40 years. Obviously, an investigation is underway and authorities are eager to identify what caused the accident and are working to prevent anything similar from occurring in the future. One of the ways this accident can be analyzed is by building a Cause Map, a visual format for performing a root cause analysis. A Cause Map visually lays out the different causes that contributed to an accident in an intuitive format that shows the cause-and-effect relationships.

The Cause Mapping process begins by filling in the basic background information for an issue as well as identifying how the incident impacted the goals. In this example, the safety goal is clearly impacted because there were fatalities and injuries. The schedule, labor, and material goals were also impacted because of the time and resources needed to investigate and clean up the accident and the damage to the train. The negative publicity surrounding the accident can also be considered an impact to the customer service goal because people may be hesitant to ride trains if they have concerns about safety.

So why did the train derail? The train was going too fast to safely navigate a curved section of track. The train was going fast because it had previously been running on track designed for high speed trains where high speeds were permitted and it didn’t slow down as it entered a section of track where the posted speed was lower. Operator action was required to slow down the train and it appears that the operator failed to take action. Investigators are looking to whether there was a mechanical problem of some kind that prevented the train from reducing speed, but early indication is that the operator simply failed to brake and reduce the speed of the train.

A number of factors seem to have contributed to this deadly error by an experienced train operator who was familiar with this portion of track. European Rail Traffic Management System (ERTMS) automatically controls braking and is installed on most of the track high speed trains operate on in the region, but not on the track where the accident occurred. The accident occurred at the first potentially dangerous curve after the transition to track where operator action is necessary to brake the train. Based on statements by the driver, he missed the transition to the track where manual braking is required and didn’t realize that the train was in danger. It has also come to light that the train driver was on the phone with the train’s ticket inspector immediately prior to the derailment and this distraction likely played a role in the accident. The initial investigation findings have led to the train’s driver being provisionally charged with multiple counts of homicide by professional recklessness on 28 July 2013.

Regardless of whether the driver is convicted on the charges, the automatic systems involved should be a focus of the investigation. The safety system sent a warning to the operator about the high speed prior to the accident, but it failed to prevent the accident. Investigators need to review the timing of the warning and determine whether it came too late. Other automatic systems such as the ERTMS also have the ability to stop a train that is operating at unsafe speeds, which raises the question of whether the safety systems used on this portion of track are adequate since the accident happened. Ideally, a single error by a train driver for any reason won’t result in dozens of deaths.

To view a high level Cause Map of this incident, click on “Download PDF” above. Click here to view a video of the accident.

Root Cause Analysis - Incident Investigation

Deadly Plane Crash at San Francisco Airport

July 25, 2013 Kim Smiley

By Kim Smiley

On July 6, 2013, Asiana Airlines Flight 214 crashed while attempting to land at the San Francisco International Airport. Three people have died as a result of the crash and around 180 others were injured, 13 critically. The cause of the crash is currently under investigation, but there were no obvious mechanical issues and the weather was near perfect.

Even though the investigation is still in its infancy, an initial Cause Map can be built to document what is known now about the accident and it can easily be expanded later as more information becomes available. A Cause Map is a visual format for performing a root cause analysis that intuitively lays out the different causes for an accident. The first step in the Cause Mapping process is to fill in an Outline with the basic background information for an issue. On the bottom half of the Outline there is space to document how the problem impacts the overall goals. This is useful because it helps everyone involved in the process understand the big picture and the issues with the more significant impacts can be prioritized first.

There is also space on the Outline to list anything that was different or unusual at the time the problem occurred. It’s important to note any differences because they are usually worth exploring during an investigation because they may have played a role in the accident. In this specific example, this was the first time the pilots had worked together and the two main pilots were both in unfamiliar roles. The pilot landing the plane had limited experience with Boeing 777s even though he was an experienced pilot and this was his first time landing this type of aircraft at the San Francisco airport. There was another pilot instructing him, but it was his first flight as an instructor.

Once the Outline is completed, the next step is to ask “why” question and add the answers to the Cause Map. In this example, we know that the airplane was coming in too low and too slow to land safely, but it isn’t known why that happened. The NTSB has initiated an investigation and the results will reported when the analysis is complete. Some of the early speculation is that there may have been an equipment failure, mismanagement of automated systems or ineffective communication in the cockpit. The fact that this crew was different than the typical staffing has been a focus of investigators, but it isn’t known what role they may have played in the crash.

Another piece of this puzzle is that one of the passengers who died at the crash scene appears to have been killed when she was run over by a fire engine. She was covered in foam on the ground and the firefighters were unaware of her location. Emergency response procedures will need to be reviewed as part of the investigation into this accident to ensure that first responders can do their jobs in the safest way possible.

To view an initial Cause Map of this issue, click on “Download PDF” above.