Category Archives: Root Cause Analysis – Incident Investigation

Root Cause Analysis - Incident Investigation

The Morris Worm: The First Significant Cyber Attack

November 7, 2013 Kim Smiley

In 1988 the world was introduced to the concept of a software worm when the Morris worm made headlines for significantly disrupting the fledgling internet. The mess left in the wake of the Morris worm took several days to clean up. The estimates for the cost of the Morris worm vary greatly from $100,000–10,000,000, but even at the lower range the numbers are still substantial.

A Cause Map, or visual root cause analysis, can be used to analyze this issue. A Cause Map is built by asking “why” questions and using the answers to visually lay out the causes that contributed to an issue to show the cause-and-effect relationships. In this example, a programmer was trying to build a “harmless” worm that could be used to gauge the size of the internet, but he made a mistake. The goal was to infect each computer one time, but the worm was designed to duplicate itself every seventh time a computer indicated it already had the worm to make the worm hard to defend against. The problem was that the speed of propagation was underestimated. Once released, the worm quickly reinfected computers over and over again until they were unable to function and the internet came crashing down. (To view a Cause Map of this example, click on “View PDF” above.)

One of the lasting impacts from the Morris worm that is hard to quantify is the impact on cyber security. The worm exploited known bugs that no one had worried about enough to fix. At the time of the Morris worm, there was no commercial traffic on the internet or even Web sites. The people who had access to the internet were a small, elite group and concerns about cyber security hadn’t really come up. If the first “hacker” attack had had malicious intent behind it and came a little later it’s likely that the damage would have been much more severe. While the initial impacts of the Morris worm were all negative, it’s a positive thing that it highlighted the need to consider cyber security relatively early in the development of the internet.

It’s also interesting to note that the programmer behind the Morris worm, Robert Tappan Morris, become the first person to be indicted under the 1986 Computer Fraud and Abuse Act. He was sentenced with a $10,050 fine, 400 hours of community service, and a three-year probation. Morris was a 23 year old graduate student at the time he released his infamous worm. After this initial hiccup, Morris went one to have a successful career and now works in the MIT Computer Science and Artificial Intelligence Laboratory.

Root Cause Analysis - Incident Investigation

16-Day Government Shutdown Affects Economy

November 1, 2013 Holly Maher

By Holly Maher

On October 1, 2013 at 12:01 AM, the beginning of the 2014 fiscal year, the federal government shut down all non-essential operations when Congress could not pass a continuing resolution to allow spending at current levels. The government shutdown lasted 16 days and, in addition to other impacts, closed the National Parks system (see our blog about the park closures), furloughed 800,000 federal employees, had the potential to impact payment of veterans’ benefits and negatively impacted the economy, both directly and indirectly.

So what caused the government shutdown? If you watched any TV during that 16 day period, you could certainly hear any number of experts (on both sides) explaining who was to blame. As the Cause Mapping methodology is intended to do, this analysis of the government shutdown is not trying to identify the one person, the one group or the one reason to blame for the shutdown. Instead, we will identify all the causes required to produce this effect. This will allow us to identify many possible solutions for preventing it from happening again. We start by asking “why” questions and documenting the answers to visually lay out all the causes that contributed to the shutdown. The cause and effect relationships lay out from left to right.

In this example, the government shutdown occurred because a vote on a continuing resolution bill could not be passed by Congress because there was a line item added to the continuing resolution, defunding the Affordable Care Act (ACA) that could not be agreed upon. A continuing resolution was required because the Constitution gives the power to spend money to Congress, and since they had not passed a Budget for fiscal year 2014, a continuing resolution was constitutionally required to continue operating the government after October 1. Defunding the ACA was added to the continuing resolution bill because the ACA was about to go into effect and because it can be added on a line item basis. Congress was unable to compromise to reach an agreement to pass the continuing resolution.

So why was Congress unable to reach an agreement? If the incentive to compromise was greater than the incentive to not compromise, they would have compromised. So why is the incentive to compromise ineffective? One of the reasons is because Congress’s pay is not affected when the government shuts down. Another reason is because there is significant incentive to maintain a position aligned with the party (either left or right). The desire to get re-elected (which is unlimited within Congress), the need for support in the primaries to get re-elected (based on the current primary system), and the need for campaign financing are all causes that support the incentive to maintain alignment with the party versus compromise.

Once all the causes of the government shutdown have been identified, possible solutions to prevent the shutdown from happening again can be brainstormed. One possible solution would be to legally require a continuing resolution to be a “clean” bill, with no additional line items. This would make it more likely in the future, when there are debates or discussions over current, hot button items, such as the ACA, that the result would not be a failure to pass the continuing resolution and therefore cause a government shutdown. Another possible solution would be to stop pay for Congress during the government shutdown. Other more global, systemic solutions might be to implement term limits in Congress or provide government campaign financing to reduce the dependency on party financial support.

To view the Outline and Cause Map, please click “Download PDF” above.

Root Cause Analysis - Incident Investigation

“Ghost Train” Causes Head-On Collision in Chicago

October 24, 2013 Kim Smiley

By Kim Smiley

On September 30, 2013, an unoccupied train collided head on with another train sending 30 people to the hospital in Chicago. In a nod to the season and the bizarre circumstances of the accident, the unoccupied train has been colorfully dubbed “the ghost train”.

So what caused the “ghost train” and how did it end up causing a dangerous train collision? Investigators from the National Transportation Safety Board (NTSB) are still reviewing the details of the accident, but some information is available. An initial Cause Map, or visual root cause analysis, can be built to capture what is already known and can be expanded to incorporate more information as the investigation progresses. A Cause Map is built by asking “why” questions and documenting the answers to visually lay out all the causes that contributed to an accident to show the cause-and-effect relationships from left to right.

In this example, the trains collided because an unoccupied train began moving and the safety systems in place did not stop the train. Investigators still haven’t determined exactly what caused the train cars to move, but a key piece of the puzzle is that there was still power to the cars while they were being stored in a repair terminal awaiting maintenance. The NTSB believes that it was common practice to leave power to cars so that the lights could be used to illuminate the terminal. Workers used the lights to discourage graffiti and vandalism because the terminal was located in a high crime neighborhood.

Investigators will need to not only determine why the train started rolling, but also learn more about why the safety systems didn’t prevent the accident. Before colliding with another train, the unoccupied train traveled through five mechanical train-stop mechanisms, each of which should have stopped a train without a driver. Emergency brakes were applied at each train-stop that caused the train to pause momentarily, but then it started moving because the setting on the master lever caused the train to restart. Review of the safety systems will need to be part of the investigation to ensure that adequate protection is in place to prevent anything similar from occurring again.

The NTSB investigation is still ongoing, but the NTSB has stated that de-energizing propulsion power and using an alternate brake setting could help prevent unintended movement of unoccupied train cars. Additionally, the NTSB believes the use of a wheel chock and/or derail would ensure that a train stopped by a mechanical train stop mechanism remains stopped. Based on the information already uncovered, the NTSB has issued an urgent safety recommendation to the Federal Transit Authority (FTA). The NTSB recommended that the FTA issue a safety advisory to all rail transit properties to review procedures for storing unoccupied train cars to ensure that they were left in a safe condition that wouldn’t allow unintended movement and to ensure that they had redundant means of stopping any unintended movement. There is more information that is needed to fully understand this accident, but these precautions would be effective solutions that can be quickly implemented to reduce the risk of train accidents.

Root Cause Analysis - Incident Investigation

Utah Fights for National Parks

October 17, 2013 ThinkReliability Staff

By ThinkReliability Staff

Beginning on October 1, 2013 with the failure to spending approval, the US government entered a partial shutdown including the complete closure of the National Parks, as specified in the National Park Service Contingency Plan. While the government shutdown had far-reaching effects, both across industry and geographically, areas of Utah have been hit particularly hard by the closure of multiple National Parks in the area. The shutdown finally ended on October 17 when the government reached a deal to reopen.

A large proportion of Utah businesses are dependent on revenue brought in from tourists visiting the multiple Federal lands in the state, which include National Parks, National Monuments and National Recreation Areas. A total of five counties in Utah declared a state of emergency, with the counties saying they’re losing up to $300,000 a day. San Juan County, the last to declare a state of emergency, went a step further and decided it would reopen the parks themselves using local personnel to provide necessary emergency response and facilities for park visitors.

On October 10, the state of Utah came to an agreement with the Department of the Interior to pay for the Park Service to reopen the park for up to 10 days at a cost of $166,572 a day. (It is possible, though not automatic, that the state will be reimbursed for these costs after funding is restored.) Luckily a “practical and temporary solution” (as described by the Secretary of the Interior Sally Jewell) was found before county officials had to resort to what they described as “civil disobedience”. (Trespassing in a National Park can result in a citation that could lead to fines or jail terms.)

This situation mirrors that frequently found on a smaller scale in all workplaces. Concerned employees find themselves in circumstances that they believe are not in the best interest of their company or customers. If support for change is not provided by management, these employees will develop work-around (like illegally reopening a National Park to allow tourists to enter). Sometimes workarounds are actually a more effective way of completing work tasks, but they can also sometimes lead to unintended consequences that can be disastrous.

This is why the most effective work processes are developed with the experience and insight of employees at all levels. Taking their concerns into account at the development of procedures and on an ongoing basis will reduce the use of potentially risky workarounds, and can increase the success of all an organization’s goals.

To view the Outline, Cause Map, and considered solutions, please click “Download PDF” above. Or click here to read more.

Root Cause Analysis - Incident Investigation

Rules on Inflight Electronics May be Changing Soon

October 10, 2013 Kim Smiley

By Kim Smiley

In welcome news to many airline passengers, it looks like the FAA may soon allow the use of personal electronic devices during the entire duration of flights, including takeoff and landing. The current restrictions on the use of personal electronics are being reviewed following a recent recommendation by an aviation advisory committee made of up pilots, mechanics, engineers and other aviation experts.

A Cause Map, a visual format for performing a root cause analysis, can be used to analyze this issue. A Cause Map is built by asking “why” questions and intuitively laying out the many causes that contributed to an issue to show the cause-and-effect relationships. The first step in the Cause Mapping process is to document the basic background information as well as list how the issue impacts the goals in the an Outline.

One of the major impacts for this example is that there is concern that use of personal electronic devices onboard aircraft may be dangerous and increase the risk of a plane crash. Currently, the use of personal electronics is allowed once a plane is above 10,000 feet, which is basically the whole flight except landing and takeoff which are considered the most critical portions of the flight. These restrictions are in place because pilots depend on electronic systems, such as navigation and communications systems, to safely do their job and there is concern about the potential for interference with these vital systems.

How likely it is that dangerous interference could be an actual issue is debated. There were 75 reports by pilots of suspected electronic device interference between 2003 and 2009, according to the International Air Transport Association. However, it’s difficult to reproduce interference and it has never been cited as a cause in any airplane accident. The current ban on the use of electronics also seems to be loosely enforced, raising questions about its necessity and effectiveness. (A survey by the Consumer Electronics Association also found that nearly a third of airplane passengers said they left on a portable electronic device on a flight during the previous year.) There seems to be a general consensus that this is low risk issue, but the potentially high consequences if it occurs has made some reluctant to reduce the restrictions.

There are also some non-technical issues that need to be considered with the onboard use of electronics. There is concern that passengers enthralled with their devices will be distracted and miss important information during preflight safety briefs. There is also a concern that larger devices, such as laptops, could become a missile hazard and hurt passengers if the plane moves unexpectedly.

If the new recommendations are approved, passengers will be able to use any device that doesn’t transfer data the entire flight, including takeoff and landing. Passengers would be able to leave all devices turned on, but they would need to set them to airplane mode so that no data is transmitted. So you won’t be able to make calls on your smartphone or stream video, but you would be able to rock out to music already downloaded or read a book on a kindle. Larger devices will still need to be stowed during takeoff and landing because nobody wants to be hit with a laptop, but smaller gadgets will be fair game if the new recommendations are adopted.

To see a Cause Map of this issue, click on “Download PDF” above.

Root Cause Analysis - Incident Investigation

The Salvage Process of Costa Concordia

October 3, 2013 ThinkReliability Staff

By ThinkReliability Staff

On September 16, 2013, the fatally stricken Costa Concordia was lifted upright (known as “parbuckling”) after salvage operations that were the most expensive and involved the largest ship ever. The ship ran aground off the coast of Italy January 13, 2012 (see our previous blog about the causes of the ship running aground) and has been lying on its side for the 20 months since.

The ship grounding had immediate, catastrophic impacts, including the death of 32 people. However, it also had longer term impacts, mainly pollution from the fuel, sewage and other hazardous materials stored aboard the ship. It was determined that the best way to minimize the leakage from the ship would be to return it upright and tow it to port, where it the onboard waste could be emptied and disposed of, then the ship broken up for scrap.

Because a salvage operation of this magnitude (due to the size and location of the ship) had never been attempted, careful planning was necessary. Processes like this salvage operation can be described in a Process Map, which visually diagrams the steps that need to be taken for a process to be completed successfully. A Process Map differs from a Cause Map, which visually diagrams cause-and-effect relationships to show the causes that led to the impacts (such as the deaths and pollution). Whereas a Cause Map reads backwards in time (the impacted goals result from the causes, which generally must precede those impacts), a Process Map reads from left to right along with time. (Step 1 is to the left of, and must be performed before, Step 2.) In both cases, arrows indicate the direction of time.

Like a Cause Map, Process Maps can be built in varying levels of detail. In a complex process, many individual steps will consist of more detailed steps. Both a high level overview of a process, as a well as a more detailed breakdown, can be useful when developing a process. Processes can be used as part of the analysis step of an incident investigation – to show which steps in a process did not go well – or as part of the solutions – to show how a process developed as a solution should be implemented.

In the example of the salvaging of the Costa Concordia, we use the Process Map for the latter. The salvaging process is part of the solutions – how to remove the ship while minimizing further damage and pollution. This task was not easy – uprighting the ship (only the first step in the salvage process) took 19 hours, involved 500 crewmembers from 26 countries and cost nearly $800 million. Other options used for similar situations included blowing up the ship or taking it apart on-site. Because of the hazardous substances onboard – and the belief that two bodies are still trapped under or inside the ship – these options were considered unacceptable.

Instead, a detailed plan was developed to prepare for leakage with oil booms that held sponges and skirts, then installed an underwater platform and 12 turrets to aid in the parbuckling and hold the ship upright. The ship was winched upright using 36 cables and is being held steady on the platform with computer-controlled chains until Spring, when the ship will be floated off the platform and delivered to Sicily to be taken apart.

To view the Process Map in varying levels of detail, please click “Download PDF” above. Or, see the Cause Map about the grounding of the ship in our previous blog.

Root Cause Analysis - Incident Investigation

Sea life Devastated by Molasses Spill

September 26, 2013 Kim Smiley

By Kim Smiley

On September 9, 2013, a reported 1,400 tons of molasses was inadvertently spilled into Honolulu Harbor in Hawaii, devastating the sea life. When I think of ocean spills, pictures of oil-covered animals jump into my mind, but the molasses spill is proving to be potentially just as damaging to the environment.

This incident can be analyzed by building a Cause Map, an intuitive format for performing a root cause analysis. A Cause Map visually lays out the causes that contribute to an accident to show the cause-and-effect relationships between them so that it’s easier to understand the factors that led to the issue. Understanding all the causes and not just focusing on a single “root cause” helps broaden the potential solutions that are considered and can lead to a better long term solution. The first step in the Cause Mapping process is to define how the problem impacted the goals and then these impacts are used as the starting point for the Cause Map.

The most obvious impact from the molasses spill is that thousands of fish and other marine life were killed. They suffocated because the molasses sank and displaced the oxygen- containing seawater in the harbor. The density of molasses is what makes this spill so different from an oil spill. Oil is lighter than water and floats on top of the ocean while molasses sinks to the bottom, with devastating effects at all levels in the ocean. Divers investigating the molasses spill reported that there were no signs of life in the ocean near the spill; all bottom dwellers had been killed.

The fact that molasses sinks also means that there is no practical way to clean it up. One positive about molasses is that molasses, unlike oil, will mix with water. It sits on the bottom until it is diluted and ocean movements disperse it. Since the spill occurred in a protected harbor, the ocean movements are weaker and the time frame to move the molasses is longer than it would be in the open ocean, but nature will eventually return oxygen levels in the harbor to life-supporting levels.

The cause of the spill has been reported to be a leaking pipe. Molasses produced on Hawaii was being pumped into a ship for transportation to the mainland where it was planned for use in animal feed. During the transfer, the molasses was accidently pumped through a pipe with a leak and nobody noticed before the majority of the molasses had been released into the harbor. Details about what specifically caused the leak haven’t been released.

There are also other impacts from the spill that are worth considering. With any environment issue, the cost of the investigation and any clean up that needs to be done is always substantial. Many businesses in the area were also impacted by a drop in tourism because the harbor was closed for about two weeks after the accident and normal tourism levels will probably not return until marine life in the area begins to recover. There was also a potential safety risk to any swimmers for a time after the accident because the presence of thousands of dead fish could attract predators.

To view an Outline and high level Cause Map of this accident, click on “Download PDF” above.

Root Cause Analysis - Incident Investigation

Is Your Emergency Response Plan Really Good Enough?

September 18, 2013 Kim Smiley

By Kim Smiley

The deadly shooting at the Washington Navy Yard this week hit especially close to home. I live about 15 miles away from the Navy Yard and I also worked there for 5 years. And my husband, a Navy civilian, still does.

As far as the thousands impacted by the shootings, my family was very lucky. My spouse came home safely while others did not. Additionally, we hit the jackpot from a logistics stand point because he was in his office (which is not in the affected building) when the order to shelter in place was issued. He had access to his phone and internet (as well as a bathroom and his packed lunch). I had word almost immediately that he was safe and was able to communicate with him throughout the day. When word came that he could go home but his car couldn’t, we were able to coordinate and get him home as quickly as possible.

Like many people across the country, I was riveted by the news and was holding my breath as the information fluctuated by the hour. At ThinkReliability, we are generally called in to help investigate or document information after an incident so the opportunity to watch an incident as it plays out in real time is fairly rare. I was throwing together a makeshift family emergency response as I was bombarded with calls and messages from concerned friends and family as well as trying to figure when and how my husband would get home. And for somebody who works on processes and solutions for a living, my personal emergency response wasn’t very impressive. Take my word for it: the ideal time to discover your mother in-law has your old cell phone number isn’t when your husband’s place of employment just made national headlines.

A time like this is an excellent opportunity to review both your organization and family’s emergency response plan. Is your organization ready to handle a shelter-in-place situation? Do you know which authorities to contact in case of emergencies? And, one piece that I think is often overlooked: how you would handle the flow of information? How do you pass word to families if something significant occurs and do people know where to look for the information? Would you post the information on the website? Would an old-fashioned phone tree serve your needs? Do you have updated contact information and home phone numbers?

It’s also important to have a basic plan in place for your family in case something unforeseen happens. There was a flurry of activity on Monday as everyone worked to make sure that there was a plan for all the children of the people we knew on the Navy Yard to be picked up and potentially kept overnight. Thousands of people work on the Navy Yard and there were several cases were a single parent or both parents were stuck on lock-down for an indeterminate amount of time. Are you really ready to handle a situation like that? If your family or employees have any special needs, like requiring medication, I would recommend making a plan to deal with it. I also highly recommend taking a moment to make sure that any list with people allowed to pick up your children is up to date and includes a few folks who do not work in your building or even on the same side of town. Fairly simple precautions can make a tough situation go much smoother.

And don’t think you don’t need a basic plan if you have no dependents. Do you know how you would get home if you suddenly had to leave your car at work like many of the Navy Yard employees did? What if your wallet was left behind in a rushed evacuation? It might be a good idea to have enough money to cover cab fare in your car or in your badge holder if you wear one to work. How do you pass word to your parents that you’re okay, especially if you don’t have access to a phone? Would your mom think to check her email? Do you have a friend who has your parents’ or siblings’ phone numbers and could call them for you if they aren’t comfortable with social media or computers? Trust me; your families would be very interested in hearing that you’re okay.

I hope you never experience any crisis even remotely close to the tragedy at the Navy Yard. But if there is ever an emergency, you’ll be grateful if you made a plan beforehand.

Root Cause Analysis - Incident Investigation

NYT Website Disrupted for Hours

September 13, 2013 Kim Smiley

By Kim Smiley

On Tuesday, August 27, 2013 the New York Times website went dark for several hours after being attacked by a well-known group of hackers. Reports of hacked websites are becoming increasingly common and the New York Times was just one of many recent victims.

A Cause Map, or visual root cause analysis, can be used to analyze the recent attack on the New York Times website. A Cause Map lays out the many causes that contribute to an issue in an intuitive format that illustrates the cause-and-effect relationships. A Cause Map is useful for understanding all the causes involved and can help when brainstorming solutions. To see a Cause Map of this example, click on “Download PDF” above.

Some details of how the attack was done have been released, as documented on the Cause Map. The New York Times website itself was not technically hacked, but traffic was redirected away from the legitimate website to another web domain. To pull off this feat, hackers changed the domain name records for the New York Times website after acquiring the user name and password of an employee at the domain name registrar company. The employee inadvertently provided the information to the hackers by responding to a phishing email asking for personal information.

The email sent by the hackers looked legitimate enough to fool the employee.

So why did hackers target the New York Times in the first place? The answer is that the New York Times is one of many western media outlets to be targeted by Syrian Electronic Army (S.E.A.), who has claimed responsibility for the attack. The S.E.A. supports President Bashar al-Assad of Syria and is generally unhappy with the way the events in Syria have been portrayed in the West.

So the next logical question is how do you protect yourself from a phishing scheme? The first step is awareness. Pretty much everybody who uses email can expect to receive some suspicious emails. A few things to look out for: attachments, links, misspellings, and a mismatched “from” field or subject line. Also any alarming language should be a red flag. For example, an email from your credit card company warning you that your account will be closed unless you take immediate action is probably not the real deal. A good rule of thumb is to never respond to any email with personal information or to click on links in emails. If you think a request for action may be real, either call the company or open a new web browser window and type in the company’s web address. It’s best to delete any suspicious emails immediately.

This example is also a good reminder to be aware that websites can get hacked. A great example of this is when the S.E.A. hacked the Associated Press’s twitter feed last April and used it to announce (falsely) that the White House had been bombed. That one tweet is estimated to have caused a $136 billion loss in the stock markets as people responded to the news. In general, it is probably good to be skeptical about anything shocking you read online until the information is confirmed.

Root Cause Analysis - Incident Investigation

What Happens When a Copy Isn’t a Copy?

September 6, 2013 Kim Smiley

By Kim Smiley

Think of how many documents are scanned every day. Imagine how important some of these pieces of paper are, such as invoices, property records, and medical files. Now try to picture what might happen if the copies of these documents aren’t true copies. This is exactly the scenario that Xerox was recently facing.

It recently came to light that some copies of scanned documents were altered by the scanning process. Specifically, some scanner/copier machines changed numbers on documents. This issue can be analyzed by building a Cause Map, an intuitive, visual format for performing a root cause analysis. The first step in the Cause Mapping process is to fill in an Outline with the basic background information on an issue. Additionally, the impacts to the overall goals are documented on the Outline to help clarify the severity of any given issue. In this example, the customer service goal is impacted because the scanners weren’t operating as expected. There is also a potential impact to the overall economic goal because the altered documents could result in any number of issues. There is also an impact because of the labor needed to investigate and fix the problem.

After completing the Outline, the next step is ask “why” questions to build the Cause Map. Why weren’t the scanners operating as expected? This happened because the scanners were changing some documents during the scanning process. Scanners use software to help interpret the original documents and Xerox has stated that the problem happened because of a software bug. Testing showed that the number substitutions were more likely to occur when the settings on the scanners were set to lower quality/ higher compression because of the specific software used for these settings. Testing also showed that the error was more likely to occur when scanning those documents that were more difficult to read such as those with small fonts or that had already been copied multiple times.

Xerox had been aware of the potential for number substitution at lower quality settings, but didn’t appear to expect it to occur at factory settings (which was found to be very unlikely, but possible). A notice that stated that character substitutions were possible appeared on the scanners when lower resolution settings were selected and was included in some manuals, but this approach seems to have been ineffective since many users were caught unaware by this issue.

After a Cause Map has been built with enough detail to understand the issue, it can be used to help develop solutions. In this example, Xerox developed a software patch that corrected the error. Xerox also posted several blogs on their website to keep customers informed about the issue and worked with users to ensure that the patch was successful in correcting the error.

To see a high level Cause Map of this issue, click on “Download PDF” above.