All posts by ThinkReliability Staff

ThinkReliability are specialists in applying root cause analysis to solve all types of problems. We investigate errors, defects, failures, losses, outages and incidents in a wide variety of industries. Our Cause Mapping analysis method of root causes, captures the complete investigation with the best solutions all in an easy to understand format. ThinkReliability provides investigation services and root cause analysis training to clients around the world and is considered the trusted authority on the subject

Root Cause Analysis - Incident Investigation

When You Call Yourself ThinkReliability…

March 25, 2014 ThinkReliability Staff

By ThinkReliability Staff

While I was bombasting about the Valdez oil spill in 1989, one of those ubiquitous internet fairies decided that I did not really need the network connection at my remote office. Sadly this meant that the attendees on my Webinar had to listen only to me speaking without seeing the pretty diagrams I made for the occasion (after a short delay to switch audio mode).

Though I have all sorts of redundancies built in to Webinar presentations (seriously, I use a checklist every time), I have not prepared for the complete loss of network access, which is what happened during my March 20th, 2014 Webinar. I’m not going to use the term “root cause”, because I still had another plan . . . (yep, that failed, too).

For our mutual amusement (and because I get asked for this all the time), here is a Cause Map, or visual root cause analysis – the very method I was demonstrating during the failure – of what happened.

First we start with the what, when and where. No who because blame isn’t the point, though in this case I will provide full disclosure and clarify that I am, in fact, writing about myself. The Webinar in question was presented on March 20, 2014 at 2:00 PM EST (although to my great relief the issues didn’t start until around 2:30 pm). That little thorn in my side? It was the loss of a network connection at the Wisconsin remote office (where I typically present from). I was using Citrix Online’s GoToWebinar© program to present a root cause analysis case study of the Valdez oil spill online.

Next we capture the impact to the organization’s (in this case, ThinkReliability) goals. Luckily, in the grand scheme of things, the impacted goals were pretty minor. I annoyed a bunch of customers who didn’t get to see my slides and I scheduled an additional Webinar. Also I spent some time doing follow-up to those who were impacted, scheduling another Webinar, and writing this blog.

Next we start with the impacted goals and ask “Why” questions. The customer service goal was impacted because of the interruption in the Webinar. GoToWebinar© (as well as other online meeting programs) has two parts: audio and visual. I temporarily lost audio as I was using the online option (VOIP), which I use as a default because I like my USB headset better than my wireless headset. The other option is to dial in using the phone. As soon as I figured out I had lost audio, I switched to phone and was able to maintain the audio connection until the end of the Webinar (and after, for those lucky enough to hear me venting my frustration at my office assistant).

In addition to losing audio, I lost the visual screen-sharing portion of the Webinar. Unlike audio, there’s only one option for this. Screen sharing occurs through an online connection to GoToWebinar©. Loss of that connection means there’s a problem with the GoToWebinar© program, or my network connection. (I’ve had really good luck with GoToWebinar; over the last 5 years I have used the program at least weekly with only two connection problems attributed to Citrix.) At this point I started running through my troubleshooting checklist. I was able to reconnect to audio, so it seemed the problem was not with GoToWebinar©. I immediately changed from my wired router connection to wireless, which didn’t help. Meanwhile my office assistant checked the router and determined that the router was not connected to the network.

You will quickly see that at this point I reached the end of my expertise. I had my assistant restart the router, which didn’t work, at least not immediately. At this point, my short-term connection attempts (“immediate solutions”) were over. Router troubleshooting (beyond the restart) or a call to my internet provider were going to take far longer than I had on the Webinar.

Normally there would have been one other possibility to save the day. For online presentations, I typically have other staff members online to assist with questions and connection issues, who have access to the slides I’m presenting. That presenter (and we have done this before) could take over the screen sharing while I continued the audio presentation. However, the main office in Houston was unusually short-staffed last week (which is to say most everyone was out visiting cool companies in exciting places). And (yes, this was the wound that this issue rubbed salt in), I had been out sick until just prior to the Webinar. I didn’t do my usual coordination of ensuring I had someone online as my backup.

Because my careful plans failed me so completely, I scheduled another Webinar on the same topic. (Click the graphic below to register.) I’ll have another staff member (at another location) ready online to take over the presentation should I experience another catastrophic failure (or a power outage, which did not occur last week but would also result in complete network loss to my location). Also, as was suggested by an affected attendee, I’ll send out the slides ahead of time. That way, even if this exact series of unfortunate events should recur, at least everyone can look at the slides while I keep talking.

To view my comprehensive analysis of a presentation that didn’t quite go as planned, please click “Download PDF above. To view one of our presentations that will be “protected” by my new redundancy plans, please see our upcoming Webinar schedule.

Root Cause Analysis - Incident Investigation

Microsoft Withdrawing Support for Windows XP, Still Used by 95% of World’s 2.2 Million ATMs

March 21, 2014 ThinkReliability Staff

By ThinkReliability Staff

On April 8, 2014, Microsoft will withdraw support for its XP operating system. While this isn’t new news (Microsoft made the announcement in 2007), it’s quickly becoming an issue for the world’s automated teller machines (ATMs). Of the 2.2 million ATMs in the world, 95% run Windows XP. Of these, only about a third will be upgraded by the April 8th deadline.

These banks then face a choice: upgrade to a newer operating system (which will have to be done eventually anyway), pay for extended support, or go it alone. We can look at the potential consequences for each decision – and the reasons behind the choices – in a Cause Map, a visual form of root cause analysis.

First we look at the consequences, or the impacts to the goals. The customer service goal is impacted by the potential exposure to security threats. (According to Microsoft, it’s more than just potential. Says Timothy Rains, Microsoft’s Director of trustworthy computing, “The probability of attackers using security updates for Windows 7, Windows 8, Windows Vista to attack Windows XP is about 100 per cent.”) Required upgrades, estimated to cost each bank in the United Kingdom $100M (US) by security experts, impact the production/schedule and property/equipment goals. Lastly, if implemented, extended service/ support contracts will impact the labor/time goal. Though many banks have announced they will extend their contract, the costs of such an extension are unclear, and likely vary due to particular circumstances.

As mentioned above, banks have a choice. They can upgrade immediately, as will be required at some point anyways. However, it’s estimated that most (about two-thirds) of banks worldwide won’t make the deadline. They will then continue to operate in XP, with or without an extended service/ support contract.

Operating without an extended contract will create a high vulnerability to security risks – hackers and viruses. It has been surmised that hackers will take security upgrades developed for other operating systems and reverse engineer them to find weaknesses in XP. The downside of the extended contracts is the cost.

Given the risk of security issues with maintaining XP as an operating system, why haven’t more banks upgraded in the 7 years since Microsoft announced it would be withdrawing support? There are multiple reasons. First, because of the huge number of banks that still need to upgrade, experts available to assist with the upgrade are limited. Many banks use proprietary software based on the operating system, so it’s not just the operating system that would need to be upgraded – so would many additional programs.

The many changes that banks have been dealing with as a result of the financial crisis may have also contributed to the delay. (For more on the financial crisis, see our example page.) Banks are having trouble implementing the many changes within the time periods specified. Another potential cause is that banks may be trying to perform many upgrades together. For example, some ATMs will move to a new operating system and begin accepting chip cards as part of the same upgrade. (For more about the move towards chip cards, see our previous blog.)

Some banks are just concerned about such a substantial change. “I ask these companies why they are using old software, they say ‘Come on, it works and we don’t want to touch that,'” says Jaime Blasco, a malware researcher for AlienVault. The problem is, soon it won’t be working.

To view the Outline and Cause Map, please click “Download PDF” above. Or click here to read more.

Root Cause Analysis - Incident Investigation

Cleaning up Fukushima Daiichi

March 12, 2014 ThinkReliability Staff

By ThinkReliability Staff

The nuclear power plants at Fukushima Daiichi were damaged beyond repair during the earthquake and subsequent tsunami on March 11, 2011. (Read more about the issues that resulted in the damage in our previous blog.) Release of radioactivity as a result of these issues is ongoing and will end only after the plants have been decommissioned. Decommissioning the nuclear power plants at Fukushima Daiichi will be a difficult and time consuming process. Not only the process but the equipment being used are essentially being developed on the fly for this particular purpose.

Past nuclear incidents offer no help. The reactor at Chernobyl which exploded was entombed in concrete, not dismantled as is the plan for the reactors at Fukushima Daiichi. The reactor at Three Mile Island which overheated was defueled, but the pressure vessel and buildings in that case were not damaged, meaning the cleanup was of an entirely different magnitude. Lake Barrett, the site director during the decommissioning process at Three Mile Island and a consultant on the Fukushima Daiichi cleanup, says that nothing like Fukushima has ever happened before.

An additional challenge? Though the reactors have been shut down since March 2011, the radiation levels remain too high for human access (and will be for some time). All access, including for inspection, has to be done by robot.

The decommissioning process involves 5 basic steps (though the completion of them will take decades).

First, an inspection of the site must be completed using robots. These inspection robots aren’t your run-of-the-mill Roombas. Because of the steel and concrete structures involved with nuclear power, wireless communication is difficult. One type of robot used to survey got stuck in reactor 2 after its cable was entangled and damaged. The next generation of survey robots unspools cable, takes up slack when it changes direction and plugs itself in for a recharge. This last one is particularly important: not only can humans not access the reactor building, they can’t handle the robots after they’ve been in there. The new robots should be able to perform about 100 missions before component failure, pretty impressive for access in a site where the hourly radiation dose can be the same as a cleanup worker’s annual limit (54 millisieverts an hour).

Second, internal surfaces will be decontaminated. This requires even more robots, with different specialties. One type of robot will clear a path for another type, which will be outfitted with water and dry ice, to be blasted at surfaces in order to remove the outer level, and the radiation with it. The robots will them vacuum up and remove the radioactive sludge from the building. The resulting sludge will have to be stored, though the plan for the storage is not yet clear.

Third, spent fuel rods will be removed, further reducing the radiation within the buildings. A shielded cask is lowered with a crane-like machine, which then packs the fuel assemblies into the cask. The cask is then removed and transported to a common pool for storage. (The fuel assemblies must remain in water due to the decay heat still being produced.)

Fourth, radioactive water must be contained. An ongoing issue with the Fukushima Daiichi reactors is the flow of groundwater through contaminated buildings. (Read more about the issues with water contamination in a previous blog.) First, the flow of groundwater must be stopped. The current plan is to freeze soil to create a wall of ice and put in a series of pumps to reroute the water. Then, the leaks in the pressure vessels must be found and fixed. If the leaks can’t be fixed, the entire system may be blocked off with concrete.

Another challenge is what to do with the radioactive water being collected. So far, over 1,000 tanks have been installed. But these tanks have had problems with leaks. Public sentiment is against releasing the water into the ocean, though the contamination is low and of a form that poses a “negligible threat”. The alternative would be using evaporation to dispose of the water over years, as was done after Three Mile Island.

Finally, the remaining damaged nuclear material must be removed. More mapping is required, to determine the location of the melted fuel. This fuel must then be broken up using long drills capable of withstanding the radiation that will still be present. The debris will then be taken into more shielded casks to a storage facility, the location of which is yet to be determined. The operator of the plant estimates this process will take at least 20 years.

To view the Process Map laid out visually, please click “Download PDF” above. Or click here to read more.

Root Cause Analysis - Incident Investigation

Volunteer Killed in Helicopter Fall

February 7, 2014 ThinkReliability Staff

By ThinkReliability Staff

On September 12, 2013, the California National Guard invited Shane Krogen, the executive director of the High Sierra Volunteer Trail Crew and the U.S. Forest Service’s Regional Forester’s Volunteer of the Year for 2012, to assist in the reclamation effort of a portion of the Sequoia National Forest where a marijuana crop had been removed three weeks earlier. Because the terrain in the area was steep, the team was to be lowered from a helicopter into the area.

After Mr. Krogen left the helicopter to be lowered, an equipment failure caused the volunteer to fall 40 feet. He later died from blunt force trauma injuries. The Air Force’s report on the incident, which was released in January, determined that Mr. Krogen had been improperly harnessed. The report also found that he should have never been invited on the flight.

To show the combination of factors that resulted in the death of the volunteer, we can capture the information from the Air Force report in a Cause Map, or visual root cause analysis. First it’s important to determine the impacts to the goals. In this case, Mr. Krogen’s death is an impact to the safety goal, and of primary consideration. Additionally, the improper harnessing can be considered an impact to the customer service goal, as Mr. Krogen was dependent on the expertise of National Guard personnel to ensure he was properly outfitted. Because it was contrary to Air Force regulations, which say civilian volunteers cannot be passengers on counter-drug operations, the fact that Mr. Krogen was allowed on the flight can be considered an impact to the regulatory goal. Lastly, the time spend performing the investigation impacts the labor goal because of the resources used during the investigation.

Beginning with the impacted goal of primary concern – the safety goal – asking “Why” questions allows for the determination of causes that resulted in the impacted goal (the end effect). In this case, Mr. Kroger died of blunt force trauma injuries from falling 40 feet. He fell 40 feet because he was being lowered from a helicopter and his rigging failed. He was being lowered from a helicopter to aid in reclamation efforts and because the terrain was too steep for the helicopter to land.

The rigging failure resulted from the failure of a D-ring which was used to connect the harness to the hoist. Specifically, the D-ring was not strong enough to handle the weight of a person being lowered on it. This is because the hoist was connected to Mr. Krogen’s personal, plastic D-ring instead of a government-issued, load-bearing metal D-ring. After Mr. Krogen mistakenly connected the wrong D-ring, his rigging was checked by National Guard personnel. The airman doing the checking didn’t notice the mistake, likely because of the proximity of the two D-rings and the fact that Mr. Krogen was wearing his own tactical vest, loaded with equipment, over the harness to which the metal D-ring was connected.

I think Mark Thompson sums up the incident best in his article for Time: “The death of Shane Krogen, executive director of the High Sierra Volunteer Trail Crew, last summer in the Sequoia National Forest, just south of Yosemite National Park, was a tragedy. But it was an entirely preventable one. It stands as a reminder of how dangerous military missions can be, and on the importance of a second set of eyes to make sure that potentially deadly errors, whenever possible, are reviewed and reversed before it is too late.”

To view the Outline and Cause Map, please click “Download PDF” above. Or click here to read more.

Root Cause Analysis - Incident Investigation

Improper Fireplace Installation Results in Firefighter’s Death

January 17, 2014 ThinkReliability Staff

By Mark Galley

While battling a fire in a mansion in Hollywood Hills, California on February 16, 2011, a firefighter was killed (and 5 others seriously injured) when the roof collapsed. As a result of the firefighter’s death, the owner/ architect of the home was convicted of involuntary manslaughter. He is scheduled to serve 6 months and then will be deported.

The fire wasn’t arson, but the owner/ architect was considered responsible due to the installation of an outdoor-only fireplace on the top floor of his home. Because of the legal issues surrounding this case, it’s important to carefully determine and clearly present all of the causes that led to the fire and the firefighter’s death.

We can capture information related to this issue within a Cause Map, or visual root cause analysis. A Cause Map begins with the impacted goals, allowing a clear accounting of the effects from the issue. The firefighter’s death is an impact to the safety goal, as are the injuries to the other firefighters. Impacts to the safety goal are the primary focus of any investigation, but we will capture the other impacted goals as well. In this case, the regulatory goal was impacted due to the non-compliant fireplace, the non-compliance being missed during inspection, and the prison sentence for the architect/owner. Additionally the loss of the home and the time and effort put into firefighting and the subsequent trial impact the property and labor/time goals.

Once the impacts to the goals are determined, asking “why” questions begins to develop the cause-and-effect relationships that resulted in those impacts. A Cause Map can start simple – in this example, the safety goal was impacted due to the death of a firefighter. Why? Because the ceiling collapsed. Why? Because the house was on fire. Why? Because heat ignited flammable building materials.

Though this analysis is accurate, it’s certainly not complete. More detail can be added to the Cause Map until the issue is adequately understood and all causes are included in the analysis. Detail can be added by asking more “why” questions – the heat ignited flammable building materials because an outdoor-only fireplace was improperly used inside the house. Causes can also be added by considering causes that both had to occur in order for the effect to happen. The firefighter was killed when the ceiling collapsed AND the firefighter was beneath the ceiling, fighting the fire. Had the ceiling collapsed but the firefighters not been inside, the firefighter would not have been killed by the ceiling collapse.

Detail can also be added between causes to provide more clarify. In this case, the ceiling collapse was not directly caused by high heat. Instead, the high heat activated and melted the sprinkler system, resulting in a buildup of water that caused the ceiling collapse. The other goals that were impacted should also be added to the Cause Map, which may result in more causes. In this case, the improperly installed fireplace was missed by the building inspector, which is an impact to the regulatory goal. The reason it was missed was debated during the trial, but changes to the inspection process may result that would make this type of incident less likely, ideally reducing the risk to firefighters and home owners.

An incident analysis should have enough detail to lead to solutions that will reduce the risk of recurrence of the impacted goals. As I mentioned previously, solutions from the perspective of the building inspectors may be to look specifically for issues on fireplaces that could lead to these types of fires. Ideally, a way to determine if a sprinkler system was malfunctioning and leading to water collection could be developed that could reduce the risk to firefighters. For homeowners, this incident should stand as a reminder that outdoor-only heat sources such as fireplaces are outdoor-only for a reason.

Root Cause Analysis - Incident Investigation

Department of Energy Cyber Breach Affects Thousands, Costs Millions

December 19, 2013 ThinkReliability Staff

By ThinkReliability Staff

Personally identifiable information (PII), including social security numbers (SSNs) and banking information, for more than 104,000 individuals currently or formerly employed by the Department of Energy (DOE) was accessed by hackers from the Department’s Employee Data Repository database (DOEInfo) through the Department’s Management Information System (MIS). A review by the DOE’s Inspector General in a recently released special report analyzes the causes of the breach and provides recommendations for preventing or mitigating future breaches.

The report notes that, “While we did not identify a single point of failure that led to the MIS/DOEInfo breach, the combination of the technical and managerial problems we observed set the stage for individuals with malicious intent to access the system with what appeared to be relative ease.” Because of the complex interactions between the systems, personnel interactions and safety precautions (or lack thereof) that led to system access by hackers, a diagram showing the cause-and-effect relationships can be helpful. Here those relationships – and the impacts it had on the DOE and DOE personnel – are captured within a Cause Map, a form of visual root cause analysis.

In this case, the report uncovered concerns that other systems were at risk for compromise – and that a breach of those systems could impact public health and safety. The loss of PII for hundreds of thousands of personnel can be considered an impact to the customer service goal. The event (combined with two other cyber breaches since May 2011), has resulted in a loss of confidence in cyber security at the Department, an impact to the mission goal. Affected employees were given 4 hours of authorized leave to deal with potential impacts from the breach, impacting both the production and labor goals. (Labor costs for recovery and lost productivity are estimated to cost $2.1 million.) The Department has paid for credit monitoring and established a call center for the affected individuals, at an additional cost of $1.6 million, leading to a cost of this event of $3.7 million. With an average of one cyber breach a year for the past 3 years, the Department could be looking at multi-million dollar annual costs related to cyber breaches.

These impacts to the goals resulted from hackers gaining access to unencrypted PPI. Hackers were able to gain access to the system, which was encrypted, and contained significant amounts of PPI, as this database was the central repository for current and former employees. The PPI within the database included SSNs which were used for identifiers, though this is contrary to Federal guidance. There appeared to have been no effort to remove SSNs as identifiers per a 5-year-old requirement for reasons that are unknown. Reasons for the system remaining unencrypted appear to have been based on performance concerns, though these were not well documented or understood.

Hackers were able to “access the system with what appeared to be relative ease” because the system had inadequate security controls (only a user name and password were required for access), and could be directly accessed from the internet, presumably in order to accomplish necessary tasks. In the report, ability to access the system was directly related to “continued operation with known vulnerabilities.” This concept may be familiar to many at a time when most organizations are trying to do more with less. Along with a perceived lack of authority to restrict operation, inability to address these vulnerabilities based on unclear responsibility for applying patches, and vulnerabilities that were unknown because of the limited development, testing, troubleshooting and ongoing scanning of the system, cost was also brought up as a potential issue for delay in addressing the vulnerabilities that contributed to the system breach.

According to the report, “The Department should have considered costs associated with mitigating a system breach … We noted the Department procured the updated version in March 2013 for approximately $4,200. That amount coupled with labor costs associated with testing and installing the upgrade were significantly less than the cost to mitigate the affected system, notify affected individuals of the compromise of PII and rebuild the Department’s reputation.”

The updated system referred to was purchased in March 2013 though the system had not been updated since early 2011 and core support for the application upon which the system was built ended in July 2012. Additionally, “the vulnerability exploited by the attacker was specifically identified by the vendor in January 2013.” The update, though purchased in March, was not installed until after the breach occurred. Officials stated that a decision to upgrade the system had not been made until December 2012, because it had not reached the end of its useful life.” The Inspector General ‘s note about considering costs of mitigating a system breach is poignant, comparing the several thousand dollar cost of an on-time upgrade to a several million dollar cost of mitigating a breach. However, like the DOE, many companies find themselves in the same situation, cutting costs on prevention and paying exponential higher costs to deal with the inevitable problem that will arise.

To view the Outline, Cause Map and recommended solutions based on the DOE Inspector General’s report, please click “Download PDF” above. Or click here to read more.

Root Cause Analysis - Incident Investigation

Pilot Response to Turbulence Leads to Crash

November 12, 2013 ThinkReliability Staff

By ThinkReliability Staff

All 260 people onboard Flight 587, plus 5 on the ground, were killed when the plane crashed into a residential area on November 12, 2001. Flight 587 took off shortly after another large aircraft. The plane experienced turbulence. According to the NTSB, the pilot’s overuse of the rudder mechanism, which had been redesigned and as a result was unusually sensitive, resulted in such high stress that that vertical stabilizer separated from the body of the plane.

This event is an example of an Aircraft Pilot Coupling (APC) event. According to the National Research Council, “APC events are collaborations between the pilot and the aircraft in that they occur only when the pilot attempts to control what the aircraft does. For this reason, pilot error is often listed as the cause of accidents and incidents that include an APC event. However, the [NRC] committee believes that the most severe APC events attributed to pilot error are the result of the adverse APC that misleads the pilot into taking actions that contribute to the severity of the event. In these situations, it is often possible, after the fact, to analyze the event carefully and identify a sequence of actions the pilot could have taken to overcome the aircraft design deficiencies and avoid the event. However, it is typically not feasible for the pilot to identify and execute the required actions in real time.”

This crash is a case where it is tempting to chalk up the accident to pilot error and move on. However, a more thorough investigation of causes identifies multiple issues that contributed to the accident and, most importantly, multiple opportunities to increase safety for future pilots and passengers. The impacts to the goals, causes of these impacts, and possible solutions can be organized visually in cause-and-effect relationships by using a Cause Map. To view the Outline and Cause Map, please click “Download PDF” above.

The wake turbulence that initially affected the flight was due to the small separation distance between the flight and a large plane that took off 2 minutes prior (the required separation distance by the FAA). This led to a recommendation to re-evaluate the separation standards, especially for extremely large planes. In the investigation, the NTSB found that the training provided to pilots on this particular type of aircraft was inadequate, especially because changes to the aircraft’s flight control system rendered the rudder control system extremely sensitive. This combination is believed to be what led to the overuse of the rudder system, leading to stress on the vertical stabilizer that resulted in its detachment from the plane. Specific formal training for pilots based on the flight control system for this particular plane was incorporated, as was evaluation of changes to the flight control system and requirements of handling evaluations when design changes are made to flight control systems for previously certified aircraft. A caution box related to rudder sensitivity was incorporated on these planes, as was a detailed inspection to verify stabilizer to fuselage and rudder to stabilizer attachments. An additional inspection was required for planes that experience extreme in-flight lateral loading events. Lastly, the airplane upset recovery training aid was revised to assist pilots in recovering from upsets such as from this event.

Had this investigation been limited to a discussion of pilot error, revised training may have been developed, but it’s likely that a discussion of the causes that led to the other solutions that were recommended and/or implemented as a result of this accident would not have been incorporated. It’s important to ensure that incident investigations address all the causes, so that as many solutions as possible can be considered.

Root Cause Analysis - Incident Investigation

Utah Fights for National Parks

October 17, 2013 ThinkReliability Staff

By ThinkReliability Staff

Beginning on October 1, 2013 with the failure to spending approval, the US government entered a partial shutdown including the complete closure of the National Parks, as specified in the National Park Service Contingency Plan. While the government shutdown had far-reaching effects, both across industry and geographically, areas of Utah have been hit particularly hard by the closure of multiple National Parks in the area. The shutdown finally ended on October 17 when the government reached a deal to reopen.

A large proportion of Utah businesses are dependent on revenue brought in from tourists visiting the multiple Federal lands in the state, which include National Parks, National Monuments and National Recreation Areas. A total of five counties in Utah declared a state of emergency, with the counties saying they’re losing up to $300,000 a day. San Juan County, the last to declare a state of emergency, went a step further and decided it would reopen the parks themselves using local personnel to provide necessary emergency response and facilities for park visitors.

On October 10, the state of Utah came to an agreement with the Department of the Interior to pay for the Park Service to reopen the park for up to 10 days at a cost of $166,572 a day. (It is possible, though not automatic, that the state will be reimbursed for these costs after funding is restored.) Luckily a “practical and temporary solution” (as described by the Secretary of the Interior Sally Jewell) was found before county officials had to resort to what they described as “civil disobedience”. (Trespassing in a National Park can result in a citation that could lead to fines or jail terms.)

This situation mirrors that frequently found on a smaller scale in all workplaces. Concerned employees find themselves in circumstances that they believe are not in the best interest of their company or customers. If support for change is not provided by management, these employees will develop work-around (like illegally reopening a National Park to allow tourists to enter). Sometimes workarounds are actually a more effective way of completing work tasks, but they can also sometimes lead to unintended consequences that can be disastrous.

This is why the most effective work processes are developed with the experience and insight of employees at all levels. Taking their concerns into account at the development of procedures and on an ongoing basis will reduce the use of potentially risky workarounds, and can increase the success of all an organization’s goals.

To view the Outline, Cause Map, and considered solutions, please click “Download PDF” above. Or click here to read more.

Root Cause Analysis - Incident Investigation

The Salvage Process of Costa Concordia

October 3, 2013 ThinkReliability Staff

By ThinkReliability Staff

On September 16, 2013, the fatally stricken Costa Concordia was lifted upright (known as “parbuckling”) after salvage operations that were the most expensive and involved the largest ship ever. The ship ran aground off the coast of Italy January 13, 2012 (see our previous blog about the causes of the ship running aground) and has been lying on its side for the 20 months since.

The ship grounding had immediate, catastrophic impacts, including the death of 32 people. However, it also had longer term impacts, mainly pollution from the fuel, sewage and other hazardous materials stored aboard the ship. It was determined that the best way to minimize the leakage from the ship would be to return it upright and tow it to port, where it the onboard waste could be emptied and disposed of, then the ship broken up for scrap.

Because a salvage operation of this magnitude (due to the size and location of the ship) had never been attempted, careful planning was necessary. Processes like this salvage operation can be described in a Process Map, which visually diagrams the steps that need to be taken for a process to be completed successfully. A Process Map differs from a Cause Map, which visually diagrams cause-and-effect relationships to show the causes that led to the impacts (such as the deaths and pollution). Whereas a Cause Map reads backwards in time (the impacted goals result from the causes, which generally must precede those impacts), a Process Map reads from left to right along with time. (Step 1 is to the left of, and must be performed before, Step 2.) In both cases, arrows indicate the direction of time.

Like a Cause Map, Process Maps can be built in varying levels of detail. In a complex process, many individual steps will consist of more detailed steps. Both a high level overview of a process, as a well as a more detailed breakdown, can be useful when developing a process. Processes can be used as part of the analysis step of an incident investigation – to show which steps in a process did not go well – or as part of the solutions – to show how a process developed as a solution should be implemented.

In the example of the salvaging of the Costa Concordia, we use the Process Map for the latter. The salvaging process is part of the solutions – how to remove the ship while minimizing further damage and pollution. This task was not easy – uprighting the ship (only the first step in the salvage process) took 19 hours, involved 500 crewmembers from 26 countries and cost nearly $800 million. Other options used for similar situations included blowing up the ship or taking it apart on-site. Because of the hazardous substances onboard – and the belief that two bodies are still trapped under or inside the ship – these options were considered unacceptable.

Instead, a detailed plan was developed to prepare for leakage with oil booms that held sponges and skirts, then installed an underwater platform and 12 turrets to aid in the parbuckling and hold the ship upright. The ship was winched upright using 36 cables and is being held steady on the platform with computer-controlled chains until Spring, when the ship will be floated off the platform and delivered to Sicily to be taken apart.

To view the Process Map in varying levels of detail, please click “Download PDF” above. Or, see the Cause Map about the grounding of the ship in our previous blog.

Root Cause Analysis - Incident Investigation

The loss of the Steamship General Slocum, June 15, 1904

August 8, 2013 ThinkReliability Staff

By ThinkReliability Staff

On June 15, 1904, a church group headed out for an excursion through New York City’s East River on the Steamship General Slocum. Approximately half an hour after the ship left the pier, it caught fire. Despite being only hundreds of yards from shore, the Captain continued to go full speed ahead in hopes of beaching at North Brother Island, where a hospital was located. This served to fan the flames quickly over the entire highly flammable ship, killing many in the inferno. Most of those who were not killed by the fire drowned, even though the Captain did successfully beach the ship at North Brother Island, due to the depth of the water and lack of safety equipment.

To perform a root cause analysis of the General Slocum tragedy, we can use a Cause Map. A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page. First we look at the impact to the goals. On the General Slocum there were at least 1,021 fatalities of the passengers and crew that were aboard. (However, only two of the crew were killed.) Additionally, 180 were injured. There were other goals that were affected but the loss of life makes any other goals insignificant. The deaths and injuries are impacts to the safety goals.

Passengers drowned because they were in water over their heads with inadequate help or safety equipment. Passengers were either in the water because they fell when the deck collapsed, or because they jumped into the water trying to avoid the fire. The water was too deep to stand because only the bow was in shallow water and the passengers could not reach the bow. This was due to a poor decision on the Master’s part (namely his decision to beach the ship at a severe angle, with the bow in towards the island, instead of parallel to the island, where passengers would have been able to wade to shore.) Note that the Master himself (and most of the crew) were on the bow side of the ship and were able to (and did) jump off and wade to shore. The safety equipment, including life preservers, life boats, and life rafts, was mostly unusable due to inadequate upkeep and inadequate inspections.

Passengers (and two crewmembers) were also killed by fire. Once the fire was started, it spread rapidly and was not put out. The fire spread rapidly because the ship was highly flammable. When this ship was constructed, there was no consideration of flammability. Additionally, the current of air created by the vessel speeding ahead drove the fire across the ship. The fact that an experienced Master would have allowed this situation was considered misconduct, negligence and inattention to duty – charges for which the Master was later convicted. The fire was not put out because of inadequate crew effort and insufficient fire-fighting equipment. The crew effort was inadequate because of a lack of training. The fire-fighting equipment was insufficient because of inadequate upkeep and inadequate inspections. (Possibly you are noticing a theme here?)

Although many people have not heard of the General Slocum tragedy, many of its lessons learned have been implemented to make ship travel safer today. However, many of the solutions were not implemented widely enough or in time to prevent the Titanic disaster from occurring eight years later. (Although there were nearly as many people killed on the General Slocum, it is believed that the Titanic disaster is more well known because the passengers on Titanic were wealthy, as opposed to the working class passengers on General Slocum. It is also surmised that sympathy for the highly German population aboard General Slocum was diminished as World War I began.)

In a macabre ending to a gruesome story, ships began replacing their outdated, decrepit life preservers after the investigation on General Slocum. It was later found that the company selling these new life preservers had hidden iron bars within the buoyant material, in a dastardly attempt to raise their apparent weight. Unfortunately there were no adequate laws (then) against selling defective life-saving equipment.