Root Cause Analysis - Incident Investigation

Washing Machine Failure

August 27, 2010 ThinkReliability Staff

(This week, we are proud to announce a Cause Map by a guest blogger, Bill Graham. Thanks, Bill!)

While completing household chores in the spring of 2010, a Housewife found her front load washing machine stopped with water standing in the clothing. Inspection of the machine uncovered the washing machine’s drain pump had failed. Because the washer is less than two years old, it was decided to attempt repair of the machine instead of replacing it. A replacement pump was not locally available, so the family finds and orders a pump from an Internet dealer. Delivery time for the pump is approximately one week, during which time the household laundry chore cannot be completed and some of the family’s favorite clothing cannot be worn because it is has not been laundered. On receiving the new pump, Dad immediately removes the broken pump and finds, to his chagrin, a small, thin guitar pick in the suction of the old pump. Upon discovery of the guitar pick, the family’s children report that the pick had been left in the pocket of the pants that where being washed at the time of the pump’s failure. The new pump was installed and the laundry chore resumed for the household.

While most cause analysis programs would identify the guitar pick as the root cause to the washing machine’s failure, Cause Mapping unveils all of the event’s contributing factors and what most efficient / cost effective measures might be taken to avert a similar failure. For example, if all the family’s children aspire to be guitar players, then a top load washer may better suit their lifestyle while also averting the same mishap. Or, maybe the family should consider wearing pocket-less clothing. Or, maybe all family members should assume bigger role in completing the household laundry chore. Whichever solution is chosen, the impact of these and all contributing causes is easily understood when the event is Cause Mapped.

Root Cause Analysis - Incident Investigation

Dissecting Safety Incidents: Using root cause analysis to reveal culture issues

August 19, 2010 Angela Griffith

By ThinkReliability Staff

The objective of a root cause analysis investigation is to prevention. The causes of an incident are investigated, so that solutions can be developed and implemented, to reduce the risk of the same or a similar problem from occurring. The process sounds easy, but in practice it can become more involved. For example, what do you do when one of the identified causes is “lack of safety culture”? How exactly do you solve that?

This is the issue that the Washington DC Metrorail (Metro) is currently facing. The National Transportation and Safety Board (NSTB) recently released findings from the investigation of a DC metro train crash that killed nine last June. (See our previous blog for more details). Predictably, the NSTB findings include several technical issues including failed track circuits and lack of adequate testing, but the list of causes also includes items like lack of safety culture and ineffective oversight.

Fortunately, the NSTB also provided recommendations such as developing a non-punitive safety reporting program, establishment of periodic inspections and maintenance procedures for the equipment that failed during this accident, and reviewing the process used to pass along safety and technical information. One of the important things to notice in this example is that the recommendations are fairly specific, even if the stated cause is a little vague. Specific solutions are necessary if they are going to be effectively implemented.

If you find yourself at a point in your organization where a cause is identified as “lack of safety culture”, it’s a good idea to keep asking why questions until you identify the specific problems that are causing the issue. Is it the safety information that is lacking or incorrect? Is the process that provides the information confusing? Do the workers need better safety equipment? Knowing all the details involved will allow better solutions to be developed. And better solutions result in lower risks in the future. Culture is the shared values and practices of the people in an organization. The Cause Mapping method of root cause analysis has an effective way for an organization to identify “culture gaps” by thoroughly dissecting just one of its incidents.

Root Cause Analysis - Incident Investigation

Spacewalk Delay for Ammonia Leak

August 13, 2010 Kim Smiley

By Kim Smiley

Astronauts at the International Space Station ran into problems during a planned replacement of a broken ammonia cooling pump on August 7, 2010. In order to replace the pump, four ammonia hoses and five electrical cables needed to be disconnected to remove the broken pump. One of the hoses could not be removed because of a jammed fitting. When an astronaut was able to disconnect it by hitting the fitting with a hammer, it caused an ammonia leak.

Ammonia is toxic, so the leak impacted both the safety and environmental goals. Because the broken pump kept one cooling system from working, there was a risk of having to evacuate the space station, should the other system (which was the same age) fail. This can be considered an impact to the customer service goal. The repair had to be delayed, which is an impact to the production/schedule goal. The loss of a redundant system is an impact to the property/equipment goal. The extended spacewalk is an impact to the labor/time goal.

Once we fill out the outline with the impact to the goals and information regarding the problem, we can go on to the Cause Map. The ammonia leak was caused by an unknown leak path and the fitting being removed by a hammer. The fitting was removed with a hammer because it was jammed and had to be disconnected in order for the broken pump to be replaced. As we’re not aware of what caused the pump to break (this information will likely be discovered now that the pump has been removed), we leave a question mark on the map, to fill in later.

The failed cooling pump also caused the loss of one cooling system. If the other system, which is near the end of its expected life, were to fail, this would require evacuation from the station.

To aid in our understanding of this incident, we can create a very simple process map of the pump replacement. The red firework shows the step in the replacement that didn’t go well. To view the outline, Cause Map and Process Map, click on “Download PDF” above.

Root Cause Analysis - Incident Investigation

Therac-25 Radiation Overdoses

August 8, 2010 Angela Griffith 2 Comments

By ThinkReliability Staff

The Therac-25 is a radiation therapy machine used during the mid-80s. It delivered two types of radiation beams, a low-power electron beam and a high-power x-ray. This provided the economic advantage of delivering two kinds of therapeutic radiation with one machine. From June 1985 to January 1987, the Therac-25 delivered massive radiation overdoses to 6 people around the country. We can look at the causes of these overdoses in a root cause analysis performed as a Cause Map.

The radiation overdoses were caused by delivery of the high-powered electron beam without attenuation. In order for this to happen, the high-powered beam was delivered, and the attenuation was not present. The lower-powered beam did not require attenuation provided by the beam spreader, so it was possible to operate the machine without it. The machine did register an error when the high-powered beam was turned on without attenuation. However, it was possible to operate the the beam with the error and the warning was overridden by the operators.

The Therac-25 had two different responses to errors. One was to pause the treatment, which allowed the operators to resume without any changes to settings, and another was to reset the machine settings. The error resulting in this case, having the high-power beam without attenuation, resulted only in a treatment pause, allowing the operator to resume treatment with an override, without changing any of the settings. Researchers talking to the operators found that the Therac-25 frequently resulted in errors and so operators were accustomed to overriding them. In this case, the error that resulted (“Malfunction 54”) was ambiguous and not defined in any of the operating manuals. (This code was apparently only to be used for the manufacturing company, not healthcare users.)

The Therac-25 allowed the beam to be turned on without error (minus the overridden warning) in this circumstance. The Therac-25 had no hardware protective circuits and depended solely on software for protection. The safety analysis of the Therac-25 considered only hardware failures, not software errors, and thus did not discover the need for any sort of hardware protection. The reasoning given for not including software errors was the “extensive testing” of the Therac-25, the fact that software, unlike hardware, does not degrade, and the general assumption that software is error-proof. Software errors were assumed to be caused by hardware errors, and residual software errors were not included in the analysis.

Unfortunately the coding used in the Therac-25 was in part borrowed from a previous machine and contained a residual error. This error was not noticed in previous versions because hardware protective circuits prevented a similar error from occurring. The residual error was a software error known as a “race condition”. In short, the output of the coding was dependent on the order the variables were entered. If an operator were to enter the variables for the treatment very quickly and not in the normal order (such as going back to correct a mistake), the machine would accept the settings before the change from the default setting had registered. In some of these cases, it resulted in the error described here. This error was not caught before the overdoses happened because software failures were not considered in the safety analysis (as described above), the code was reused from a previous system that had hardware interlocks (and so had not had these problems) and the review of the software was inadequate. The coding was not independently reviewed, the design of the software did not include failure modes and the software was not tested with the hardware until installation.

This incident can teach us a lot about over-reliance on one part of a system and re-using designs in a new way with inadequate testing and verification (as well as many other issues). If we can learn from the mistakes of others, we are less likely to make those mistakes ourselves. For more detail on this (extremely complicated) issue, please see Nancy Leverson and Clark Turner’s “An Investigation of the Therac-25 Incidents.”

Your Expert Root Cause Analysis Resource

Monthly Archives: August 2010

Washing Machine Failure

Dissecting Safety Incidents: Using root cause analysis to reveal culture issues

Spacewalk Delay for Ammonia Leak

Therac-25 Radiation Overdoses