All posts by Angela Griffith

I lead comprehensive investigations by collecting and organizing all related information into a coherent record of the issue. Let me solve a problem for you!

Mine Deaths in China

By ThinkReliability Staff

Following the successful rescue of all 33 miners trapped in a Chilean mine is some unhappy mine news from China.  A gas blast on October 16, 2010 in the early morning is known to have killed 26 miners, and the 11 miners unaccounted for are believed dead.   In addition to these impacts to the safety goals, the environmental goal is impacted by the extremely high levels of methane gas, the customer service and production goals are impacted by the closure of the mine, and the property and labor goals are impacted by the rescue efforts that have been required.  Unfortunately this is not an uncommon occurrence.  It is estimated that 2,600 people were killed in Chinese mine accidents last year.

It is expected that the miners were mostly killed due to suffocation.  In addition to the lack of oxygen from the extremely high levels of methane (40% compared to the normal level of 1%), the miners were buried by coal dust, released by the gas blast.  The miners were trapped in the mine by the gas blast, of which the cause is as of yet unknown.  This is a question that additional investigation will try and answer.  Additionally more information is needed about the high levels of methane.  The rescuers had difficulty reducing the levels of methane because coal dust was blocking an access shaft, but levels were high prior to the blast, for reasons that are unclear.

More detail can be added to this Cause Map as the analysis continues. As with any investigation the level of detail in the analysis is based on the impact of the incident on the organization’s overall goals.  Because of the high number of deaths (and the high frequency of this type of incident), the Cause Map should end up very detailed in order to provide as many solutions as possible to ensure that the best solutions are implemented to reduce these types of incidents.

Miners Rescued!

By ThinkReliability Staff

On October 13, 2010, after almost 70 days spent at 688 meters underground, the 33 miners who were trapped in Chile’s San Jose Mine were brought to the surface in a small rescue capsule. Although the complexity of this rescue mission was unmatched in history, it seemed to go off without a hitch, even allowing the rescue to proceed more quickly than anticipated.

The primary concern throughout the rescue was the miner’s safety. Plans for the rescue focused on ensuring the safest possible environment for the miners – and making adjustments based on the ordeal they’ve been through. For example, there was concern about damage to the miner’s eyes – they haven’t been exposed to natural light for a while. So the miners wore protective eyewear to prevent damage. In addition, medics and rescuers were sent down to the chamber where the miners had been trapped to prepare them for the trip up (in a rescue pod small enough to fit through a 60-cm diameter hole) and evaluate them for medical conditions. After the miners reached the surface, they will receive 48 hours of medical observation by a team of specialists.

The preparations for this undertaking have been extremely methodical and detail. An area near the mine exit was cleared for a helicopter landing – a backup plan in case anything should happen so that the miners would be unable to be transported to the medical facility by road.

Even less-immediate concerns have been considered. The company that owned the mine went bankrupt while the miners were trapped, meaning these brave men returned to the surface jobless. The Chilean government put out a notice, and has received more than a thousand job offers.

One of the biggest concerns is that the miners will suffer from post-traumatic stress disorder (PTSD). It’s unclear exactly what exactly is being – or can be – done to reduce the impact, but the Chilean government has consulted with NASA about potential emotional and psychological issues the miners will face.

It seems that the rescuers really tried to think of everything that would make the rescue go smoothly – and the result of this planning showed in the faces of millions who watched the last miner safely pulled from the mine. A big Bravo Zulu out to all involved!

(You can see a timeline of the events starting from the mine collapse and a Cause Map that shows some of the worries the rescuers considered – and planned for – by clicking “Download PDF” above.)

Dissecting Safety Incidents: Using root cause analysis to reveal culture issues

By ThinkReliability Staff

The objective of a root cause analysis investigation is to prevention.  The causes of an incident are investigated, so that solutions can be developed and implemented, to reduce the risk of the same or a similar problem from occurring.  The process sounds easy, but in practice it can become more involved.  For example, what do you do when one of the identified causes is “lack of safety culture”?  How exactly do you solve that?

This is the issue that the Washington DC Metrorail (Metro) is currently facing.  The National Transportation and Safety Board (NSTB) recently released findings from the investigation of a DC metro train crash that killed nine last June.  (See our previous blog for more details). Predictably, the NSTB findings include several technical issues including failed track circuits and lack of adequate testing, but the list of causes also includes items like lack of safety culture and ineffective oversight.

Fortunately, the NSTB also provided recommendations such as developing a non-punitive safety reporting program, establishment of periodic inspections and maintenance procedures for the equipment that failed during this accident, and reviewing the process used to pass along safety and technical information.  One of the important things to notice in this example is that the recommendations are fairly specific, even if the stated cause is a little vague.  Specific solutions are necessary if they are going to be effectively implemented.

If you find yourself at a point in your organization where a cause is identified as “lack of safety culture”, it’s a good idea to keep asking why questions until you identify the specific problems that are causing the issue.  Is it the safety information that is lacking or incorrect?  Is the process that provides the information confusing?  Do the workers need better safety equipment?  Knowing all the details involved will allow better solutions to be developed.  And better solutions result in lower risks in the future.  Culture is the shared values and practices of the people in an organization.  The Cause Mapping method of root cause analysis has an effective way for an organization to identify “culture gaps” by thoroughly dissecting just one of its incidents.

Therac-25 Radiation Overdoses

By ThinkReliability Staff

The Therac-25 is a radiation therapy machine used during the mid-80s.  It delivered two types of radiation beams, a low-power electron beam and a high-power x-ray.  This provided the economic advantage of delivering two kinds of therapeutic radiation with one machine.  From June 1985 to January 1987, the Therac-25 delivered massive radiation overdoses to 6 people around the country.  We can look at the causes of these overdoses in a root cause analysis performed as a Cause Map.

The radiation overdoses were caused by delivery of the high-powered electron beam without attenuation.  In order for this to happen, the high-powered beam was delivered, and the attenuation was not present.  The lower-powered beam did not require attenuation provided by the beam spreader, so it was possible to operate the machine without it.  The machine did register an error when the high-powered beam was turned on without attenuation.  However, it was possible to operate the the beam with the error and the warning was overridden by the operators.

The Therac-25 had two different responses to errors.  One was to pause the treatment, which allowed the operators to resume without any changes to settings, and another was to reset the machine settings.  The error resulting in this case, having the high-power beam without attenuation, resulted only in a treatment pause, allowing the operator to resume treatment with an override, without changing any of the settings.  Researchers talking to the operators found that the Therac-25 frequently resulted in errors and so operators were accustomed to overriding them.  In this case, the error that resulted (“Malfunction 54”) was ambiguous and not defined in any of the operating manuals.  (This code was apparently only to be used for the manufacturing company, not healthcare users.)

The Therac-25 allowed the beam to be turned on without error (minus the overridden warning) in this circumstance.  The Therac-25 had no hardware protective circuits and depended solely on software for protection.  The safety analysis of the Therac-25 considered only hardware failures, not software errors, and thus did not discover the need for any sort of hardware protection.  The reasoning given for not including software errors was the “extensive testing” of the Therac-25, the fact that software, unlike hardware, does not degrade, and the general assumption that software is error-proof.  Software errors were assumed to be caused by hardware errors, and residual software errors were not included in the analysis.

Unfortunately the coding used in the Therac-25 was in part borrowed from a previous machine and contained a residual error.  This error was not noticed in previous versions because hardware protective circuits prevented a similar error from occurring.  The residual error was a software error known as a “race condition”.  In short, the output of the coding was dependent on the order the variables were entered.  If an operator were to enter the variables for the treatment very quickly and not in the normal order (such as going back to correct a mistake), the machine would accept the settings before the change from the default setting had registered.  In some of these cases, it resulted in the error described here.  This error was not caught before the overdoses happened because software failures were not considered in the safety analysis (as described above), the code was reused from a previous system that had hardware interlocks (and so had not had these problems) and the review of the software was inadequate.  The coding was not independently reviewed, the design of the software did not include failure modes and the software was not tested with the hardware until installation.

This incident can teach us a lot about over-reliance on one part of a system and re-using designs in a new way with inadequate testing and verification (as well as many other issues).  If we can learn from the mistakes of others, we are less likely to make those mistakes ourselves.  For more detail on this (extremely complicated) issue, please see Nancy Leverson and Clark Turner’s An Investigation of the Therac-25 Incidents.”

UPDATE: Convictions Result from Bhopal Tragedy

By ThinkReliability Staff

In a previous blog, we outlined the two theories of a 1984 tragedy in Bhopal that resulted in approximately 15,000 deaths. On Monday, June 7, 2010 (nearly 16 years after the incident) 7 former Union Carbide senior employees were convicted of “death by negligence” by an Indian court.  The sentence was two years in jail and a fine of 100,000 rupees (just over $2,000 U.S.).  One former employee who was also charged has since died.  The Union Carbide subsidiary which owned the plant at the time of the leak was also ordered to pay 500,000 rupees as well.

Memorial for those killed and disabled by the toxic gas release.

The charges against the company and the senior officials had been reduced from culpable homicide by India’s supreme court in 1996.  The head of Union Carbide has also been charged but extradition requests remain outstanding.

The recent court case has highlighted not only the problems that led to the chemical leak, but problems that result in massive delays in the Indian court system.  The Law Minister has admitted the delay and said “We need to address that.”

Although the court did not release additional information with their ruling, this appears to support the theory that the an equipment or safety system failure (and not employee sabotage) caused the leak.

Recreational Water Illnesses

By ThinkReliability Staff

Last year we wrote a blog about preventing pool injuries, specifically slipping and drowning. However, there’s a lesser known risk from a pool – getting sick from swimming. This is officially known as “recreational water illness” or RWI, and normally involves diarrhea. RWI is estimated to affect approximately 1,000 people a year (according to WebMD) and can cause death, especially in immune-compromised people.

We can perform a proactive root cause analysis to determine what causes these illnesses. Essentially, a person consumes germs by ingesting pool water that contains germs. Pool water becomes contaminated when germs enter the pool from fecal matter. (Easier said than done. Did you know that the average person is wearing 0.14 grams of fecal matter?) So please, keep fecal matter out of the pool. Take a shower before you get in and make sure your kids are using the bathroom regularly elsewhere. (Not surprisingly, kiddie pools are the ‘germiest’.)

Now, pools are treated to prevent these germs from proliferating. However, some combinations of pool chemicals and germs take much too long to work to be effective. (For example, cryptosporidium takes 7 days to be killed in chlorine.) Some pools aren’t getting enough chemicals due to inadequate maintenance. And, there’s some stuff you can put in the pool – namely urine, sunscreen, and sweat – that interacts with chlorine and reduces the effective volume in the pool. So, even though urine itself doesn’t contain germs, don’t pee in the pool. And again, take a shower.

Our solutions to RWI – take a shower, don’t perform any bodily functions in the pool, and don’t swallow the pool water. However, that works for you and your family, but what about the unwashed masses in the pool? The CDC recommends you buy your own water testing kit and test the pool water before you get in. Make sure there’s a pool treatment plan and that it’s being followed, and that all ‘accidents’ are reported immediately. (Yep, even if they’re your fault.) Then lay back, relax, and enjoy your swim.

Tay Bridge Collapse

By ThinkReliability Staff

On December 28, 1879, the Tay Bridge in Dundee, Scotland collapsed as an express train was traveling across. All 75 people on board were killed. The bridge had been tested and approved by the Board of Trade only 19 months prior and opened to traffic just over 2 years before the collapse. The failure of the bridge also resulted in the loss of the bridge (it was rebuilt nearby) and the temporary loss of a train route. Surprisingly, there was very little damage to the train, which was refurbished and placed back in service.

Although the bridge had passed its Board of Trade testing, problems quickly started to arise. Work crews on the bridge reported severe vibrations whenever a train crossed. An inspector noticed deficient joints, but rather than reporting them, determined he could repair them himself. (Unfortunately, it helped the vibrations but further decreased the structural integrity of the bridge.) Another train crossed the bridge at 6:00 p.m. the evening of the failure and reported a “very rough” journey – the train reportedly let off sparks as it swayed and rubbed against the guardrails.

Modern Tay Bridge
Modern Tay Bridge

The train that began to cross the bridge at around 7:00 p.m. was much larger and heavier than the train that crossed at 6:00 p.m. There was also severe weather, and witnesses on the shore report an especially heavy gust of wind as the bridge began to collapse. The board of inquiry determined that the collapse of the bridge was due to the failure of the tower lugs. These tower lugs were experiencing more stress than usual, more than they were designed for, due to an increase in traffic, the heavy winds, and the particularly heavy train which was crossing at the time of failure. In addition, the lugs had been weakened by fatigue cracking caused by large lateral oscillations. The causes of the additional stress were also causes of the oscillations, along with a misalignment in the track. The defective joints – both from design and from the “fix” of the inspector – allowed the oscillations to increase over the (short) life of the bridge.

This bridge failure – still the most famous in the British Isles – did lead to some additional insight in bridge construction, including some lessons still used today. By reviewing these types of failures, we can ensure that 75 people don’t have to lose their life again to get a lesson on building structurally sound bridges. To view the Cause Map of this incident, please click on ‘Download PDF’ above.

The information used to build this Cause Map is from Failure Magazine.

Chinatown Fire NYC

By ThinkReliability Staff

On April 11, 2010, a fire broke out in a store on the first level of an apartment building on the 200 block of Grand Street in Chinatown, New York City. The fire would eventually reach 7 alarms, requiring 250 firefighters to fight. Once firefighters were able to enter the building the next day, they found one body.  33 people, including 29 firefighters, were injured and approximately 200 were left homeless, as the blaze left three buildings needing to be demolished and at least two more severely damaged.

For years the buildings affected (which were more than a century old) had been neglected, including violations for missing smoke detectors and a boiler which released smoke into the buildings.  At this point it’s unclear how these violations may have contributed to the fire and its aftermath.  At the time of the fire, the buildings were for sale for over $9 million, although no offers had been made.    There were many goals impacted by the fire, but the loss of human life and number of injuries are the focus for our investigation.

The injuries (many of which were smoke inhalation) were caused by a seven-alarm fire.  The fire was able to reach seven alarms because the fire was able to quickly spread through the six-story building.  In order for a fire to start heat, fuel and oxygen are required.  There’s no shortage of fuel and oxygen in an apartment building, due to necessities for people to live there.  The heat (or ignition source) may have been provided by exposed wiring that many residents have complained of, or the boiler previously cited for neglect.  Or, it may have been something else altogether.  (However, arson is not suspected at this point.)

The fire was able to spread so quickly due to a large number of voids and shafts in the building – a function of its  age.  Another cause that may have contributed to the death was a potential lack of warning of the fire due to the missing smoke detectors for which the building had also been previously cited.

Throughout an investigation there may be additional tools that help to clarify the incident.  Here we use a timeline to show the sequence of events.  A timeline is especially useful for complex events such as this.

A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.  In fact, the outline, Cause Map and timeline for this event easily fit on one page.  (View them by clicking “Download PDF” above.)   Even more detail can be added to this Cause Map as more information is released about the incident. As with any investigation the level of detail in the analysis is based on the impact of the incident on the organization’s overall goals.

Oil Refinery Explosion Rocks Anacortes Washington

By ThinkReliability Staff

Early this morning (Friday, April 2nd, 2010), an explosion at an oil refinery rocked the town of Anacortes, Washington.  The cause of the explosion is not yet known.  However, even with very little information a root cause analysis of the event can be started.  It is extremely helpful to gather information regarding an incident as soon as possible after it occurs.  More information can always be added as the investigation continues.

In this case, the date and approximate time are known.  It’s not clear if there was anything different or unusual at the refinery this morning, so we’ll put a question mark here for now.  Detailed information regarding the exact location of the incident has been released, so we can record that the explosion occurred in an Anacortes, Washington oil refinery, at the catalytic reformer hydrotreater unit while maintenance work was being performed.

We also know that some of the company’s goals have been impacted.  One worker was killed, four workers were seriously injured, and three workers are missing.  These are all impacts to the safety goal.  Because of the severe impact to the safety goal and the loss of human life, the other goals are far less important.  However, we can record the impacts for assistance in performing the analysis.

Reports of black smoke in the area indicate pollution which is an impact to the environmental goal.  There are reports of some damage to nearby buildings, which could be considered an impact to the customer service goal.  The damage to the plant, and possible delay in production as a result, are impacts to the production/schedule and property goals.  Additionally, the emergency response is an impact to the labor goal.

The costs resulting from the impacts to the goals and the frequency of events such as these are not immediately known.  This is information that can be filled in as the root cause analysis continues.   As more information is released regarding the incident, we can continue our investigation.

Power Outage Chile

By ThinkReliability Staff

A power outage struck Chile less than a month after an earthquake struck.  The power outage affected an area of nearly 2,000 kilometers and roughly 80% of Chile’s population.  Power in most areas was restored within several areas.  However, it was estimated that power to some in the Bio Bio region – which received more severe infrastructure damage – might be out for the better part of a week.

A power outage is an impact to the customer service and production/schedule goal.  The power outage was caused by the collapse of the Central Interconnected System (Sistema Interconectado Central).  The grid collapse was due to a lack of backup power capabilities, which was caused by a fragile power grid as a result of the earthquake, and interruption to the main power grid.  This interruption was caused by a disruption at the biggest substation due to a damaged transformer.  It’s unclear what caused the damage to the transformer, but it is believed to be related to the earthquake that hit in February.  We show this by adding a cause box with a question mark between “damaged transformer” and “earthquake on Feb. 27th”.

Repairs to the damaged transformer were required, which is an impact to the property and labor goals.

The Chilean government pledged to repair the transformer within 48 hours and stabilize the transmission lines within a week.  Interim solutions to get the electricity flowing were to isolate the damaged unit and install a reserve.  Additionally, Chileans have been asked to conserve electricity to minimize the amount of power transmitted through the lines.

By clicking ‘Download PDF” above, you can see the thorough root cause analysis built as a Cause Map that captures all of the currently known information in a simple, intuitive format that fits on one page.

Even more detail can be added to this Cause Map as the analysis continues. As with any investigation the level of detail in the analysis is based on the impact of the incident on the organization’s overall goals.