Tag Archives: root cause analysis

Chemical Release Kills Four Workers at Texas Pesticide Plant

By ThinkReliability Staff

In the early morning hours of November 15, 2014, a release of methyl mercaptan resulted in the deaths of four employees at a plant in Texas that manufactures pesticides. The investigation into the source of the leak is still ongoing, though persistent maintenance problems had been reported in the plant, which was shut down five days prior to the incident.

Even though the investigation has not been completed, there are some lessons learned that can be applied to this facility, and other facilities that handle chemicals, immediately.

Even “safer” chemicals are dangerous when not treated properly. The chemical released – methyl mercaptan – is stored as a safer alternative to methyl isocyanate (which was the chemical released in the Bhopal disaster). Although it’s “safer” than its alternatives, it is still lethal at concentrations above 150 parts per million. The company has stated that 23,000 pounds were released – in a room where complaints were made about insufficient ventilation. The workers were unable to escape – likely because they were quickly incapacitated by the levels of methyl mercaptan and did not have the necessary equipment to get out. (Only two air masks and oxygen tanks were found in the area where the employees were.)

A fast response is necessary for employee safety. Records show that 911 was not called for an hour after the employees were trapped. (One of the victims called his wife an hour prior to indicate there was an issue and he was attempting rescue.) The emergency industrial response group, which is trained to provide response in these sort of situations, was never called by the plant. Medical personnel could not access the employees because they were not trained in protective gear. Firefighters who responded did not have enough air to travel through the entire facility and did not have enough information on the layout to know where to go. It’s unclear whether a quicker response could have saved lives.

Providing timely, accurate information is necessary for public safety. The best way to determine the impact on the public is to measure the concentration of released chemicals at the fenceline (known as fenceline monitoring). Air monitoring was not performed for more than four hours after the release. Companies are not required to provide fenceline monitoring, although an Environmental Protection Agency (EPA) rule requiring monitoring systems for refineries is under review. (This rule would not have impacted this plant as it produced pesticides.) Until that monitoring, the only information available to the public was information provided by the company (which did not release until days later the amount of chemical released.) In Texas, companies are required to disclose the presence of chemicals, but not the amount. A reverse 911 system was used to inform residents that an odor would be present, but did not discuss the risks.

What can you do? Ensure that all chemicals at your facility are known and stored carefully. Develop a response plan that ensures that your employees can get out safely, that responders can get in safely (and are apprised of risks they may face), and that the public has the necessary information to keep them safe. Make sure these plans are trained on and posted readily. Depending on the risk of public impact from your business, involving emergency responders and the public in your drills may be desired.

To see a high level Cause Map of this incident, click on “Download PDF” above.

Chocolate Makers Warn of Possible Shortage

By Kim Smiley

Chocolate is one of the most beloved foods, but it may be becoming a little too popular.  Major chocolate makers have warned of a possible chocolate shortage looming in the near future.  According to a recent article by the Washington Post, “The world’s biggest chocolate-maker says we’re running out of chocolate”, the world consumed about 70,000 metric tons more cocoa last year than it produced.  The chocolate deficit is also predicted to get worst.

The chocolate shortage is a classic example of supply and demand in action.  The demand for cocoa is rising at the same time that the supply is dropping.  The price consumers are paying for chocolate is already increasing and is likely to get significantly higher if these trends continue.

So why is demand increasing (beyond the obvious fact that chocolate is delicious)? Part of the answer is that it is trendy to include chocolate in a wider variety of foods such as savory gourmet dishes, liquor and breakfast cereal.  Even the already questionable potato chip has been covered in chocolate to the delight of many.  The increasing popularity of dark chocolate also comes into play because dark chocolate contains significantly more cocoa than typical chocolate. (An average chocolate bar is about 10% cocoa while dark chocolate bars are usually closer to 70%.)  The sheer number of people who are eating chocolate is also growing as chocolate is more widely available worldwide, particularly in Asia where chocolate consumption is increasing rapidly.

While demand continues to grow, supply is decreasing.  Drought in West Africa, where the majority of the world’s chocolate is grown, has impacted the cocoa supply.  The plants are also being attacked by diseases; the most noteworthy is a fungus called Frosty pod, which is reducing the crop further.  The nature of chocolate trees also makes responding to difficult or changing growing conditions challenging because it takes them years to mature.  With the difficulties facing chocolate trees, many farmers are turning to other crops that are more profitable which reduces the production of cocoa.

The end result of higher demand for chocolate will likely be further increases in the price of chocolate.  It’s also likely that chocolate makers will continue to develop candy that includes non-chocolate ingredients such as nuts, raisins or nougats to meet the demand for treats while using less actual chocolate.  Additionally, farmers are working to develop new strains of cocoa that are resistant to disease and drought and/or produce more cocoa per plant, which would increase the supply of cocoa.

A Cause Map, a visual root cause analysis, can be used to show the causes that have contributed to the chocolate deficit. To view a high level Cause Map of this example, click on “Download PDF” above.

Investigation Into the Fatal Crash of Commercial Space Vehicle is Underway

By Kim Smiley

On October 31, 2014, Virgin Galactic’s commercial space vehicle, SpaceShipTwo, tore apart over the Mojave Desert in California during its fourth rocket-powered test flight. One pilot was killed and the other seriously injured. An investigation is underway to determine exactly what caused the crash, but initial data indicates that the tail booms used to slow down the vehicle moved into the feathered position prematurely, increasing the aerodynamic force. This disaster has the potential to impact the emerging commercial space industry as regulators and potential passengers are reminded of the inherent dangers of space travel.

This issue can be analyzed by building a Cause Map, a visual method for performing a root cause analysis. An initial Cause Map can be built using the information that is currently available and then easily expanded as more data is known. The first step is to fill in an Outline with the basic background information of the incident. Additionally, the impacts to the overall goals are listed on the Outline to determine the scope of the issue. The Cause Map is then built by asking “why” questions.

Starting with the safety goal in this example: one pilot was killed and another was injured because a space vehicle was destroyed and they were onboard. (When two causes both contribute to an effect, they are both listed on the Cause Map and joined with an “and”.) SpaceShipTwo is designed to hold passengers, but this was a test flight to assess a new fuel so the pilots were the only people onboard. The space vehicle tore apart because the stress on the vehicle was greater than the strength of the vehicle. The final report on the accident will not be available for many months, but the initial findings indicate that the space vehicle experienced greater aerodynamic forces than expected.

The space vehicle used tail booms that were shifted into a feathered position to increase drag and reduce speed prior to landing. Video shows the co-pilot releasing the lever that unlocked the tail booms earlier than expected while the vehicle was still accelerating. It’s unclear at this time why he released the lever. The tail booms were not designed to move when unlocked and a second lever controls movement, but investigators speculate that the aerodynamic forces on the space vehicle while it was still accelerating caused them to lift up into the feathered position once they were unlocked. The vehicle disintegrated seconds after the tail booms shifted position, likely because of the aerodynamic forces in play.

After the final report is released, the Cause Map can be expanded to include the additional information. To view a high level Cause Map of this accident, click on “Download PDF” above.

Safety Concerns Raised by 5 Railroad Accidents in 11 Months

By ThinkReliability Staff

The National Transportation Safety Board investigates major railroad accidents in the United States. It was not only the severity (6 deaths and 126 injuries) but the frequency (5 accidents over 11 months) of recent accidents on a railroad that led to an “in-depth special investigation“. Part of the purpose of the special investigation was to “examine the common elements that were found in each”.

When an organization sees a recurring issue – in this case, multiple accidents requiring investigation from the same railroad, there may be value in not only investigating the incidents separately but also in a common analysis. A root cause analysis that addresses more than one incident is known as a Cumulative Cause Map, and it captures visually much of the same information in a Failure Modes and Effects Analysis, or FMEA.

The information from the individual investigations of each of these accidents can be combined into one analysis, including an outline addressing the problems and impacts to the goals from the incidents as a whole. In this case, the problems addressed include issues on the Metro-North railroad in New York and Connecticut from May 2013 to March 2014. The five incidents during that time period resulted in 4 customer deaths and 126 injuries, 2 employee deaths, and over $23.8 million in property damage.

The analysis of the individual accidents can be combined in a Cumulative Cause Map to intuitively show the cause-and-effect relationships. The customer deaths and injuries, and the property damage, resulted from train derailments and a collision. The train collision resulted from a derailment. In two of the cases, the derailment was due to track damage that had either been missed on inspection or had maintenance deferred. In the third derailment (discussed in a previous blog), the train took a curve at an excessive rate of speed due to fatigue of the engineer. Inadequate track inspections and maintenance, and deferred maintenance were highlighted as recurring safety issues to the railroad.

Both of the employee fatalities resulted from workers being struck by a train while performing track maintenance. In one case, the worker was outside the designated protected area due to an inadequate job safety briefing. In the other, a student removed the block while working unsupervised, allowing a train to travel into the protected area. The NTSB also identified inadequate safety oversight and roadway worker protection procedures as areas needing improvement. While the NTSB already released recommendations with each of the individual investigations, it plans to issue more based on the cumulative investigation addressing all five incidents. View an overview of all 5 incidents by clicking “Download PDF” above.

Years of Uncontrolled Leakage Lead to Fatal Mall Collapse

By ThinkReliability Staff

The problems that led to the collapse of a shopping mall’s parking structure were present over its thirty-plus year history says the Report of the Elliot Lake Commission of Inquiry. Multiple opportunities to fix the problem were missed, culminating in the deaths of two on June 23, 2012. Says the report, “Although it was rust that defeated the structure of the Algo Mall, the real story behind the collapse is one of human, not material failure.”

Yes, corrosion of a connection supporting the parking garage decreased its strength to 13% of its original capacity, meaning that on that fateful day, one car driving over it resulted in its fatal collapse. But the more important story is that of how the corrosion was allowed to increase unchecked, due to leakage that had been noted since the opening of the mall.

Multiple causes were discovered resulting in the fatal collapse. The report that addresses them and suggests improvement is more than 1,000 pages long. Though the detail in the report is outstanding, an overview of the information from the report can be diagrammed in a Cause Map, or visual root cause analysis, allowing a one-page overview that clearly shows the cause-and-effect relationships.

It’s important to begin with the impact to the goals. Doing so gives a starting point – and focus – to the cause-and-effect questioning. In this case, the safety goal was impacted due to the 2 fatalities and 19 injuries caused by the collapse. The mall experienced severe damage, and the rescue and response efforts were comprehensive and time-consuming. Additionally, an engineer was criminally charged due to negligence from issues with the mall’s structural integrity.

The fatalities, property damage, and rescue efforts all resulted from the catastrophic collapse of the mall’s rooftop parking structure. The collapse was caused by the sudden failure of a connector. Material failure results from stress on an object overcoming the strength of the object. In this case the stress on the object was a single vehicle driving over the connection in question (evidenced by a video of the collapse). The strength of the connection had been significantly reduced due to corrosion, caused by the continuous ingress of water and chlorides on the unprotected beam.

The leakage was found to stem from a faulty initial design of the waterproofing system from construction of the mall in 1979. Specifically, the architect’s suggestions regarding waterproofing were ignored due to cost and land availability concerns, and the waterproofing system was installed during suboptimal weather because of construction delays. After construction, the architect signed off on the design without inspecting the site, beginning the first in a long list of failings that would eventually cost two women their lives.

Over the years, there were multiple warnings (not the least the need to use buckets to collect leaking water on a fairly constant basis) that were never resolved. According to the report, the problem was never fully addressed with maintenance and repairs but rather pushed off with cheap, ineffective repairs or by selling the structure (as happened twice in its history). For the most part, the local government did not investigate complaints or enforce building standards, apparently unwilling to interfere with the operation of a large source of local revenue and employment

When the local government finally did get involved and issued an Order to Remedy in 2009, the building owner appeared to provide deliberately false information that suggested that repairs were underway, leading to a rescinding of the order later that year. After an anonymous complaint in late 2011, an engineer with a suspended license performed a visual-only inspection which had to be signed off by a licensed engineer. After it was signed, the engineer testified that he had changed the contents of the report at the request of the owner, leading to the criminal charges against him for negligence.

Although plenty of failings were discussed in the report, it states very clearly, “This Commission’s role is not to castigate or chastise; its only purpose in finding fault, if it must, is to seek to prevent recurrence. Criticism of prevailing practices serves only to suggest their improvement or, if necessary, elimination.” In the report, the Commission discusses multiple suggestions for improvement – specifically clarifying, enforcing, and providing public information regarding building standards. Hopefully, the lessons learned from this tragic accident will allow for implementation of these solutions to ensure that thirty years of negligence isn’t allowed to cause a fatal building collapse again.

Software Error Causes 911 Outage

By Kim Smiley

On April 9, 2014, more than 6,000 calls to 911 went unanswered.  The problem was spread across seven states and went on for hours.  Calling 911 is one of those things that every child is taught and every person hopes they will never need to do –  and having emergency calls go unanswered has the potential to turn into a nightmare.

The Federal Communications Commission (FCC) investigated this 911 outage and has released a study detailing what went wrong on that day in April.  The short answer is that a software error led to the unanswered calls, but there is nearly always more to the story than a single “root cause”.  A Cause Map, an intuitive format for performing a root cause analysis, can be used to better understand this issue by visually laying out the causes (plural) that led to the outage.

There are three steps in the Cause Mapping process. The first is to define an issue by completing an Outline that documents the basic background information and how the problem impacts the overall goals.  Most incidents impact more than one goal and this issue is no exception, but for simplicity let’s focus on the safety goal.  The safety goal was impacted because there was the potential for deaths and injuries.  Once the Outline is completed (including the impacts to the goals), the Cause Map is built by asking “why” questions.

The second step of the Cause Mapping process is to analyze the problem by building the Cause Map.  Starting with the impacted safety goal – “why” was there the potential for deaths and injuries?  This occurred because more than 6,000 911 calls were not answered.   An automated system was designed to answer the calls and it wouldn’t accept new calls for hours.  There was a bug in the automated system’s software AND the issue wasn’t identified for a significant period of time.  The error occurred because the software used a counter with a pre-set limit to assign calls a tracking number.  The counter hit the limit and couldn’t assign a tracking number so it quit accepting new calls.

The delay in identification of the problem is also important to identify in the investigation because the problem would have been much less severe if it had been found and corrected more quickly.  Any 911 outage is a problem, but one that lasts 30 minutes is less alarming than one that plays out over 8hours.  In this example, the system identified the issue and issued alerts, but categorized them as “low level” so they were never flagged for human review.

The final step in the Cause Mapping process is to develop and implement solutions to reduce the risk of the problem recurring.  In order to fix the issues with the software, the pre-set limit on the timer has been increased and will periodically be checked to ensure that the max isn’t hit again.  Additionally, to help improve how quickly a problem is identified, an alert has been added to notify operators when the number of successful calls falls below a certain percentage.

New issues will likely continue to crop up as emergency systems move toward internet-powered infrastructure, but hopefully the systems will become more robust as lessons are learned and solutions are implemented.  I imagine there aren’t many experiences more frightening than frantically calling 911 for help and having no one answer.

To view a high level Cause Map of this issue, including a completed Outline, click on “Download PDF” above.

Lawsuit Questions the Safety of Guardrails

By Kim Smiley

A whistleblower lawsuit claims that tens of thousands of guardrails installed across the US may be unsafe.  The concern is that the specific design of the guardrail in question, the ET-Plus, can jam when hit and puncture cars, potentially causing injury, rather than curling away as intended.

This issue has more questions than answers at this point, but an initial Cause Map can be built to document what is currently known.  A question mark should be added to any cause that is suspected, but has not been proven with evidence.  As more information, both new causes and evidence, becomes available the Cause Map can easily be expanded to incorporate it.

In this example, the primary concern, both from a safety and regulation standpoint, about the guardrails are centered on a design change made in 2005.  The size of the energy-absorbing end terminal was changed from five inches to four.  The modification was apparently made as a cost-saving measure.   The lawsuit alleges that federal authorities were never alerted to the design change so it never received the required review and approval.  It appears that federal authorities were not alerted until a patent case bought up the issue in 2012.

The reduction in the size of the end terminals may have affected how the guardrails function during auto accidents.  The lawsuit claims that five deaths and other injuries from at least 14 auto accidents can be attributed to the new design of guardrails.  The Federal Highway Administration has stated that the guardrails meet crash-test criteria, but three states (Missouri, Nevada and Massachusetts) are taking the concerns seriously enough to ban further installation of the guardrails pending completion of the investigation.

This issue is a classic proverbial can of worms.  Up to a billion dollars could be at stake in the lawsuit and the man who filed the lawsuit could get a significant cut of the payout.  There are potential testing requirement issues that need to be considered if the guardrails are passing crash tests, but causing injuries.  There are concerns over whether the company properly informed the federal government about design changes, which is a particularly sensitive topic following the recent GM ignition switch issues.  All and all, this should be a very interesting topic to follow as it plays out.

To view a high level Cause Map of this issue, click on “Download PDF” above.

Two Firefighters Killed by Rogue Welding

By ThinkReliability  Staff

On March 26, 2014, two firefighters were killed when trapped in a basement by a quickly spreading, very dangerous fire in Boston, Massachusetts. These firefighters appear to have been the first to succumb to injuries directly caused by fire while on the job in 2014. The company that was found responsible for starting the fire has been fined by OSHA for failure to follow safety procedures. Says Brenda Gordon, Occupational Safety and Health Administration (OSHA)’s director for Boston and southeastern Massachusetts, “This company’s failure to implement these required, common-sense safeguards put its own employees at risk and resulted in a needless, tragic fire.”

Every incident that results in a fatality should be carefully investigated. Investigations are used not only for liability and regulatory reasons, but also to develop solutions to reduce the risk of similar fatalities happening in the future. Investigating an incident such as this in a Cause Map, or visual root cause analysis, allows for better solutions by determining all the cause-and-effect relationships that led to the issue.

First it’s important to define how goals were impacted in order to define the scope of the problem. In this case, two firefighters were killed, which impacts the safety goal. In addition, the spread of the fire, damage of nearby buildings and associated civil lawsuits are also impacts to the goals. The OSHA fine of $58,000 for 10 violations of workplace safety regulations is an impact to the regulatory goal. The response to the fire, as well as the multiple investigations, are impacts to the labor/time goal.

Beginning with an impacted goal and asking “Why” questions develops cause-and-effect relationships that explain how the incident occurred. In this case, the firefighters perished when they were trapped by fire. The firefighters were in the basement of a residential building to rescue occupants from a fire, and the fire was so hot and dangerous that the firefighters could not exit, and other firefighters were unable to come to their rescue. Extremely windy conditions spread the fire caused by a welding spark that struck a nearby wood shed.   OSHA investigators note that the company performing the welding did not follow safety precautions (including having a fire watcher and moving welding away from flammable objects) that would have reduced the risk for fire. They cited the lack of an effective fire prevention/ protection program and a lack of training in workplace and fire safety. View the Cause Map by clicking “Download PDF” above.

Ideally the fine levied by OSHA will encourage the company involved to increase its methods of fire protection, not only to protect its own workers, but also to protect the public. In addition, the Boston Fire Department is conducting an internal review to improve firefighter safety. Says Steve MacDonald, spokesman, “What they’re doing is looking at policies and procedures. They’re reviewing everything, reviewing weather, radio communications, anything and everything having to do with the fire.”

On July 5th, another firefighter died after being trapped in a building while looking for occupants during a fire in Brooklyn, New York. On July 9th, a firefighter in Houston, Texas was killed of smoke inhalation inside a burning building. A firefighter died in a building collapse due to fire in New Carlisle, Indiana on August 5, 2014, making a total of 5 firefighters who have died as a direct result of smoke/fire injuries while on the call of duty so far in 2014. In 2013, a total of 30 firefighters were killed on the job, most as the result of the Yarnell Hill fire in Arizona.

Fire at FAA Facility Sparks Flight Havoc

By Kim Smiley 

On Friday September 26, 2014, air traffic was grounded for hours in the Chicago region following a fire in a Federal Aviation Administration facility in Aurora, Illinois. The snarl of flight issues impacted thousands of travelers in the days following the fire as airports struggled to deal with the aftermath of more than 4,000 canceled flights and thousands more delayed.

A Cause Map, a format for performing a visual root cause analysis, can be used to analyze this issue.  To build a Cause Map, the first step is to define the problem by determining how the overall organizational goals are impacted.  In this example, there is a significant customer service impact because thousands of passengers had their travel plans disrupted. The flight cancelations and delays can be considered an impact to the production/schedule goal.  The amount of time and energy needed to address the flight disruptions along with the investigation into the issue would also be impacts to the labor goal.  Once the impacts to the goals are determined, the Cause Map is built by asking “why” questions and visually laying out the answers to show the cause-and-effect relationship.

Thousands of flights were canceled because air traffic control was unable to support them.  Air traffic control couldn’t perform their usual function because there was a fire in a building that provided air traffic support for a large portion of the upper Midwest and it wasn’t possible to quickly provide air traffic support from another location. Focusing on the fire itself first, the fire appears to have been intentionally set by a contractor who worked in the building.  He was able to bring in flammable materials and start a fire without anyone stopping him.  Police are still investigating his motives, but he has been charged with a felony. The building was evacuated once the fire was discovered and employees obviously couldn’t perform their usual duties during that time.  Additionally, the fire damaged equipment so air traffic control functionality could not be quickly restored once the initial crisis was addressed and it was safe to return to the building.

The second portion of the issue is that there wasn’t a way to support air traffic once the building was evacuated.  Once the fire occurred, all flights were grounded because there wasn’t air traffic control support and it was not possible to quickly get air traffic moving again.

The final step in the Cause Mapping process is to develop and implement solutions to reduce the risk of a similar problem.  Law makers have called for an investigation into this issue to see if there is sufficient redundancy in the air traffic control system.  In an ideal situation, a fire or other crisis at any single location would not cripple US air traffic to the extent that this issue did.  The investigation is also looking into the fire and reviewing the security at the facility to see if there should be stricter restrictions put in place, such as ensuring that no employees work alone or searching bags as workers access the site.

This situation is also a strong reminder that organizations need to have a plan in place of what to do in case a failure occurs.  There was a previous fire scare at this same location earlier in 2014 when a smoking ceiling fan resulted in an evacuation and flight delays (see previous blog) that should have prompted some serious consideration of what the contingency plan should be if this facility was ever out of commission.

I was one of those people standing in line for hours at an airport on Friday morning after my flight was canceled.  And I for one would love to see the air traffic control system become more robust and better able to deal with the inevitable hiccups that occur.  It’s impossible to prevent every potential problem and another intentional fire in a FAA facility seems pretty farfetched, but it is possible to have a better plan in place to deal with issues that may arise.  The potential consequences of any single failure can be limited with a good plan and quick implementation of that plan.

Can Airline Seats Get Even Smaller?

By Kim Smiley

Was the experience the last time you flew wonderful?  Did you enjoy all the luxurious amenities like ample elbow room, stretching out your legs, and turning around in the bathroom?  Me neither.  Comfort certainly hasn’t been the top priority as airlines have shrunk seats to cram more passengers onboard, but a new patent application by Airbus really takes things to a whole new level.

They say that a picture is worth a thousand words and I think that is particularly true in this case.  This is a diagram of a patent application for a proposed seat design –

 

I’m not sure about the rest of you, but my backside is sore just thinking about an airplane seat that bears such a strong resemble to a bicycle.

I attempted to build a Cause Map, a visual root cause analysis, in order to better understand how such a design could be proposed because I frankly find it mind-boggling.  The basic idea is that airlines would like to maximize profits and that putting more people on each flight allows more tickets to be sold resulting in more money made.  The average airline seat width has already decreased to about 17 inches from the 18 inches typical for a long-haul airplane seat in the 1970s and 1980s.  Compounding the impact on passengers is the fact that the average passenger has increased during that same time frame.  In general larger bodies are being put in smaller seats, not a recipe for a comfort.

I’m still having a hard time understanding how the correct answer to increasing airline profits is making seats even smaller.  I have to believe that passengers will balk at some point.  At some level of discomfort, a cheap ticket just won’t be cheap enough for me to be willing to endure a truly awful flight.  Even with electronic distractions and snacks, there has to be a point where people would just say no.

There also has to be a number of safety concerns that arise when the size of airplane seats is dramatically decreased.  Survivability in a crash is greatly influenced by seat design because airplane seats are designed to absorb energy and provide head injury protection during an accident.

Just to be clear, there is no plan to actually use this seat design anytime in the near future.  This is just a patent application.  As Airbus spokeswoman, Mary Anne Greczyn said, “Many, if not most, of these concepts will never be developed, but in case the future of commercial aviation makes one of our patents relevant, our work is protected. Right now these patent filings are simply conceptual.” But somebody somewhere still thought this was a good enough idea that it should be patented…just in case.