Tag Archives: Cause Mapping

Dreamliner fire: firefighter injured when battery explodes

By ThinkReliability Staff

On January 7, 20 13, smoke was discovered on a recently deplaned Boeing 787 Dreamliner. The recently released National Transportation Safety Board (NTSB) investigation found that an internal short circuit within a cell of the auxiliary power unit (APU) battery spread to adjacent cells and led to a thermal runway which released fire and smoke aboard the aircraft. A firefighter responding to the fire was injured when the battery exploded. Only 9 days later, an incident involving the main battery, which is the same model as that used for the APU, resulted in an emergency landing of another Boeing 787. As a result of these two incidents, the entire Dreamliner fleet was grounded for 3 months for the ensuing investigation and incorporation of modifications. (See our previous blog about the grounding.) Before the fleet was allowed to resume operations, certain protective modifications were required to be implemented.

The investigation determined that the internal short circuit, which provided the initial heat source for the fire within the battery cell, could not be definitively determined due to severe damage in the area, but was potentially related to defects discovered during the manufacturing process. (Defects that could result in this type of short circuit were found on similar components.) The investigation found issues within the manufacturing process and with the oversight of subcontractors by contractors, as opposed to the manufacturers themselves.

The high temperatures resulting from the battery fire allowed it to spread to adjacent cells. Localized high temperatures were found greater than allowable at times of maximum current discharge, such as the APU startup, which had recently occurred. The high temperatures were not detected by the monitoring system (the impact could have been minimized had the issue come to light sooner), because temperatures were not monitored at individual cells, but only on two cell bus bars.

The systems were not prepared to deal with a spreading fire as the design of the aircraft assumed that a short circuit internal to the cell would not propagate. The NTSB determined that the guidance provided to determine key assumptions was ineffective and that the validation of these assumptions had failed. Likely related to this assumption, the safety assessment and testing on the battery system was ineffective. The rate of occurrence of cell venting (the spreading of fire from cell to cell) was calculated by the manufacturer to be 1 in 10 million flight hours. The two occurrences that resulted in the grounding both involved cell venting and occurred while the 787 fleet had less than 52,000 flight hours.

Immediate actions that were required by the NTSB prior to a return to flight were to enclose the battery case, vent from the interior of the enclosure containing the battery to the exterior of the plane (keeping smoke out of the occupied spaces), and modify the battery to minimize the most severe effects from an internal short circuit. The NTSB also made multiple safety recommendations to the manufacturer, subcontractor and the Federal Aviation Administration (FAA).

One of these recommendations was to ensure that assumptions are validated. According to the NTSB report, “Validation of assumptions related to failure conditions that can impact safety is a critical step in the development and certification of an aircraft. The validation process must employ a level of rigor that is consistent with the potential hazard to the aircraft in case an assumption is incorrect.” This statement is true for any object that’s manufactured. Just replace the word “aircraft” with whatever is being manufactured, such as “car” or “pacemaker”. (See another disaster that resulted from not validated assumptions: the collapse of the I-35 Bridge.)

Click on “Download PDF” above to view a high level Cause Map of this issue.

10,000 Pound Buoy Falls on Workers

By Kim Smiley

On December 10, 2014, a buoy that weighs close to 10,000 pounds fell onto workers at an inactive ship maintenance facility in Pearl Harbor. Two workers were killed and two others sustained injuries. While an object this large is an extreme example of the dangers of dropped objects, worker injuries and deaths from falling objects of all sizes is a significant safety concern. A US census report of fatal occupation injuries states that 245 workers were killed after being struck by falling objects in 2013 alone.

The case of the dropped buoy can be built into a Cause Map, a visual root cause analysis, to better understand what happened. Understanding the details of an accident is necessary to ensure that a wide range of solutions is considered and that any solutions implemented will be effective at preventing future incidents.

The investigation into the falling buoy is still underway so some information is not yet available, but it can easily be incorporated into the Cause Map once it is known. Any causes that need more information or evidence can be noted with a question mark to show that there is still an open question.

Exactly what caused the buoy to drop hasn’t been released yet, but it is known that the safety lines attached to the buoy failed. Both of these issues need to be investigated to ensure that solutions can be implemented to prevent further tragedies.

Additionally, there are open questions about why people were working under the path of the lift. The workers were wearing hard hats, but this is obviously inadequate protection against a 10,000 buoy. The contractors were working to strengthen mooring lines at the time of the accident, but no one should be where they could be crushed if such a large object was dropped, as it was in this case. As stated by Jeff Romeo, the Occupational Safety and Health Administration (OSHA) Honolulu area director, “We’re still looking at the facts to try to determine the exact locations of where these employees were located. If in fact, they were working directly underneath the load, then that would be an alarming situation.”

The OSHA investigation is currently underway and is expected to take four to six months. Additionally, the Navy is launching a Safety Investigation Board to review the accident with findings expected to be released by February. Once the investigation is complete, work processes will need to be reviewed to see what changes need to be made to prevent any future injuries from falling objects.

To view an initial Cause Map of this incident, click on “Download PDF” above.

Chemical Release Kills Four Workers at Texas Pesticide Plant

By ThinkReliability Staff

In the early morning hours of November 15, 2014, a release of methyl mercaptan resulted in the deaths of four employees at a plant in Texas that manufactures pesticides. The investigation into the source of the leak is still ongoing, though persistent maintenance problems had been reported in the plant, which was shut down five days prior to the incident.

Even though the investigation has not been completed, there are some lessons learned that can be applied to this facility, and other facilities that handle chemicals, immediately.

Even “safer” chemicals are dangerous when not treated properly. The chemical released – methyl mercaptan – is stored as a safer alternative to methyl isocyanate (which was the chemical released in the Bhopal disaster). Although it’s “safer” than its alternatives, it is still lethal at concentrations above 150 parts per million. The company has stated that 23,000 pounds were released – in a room where complaints were made about insufficient ventilation. The workers were unable to escape – likely because they were quickly incapacitated by the levels of methyl mercaptan and did not have the necessary equipment to get out. (Only two air masks and oxygen tanks were found in the area where the employees were.)

A fast response is necessary for employee safety. Records show that 911 was not called for an hour after the employees were trapped. (One of the victims called his wife an hour prior to indicate there was an issue and he was attempting rescue.) The emergency industrial response group, which is trained to provide response in these sort of situations, was never called by the plant. Medical personnel could not access the employees because they were not trained in protective gear. Firefighters who responded did not have enough air to travel through the entire facility and did not have enough information on the layout to know where to go. It’s unclear whether a quicker response could have saved lives.

Providing timely, accurate information is necessary for public safety. The best way to determine the impact on the public is to measure the concentration of released chemicals at the fenceline (known as fenceline monitoring). Air monitoring was not performed for more than four hours after the release. Companies are not required to provide fenceline monitoring, although an Environmental Protection Agency (EPA) rule requiring monitoring systems for refineries is under review. (This rule would not have impacted this plant as it produced pesticides.) Until that monitoring, the only information available to the public was information provided by the company (which did not release until days later the amount of chemical released.) In Texas, companies are required to disclose the presence of chemicals, but not the amount. A reverse 911 system was used to inform residents that an odor would be present, but did not discuss the risks.

What can you do? Ensure that all chemicals at your facility are known and stored carefully. Develop a response plan that ensures that your employees can get out safely, that responders can get in safely (and are apprised of risks they may face), and that the public has the necessary information to keep them safe. Make sure these plans are trained on and posted readily. Depending on the risk of public impact from your business, involving emergency responders and the public in your drills may be desired.

To see a high level Cause Map of this incident, click on “Download PDF” above.

Chocolate Makers Warn of Possible Shortage

By Kim Smiley

Chocolate is one of the most beloved foods, but it may be becoming a little too popular.  Major chocolate makers have warned of a possible chocolate shortage looming in the near future.  According to a recent article by the Washington Post, “The world’s biggest chocolate-maker says we’re running out of chocolate”, the world consumed about 70,000 metric tons more cocoa last year than it produced.  The chocolate deficit is also predicted to get worst.

The chocolate shortage is a classic example of supply and demand in action.  The demand for cocoa is rising at the same time that the supply is dropping.  The price consumers are paying for chocolate is already increasing and is likely to get significantly higher if these trends continue.

So why is demand increasing (beyond the obvious fact that chocolate is delicious)? Part of the answer is that it is trendy to include chocolate in a wider variety of foods such as savory gourmet dishes, liquor and breakfast cereal.  Even the already questionable potato chip has been covered in chocolate to the delight of many.  The increasing popularity of dark chocolate also comes into play because dark chocolate contains significantly more cocoa than typical chocolate. (An average chocolate bar is about 10% cocoa while dark chocolate bars are usually closer to 70%.)  The sheer number of people who are eating chocolate is also growing as chocolate is more widely available worldwide, particularly in Asia where chocolate consumption is increasing rapidly.

While demand continues to grow, supply is decreasing.  Drought in West Africa, where the majority of the world’s chocolate is grown, has impacted the cocoa supply.  The plants are also being attacked by diseases; the most noteworthy is a fungus called Frosty pod, which is reducing the crop further.  The nature of chocolate trees also makes responding to difficult or changing growing conditions challenging because it takes them years to mature.  With the difficulties facing chocolate trees, many farmers are turning to other crops that are more profitable which reduces the production of cocoa.

The end result of higher demand for chocolate will likely be further increases in the price of chocolate.  It’s also likely that chocolate makers will continue to develop candy that includes non-chocolate ingredients such as nuts, raisins or nougats to meet the demand for treats while using less actual chocolate.  Additionally, farmers are working to develop new strains of cocoa that are resistant to disease and drought and/or produce more cocoa per plant, which would increase the supply of cocoa.

A Cause Map, a visual root cause analysis, can be used to show the causes that have contributed to the chocolate deficit. To view a high level Cause Map of this example, click on “Download PDF” above.

Investigation Into the Fatal Crash of Commercial Space Vehicle is Underway

By Kim Smiley

On October 31, 2014, Virgin Galactic’s commercial space vehicle, SpaceShipTwo, tore apart over the Mojave Desert in California during its fourth rocket-powered test flight. One pilot was killed and the other seriously injured. An investigation is underway to determine exactly what caused the crash, but initial data indicates that the tail booms used to slow down the vehicle moved into the feathered position prematurely, increasing the aerodynamic force. This disaster has the potential to impact the emerging commercial space industry as regulators and potential passengers are reminded of the inherent dangers of space travel.

This issue can be analyzed by building a Cause Map, a visual method for performing a root cause analysis. An initial Cause Map can be built using the information that is currently available and then easily expanded as more data is known. The first step is to fill in an Outline with the basic background information of the incident. Additionally, the impacts to the overall goals are listed on the Outline to determine the scope of the issue. The Cause Map is then built by asking “why” questions.

Starting with the safety goal in this example: one pilot was killed and another was injured because a space vehicle was destroyed and they were onboard. (When two causes both contribute to an effect, they are both listed on the Cause Map and joined with an “and”.) SpaceShipTwo is designed to hold passengers, but this was a test flight to assess a new fuel so the pilots were the only people onboard. The space vehicle tore apart because the stress on the vehicle was greater than the strength of the vehicle. The final report on the accident will not be available for many months, but the initial findings indicate that the space vehicle experienced greater aerodynamic forces than expected.

The space vehicle used tail booms that were shifted into a feathered position to increase drag and reduce speed prior to landing. Video shows the co-pilot releasing the lever that unlocked the tail booms earlier than expected while the vehicle was still accelerating. It’s unclear at this time why he released the lever. The tail booms were not designed to move when unlocked and a second lever controls movement, but investigators speculate that the aerodynamic forces on the space vehicle while it was still accelerating caused them to lift up into the feathered position once they were unlocked. The vehicle disintegrated seconds after the tail booms shifted position, likely because of the aerodynamic forces in play.

After the final report is released, the Cause Map can be expanded to include the additional information. To view a high level Cause Map of this accident, click on “Download PDF” above.

Safety Concerns Raised by 5 Railroad Accidents in 11 Months

By ThinkReliability Staff

The National Transportation Safety Board investigates major railroad accidents in the United States. It was not only the severity (6 deaths and 126 injuries) but the frequency (5 accidents over 11 months) of recent accidents on a railroad that led to an “in-depth special investigation“. Part of the purpose of the special investigation was to “examine the common elements that were found in each”.

When an organization sees a recurring issue – in this case, multiple accidents requiring investigation from the same railroad, there may be value in not only investigating the incidents separately but also in a common analysis. A root cause analysis that addresses more than one incident is known as a Cumulative Cause Map, and it captures visually much of the same information in a Failure Modes and Effects Analysis, or FMEA.

The information from the individual investigations of each of these accidents can be combined into one analysis, including an outline addressing the problems and impacts to the goals from the incidents as a whole. In this case, the problems addressed include issues on the Metro-North railroad in New York and Connecticut from May 2013 to March 2014. The five incidents during that time period resulted in 4 customer deaths and 126 injuries, 2 employee deaths, and over $23.8 million in property damage.

The analysis of the individual accidents can be combined in a Cumulative Cause Map to intuitively show the cause-and-effect relationships. The customer deaths and injuries, and the property damage, resulted from train derailments and a collision. The train collision resulted from a derailment. In two of the cases, the derailment was due to track damage that had either been missed on inspection or had maintenance deferred. In the third derailment (discussed in a previous blog), the train took a curve at an excessive rate of speed due to fatigue of the engineer. Inadequate track inspections and maintenance, and deferred maintenance were highlighted as recurring safety issues to the railroad.

Both of the employee fatalities resulted from workers being struck by a train while performing track maintenance. In one case, the worker was outside the designated protected area due to an inadequate job safety briefing. In the other, a student removed the block while working unsupervised, allowing a train to travel into the protected area. The NTSB also identified inadequate safety oversight and roadway worker protection procedures as areas needing improvement. While the NTSB already released recommendations with each of the individual investigations, it plans to issue more based on the cumulative investigation addressing all five incidents. View an overview of all 5 incidents by clicking “Download PDF” above.

Antares Cargo Rocket Explodes Seconds After Launch

By Kim Smiley

On October 28, 2014 an Antares cargo rocket bound for the International Space Station (ISS) catastrophically exploded seconds after launch.  The $200 million rocket was planned to be one of eight supply missions to the ISS that Orbital Sciences has a $1.9 billion contract to provide.  The investigation is still underway, but initial findings indicate that there may have been a problem with the engines, which were initially built in the 1960s and early 1970s by the Soviet space program.

Whenever NASA launches a rocket, it is observed by safety personnel with the ability to cause the rocket to self-destruct if it appears to be malfunctioning to minimize potential injuries and property damage. Reports by NASA have indicated that this flight-termination system was engaged shortly after liftoff in this case because the rocket malfunctioned shortly after takeoff.

Video of the launch and the subsequent explosion show the plume from one engine changing shape a second before the massive explosion.  The change in the plume has led to speculation that a turbopump failed shortly after liftoff and suggests that the engines were the source of the malfunction.  Investigators are currently reviewing the video of the launch, telemetry readings from the rocket, and studying the debris to learn as many details as possible about this failure.

The engines in question are NK-33 rocket engines that were initially built (not just designed, but actually manufactured) more than 4 decades ago. So how did engines from the Apollo era end up on a rocket decades later in 2014?  The one-word answer is money.

These engines were originally designed to support the Soviet space program which was disbanded in 1974.  For years, these engines were warehoused with no real purpose.  In 1990, these engines were sold to a company called Aerojet, reportedly for the bargain price of a cool million each.  The engines were refurbished and renamed Aerojet AJ-26s.  The cost of using these older engines was significantly less than developing a brand new rocket design.  In addition to being expensive, a new rocket design requires a significant time investment.  There are also limited alternatives available, partly due to NASA’s shrinking budget.

Orbital Sciences has announced that they will source a different engine and no longer use the AJ-26s, but it’s worth nothing that these rockets have been used successfully in recent years. They have launched Cygnus supply spacecraft three times without incident.

To view a high level Cause Map, a visual root cause analysis, of this incident, click on “Download PDF” above.

Years of Uncontrolled Leakage Lead to Fatal Mall Collapse

By ThinkReliability Staff

The problems that led to the collapse of a shopping mall’s parking structure were present over its thirty-plus year history says the Report of the Elliot Lake Commission of Inquiry. Multiple opportunities to fix the problem were missed, culminating in the deaths of two on June 23, 2012. Says the report, “Although it was rust that defeated the structure of the Algo Mall, the real story behind the collapse is one of human, not material failure.”

Yes, corrosion of a connection supporting the parking garage decreased its strength to 13% of its original capacity, meaning that on that fateful day, one car driving over it resulted in its fatal collapse. But the more important story is that of how the corrosion was allowed to increase unchecked, due to leakage that had been noted since the opening of the mall.

Multiple causes were discovered resulting in the fatal collapse. The report that addresses them and suggests improvement is more than 1,000 pages long. Though the detail in the report is outstanding, an overview of the information from the report can be diagrammed in a Cause Map, or visual root cause analysis, allowing a one-page overview that clearly shows the cause-and-effect relationships.

It’s important to begin with the impact to the goals. Doing so gives a starting point – and focus – to the cause-and-effect questioning. In this case, the safety goal was impacted due to the 2 fatalities and 19 injuries caused by the collapse. The mall experienced severe damage, and the rescue and response efforts were comprehensive and time-consuming. Additionally, an engineer was criminally charged due to negligence from issues with the mall’s structural integrity.

The fatalities, property damage, and rescue efforts all resulted from the catastrophic collapse of the mall’s rooftop parking structure. The collapse was caused by the sudden failure of a connector. Material failure results from stress on an object overcoming the strength of the object. In this case the stress on the object was a single vehicle driving over the connection in question (evidenced by a video of the collapse). The strength of the connection had been significantly reduced due to corrosion, caused by the continuous ingress of water and chlorides on the unprotected beam.

The leakage was found to stem from a faulty initial design of the waterproofing system from construction of the mall in 1979. Specifically, the architect’s suggestions regarding waterproofing were ignored due to cost and land availability concerns, and the waterproofing system was installed during suboptimal weather because of construction delays. After construction, the architect signed off on the design without inspecting the site, beginning the first in a long list of failings that would eventually cost two women their lives.

Over the years, there were multiple warnings (not the least the need to use buckets to collect leaking water on a fairly constant basis) that were never resolved. According to the report, the problem was never fully addressed with maintenance and repairs but rather pushed off with cheap, ineffective repairs or by selling the structure (as happened twice in its history). For the most part, the local government did not investigate complaints or enforce building standards, apparently unwilling to interfere with the operation of a large source of local revenue and employment

When the local government finally did get involved and issued an Order to Remedy in 2009, the building owner appeared to provide deliberately false information that suggested that repairs were underway, leading to a rescinding of the order later that year. After an anonymous complaint in late 2011, an engineer with a suspended license performed a visual-only inspection which had to be signed off by a licensed engineer. After it was signed, the engineer testified that he had changed the contents of the report at the request of the owner, leading to the criminal charges against him for negligence.

Although plenty of failings were discussed in the report, it states very clearly, “This Commission’s role is not to castigate or chastise; its only purpose in finding fault, if it must, is to seek to prevent recurrence. Criticism of prevailing practices serves only to suggest their improvement or, if necessary, elimination.” In the report, the Commission discusses multiple suggestions for improvement – specifically clarifying, enforcing, and providing public information regarding building standards. Hopefully, the lessons learned from this tragic accident will allow for implementation of these solutions to ensure that thirty years of negligence isn’t allowed to cause a fatal building collapse again.

Software Error Causes 911 Outage

By Kim Smiley

On April 9, 2014, more than 6,000 calls to 911 went unanswered.  The problem was spread across seven states and went on for hours.  Calling 911 is one of those things that every child is taught and every person hopes they will never need to do –  and having emergency calls go unanswered has the potential to turn into a nightmare.

The Federal Communications Commission (FCC) investigated this 911 outage and has released a study detailing what went wrong on that day in April.  The short answer is that a software error led to the unanswered calls, but there is nearly always more to the story than a single “root cause”.  A Cause Map, an intuitive format for performing a root cause analysis, can be used to better understand this issue by visually laying out the causes (plural) that led to the outage.

There are three steps in the Cause Mapping process. The first is to define an issue by completing an Outline that documents the basic background information and how the problem impacts the overall goals.  Most incidents impact more than one goal and this issue is no exception, but for simplicity let’s focus on the safety goal.  The safety goal was impacted because there was the potential for deaths and injuries.  Once the Outline is completed (including the impacts to the goals), the Cause Map is built by asking “why” questions.

The second step of the Cause Mapping process is to analyze the problem by building the Cause Map.  Starting with the impacted safety goal – “why” was there the potential for deaths and injuries?  This occurred because more than 6,000 911 calls were not answered.   An automated system was designed to answer the calls and it wouldn’t accept new calls for hours.  There was a bug in the automated system’s software AND the issue wasn’t identified for a significant period of time.  The error occurred because the software used a counter with a pre-set limit to assign calls a tracking number.  The counter hit the limit and couldn’t assign a tracking number so it quit accepting new calls.

The delay in identification of the problem is also important to identify in the investigation because the problem would have been much less severe if it had been found and corrected more quickly.  Any 911 outage is a problem, but one that lasts 30 minutes is less alarming than one that plays out over 8hours.  In this example, the system identified the issue and issued alerts, but categorized them as “low level” so they were never flagged for human review.

The final step in the Cause Mapping process is to develop and implement solutions to reduce the risk of the problem recurring.  In order to fix the issues with the software, the pre-set limit on the timer has been increased and will periodically be checked to ensure that the max isn’t hit again.  Additionally, to help improve how quickly a problem is identified, an alert has been added to notify operators when the number of successful calls falls below a certain percentage.

New issues will likely continue to crop up as emergency systems move toward internet-powered infrastructure, but hopefully the systems will become more robust as lessons are learned and solutions are implemented.  I imagine there aren’t many experiences more frightening than frantically calling 911 for help and having no one answer.

To view a high level Cause Map of this issue, including a completed Outline, click on “Download PDF” above.

Lawsuit Questions the Safety of Guardrails

By Kim Smiley

A whistleblower lawsuit claims that tens of thousands of guardrails installed across the US may be unsafe.  The concern is that the specific design of the guardrail in question, the ET-Plus, can jam when hit and puncture cars, potentially causing injury, rather than curling away as intended.

This issue has more questions than answers at this point, but an initial Cause Map can be built to document what is currently known.  A question mark should be added to any cause that is suspected, but has not been proven with evidence.  As more information, both new causes and evidence, becomes available the Cause Map can easily be expanded to incorporate it.

In this example, the primary concern, both from a safety and regulation standpoint, about the guardrails are centered on a design change made in 2005.  The size of the energy-absorbing end terminal was changed from five inches to four.  The modification was apparently made as a cost-saving measure.   The lawsuit alleges that federal authorities were never alerted to the design change so it never received the required review and approval.  It appears that federal authorities were not alerted until a patent case bought up the issue in 2012.

The reduction in the size of the end terminals may have affected how the guardrails function during auto accidents.  The lawsuit claims that five deaths and other injuries from at least 14 auto accidents can be attributed to the new design of guardrails.  The Federal Highway Administration has stated that the guardrails meet crash-test criteria, but three states (Missouri, Nevada and Massachusetts) are taking the concerns seriously enough to ban further installation of the guardrails pending completion of the investigation.

This issue is a classic proverbial can of worms.  Up to a billion dollars could be at stake in the lawsuit and the man who filed the lawsuit could get a significant cut of the payout.  There are potential testing requirement issues that need to be considered if the guardrails are passing crash tests, but causing injuries.  There are concerns over whether the company properly informed the federal government about design changes, which is a particularly sensitive topic following the recent GM ignition switch issues.  All and all, this should be a very interesting topic to follow as it plays out.

To view a high level Cause Map of this issue, click on “Download PDF” above.