Years of Uncontrolled Leakage Lead to Fatal Mall Collapse

By ThinkReliability Staff

The problems that led to the collapse of a shopping mall’s parking structure were present over its thirty-plus year history says the Report of the Elliot Lake Commission of Inquiry. Multiple opportunities to fix the problem were missed, culminating in the deaths of two on June 23, 2012. Says the report, “Although it was rust that defeated the structure of the Algo Mall, the real story behind the collapse is one of human, not material failure.”

Yes, corrosion of a connection supporting the parking garage decreased its strength to 13% of its original capacity, meaning that on that fateful day, one car driving over it resulted in its fatal collapse. But the more important story is that of how the corrosion was allowed to increase unchecked, due to leakage that had been noted since the opening of the mall.

Multiple causes were discovered resulting in the fatal collapse. The report that addresses them and suggests improvement is more than 1,000 pages long. Though the detail in the report is outstanding, an overview of the information from the report can be diagrammed in a Cause Map, or visual root cause analysis, allowing a one-page overview that clearly shows the cause-and-effect relationships.

It’s important to begin with the impact to the goals. Doing so gives a starting point – and focus – to the cause-and-effect questioning. In this case, the safety goal was impacted due to the 2 fatalities and 19 injuries caused by the collapse. The mall experienced severe damage, and the rescue and response efforts were comprehensive and time-consuming. Additionally, an engineer was criminally charged due to negligence from issues with the mall’s structural integrity.

The fatalities, property damage, and rescue efforts all resulted from the catastrophic collapse of the mall’s rooftop parking structure. The collapse was caused by the sudden failure of a connector. Material failure results from stress on an object overcoming the strength of the object. In this case the stress on the object was a single vehicle driving over the connection in question (evidenced by a video of the collapse). The strength of the connection had been significantly reduced due to corrosion, caused by the continuous ingress of water and chlorides on the unprotected beam.

The leakage was found to stem from a faulty initial design of the waterproofing system from construction of the mall in 1979. Specifically, the architect’s suggestions regarding waterproofing were ignored due to cost and land availability concerns, and the waterproofing system was installed during suboptimal weather because of construction delays. After construction, the architect signed off on the design without inspecting the site, beginning the first in a long list of failings that would eventually cost two women their lives.

Over the years, there were multiple warnings (not the least the need to use buckets to collect leaking water on a fairly constant basis) that were never resolved. According to the report, the problem was never fully addressed with maintenance and repairs but rather pushed off with cheap, ineffective repairs or by selling the structure (as happened twice in its history). For the most part, the local government did not investigate complaints or enforce building standards, apparently unwilling to interfere with the operation of a large source of local revenue and employment

When the local government finally did get involved and issued an Order to Remedy in 2009, the building owner appeared to provide deliberately false information that suggested that repairs were underway, leading to a rescinding of the order later that year. After an anonymous complaint in late 2011, an engineer with a suspended license performed a visual-only inspection which had to be signed off by a licensed engineer. After it was signed, the engineer testified that he had changed the contents of the report at the request of the owner, leading to the criminal charges against him for negligence.

Although plenty of failings were discussed in the report, it states very clearly, “This Commission’s role is not to castigate or chastise; its only purpose in finding fault, if it must, is to seek to prevent recurrence. Criticism of prevailing practices serves only to suggest their improvement or, if necessary, elimination.” In the report, the Commission discusses multiple suggestions for improvement – specifically clarifying, enforcing, and providing public information regarding building standards. Hopefully, the lessons learned from this tragic accident will allow for implementation of these solutions to ensure that thirty years of negligence isn’t allowed to cause a fatal building collapse again.

Software Error Causes 911 Outage

By Kim Smiley

On April 9, 2014, more than 6,000 calls to 911 went unanswered.  The problem was spread across seven states and went on for hours.  Calling 911 is one of those things that every child is taught and every person hopes they will never need to do –  and having emergency calls go unanswered has the potential to turn into a nightmare.

The Federal Communications Commission (FCC) investigated this 911 outage and has released a study detailing what went wrong on that day in April.  The short answer is that a software error led to the unanswered calls, but there is nearly always more to the story than a single “root cause”.  A Cause Map, an intuitive format for performing a root cause analysis, can be used to better understand this issue by visually laying out the causes (plural) that led to the outage.

There are three steps in the Cause Mapping process. The first is to define an issue by completing an Outline that documents the basic background information and how the problem impacts the overall goals.  Most incidents impact more than one goal and this issue is no exception, but for simplicity let’s focus on the safety goal.  The safety goal was impacted because there was the potential for deaths and injuries.  Once the Outline is completed (including the impacts to the goals), the Cause Map is built by asking “why” questions.

The second step of the Cause Mapping process is to analyze the problem by building the Cause Map.  Starting with the impacted safety goal – “why” was there the potential for deaths and injuries?  This occurred because more than 6,000 911 calls were not answered.   An automated system was designed to answer the calls and it wouldn’t accept new calls for hours.  There was a bug in the automated system’s software AND the issue wasn’t identified for a significant period of time.  The error occurred because the software used a counter with a pre-set limit to assign calls a tracking number.  The counter hit the limit and couldn’t assign a tracking number so it quit accepting new calls.

The delay in identification of the problem is also important to identify in the investigation because the problem would have been much less severe if it had been found and corrected more quickly.  Any 911 outage is a problem, but one that lasts 30 minutes is less alarming than one that plays out over 8hours.  In this example, the system identified the issue and issued alerts, but categorized them as “low level” so they were never flagged for human review.

The final step in the Cause Mapping process is to develop and implement solutions to reduce the risk of the problem recurring.  In order to fix the issues with the software, the pre-set limit on the timer has been increased and will periodically be checked to ensure that the max isn’t hit again.  Additionally, to help improve how quickly a problem is identified, an alert has been added to notify operators when the number of successful calls falls below a certain percentage.

New issues will likely continue to crop up as emergency systems move toward internet-powered infrastructure, but hopefully the systems will become more robust as lessons are learned and solutions are implemented.  I imagine there aren’t many experiences more frightening than frantically calling 911 for help and having no one answer.

To view a high level Cause Map of this issue, including a completed Outline, click on “Download PDF” above.

Lawsuit Questions the Safety of Guardrails

By Kim Smiley

A whistleblower lawsuit claims that tens of thousands of guardrails installed across the US may be unsafe.  The concern is that the specific design of the guardrail in question, the ET-Plus, can jam when hit and puncture cars, potentially causing injury, rather than curling away as intended.

This issue has more questions than answers at this point, but an initial Cause Map can be built to document what is currently known.  A question mark should be added to any cause that is suspected, but has not been proven with evidence.  As more information, both new causes and evidence, becomes available the Cause Map can easily be expanded to incorporate it.

In this example, the primary concern, both from a safety and regulation standpoint, about the guardrails are centered on a design change made in 2005.  The size of the energy-absorbing end terminal was changed from five inches to four.  The modification was apparently made as a cost-saving measure.   The lawsuit alleges that federal authorities were never alerted to the design change so it never received the required review and approval.  It appears that federal authorities were not alerted until a patent case bought up the issue in 2012.

The reduction in the size of the end terminals may have affected how the guardrails function during auto accidents.  The lawsuit claims that five deaths and other injuries from at least 14 auto accidents can be attributed to the new design of guardrails.  The Federal Highway Administration has stated that the guardrails meet crash-test criteria, but three states (Missouri, Nevada and Massachusetts) are taking the concerns seriously enough to ban further installation of the guardrails pending completion of the investigation.

This issue is a classic proverbial can of worms.  Up to a billion dollars could be at stake in the lawsuit and the man who filed the lawsuit could get a significant cut of the payout.  There are potential testing requirement issues that need to be considered if the guardrails are passing crash tests, but causing injuries.  There are concerns over whether the company properly informed the federal government about design changes, which is a particularly sensitive topic following the recent GM ignition switch issues.  All and all, this should be a very interesting topic to follow as it plays out.

To view a high level Cause Map of this issue, click on “Download PDF” above.

Two Firefighters Killed by Rogue Welding

By ThinkReliability  Staff

On March 26, 2014, two firefighters were killed when trapped in a basement by a quickly spreading, very dangerous fire in Boston, Massachusetts. These firefighters appear to have been the first to succumb to injuries directly caused by fire while on the job in 2014. The company that was found responsible for starting the fire has been fined by OSHA for failure to follow safety procedures. Says Brenda Gordon, Occupational Safety and Health Administration (OSHA)’s director for Boston and southeastern Massachusetts, “This company’s failure to implement these required, common-sense safeguards put its own employees at risk and resulted in a needless, tragic fire.”

Every incident that results in a fatality should be carefully investigated. Investigations are used not only for liability and regulatory reasons, but also to develop solutions to reduce the risk of similar fatalities happening in the future. Investigating an incident such as this in a Cause Map, or visual root cause analysis, allows for better solutions by determining all the cause-and-effect relationships that led to the issue.

First it’s important to define how goals were impacted in order to define the scope of the problem. In this case, two firefighters were killed, which impacts the safety goal. In addition, the spread of the fire, damage of nearby buildings and associated civil lawsuits are also impacts to the goals. The OSHA fine of $58,000 for 10 violations of workplace safety regulations is an impact to the regulatory goal. The response to the fire, as well as the multiple investigations, are impacts to the labor/time goal.

Beginning with an impacted goal and asking “Why” questions develops cause-and-effect relationships that explain how the incident occurred. In this case, the firefighters perished when they were trapped by fire. The firefighters were in the basement of a residential building to rescue occupants from a fire, and the fire was so hot and dangerous that the firefighters could not exit, and other firefighters were unable to come to their rescue. Extremely windy conditions spread the fire caused by a welding spark that struck a nearby wood shed.   OSHA investigators note that the company performing the welding did not follow safety precautions (including having a fire watcher and moving welding away from flammable objects) that would have reduced the risk for fire. They cited the lack of an effective fire prevention/ protection program and a lack of training in workplace and fire safety. View the Cause Map by clicking “Download PDF” above.

Ideally the fine levied by OSHA will encourage the company involved to increase its methods of fire protection, not only to protect its own workers, but also to protect the public. In addition, the Boston Fire Department is conducting an internal review to improve firefighter safety. Says Steve MacDonald, spokesman, “What they’re doing is looking at policies and procedures. They’re reviewing everything, reviewing weather, radio communications, anything and everything having to do with the fire.”

On July 5th, another firefighter died after being trapped in a building while looking for occupants during a fire in Brooklyn, New York. On July 9th, a firefighter in Houston, Texas was killed of smoke inhalation inside a burning building. A firefighter died in a building collapse due to fire in New Carlisle, Indiana on August 5, 2014, making a total of 5 firefighters who have died as a direct result of smoke/fire injuries while on the call of duty so far in 2014. In 2013, a total of 30 firefighters were killed on the job, most as the result of the Yarnell Hill fire in Arizona.

Fire at FAA Facility Sparks Flight Havoc

By Kim Smiley 

On Friday September 26, 2014, air traffic was grounded for hours in the Chicago region following a fire in a Federal Aviation Administration facility in Aurora, Illinois. The snarl of flight issues impacted thousands of travelers in the days following the fire as airports struggled to deal with the aftermath of more than 4,000 canceled flights and thousands more delayed.

A Cause Map, a format for performing a visual root cause analysis, can be used to analyze this issue.  To build a Cause Map, the first step is to define the problem by determining how the overall organizational goals are impacted.  In this example, there is a significant customer service impact because thousands of passengers had their travel plans disrupted. The flight cancelations and delays can be considered an impact to the production/schedule goal.  The amount of time and energy needed to address the flight disruptions along with the investigation into the issue would also be impacts to the labor goal.  Once the impacts to the goals are determined, the Cause Map is built by asking “why” questions and visually laying out the answers to show the cause-and-effect relationship.

Thousands of flights were canceled because air traffic control was unable to support them.  Air traffic control couldn’t perform their usual function because there was a fire in a building that provided air traffic support for a large portion of the upper Midwest and it wasn’t possible to quickly provide air traffic support from another location. Focusing on the fire itself first, the fire appears to have been intentionally set by a contractor who worked in the building.  He was able to bring in flammable materials and start a fire without anyone stopping him.  Police are still investigating his motives, but he has been charged with a felony. The building was evacuated once the fire was discovered and employees obviously couldn’t perform their usual duties during that time.  Additionally, the fire damaged equipment so air traffic control functionality could not be quickly restored once the initial crisis was addressed and it was safe to return to the building.

The second portion of the issue is that there wasn’t a way to support air traffic once the building was evacuated.  Once the fire occurred, all flights were grounded because there wasn’t air traffic control support and it was not possible to quickly get air traffic moving again.

The final step in the Cause Mapping process is to develop and implement solutions to reduce the risk of a similar problem.  Law makers have called for an investigation into this issue to see if there is sufficient redundancy in the air traffic control system.  In an ideal situation, a fire or other crisis at any single location would not cripple US air traffic to the extent that this issue did.  The investigation is also looking into the fire and reviewing the security at the facility to see if there should be stricter restrictions put in place, such as ensuring that no employees work alone or searching bags as workers access the site.

This situation is also a strong reminder that organizations need to have a plan in place of what to do in case a failure occurs.  There was a previous fire scare at this same location earlier in 2014 when a smoking ceiling fan resulted in an evacuation and flight delays (see previous blog) that should have prompted some serious consideration of what the contingency plan should be if this facility was ever out of commission.

I was one of those people standing in line for hours at an airport on Friday morning after my flight was canceled.  And I for one would love to see the air traffic control system become more robust and better able to deal with the inevitable hiccups that occur.  It’s impossible to prevent every potential problem and another intentional fire in a FAA facility seems pretty farfetched, but it is possible to have a better plan in place to deal with issues that may arise.  The potential consequences of any single failure can be limited with a good plan and quick implementation of that plan.