Category Archives: Root Cause Analysis – Incident Investigation

NTSB recommends increased oversight of DC Metro

By Kim Smiley

On September 30, 2015, the National Transportation Safety Board (NTSB) issued urgent safety recommendations calling for the Federal Railroad Administration to take over the task of overseeing the Washington, DC Metro system. The NTSB has determined that the body presently charged with overseeing it (the Tri-State Oversight Committee) doesn’t provide adequate independent safety oversight.  Specifically, the Tri-State Oversight Committee doesn’t have the regulatory power to issue orders or levy fines and lacks enforcement authority.

The recommendations resulted from findings from the ongoing investigation into a smoke and electrical arcing accident in a Metro tunnel that killed one passenger and sent 86 others to the hospital.  (To learn more, read our previous blog “Passengers trapped in smoke-filled metro train”.) The severity of damage done to the components involved in the arcing incident have made it difficult to identify exactly what caused the arcing to occur, but the investigation uncovered problems with other electrical connections in the system that could potentially lead to similar issues if not fixed.

Investigators found that some electrical connections are at risk of short circuiting because moisture and contaminants may get into them because they were improperly constructed and/or installed.  The issues with the electrical components were not identified prior to this investigation which raises more questions about the Metro’s inspection and maintenance programs.  Although the final report on the incident has not been completed, the NTSB issued recommendations in June to address these electrical short circuit hazards because they required “immediate action” to ensure safety.

Investigators have found other issues with the aging DC Metro system such as leaks allowing significant water into the tunnels, issues with inadequate ventilation and questions about the adequacy of staff training.   The final report into the deadly arcing incident will include recommendations that go far beyond fixing one electrical issue on one run of track.

This example is a great illustration of how digging into the details of one specific problem will often reveal information about how to improve reliability across an organization. It may seem overwhelming to tackle organization-wide improvements, but often the best way to start is with an investigation into one issue and digging down into the details.

5 killed and dozens injured when duck tour boat collides with bus

By Kim Smiley

Five people were killed and dozens more injured when an amphibious Ride the Ducks tour boat collided with a charter bus in Seattle on September 24, 2015.  The circumstances of the accident were particularly unfortunate because two large vehicles carrying tour groups across a busy bridge were involved.  Traffic was mangled for hours as emergency responders worked to treat the high number of victims, investigate the accident and clear the roadway.

The National Transportation Safety Board (NTSB) is investigating the accident to determine exactly what led to the collision and if there are lessons learned that could help reduce the risk of a similar crash in the future.  Potential issues with the duck boat are some of the early focuses of the investigation.  In case you are unfamiliar, duck boats are amphibious landing craft that were used by the U.S. Army during World War II that have been refurbished for use as tour vehicles that can travel on both water and land to give visitors a unique way to experience a city.  Their military designation DUKW was changed to the more user-friendly duck boat moniker that is used by many tour companies throughout the world.

Eyewitnesses of the accident have reported that the duck boat unexpectedly swerved while crossing the bridge, slamming into the driver’s side of the tour bus.  Reports are that the left front wheel of the duck boat locked up and the driver lost control of the vehicle.  NTSB investigators have stated that the duck boat didn’t have a recommended axle repair done that was recommended in 2013 and that they are working to determine whether or not this played a role in the accident.

Investigators are also looking into whether or not Seattle Ride the Ducks was notified of the repair.  Photos of the wrecked duck boat show that the front axle sheared and the left wheel popped off the vehicle, but it hasn’t been conclusively determined whether the damage was the cause of the accident or occurred during the accident.  The issues with the axle certainly seem like a smoking gun, but a thorough investigation still needs to be performed and the process will take up to a year.  If there was a mechanical failure on the duck boat unrelated to the already identified axle issue, that will need to be identified and reviewed to see if it applies to other duck tour vehicles.

This severity of this accident is raising concerns about the overall safety of duck tours.  The duck boat involved in this accident underwent regular annual inspections and was found to meet federal standards.  If a mechanical failure was in fact involved, hard questions about the adequacy of standards and inspections will need to be asked.  The issue of the recommended repair that was not done also raises questions about how the recommendations are passed along to companies running duck boat tours as well as incorporated into inspection standards.

Click on “Download PDF” above to see an outline and Cause Map of this issue.

Volkswagen admits to use of a ‘defeat device’

By Kim Smiley

The automotive industry was recently rocked by Volkswagen’s acknowledgement that the company knowingly cheated on emissions testing of several models of 4-cylinder diesel cars starting in 2009.  The diesel cars in question include software “defeat devices” that turn on full emissions control only during emissions testing.  Full emissions control is not activated during normal driving conditions and the cars have been shown to emit as much as 40 times the allowable pollution.   Customers are understandably outraged, especially since many of them purchased a “clean diesel” car in an effort to be greener.

The investigation into this issue is ongoing and many details aren’t known yet, but an initial Cause Map, a visual format for performing a root cause analysis, can be created to document and analyze what is known.  The first step in the Cause Mapping process is to fill in a Problem Outline with the basic background information and how the issue impacts the overall organizational goals.  The “defeat device” issue is a complex problem and impacts many different organizational goals.  The increased emissions obviously impacts the environmental goal and the potential health impacts of those emissions is an impact to the safety goal.  Some of the specific details are still unknown, like the exact amount of the fines the company will face, but we can safely assume the company will be paying significant fines (on the order of billions) as a result of this blatant violation of the law.  The Volkswagen stock price also took a major hit and dropped more than 20 percent following the announcement of the diesel emissions issues.  It is difficult to quantify how much the loss of consumer confidence will impact the company long-term, but being perceived as a dishonest company by many will certainly impact their sales.   A large recall that will be both time-consuming and costly is also in Volkswagen’s future.  Depending on the investigation findings, there is also the potential for criminal prosecution because of the intentional nature of this issue.

Once the overall impacts to the goals are defined, the actual Cause Map can be built by asking “why” questions.  So why did these cars include “defeat devices” to cheat on emissions tests?  The simple answer is increased profits.  Designing cars that appeared to have much lower emissions than they did in reality allowed Volkswagen to market a car that was more desirable. Car design has always included a trade-off between emissions and performance.  Detailed information hasn’t been released yet, but it is likely that the car had improved fuel economy and improved driving performance during normal driving conditions when full emissions control wasn’t activated. Whoever was involved in the design of the “defeat device” also likely assumed the deception would never be discovered, which raises concern about how emissions testing is performed.

The design of the “defeat device” is believed to work by taking advantage of unique conditions that exist during emissions testing. During normal driving, the steering column moves as the driver steers the car, but during emissions testing the wheels rotate, but the steering column doesn’t move.  The “defeat device” software appears to have monitored the steering column and wheels to sense when the conditions indicated an emissions test was occurring.  When the wheels turned without corresponding steering wheel motion, the software turned the catalytic scrubber up to full power, reducing emissions and allowing the car to pass emissions tests. Details on how the “defeat device” was developed and approved for inclusion in the design haven’t been released, but hopefully the investigation into this issue will be insightful and help understand exactly how something this over the line occurred.

Only time will tell exactly how this issue impacts the overall health of the Volkswagen company, but the short-term effects are likely to be severe.  This issue may also have long-reaching impacts on the diesel market as consumer confidence in the technology is shaken.

To view an Outline and initial Cause Map of this issue, click on “Download PDF” above.

Waste Released from Gold King Mine

By Renata Martinez

On August 5, 2015 over 3 million gallons of waste was released from Gold King Mine into Cement Creek which then flowed into the Animas River. The orangish colored plume moved over 100 miles downstream from Silverton, Colorado through Durango reaching the San Juan River in New Mexico and eventually making its way to Lake Powell in Utah (although the EPA stated that the leading edge of the plume was no longer visible by the time it reached Lake Powell a week after the release occurred).

Some of the impacts were immediate.  No workers at the mine site were hurt in the incident but the collapse of the mine opening and release of water can be considered a near miss because there was potential for injuries. After the release, there were also potential health risks associated with the waste itself since it contained heavy metals.

Water sources along the river were impacted and there’s potential that local wells could be contaminated with the waste.   To mitigate the impacts, irrigation ditches that fed crops and livestock were shut down.  Additionally, the short-term impacts include closure of the Animas River for recreation (impacting tourism in Southwest Colorado) from August 5-14.

The long-term environmental impacts will be evaluated over time, but it appears that the waste may damage ecosystems in and along the plume’s path. There are ongoing investigations to assess the impact to wildlife and aquatic organisms, but so far the health effects from skin contact or incidental ingestion of contaminated river water are not considered significant.

“Based on the data we have seen so far, EPA and the Agency for Toxic Substances and Disease Registry (ATSDR) do not anticipate adverse health effects from exposure to the metals detected in the river water samples from skin contact or incidental (unintentional) ingestion. Similarly, the risk of adverse effects to livestock that may have been exposed to metals detected in river water samples from ingestion or skin contact is low. We continue to evaluate water quality at locations impacted by the release.”

The release occurred when the EPA was working to stabilize the existing adit (a horizontal shaft into a mine which is used for access or drainage). The force of the weight of a pool of waste in the mine overcame the strength of the adit, releasing the water into the environment.  The  EPA’s scope of work at Gold King Mine also included assessing the ongoing leaks from the mine to determine if the discharge could be diverted to retention ponds at the Red and Bonita sites.

The wastewater had been building up since the adit collapsed in 1995.  There are networks and tunnels that allow water to easily flow between the estimated 22,000 mine sites in Colorado.  As water flows through the sites it reacts with pyrite and oxygen to form sulfuric acid.  When the water is not treated and it contacts (naturally occurring) minerals such as zinc, lead, cadmium, copper and aluminum and breaks down the heavy metals, leaving tailings.  The mines involved in this incident were known to have been leaking waste for years.  In the 90s, the EPA agreed to postpone adding the site to the Superfund National Priorities List (NPL), so long as progress was made to improve the water quality of the Animas River.  Water quality improved until about 2005 at which point it was re-assessed.  Again in 2008, the EPA postponed efforts to include this area on the NPL.  From the available information, it’s unclear if this area and the waste pool would have been treated if the site was on the NPL.

In response, the “EPA is working closely with first responders and local and state officials to ensure the safety of citizens to water contaminated by the spill. ” Additionally, retention ponds have been built below the mine site to treat the water and continued sampling is taking place to monitor the water.

So how do we prevent this from happening again?  Mitigation efforts to prevent the release were unsuccessful.  This may have been because the amount of water contained in the mine was underestimated.  Alternatively, if the amount of water in the mine was anticipated (and the risk more obvious) perhaps the excavation work could have been planned differently to mitigate the collapse of the tunnel.  As a local resident, I’m especially curious to learn more facts about the specific incident (how and why it occurred) and how we are going to prevent this from recurring.

The EPA has additional information available (photos, sampling data, historic mine information) for reference: http://www2.epa.gov/goldkingmine

Spider in air monitoring equipment causes erroneously high readings

By Kim Smiley

Smoke drifting north from wildfires in Washington state has raised concerns about air quality in Calgary, but staff decided to check an air monitoring station after it reported an alarming rating of 28 on a 1-10 scale.  What they found was a bug, or rather a spider, in the system that was causing erroneously high readings.

The air monitoring station measures the amount of particulate matter in air by shining a beam of light through a sample of air.  The less light that makes it through the sample, the higher the number of particulates in the sample and the worse the quality of air.  You can see the problem that would arise if the beam of light was blocked by a spider.

This example is a great reminder not to rely solely on instrument readings.  Instruments are obviously useful tools, but the output should always be run through the common sense check.  Does it make sense that the air quality would be so far off the scale?  If there is any question about the accuracy of readings, the instrument should probably be checked because the unexpected sometimes happens.

In this case, inaccurate readings of 10+ were reported by both Environment Canada and Alberta Environment before the issue was discovered and the air quality rating was adjusted down to a 4.  Ideally, the inaccurate readings would have been identified prior to posting potentially alarming information on public websites.  The timing of the spider’s visit was unfortunate because it coincided with smoky conditions that made the problem more difficult to identify, but extremely high readings should be verified before making them public if at all possible.

Adding an additional verification step when there are very high readings prior to publicly posting the information could be a potential solution to reduce the risk of a similar problem recurring.  A second air monitoring station could be added to create a built-in double check because an error would be more obvious if the monitoring stations didn’t have similar readings.

Depending on how often insects and spiders crawl into the air monitoring equipment, the equipment itself could be modified to reduce the risk of a similar problem recurring in the future.

To view a Cause Map, a visual root cause analysis, of this issue, click on “Download PDF” above.

Power grid near Google datacenter struck by lightning 4 times

By Kim Smiley

A small amount of data was permanently lost at a Google datacenter after lightning struck the nearby power grid four times on August 13, 2015. About five percent of the disks in Google’s Europe-west1-b cloud zone datacenter were impacted by the lightning strikes, but nearly all of the data was eventually recovered with less than 0.000001% of the stored data not able to be recovered.

A Cause Map, or visual root cause analysis, can be built to analyze this issue. The first step in the Cause Mapping process is to fill in an Outline with the basic background information such as the date, time and specific equipment involved. The bottom of the Outline has a spot to list the impacted goals to help define the scope of an issue. The impacted goals are then used to begin building the Cause Map. The impacted goals are listed in red boxes on the Cause Map and the impacts are the first cause boxes on the Cause Map. Why questions are then asked to add to the Cause Map and visually lay out the cause-and-effect relationships.

For this example, the customer service goal was impacted because some data was permanently lost. Why did this happen? Data was lost because datacenter equipment failed, this particular data was stored on less stable system and wasn’t duplicated in another location. Google has stated that the lost data was newly written data that was located on storage systems which were more susceptible to power failures. The datacenter equipment failed because the nearby power grid was struck by lightning four times and was damaged. Additionally, the automatic auxiliary power systems and backup battery were not able to prevent data loss after the lightning damage.

When more than one cause was required to produce an effect, all the causes are listed vertically and separated by an “and”. You can click on “Download PDF” above to see a high level Cause Map of this issue that shows how an “and” can be used to build a Cause Map. A more detailed Cause Map could be built that could include all the technical details of exactly why the datacenter equipment failed. This would be useful to the engineers developing detailed solutions.

The final step in the Cause Mapping process is to develop solutions to reduce the risk of a problem recurring in the future. For this example, Google has stated that they are upgrading the datacenter equipment so that it is more robust in the event of a similar event in the future. Google also stated that customers should backup essential data so that it is stored in another physical location to improve reliability.

Few of us probably design datacenter storage systems, but this incident is a good reminder of the importance of having a backup. If data is essential to you or your business, make sure there is a backup that is stored in a physically separate location from the original. Similar to the “unsinkable” Titanic, it is always a good idea to include enough life boats or backups in a design just in case something you didn’t expect goes wrong. Sometimes lightning strikes four times so it’s best to be prepared just in case.

Explosions raise concern over hazardous material storage

By ThinkReliability Staff

On August 12, a fire began at a storage warehouse in Tianjin, China. More than a thousand firefighters were sent in to fight the fire. About an hour after the firefighters went in, two huge explosions registered on the earthquake measurement scale (2.3 and 2.9, respectively). Follow-on explosions continued and at least 114 firefighters, workers and area residents have been reported dead so far, with 57 still missing (at this point, most are presumed dead).

Little is known for sure about what caused the initial fire and continuing explosions. What is known is that the fire, explosions and release of hazardous chemicals that were stored on site have caused significant impacts to the surrounding population and rescuers. These impacts can be used to develop cause-and-effect relationships to determine the causes that contributed to an event. It’s particularly important in an issue like this – where so many were adversely affected – to find effective solutions to reduce the risk of a similar incident recurring in the future.

Even with so much information unavailable, an initial root cause analysis can identify many issues that led to an adverse event. In this case, the cause of the initial fire is still unknown, but the site was licensed to handle calcium carbide, which releases flammable gases when exposed to water. If the chemical was present on site, the fire would have continued to spread when firefighters attempted to fight it using water. Contract firefighters, who are described as being young and inexperienced, have said that they weren’t adequately trained for the hazards they faced. Once the fire started, it likely ignited explosive chemicals, including the 800 tons of ammonium nitrate and 500 tons of potassium nitrate stored on site.

Damage to the site released those and other hazardous chemicals. More than 700 tons of sodium cyanide were reported to be stored at the site, though it was only permitted 10 tons at a time. Sodium cyanide is a particular problem for human safety. Says David Leggett, a chemical risk consultant, “Sodium cyanide is a very toxic chemical. It would take about a quarter of teaspoon to kill you. Another problem with sodium cyanide is that it can change into prussic acid, which is even more deadly.”

But cleaning up the mess is necessary, especially because there are residents living within 2,000 ft. of the site, despite regulations that hazardous sites are a minimum of 3,200 ft. away from residential areas. Developers who built an apartment building within the exclusion zone say they were told the site stored only common goods. Rain could make the situation worse, both by spreading the chemicals and because of the potential that the released chemicals will react with water.

The military has taken over the response and cleanup. Major General Shi Luze, chief of the general staff of the military region, said, “After on-site inspection, we have found several hundred tons of cyanide material at two locations. If the blasts have ripped the barrels open, we neutralize it with hydrogen peroxide or other even better methods. If a large quantity is already mixed with other debris, which may be dangerous, we have built 1-meter-high walls around it to contain the material — in case of chemical reactions if it rains. If we find barrels that remain intact, we collect them and have police transport them to the owners.”

In addition to sending in a team of hazardous materials experts to neutralize and/or contain the chemicals and limiting the public from the area in hopes to limit further impact to public safety, the state media had said they were trying to prevent rain from falling, presumably using the same strategies developed to ensure clear skies for the 2008 Summer Olympics. Whether it worked or not hasn’t been said, but it did rain on August 18, nearly a week after the blast, leaving white foam that residents have said creates a burning or itchy sensation with contact.

View an initial Cause Map of the incident by clicking on “Download PDF” above.

Legionnaires’ Disease Outbreak Blamed on Contaminated Cooling Towers

By ThinkReliability Staff

An outbreak of Legionnaires’ disease has affected at least 115 and killed 12 in the South Bronx area of New York City. While Legionnaires’, a respiratory disease caused by breathing in vaporized Legionella bacteria, has struck the New York City area before, the magnitude of the current outbreak is catching the area by surprise. (Because the vaporization is required, drinking water is safe, as is home air conditioning.) It’s also galvanizing a call for actions to better regulate the causes of the outbreak.

It’s important when dealing with an outbreak that affects public health to fully analyze an issue to determine all the causes that contributed to the problem. In the case of the current Legionnaires’ outbreak, our analysis will be performed in the form of a Cause Map, or visual root cause analysis. We begin by capturing the basic information (what, when and where) about the issue in a problem outline. Because the issue unfolded over months, we will reference the timeline (to view the analysis including the timeline, click on “Download PDF”) to describe when the incident occurred. Some important differences to note – people with underlying medical conditions and smokers are at a higher risk from Legionnaires’, and Legionella bacteria are resistant to chlorine. Infection results from breathing in contaminated mist, which has been determined to have come from South Bronx area cooling towers (which is part of the air conditioning and heating systems of some large buildings).

Next we capture the impact to the goals. The safety goal is impacted due to the 12 deaths, and 115 who have been infected. The customer service goal is impacted by the outbreak of Legionnaires’. The environmental and property goals are impacted because at least eleven cooling towers in the area have been found to be contaminated with Legionella. The issue is resulting in increased regulation, an impact to the regulatory goal, and testing and disinfection, which is being performed by at least 350 workers and is an impact to the labor goal.

The analysis begins by asking “why” questions from one of the impacted goals. In this case, the deaths resulted from an outbreak of Legionnaires’ disease. The outbreak results from exposure to mist from one of the contaminated cooling towers. The design of some cooling towers allows exposure to the mist produced. It is common for water sources to contain Legionella (which again, is resistant to chlorine) but certain conditions allow the bacteria to “take root”: the damp warm environment found in cooling towers and insufficient cleaning/ disinfection. The cost of cleaning is believed to be an issue – studies have found that, like this outbreak, impoverished areas are more prone to these types of outbreaks. Additionally, there are insufficient regulations regarding cooling towers. The city does not regularly inspect cooling towers. According to the mayor and the city’s deputy commissioner for disease control, there just hasn’t been enough evidence to indicate that cooling towers are a potential source of Legionnaires’ outbreaks.

Evidence would indicate otherwise, however. A study that researched risk factors for Legionnaires’ in New York City from 2002-2011 specifically indicated that proximity to cooling towers was an environmental risk. A 2010 hearing on indoor air quality discussed Legionella after a failed resolution in 2000 to reduce outbreaks at area hospitals. New York City is no stranger to Legionnaires’; the first outbreak occurred in 1977, just after Legionnaires’ was identified. There have been two previous outbreaks of Legionnaires’ this year. Had there been a look at other outbreaks, such as the 2012 outbreak in Quebec City, cooling towers would have been identified as a definite risk factor.

For now, though the outbreak appears to be waning (no new cases have been reported since August 3), the city is playing catch-up. Though they are requiring all cooling towers to be disinfected by August 20 and plan increase inspections, right now there isn’t even a list of all the cooling towers in the city. Echoing the frustrations of many, Bill Pearson, member of the committee that wrote standards to address the risk of legionella in cooling towers, says “Hindsight is 20-20, but it’s not a new disease. And it’s not like we haven’t known about the risk of cooling towers, and it’s not like people in New York haven’t died of Legionnaires’ before.”

Ruben Diaz Jr., Bronx borough president, brings up a good point for the cities that may have Legionella risks from cooling towers, “Why, instead of doing a good job responding, don’t we do a good job proactively inspecting?” Let’s hope this outbreak will be a call for others to learn from these tragic deaths, and take a proactive approach to protecting their citizens from Legionnaire’s disease.

Unintended Consequences, Serendipity, and Prawns

By ThinkReliability Staff

The Diama dam in Senegal was installed to create a freshwater reservoir. Unfortunately, that very dam also led to an outbreak of schistosomiasis. This was an unintended consequence: a negative result from something meant to be positive.   Schistosomiasis, which weakens the immune system and impairs the operation of organs, is transmitted by parasitic flatworms. These parasitic flatworms are hosted by snails. When the dam was installed, the snails’ main predators lost a migration route and died off. Keeping the saltwater out of the river allowed algae and plants that feed the snails to flourish. The five why analysis of the issue would go something like this: The safety goal is impacted. Why? Because of an outbreak of schistosomiasis. Why? Because of the increase in flatworms. Why? Because of the increase in snails. Why? Because of the lack of snail predators. Why? Because of the installation of the dam.

Clearly, there’s more to it. We can capture more details about this issue in a Cause Map, or visual form of root cause analysis. First, it’s important to capture the impact to the goals. In this case, the safety goal is impacted because of a serious risk to health and the environmental goal is impacted due to the spread of parasitic flatworms. The customer service goal (if we consider customers as all those who get water from the reservoir created by the dam) is impacted due to the outbreak of schistosomiasis.

Beginning with the safety goal, we can ask why questions. Instead of including just one effect, we include all effects to create a map of the cause-and-effect relationships. The serious risk to health is caused by the villagers suffering from schistosomiasis, which can cause serious health impacts. The villagers are infected with schistosomiasis and do not receive effective treatment. Not all those infected are receiving drugs due to cost and availability concerns. The drugs do not reverse the damage already done. And, most importantly, even those treated are quickly reinfected as they have little choice but to continue to use the contaminated water.

The outbreak of schistosomiasis is caused by the spread of parasitic flatworms, which carry the disease. The increase in flatworms is caused by the increased population of snails, which host the flatworms. The snail population increased after the installation of the dam killed off their predators and increased their food supply.

Many solutions to this issue were attempted and found to be less than desirable. Administering medication for treatment on its own wasn’t very effective, because (as described above) the villagers kept getting reinfected. The use of molluscicide killed off other animals in the reservoir as well. Introducing crayfish to eat the snails was derided by environmentalists as they were considered an invasive species. But they were on the right track. Now, a team is studying the reintroduction of the prawns which ate the snails. During the pilot study, the rates of schistosomiasis decreased. In addition, the prawns will serve as a valuable food source. This win-win solution is an example of serendipity and should actually return money to the community. Says Michael Hsieh, the project’s principal investigator and an assistant professor of urology, “The broad potential of this project is validation of a sustainable economic solution that not only addresses a major neglected tropical disease, but also holds the promise of breaking the poverty cycle in affected communities.”

Introducing animals to get rid of other animals can be problematic, as Macquarie Island discovered when they introduced cats to eat their exploding rodent population who ate the native seabirds). (Click here to read more about Macquarie Island.) Further research is planned to ensure the project will continue to be a success. To learn more about the project, click here. Or, click “Download PDF” to view an overview of the Cause Map.

A single human error resulted in the deadly SpaceShipTwo crash

By Kim Smiley

The National Transportation and Safety Board (NTSB) has issued a report on their investigation into the deadly SpaceShipTwo crash on October 31, 2014 during a test flight.  Investigators confirmed early suspicions that the space plane tore apart after the tail boom braking system was released too early, as discussed in a previous blog.  The tail boom is designed to feather to increase the drag and slow down the space plane, but when the drag was applied earlier than expected the additional aerodynamic forces ripped the space plane apart at both high altitude and velocity.  Amazingly, one of the two pilots survived the accident.

Information from the newly released report can be used to expand the Cause Map from the previous blog.  The investigation determined that the pilot pulled the lever that released the braking system too early.  Even though the pilots did not initiate a command to put the tail booms into the braking position, the forces on the tail booms forced them into the feathered position once they were unlocked.  The space plane could not withstand the additional aerodynamic forces created by the feathered tail booms while still accelerating and it tore apart around the pilots.

A Cause Map is built by asking “why” questions and documenting the answers in cause boxes to visually display the cause-and-effect relationships. So why did the pilot pull the lever too early?  A definitive answer to that may never be known since the pilot did not survive the crash, but it’s easy to understand how a mistake could be made in a high-stress environment while trying to recall multiple tasks from memory very quickly.  Additionally, the NTSB found that training did not emphasize the dangers of unlocking the tail booms too early so the pilot may not have been fully aware of the potential consequences of this particular error.

A more useful question to ask would be how a single mistake could result in a deadly crash.  The plane had to be designed so that it was possible for the pilot to pull a lever too early and create a dangerous situation.  Ideally, no single mistake could create a deadly accident and there would have been safeguards built into the design to prevent the tail booms from feathering prematurely.  The NTSB determined the probable cause of this accident to be “failure to consider and protect against the possibility that a single error could result in a catastrophic hazard to the SpaceShipTwo vehicle.”  The investigation found that the design of the space plane assumed that the pilots would perform the correct actions every time.  Test pilots are highly trained and the best at what they do, but assuming human perfection is generally a dangerous proposition.

The NSTB identified a few causes that contributed to the lack of safeguards in the SpaceShipTwo design.  Designing commercial space craft is a relatively new field; there is limited human factors guidance for commercial space operators and the flight database for commercial space mishaps is incomplete. Additionally, there was insufficient review during the design process because it was never identified that a single error could cause a catastrophic failure. To see the recommendations and more information on the investigation, view a synopsis from the NTSB’s report.

To see an updated Cause Map of this accident, click on “Download PDF” above.