A single human error resulted in the deadly SpaceShipTwo crash

By Kim Smiley

The National Transportation and Safety Board (NTSB) has issued a report on their investigation into the deadly SpaceShipTwo crash on October 31, 2014 during a test flight.  Investigators confirmed early suspicions that the space plane tore apart after the tail boom braking system was released too early, as discussed in a previous blog.  The tail boom is designed to feather to increase the drag and slow down the space plane, but when the drag was applied earlier than expected the additional aerodynamic forces ripped the space plane apart at both high altitude and velocity.  Amazingly, one of the two pilots survived the accident.

Information from the newly released report can be used to expand the Cause Map from the previous blog.  The investigation determined that the pilot pulled the lever that released the braking system too early.  Even though the pilots did not initiate a command to put the tail booms into the braking position, the forces on the tail booms forced them into the feathered position once they were unlocked.  The space plane could not withstand the additional aerodynamic forces created by the feathered tail booms while still accelerating and it tore apart around the pilots.

A Cause Map is built by asking “why” questions and documenting the answers in cause boxes to visually display the cause-and-effect relationships. So why did the pilot pull the lever too early?  A definitive answer to that may never be known since the pilot did not survive the crash, but it’s easy to understand how a mistake could be made in a high-stress environment while trying to recall multiple tasks from memory very quickly.  Additionally, the NTSB found that training did not emphasize the dangers of unlocking the tail booms too early so the pilot may not have been fully aware of the potential consequences of this particular error.

A more useful question to ask would be how a single mistake could result in a deadly crash.  The plane had to be designed so that it was possible for the pilot to pull a lever too early and create a dangerous situation.  Ideally, no single mistake could create a deadly accident and there would have been safeguards built into the design to prevent the tail booms from feathering prematurely.  The NTSB determined the probable cause of this accident to be “failure to consider and protect against the possibility that a single error could result in a catastrophic hazard to the SpaceShipTwo vehicle.”  The investigation found that the design of the space plane assumed that the pilots would perform the correct actions every time.  Test pilots are highly trained and the best at what they do, but assuming human perfection is generally a dangerous proposition.

The NSTB identified a few causes that contributed to the lack of safeguards in the SpaceShipTwo design.  Designing commercial space craft is a relatively new field; there is limited human factors guidance for commercial space operators and the flight database for commercial space mishaps is incomplete. Additionally, there was insufficient review during the design process because it was never identified that a single error could cause a catastrophic failure. To see the recommendations and more information on the investigation, view a synopsis from the NTSB’s report.

To see an updated Cause Map of this accident, click on “Download PDF” above.

Extensive Contingency Plans Prevent Loss of Pluto Mission

By ThinkReliability Staff

Beginning July 14, 2015, the New Horizons probe started sending photos of Pluto back to earth, much to the delight of the world (and social media).  The New Horizons probe was launched more than 9 years ago (on January 19, 2006) – so long ago that when it left, Pluto was still considered a planet. (It’s been downgraded to dwarf planet now.)  A mission that long isn’t without a few bumps in the road.  Most notably, just ten days before New Horizons’ Pluto flyby, mission control lost contact with the probe.

Loss of communication with the New Horizons probe while it was nearly 3 billion miles away could have resulted in the loss of the mission.  However, because of contingency and troubleshooting plans built in to the design of the probe and the mission, communication was able to be restored, and the New Horizons probe continued on to Pluto.

The potential loss of a mission is a near miss. Analyzing near misses can provide important information and improvements for future issues and response.  In this case, the mission goal is impacted by the potential loss of the mission (near miss).  The labor and time goal are impacted by the time for response and repair.  Because of the distance between mission control on earth and the probe on its way to Pluto, the time required for troubleshooting was considerable owing mainly to the delay in communications that had to travel nearly 3 billion miles (a 9-hour round trip).

The potential loss of the mission was caused by the loss of communication between mission control and the probe.  Details on the error have not been released, but its description as a “hard to detect” error implies that it wasn’t noticed in testing prior to launch.  Because the particular command sequence that led to the loss of communication was not being repeated in the mission, once communication was restored there was no concern for a repeat of this issue.

Not all causes are negative.  In this case, the “loss of mission” became a “potential loss of mission” because communication with the probe was able to be restored.  This is due to the contingency and troubleshooting plans built in to the design of the mission.  After the error, the probe automatically switched to a backup computer, per contingency design.  Once communication was restored, the spacecraft automatically transmits data back to mission control to aid in troubleshooting.

Of the mission, Alice Bowman, the Missions Operation Manager says, “There’s nothing we could do but trust we’d prepared it well to set off on its journey on its own.”  Clearly, they did.

Trading Suspended on the NYSE for More Than 3 Hours

By ThinkReliability Staff

On July 8, 2015, trading was suspended on the New York Stock Exchange (NYSE) at 11:32 AM. According to the NYSE president Tom Farley, “the root cause was determined to be a configuration issue.” This still leaves many questions unanswered. This issue can be examined in a Cause Map, a visual form of root cause analysis.

There are three steps to the Cause Mapping problem-solving method. First, the problem is defined with respect to the impact to the goals. The basic problem information is captured – the what, when, and where. In a case such as this, where the problem unfolded over hours, a timeline can be useful to provide an overview of the incident. Problems with the NYSE began when a system upgrade to meet timestamp requirements began on the evening of July 7. As traders attempted to connect to the system early the next morning, communication issues were found and worsened until the NYSE suspended trading. The system was restarted and full trading resumed at 3:10 PM.

The impacts to the goals are also documented as part of the basic problem information. In this case, there were no impacts to safety or the environment as a result of this issue. Additionally, there was no impact to customers, whose trades automatically shifted to other exchanges. However, an investigation by the Securities & Exchange Commission (SEC) and political hearings are expected as a result of the outage, impacting the regulatory goal. The outage itself is an impact to the production goal, and the time spent on response and repairs is an impact to the labor/time goal.

The cause-and-effect relationships that led to these impacts to the goals can be developed by asking “why” questions. This can be done even for positive impacts to the goals. For example, in this case customer service was NOT impacted adversely because customers were able to continue making trades even through the NYSE outage. This occurred because there are 13 exchanges, and current technology automatically transfers the trades to other exchanges. Because of this, the outage was nearly transparent to the general public.

In the case of the outage itself, as discussed above, the NYSE has stated it was due to a configuration issue. Specifically, the gateways were not loaded with the proper configuration for the outage that was rolled out July 7. However, information about what exactly the configuration issue was or what checks failed to result in the improper configuration being loaded is not currently available. (Although some have said that the chance of this failure happening on the same date as two other large-scale outages could not be coincidental, the NYSE and government have ruled out hacking.) According to NYSE president Tom Farley, “We found what was wrong and we fixed what was wrong and we have no evidence whatsoever to suspect that it was external. Tonight and overnight starts the investigation of what exactly we need to change. Do we need to change those protocols? Absolutely. Exactly what those changes are I’m not prepared to say.”

Another concern is the backup plan in place for these types of issues. Says Harvey Pitt, SEC Chairman 2001 to 2003, “This kind of stuff is inevitable. But if it’s inevitable, that means you can plan for it. What confidence are we going to have that this isn’t going to happen anymore, or that what did happen was handled as good as anyone could have expected?” The backup plan in place appeared to be shifting operations to a disaster recovery center. This was not done because it was felt that requiring traders to reconnect would be disruptive. Other backup plans (if any) were not discussed. This has led some to question the oversight role of the SEC and its ability to prevent issues like this from recurring.

To view the investigation file, including the problem outline, Cause Map, and timeline, click on “Download PDF” above. To view the NYSE statement on the outage, click here.

Small goldfish can grow into a large problem in the wild

By Kim Smiley

Believe it or not, the unassuming goldfish can cause big problems when released into the wild.  I personally would have assumed that a goldfish set loose into the environment would quickly become a light snack for a native species, but invasive goldfish have managed to survive and thrive in lakes and ponds throughout the world.  Goldfish will keep growing as long as the environment they are in supports it.  So while goldfish kept in an aquarium will generally remain small, without the constraints of a tank, goldfish the size of dinner plates are not uncommon in the wild. These large goldfish both compete with and prey on native species, dramatically impacting native fish populations.

This issue can be better understood by building a Cause Map, a visual format of root cause analysis, which intuitively lays out the cause-and-effect relationships that contributed to the problem.  A Cause Map is built by asking “why” questions and recording the answers as a box on the Cause Map.  So why are invasive goldfish causing problems?  The problems are occurring because there are large populations of goldfish in the wild AND the goldfish are reducing native fish populations.  When there are two causes needed to produce an effect like in this case, both causes are listed on the Cause Map vertically and separated by an “and”.   Keep asking “why” questions to continue building the Cause Map.

So why are there large populations of goldfish in the wild?  Goldfish are being introduced to the wild by pet owners who no longer want to care for them and don’t want to kill their fish.  The owners likely don’t understand the potential environmental impacts of dumping non-native fish into their local lakes and ponds.  Goldfish are also hardy and some may survive being flushed down a toilet and end up happily living in a lake if a pet owner chooses to try that method of fish disposal.

Why do goldfish have such a large impact on native species?  Goldfish can grow larger than many native species and they compete with them for the same food sources.  In addition, goldfish eat small fish as well as eggs from native species.  Invasive goldfish can also introduce new diseases into bodies of water that can spread to the native species.  The presence of a large number of goldfish can also change the environment in a body of water.  Goldfish stir up mud and other matter when they feed which causes the water to be cloudier, impacting aquatic plants.  Some scientists also believe that large populations of goldfish can lead to algae blooms because goldfish feces is a potential food source for them.

Scientists are working to develop the most effective methods to deal with the invasive goldfish.  In some cases, officials may drain a lake or use electroshocking to remove the goldfish.  As an individual, you can help the problem by refraining from releasing pet fish into the wild.  It’s an understandable impulse to want to free an unwanted pet, but the consequences can be much larger than might be expected. You can contact local pet stores if you need to get rid of aquarium fish; some will allow you to return the fish.

To view a Cause Map of this problem, click on “Download PDF” above.