Trading Glitch Loses Goldman Sachs Millions

By Kim Smiley

A Goldman Sachs trading glitch on August 20, 2013 caused a large number of erroneous single stock and ETF options trades.  About 80 percent of the errant trades were cancelled, but the financial damage is still speculated to be as much as one hundred million dollars. The company also finds itself once again in the uncomfortable position of making headlines for negative reasons which is never good for business.

The glitch occurred during an update to an internal computer system that is used to determine where to price options.  The update changed the software so that the system began inadvertently misinterpreted non-binding indications of interest as actual bids and offers.  The system acted on these bids and executed a large volume of trades at errant prices that were out of touch with actual market prices.

This issue can be built into a Cause Map, an intuitive method for performing a root cause analysis.  One of the advantages of a Cause Map is that it visually lays out all the causes and the cause-and-effect relationships between them. Seeing all the causes can broaden the solutions that are considered.

In this example, a Cause Map can help illustrate the fact that the software glitch itself isn’t the only thing worth focusing on.  The lack of an effective test program also contributed to the problem and testing may be the easiest place to implement an effective solution.  If the problem would have been caught in testing, the only cost would have been the time and effort needed to fix the software.  The importance of a robust test program for software is difficult to overstate.  If the software is vital to whatever your company’s mission is, develop a way to test it.

To view a high level Cause Map of this issue, click on “Download PDF” above.  Click here to read about the loss of the Mars Climate Orbiter, another excellent example of a software error with huge consequences.

Sinkhole Forms Under Orlando Resort

By Kim Smiley

On Sunday August 11, 2013, guests at a resort near Orlando, Florida woke to creaking sounds and breaking windows.  About 10 minutes after the disturbances began a portion of the luxury resort villa was swallowed up by a sinkhole that had formed with little warning.  Luckily, there was enough time to evacuate the resort and no one was injured, but guests lost most of their luggage, including purses and wallets, and the resort obviously suffered significant damage.

This incident can be analyzed by building a Cause Map, or visual root cause analysis.  A Cause Map shows the cause-and-effect relationships between the different causes that contributed to an issue.  This can help guide an investigation and is useful when developing solutions to prevent similar problems from occurring in the future.  Cause Maps can also be used to help explain the issue to somebody who was not involved with the investigation.  To view a high level Cause Map of this specific issue, click on “Download PDF” above.

A sinkhole forms when a void is created underground and the earth ceiling over the void collapses into it.  Voids typically form when there is a limestone or similar rock deposit underground that is dissolved by slightly acidic groundwater.  This region in central Florida is well known for sinkholes because these types of underground deposits are relatively common in the area.  Over pumping of groundwater can also cause the ground to settle, possibly forming a void.  There is concern that the rapid pace of development in this area has had an impact on the groundwater and may potentially be helping fuel the formation of sinkholes.  There was also record rain fall in July in Orlando and the additional water may have caused the ceiling over the void to be heavier than normal and more likely to collapse.

There rarely are easy answers, but sinkholes seem to be a particularly tricky problem to solve.  They are unpredictable and there is typically little warning before they develop.  Prior to the resort being constructed, the site underwent geological testing and the ground was found to be stable so something more than basic geological testing will be needed to solve this problem.  Well planned development and careful management of ground water may help limit the development and impact of sinkholes, but there will be strong economic pressure to develop more and more land in the booming Orlando area.  Insurance seems to be one of the best solutions to date.  Florida law requires insurance companies to cover “catastrophic ground collapse” so that property owners at least have an economic safety net in the event of a sinkhole.  The only good thing about sinkholes is that they generally take a little time to form like the recent resort sinkhole.  There will be property damage from sinkholes in the future, but hopefully there will also be time to evacuate everybody like there was this case.

Thousands Injured Each Year From Falling Televisions

By Kim Smiley

Nearly all parents know about the dangers of watching too much television, but a new study shows that too few are aware of the risk of injury from televisions.  The number of television injuries is more than most would guess with more than 17,000 children visiting emergency rooms for  television related injuries each year.   Falling televisions have also caused hundreds of deaths with 29 killed in just 2011.  The rate of injuries associated with televisions is also increasing at an alarming pace, jumping 126% since 1990.

The majority of victims were young children under five.  The accidents seem to be a potentially deadly combination of their lack of situational awareness and unanchored televisions set on unsafe surfaces.  The study didn’t include why the televisions were in unsafe locations, but one theory is that many older televisions are moved into secondary locations that aren’t as safe as families acquire bigger, fancier televisions.  The older televisions may be on dressers or night stands that were never meant to hold televisions.  Children climb the furniture either attempting to turn on the television or retrieve something off the top and the television tumbles down on top of them.  Dressers with drawers are particularly dangerous because children may figure out how to use the drawers as steps and manage to climb much higher than anticipated.

The rapid rate of technological advances may also play a role since typical families are buying new televisions more frequently than in previous decades and the number of televisions in an average home has increased.  The changing design of televisions is also relevant.  New thinner televisions have significantly smaller bases making them top heavy and more likely to topple over.  Many families are also buying bigger televisions with can amplify the danger if they topple.

Experts have suggested a few potential solutions to this problem.  First and foremost, parents need to be made more aware of the issue, possibly through a public awareness campaign.  A campaign to distribute anchoring devices has been discussed as well as providing them with new televisions at purchase.  Another option may be to add stability requirements to new designs so that televisions are less likely to topple.  It is also recommended that parents never store remote controls or toys on top of a television because they may entice children into climbing to reach them.  Only time will tell which solution if any are implemented, but this study is a first step in raising public awareness about this issue.

To view a Cause Map, or visual root cause analysis, of this issue, click on “Download PDF” above.  A Cause Map visually lays out the causes that contribute to a problem to show the cause-and-effect relationships and can help clarify a situation.  The possible solutions are included on the Cause Map.

The loss of the Steamship General Slocum, June 15, 1904

By ThinkReliability Staff

On June 15, 1904, a church group headed out for an excursion through New York City’s East River on the Steamship General Slocum.  Approximately half an hour after the ship left the pier, it caught fire.  Despite being only hundreds of yards from shore, the Captain continued to go full speed ahead in hopes of beaching at North Brother Island, where a hospital was located.  This served to fan the flames quickly over the entire highly flammable ship, killing many in the inferno.  Most of those who were not killed by the fire drowned, even though the Captain did successfully beach the ship at North Brother Island, due to the depth of the water and lack of safety equipment.

To perform a root cause analysis of the General Slocum tragedy, we can use a Cause Map.  A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.  First we look at the impact to the goals.  On the General Slocum there were at least 1,021 fatalities of the passengers and crew that were aboard.  (However, only two of the crew were killed.)  Additionally, 180 were injured.  There were other goals that were affected but the loss of life makes any other goals insignificant.  The deaths and injuries are impacts to the safety goals.

Passengers drowned because they were in water over their heads with inadequate help or safety equipment.  Passengers were either in the water because they fell when the deck collapsed, or because they jumped into the water trying to avoid the fire.  The water was too deep to stand because only the bow was in shallow water and the passengers could not reach the bow.  This was due to a poor decision on the Master’s part (namely his decision to beach the ship at a severe angle, with the bow in towards the island, instead of parallel to the island, where passengers would have been able to wade to shore.)  Note that the Master himself (and most of the crew) were on the bow side of the ship and were able to (and did) jump off and wade to shore.  The safety equipment, including life preservers, life boats, and life rafts, was mostly unusable due to inadequate upkeep and inadequate inspections.

Passengers (and two crewmembers) were also killed by fire.  Once the fire was started, it spread rapidly and was not put out.    The fire spread rapidly because the ship was highly flammable.  When this ship was constructed, there was no consideration of flammability.  Additionally, the current of air created by the vessel speeding ahead drove the fire across the ship.  The fact that an experienced Master would have allowed this situation was considered misconduct, negligence and inattention to duty – charges for which the Master was later convicted.   The fire was not put out because of inadequate crew effort and insufficient fire-fighting equipment.  The crew effort was inadequate because of a lack of training.  The fire-fighting equipment was insufficient because of inadequate upkeep and inadequate inspections.  (Possibly you are noticing a theme here?)

Although many people have not heard of the General Slocum tragedy, many of its lessons learned have been implemented to make ship travel safer today.  However, many of the solutions were not implemented widely enough or in time to prevent the Titanic disaster from occurring eight years later.  (Although there were nearly as many people killed on the General Slocum, it is believed that the Titanic disaster is more well known because the passengers on Titanic were wealthy, as opposed to the working class passengers on General Slocum.  It is also surmised that sympathy for the highly German population aboard General Slocum was diminished as World War I began.)

In a macabre ending to a gruesome story, ships began replacing their outdated, decrepit life preservers after the investigation on General Slocum.  It was later found that the company selling these new life preservers had hidden iron bars within the buoyant material, in a dastardly attempt to raise their apparent weight.  Unfortunately there were no adequate laws (then) against selling defective life-saving equipment.

Train Derailment Kills 79 in Spain

By Kim Smiley

On July 24, 2013, a train carrying 247 people violently derailed near Santiago de Compostela Spain.  Over 130 were injured and 79 were killed as a result of the accident.  Many details are still unknown, but investigators have determined that the train was traveling about twice the posted speed over a curved section of track.

The derailment was the worst train accident Spain has suffered in 40 years.  Obviously, an investigation is underway and authorities are eager to identify what caused the accident and are working to prevent anything similar from occurring in the future. One of the ways this accident can be analyzed is by building a Cause Map, a visual format for performing a root cause analysis.  A Cause Map visually lays out the different causes that contributed to an accident in an intuitive format that shows the cause-and-effect relationships.

The Cause Mapping process begins by filling in the basic background information for an issue as well as identifying how the incident impacted the goals.  In this example, the safety goal is clearly impacted because there were fatalities and injuries.  The schedule, labor, and material goals were also impacted because of the time and resources needed to investigate and clean up the accident and the damage to the train.  The negative publicity surrounding the accident can also be considered an impact to the customer service goal because people may be hesitant to ride trains if they have concerns about safety.

So why did the train derail?  The train was going too fast to safely navigate a curved section of track.  The train was going fast because it had previously been running on track designed for high speed trains where high speeds were permitted and it didn’t slow down as it entered a section of track where the posted speed was lower.  Operator action was required to slow down the train and it appears that the operator failed to take action.   Investigators are looking to whether there was a mechanical problem of some kind that prevented the train from reducing speed, but early indication is that the operator simply failed to brake and reduce the speed of the train.

A number of factors seem to have contributed to this deadly error by an experienced train operator who was familiar with this portion of track.  European Rail Traffic Management System (ERTMS) automatically controls braking and is installed on most of the track high speed trains operate on in the region, but not on the track where the accident occurred.  The accident occurred at the first potentially dangerous curve after the transition to  track where operator action is necessary to brake the train.  Based on statements by the driver,  he missed the transition to  the track where manual braking is required and didn’t realize that the train was in danger.  It has also come to light that the train driver was on the phone with the train’s ticket inspector immediately prior to the derailment and this distraction likely played a role in the accident.  The initial investigation findings have led to the train’s driver being provisionally charged with multiple counts of homicide by professional recklessness on 28 July 2013.

Regardless of whether the driver is convicted on the charges, the automatic systems involved should be a focus of the investigation.  The safety system sent a warning to the operator about the high speed prior to the accident, but it failed to prevent the accident.  Investigators need to review the timing of the warning and determine whether it came too late.  Other automatic systems such as the ERTMS also have the ability to stop a train that is operating at unsafe speeds, which raises the question of whether the safety systems used on this portion of track are adequate since the accident happened.  Ideally, a single error by a train driver for any reason won’t result in dozens of deaths.

To view a high level Cause Map of this incident, click on “Download PDF” above.  Click here to view a video of the accident.