What Happens When a Copy Isn’t a Copy?

By Kim Smiley

Think of how many documents are scanned every day. Imagine how important some of these pieces of paper are, such as invoices, property records, and medical files. Now try to picture what might happen if the copies of these documents aren’t true copies. This is exactly the scenario that Xerox was recently facing.

It recently came to light that some copies of scanned documents were altered by the scanning process. Specifically, some scanner/copier machines changed numbers on documents. This issue can be analyzed by building a Cause Map, an intuitive, visual format for performing a root cause analysis. The first step in the Cause Mapping process is to fill in an Outline with the basic background information on an issue. Additionally, the impacts to the overall goals are documented on the Outline to help clarify the severity of any given issue. In this example, the customer service goal is impacted because the scanners weren’t operating as expected. There is also a potential impact to the overall economic goal because the altered documents could result in any number of issues. There is also an impact because of the labor needed to investigate and fix the problem.

After completing the Outline, the next step is ask “why” questions to build the Cause Map. Why weren’t the scanners operating as expected? This happened because the scanners were changing some documents during the scanning process. Scanners use software to help interpret the original documents and Xerox has stated that the problem happened because of a software bug. Testing showed that the number substitutions were more likely to occur when the settings on the scanners were set to lower quality/ higher compression because of the specific software used for these settings. Testing also showed that the error was more likely to occur when scanning those documents that were more difficult to read such as those with small fonts or that had already been copied multiple times.

Xerox had been aware of the potential for number substitution at lower quality settings, but didn’t appear to expect it to occur at factory settings (which was found to be very unlikely, but possible). A notice that stated that character substitutions were possible appeared on the scanners when lower resolution settings were selected and was included in some manuals, but this approach seems to have been ineffective since many users were caught unaware by this issue.

After a Cause Map has been built with enough detail to understand the issue, it can be used to help develop solutions. In this example, Xerox developed a software patch that corrected the error. Xerox also posted several blogs on their website to keep customers informed about the issue and worked with users to ensure that the patch was successful in correcting the error.

To see a high level Cause Map of this issue, click on “Download PDF” above.

 

Trading Glitch Loses Goldman Sachs Millions

By Kim Smiley

A Goldman Sachs trading glitch on August 20, 2013 caused a large number of erroneous single stock and ETF options trades.  About 80 percent of the errant trades were cancelled, but the financial damage is still speculated to be as much as one hundred million dollars. The company also finds itself once again in the uncomfortable position of making headlines for negative reasons which is never good for business.

The glitch occurred during an update to an internal computer system that is used to determine where to price options.  The update changed the software so that the system began inadvertently misinterpreted non-binding indications of interest as actual bids and offers.  The system acted on these bids and executed a large volume of trades at errant prices that were out of touch with actual market prices.

This issue can be built into a Cause Map, an intuitive method for performing a root cause analysis.  One of the advantages of a Cause Map is that it visually lays out all the causes and the cause-and-effect relationships between them. Seeing all the causes can broaden the solutions that are considered.

In this example, a Cause Map can help illustrate the fact that the software glitch itself isn’t the only thing worth focusing on.  The lack of an effective test program also contributed to the problem and testing may be the easiest place to implement an effective solution.  If the problem would have been caught in testing, the only cost would have been the time and effort needed to fix the software.  The importance of a robust test program for software is difficult to overstate.  If the software is vital to whatever your company’s mission is, develop a way to test it.

To view a high level Cause Map of this issue, click on “Download PDF” above.  Click here to read about the loss of the Mars Climate Orbiter, another excellent example of a software error with huge consequences.

Sinkhole Forms Under Orlando Resort

By Kim Smiley

On Sunday August 11, 2013, guests at a resort near Orlando, Florida woke to creaking sounds and breaking windows.  About 10 minutes after the disturbances began a portion of the luxury resort villa was swallowed up by a sinkhole that had formed with little warning.  Luckily, there was enough time to evacuate the resort and no one was injured, but guests lost most of their luggage, including purses and wallets, and the resort obviously suffered significant damage.

This incident can be analyzed by building a Cause Map, or visual root cause analysis.  A Cause Map shows the cause-and-effect relationships between the different causes that contributed to an issue.  This can help guide an investigation and is useful when developing solutions to prevent similar problems from occurring in the future.  Cause Maps can also be used to help explain the issue to somebody who was not involved with the investigation.  To view a high level Cause Map of this specific issue, click on “Download PDF” above.

A sinkhole forms when a void is created underground and the earth ceiling over the void collapses into it.  Voids typically form when there is a limestone or similar rock deposit underground that is dissolved by slightly acidic groundwater.  This region in central Florida is well known for sinkholes because these types of underground deposits are relatively common in the area.  Over pumping of groundwater can also cause the ground to settle, possibly forming a void.  There is concern that the rapid pace of development in this area has had an impact on the groundwater and may potentially be helping fuel the formation of sinkholes.  There was also record rain fall in July in Orlando and the additional water may have caused the ceiling over the void to be heavier than normal and more likely to collapse.

There rarely are easy answers, but sinkholes seem to be a particularly tricky problem to solve.  They are unpredictable and there is typically little warning before they develop.  Prior to the resort being constructed, the site underwent geological testing and the ground was found to be stable so something more than basic geological testing will be needed to solve this problem.  Well planned development and careful management of ground water may help limit the development and impact of sinkholes, but there will be strong economic pressure to develop more and more land in the booming Orlando area.  Insurance seems to be one of the best solutions to date.  Florida law requires insurance companies to cover “catastrophic ground collapse” so that property owners at least have an economic safety net in the event of a sinkhole.  The only good thing about sinkholes is that they generally take a little time to form like the recent resort sinkhole.  There will be property damage from sinkholes in the future, but hopefully there will also be time to evacuate everybody like there was this case.

Thousands Injured Each Year From Falling Televisions

By Kim Smiley

Nearly all parents know about the dangers of watching too much television, but a new study shows that too few are aware of the risk of injury from televisions.  The number of television injuries is more than most would guess with more than 17,000 children visiting emergency rooms for  television related injuries each year.   Falling televisions have also caused hundreds of deaths with 29 killed in just 2011.  The rate of injuries associated with televisions is also increasing at an alarming pace, jumping 126% since 1990.

The majority of victims were young children under five.  The accidents seem to be a potentially deadly combination of their lack of situational awareness and unanchored televisions set on unsafe surfaces.  The study didn’t include why the televisions were in unsafe locations, but one theory is that many older televisions are moved into secondary locations that aren’t as safe as families acquire bigger, fancier televisions.  The older televisions may be on dressers or night stands that were never meant to hold televisions.  Children climb the furniture either attempting to turn on the television or retrieve something off the top and the television tumbles down on top of them.  Dressers with drawers are particularly dangerous because children may figure out how to use the drawers as steps and manage to climb much higher than anticipated.

The rapid rate of technological advances may also play a role since typical families are buying new televisions more frequently than in previous decades and the number of televisions in an average home has increased.  The changing design of televisions is also relevant.  New thinner televisions have significantly smaller bases making them top heavy and more likely to topple over.  Many families are also buying bigger televisions with can amplify the danger if they topple.

Experts have suggested a few potential solutions to this problem.  First and foremost, parents need to be made more aware of the issue, possibly through a public awareness campaign.  A campaign to distribute anchoring devices has been discussed as well as providing them with new televisions at purchase.  Another option may be to add stability requirements to new designs so that televisions are less likely to topple.  It is also recommended that parents never store remote controls or toys on top of a television because they may entice children into climbing to reach them.  Only time will tell which solution if any are implemented, but this study is a first step in raising public awareness about this issue.

To view a Cause Map, or visual root cause analysis, of this issue, click on “Download PDF” above.  A Cause Map visually lays out the causes that contribute to a problem to show the cause-and-effect relationships and can help clarify a situation.  The possible solutions are included on the Cause Map.

The loss of the Steamship General Slocum, June 15, 1904

By ThinkReliability Staff

On June 15, 1904, a church group headed out for an excursion through New York City’s East River on the Steamship General Slocum.  Approximately half an hour after the ship left the pier, it caught fire.  Despite being only hundreds of yards from shore, the Captain continued to go full speed ahead in hopes of beaching at North Brother Island, where a hospital was located.  This served to fan the flames quickly over the entire highly flammable ship, killing many in the inferno.  Most of those who were not killed by the fire drowned, even though the Captain did successfully beach the ship at North Brother Island, due to the depth of the water and lack of safety equipment.

To perform a root cause analysis of the General Slocum tragedy, we can use a Cause Map.  A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.  First we look at the impact to the goals.  On the General Slocum there were at least 1,021 fatalities of the passengers and crew that were aboard.  (However, only two of the crew were killed.)  Additionally, 180 were injured.  There were other goals that were affected but the loss of life makes any other goals insignificant.  The deaths and injuries are impacts to the safety goals.

Passengers drowned because they were in water over their heads with inadequate help or safety equipment.  Passengers were either in the water because they fell when the deck collapsed, or because they jumped into the water trying to avoid the fire.  The water was too deep to stand because only the bow was in shallow water and the passengers could not reach the bow.  This was due to a poor decision on the Master’s part (namely his decision to beach the ship at a severe angle, with the bow in towards the island, instead of parallel to the island, where passengers would have been able to wade to shore.)  Note that the Master himself (and most of the crew) were on the bow side of the ship and were able to (and did) jump off and wade to shore.  The safety equipment, including life preservers, life boats, and life rafts, was mostly unusable due to inadequate upkeep and inadequate inspections.

Passengers (and two crewmembers) were also killed by fire.  Once the fire was started, it spread rapidly and was not put out.    The fire spread rapidly because the ship was highly flammable.  When this ship was constructed, there was no consideration of flammability.  Additionally, the current of air created by the vessel speeding ahead drove the fire across the ship.  The fact that an experienced Master would have allowed this situation was considered misconduct, negligence and inattention to duty – charges for which the Master was later convicted.   The fire was not put out because of inadequate crew effort and insufficient fire-fighting equipment.  The crew effort was inadequate because of a lack of training.  The fire-fighting equipment was insufficient because of inadequate upkeep and inadequate inspections.  (Possibly you are noticing a theme here?)

Although many people have not heard of the General Slocum tragedy, many of its lessons learned have been implemented to make ship travel safer today.  However, many of the solutions were not implemented widely enough or in time to prevent the Titanic disaster from occurring eight years later.  (Although there were nearly as many people killed on the General Slocum, it is believed that the Titanic disaster is more well known because the passengers on Titanic were wealthy, as opposed to the working class passengers on General Slocum.  It is also surmised that sympathy for the highly German population aboard General Slocum was diminished as World War I began.)

In a macabre ending to a gruesome story, ships began replacing their outdated, decrepit life preservers after the investigation on General Slocum.  It was later found that the company selling these new life preservers had hidden iron bars within the buoyant material, in a dastardly attempt to raise their apparent weight.  Unfortunately there were no adequate laws (then) against selling defective life-saving equipment.

Train Derailment Kills 79 in Spain

By Kim Smiley

On July 24, 2013, a train carrying 247 people violently derailed near Santiago de Compostela Spain.  Over 130 were injured and 79 were killed as a result of the accident.  Many details are still unknown, but investigators have determined that the train was traveling about twice the posted speed over a curved section of track.

The derailment was the worst train accident Spain has suffered in 40 years.  Obviously, an investigation is underway and authorities are eager to identify what caused the accident and are working to prevent anything similar from occurring in the future. One of the ways this accident can be analyzed is by building a Cause Map, a visual format for performing a root cause analysis.  A Cause Map visually lays out the different causes that contributed to an accident in an intuitive format that shows the cause-and-effect relationships.

The Cause Mapping process begins by filling in the basic background information for an issue as well as identifying how the incident impacted the goals.  In this example, the safety goal is clearly impacted because there were fatalities and injuries.  The schedule, labor, and material goals were also impacted because of the time and resources needed to investigate and clean up the accident and the damage to the train.  The negative publicity surrounding the accident can also be considered an impact to the customer service goal because people may be hesitant to ride trains if they have concerns about safety.

So why did the train derail?  The train was going too fast to safely navigate a curved section of track.  The train was going fast because it had previously been running on track designed for high speed trains where high speeds were permitted and it didn’t slow down as it entered a section of track where the posted speed was lower.  Operator action was required to slow down the train and it appears that the operator failed to take action.   Investigators are looking to whether there was a mechanical problem of some kind that prevented the train from reducing speed, but early indication is that the operator simply failed to brake and reduce the speed of the train.

A number of factors seem to have contributed to this deadly error by an experienced train operator who was familiar with this portion of track.  European Rail Traffic Management System (ERTMS) automatically controls braking and is installed on most of the track high speed trains operate on in the region, but not on the track where the accident occurred.  The accident occurred at the first potentially dangerous curve after the transition to  track where operator action is necessary to brake the train.  Based on statements by the driver,  he missed the transition to  the track where manual braking is required and didn’t realize that the train was in danger.  It has also come to light that the train driver was on the phone with the train’s ticket inspector immediately prior to the derailment and this distraction likely played a role in the accident.  The initial investigation findings have led to the train’s driver being provisionally charged with multiple counts of homicide by professional recklessness on 28 July 2013.

Regardless of whether the driver is convicted on the charges, the automatic systems involved should be a focus of the investigation.  The safety system sent a warning to the operator about the high speed prior to the accident, but it failed to prevent the accident.  Investigators need to review the timing of the warning and determine whether it came too late.  Other automatic systems such as the ERTMS also have the ability to stop a train that is operating at unsafe speeds, which raises the question of whether the safety systems used on this portion of track are adequate since the accident happened.  Ideally, a single error by a train driver for any reason won’t result in dozens of deaths.

To view a high level Cause Map of this incident, click on “Download PDF” above.  Click here to view a video of the accident.

Deadly Plane Crash at San Francisco Airport

By Kim Smiley

On July 6, 2013, Asiana Airlines Flight 214 crashed while attempting to land at the San Francisco International Airport. Three people have died as a result of the crash and around 180 others were injured, 13 critically. The cause of the crash is currently under investigation, but there were no obvious mechanical issues and the weather was near perfect.

Even though the investigation is still in its infancy, an initial Cause Map can be built to document what is known now about the accident and it can easily be expanded later as more information becomes available. A Cause Map is a visual format for performing a root cause analysis that intuitively lays out the different causes for an accident. The first step in the Cause Mapping process is to fill in an Outline with the basic background information for an issue. On the bottom half of the Outline there is space to document how the problem impacts the overall goals. This is useful because it helps everyone involved in the process understand the big picture and the issues with the more significant impacts can be prioritized first.

There is also space on the Outline to list anything that was different or unusual at the time the problem occurred. It’s important to note any differences because they are usually worth exploring during an investigation because they may have played a role in the accident. In this specific example, this was the first time the pilots had worked together and the two main pilots were both in unfamiliar roles. The pilot landing the plane had limited experience with Boeing 777s even though he was an experienced pilot and this was his first time landing this type of aircraft at the San Francisco airport. There was another pilot instructing him, but it was his first flight as an instructor.

Once the Outline is completed, the next step is to ask “why” question and add the answers to the Cause Map. In this example, we know that the airplane was coming in too low and too slow to land safely, but it isn’t known why that happened. The NTSB has initiated an investigation and the results will reported when the analysis is complete. Some of the early speculation is that there may have been an equipment failure, mismanagement of automated systems or ineffective communication in the cockpit. The fact that this crew was different than the typical staffing has been a focus of investigators, but it isn’t known what role they may have played in the crash.

Another piece of this puzzle is that one of the passengers who died at the crash scene appears to have been killed when she was run over by a fire engine. She was covered in foam on the ground and the firefighters were unaware of her location. Emergency response procedures will need to be reviewed as part of the investigation into this accident to ensure that first responders can do their jobs in the safest way possible.

To view an initial Cause Map of this issue, click on “Download PDF” above.

 

A Potentially Stinging Situation – Jellyfish Blooms

By Kim Smiley

Jellyfish are some of nature’s most impressive survivors.  They have been around since long before the dinosaurs roamed the earth and continue to thrive.  In some cases, they may even be thriving a little too successfully.  Massive jellyfish blooms can flourish in the right environment and can decimate other species and cause significant damage.

Naturally occurring jellyfish blooms have been around for ages and while they may be inconvenient at times, they aren’t particularly alarming.  The real concern is that manmade conditions may lead to the growth of jellyfish blooms at times or regions that wouldn’t normally see them.  Large numbers of jellyfish can cause a number of serious issues.  Safety is a concern because jellyfish stings are painful and can even be deadly.  Regions that depend on tourism can also be impacted because travelers may avoid areas with large numbers of jellyfish.  Jellyfish have caused damage to ships and buildings when they clog intake lines.  Populations of other species have also been decimated in some areas by jellyfish blooms which can affect commercial fishing operations.

What causes these jellyfish blooms can be explored by building a Cause Map or visual root cause analysis.  A Cause Map intuitively lays out causes that contribute to an issue and shows the cause-and-effect relationships between them.  In this example, the jellyfish blooms grow because jellyfish are well suited for life in low oxygen “dead zones” that are being created in the ocean.

It all starts with fertilizer containing nutrients running into the ocean.  An algae bloom forms as algae feed on the nutrients.  Eventually the nutrients are depleted and the algae dies off leading to the growth of a bacterial bloom as bacteria feed on the dead algae.  The bacterial bloom depletes the oxygen making the region unsuitable for most species.  However, the opportunistic jellyfish can survive and even thrive in low oxygen levels.  Jellyfish are able to rapidly grow and reproduce quickly so the population surges upward in an environment with few predators and little competition.

A few facts so that the reproductive abilities of jellyfish can be fully appreciated: a single female jellyfish can release tens of thousands of eggs per day, and jellyfish are able to double their weight in a single day if food is abundant.

Eating habits of jellyfish also make it very difficult for other species to move back into the region even if oxygen levels increase.  Jellyfish not only compete for the same food as larvae of other species, plankton, they are fond of eating larvae and eggs.  It’s difficult to compete with a species that is both a predator and competitor.

Before anyone has nightmares of huge jellyfish causing wide scale destruction, I should note that researchers have not found evidence that jellyfish are in danger of overrunning the oceans.  But many scientists do believe that human activities have contributed to jellyfish blooms growing in localized areas.  It’s always worth trying to understand how human activities are impacting our environment, especially when a species so well equipped for survival is involved.

To view a high level Cause Map of this issue, click on “Download PDF” above.

50 Presumed Dead in Canadian Train Disaster

By ThinkReliability Staff

A tragic accident devastated the Canadian town of Lac-Mégantic, Quebec on July 6, 2013.  Much about the issue is still unknown.  When investigating an incident such as this, it can be helpful to identify what is known and information that still needs to be determined.

What is known: a 73-car train was parked in Nantes, Quebec, uphill from Lac-Mégantic.  Of the cars, 72 contained crude oil.  The train was left unattended and late the evening of July 5, 2013, a fire broke out in the locomotive.  While the fire department of Nantes was putting out the fire, they turned off the train’s main engine.  Less than two hours later, the train rolled down the track and derailed in Lac-Mégantic.  After subsequent explosions and long-burning fires, 24 people have been confirmed dead.  26 more are missing.   Much of the town and the train – and the evidence in it – is destroyed.

What is not known: The cause of the initial fire on the train is not known.  Whether or not the fire department should have explicitly notified the train engineer that the main engine had been shut off is not known.  What happened that allowed the train to roll downhill is unknown.

With this number of unknowns, it is helpful to visually lay out the cause-and-effect relationships that occurred, and what impact they had on those affected.  This can allow us to see the holes in our analysis and identify where more evidence is needed.  Once as much evidence as possible has been obtained, additional detail can be added to the cause-and-effect relationships.  Ensuring that all causes related to the incident are included will provide the largest number of solutions, allowing us to choose the most effective.  We can do all this in a Cause Map, or visual root cause analysis.

The first step in using any problem solving methodology is to determine the impact caused by the incident.  In this case, the deaths (and assumed deaths) are our most significant impact.  Also addressed should be the crude oil leakage (though much of it was likely burned off), the high potential for lawsuits, the possible impact on rail shipments, the destruction of the town and the train, and the response and cleanup efforts.  These form the initial “effects” for our cause-and-effect analysis.

Asking “Why” questions allows us to further develop the cause-and-effect relationships.  We know that for the train to roll backwards down the hill, both sets of brakes had to be ineffective.  The railway company has stated that the air brakes released because the main engine had been shutdown.  However, according to the New York Times, “since the 19th century, railways in North America have used an air-braking system that applies, rather than releases, freight car brakes as a safety measure when it loses pressure.”  This certainly makes more sense than having brakes be dependent on engine power.

The hand brakes functioned as backup brakes.  The number of cars (which, when on a hill, affects the force pulling on the train) determines the number of handbrakes required.  In this case, the engineer claims to have set 11 handbrakes, but the rail company has now stated that they no longer believe this.  No other information – or evidence that could help demonstrate what happened to either sets of brakes – has been released.

Also of concern are the style of train cars – believed to be the same that the NTSB identified in a report on a previous train accident as “subject to damage and catastrophic loss of hazardous materials”.

In a tragedy such as this one, the first priority is to save and preserve human lives in every way possible.  However, once that mission is complete, evidence-gathering to determine what happened is the next priority.  As evidence becomes available it is added directly to the Cause Map, below the cause it supports or refutes.  Additional causes are added as necessary with the goal of determining all the cause-and-effect relationships to provide the largest supply of possible solutions to choose from.

The company involved has already stated it will no longer leave trains unattended.  That should be a big help but, given the consequences of this event, other solutions should be considered as well.

To view the Outline and Cause Map, please click “Download PDF” above.  Or click here to read more.

Train Derails After Collapse of Bridge in Calgary

By ThinkReliability Staff

Emergency crews in Calgary, already overworked from the heavy flooding in the area, have another potential emergency to mitigate.  In the very early morning of June 27, a rail bridge over the Bow River collapsed, just as the end of a long train went across.  The last six cars – five of which contained petroleum products, the last empty – derailed and are now precariously balanced on the collapsed bridge (but remain out of the river).

Emergency plans have been implemented to reduce the risk of the public in the area.  Capturing the cause-and-effect relationships that are resulting in the risk to the area can show how the emergency actions are acting on causes and potential causes to keep the public and the environment safe.  A visual representation of this type of root cause analysis is called a Cause Map.  The Cause Map has three steps.

First, we capture the what, when and where of the incident in a problem outline.  We also capture any differences such as the heavy recent flooding in the area.  Differences that we capture may or may not turn out to be causally related to the problem, but investigating the possibility is important.

The remainder of the outline is the impact to the goals.  Any “problem” is in fact an impact to one or more of an organization’s goals.  Framing the problem with respect to an organization’s goals ensures that solutions act to reduce these impacts and decrease disagreement about what the problem really is.  In this case, the safety goal is impacted due to the potential for injuries (though there have not been any associated with the derailment or response yet).  The potential leak of petroleum products is an impact to the environmental goal.  The area surrounding the train was evacuated, which can be considered an impact to the customer service goal.  (Frequently organizations will consider the surrounding communities as part of their customer base.)  The production/schedule goal is impacted because the bridge is unusable.  The damage to the bridge and potential damage to the cars (as yet unknown, because of their unstable position) are impacts to the property/materials goal. Lastly, the manpower being spent on emergency response and securing the train cars is an impact to the labor/time goal.

Once the impacts to the goals have been determined, the analysis begins with these impacted goals and is continued by asking “Why” questions.  For example, the risk of injury is caused by the potential leak of petroleum products.  The potential leak is due to the potential damage to the rail cars and the product contained within 5 of the cars.  Removing either of these causes reduces the risk of the effect.  In this case, plans are being made to remove the product from the trains to reduce the risk of both the safety and environmental impact.  The rail cars are being secured so that further bridge collapse will have less impact on the structural stability of the cars.

The evacuation – removing the people from the area – also serves to reduce the risk of injury.  To reduce the potential environmental impact, booms have been installed down-river in case of product leak.  Of course, these are emergency, short-term solutions designed to reduce the impact of collapse and derailment.  Solutions to the issue of the bridge collapse and its causes will be looked at in more detail once the cars have been safely removed.

While the investigation of the issue is still ongoing, the information that is currently known can be added to the Cause Map.  The structural failure that led to the bridge collapse appears to be caused by scouring of a support pier caused by the sudden influx of water related to the flooding in the area.  Concerns about how quickly the damage was discovered are a concern.  While investigations had been stepped up as a result of the flood, concerns from the city – which does not have jurisdiction for its own inspections – that the rail company’s inspections were insufficient will certainly be investigated.  It’s also been determined that the bridge – unlike others in the area – was not built into the bedrock, decreasing its strength, though how much of a role that played in the collapse  is yet to be determined.  When more information is known, it can be added to the Cause Map.

To view the Outline and Cause Map, please click “Download PDF” above.