Tag Archives: Cause Mapping

Metro Train Derails in the Bronx, Killing 4 and Injuring More Than 60

By Kim Smiley

Four passengers were killed and dozens more sent to the hospital after a metro train derailed in the Bronx early Sunday, December 1, 2013.  At the time of the accident, the train was carrying about 150 passengers and was traveling to Grand Central Terminal in New York City. The aftermath of the accident was horrific with all seven cars of the commuter train derailing. Metro-North has been operating for more than 30 years and this was the first accident that resulted in passenger deaths.

A Cause Map, or visual root cause analysis, can be built to help analyze this accident.  There is still a lot of investigative work that needs to be done to understand what caused the derailment, but the information that is available can be used to create an initial Cause Map.  The Cause Map can easily be expanded later to incorporate more information as it becomes available.  The first step when building a Cause Map is to fill in an Outline with the basic background information.  The impacts to the goals are also documented on the bottom of the Outline.  The impacted goals are then used to begin building the Cause Map.

In this example, the safety goal is clearly impacted because there were four fatalities and over 60 people injured.  The schedule goal is also significantly impacted because this portion of rail will be closed during most of the investigation.  The National Transportation Safety Board has estimated that the investigation will take 7 to 10 days.  The track closure is particularly impacting because this is a major artery into New York City with a ridership of 15.9 million in 2012.  Once the impacted goals are documented, the Cause Map itself is built by asking “why” questions.

So why did the train derail?  The details aren’t known yet, but there is still some information that should be documented on the Cause Map.  A question mark is included after a cause that may have contributed to an issue, but requires more evidence or investigation.  It’s useful to document these open questions during an investigation to ensure that all the pertinent questions are asked and nothing is overlooked.  (If it is determined that a cause didn’t play a role, it can be crossed out on the Cause Map to show that the cause was considered, but ruled out.)  Two factors that likely  played a role in the derailment are the speed of the train and the track design where the accident occurred.  There is a sharp curve in the track where the derailment happened.  Trains are required to reduce their speed before traveling it.  The latest reports from the investigation are that the train was traveling 82 mph in a 30 mph zone. The train operator has stated that the brakes malfunctioned and didn’t respond when he tried to reduce speed and that the train was traveling too fast over the curved track.

Investigators have recovered the data recorder from the train which will provide  more information and if there was a problem with the brakes.  Investigators will also interview all the relevant personnel and determine what happened to cause this deadly crash.  Once the investigation is completed, any necessary solutions can be implemented to reduce the risk that a similar accident occurs in the future.

To view a completed Outline and initial Cause Map of this incident, click on “Download PDF” above.

Boeing 747 “Dreamlifter” Cargo Jet Lands At Wrong Airport

By Kim Smiley

On November 21, 2013, a massive Boeing 747 Dreamlifter cargo jet made national headlines after it landed at the wrong airport near Wichita, Kansas.  For a time, the Dreamlifter looked to be stuck at the small airport with a relatively short runway, but it was able to take off safely the next day after some quick calculations and a little help turning around.

At the time of the airport mix-up, the Dreamlifter was on its way to the McConnell Air Force base to retrieve Dreamliner nose sections made by nearby Spirit Aerosystems.   Dreamlifters are notably large because they are modified jumbo jets designed to haul pieces of Dreamliners between the different facilities that manufacture parts for aircraft.

So how does an airplane land at the wrong airport?  It’s not entirely clear yet how a mistake of this magnitude was made.  The Federal Aviation Administration is planning to investigate the incident to determine what happened and to see whether any regulations were violated.  What is known is that the airports have some similarities in layout that can be confusing from the air.  First off, there are three airports in fairly close proximity in the region.  The intended destination was the McConnell Air Force base, which has a runway configuration similar to Jabara airfield where the Dreamlifter landed by mistake.  Both runways run north-south and  are nearly parallel.  It can also be difficult to determine how long a runway is from the airport so the shorter length isn’t necessarily easy to see.  Beyond the airport similarities, the details of how the plane landed at the wrong airport haven’t been released yet.

What is known can be captured by building an initial Cause Map, a visual format for performing a root cause analysis.  One of the advantages of Cause Maps is they can be easily expanded to incorporate more information as it becomes available.  The first step in Cause Mapping is to fill in an Outline with the basic background information and to list how the issue impacts the overall goals.  There are a number of goals impacted in this example.  The potential for a plane crash means that there was an impact to both the safety and property goal because of the possibility of fatalities and damage to the jet.  The effort needed to ensure that the jet could safely take off on a shorter runway is an impact to the labor goal and the delay was an impact to the schedule goal.  The negative publicity surrounding this incident can also be considered an impact to the  customer service goal.

Once the Outline is completed, the Cause Map is built by asking “why” questions and intuitively laying out the answers until all the causes that contributed to the issue are documented.  Click on “Download PDF” above to see an Outline and initial Cause Map of this issue.

Good luck with any air travel planned for this busy holiday week.  And if your plane makes it to the right airport (even if it’s a little late), take a moment to be thankful because it’s apparently not the given I’ve generally assumed.

Can the Epidemic of Smartphone Thefts be Stopped?

By Kim Smiley

About 1.6 million handheld devices were stolen in the United States in 2012, the majority of which were smartphones.  In fact, the frequency at which the popular Apple devices are taken has given rise to a whole new term, “apple picking”.  Stolen smartphones cost consumers nearly $30 billion a year.  These thefts affect a significant number of smartphone owners with approximately 10 percent reporting that they have had a device stolen.

The problem of smartphone theft can be analyzed by building a Cause Map, a method for performing a visual root cause analysis.  A Cause Map is built by completing an Outline by both filling in the basic background information and listing how the issue impacts the overall goals.  The impacts to the goals from the Outline are then used as the first step in building the Cause Map.  Causes are then added by asking “why” questions to determine what other causes contributed to an issue.  (To view a high level Cause Map of this issue, click on “Download PDF” above.)

So why do so many smartphone get taken?  Smartphones are a popular target because it is lucrative to resell them, they are relatively easy to steal, and many of the crimes go unpunished.  Smartphones are fairly easy to steal because they are readily available since so many people carry them, and they are both small and light weight.  Many criminals who steal smartphones go unpunished because there are so many of them taken and it is difficult to locate the thieves.  Many stolen smartphones are shipped overseas which further complicates the situation.

The black market for smartphones is lucrative because the items are popular and relatively expensive to buy new.  People buy stolen smartphones because they are cheaper and they are able to be used by the “new owner”, especially overseas where the networks are different and phones deactivated in the US may be able to be used.

One of the possible solutions suggested to reduce the number of smartphone thefts is to include a kill switch in smartphone software.  This kill switch would essentially make the phone worthless because it would no longer function no matter where it was in the world.  If smartphones no longer have resale value, then there would be little incentive to steal them and the number of thefts should dramatically decrease.  While this idea is elegant in its simplicity, like most things there is more that needs to be considered.

The addition of a kill switch was recently rejected by cellphone carriers because of concerns about hacking and problems with reactivation.  If hackers found a way to flip the kill switches they would have the ability to destroy a huge number of smartphones from anywhere in the world.  Depending on how many users were targeted this could have a huge impact, which could be especially problematic for people who use their phones in an official capacity like law enforcement. It doesn’t take much imagination to see how this scenario could go horribly wrong. The proposed kill switch is also permanent so users won’t be able to reactivate their phones and any stolen phones that were recovered would be useless.  Companies continue to work on a number of ideas to make it more difficult to resell smartphones, but there isn’t general agreement on the best approach yet.  Only time will tell if the tide of smartphone thefts has peaked.

Pilot Response to Turbulence Leads to Crash

By ThinkReliability Staff

All 260 people onboard Flight 587, plus 5 on the ground, were killed when the plane crashed into a residential area on November 12, 2001.  Flight 587 took off shortly after another large aircraft.  The plane experienced turbulence.  According to the NTSB, the pilot’s overuse of the rudder mechanism, which had been redesigned and as a result was unusually sensitive, resulted in such high stress that that vertical stabilizer separated from the body of the plane.

This event is an example of an Aircraft Pilot Coupling (APC) event.  According to the National Research Council, “APC events are collaborations between the pilot and the aircraft in that they occur only when the pilot attempts to control what the aircraft does.  For this reason, pilot error is often listed as the cause of accidents and incidents that include an APC event.  However, the [NRC] committee believes that the most severe APC events attributed to pilot error are the result of the adverse APC that misleads the pilot into taking actions that contribute to the severity of the event.  In these situations, it is often possible, after the fact, to analyze the event carefully and identify a sequence of actions the pilot could have taken to overcome the aircraft design deficiencies and avoid the event.  However, it is typically not feasible for the pilot to identify and execute the required actions in real time.”

This crash is a case where it is tempting to chalk up the accident to pilot error and move on.  However, a more thorough investigation of causes identifies multiple issues that contributed to the accident and, most importantly, multiple opportunities to increase safety for future pilots and passengers.  The impacts to the goals, causes of these impacts, and possible solutions can be organized visually in cause-and-effect relationships by using a Cause Map.  To view the Outline and Cause Map, please click “Download PDF” above.

The wake turbulence that initially affected the flight was due to the small separation distance between the flight and a large plane that took off 2 minutes prior (the required separation distance by the FAA).  This led to a recommendation to re-evaluate the separation standards, especially for extremely large planes.  In the investigation, the NTSB found that the training provided to pilots on this particular type of aircraft was inadequate, especially because changes to the aircraft’s flight control system rendered the rudder control system extremely sensitive.  This combination is believed to be what led to the overuse of the rudder system, leading to stress on the vertical stabilizer that resulted in its detachment from the plane.  Specific formal training for pilots based on the flight control system for this particular plane was incorporated, as was evaluation of changes to the flight control system and requirements of handling evaluations when design changes are made to flight control systems for   previously certified aircraft. A caution box related to rudder sensitivity was incorporated on these planes, as was a detailed inspection to verify stabilizer to fuselage and rudder to stabilizer attachments.  An additional inspection was required for planes that experience extreme in-flight lateral loading events.  Lastly, the airplane upset recovery training aid was revised to assist pilots in recovering from upsets such as from this event.

Had this investigation been limited to a discussion of pilot error, revised training may have been developed, but it’s likely that a discussion of the causes that led to the other solutions that were recommended and/or implemented as a result of this accident would not have been incorporated.  It’s important to ensure that incident investigations address all the causes, so that as many solutions as possible can be considered.

The Morris Worm: The First Significant Cyber Attack

By Kim Smiley

In 1988 the world was introduced to the concept of a software worm when the Morris worm made headlines for significantly disrupting the fledgling internet.  The mess left in the wake of the Morris worm took several days to clean up. The estimates for the cost of the Morris worm vary greatly from $100,000–10,000,000, but even at the lower range the numbers are still substantial.

A Cause Map, or visual root cause analysis, can be used to analyze this issue.  A Cause Map is built by asking “why” questions and using the answers to visually lay out the causes that contributed to an issue to show the cause-and-effect relationships.  In this example, a programmer was trying to build a “harmless” worm that could be used to gauge the size of the internet, but he made a mistake.  The goal was to infect each computer one time, but the worm was designed to duplicate itself every seventh time a computer indicated it already had the worm to make the worm hard to defend against.  The problem was that the speed of propagation was underestimated. Once released, the worm quickly reinfected computers over and over again until they were unable to function and the internet came crashing down.  (To view a Cause Map of this example, click on “View PDF” above.)

One of the lasting impacts from the Morris worm that is hard to quantify is the impact on cyber security.  The worm exploited known bugs that no one had worried about enough to fix.  At the time of the Morris worm, there was no commercial traffic on the internet or even Web sites.  The people who had access to the internet were a small, elite group and concerns about cyber security hadn’t really come up.  If the first “hacker” attack had had malicious intent behind it and came a little later it’s likely that the damage would have been much more severe.  While the initial impacts of the Morris worm were all negative, it’s a positive thing that it highlighted the need to consider cyber security relatively early in the development of the internet.

It’s also interesting to note that the programmer behind the Morris worm, Robert Tappan Morris, become the first person to be indicted under the 1986 Computer Fraud and Abuse Act. He was sentenced with a $10,050 fine, 400 hours of community service, and a three-year probation. Morris was a 23 year old graduate student at the time he released his infamous worm.  After this initial hiccup, Morris went one to have a successful career and now works in the MIT Computer Science and Artificial Intelligence Laboratory.

16-Day Government Shutdown Affects Economy

By Holly Maher

On October 1, 2013 at 12:01 AM, the beginning of the 2014 fiscal year, the federal government shut down all non-essential operations when Congress could not pass a continuing resolution to allow spending at current levels. The government shutdown lasted 16 days and, in addition to other impacts, closed the National Parks system (see our blog about the park closures), furloughed 800,000 federal employees, had the potential to impact payment of veterans’ benefits and negatively impacted the economy, both directly and indirectly.

So what caused the government shutdown? If you watched any TV during that 16 day period, you could certainly hear any number of experts (on both sides) explaining who was to blame. As the Cause Mapping methodology is intended to do, this analysis of the government shutdown is not trying to identify the one person, the one group or the one reason to blame for the shutdown. Instead, we will identify all the causes required to produce this effect. This will allow us to identify many possible solutions for preventing it from happening again. We start by asking “why” questions and documenting the answers to visually lay out all the causes that contributed to the shutdown. The cause and effect relationships lay out from left to right.

In this example, the government shutdown occurred because a vote on a continuing resolution bill could not be passed by Congress because there was a line item added to the continuing resolution, defunding the Affordable Care Act (ACA) that could not be agreed upon. A continuing resolution was required because the Constitution gives the power to spend money to Congress, and since they had not passed a Budget for fiscal year 2014, a continuing resolution was constitutionally required to continue operating the government after October 1. Defunding the ACA was added to the continuing resolution bill because the ACA was about to go into effect and because it can be added on a line item basis. Congress was unable to compromise to reach an agreement to pass the continuing resolution.

So why was Congress unable to reach an agreement? If the incentive to compromise was greater than the incentive to not compromise, they would have compromised. So why is the incentive to compromise ineffective? One of the reasons is because Congress’s pay is not affected when the government shuts down. Another reason is because there is significant incentive to maintain a position aligned with the party (either left or right). The desire to get re-elected (which is unlimited within Congress), the need for support in the primaries to get re-elected (based on the current primary system), and the need for campaign financing are all causes that support the incentive to maintain alignment with the party versus compromise.

Once all the causes of the government shutdown have been identified, possible solutions to prevent the shutdown from happening again can be brainstormed. One possible solution would be to legally require a continuing resolution to be a “clean” bill, with no additional line items. This would make it more likely in the future, when there are debates or discussions over current, hot button items, such as the ACA, that the result would not be a failure to pass the continuing resolution and therefore cause a government shutdown. Another possible solution would be to stop pay for Congress during the government shutdown. Other more global, systemic solutions might be to implement term limits in Congress or provide government campaign financing to reduce the dependency on party financial support.

To view the Outline and Cause Map, please click “Download PDF” above.

Utah Fights for National Parks

By ThinkReliability Staff

Beginning on October 1, 2013 with the failure to spending approval, the US government entered a partial shutdown including the complete closure of the National Parks, as specified in the National Park Service Contingency Plan.  While the government shutdown had far-reaching effects, both across industry and geographically, areas of Utah   have been hit particularly hard by the closure of multiple National Parks in the area. The shutdown finally ended on October 17 when the government reached a deal to reopen.

A large proportion of Utah businesses are dependent on revenue brought in from tourists visiting the multiple Federal lands in the state, which include National Parks, National Monuments and National Recreation Areas.  A total of five counties in Utah declared a state of emergency, with the counties saying they’re losing up to $300,000 a day.  San Juan County, the last to declare a state of emergency, went a step further and decided it would reopen the parks themselves using local personnel to provide necessary emergency response and facilities for park visitors.

On October 10, the state of Utah came to an agreement with the Department of the Interior to pay for the Park Service to reopen the park for up to 10 days at a cost of $166,572 a day.  (It is possible, though not automatic, that the state will be reimbursed for these costs after funding is restored.)  Luckily a “practical and temporary solution” (as described by the Secretary of the Interior Sally Jewell) was found before county officials had to resort to what they described as “civil disobedience”.  (Trespassing in a National Park can result in a citation that could lead to fines or jail terms.)

This situation mirrors that frequently found on a smaller scale in all workplaces.  Concerned employees find themselves in circumstances that they believe are not in the best interest of their company or customers.  If support for change is not provided by management, these employees will develop work-around (like illegally reopening a National Park to allow tourists to enter).  Sometimes workarounds are actually a more effective way of completing work tasks, but they can also sometimes lead to unintended consequences that can be disastrous.

This is why the most effective work processes are developed with the experience and insight of employees at all levels.  Taking their concerns into account at the development of procedures and on an ongoing basis will reduce the use of potentially risky workarounds, and can increase the success of all an organization’s goals.

To view the Outline, Cause Map, and considered solutions, please click “Download PDF” above.  Or click here to read more.

Rules on Inflight Electronics May be Changing Soon

By Kim Smiley

In welcome news to many airline passengers, it looks like the FAA may soon allow the use of personal electronic devices during the entire duration of flights, including takeoff and landing.  The current restrictions on the use of personal electronics are being reviewed following a recent recommendation by an aviation advisory committee made of up pilots, mechanics, engineers and other aviation experts.

A Cause Map, a visual format for performing a root cause analysis, can be used to analyze this issue.  A Cause Map is built by asking “why” questions and intuitively laying out the many causes that contributed to an issue to show the cause-and-effect relationships.  The first step in the Cause Mapping process is to document the basic background information as well as list how the issue impacts the goals in the an Outline.

One of the major impacts for this example is that there is concern that use of personal electronic devices onboard aircraft may be dangerous and increase the risk of a plane crash.  Currently, the use of personal electronics is allowed once a plane is above 10,000 feet, which is basically the whole flight except landing and takeoff which are considered the most critical portions of the flight.   These restrictions are in place because pilots depend on electronic systems, such as navigation and communications systems, to safely do their job and there is concern about the potential for interference with these vital systems.

How likely it is that dangerous interference could be an actual issue is debated.  There were 75 reports by pilots of suspected electronic device interference between 2003 and 2009, according to the International Air Transport Association.  However, it’s difficult to reproduce interference and it has never been cited as a cause in any airplane accident.  The current ban on the use of electronics also seems to be loosely enforced, raising questions about its necessity and effectiveness.  (A survey by the Consumer Electronics Association also found that nearly a third of airplane passengers said they left on a portable electronic device on a flight during the previous year.)  There seems to be a general consensus that this is low risk issue, but the potentially high consequences if it occurs has made some reluctant to reduce the restrictions.

There are also some non-technical issues that need to be considered with the onboard use of electronics.  There is concern that passengers enthralled with their devices will be distracted and miss important information during preflight safety briefs.  There is also a concern that larger devices, such as laptops, could become a missile hazard and hurt passengers if the plane moves unexpectedly.

If the new recommendations are approved, passengers will be able to use any device that doesn’t transfer data the entire flight, including takeoff and landing.  Passengers would be able to leave all devices turned on, but they would need to set them to airplane mode so that no data is transmitted.  So you won’t be able to make calls on your smartphone or stream video, but you would be able to rock out to music already downloaded or read a book on a kindle.  Larger devices will still need to be stowed during takeoff and landing because nobody wants to be hit with a laptop, but smaller gadgets will be fair game if the new recommendations are adopted.

To see a Cause Map of this issue, click on “Download PDF” above.

 

 

 

NYT Website Disrupted for Hours

By Kim Smiley

On Tuesday, August 27, 2013 the New York Times website went dark for several hours after being attacked by a well-known group of hackers.   Reports of hacked websites are becoming increasingly common and the New York Times was just one of many recent victims.

A Cause Map, or visual root cause analysis, can be used to analyze the recent attack on the New York Times website.  A Cause Map lays out the many causes that contribute to an issue in an intuitive format that illustrates the cause-and-effect relationships.   A Cause Map is useful for understanding all the causes involved and can help when brainstorming solutions.  To see a Cause Map of this example, click on “Download PDF” above.

Some details of how the attack was done have been released, as documented on the Cause Map. The New York Times website itself was not technically hacked, but traffic was redirected away from the legitimate website to another web domain.   To pull off this feat, hackers changed the domain name records for the New York Times website after acquiring the user name and password of an employee at the domain name registrar company.  The employee inadvertently provided the information to the hackers by responding to a phishing email asking for personal information.

The email sent by the hackers looked legitimate enough to fool the employee.

So why did hackers target the New York Times in the first place?  The answer is that the New York Times is one of many western media outlets to be targeted by Syrian Electronic Army (S.E.A.), who has claimed responsibility for the attack.  The S.E.A. supports President Bashar al-Assad of Syria and is generally unhappy with the way the events in Syria have been portrayed in the West.

So the next logical question is how do you protect yourself from a phishing scheme?  The first step is awareness.  Pretty much everybody who uses email can expect to receive some suspicious emails.  A few things to look out for:  attachments, links, misspellings, and a mismatched “from” field or subject line.  Also any alarming language should be a red flag.  For example, an email from your credit card company warning you that your account will be closed unless you take immediate action is probably not the real deal.  A good rule of thumb is to never respond to any email with personal information or to click on links in emails. If you think a request for action may be real, either call the company or open a new web browser window and type in the company’s web address.  It’s best to delete any suspicious emails immediately.

This example is also a good reminder to be aware that websites can get hacked.  A great example of this is when the S.E.A. hacked the Associated Press’s twitter feed last April and used it to announce (falsely) that the White House had been bombed.  That one tweet is estimated to have caused a $136 billion loss in the stock markets as people responded to the news.  In general, it is probably good to be skeptical about anything shocking you read online until the information is confirmed.

What Happens When a Copy Isn’t a Copy?

By Kim Smiley

Think of how many documents are scanned every day. Imagine how important some of these pieces of paper are, such as invoices, property records, and medical files. Now try to picture what might happen if the copies of these documents aren’t true copies. This is exactly the scenario that Xerox was recently facing.

It recently came to light that some copies of scanned documents were altered by the scanning process. Specifically, some scanner/copier machines changed numbers on documents. This issue can be analyzed by building a Cause Map, an intuitive, visual format for performing a root cause analysis. The first step in the Cause Mapping process is to fill in an Outline with the basic background information on an issue. Additionally, the impacts to the overall goals are documented on the Outline to help clarify the severity of any given issue. In this example, the customer service goal is impacted because the scanners weren’t operating as expected. There is also a potential impact to the overall economic goal because the altered documents could result in any number of issues. There is also an impact because of the labor needed to investigate and fix the problem.

After completing the Outline, the next step is ask “why” questions to build the Cause Map. Why weren’t the scanners operating as expected? This happened because the scanners were changing some documents during the scanning process. Scanners use software to help interpret the original documents and Xerox has stated that the problem happened because of a software bug. Testing showed that the number substitutions were more likely to occur when the settings on the scanners were set to lower quality/ higher compression because of the specific software used for these settings. Testing also showed that the error was more likely to occur when scanning those documents that were more difficult to read such as those with small fonts or that had already been copied multiple times.

Xerox had been aware of the potential for number substitution at lower quality settings, but didn’t appear to expect it to occur at factory settings (which was found to be very unlikely, but possible). A notice that stated that character substitutions were possible appeared on the scanners when lower resolution settings were selected and was included in some manuals, but this approach seems to have been ineffective since many users were caught unaware by this issue.

After a Cause Map has been built with enough detail to understand the issue, it can be used to help develop solutions. In this example, Xerox developed a software patch that corrected the error. Xerox also posted several blogs on their website to keep customers informed about the issue and worked with users to ensure that the patch was successful in correcting the error.

To see a high level Cause Map of this issue, click on “Download PDF” above.