Category Archives: Root Cause Analysis – Incident Investigation

Impact of Gasoline Spending on US Household Budgets

By Holly Maher

Having worked in an oil refinery for the majority of my career, the question “why is gasoline so expensive?” has been posed to me on more than a few occasions.  It is normally asked with a great deal of frustration and sometimes with a bit of anger directed at the oil companies (and those who work for them).  So, with summer driving season officially kicked off, it seems like an appropriate time to tackle this issue.

If we ask the question “What is the problem” we can expect to get different answers:  crude oil price is too high, oil companies are making too much profit, people are driving too many SUVs, etc..  All of these answers give perspectives on what different people view as the problem, which is subjective.  So in order to start the analysis, we have to identify how this issue is impacting our goals.  In terms of the impact to the average American family, the annual spending on gasoline is impacting the household budget.  In 2011, the average spending on gasoline was $2,655 or roughly 4% of the average household gross income.

Once we have identified the impact to the goal we can begin the analysis.  We start by asking “why” questions and documenting the answers to visually lay out all the causes that contributed to this impact.  The cause-and-effect relationships lay out from left to right.  The average annual spending on gasoline is caused by both the price of a gallon of gasoline, which in 2011 was $3.52/gal, as well as the annual consumption of gasoline (average household consumption was 754 gallons in 2011).  Although the national discussion tends to focus on the price at the pump, the price alone does not create the impact to the household budget (you don’t see too many articles on the price of a gallon of milk, which, by the way, in 2011 was $3.57/gallon).

The price of a gallon of gasoline is set by 4 primary causes:  crude oil price (~68% of the price), state/local/federal taxes (~13% of the price), transportation and marketing (~11% of the price) and the cost of refining the crude oil into useable products (~8% of the price). The price of crude oil in 2011 was $94.87/bbl (barrel).  This is compared to $27.39/bbl in 2000 and $23.19/bbl in 1990.  This price is set by normal supply and demand economics, both internationally and domestically.  The global demand for crude oil has dramatically shifted in recent years as the countries in eastern Asia have moved into their “industrial revolution”.  The supply of crude oil globally is set not only by total oil well capacity, but also by transportation availability, OPEC targets, as well as political sanctions on oil-producing countries.

In addition to normal supply and demand economics, crude oil is a traded commodity on the stock market and is susceptible to price fluctuation based on fear and speculation.  Prior to 2000, the energy market and trading of energy futures was regulated because of the significant impact it could have on the economy.  In 2000, the energy sector was deregulated as part of the Commodity Futures Modernization Act of 2000.

The average annual household consumption of gasoline in 2011 was 754 gallons.  This is caused by the annual miles driven per car (15,000 miles), the number of cars per household (1.95 cars), and the fuel efficiency of the cars.  The average mileage per car is caused by commute mileage, whether household members carpool, whether household members utilize public transportation and recreational miles driven (outside of work).  The fuel efficiency of cars is determined by the types of cars driven, the fuel efficiency technology available and the vehicle fuel efficiency standards required by law.  In 2011, 50% of the household vehicles purchased were classified as light trucks.  New fuel efficiency standards were introduced for vehicles in 2011 requiring passenger cars to meet 30.2 miles per gallon (mpg) and light-trucks to meet 24.2 mpg.  This was an increase of 2 mpg for each type of vehicle.

Once the analysis has been broken down into its causes, solutions can be identified to mitigate the impact to the goal.  Even with this initial, basic analysis, solutions start to be become visible. Household members could car pool more (with friends, co-workers, or their spouse).  Household members could take public transportation, if available, and communities could work to make public transportation more available to residents.  Households could purchase more fuel efficient vehicles. The government could continue to increase fuel efficiency requirements.  The government could pass a law re-regulating the energy sector.

As with any incident or problem with significant impact to the goal(s), the analysis always reveals more than one single cause.  Being able to see multiple causes gives us the opportunity to find more than one potential solution.

To view the Outline, Cause Map, and solutions please click “Download PDF” above.

Software Glitch in Electronic Voting System during Belgium’s Federal Election

By Kim Smiley

A root cause analysis of electronic voting – at the most basic level, the idea behind elections seems very simple – let every citizen vote one time and count them.  But in reality, it often proves difficult to quickly and accurately collect and count thousands and thousands of votes.   The recent software bug during the May federal elections in Belgium illustrates some of the technical difficulties that can come into play during an election.

Root Cause Analysis Cause Mapping of  Belgiums voting system

Belgium held federal elections on May 25, 2014 and used an electronic voting system to collect and count many of the votes.  While computing election results, officials realized that some of the votes weren’t calculating correctly.  Announcement of the election results was delayed while the problem was addressed, but the bigger problem is that any software hiccups during elections make people question the validity of the vote.

A root cause analysis by Government officials have stated that the problem was quickly addressed and that the impacted votes would not have changed the outcome of the election, but the lack of transparency in the process worries some.  In fact, many countries have banned the use of electronic voting because of concern over potential issues and Belgium is one of the only European countries to still use e-voting machines.

There are two separate electronic voting systems in use in Belgium.  The software glitch impacted the older, first generation Jites system computers using DOS operating systems.  The Jites system was certified and tested, but the test program should be reevaluated before future elections because it missed a significant software glitch.  Another option would be to upgrade the first generation computers before the next election to reduce the risk of future issues by only having one system to test and maintain.

Conducting a large scale national vote is a tricky problem and worth pondering.  The system needs to be transparent enough that the public feels the system is “fair”, but secret enough that individual voters are ensured privacy.   Officials need to be able to ensure that only eligible voters participate, but need the process to not be so onerous that it inhibits citizens’ ability to navigate it (think the ongoing debate in the US regarding photo IDs).   There are a number of strong, opposing forces at play in the process and any issues like a software error only add fuel to the fire.

To view the Outline and the root cause analysis Cause Map, please click “Download PDF” above.  Or click here to read more.

Kitty Litter Cause of Radiological Leak?

By ThinkReliability Staff

The rupture of a container filled with nuclear waste from Department of Energy (DOE) sites that resulted in the  radiological contamination of 21 workers appears to have resulted from a heat-producing reaction, possibly between the nuclear waste and the kitty litter used to stabilize the waste.

DOE photo of damaged container

Yes, you read that correctly. The same stuff you use for Fluffy’s “business” is also used to stabilize nuclear waste.  However, the kitty litter typically used is clay.  One of the sites that provides waste to the Waste Isolation Pilot Plant, where the release occurred, changed from clay kitty litter to organic kitty litter, which is made of plant material.  Although the reaction that resulted in the container’s rupture has not yet been determined, it is possible that it was due to the change in litter.

We can look at this incident in a Cause Map, or visual root cause analysis, to lay out both the effects and causes.  In this case, the effects were significant.  Twenty-one workers were found to have internal radiological contamination, impacting the safety goal.  A radiological release off-site impacted the environmental goal.  The waste repository has been shut down and is not accepting shipments, impacting both the customer service and production goals.  The release requires the investigation of a formal Accident Investigation Board, impacting the regulatory and labor goal.  Lastly, the damage to the container is an impact to the property goal.

The release was caused by the rupture of a container that stored radiological waste, including americium and plutonium.  The release was able to leave the underground storage facility due to a leak path in the ventilation system, which was by design because the ventilation system was not designed for containment because the safety analysis assumed that a release within the storage facility would result from a roof panel fall and was adequately prevented.

The rupture appears to have resulted from a heat-producing reaction. The constituents of that reaction have not yet been determined, but the change from clay to organic kitty litter has been identified as a possible cause.  (A possible cause indicates a cause for which evidence is not yet available.)  More research is being done to determine the actual reaction.  This will also allow a determination of which other waste containers may be at risk for rupture.

A solution that has already been implemented is to seal the leaks in the ventilation system with foam to reduce the risk of leak-by.  Other solutions that have been suggested are to add an additional heavy-duty containment around the affected casks, reclassify the ventilation system as containment, and perform an independent review of the safety analysis of the site.  Once appropriate solutions are determined and implemented, it’s hope the site will be able to reopen.

To view the Outline and Cause Map, please click “Download PDF” above.

Smoke at FAA Facility Results in Major Flight Disruptions

By Kim Smiley

A smoking bathroom fan resulted in the disruption of more than a thousand flights in the Chicago area on May 13, 2014 in a dramatic demonstration of real time cause-and-effect.  This incident illuminates how a relatively small issue can quickly grow into an expensive and time-consuming problem.  In an ideal world, a smoking bathroom fan wouldn’t result in national headlines.

So what happened?  How did a smoking bathroom fan that wasn’t even at the airport delay so many flights?  A Cause Map, a visual method for performing a root cause analysis, is a useful tool for understanding the causes that contributed to an issue.   When building a Cause Map, causes are laid out based on cause-and-effect relationships to clearly show what lead to the problem.

In this example, flights were delayed because there was limited support from air traffic control available and air traffic control support is necessary for safe operation.  Air traffic control support was reduced because the Elgin FAA facility that monitors airports in the Chicago area was evacuated for several hours because the building was filled with smoke.  The building had to be evacuated  for personnel safety and  it took some time to reestablish safe conditions.  Emergency personnel had a difficult time pinpointing the source of the smoke because it spread through the space.  The smoke was throughout the building  because the source of the smoke, a bathroom fan, was part of the HVAC system.

The media reports didn’t provide details about why exactly the bathroom fan was smoking in this particular case, but bathroom fans are a relatively common cause of building fires.  Lint or dust can build up in the fan motor over time, eventually leading to the motor overheating.  The situation can quickly become dangerous, particularly when a motor is left powered after it has seized which is a common failure mode for this equipment.

A few fairly easy things can be done to reduce the risk of bathroom fan fires.  Fan should be cleaned at least annually, but should be cleaned more frequently if they appear dirty or dusty.  A motor that is making unusual sounds or noise should be immediately turned off and inspected by an electrician prior to being returned to service.  Any fan that isn’t making the typical whizz sound should also be powered off and repaired or replaced prior to use because a motor that isn’t rotating has a greater likelihood of overheating.   Older models that aren’t thermally protected are most at risk for a fire and replacing them with a newer model with thermal protection can significantly reduce the risk of fire.

To view a high level Cause Map, click on “Download PDF” above.

Hundreds of Flights Disrupted After Air-Traffic Control System Confused by U-2 Spy Plane

By Kim Smiley

Hundreds of flights were disrupted in the Los Angeles area on April 30, 2014 when the air traffic control system En Route Automation Modernization system, known as ERAM, crashed.   It’s been reported that the presence of a U-2 spy plane played a role in the air traffic control issues.

This issue can be analyzed by building a Cause Map, a visual format for performing a root cause analysis.  A Cause Map intuitively lays out the cause-and-effect relationships so that the problem can be better understood and a wider range of solutions considered.  In order to build a Cause Map, the impacted goals are determined and “why” questions are asked to determine all the causes that contributed to the issue.

In this example, the schedule goal was clearly impacted because 50 flights were canceled and more than 400 were delayed.  Why did this occur?  The flight schedule was disrupted because planes were unable to land or depart safely because the air traffic control system used to monitor the landings was down.  The computer system crashed because it became overwhelmed when it tried to reroute a large number of flights in a short period of time.

The system attempted to reroute so many flights at once because the system’s calculations showed that there was a risk of plane collisions because the system misinterpreted the flight path, specifically the altitude, of a U-2 on a routine training mission in the area.  U-2s are designed for ultra-high altitude reconnaissance, and the plane is reported to have been flying above 60,000 feet, well above any commercial flights.  The system didn’t realize that the U-2 was thousands of feet above any other aircraft so it frantically worked to reroute planes so they wouldn’t be in unsafe proximity.

It took several hours to sort out the problem, but then the Federal Aviation Administration was able to implement a short term fix relatively quickly and get the ERAM system back online.  The ERAM system is being evaluated to ensure that no other fixes are needed to ensure that a similar problem doesn’t occur again.  It’s also worth noting that ERAM is a relatively new system (implementation began in 2002) that is replacing the obsolete 1970s-era hardware and software system that had been in place previously.  Hopefully there won’t be many more growing pains with the changeover to a new air traffic control system.

To see a high level Cause Map of this problem, click on “Download PDF” above.

1990 Cascading Long Distance Failure

By ThinkReliability Staff

On January 15, 1990, a cascading failure resulted in tens of thousands of people in the Northeast US without long distance service for up to 9 hours.  This resulted in over 50 million calls being blocked at an estimated loss of $60 M.  (Remember, there weren’t really any other ways to quickly connect outside of the immediate area at the time.)

We can examine this historical incident in a Cause Map, or visual root cause analysis, to demonstrate what went   wrong, and what was done to fix the problem.  First, we begin with the impact to the goals.  No impacts to the safety, environmental, or property goals were discussed in the resources I used, but it is possible they were impacted, so we’ll leave those as unknown.  The customer service and production goals were clearly impacted by the loss of service, which was considerable and estimated to cost $60 million, not including time for troubleshooting and repairs.

Asking “Why” questions allows development of the cause-and-effect relationships that led to the impacted goals.  In this case, the outage was due to a cascading switch failure: 114 switches crashed and rebooting over and over again.  The switches would crash upon receiving a message from its neighbor switches.  This message was meant to inform other switches that one switch was busy to ensure messages were routed elsewhere.  (A Process Map demonstrating how long distance calls were connected is included on the downloadable PDF.)  Unfortunately, instead of allowing the call to be redirected, the message caused a switch to crash.  This occurred when an errant line in the coding of the process allowed optional tasks to overwrite crucial communication data.  The error was included in a software upgrade designed to increase throughput of messages.

It’s not entirely clear how the error (one added line of code that would bring down a huge portion of the long distance network) was released.  The line appears to be added after testing was complete during a busy holiday season. That a line of code was added after testing seems to indicate that the release process wasn’t followed.

In this case, a solution needed to be found quickly. The upgraded software was pulled and replaced with the previous version.  Better testing was surely used in the future because a problem of this magnitude has rarely been seen.

To view the Outline, Cause Map and Process Map, please click “Download PDF” above.  Or click here to read more

Hundreds Die When South Korean Ferry Capsizes

By ThinkReliability Staff

The nation of South Korea was devastated after a ferry capsized off Byungpoong on April 16, 2014.  While the ferry tipped over and sank quickly (within two hours), the evacuation orders came slowly (a half-hour after the first distress call.)  The combination resulted in over 300 being trapped within the ship and killed.  The Captain and much of the crew were able to escape.

There are a multitude of causes involved in this tragedy, which can be captured within a Cause Map.  A Cause Map visually develops the cause-and-effect relationships that led to organizational goals that were impacted.

Clearly, the safety goal in this case was impacted, due to the large number of deaths (at the time of this blog, 226 bodies had been found and 73 people are still missing).  In addition, legal action is being taken against the captain and members of the crew responsible for navigation for negligence and failure to assist passengers. The Captain has also been arrested for “undertaking an excessive change of course without slowing down”.  The loss of the ship can be considered an impact to the property goal, and the massive rescue and recovery operations are an impact to the labor/ time goal.

By asking why questions, the cause-and-effect relationships are developed.  Most of the deaths resulted from passengers drowning when they were trapped in the ship as it capsized and sank.  The ferry capsized because of a sharp turn and stability issues.  The ship was turned too quickly at excess speed, possibly because the third mate in charge of navigation was inexperienced (this was her first time) and of steering gear issues, reported two weeks prior to the accident and apparently not fixed.  The ship had been recently modified to add more passenger cabins, which made it top heavy.  As a result of the modifications, the recommended cargo weight was reduced.  The ship was carrying three times the cargo weight recommended at the time of the accident.

Passengers became trapped in the ferry prior to the evacuation order, which was issued thirty minutes after the first distress call (and which it appears not all passengers were able to hear).  During this time, the ship had listed to a point that made it impossible to get out.  The Captain was concerned about the safety of his passengers in the water and appears to have called the parent company to request permission to evacuate.  Additionally, the ship’s life rafts were unable to be used.  Photos show crew members being unable to release life rafts.  Only 2 of the 46 on the ship were successfully deployed.   Lastly, the crew provided insufficient assistance, abandoning ship without making necessary efforts to free the passengers.

This tragic incident has been compared to the Titanic (due to the insufficient number of lifeboats and people being unable to leave the ship), the Valdez oil spill (because an inexperienced third mate was performing navigation while the Captain was in his cabin), and the Costa Concordia (when the Captain left the ship without supervising the evacuation effort).  As long as lessons from other organizations (and even industries) are not understood by those performing similar work, these tragedies will continue to happen.

To view the Outline and Cause Map, please click “Download PDF” above.  Or click here to read more.

Why Can’t the Missing Malaysia Airliner be Found?

By Holly Maher

On March 8, 2014 Malaysia Airline flight MH370 took off from Kuala Lumpur heading for Beijing, China.  The aircraft had 239 passengers and crew aboard.  Less than 1 hour into the flight, communication and radar contact was lost with the aircraft.  Forty-nine days later, the location and fate of the aircraft is still unknown despite a massive international effort to locate the missing airliner.  The search effort has dominated the news for the last month and the question is still out there: how, with today’s technology, can an entire aircraft go missing?

Since we may never know what happened to flight MH370, this analysis is intended to understand why we can’t find it and identify the causes required to produce this effect.  This will allow us to identify many possible solutions for preventing it from happening again.  We start by asking “why” questions and documenting the answers to visually lay out all the causes that contributed to this incident.  The cause-and-effect relationships lay out from left to right.

In this example, the Customer Service Goal is impacted because we are missing 239 passengers and crew.  This is caused by the fact that we can’t locate Malaysia Airline MH370.  The inability to locate the airline is a result of a number of causes over the 49 day period.  One reason is that 3 days were initially spent looking in the wrong location, along the original flight path from Kuala Lumpur to Beijing, in the Gulf of Thailand and the South China Sea.  The reason 3 days were mistakenly spent looking in this location is that the airline had left the original flight path and officials were unaware of that fact.  Why the aircraft left the original flight path is still unknown, but we can look at some of the causes that allowed the flight to leave the original flight path undetected.

One of the reasons the aircraft was able to leave the original flight path undetected was that air traffic control was unable to track the airplane with radar. The transponder onboard the aircraft, which allows the ground control to track the aircraft using airspeed and altitude, was turned off less than one hour into the flight.  We don’t know the reason the transponder was turned off; however, the fact that it is designed to be turned off manually is a cause of the transponder being turned off.  It is designed to be manually turned off to reduce risk in the event of failure or fire, and to reduce radio traffic when the airplane is on the ground.  After 9/11, when 3 out of the 4 hijacked airplanes had transponders that had been turned off, the airline industry debated the manual on/off design of the transponder, but aviation experts strongly supported the need for the pilots to be able to turn off the transponders, as needed, for the safety of the flight.

Another reason the aircraft left the original flight path undetected was because the flight crew outside the cockpit did not communicate distress or change of route.  This is because all communications from the airplane come from/through the cockpit.  The aircraft is not currently equipped to allow for communication, specifically distress communications, from outside the cockpit.

Days into the investigation, radar data was identified which showed the change of course of the aircraft.  This changed the area of the search away from the original flight path.  However, this radar detection was not identified in real time, as the plane was moving away from the original flight path.  This is also a cause of the aircraft being able to leave the flight path undetected.

Once the search area moved west, the size of the potential search area was incredibly large, another cause of being unable to locate the aircraft.  At its largest, the search area was 2.96 million square miles.  This was based on an analysis of how far the flight could have gotten with the amount of fuel on board.  Further analysis of satellite data, or “handshakes” with the computer framework on board the aircraft, continued to refine the search area.

Many people have asked why no one on the flight made cell phone calls indicating distress (if this was an act of terrorism).  The reason no cell phone calls were made was because cell phones do not work over 2000 ft.  That is because there is no direct line to a cellular tower.

Another cause of being unable to locate MH370 is being unable to locate the black box.  The black box is made of aluminum and is very heavy, designed to withstand significant forces in the event of a crash.  This causes the black box to sink, instead of float, making it difficult to locate.  The depth of the ocean in which the search is occurring ranges from 4,000-23,000 ft, adding to the difficulty of finding the black box.  Acoustic pings were last detected from the black box on April 8, 2014, 32 days into the search.  This is because the battery life on the black box is ~30 days.  This had been the battery design life criteria prior to the Air France Flight 447 crash in 2009.  It took over 2 years to locate the black box and wreckage from flight 447, therefore the design criteria for the black box battery life was changed from 30 days to 90 days.  This would allow search crews more time to locate the black box.  Malaysia Airlines Flight MH370 still had a black box with a battery life of 30 days.

Once the analysis has broken down incident into its causes, solutions can be identified to mitigate the risk a similar incident in the future.

To view the Outline and Cause Map, please click “Download PDF” above.  Or click here to read more.

Nearly 2.6 million GM Vehicles Recalled, Costs Soar to 1.3 Billion

By Kim Smiley

During the first quarter of 2014, General Motors (GM) recalled 2.6 million vehicles due to ignition switch issues tied to at least 13 deaths.  Costs associated with the issue are estimated to be around $1.3 billion and, possibility even more damaging to the long term health of the company, is the beating the company’s reputation has taken.

The ignition switch issues are caused by a small, inexpensive part called a switch indent plunger.  An ignition switch has four main positions (off, accessories, on, and start) and the switch indent plunger holds the ignition switch into position.  In the accidents associated with the recent recall, the ignition switch slipped out of the on position and into the accessories position because the ignition switch plunger didn’t have enough torque to hold it in place.  When the ignition is put into the accessories mode, the car loses both power steering and power braking, and the air bags won’t inflate.  It’s easy to see how a situation that makes a car less safe and more difficult to control can quickly create a dangerous, or even deadly, situation.  Additionally, it’s important to know the problem is most likely to occur when driving on a bumpy surface or if a heavy key ring is pulling on the key.

The other key element of this issue is how the problem has been handled by GM.  There are a lot of hard questions being asked about what was known about the problem and when it was known.  It is known that the faulty part was redesigned in 2006 to address the problem, but the new design of the part wasn’t given a new part number as would normally be done.  Multiple federal inquiries are working to determine when it was known that the faulty parts posed a danger to drivers and why there was such a long delay before a recall was done.  The fact that the redesigned part wasn’t assigned a new part number has also lead to questions about whether there was an attempt to cover up the issue. GM is not civilly liable for deaths and injuries associated with the faulty ignition switches because of its 2009 bankruptcy, but the company could potentially be found criminally liable.

No company ever wants to recall a product, but it’s important to remember that how the recall is handled is just as important as getting the technical details right.  Consumers need to believe that a company will do the right thing and that any safety concerns will quickly and openly be addressed.  Once consumers lose faith in a company’s integrity the cost will be far greater than the price of a recall.

If you drive a GM car, you can get more information about the recall here.  The recalled models are Chevrolet Cobalts and Pontiac G5s from the 2005 through 2007 model years; Saturn Ion compacts from 2003 through 2007; and Chevrolet HHR SUVs, and Pontiac Solstice and Saturn Sky sports cars from 2006 and 2007.

To view the Outline and Cause Map showing the root cause analysis of this issue, please click “Download PDF” above.  Or click here to read more.

Chicago O’Hare Commuter Train Derailment Injures 33

By Sarah Wrenn

At 2:49 AM on March 24, 2014, a Blue Line Commuter train entered the Chicago-O’Hare International Airport Station, collided with the track bumper post, and proceeded to derail landing on an escalator and stairway.  Thirty-two passengers and the train operator were injured and transported to nearby hospitals.  Images showing the lead rail car perched on the escalator look like the train was involved in filming an action movie.

So what caused a Chicago Transit Authority (CTA) train, part of the nation’s second largest public transportation system, to derail?  We can use the Cause Mapping process to analyze this specific incident with the following three steps: 1) Define the problem, 2) Conduct the analysis and 3) Identify the best solutions.

We start by defining the problem.  In the problem outline, you’ll notice we’ve asked four questions: What is the problem? When did it happen? Where did it happen? And how did it impact the goals?

Next we’ll analyze the incident.  We start with the impacted goals and begin asking “why” questions while documenting the answers to visually lay out all the causes that contributed to the incident.  The cause and effect relationships lay out from left to right.  As can be seen in the problem outline, this incident resulted in multiple goals being impacted.

In this incident, 33 people were injured when the train they were riding derailed in the O’Hare station thereby affecting our safety goal of zero injuries.  The injuries were caused by the train derailing, so let’s dig in to why the train derailed.  Let’s first ask why the train operator was unable to stop the train.  Operator statements are crucial to understanding exactly what happened.  Here, it is important to avoid blame by asking questions about the process followed by the operator.  Interestingly, 45 seconds before the crash, the operator manually reduced the train speed.  However, at some point, the train operator dozed off.  The train operator’s schedule (working nearly 60 hours the previous week), length of shift, and time off are all possible causes of the lack of rest.  Evidence that the operator was coming off of an 18 hour break allows us to eliminate insufficient time off between shifts as a cause.  In addition, the train operator was relatively new (qualified train operator in January 2014), but also she was an “extra-board” employee meaning she substituted for other train operators who were out sick or on vacation.

Next, let’s ask why the train was unable to stop.  An automatic breaking system is installed at this station and the system activated when the train crossed the fixed trip stop.  The train was unable to stop, because there was an insufficient stopping distance for the train’s speed.  At the location of the trip stop, the train speed limit was 25 mph and the train was traveling 26 mph.  While the emergency braking system functioned correctly, the limited distance and the speed of the train did not allow the train to stop.

The train derailing impacted multiple organizational goals, but also the personal goal of the train operator who was fired.  During the investigation, we learn that the train operator failed to appear at a disciplinary hearing and had a previous safety violation in which she dozed off and overshot a station.  These details reveal themselves on the cause map by asking why questions.

The final step of the investigation is to use the cause map to identify and select the best solutions that will reduce the risk of the incident recurring.  On April 4, 2014, the CTA announced proposed changes to the train operator scheduling policy.  In addition, the CTA changed the speed limit when entering a station and moved the trip stops to increase the stopping distance.   Each of these identified solutions reduce the risk of a future incident by addressing many of the causes identified during the investigation.