Stuck in the Chunnel

By Kim Smiley

High-speed train service in the Channel Tunnel (connecting Britain, France and Belgium) resumed partially on Tuesday, December 22, 2009 after a complete stoppage that began Friday, December 18th when five trains failed inside the tunnel.

Eurostar, the operator of the train, has stated that the failure of the trains were caused by an electrical failure due to condensation from snow that was able to enter the snow screens protecting the engine and higher temperatures within the tunnel than outside. Unseasonably cold weather was believed to cause finer, lighter snow than usual, which was able to enter the screens.

There was a delay in rescuing the trapped passengers in the tunnel – some of whom were trapped for up to 18 hours. Responsibility for rescue lies with both the train operator and the tunnel operator, and the process for rescue obviously needs to be reviewed by both parties to determine a better course of action the next time a rescue plan is needed. Additionally, the train operator will want to review its policies based on the reports of abysmal customer service throughout the event.

Eurostar, British Rail Class 373 at St Pancras railway station by Oxyman (11/23/07)
Eurostar, British Rail Class 373 at St Pancras railway station by Oxyman (11/23/07)

Eurostar took immediate action to install finer filters on the engine intakes and trains were put back into service on Tuesday, the 22nd. The company has also stated it will reimburse passengers for the delay, but this solution will take longer to implement.

The entire root cause analysis investigation so far is shown on the downloadable PDF. A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page. Even more detail can be added to this Cause Map as the analysis continues. As with any investigation the level of detail in the analysis is based on the impact of the incident on the organization’s overall goals. To view the PDF, click on “Download PDF” above.

Toyota Recall: Problems, Interim Solutions and Permanent Solutions

by Kim Smiley

On September 29, 2009, Toyota/Lexus issued a safety advisory that some 2004-2010 model year vehicles could be prone to a rapid acceleration issue if the floor mat moved out of place and jammed the accelerator pedal. Although the recall is only applicable in the U.S. and Canada because of the type of floor mats used, over 4 million vehicles are affected by the recall.

Although all the solutions to this problem have not yet been implemented, we can look at the issue so far in a Cause Map, or visual root cause analysis. First we define the problem. Here we could consider the problem the recall, or the acceleration problems. We can list all the models and years that are affected by the recall, and that the recall is limited to the U.S. and Canada.

We define the problem with respect to the organization’s goals. There have been at least 5 fatalities addressed by the National Highway Transportation Safety Administration (NHTSA), though some media outlets have reported more. Additionally, the NHTSA has reported 17 accidents (again, some claim more) and has received at least 100 complaints. The fatalities and accidents are impacts to the safety goal. Complaints are impacts to the customer service goal. The recall of more than 4 million cars is an impact to the production/schedule goal, and the replacement of the accelerator pedals and floor mats as a result of the recall is estimated to cost $250 million, which is an impact to the property goal.

Once we’ve completed the outline, we can begin the Cause Map, or the analysis step of the process. The fatalities are caused by vehicle crashes resulting from a loss of control of the vehicle. The loss of control is caused by a sudden surge of acceleration, inability to brake, and sometimes an inability to shut down the engine of the car. Toyota says the sudden bursts of acceleration are caused by entrapment of the accelerator pedal due to interference from floor mats. Toyota refutes the possibility that there may be a malfunction in the electronic control system, saying it’s been ruled out by Toyota research.

The vehicles are unable to brake because the brake is non-functional when the accelerator pedal is engaged, as it is in these cases. Additionally, owners whose models are equipped with keyless ignition cannot quickly turn off their ignition. These models require the ignition button to be pressed for 3 seconds to prevent inadvertent engine stops, and the instructions are not posted on the dashboard, so owners who weren’t meticulous about reading (or remembering) instructions from the owners’ manual may not know how to turn off the car while moving at very quick speeds.

When the Cause Map is complete to a sufficient level of detail, it’s time to explore some solutions. In this case, the permanent solutions (which will reduce the risk of these accidents most significantly) to be implemented by Toyota are to reconfigure the accelerator pedal, replace the floor mats, and install a brake override system which will allow the brakes to function even with the accelerator pedal engaged. However, designing and implementing these changes for more than 4 million cars will take some time, so owners of Toyotas require interim solutions. Interim solutions are those that do not sufficiently reduce the risk for long-term applicability but can be used as a stop-gap until permanent solutions are put in place. In this case, Toyota has asked owners to remove floor mats, and has put out guidance that drivers who are in an uncontrolled acceleration situation should shift the engine into neutral, which will disengage the engine and allow the brake to stop the car.

View the high level summary of the investigation by clicking “Download PDF” above.

Learn more about the recall at the NHTSA website.

Airlink Incidents: Viewing Trends in Visual Form

By ThinkReliability Staff

Over the past three months, South Africa’s Airlink airline has had four incidents, ranging from embarrassing to fatal. Four similar incidents such as these start to point out a trend, which should be investigated to improve processes and increase safety. But how do we start the investigation?

In the Cause Mapping root cause analysis method, we begin by defining the problem. Here we can define four problems, which are the four incidents over the last three months. We can look at one incident at a time in a problem outline, the first step of the Cause Mapping process. We’ll start with the earliest incident first.

On September 24, 2009 at approximately 8 a.m. a Jetstream 41 crashed into a school yard in Durban Bluff just after take-off from Durban International Airport. This was a forced landing necessitated by the loss of an engine. The pilot was killed. There were also two serious injuries of the crew, and a minor injury of a person on the ground. There were no passengers on the plane, and the impact to Airlink’s schedule is unclear. However, the plane was lost.

We can capture this information more clearly and succinctly in an outline. For example, the above paragraph has more than 80 words. The outline, which records the same information, uses only 42 words in an easily understandable visual form. (The outline for all three incidents can be viewed by clicking on “Download PDF” above.)

The second incident: On November 18, 2009 at 1:30 p.m. a BAE Systems Jetstream 41 aborted take-off for East London and slid off the runway at Port Elizabeth airport. There were high velocity cross winds, and the pilot may have been unable to establish directional control. There were no injuries, no environmental impact and damages to the plane are unknown. However, new travel arrangements had to be made by the airline for all the passengers. The frequency of Airlink incidents is now two in eight weeks. (Over 80 words; the outline has 49 words.)

The third incident: On November 24, 2009 at approximately 8 a.m. a flight en route to Harare carrying a Prime Minister was forced to return to Johannesburg Airport after it experienced a technical fault. There were no injuries, but it caused a delay in the Prime Minister’s schedule. The damage to the airplane is unclear. The frequency of Airlink incidents is now three in two months. (Over 60 words; the outline has 33 words.)

The fourth incident: On December 7, 2009 at approximately 11 a.m. a Regional airline SA Airlink Embraer 135 commuter jet hydroplaned and overshot the runway while landing at George Airport during rainy weather. There were five injuries, including a sprained ankle. This incident has led to a poor public perception of the airline and increased supervision from the authorities. We do not have a dollar amount on the property damage. The frequency of Airlink incidents is now 4 in 10 weeks. (Over 70 words; the outline has 42 words.)

In addition to the increased brevity of the outline, it provides an easy visual comparison of the four incidents by showing them in a similar visual form. On one page, we can show the timeline, and outlines of the four incidents for easy comparison. This is especially useful for a briefing tool for busy managers.

Another Train Collision for the Washington D.C. Metro

By ThinkReliability Staff

In the early morning hours of Sunday, November 29th, after the Washington D.C. Metro shut down for the night, train 902 pulled into the West Falls Church station for cleaning. However, instead of stopping just behind the parked train already on the tracks, it rammed into it.

We can put this incident into a Cause Map, or a visual form of root cause analysis. A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page. The first step in the Cause Mapping process is to outline the problem. After entering the “what, when and where” we frame the incident with respect to the Washington Transity Authority’s goals.

The operator, plus two other employees who were on the parked car cleaning, suffered minor injuries. This is an impact to the safety goal. The train cars, however, suffered extensive damage. Three of the cars will have to be replaced (at a cost of $3 million per car) and the extent of the damage to the other 9 cars involved is unclear. These are both impacts to the property goal. There may have been other goals that were impacted, but these are the main concerns.

West Falls Church-VT/UVA station, photographed by Ben Schumin on July 28, 2001

The second step of the Cause Mapping process is the Cause Map itself, or the analysis of the problem. To fill out the Cause Map, we begin with the goals that were impacted and ask why questions. The injuries and damage were caused by the parked train being struck by a moving train. The moving train was not stopped in time because the automatic train control system was not on (it’s not used in the railyard) and the speed suddenly increased, OR the operator wasn’t paying attention. (We don’t know yet, at this point of the investigation.)

Another train operator has come forward to say that this type of car suffers from power surges at low speeds (such as speeds used in the rail yard), which could have caused the speed to suddenly increase. We add this information to the map, and also add an evidence box showing where the information came from. This can be invaluable when sorting through a lot of information.

Although it is known that the operator had surpassed a ten-hour shift, it’s not known if fatigue or other causes of inattentiveness were involved. A union representative has asserted that the training program was unsatisfactory, which may have also played a part. As the National Transportation Safety Board (NTSB) and the Transit Authority continues their investigation, more detail can be added to this Cause Map as the analysis. As with any investigation the level of detail in the analysis is based on the impact of the incident on the organization’s overall goals.

Barge Grounds Off Virginia Beach

By ThinkReliability Staff

At approximately 11:00 p.m. on October 12th, 2009, the two 500,000 lb strength towlines connecting La Prinsesa barge to its tug broke free.  The tug was unable to recapture the ship, and it drifted for about seven hours in heavy seas caused by a wind-driven rain storm before grounding at Sandbridge Beach in Virginia, just shy of the Sandbridge pier.

So far the 84 hazardous material (HAZMAT) loads the barge was carrying appear to be intact.   There were no injuries, as the barge was unmanned.  Damage to the ship is not known at this time.   However, the incident had the potential to cause injuries, a HAZMAT spill that could have led to an evacuation, and far more damage to the ship and the beach.  The incident did lead to the loss of the towlines, which are valued at approximately $70,000 and a delay in the barge’s arrival.

It’s unclear what caused the towlines to break free.  Initial solutions are to clear the area and ballast the tug to attempt to keep it from drifting.  On November 17, the barge began being towed to open waters where the cargo can be off-loaded safely. However, long-term solutions that would prevent another incident of this type will only be determined after the causes of the issue are determined.

Click on “Download PDF” to view a PDF showing the root cause analysis investigation based on what is known  so far.  A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.  Even more detail can be added to the Cause Map as the analysis continues. As with any investigation the level of detail in the analysis is based on the magnitude of the impacts (or potential impacts).

How to Determine Your Organization’s Goals

By ThinkReliability Staff

The first step of the Cause Mapping strategy of root cause analysis is to define the problem with respect to the organization’s goals.  In order to do this, you need to know what an organization’s goals are.  While we provide Cause Mapping root cause analysis templates that will give you an idea of where to start, your organization may wish to personalize their investigations so that they correspond to your particular goals.

To define your organization’s goals, try to imagine a perfect day for your organization.  No matter what industry you’re in, that perfect day doesn’t include anyone getting hurt or killed.  This is the safety goal.  However, if your organization regularly is responsible for the health and welfare of people other than your employees, you may wish to have more than one category of safety.  For example, a hospital may have both “patient safety” and “employee safety” goals.  A public school may have “student safety” and “employee safety” goals.

Another goal generally common to all industries is the goal of not impacting the environment.   However, some industries have a base level of environmental impact, so their goal might be to not surpass that level rather than having no impact.  Environmental impacts usually result from leaks or spills of any material other than water, but may also result from improper storage or disposal of hazardous material.

Some organizations may have as a goal to meet regulatory requirements.  If an organization has an OSHA (Occupational Safety and Health Administration) reportable injury, this is an impact to the “Regulatory Compliance” goal.  Organizations may also have a “Compliance” goal if they are subject to another governing body, such as a trade group or an external accreditation.

Organizations usually exist to provide either products, services, or both.  If an organization provides products, a goal of that organization may be to get a set amount of products produced and delivered on a certain schedule.  We call this the “Production/Schedule” goal.  An organization that provides services wants to ensure that its customers are satisfied with the services they provide.  This is the “customer service” goal.  Many organizations will use both goals to define a problem.

Another area of concern for almost all organizations is cost.  An incident that requires additional labor, rework, or lost product results in unplanned costs for the organization.  We call this goal the “material and labor goal”.  If an incident results in many costs, it’s possible to itemize them within the problem outline.  Quantifying all the costs associated with an incident can help prioritize which incidents require the most immediate attention.  It also provides a bound for the cost of solutions – installing a $100,000 machine to solve an infrequent $20,000 problem doesn’t make sense.  (Of course, for incidents that involve impacts that can’t be easily quantified – human safety, regulatory requirements, customer service, etc.  – these impacts must be considered above and beyond the “cost” of the incident.)

Once you’ve determined all of the goals that are meaningful to your organization, you’re ready to make an outline for the first step of the Cause Mapping method of root cause analysis – define the problem.  But what order do you put the goals in?  Generally, the goals go in order from most to least important.  The safety goal is almost always at the top.  Your organization’s mission statement is an excellent resource to determine the order of the goals.  Ideally, they’ll follow along with your mission statement, with any goals not specifically called out (such as the “material and labor” goal) listed below.  It’s also possible to use a different order so that the biggest impacts from an incident are listed at the top.  However, your organization may prefer to always use the same order for consistency.

If an incident resulted in no impact to one of your organization’s goals, don’t delete the goal from the problem outline.  Instead, write “N/A” next to the goal.  That way, it’s clear that the goal was considered but it was determined that there was no impact.  Deleting the goal may lead others to believe that it’s no longer a goal of the organization!

Check out our examples to see a problem definition in action!

ThinkReliability has specialists who can solve all types of problems. We investigate errors, defects, failures, losses, outages and incidents in a wide variety of industries.  Contact us for investigation services and root cause analysis training.

Damage to the San Francisco-Oakland Bay Bridge (Again)

By ThinkReliability Staff

In a previous blog, I wrote about the impressively quick repairs to the San Francisco-Oakland Bay Bridge.  These repairs allowed the heavily-traveled bridge to reopen only an hour and a half late from scheduled repairs, despite unexpectedly finding a cracked eyebar during that time.

However, during evening rush hour on October 27, less than 2 months after the eyebar repair had been completed, two metal rods and a 5,000 pound metal beam fell onto the roadway.  The items that fell were part of the previous repair, which was supposed to have lasted until the new bridge opened in 2013. Although only one motorist was injured, other injuries or even fatalities were possible, and the damage to the bridge necessitated repairs and closing the transportation route for 280,000 cars a day for more than 5 days.

The “cause” given for the failure of one of the rods (which snapped, leading to the falling of the other rod and the beam) was fatigue caused by high (over 30 mile per hour) winds.  However, an adequate repair would have been able to withstand less than 2 months of traffic and 30 mile per hour winds, so the rod failure must have been caused by the combination of the high winds and an inadequate repair.

Given the speed with which the repair was completed (see our previous blog), it’s possible that the repair job was rushed.  Additionally, the Federal Highway Administration did not inspect the bridge after the repairs were completed, instead relying on state inspection reports.  Had another agency inspected the repairs, it’s possible the problems with the repair would have been noticed and fixed before the bridge was re-opened.

A summary of the investigation to date can be found on the downloadable PDF.  (To open, click on “Download PDF” above.)  The investigation includes a timeline, which can aid in the understanding of this issue, the problem outline, and the Cause Map (visual root cause analysis).  A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.  As with any investigation, as more information becomes known, more detail can be added to the Cause Map.

Missed by 150 Miles?

By Kim Smiley

On October 21, 2009, Northwest Airlines Flight 188 left San Diego and overshot its intended destination, Minneapolis-St. Paul by about 150 miles. Luckily, the incident resulted in a safe landing at the intended destination, but the circumstances surrounding the flight remain vague and unsettling.

One of the strangest facts that have come out is that the plane lost contact with air-traffic controllers for one hour and 18 minutes.  In the post 9/11 aviation environment, controllers are very sensitive to planes that quit responding to communications.  The Federal Aviation Administration had contacted military authorities about the possibility of terrorism.  Fighter jets were ready to respond and prepared to intercept the plane if necessary.

So what happened?  How did the pilots overshoot the airport by such a significant amount without realizing their mistake?

Initial reports were that the pilots stated that they were in a heated discussion and simply lost situational awareness, but many aviation experts have stated it is unlikely that pilots would miss repeated hails for over an hour because of an argument.  Other reports have speculated that maybe both pilots fell asleep.

The most recent information to come out is that the pilots were using their laptops during the time they failed to respond to hails.  The pilots stated that they both were working on laptops and that they were discussing monthly flight crew scheduling.

Details concerning the overshoot are still being investigated, but an initial root cause analysis can be started to help document the investigation as it progresses.  This is what an Outline could look like at this stage:

A preliminary Cause Map can be started at this stage of an investigation.  As more information is known a detailed Cause Map can be built to document all the relent information.

More data should be available soon.  The Cockpit Voice Recorder and the Flight Data Recorder have both been sent to the National Transportation Safety Board for analysis and interviews of all involved parties continue.

On October 28, it was announced that the FAA has revoked the licenses of the two pilots involved because they violated several federal regulations, including fail to comply with air traffic control instructions and operating carelessly and recklessly. There are no currently specific federal rules banning the use of laptops after the flight reaches 10,000 feet at this time.

Genesis Spacecraft Crash

By Kim Smiley

The mission of the Genesis spacecraft was to collect the first samples of the solar wind and return the samples to earth to be analyzed. The goal was to provide fundamental data to help scientists determine the composition of the sun and learn more about the formation of our solar system.

Unfortunately, during descent on September 8, 2004, the Genesis crashed into the earth at high velocity. Its descent was only slowed by air resistance and the collection capsule was damaged on impact.

What happened? What went wrong with the re-entry?

A root cause analysis can be performed to evaluate this incident. The investigation can be documented by building a Cause Map that collects all the information associated with the incident in a visual format that is easy to follow.

In this case, the main goal we’ll consider is the production goal. The production goal was impacted because the collection capsule was damaged, which had the potential to destroy all the physical data collected during the three year mission.

The investigation can proceed by asking “why” questions and adding the causes to the Cause Map. In this scenario, the collection capsule was damaged because it impacted the earth at high velocity. This occurred because the parachute that was intended to slow the descent to allow for a midair recovery by helicopter failed to deploy.

Post-accident investigation determined that the parachute was never triggered to deploy because gravity switches were installed backwards. The backward installation occurred for several reasons: the design was flawed, the design review process didn’t detect the error and the testing performed didn’t detect the error.

Luckily, the impact to the production goal has been less significant than it might have been in this case. The collection capsule was cushioned somewhat by the soft ground and while desert dirt entered the capsule, liquid water did not. The solar wind particles were embedded in the collection materials and the contaminating dirt was able to be removed for the most part. NASA has been able to retrieve significant amounts of data from the mission.

NASA’s Mishap Report can be downloaded for free for additional information on the incident.

A one page PDF showing a high level Cause Map of the incident can be downloaded by clicking on the button above.

Sugar Dust Explosion

By ThinkReliability Staff

On February 7, 2008, an explosion at the sugar refinery in Port Wentworth, Georgia resulted in the deaths of 14 workers.  It also injured 36 and caused significant damage to the refinery.  Immediately following the incident, we began a very simple root cause analysis, leaving the more detailed analysis for when the Chemical Safety Board (CSB) report was released and more detailed information could be found.  The CSB final draft reportwas recently issued and with the information it contains, we can add more detail to our Cause Map.

We can begin our analysis by beginning with a goal that was impacted and using the “5-whys” approach.  The 14 deaths and 36 injuries were caused by the propagation of secondary explosions and fire.  The secondary explosions and fires were caused by a primary explosion, which was caused by an explosive concentration of sugar dust, which was caused by inadequate housekeeping.

From here we can add more detail to our map.  For example, difficulty evacuating the plant was also a cause of the deaths and injuries.  The difficulty was caused by having no evacuation drills, and using cell phones and radios to communicate instead of an intercom or emergency alert system.

In order for the explosions to propagate, they needed additional fuel.  This was found in the accumulated sugar dust in open areas of the plant, due to inadequate housekeeping, and a dust removal system that was not functioning properly and had ducts filled with sugar dust.

Since “inadequate housekeeping” has now come up twice on our map, let’s expand on that a little.  There was a lack of awareness of the hazards of sugar dust.  The facility risk assessment did not address these hazards, there was very little training on dust hazards, and there was little regulatory oversight which might have created more awareness or cleanliness requirements.  OSHA’s hazardous dust safety standards were limited to grain, and the State of Georgia had no regulations addressing dust.  (Both of these issues are in the process of being fixed.)

Although the sugar dust accumulated due to lack of housekeeping, it required more to reach explosive levels.  The containment was provided by steel panels installed around the conveyor which were designed to protect the sugar from contamination.   The dust also required an ignition source.  Due to the extensive damage, the CSB was not able to pinpoint the ignition source.

The CSB identified several solutions that would mitigate the risk of future incidents.  Some of these solutions are for Imperial Sugar to implement at this site, such as holding evacuation drills, increasing training on dust hazards, improving the housekeeping program, and installing (and using) an intercom system.  As discussed above, OSHA and the State of Georgia are implementing standards and regulations to decrease the chances of a dust explosion in their jurisdictions.  Also, the CSB has recommended that the company who performed the risk assessment at Imperial Sugar consider dust hazards as a risk.

Click on “Download PDF” above to see all the information discussed above in a visual form.

Learn more about dust explosions.