Tag Archives: root cause analysis

Early Problems with Mark 14 Torpedoes

By Kim Smiley

The problems with Mark 14 torpedoes at the start of World War II are a classic example that illustrates the important of robust testing.  The Mark 14 design included brand new, carefully guarded technology and was developed during a time of economic austerity following the Great Depression.  The desire to minimize costs and to protect the new exploder design led to such a limited test program that not a single live-fire test with a production model was done prior to deploying the Mark 14.

The Mark 14 torpedo design was a step change in torpedo technology. The new Mark VI exploder was a magnetic exploder designed to detonate under a ship where there was little to no armor and where the damage would be greatest.  The new exploder was tested using specially instrumented test torpedoes, but never a standard torpedo. Not particularly shocking given the lack of testing, the torpedoes routinely failed to function as designed once deployed.

The Mark 14 torpedoes tended to run too deep and often failed to detonate near the target. One of the problems was that the live torpedoes were heavier than the test torpedoes so they behaved differently. There were also issues with the torpedo’s depth sensor.  The pressure tap for the sensor was in the rear cone section where the measured pressure was substantially less than the hydrostatic pressure when the torpedo was traveling through the water.  This meant that the depth sensor read too shallow and resulted in the torpedo running at deeper depths than its set point.  Eventually the design of the torpedo was changed to move the depth sensor tap to the mid-body of the torpedo where the readings were more accurate.

The Mark 14 design also had issues with premature explosions.  The magnetic exploder was intended to explode near a ship without actually contacting it.  It used small changes in the magnetic field to identify the location of a target. The magnetic exploder had been designed and tested at higher latitudes and it wasn’t as accurate closer to the equator where the earth’s magnetic field is slightly different.

In desperation, many crews disabled the magnetic exploder on Mark 14 torpedoes even before official orders to do so came in July 1943.  Use of the traditional contact exploder revealed yet another design flaw in the Mark 14 torpedoes.  A significant number of torpedoes failed to explode even during a direct hit on a target.  The conventional contact exploder that was initially used on the Mark 14 torpedo had been designed for earlier, slower torpedoes.  The firing pin sometimes missed the exploder cap in the faster Mark 14 design.

The early technical issues of the Mark 14 torpedoes were eventually fixed and the torpedo went on to play a major role in World War II.  Mark 14 torpedoes were used by the US Navy for nearly 40 years despite the early issues.  But there is no doubt that it would have been far more effective and less painful to identify the technical issues during testing rather than in the field during war time.  There are times when thorough testing may seem too expensive and time consuming, but having to fix a problem later is generally much more difficult.  No one wants to waste effort on unnecessary tests, but a reasonable test program that verifies performance under realistic conditions is almost always worth the investment.

To view a high level Cause Map of the early issues of the Mark 14 torpedoes, click “Download PDF”.

You can also learn more about the torpedoes by clicking here and here.

Deadly Train-Car Collision

By Kim Smiley

On February 3, 2015, an SUV was struck by a commuter train near Valhalla, New York.  The driver of the vehicle and 5 train passengers were killed in the accident.  The National Transportation Safety Board (NTSB) is investigating the accident to determine what went wrong.

An initial Cause Map, a visual root cause analysis, can be built to analyze and document what is known about this train-car collision.  A Cause Map visually lays out the cause-and-effect relationships that contributed to an issue and focuses on understanding all the causes, not THE root cause.  Generally, identifying more causes results in a greater number of potential solutions being considered.

So why did the train hit a vehicle?  Eyewitnesses have stated that the SUV was hit by a crossing gate as it descended.    It is not clear why the SUV didn’t stop prior to entering the railroad crossing area. The driver pulled the SUV forward onto the tracks rather than backing up and the train struck the vehicle shortly after.  Investigators don’t know why the driver stopped on the tracks, but initial reports are that all safety features, such as the crossing gate, signs and train horn, were functioning properly at the time of the accident.

Unfortunately, it’s not unusual for passengers in a vehicle struck by a train to be injured or killed, but it is less common for fatalities among the train passengers.  Investigators are working to determine what made this accident particularly dangerous for train passengers.  The NTSB plans to use information about the passengers’ injuries and a diagram of where people were sitting on the train to try to understand what happened during the collision.  Post-accident photos of the train show that significant fire damage occurred, likely fueled by the gas in the SUV.

One of the open questions is whether the electrified third rail contributed to the accident and subsequent injuries. Metro-North uses an unusual “under-running” third rail design where power is taken from the bottom of the rail.  During the collision, 400 feet of the third rail broke apart and 12 pieces pierced both the SUV and the train. This rail design uses a metal shoe that slips underneath the third rail and some think that the force of the collision may have essentially pried up the rail and threw it into the train and vehicle.

Open questions can be documented on the initial Cause Map with a question mark.  As more information becomes available, the Cause Map can quickly be updated.  Typically, Cause Maps are built in Excel and different versions can be saved as different sheets to document the investigation process.

Click on “Download PDF” above to view an initial Cause Map of this accident, built from the information in the media articles on the accident.

TransAsia Plane Crashes into River in Taiwan

By Kim Smiley

On February 4, 2015, there were 53 passengers onboard TransAsia Airways Flight 235 when the plane crashed into the Keelung River shortly after taking off from the Taipei Shonshan Airport.  There were 15 survivors from this dramatic crash where the plane hit a bridge and taxi cab prior to turning upside down before hitting the river. (The crash was caught on video by dash cameras from a vehicle on the bridge and can be seen here.)

Investigators are still working to determine exactly what happened, but some early findings have been released.  The plane involved in this crash was a turboprop with two engines.  This model of plane can fly safely with only one engine, but both engines had issues immediately prior to the crash so the pilots were unable to control the plane.

Data from the flight recorder shows that the right engine idled 37 seconds after takeoff.  No details about what caused the problem with the right engine have been made available.  The initial investigation findings are that the left engine was likely manually shut down by the pilots.  It’s not clear why the functioning engine would have been intentionally shut down. Early speculation is that it was a mistake and that the pilots were attempting to restart the idled right engine when they hit the switch for the operating left engine.

The investigation into the crash is ongoing and the final report isn’t expected to be released for about a year, but based on the initial findings, a few solutions to help reduce the likelihood of future crashes have already been implemented.  TransAsia has grounded most of its turboprop aircraft pending additional pilot instruction and requalification because it is believed that pilot action may well have contributed to the deadly accident.  More than 100 domestic flights have been canceled as a result.  Additionally, Taiwan’s Civil Aeronautic Administration has announced that the carrier will be banned from adding new international routes for 12 months.  A previous crash in July 2014 had already tarnished TransAsia’s reputation and this latest disaster will certainly be scrutinized by the authorities.

An initial Cause Map, a visual root cause analysis, can be built to analyze the information that is available on this crash and to document where there are still open questions.  To view a Cause Map and Outline of this incident, click on “Download PDF” above.

Working Conditions Raise Concerns at Fukushima Daiichi

By ThinkReliability Staff

The nearly 7,000 workers toiling to decommission the reactors at Fukushima Daiichi after they were destroyed by the earthquake and tsunami on March 11, 2011 face a daunting task (described in our previous blog). Recent events have led to questions about the working conditions and safety of these workers.

On January 16, 2015, the local labor bureau instructed the utility that owns the plants to reduce industrial accidents. (The site experienced 23 accidents in fiscal year 2013 and 55 so far this fiscal year.) Three days later, on January 19, a worker fell into a water storage tank and was taken to the hospital. He died the next day, as did a worker at Fukushima Daini when his head got caught in machinery. (Fukushima Daini is nearby and was less impacted by the 2011 event. It is now being used as a staging site for the decommissioning work at Fukushima Daiichi.)

Although looking at all industrial accidents will provide the most effective solutions, often digging into just one in greater detail will provide a starting point for site improvements. In this case, we will look at the January 19 fall at Fukushima Daiichi to identify some of the challenges facing the site that may be leading to worker injuries and fatalities.

A Cause Map, or visual form of root cause analysis, is begun by determining the organizational impacts as a result of an incident. In this case the worker fall impacted the safety goal due to the death of the worker. The environmental goal was not impacted. (Although the radiation levels at the site still require extensive personal protective equipment, the incident was not radiation-related.) Workers on site have noted difficult working conditions, which are thought to be at least partially responsible for the rise in incidents, as are the huge number of workers at the site (itself an impact to the labor/time goal). Lastly, local organizations have raised regulatory concerns due to the high number of incidents at the site.

An analysis of the issues begins with one impacted goal. In this case, the worker death resulted from a fall into a ten-meter empty tank. The worker was apparently not found immediately (though specific timeline details and whether or not that impacted the worker’s outcome have not been released) because it appears he was working alone, likely due to the massive manpower needs at the site. Additionally, the face masks worn by all workers (due to the high radiation levels still present) limit visibility.

The worker was checking for leaks at the top of the tank, which is being used to store water used to cool the reactors at the site. There is a general concern about lack of knowledge of workers (many of whom have been hired recently with little or no experience doing the types of tasks they are now performing), though again, it’s unclear whether this was applicable in this case. Of more concern is the ineffective safety equipment – apparently the worker did not securely fasten his safety harness.

The reasons for this, and the worker falling in the first place, are likely due to worker fatigue or lack of concentration. Workers at the site face difficult conditions doing difficult work all day (or night) long, and have to travel far afterwards, as the surrounding area is still evacuated. Reports of mental health issues and fatigue in these workers has led to the opening of a new site providing meals and rest for these workers.

These factors are likely contributing to the increase in accidents, as is the number of workers at the site, which doubled from December 2013 to December 2014. Local organizations are still calling for action to reduce these actions. “It’s not just the number of accidents that has been on the rise. It’s the serious cases, including deaths and serious injuries that have risen, so we asked Tokyo Electric to improve the situation,” says Katsuyoshi Ito, a local labor standards inspector.

In addition to improving working conditions, the site is implementing improved worker training – and looking at discharging wastewater instead of storing it, which would reduce the pieces of equipment required to be monitored and maintained. Improvements must be made, because decades of work remains before work at the site will be completed.

Click here to sign up for our FREE webinar “Root Cause Analysis Case Study: Fukushima Daiichi” at 2:00 pm EDT on March 12 to learn more about how the earthquake and tsunami on March 11, 2011 impacted the plant.

Prison Bus Collides With Freight Train

By Kim Smiley

On the morning of January 14, 2015, a prison bus went off an overpass and collided with a moving freight train.  Ten were killed and five more injured.  Investigators believe the accident was weather-related.

This tragic accident can be analyzed by building a Cause Map, a visual root cause analysis.  A Cause Map visually lays out the cause-and-effect relationships to show all the causes (not just a single root cause) that contributed to an accident.  The first step in the Cause Mapping method is to determine how the incident impacted the overall organizational goals.  Typically, more than one goal needs to be considered.  Clearly the safety goal was impacted because of the deaths and injuries.  The property goal is impacted because of the damage to both the bus and train (two train cars carrying UPS packages were damaged).  The schedule goal is impacted because of the delays in the train schedule and the impact on vehicle traffic.

The Cause Map itself is built by starting at one of the impacted goals and asking “why” questions. So why were there fatalities and injuries?  This occurred because there were 15 people on a bus and the bus collided with a train.  The bus was traveling between two prison facilities and drove over an overpass.  While on the overpass, the bus hit a patch of ice and slid off the road, falling onto a moving freight train that was passing under the roadway.  No one onboard the train was injured and the train did not derail, but it was significantly damaged.  The bus was severely damaged.

The prisoners onboard the bus were not wearing seat belts, as is typical on many buses.  They were also handcuffed together, although it’s difficult to say how much this contributed to the injuries and fatalities.

Useful solutions to prevent these types of accidents can be tricky.  The prison system may want to review how they evaluate road conditions prior to transporting prisoners.  This accident occurred early in the morning and waiting until later in the day when temperatures had increased may have reduced the risk of a bus accident.  Transportation officials may also want to look at how roads, especially overpasses, are treated in freezing weather to see if additional efforts are warranted.

To view a high level Cause Map of this accident, click on “Download PDF” above.

You can also read our previous blogs to learn more about other train collisions:

Freight Trains Collide Head-on in Arkansas

Freight Train Carrying Crude Oil Explodes After Colliding with Another

“Ghost Train” Causes Head-on Collision in Chicago

Deadly Train Collision in Poland

Passengers trapped in smoke-filled metro train

By Kim Smiley

A standard commute quickly turned into a terrifying ordeal for passengers on a metro train in Washington, DC the afternoon of January 12, 2015.  Shortly after leaving a station, the train abruptly stopped and then quickly filled with thick smoke. One passenger died as a result of the incident and 84 more were treated for injuries, predominantly smoke inhalation.

This incident can be analyzed by building a Cause Map, a visual root cause analysis.  A Cause Map visually lays out the cause-and-effect relationships to show all the causes that contributed to an issue.  The first step in the Cause Mapping process is to define the problem by filling in an Outline with the basic background information as well as documenting how the issue impacts the overall goals.  For this example, the safety goal is clearly impacted by the passenger death and injuries.  A number of other goals should also be considered such as the schedule goal which was impacted by significant metro delays.  (To view an Outline and initial Cause Map for this issue, click on “Download PDF” above.)

So why were passengers injured and killed?  Passengers were trapped on the train and it filled with smoke.  It is unclear why the train wasn’t able to back up to the nearby station once the smoke formed and investigators are working to learn more.  (Open issues can be documented on the Cause Map with a question mark to indicate that more evidence is needed.)  There are also questions about the time emergency workers took to reach the train to aid in evacuation of passengers so this is another area that will require more information to fully understand. By some account, it took 40 minutes for firefighters to reach the trapped passengers.

Initial reports are that smoke was caused by an electrical arcing event, likely from the cables supporting the high voltage third rail used to power the trains. The specifics of what caused the arc are being investigated by the National Transportation Safety Board and will be released when the investigation is concluded.  What is known is that there was significant smoke caused by the arc, but no fire.  There have also been reports of water near the rails that may have been a factor in the arcing.

Eyewitness accounts of this incident are horrifying.  People had little information and didn’t know whether there was fire nearby at first.  They were told to remain on the train and await rescue, but the rescue took some time, which surely felt longer to the scared passengers.  It won’t be clear what solutions need to be implemented to prevent similar problems in the future until the investigation is complete, but I think we can agree that metro officials need to work to ensure passenger safety going forward.

Bad Weather Believed to Have Brought Down AirAsia Flight QZ8501

By ThinkReliability Staff

AirAsia flight QZ8501, and the 162 people on-board, was lost on December 28, 2014 while flying through high-altitude thunderstorms. Because of a delay in finding the plane and continuing bad weather in the area, the black box, which contains data that will give investigators more detail on why the plane went down, has not yet been recovered. Even without the black box’s data, experts believe that the terrible weather in the area was a likely cause of the crash.

“From our data it looks like the last location of the plane had very bad weather and it was the biggest factor in behind the crash. These icy conditions can stall the engines of the plane and freeze and damage the plane’s machinery,” says Edvin Aldrian, the head of Research at an Indonesian weather agency. Beyond the icing of engines, there are other theories on how weather-related issue may have brought down the plane.

Early speculation was that the plane was struck by lighting; while it may have been struck by lightning, experts say it’s unlikely it would have brought the plane down, because modern planes are fairly well-equipped to deal with direct lightning strikes. High levels of turbulence can also result in stalling due to a loss of airflow over the wings. There are also some who believe the plane (an Airbus A320) may have been pushed into a vertical climb past the limit for safe operation (to escape the weather) which resulted in a stall.

While the actual mechanism of how the weather (or an unrelated issue) brought the plane down is still to be determined, aviation safety organizations are already implementing some interventions to increase the safety of air travel in the area based on some specific areas of concern. (These areas of concern can be viewed visually in a Cause Map, or visual root cause analysis, by clicking on “Download PDF” above.)

AirAsia pilots relied on “self-briefings” regarding the weather. Pilots in other locations have expressed concern about the adequacy of weather information pilots obtain using this method. Direct pilot briefings with dispatchers based on detailed weather reporting are recommended to ensure that pilots have the information they need to safely traverse areas of poor weather (or stay out of them altogether).

Heavy air traffic in the area delayed approval to climb out of storm. At 6:12 local time the flight crew requested to climb to higher altitude to attempt to escape the storm. Air traffic control did not attempt to respond to the plane until 6:17, at which point it could no longer be contacted. Air traffic in the area was heavy, possibly because:

The plane did not have permission to fly the route it was on. AirAsia was licensed to fly the route it was taking at the time of the crash four days a week, but not the day of the crash. The takeoff airport used incorrect information in allowing the plane to take off in the first place (and the airline certainly used incorrect information in trying to fly the route as well). The selection of the route has been determined not to be a factor in the crash, but it certainly may have resulted in the overcrowding that led to the delayed response from air traffic control. It also resulted in the airline’s flights on that route being suspended.

It took almost three days to find the plane. The delay is renewing calls for universal tracking of aircraft or real-time streaming of flight data that were initially raised after the loss of Malaysia Airline flight MH370, which is still missing ten months after losing radar contact. (See our previous blog on the difficulties finding it.) Not only would this reduce the suffering of families while waiting to hear their loved ones’ fates, it would reduce resources required to find lost aircraft and, in cases where survival is possible, increase the chance of survival of those on the plane.

 

Hundreds Saved by Arduous Helicopter Rescue From Ferry Fire

By Kim Smiley

In a grueling rescue effort, 427 people were saved from a passenger ferry, Norman Atlantic, which caught fire December 28, 2014 off the coast of Greece.  About 150 people managed to escape the fire in lifeboats, but the remaining passengers were lifted to safety via helicopter.  Gale force winds, heavy rain and darkness all combined to make a difficult rescue operation even more daunting. Ten people died as a result of the accident with few details known about what caused the fatalities.

A Cause Map, a visual root cause analysis, can be built to analyze this incident.  The investigation is just beginning and there are still many unknowns, but an initial Cause Map can be begun that can easily be expanded to incorporate new information as it becomes available.  Even the exact number of people onboard has been difficult to determine because there were several stowaways discovered during the rescue operations that weren’t listed on the ship’s manifest.

What is known is that the fire began early in the morning of December 28th and 427 people were rescued off the ferry. Early reports are that the fire started on the parking deck where there were tanker trucks filled with oil.  Witness accounts indicate that the fire spread fairly quickly, leading to speculation that the fire doors failed.  As the fire progressed, the ship lost power.  Once power was gone, the lifeboats were useless because they require electricity to be lowered.  The heat from the fire drove passengers to the top deck and bridge where they were bombarded by cold, rain and thick smoke for many miserable and likely terrifying hours.  Helicopters pulled passengers to safety one by one, working through the windy night with night vision goggles.

In a stark contrast to the South Korea ferry that capsized off Byungpoong in April, the captain was the last person to leave the Norman Atlantic. The rescue effort was truly impressive.  As Greek Prime Minister Antonis Samaras said, the “massive and unprecedented operation saved the lives of hundreds of passengers following the fire on the ship in the Adriatic Sea under the most difficult circumstances.”

The Italian Transport Ministry has seized the vessel pending an investigation into the fire and thorough inspection of the ship.  Whenever a disaster of this magnitude occurs, it is worth understanding exactly what happened and reviewing what could be done better in the future.  There will be many lessons to learn from this incident, both in how to prevent and fight shipboard fires and how to perform helicopter rescues at sea.

To view a high level Cause Map of this incident, click on “Download PDF” above.

Dreamliner fire: firefighter injured when battery explodes

By ThinkReliability Staff

On January 7, 20 13, smoke was discovered on a recently deplaned Boeing 787 Dreamliner. The recently released National Transportation Safety Board (NTSB) investigation found that an internal short circuit within a cell of the auxiliary power unit (APU) battery spread to adjacent cells and led to a thermal runway which released fire and smoke aboard the aircraft. A firefighter responding to the fire was injured when the battery exploded. Only 9 days later, an incident involving the main battery, which is the same model as that used for the APU, resulted in an emergency landing of another Boeing 787. As a result of these two incidents, the entire Dreamliner fleet was grounded for 3 months for the ensuing investigation and incorporation of modifications. (See our previous blog about the grounding.) Before the fleet was allowed to resume operations, certain protective modifications were required to be implemented.

The investigation determined that the internal short circuit, which provided the initial heat source for the fire within the battery cell, could not be definitively determined due to severe damage in the area, but was potentially related to defects discovered during the manufacturing process. (Defects that could result in this type of short circuit were found on similar components.) The investigation found issues within the manufacturing process and with the oversight of subcontractors by contractors, as opposed to the manufacturers themselves.

The high temperatures resulting from the battery fire allowed it to spread to adjacent cells. Localized high temperatures were found greater than allowable at times of maximum current discharge, such as the APU startup, which had recently occurred. The high temperatures were not detected by the monitoring system (the impact could have been minimized had the issue come to light sooner), because temperatures were not monitored at individual cells, but only on two cell bus bars.

The systems were not prepared to deal with a spreading fire as the design of the aircraft assumed that a short circuit internal to the cell would not propagate. The NTSB determined that the guidance provided to determine key assumptions was ineffective and that the validation of these assumptions had failed. Likely related to this assumption, the safety assessment and testing on the battery system was ineffective. The rate of occurrence of cell venting (the spreading of fire from cell to cell) was calculated by the manufacturer to be 1 in 10 million flight hours. The two occurrences that resulted in the grounding both involved cell venting and occurred while the 787 fleet had less than 52,000 flight hours.

Immediate actions that were required by the NTSB prior to a return to flight were to enclose the battery case, vent from the interior of the enclosure containing the battery to the exterior of the plane (keeping smoke out of the occupied spaces), and modify the battery to minimize the most severe effects from an internal short circuit. The NTSB also made multiple safety recommendations to the manufacturer, subcontractor and the Federal Aviation Administration (FAA).

One of these recommendations was to ensure that assumptions are validated. According to the NTSB report, “Validation of assumptions related to failure conditions that can impact safety is a critical step in the development and certification of an aircraft. The validation process must employ a level of rigor that is consistent with the potential hazard to the aircraft in case an assumption is incorrect.” This statement is true for any object that’s manufactured. Just replace the word “aircraft” with whatever is being manufactured, such as “car” or “pacemaker”. (See another disaster that resulted from not validated assumptions: the collapse of the I-35 Bridge.)

Click on “Download PDF” above to view a high level Cause Map of this issue.

10,000 Pound Buoy Falls on Workers

By Kim Smiley

On December 10, 2014, a buoy that weighs close to 10,000 pounds fell onto workers at an inactive ship maintenance facility in Pearl Harbor. Two workers were killed and two others sustained injuries. While an object this large is an extreme example of the dangers of dropped objects, worker injuries and deaths from falling objects of all sizes is a significant safety concern. A US census report of fatal occupation injuries states that 245 workers were killed after being struck by falling objects in 2013 alone.

The case of the dropped buoy can be built into a Cause Map, a visual root cause analysis, to better understand what happened. Understanding the details of an accident is necessary to ensure that a wide range of solutions is considered and that any solutions implemented will be effective at preventing future incidents.

The investigation into the falling buoy is still underway so some information is not yet available, but it can easily be incorporated into the Cause Map once it is known. Any causes that need more information or evidence can be noted with a question mark to show that there is still an open question.

Exactly what caused the buoy to drop hasn’t been released yet, but it is known that the safety lines attached to the buoy failed. Both of these issues need to be investigated to ensure that solutions can be implemented to prevent further tragedies.

Additionally, there are open questions about why people were working under the path of the lift. The workers were wearing hard hats, but this is obviously inadequate protection against a 10,000 buoy. The contractors were working to strengthen mooring lines at the time of the accident, but no one should be where they could be crushed if such a large object was dropped, as it was in this case. As stated by Jeff Romeo, the Occupational Safety and Health Administration (OSHA) Honolulu area director, “We’re still looking at the facts to try to determine the exact locations of where these employees were located. If in fact, they were working directly underneath the load, then that would be an alarming situation.”

The OSHA investigation is currently underway and is expected to take four to six months. Additionally, the Navy is launching a Safety Investigation Board to review the accident with findings expected to be released by February. Once the investigation is complete, work processes will need to be reviewed to see what changes need to be made to prevent any future injuries from falling objects.

To view an initial Cause Map of this incident, click on “Download PDF” above.