All posts by Angela Griffith

I lead comprehensive investigations by collecting and organizing all related information into a coherent record of the issue. Let me solve a problem for you!

Uncategorized

Extensive Contingency Plans Prevent Loss of Pluto Mission

July 23, 2015 Angela Griffith

By ThinkReliability Staff

Beginning July 14, 2015, the New Horizons probe started sending photos of Pluto back to earth, much to the delight of the world (and social media). The New Horizons probe was launched more than 9 years ago (on January 19, 2006) – so long ago that when it left, Pluto was still considered a planet. (It’s been downgraded to dwarf planet now.) A mission that long isn’t without a few bumps in the road. Most notably, just ten days before New Horizons’ Pluto flyby, mission control lost contact with the probe.

Loss of communication with the New Horizons probe while it was nearly 3 billion miles away could have resulted in the loss of the mission. However, because of contingency and troubleshooting plans built in to the design of the probe and the mission, communication was able to be restored, and the New Horizons probe continued on to Pluto.

The potential loss of a mission is a near miss. Analyzing near misses can provide important information and improvements for future issues and response. In this case, the mission goal is impacted by the potential loss of the mission (near miss). The labor and time goal are impacted by the time for response and repair. Because of the distance between mission control on earth and the probe on its way to Pluto, the time required for troubleshooting was considerable owing mainly to the delay in communications that had to travel nearly 3 billion miles (a 9-hour round trip).

The potential loss of the mission was caused by the loss of communication between mission control and the probe. Details on the error have not been released, but its description as a “hard to detect” error implies that it wasn’t noticed in testing prior to launch. Because the particular command sequence that led to the loss of communication was not being repeated in the mission, once communication was restored there was no concern for a repeat of this issue.

Not all causes are negative. In this case, the “loss of mission” became a “potential loss of mission” because communication with the probe was able to be restored. This is due to the contingency and troubleshooting plans built in to the design of the mission. After the error, the probe automatically switched to a backup computer, per contingency design. Once communication was restored, the spacecraft automatically transmits data back to mission control to aid in troubleshooting.

Of the mission, Alice Bowman, the Missions Operation Manager says, “There’s nothing we could do but trust we’d prepared it well to set off on its journey on its own.” Clearly, they did.

Uncategorized

Trading Suspended on the NYSE for More Than 3 Hours

July 16, 2015 Angela Griffith

By ThinkReliability Staff

On July 8, 2015, trading was suspended on the New York Stock Exchange (NYSE) at 11:32 AM. According to the NYSE president Tom Farley, “the root cause was determined to be a configuration issue.” This still leaves many questions unanswered. This issue can be examined in a Cause Map, a visual form of root cause analysis.

There are three steps to the Cause Mapping problem-solving method. First, the problem is defined with respect to the impact to the goals. The basic problem information is captured – the what, when, and where. In a case such as this, where the problem unfolded over hours, a timeline can be useful to provide an overview of the incident. Problems with the NYSE began when a system upgrade to meet timestamp requirements began on the evening of July 7. As traders attempted to connect to the system early the next morning, communication issues were found and worsened until the NYSE suspended trading. The system was restarted and full trading resumed at 3:10 PM.

The impacts to the goals are also documented as part of the basic problem information. In this case, there were no impacts to safety or the environment as a result of this issue. Additionally, there was no impact to customers, whose trades automatically shifted to other exchanges. However, an investigation by the Securities & Exchange Commission (SEC) and political hearings are expected as a result of the outage, impacting the regulatory goal. The outage itself is an impact to the production goal, and the time spent on response and repairs is an impact to the labor/time goal.

The cause-and-effect relationships that led to these impacts to the goals can be developed by asking “why” questions. This can be done even for positive impacts to the goals. For example, in this case customer service was NOT impacted adversely because customers were able to continue making trades even through the NYSE outage. This occurred because there are 13 exchanges, and current technology automatically transfers the trades to other exchanges. Because of this, the outage was nearly transparent to the general public.

In the case of the outage itself, as discussed above, the NYSE has stated it was due to a configuration issue. Specifically, the gateways were not loaded with the proper configuration for the outage that was rolled out July 7. However, information about what exactly the configuration issue was or what checks failed to result in the improper configuration being loaded is not currently available. (Although some have said that the chance of this failure happening on the same date as two other large-scale outages could not be coincidental, the NYSE and government have ruled out hacking.) According to NYSE president Tom Farley, “We found what was wrong and we fixed what was wrong and we have no evidence whatsoever to suspect that it was external. Tonight and overnight starts the investigation of what exactly we need to change. Do we need to change those protocols? Absolutely. Exactly what those changes are I’m not prepared to say.”

Another concern is the backup plan in place for these types of issues. Says Harvey Pitt, SEC Chairman 2001 to 2003, “This kind of stuff is inevitable. But if it’s inevitable, that means you can plan for it. What confidence are we going to have that this isn’t going to happen anymore, or that what did happen was handled as good as anyone could have expected?” The backup plan in place appeared to be shifting operations to a disaster recovery center. This was not done because it was felt that requiring traders to reconnect would be disruptive. Other backup plans (if any) were not discussed. This has led some to question the oversight role of the SEC and its ability to prevent issues like this from recurring.

To view the investigation file, including the problem outline, Cause Map, and timeline, click on “Download PDF” above. To view the NYSE statement on the outage, click here.

Root Cause Analysis - Incident Investigation

Cause-and-Effect: Alcohol Consumption

June 30, 2015 Angela Griffith

By ThinkReliability Staff

The human body is a pretty amazing thing. Many of the processes that take place in our body on a regular basis – keeping us breathing, walking and playing video games or skydiving (or both, though hopefully not at the same time) – have not yet been replicated. They’re that complex.

Which of course raises a lot of questions: why do our bodies work the way they do? It also leads to the subset of questions, when x happens, why does y happen? If your question is, when I drink, why do I feel so great, then so lousy, science has the answers for you . . . and yes, we can capture them in a Cause Map!

If your goal for your body is to feel well and behave pretty consistently, then drinking alcohol is going to impact those goals. First, drinking is going to result in a decrease in control of your behavior. The specifics of how this manifest are legion, but I am assured you probably have many examples. Your post-binge feelings are also going to be impacted: most likely your drinking is going to result in a hangover (generally awful feelings centered around your abdomen and head), dehydration and frequent urination. If your goal is not to eat everything in sight without any consideration about what it will do to your waistline, then your diet may also be impacted due to a desire for carbohydrates.

Beginning with one of these goals, we can ask our favorite question: Why? For example, our decrease in behavior control results from the hypothalamus, pituitary gland, and cerebellum being depressed. This decreases inhibitions, ability to think clearly and also releases a whole slew of hormones and dopamine. Additionally, alcohol impacts neurotransmitters which direct emotions, actions and motor skills, so the combination may make you think you can dance on a table . . . but really you can barely walk.

Now about the ill after-effects. That lovely hangover results from your digestive system attempting to detoxify your body from alcohol and the pounding headache caused by dehydration. When your digestive system works to remove alcohol, the byproduct is acetaldehyde and your body doesn’t like it at all. Most of the alcohol from your body is going to be flushed through your bladder. In order to speed its exit, your body redirects all the liquid it can to your bladder, leaving you dehydrated. (That’s also why you have to run the bathroom so many times after drinking.) The whole process of removing alcohol from your body takes energy. In order to direct as much energy towards alcohol removal as possible, your brain shuts down most of your other functions (which doesn’t help with the ability to function). To get that energy back, your body craves food – carbs in particular (grease optional).

With all these bad effects, you may wonder why people drink at all. Well, when you drink, the alcohol depresses some systems as discussed above, resulting in the release of a bunch of hormones and dopamine. These make us feel good (or even fabulous!). That’s why we keep drinking. (There’s also a whole bunch of social pressures which I’m not going to go into here.)

Giving up drinking altogether is difficult, and many people don’t want to. There are, however, ways to minimize the ill effects of drinking. Food in your stomach helps absorb some of the alcohol, so eating before you drink can help. The headache portion of the hangover can be minimized by drinking a lot of water (though that won’t help with the frequent urination issue). AND OF COURSE, drinking does a number on your fine motor control and general behavior, you should never, ever drink and drive or operate other heavy machinery.

To view the Cause Map of what happens when you drink, click on “Download PDF” above. The information used to create this blog is from:

“The Science of Getting Drunk” and

“Every Time You Get Drunk This Is What Happens To Your Body And Your Brain“

Root Cause Analysis - Incident Investigation

Rollercoaster Crash Under Investigation

June 17, 2015 Angela Griffith

By ThinkReliability Staff

A day at a resort/ theme park ended in horror on June 2, 2015 when a carriage filled with passengers on the Smiler rollercoaster crashed into an empty car in front of it. The 16 people in the carriage were injured, 5 seriously (including limb amputations). While the incident is still under investigation by the Health and Safety Executive (HSE), information that is known can be collected in cause-and-effect relationships within a Cause Map, or visual root cause analysis.

The analysis begins with determining the impact to the goals. Clearly the most important goal affected in this case is the safety goal, impacted because of the 16 injuries. In addition to the safety impacts, customer service was impacted because of the passengers who were stranded for hours in the air at a 45 degree angle. The HSE investigation and expected lawsuits are an impact to the regulatory goal. The park was closed completely for 6 days, at an estimated cost of ?3 M. (The involved rollercoaster and others with similar safety concerns remain closed.) The damage to the rollercoaster and the response, rescue and investigation are impacts to the property and labor goals, respectively.

The Cause Map is built by laying out the cause-and-effect relationships starting with one of the impacted goals. In this case, the safety goal was impacted because of the 16 injuries. 16 passengers were injured due to the force on the carriage in which they were riding. The force was due to the speed of the carriage (estimated at 50 mph) when it collided with an empty carriage. According to a former park employee, the collision resulted from both a procedural and mechanical failure.

The passenger-filled carriage should not have been released while an empty car was still on the tracks, making a test run. It’s unclear what specifically went wrong to allow the release, but that information will surely be addressed in the HSE investigation and procedural improvements going forward. There is also believed to have been a mechanical failure. The former park employee stated, “Technically, it should be absolutely impossible for two cars to enter the same block, which is down to sensors run by a computer.” If this is correct, then it is clear that there was a failure with the sensors that allowed the cars to collide. This will also be a part of the investigation and potential improvements.

After the cause-and-effect relationships have been developed as far as possible (in this case, there is much information still to be added as the investigation continues), it’s important to ensure that all the impacted goals are included on the Cause Map. In this case, the passengers were stranded in the air because the carriage was stuck on the track due to the force upon it (as described above) and also due to the time required for rescue. According to data that has so far been released, it was 38 minutes before paramedics arrived on-scene, and even longer for fire crews to arrive with the necessary equipment to begin a rescue made very difficult by the design of the rollercoaster (the world record holder for most loops: 14). The park staff did not contact outside emergency services until 16 minutes after the accident – an inexcusably long time given the gravity of the incident. The delayed emergency response will surely be another area addressed by the investigation and continuing improvements.

Although the investigation is ongoing, the owners of the park are already making improvements, not only to the Smiler but to all its rollercoasters. In a statement released June 5, the owner group said “Today we are enhancing our safety standards by issuing an additional set of safety protocols and procedures that will reinforce the safe operation of our multi-car rollercoasters. These are effective immediately.” The Smiler and similar rollercoasters remain closed while these corrective actions are implemented.

Dr. Tony Cox, a former Health and Safety Executive (HSE) advisory committee chairman, hopes the improvements don’t stop there and issues a call to action for all rollercoaster operators. “If you haven’t had the accident yourself, you want all that information and you’re going to make sure you’ve dealt with it . . . They can just call HSE and say, ‘Is there anything we need to know?’ and HSE will . . . make sure the whole industry knows. That’s part of their role. It’s unthinkable that they wouldn’t do that.”

To view the information available thus far in a Cause Map, please click “Download PDF” above.

Root Cause Analysis - Incident Investigation

New Regulations Aim to Reduce Railroad Crude Oil Spills

May 14, 2015 Angela Griffith

By ThinkReliability Staff

The tragic train derailment in Lac-Mégantic, Quebec on July 6, 2013 (see our previous blog on this topic) ushered in new concerns about the transport of crude oil by rail in the US and Canada. Unfortunately, the increased attention has highlighted a growing problem: spills of crude oil transported via rail, which can result in fires, explosions, evacuations, and potentially deaths. (Luckily there have been no fatalities since the Lac-Mégantic derailment.) According to Steve Curwood of Living on Earth, “With pipelines at capacity the boom has lead a 4,000 percent increase in the volume of crude oil that travels by rail, and that brought more accidents and more oil spills in 2014 than over the previous 38 years.”

This follows a period of increases in railroad safety – according to the US Congressional Research Service, “From 1980 to 2012, railroads reduced the number of accidents releasing hazmat product per 100,000 hazmat carloads from 14 to 1.” From October 19, 2013 to May 6, 2015, there were at least 12 railcar derailments that resulted in crude oil spills. (To see the list of events, click on “Download PDF” and go to the second page.)

Says Sarah Feinberg, acting administrator of the Federal Railroad Administration (FRA), “There will not be a silver bullet for solving this problem. This situation calls for an all-of-the-above approach – one that addresses the product itself, the tank car it is being carried in, and the way the train is being operated.” All of these potential risk-reducing solutions are addressed by the final rule released by the FRA on May 1, 2015. (On the same day, the Canadian Ministry of Transport released similar rules.) In order to view how the various requirements covered by the rule impact the risk to the public as a result of crude oil spills from railcars, we can diagram the cause-and-effect relationships that lead to the risk, and include the solutions directly over the cause they control. (To view the Cause Map, or visual root cause analysis, of crude oil train car derailments, click on “Download PDF”.)

The product: Bakken crude oil (as well as bitumen) can be more volatile than other types of crude oil and has been implicated in many of the recent oil fires and explosions. In addition to being more volatile, the composition (and thus volatility) can vary. If a material is not properly sampled and characterized, proper precautions may not be taken. The May 1 rule incorporates a more comprehensive sampling and testing program to ensure the properties of unrefined petroleum-based products are known and provided to the DOT upon request. (Note that in the May 6, 2015 derailment and fire in Heimdahl, North Dakota, the oil had been treated to reduce its volatility, so this clearly isn’t an end-all answer.)

The tank car: Older tank cars (known as DOT-111s) were involved in the Lac-Mégantic and other 2013 crude oil fires. An upgrade to these cars, known as CPC-1232, hoped to reduce these accidents. However, CPC-1232 cars have been involved in all of the issues since 2013. According to Cynthia Quarterman, former director of the Pipeline and Hazardous Materials Safety Administration, says that the recent accidents involving the newer tank cars “confirm that the CPC-1232 just doesn’t cut it.”

The new FRA rule establishes requirements for any “high-hazard flammable train” (HHFT) transported over the US rail network. A HHFT is a train comprised of 20 or more loaded tank cars of a Class 3 flammable liquid (which includes crude oil and ethanol) in a continuous block or 35 or more loaded tank cars of a Class 3 flammable liquid across the entire train. Tank cars used in HHFTs constructed after October 1, 2015 are required to meet DOT-117 design criteria, and existing cars must be retrofitted based on a risk-based schedule.

The way the train is being operated: The way the train is being operated includes not only the mechanics of operating the train, but also the route the train takes and the notifications required along the way. Because the risk for injuries and fatalities increases as the population density increases, the rule includes requirements to perform an analysis to determine the best route for a train. Notification of affected jurisdictions is also required.

Trains carrying crude oil tend to be very large (sometimes exceeding one mile in length). This can impact stopping distance as well as increase the risk of derailment if sudden stopping is required. To reduce these risks, HHFTs are restricted to 50 mph in all areas, and 40 mph in certain circumstances based on risk (one of the criteria is urban vs. rural areas). HHFTs are also required to have in place a functioning two-way end of train or distributed power braking system. Advanced braking systems are required for trains including 70 or more loaded tank cars containing Class 3 flammable liquids and traveling at speeds greater than 30 mph, though this requirement will be phased in over decades.

It is important to note that this new rule does not address inspections of rails and tank cars. According to a study of derailments from 2001 to 2010, track problems were the most important causes of derailments (with broken rails or track welds accounting for 23% of total cars derailed). A final rule issued January 24, 2014 required railroads to achieve a specified track failure rate and to prioritize remedial action.

To view the May 1 rule regarding updates to crude-by-rail requirements, click here. To view the timeline of incidents and the Cause Map showing the cause-and-effect relationships leading to these incidents, click “Download PDF”.

Uncategorized

Crash of Germanwings flight 95252 Leads to Questions

April 7, 2015 Angela Griffith

By ThinkReliability Staff

On March 24, 2015, Germanwings flight 9525 crashed into the French Alps, killing all 150 onboard. Evidence available thus far suggests the copilot deliberately locked the pilot out of the cockpit and intentionally crashed the plane. While evidence collection is ongoing, because of the magnitude of this catastrophe, solutions to prevent similar recurrences are already being discussed and, in some cases, implemented.

What is known about the crash can be captured in a Cause Map, or visual form of root cause analysis. Visually diagramming all the cause-and-effect relationships allows the potential for addressing all related causes, leading to a larger number of potential solutions. The analysis begins by capturing the impacted goals in the problem outline. In this case, the loss of 150 lives (everybody aboard the plane) is an impact to the safety goal and of primary concern in the investigation. Also impacted are the property goal due to the loss of the plane, and the recovery and investigation efforts (which are particularly difficult in this case due to the difficult-to-access location of the crash.)

Asking “Why” questions from the impacted goals develops cause-and-effect relationships. In this case, the deaths resulted from the crash of the plane into the mountains of the French Alps. So far, available information appears to support the theory that the copilot deliberately crashed the plane. Audio recordings of the pilot requesting re-entry into the cockpit, the normal breathing of the co-pilot, and the manual increase of speed of the descent while crash warnings sounded all suggest that the crash was deliberate. Questions have been raised about the co-pilot’s fitness for duty. Some have suggested increased psychological testing for pilots, but the agency Airlines for America says that the current system (at least in the US), is working: “All airlines can and do conduct fitness-for-duty testing on pilots if warranted. As evidenced by our safety record, the U.S. airline industry remains the largest and safest aviation system in the world as a result of the ongoing and strong collaboration among airlines, airline employees, manufacturers and government.”

Some think that technology is the answer. The flight voice recorder captured cockpit alarms indicating an impending crash. But these were simply ignored by the co-pilot. If flight guidance software was able to take over for an incapacitated pilot (or one who deliberately ignores these warnings, disasters like this one could be avoided. Former Department of Transportation Inspector General Mary Schiavo says, “This technology, I believe, would have saved the flight. Not only would it have saved this flight and the Germanwings passengers, it would also save lives in situations where it is not a suicidal, homicidal pilot. It has implications literally for safer flight across the industry.”

Others say cockpit procedures should be able to prevent an issue like this. According to aviation lawyers Brian Alexander & Justin Green, in a blog for CNN, “If Germanwings had implemented a procedure to require a second person in the cockpit at all times – a rule that many other airlines followed – he would not have been able to lock the pilot out.”

After 9/11, cockpit doors were reinforced to prevent any forced entry (according to the Federal Aviation Administration, they should be strong enough to withstand a grenade blast). The doors have 3 settings – unlock, normal, and lock. Under normal settings, the cockpit can be unlocked by crewmembers with a code after a delay. But under the lock setting (to be used, for example, to prevent hijackers who have obtained the crew code from entering the cockpit), no codes will allow access. (The lock setting has to be reset every 5 minutes.) Because of the possibility a rogue crewmember could lock out all other crewmembers, US airlines instituted the rule that there must always be two people in the cockpit. (Of course, if only a three-person crew is present, this can cause other issues, such as when a pilot became locked in the bathroom while the only other two flight crew members onboard were locked in the cockpit, nearly resulting in a terror alert. See our previous blog on this issue.)

James Hall, the former chairman of the National Transportation Safety Board, agrees. He says, “The flight deck is capable of accommodating three pilots and there shouldn’t ever be a situation where there is only one person in the cockpit.” In response, many airlines in Europe and Canada, including Germanwings’ parent company Lufthansa, have since instituted a rule requiring at least two people in the cockpit at all times. Other changes to increase airline safety may be implemented after more details regarding the crash are discovered.

Uncategorized

March 27, 1977: Two Jets Collide on Runway, Killing 583

March 27, 2015 Angela Griffith

By ThinkReliability Staff

March 27, 1977 was a difficult day for the aviation industry. Just after noon, a bomb exploded at the Las Palmas passenger terminal in the Canary Islands. Five large passenger planes were diverted to the Tenerife-Norte Los Rodeos Airport, where they completely covered the taxiway of the one-runway regional airport. Less than five hours later, when the planes were finally given permission to takeoff, two collided on the runway, killing 583, making this the worst accident at the time (and second now only to the September 11, 2001 attacks in the US.)

With the benefit of nearly 40 years of hindsight, it is possible to review the causes of the accident, as well as look at the solutions implemented after this accident, which are still being used in the aviation industry today. First we look at the impact to the goals as a result of this tragedy. The deaths of 583 people (out of a total of 644 on both planes) are an impact to the safety goal. The compensation to families of the victims (paid by the operating company of one of the planes) is an impact to the customer service goal. The property goal was impacted due to the destruction of both the planes, and the labor goal was impacted by the rescue, response, and investigation costs that resulted from the accident.

Beginning with one of the impacted goals, we can ask why questions to diagram the cause-and-effect relationships related to the incident. The deaths of the 583 people onboard were due to the runway collision of two planes. The collision occurred when one plane was taking off on the runway, and the other was taxiing to takeoff position on the same runway (called backtracking).

Backtracking is not common (most airports have separate runways and taxiways), but was necessary in this case because the taxiway was unavailable for taxiing. The taxiway was blocked by the three other large planes parked at the airport. A total of five planes were diverted to Tenerife which, having only one runway and a parallel taxiway, was not built to accommodate this number of planes. There were four turnoffs from the runway to the taxiway; the taxiing plane had been instructed to turn off at the third turn (the first turn that was not blocked by other planes). For unknown reasons, it did not, and the collision resulted between the third and fourth turnoff. (Experts disagree on whether the plane would have been able to successfully make the sharp turn at the third turnoff.)

One plane was attempting takeoff, when it ran into the second plane on the runway. The plane taking off was unaware of the presence of the taxiing plane. There was no ground radar and the airport was under heavy fog cover, so the control tower was relying on positions reported by radio. At the time the taxiing plane reported its position, the first plane was discussing takeoff plans with the control tower, resulting in interference rendering most of the conversation inaudible. The pilot of the plane taking off believed he had clearance, due to confusing communication between the plane and the air traffic control tower. Not only did the flight crews and control tower speak different languages, the word “takeoff” was used during a conversation that was not intended to provide clearance for takeoff. Based on discussions between the pilot and flight crew on the plane taking off have, investigators believed, but were not able to definitively determine, that other crew members may have questioned the clearance for takeoff, but not to the extent that the pilot asked the control tower for clarification or delayed the takeoff.

After the tragedy, the airport was upgraded to include ground radar. Solutions that impacted the entire aviation industry included the use of English as the official control language (to be used when communicating between aircraft and control towers) and also prohibited the use of the word “takeoff” unless approving or revoking takeoff clearance. The potential that action by one of the other crew members could have saved the flights aided in the concept of Crew Resource Management, to ensure that all flight crew members could and would speak up when they had questions related to the safety of the plane.

Though this is by far the runway collision with the greatest impact to human life, runway collisions are still a concern. In 2011, an Airbus A380 clipped the wing of a Bombardier CRJ (see our previous blog). Officials at Los Angeles International Airport (LAX) experienced 21 runway incursions in 2007, after which they redesigned the runways and taxiways so that they wouldn’t intersect, and installed radar-equipped warning lights to provide planes with a visual warning of potential collisions (see our previous blog).

To view the outline, Cause Map and recommended solutions from the Tenerife runway collision of 1977, click on “Download PDF” above. Or, click here to read more.

Root Cause Analysis - Incident Investigation

Plane Narrowly Avoids Rolling into Bay

March 11, 2015 Angela Griffith

By ThinkReliability Staff

Passengers landing at LaGuardia airport in New York amidst a heavy snowfall on March 5, 2015, were stunned (and 23 suffered minor injuries) when their plane overran the runway and approached Flushing Bay. The National Transportation Safety Board (NTSB) is currently investigating the accident to determine not only what went wrong in this particular case, but what standards can be implemented to reduce the risk of runway overruns in the future.

Says Steven Wallace, the former director of the FAA’s accident investigations office (2000-2008), “Runway overruns are the accident that never goes away. There has been a huge emphasis on runway safety and different improvements, but landing too long and too fast can result in an overrun.” Runway overruns are the most frequent type of accident (there are about 30 runway overruns due to wet or icy runways across the globe every year), and runway overruns are the primary cause of major damage to airliners.

Currently, the NTSB is collecting data (evidence) to aid in its investigation of the accident. The plane is being physically examined, and the crew is being interviewed. The data recorders on the flight are being downloaded and analyzed. While little information is able to be verified or ruled out at this point, there is still value in organizing the questions related to the investigation in a logical way.

We can do this using the Cause Mapping method of root cause analysis, which organizes cause-and-effect relationships related to an incident. We begin by capturing the impact to an organization’s goals. In this case, 23 minor passenger injuries were reported, an impact to the safety goal. There was a fuel leak of unknown quantity, which impacts the environmental goal. Customer service was impacted due to a scary landing and evacuation from the aircraft via slides. Air traffic at LaGuardia was shut down for 3 hours, impacting the production goal. Both the airplane and the airport perimeter fence suffered major damage, which impacts the property/equipment goal. The labor goal was also impacted due to the response and ongoing investigation.

By beginning with an impacted goal and asking “why” questions, we can begin to diagram the potential causes that may have resulted in an incident. Potential causes are causes without evidence. If evidence is obtained that supports a cause, it becomes a cause and it is no longer followed by a question mark. If evidence rules out a cause, it can be crossed out but left on the Cause Map. This reduces uncertainty as to whether a potential cause has been considered and ruled out, or not considered at all.

In this case, the NTSB will be looking into runway conditions, landing procedures, and the condition of the plane. According to the airport, the runway was cleared within a few minutes of the plane landing, although the crew has said it appeared all white during landing. The National Weather Service reported 7″ of snow in the New York area on the day of the overrun. Procedures for closing runways or aborting landings are also being considered. Just prior to the landing, other pilots who had recently landed reported braking conditions as good.

The crew has also reported that although the auto brakes were set to max, they did not feel any deceleration. The entire braking system will be investigated to determine if equipment failure was involved in the accident. (Previous overruns have been due to brake system failures or the failure of reverse thrust from one of the engines, causing the plane to veer.) The pilot also reported the automatic spoiler did not deploy, but they were deployed manually.

Also being investigated are the landing speed and position, though there is no evidence to suggest that there was any issue with crew performance. As more information is released, it can be added to the investigation. When the cause-and-effect relationships are better determined, the NTSB can begin looking at recommendations to reduce future runway overruns.

Uncategorized

Working Conditions Raise Concerns at Fukushima Daiichi

February 4, 2015 Angela Griffith

By ThinkReliability Staff

The nearly 7,000 workers toiling to decommission the reactors at Fukushima Daiichi after they were destroyed by the earthquake and tsunami on March 11, 2011 face a daunting task (described in our previous blog). Recent events have led to questions about the working conditions and safety of these workers.

On January 16, 2015, the local labor bureau instructed the utility that owns the plants to reduce industrial accidents. (The site experienced 23 accidents in fiscal year 2013 and 55 so far this fiscal year.) Three days later, on January 19, a worker fell into a water storage tank and was taken to the hospital. He died the next day, as did a worker at Fukushima Daini when his head got caught in machinery. (Fukushima Daini is nearby and was less impacted by the 2011 event. It is now being used as a staging site for the decommissioning work at Fukushima Daiichi.)

Although looking at all industrial accidents will provide the most effective solutions, often digging into just one in greater detail will provide a starting point for site improvements. In this case, we will look at the January 19 fall at Fukushima Daiichi to identify some of the challenges facing the site that may be leading to worker injuries and fatalities.

A Cause Map, or visual form of root cause analysis, is begun by determining the organizational impacts as a result of an incident. In this case the worker fall impacted the safety goal due to the death of the worker. The environmental goal was not impacted. (Although the radiation levels at the site still require extensive personal protective equipment, the incident was not radiation-related.) Workers on site have noted difficult working conditions, which are thought to be at least partially responsible for the rise in incidents, as are the huge number of workers at the site (itself an impact to the labor/time goal). Lastly, local organizations have raised regulatory concerns due to the high number of incidents at the site.

An analysis of the issues begins with one impacted goal. In this case, the worker death resulted from a fall into a ten-meter empty tank. The worker was apparently not found immediately (though specific timeline details and whether or not that impacted the worker’s outcome have not been released) because it appears he was working alone, likely due to the massive manpower needs at the site. Additionally, the face masks worn by all workers (due to the high radiation levels still present) limit visibility.

The worker was checking for leaks at the top of the tank, which is being used to store water used to cool the reactors at the site. There is a general concern about lack of knowledge of workers (many of whom have been hired recently with little or no experience doing the types of tasks they are now performing), though again, it’s unclear whether this was applicable in this case. Of more concern is the ineffective safety equipment – apparently the worker did not securely fasten his safety harness.

The reasons for this, and the worker falling in the first place, are likely due to worker fatigue or lack of concentration. Workers at the site face difficult conditions doing difficult work all day (or night) long, and have to travel far afterwards, as the surrounding area is still evacuated. Reports of mental health issues and fatigue in these workers has led to the opening of a new site providing meals and rest for these workers.

These factors are likely contributing to the increase in accidents, as is the number of workers at the site, which doubled from December 2013 to December 2014. Local organizations are still calling for action to reduce these actions. “It’s not just the number of accidents that has been on the rise. It’s the serious cases, including deaths and serious injuries that have risen, so we asked Tokyo Electric to improve the situation,” says Katsuyoshi Ito, a local labor standards inspector.

In addition to improving working conditions, the site is implementing improved worker training – and looking at discharging wastewater instead of storing it, which would reduce the pieces of equipment required to be monitored and maintained. Improvements must be made, because decades of work remains before work at the site will be completed.

Click here to sign up for our FREE webinar “Root Cause Analysis Case Study: Fukushima Daiichi” at 2:00 pm EDT on March 12 to learn more about how the earthquake and tsunami on March 11, 2011 impacted the plant.

Root Cause Analysis - Incident Investigation

Bad Weather Believed to Have Brought Down AirAsia Flight QZ8501

January 8, 2015 Angela Griffith

By ThinkReliability Staff

AirAsia flight QZ8501, and the 162 people on-board, was lost on December 28, 2014 while flying through high-altitude thunderstorms. Because of a delay in finding the plane and continuing bad weather in the area, the black box, which contains data that will give investigators more detail on why the plane went down, has not yet been recovered. Even without the black box’s data, experts believe that the terrible weather in the area was a likely cause of the crash.

“From our data it looks like the last location of the plane had very bad weather and it was the biggest factor in behind the crash. These icy conditions can stall the engines of the plane and freeze and damage the plane’s machinery,” says Edvin Aldrian, the head of Research at an Indonesian weather agency. Beyond the icing of engines, there are other theories on how weather-related issue may have brought down the plane.

Early speculation was that the plane was struck by lighting; while it may have been struck by lightning, experts say it’s unlikely it would have brought the plane down, because modern planes are fairly well-equipped to deal with direct lightning strikes. High levels of turbulence can also result in stalling due to a loss of airflow over the wings. There are also some who believe the plane (an Airbus A320) may have been pushed into a vertical climb past the limit for safe operation (to escape the weather) which resulted in a stall.

While the actual mechanism of how the weather (or an unrelated issue) brought the plane down is still to be determined, aviation safety organizations are already implementing some interventions to increase the safety of air travel in the area based on some specific areas of concern. (These areas of concern can be viewed visually in a Cause Map, or visual root cause analysis, by clicking on “Download PDF” above.)

AirAsia pilots relied on “self-briefings” regarding the weather. Pilots in other locations have expressed concern about the adequacy of weather information pilots obtain using this method. Direct pilot briefings with dispatchers based on detailed weather reporting are recommended to ensure that pilots have the information they need to safely traverse areas of poor weather (or stay out of them altogether).

Heavy air traffic in the area delayed approval to climb out of storm. At 6:12 local time the flight crew requested to climb to higher altitude to attempt to escape the storm. Air traffic control did not attempt to respond to the plane until 6:17, at which point it could no longer be contacted. Air traffic in the area was heavy, possibly because:

The plane did not have permission to fly the route it was on. AirAsia was licensed to fly the route it was taking at the time of the crash four days a week, but not the day of the crash. The takeoff airport used incorrect information in allowing the plane to take off in the first place (and the airline certainly used incorrect information in trying to fly the route as well). The selection of the route has been determined not to be a factor in the crash, but it certainly may have resulted in the overcrowding that led to the delayed response from air traffic control. It also resulted in the airline’s flights on that route being suspended.

It took almost three days to find the plane. The delay is renewing calls for universal tracking of aircraft or real-time streaming of flight data that were initially raised after the loss of Malaysia Airline flight MH370, which is still missing ten months after losing radar contact. (See our previous blog on the difficulties finding it.) Not only would this reduce the suffering of families while waiting to hear their loved ones’ fates, it would reduce resources required to find lost aircraft and, in cases where survival is possible, increase the chance of survival of those on the plane.