Tag Archives: Solutions

Smoke drifting north from wildfires in Washington state has raised concerns about air quality in Calgary, but staff decided to check an air monitoring station after it reported an alarming rating of 28 on a 1-10 scale. What they found was a bug, or rather a spider, in the system that was causing erroneously high readings.

The air monitoring station measures the amount of particulate matter in air by shining a beam of light through a sample of air. The less light that makes it through the sample, the higher the number of particulates in the sample and the worse the quality of air. You can see the problem that would arise if the beam of light was blocked by a spider.

This example is a great reminder not to rely solely on instrument readings. Instruments are obviously useful tools, but the output should always be run through the common sense check. Does it make sense that the air quality would be so far off the scale? If there is any question about the accuracy of readings, the instrument should probably be checked because the unexpected sometimes happens.

In this case, inaccurate readings of 10+ were reported by both Environment Canada and Alberta Environment before the issue was discovered and the air quality rating was adjusted down to a 4. Ideally, the inaccurate readings would have been identified prior to posting potentially alarming information on public websites. The timing of the spider’s visit was unfortunate because it coincided with smoky conditions that made the problem more difficult to identify, but extremely high readings should be verified before making them public if at all possible.

Adding an additional verification step when there are very high readings prior to publicly posting the information could be a potential solution to reduce the risk of a similar problem recurring. A second air monitoring station could be added to create a built-in double check because an error would be more obvious if the monitoring stations didn’t have similar readings.

Depending on how often insects and spiders crawl into the air monitoring equipment, the equipment itself could be modified to reduce the risk of a similar problem recurring in the future.

To view a Cause Map, a visual root cause analysis, of this issue, click on “Download PDF” above.

Root Cause Analysis - Incident Investigation

Power grid near Google datacenter struck by lightning 4 times

August 26, 2015 Kim Smiley

By Kim Smiley

A small amount of data was permanently lost at a Google datacenter after lightning struck the nearby power grid four times on August 13, 2015. About five percent of the disks in Google’s Europe-west1-b cloud zone datacenter were impacted by the lightning strikes, but nearly all of the data was eventually recovered with less than 0.000001% of the stored data not able to be recovered.

A Cause Map, or visual root cause analysis, can be built to analyze this issue. The first step in the Cause Mapping process is to fill in an Outline with the basic background information such as the date, time and specific equipment involved. The bottom of the Outline has a spot to list the impacted goals to help define the scope of an issue. The impacted goals are then used to begin building the Cause Map. The impacted goals are listed in red boxes on the Cause Map and the impacts are the first cause boxes on the Cause Map. Why questions are then asked to add to the Cause Map and visually lay out the cause-and-effect relationships.

For this example, the customer service goal was impacted because some data was permanently lost. Why did this happen? Data was lost because datacenter equipment failed, this particular data was stored on less stable system and wasn’t duplicated in another location. Google has stated that the lost data was newly written data that was located on storage systems which were more susceptible to power failures. The datacenter equipment failed because the nearby power grid was struck by lightning four times and was damaged. Additionally, the automatic auxiliary power systems and backup battery were not able to prevent data loss after the lightning damage.

When more than one cause was required to produce an effect, all the causes are listed vertically and separated by an “and”. You can click on “Download PDF” above to see a high level Cause Map of this issue that shows how an “and” can be used to build a Cause Map. A more detailed Cause Map could be built that could include all the technical details of exactly why the datacenter equipment failed. This would be useful to the engineers developing detailed solutions.

The final step in the Cause Mapping process is to develop solutions to reduce the risk of a problem recurring in the future. For this example, Google has stated that they are upgrading the datacenter equipment so that it is more robust in the event of a similar event in the future. Google also stated that customers should backup essential data so that it is stored in another physical location to improve reliability.

Few of us probably design datacenter storage systems, but this incident is a good reminder of the importance of having a backup. If data is essential to you or your business, make sure there is a backup that is stored in a physically separate location from the original. Similar to the “unsinkable” Titanic, it is always a good idea to include enough life boats or backups in a design just in case something you didn’t expect goes wrong. Sometimes lightning strikes four times so it’s best to be prepared just in case.

Root Cause Analysis - Incident Investigation

Legionnaires’ Disease Outbreak Blamed on Contaminated Cooling Towers

August 12, 2015 ThinkReliability Staff

By ThinkReliability Staff

An outbreak of Legionnaires’ disease has affected at least 115 and killed 12 in the South Bronx area of New York City. While Legionnaires’, a respiratory disease caused by breathing in vaporized Legionella bacteria, has struck the New York City area before, the magnitude of the current outbreak is catching the area by surprise. (Because the vaporization is required, drinking water is safe, as is home air conditioning.) It’s also galvanizing a call for actions to better regulate the causes of the outbreak.

It’s important when dealing with an outbreak that affects public health to fully analyze an issue to determine all the causes that contributed to the problem. In the case of the current Legionnaires’ outbreak, our analysis will be performed in the form of a Cause Map, or visual root cause analysis. We begin by capturing the basic information (what, when and where) about the issue in a problem outline. Because the issue unfolded over months, we will reference the timeline (to view the analysis including the timeline, click on “Download PDF”) to describe when the incident occurred. Some important differences to note – people with underlying medical conditions and smokers are at a higher risk from Legionnaires’, and Legionella bacteria are resistant to chlorine. Infection results from breathing in contaminated mist, which has been determined to have come from South Bronx area cooling towers (which is part of the air conditioning and heating systems of some large buildings).

Next we capture the impact to the goals. The safety goal is impacted due to the 12 deaths, and 115 who have been infected. The customer service goal is impacted by the outbreak of Legionnaires’. The environmental and property goals are impacted because at least eleven cooling towers in the area have been found to be contaminated with Legionella. The issue is resulting in increased regulation, an impact to the regulatory goal, and testing and disinfection, which is being performed by at least 350 workers and is an impact to the labor goal.

The analysis begins by asking “why” questions from one of the impacted goals. In this case, the deaths resulted from an outbreak of Legionnaires’ disease. The outbreak results from exposure to mist from one of the contaminated cooling towers. The design of some cooling towers allows exposure to the mist produced. It is common for water sources to contain Legionella (which again, is resistant to chlorine) but certain conditions allow the bacteria to “take root”: the damp warm environment found in cooling towers and insufficient cleaning/ disinfection. The cost of cleaning is believed to be an issue – studies have found that, like this outbreak, impoverished areas are more prone to these types of outbreaks. Additionally, there are insufficient regulations regarding cooling towers. The city does not regularly inspect cooling towers. According to the mayor and the city’s deputy commissioner for disease control, there just hasn’t been enough evidence to indicate that cooling towers are a potential source of Legionnaires’ outbreaks.

Evidence would indicate otherwise, however. A study that researched risk factors for Legionnaires’ in New York City from 2002-2011 specifically indicated that proximity to cooling towers was an environmental risk. A 2010 hearing on indoor air quality discussed Legionella after a failed resolution in 2000 to reduce outbreaks at area hospitals. New York City is no stranger to Legionnaires’; the first outbreak occurred in 1977, just after Legionnaires’ was identified. There have been two previous outbreaks of Legionnaires’ this year. Had there been a look at other outbreaks, such as the 2012 outbreak in Quebec City, cooling towers would have been identified as a definite risk factor.

For now, though the outbreak appears to be waning (no new cases have been reported since August 3), the city is playing catch-up. Though they are requiring all cooling towers to be disinfected by August 20 and plan increase inspections, right now there isn’t even a list of all the cooling towers in the city. Echoing the frustrations of many, Bill Pearson, member of the committee that wrote standards to address the risk of legionella in cooling towers, says “Hindsight is 20-20, but it’s not a new disease. And it’s not like we haven’t known about the risk of cooling towers, and it’s not like people in New York haven’t died of Legionnaires’ before.”

Ruben Diaz Jr., Bronx borough president, brings up a good point for the cities that may have Legionella risks from cooling towers, “Why, instead of doing a good job responding, don’t we do a good job proactively inspecting?” Let’s hope this outbreak will be a call for others to learn from these tragic deaths, and take a proactive approach to protecting their citizens from Legionnaire’s disease.

Uncategorized

Extensive Contingency Plans Prevent Loss of Pluto Mission

July 23, 2015 Angela Griffith

By ThinkReliability Staff

Beginning July 14, 2015, the New Horizons probe started sending photos of Pluto back to earth, much to the delight of the world (and social media). The New Horizons probe was launched more than 9 years ago (on January 19, 2006) – so long ago that when it left, Pluto was still considered a planet. (It’s been downgraded to dwarf planet now.) A mission that long isn’t without a few bumps in the road. Most notably, just ten days before New Horizons’ Pluto flyby, mission control lost contact with the probe.

Loss of communication with the New Horizons probe while it was nearly 3 billion miles away could have resulted in the loss of the mission. However, because of contingency and troubleshooting plans built in to the design of the probe and the mission, communication was able to be restored, and the New Horizons probe continued on to Pluto.

The potential loss of a mission is a near miss. Analyzing near misses can provide important information and improvements for future issues and response. In this case, the mission goal is impacted by the potential loss of the mission (near miss). The labor and time goal are impacted by the time for response and repair. Because of the distance between mission control on earth and the probe on its way to Pluto, the time required for troubleshooting was considerable owing mainly to the delay in communications that had to travel nearly 3 billion miles (a 9-hour round trip).

The potential loss of the mission was caused by the loss of communication between mission control and the probe. Details on the error have not been released, but its description as a “hard to detect” error implies that it wasn’t noticed in testing prior to launch. Because the particular command sequence that led to the loss of communication was not being repeated in the mission, once communication was restored there was no concern for a repeat of this issue.

Not all causes are negative. In this case, the “loss of mission” became a “potential loss of mission” because communication with the probe was able to be restored. This is due to the contingency and troubleshooting plans built in to the design of the mission. After the error, the probe automatically switched to a backup computer, per contingency design. Once communication was restored, the spacecraft automatically transmits data back to mission control to aid in troubleshooting.

Of the mission, Alice Bowman, the Missions Operation Manager says, “There’s nothing we could do but trust we’d prepared it well to set off on its journey on its own.” Clearly, they did.

Root Cause Analysis - Incident Investigation

Make safeguards an automatic step in the process

June 10, 2015 Holly Maher

By Holly Maher

On the morning of May 13, 2015, a parent was following his normal morning routine on his way to work. He dropped off his older daughter at school and then proceeded to the North Quincy MBTA (Massachusetts Bay Transportation Authority) station where he boarded a commuter train headed to work. When he arrived, approximately 35 minutes later, he realized that he had forgotten to drop off his one-year-old daughter at her day care and had left her in his SUV in the North Quincy station parking lot. The frantic father called 911 as he boarded a train returning to North Quincy. Thankfully, the police and emergency responders were able to find and remove the infant from the vehicle. The child showed no signs of medical distress as a result of being in the parked car for over 35 minutes.

Had this incident resulted in an actual injury or fatality, I am not sure I would have had the heart to write about it. However, because the impact was only a potential injury or fatality, I think there is great value in understanding the details of what happened and specifically how can we learn from this incident. Unfortunately, this is not an isolated incident. According to kidsandcars.org, an average of 38 children die in hot cars annually. About half of those children were accidentally left in the vehicle by a parent, grandparent or caretaker. While some people want to talk about these incidents using the terms “negligence” or “irresponsibility”, in the cases identified as accidental it is clear the parents were not trying to forget their children. They often describe going into “autopilot” mode and just forgetting. How many of us can identify with that statement?

On the morning this incident happened, the parent was following his typical routine. After dropping off his older child at school, he went into “autopilot” and went directly to the North Quincy MBTA station, parked and left the vehicle to board the train. His one-year-old daughter was not visible to him at that point because she was in the back seat of the vehicle in a rear facing car seat, as required by law. Airbags were originally introduced in the 1970s but became more commercially available in the early 1990s. In 1998, all vehicles were required to have airbags in both the driver and passenger positions. This safety improvement, which has surely reduced deaths related to vehicle accidents, had the unintended consequence of putting children in car seats in a less visible position to the parents. The number of hot car deaths has significantly increased since the early 1990s.

On the morning of the incident the ambient conditions were relatively mild, about 59 degrees Fahrenheit. However, the temperature in a vehicle can quickly exceed the ambient conditions due to what is called the greenhouse effect. Even with the windows down, the temperature in a vehicle can rise quickly. 80% of that temperature rise occurs within the first 10 minutes.

When the parent arrived at his destination, approximately 35 minutes later, he realized he had forgotten the infant and reboarded a train to return to the North Quincy station. Thankfully, the parent also called 911 which expedited the rescue of the infant. The time in the vehicle would obviously have been longer had he not called 911.

One other interesting detail about this incident is that the parent reported that he normally had a “safeguard” procedure that he followed to make sure this didn’t happen, but he didn’t follow it on this particular day. It is unknown what the safeguard was or why it wasn’t followed. This certainly makes an interesting point: we don’t follow safeguards when we know something is going to happen, we follow safeguards in case something happens. As I told my daughter (who didn’t want to wear her seatbelt on the way from school to home because it “wasn’t that far”), you wear your seat belt not because you know you are going to get into an accident, you wear it in case you get into an accident.

The solutions that have been identified for this incident have been taken directly from kidsandcars.org. They promote and encourage a consistent process to manage this risk not when you know you are going to forget, but in case you forget. Consider placing something you need (phone, shoe, briefcase, purse) in the rear floor board so that you are required to open the rear door of the vehicle. Always open the rear door when leaving your vehicle; this is called the “Look before you Lock” campaign. Consider keeping a stuffed animal in the car seat; when the car seat is occupied, place the stuffed animal in the front seat as a visual cue/reminder that the child is in the car. Consider implementing a process where the day care or caretaker calls if your child does not show up when expected. This will minimize the amount of time the child might be left in the car.

For more information about this topic, visit kidsandcars.org.

Root Cause Analysis - Incident Investigation

Deadly Train Derailment Near Philadelphia

May 28, 2015 Kim Smiley

By Kim Smiley

On the evening of May 12, 2015, an Amtrak train derailed near Philadelphia, killing 8 and injuring more than 200. The investigation is still ongoing with significant information about the accident still unknown, but changes are already being implemented to help reduce the risk of future rail accidents and improve investigations.

Data collected from the train’s onboard event recorder shows that the train sped up in the moments before the accident until it was traveling 106 mph in a 50 mph zone where the train track curved. The excessive speed clearly played a role in the accident, but there has been little information released about why the train was traveling so fast going into a curve. The engineer controlling the train suffered a head injury during the accident and has stated that he has no recollection of the accident. The engineer was familiar with the route and appears to have had all required training and qualifications.

As a result of this accident and the difficulty determining exactly what happened, Amtrak has announced that cameras will be installed inside locomotives to record the actions of engineers. While the cameras may not directly reduce the risk of future accidents, the recorded data will help future investigations be more accurate and timely.

The excessive speed at the time of the accident is also fueling the ongoing debate about how trains should be controlled and the implementation of positive train control (PTC) systems that can automatically reduce speed. There was no PTC system in place at the curve in the northbound direction where the derailment occurred and experts have speculated that one would have prevented the accident. In 2008, Congress mandated nationwide installation and operation of positive train control systems by 2015. Prior to the recent accident, the Association of America Railroads stated that more than 80 percent of the track covered by the mandate will not have functional PTC systems by the deadline. The installation of PTC systems requires a large commitment of funds and resources as well as communication bandwidth that has been difficult to secure in some area and some think the end of year deadline is unrealistic. Congress is currently considering two different bills that would address some of the issues. The recent deadly crash is sure to be front and center in their debates.

In response to the recent accident, the Federal Railroad Administration ordered Amtrak to submit plans for PTC systems at all curves where the speed limit is 20 mph less than the track leading to the curve for the main Northeast Corridor (running between Washington, D.C. and Boston). Only time will tell how quickly positive train control systems will be implemented on the Northeast Corridor as well as the rest of the nation, and the debate on the best course of action will not be a simple one.

An initial Cause Map, a visual root cause analysis, can be created to capture the information that is known at this time. Additional information can easily be incorporated into the Cause Map as it becomes available. To view a high level initial Cause Map of this accident, click on “Download PDF”.

Root Cause Analysis - Incident Investigation

New Regulations Aim to Reduce Railroad Crude Oil Spills

May 14, 2015 Angela Griffith

By ThinkReliability Staff

The tragic train derailment in Lac-Mégantic, Quebec on July 6, 2013 (see our previous blog on this topic) ushered in new concerns about the transport of crude oil by rail in the US and Canada. Unfortunately, the increased attention has highlighted a growing problem: spills of crude oil transported via rail, which can result in fires, explosions, evacuations, and potentially deaths. (Luckily there have been no fatalities since the Lac-Mégantic derailment.) According to Steve Curwood of Living on Earth, “With pipelines at capacity the boom has lead a 4,000 percent increase in the volume of crude oil that travels by rail, and that brought more accidents and more oil spills in 2014 than over the previous 38 years.”

This follows a period of increases in railroad safety – according to the US Congressional Research Service, “From 1980 to 2012, railroads reduced the number of accidents releasing hazmat product per 100,000 hazmat carloads from 14 to 1.” From October 19, 2013 to May 6, 2015, there were at least 12 railcar derailments that resulted in crude oil spills. (To see the list of events, click on “Download PDF” and go to the second page.)

Says Sarah Feinberg, acting administrator of the Federal Railroad Administration (FRA), “There will not be a silver bullet for solving this problem. This situation calls for an all-of-the-above approach – one that addresses the product itself, the tank car it is being carried in, and the way the train is being operated.” All of these potential risk-reducing solutions are addressed by the final rule released by the FRA on May 1, 2015. (On the same day, the Canadian Ministry of Transport released similar rules.) In order to view how the various requirements covered by the rule impact the risk to the public as a result of crude oil spills from railcars, we can diagram the cause-and-effect relationships that lead to the risk, and include the solutions directly over the cause they control. (To view the Cause Map, or visual root cause analysis, of crude oil train car derailments, click on “Download PDF”.)

The product: Bakken crude oil (as well as bitumen) can be more volatile than other types of crude oil and has been implicated in many of the recent oil fires and explosions. In addition to being more volatile, the composition (and thus volatility) can vary. If a material is not properly sampled and characterized, proper precautions may not be taken. The May 1 rule incorporates a more comprehensive sampling and testing program to ensure the properties of unrefined petroleum-based products are known and provided to the DOT upon request. (Note that in the May 6, 2015 derailment and fire in Heimdahl, North Dakota, the oil had been treated to reduce its volatility, so this clearly isn’t an end-all answer.)

The tank car: Older tank cars (known as DOT-111s) were involved in the Lac-Mégantic and other 2013 crude oil fires. An upgrade to these cars, known as CPC-1232, hoped to reduce these accidents. However, CPC-1232 cars have been involved in all of the issues since 2013. According to Cynthia Quarterman, former director of the Pipeline and Hazardous Materials Safety Administration, says that the recent accidents involving the newer tank cars “confirm that the CPC-1232 just doesn’t cut it.”

The new FRA rule establishes requirements for any “high-hazard flammable train” (HHFT) transported over the US rail network. A HHFT is a train comprised of 20 or more loaded tank cars of a Class 3 flammable liquid (which includes crude oil and ethanol) in a continuous block or 35 or more loaded tank cars of a Class 3 flammable liquid across the entire train. Tank cars used in HHFTs constructed after October 1, 2015 are required to meet DOT-117 design criteria, and existing cars must be retrofitted based on a risk-based schedule.

The way the train is being operated: The way the train is being operated includes not only the mechanics of operating the train, but also the route the train takes and the notifications required along the way. Because the risk for injuries and fatalities increases as the population density increases, the rule includes requirements to perform an analysis to determine the best route for a train. Notification of affected jurisdictions is also required.

Trains carrying crude oil tend to be very large (sometimes exceeding one mile in length). This can impact stopping distance as well as increase the risk of derailment if sudden stopping is required. To reduce these risks, HHFTs are restricted to 50 mph in all areas, and 40 mph in certain circumstances based on risk (one of the criteria is urban vs. rural areas). HHFTs are also required to have in place a functioning two-way end of train or distributed power braking system. Advanced braking systems are required for trains including 70 or more loaded tank cars containing Class 3 flammable liquids and traveling at speeds greater than 30 mph, though this requirement will be phased in over decades.

It is important to note that this new rule does not address inspections of rails and tank cars. According to a study of derailments from 2001 to 2010, track problems were the most important causes of derailments (with broken rails or track welds accounting for 23% of total cars derailed). A final rule issued January 24, 2014 required railroads to achieve a specified track failure rate and to prioritize remedial action.

To view the May 1 rule regarding updates to crude-by-rail requirements, click here. To view the timeline of incidents and the Cause Map showing the cause-and-effect relationships leading to these incidents, click “Download PDF”.

Root Cause Analysis - Incident Investigation

Distraction Related Accidents: Eyes on Road, Hands on Wheel, AND Mind on Task

April 15, 2015 Sarah Wrenn

By Sarah Wrenn

Admit it – you’ve checked your phone while driving. We’ve likely all been guilty of it at some point. And despite knowing that we’re not supposed to do it – it’s against the law in most states and we understand that the distraction increases our risk of having an accident – we still do it. Why?

On March 31, 2015, the National Transportation Safety Board (NTSB) held its first roundtable discussion on distractions within the transportation industry. In 2015, the NTSB added “Disconnect from Deadly Distractions” to its “Most Wanted List of Transportation Safety Improvements for 2015.” This list represents the NTSB’s priorities to increase awareness and support for key issues related to transportation safety. Other critical topics include “Make Mass Transit Safer” and “Require Medical Fitness for Duty.”

Representatives from all modes of transportation, technology, law enforcement, insurance, researchers, advocates, and educators came together for discussion related to distractions facing vehicle operators.

“New technologies are connecting us as never before – to information, to entertainment, and to each other,” said NTSB Member Robert Sumwalt. “But when those technologies compete for our attention while we’re behind the wheel of a car or at the controls of other vehicles, the results can be deadly.”

Digging into the causes

So let’s take a look at some of the causes related to an accident where the operator is distracted. In addition to the accident occurring because of the distraction, the level of driver expertise is also a factor. A large effort has been made to raise awareness and provide education to teenage drivers. This is in part because, as novice drivers, they have a more limited exposure to driving situations and may not have the ability to react as a more skilled driver.

Operators become distracted

We also want to understand the causes that led to the operator being distracted. There is a distraction type (or mode) that was introduced, the duration of the distraction, and the individual’s inability to ignore the distraction that result in the operator distraction. While the type of distraction plays a large role in taking the operator’s eyes off the road, hands off the wheel or mind off the task, the duration of the distraction also is a key factor. For example, while one’s eyes remain on the road during a phone call, the duration of that call disengages the brain from the task for more time than the act of dialing the phone. This is not to say that one of these actions is more or less impactful; it is important to note that they both play a role in distracting the individual.

It’s not just the text that is distracting

There are three primary forms of distractions – Visual (taking eyes off of the road), Manual (taking hands off of the wheel), and Cognitive (taking mind off of the task). Visual and manual types of distractions are very easy to define and generally recognized as risky behaviors while operating a vehicle. Cognitive distractions are less tangible and therefore more difficult to define. Research and studies generally define cognitive distractions as when the individual’s attention is divided between two or more tasks. While technology and activities such as texting or talking on the phone are typically identified as the primary forms of distraction, it is interesting to note that cognitive distractions such as allowing your mind to wander while operating a vehicle can be just as risky. The AAA Foundation released a 2013 study “Measuring Cognitive Distraction in the Automobile.” The study rates various tasks such as using a hands-free cell phone and listening to the radio according to the amount of cognitive workload imposed upon an operator. The study concludes that “while some tasks, like listening to the radio, are not very distracting, others – such as maintaining phone conversations and interacting with speech-to-text systems – place a high cognitive demand on drivers and degrade performance and brain activity necessary for safe driving.”

The forum discussed the concept that ability to multi-task is actually a myth, with evidence and data to conclude that for certain types of activities multi-tasking is not only difficult, but impossible. For example, tasks such as navigation and speech require the use of the same circuits within the brain. As such, the brain cannot do both tasks at once. Instead, the brain is switching between these tasks, resulting in a reduction of focus on the primary task (driving) while attempting to perform a secondary task (speaking). Therefore, attempting to multi-task introduces a cognitive distraction that increases the risk of unsafe driving.

Just ignore it

Why don’t we just ignore the temptation to become distracted?

Our brains function by releasing serotonin and dopamine when an action occurs that makes us feel good. Dr. Paul Atchley of the University of Kansas stated: “There is nothing more interesting to the human brain than other people. I don’t care how you design your vehicle or your roadways, if you have technologies in the vehicle that allow you to be social, your brain will not be able to ignore them. There are only two things we love, serotonin and dopamine. The two reward chemicals that come along with all those other things that make us feel good. There is really nothing more rewarding to us than the opportunity to talk to someone else.”

Surveys performed by various organizations have revealed a large percentage of people (sometimes 3 out of 4) that will admit to being distracted while driving. Meanwhile, a staggering percent (upwards of 90%) will rationalize the behavior which is a sign of addiction.

Finally, the level of brain development controls our ability to respond to distractions. For example, a teenager has a less developed fontal cortex than an adult which means, as Dr. David Strayer of the University of Utah explains: “Teens’ frontal cortex, the parts of the brain that do decision-making in terms of multitasking, are underdeveloped.” Much of the focus on distracted driving is focused on teens and this is justified as their brain development is not yet complete. It is, however, important to note that this is not just an issue for teens who can’t be separated from their phones or seniors who don’t understand them; this is an issue that crosses all demographics. Level of brain development is just one factor.

So what can we do?

At the end of the day, we want to identify solutions that are going to effectively reduce the risk of having accidents related to distractions from occurring. While there will always be some risk, it is key to take a comprehensive approach to education, technology, and policy. Programs like EndDD.org and stopdistractions.org are focused on bringing awareness, education, and training to youth and adults about the risks of operating vehicles while distracted. Technology can also be used in a variety of ways to reduce the risk of these types of accidents. Sensors can be built into vehicles to identify distractions and provide alerts to drivers or apps can be used to disable functions of technology so the receipt of calls and texts are delayed. Finally, establishing policies and laws that are realistic and enforceable is important so that individuals are held accountable for risky behaviors before an accident occurs. No one single solution is going to reach everyone and no one single solution is going to eliminate the risk of deadly accidents. Each one of these solutions has limitations, but they also have advantages. With a balanced approach to raise awareness and education, provide resources and tools to drivers, and change the culture of what is acceptable while driving, we can reduce the amount of accidents and save lives.

References:

NTSB Roundtable: Disconnect from Deadly Distractions held March 31, 2015, from 9:00 a.m. – 4:00 p.m.

AAA Foundation: Measuring Cognitive Distraction in the Automobile, June 2013

Uncategorized

Crash of Germanwings flight 95252 Leads to Questions

April 7, 2015 Angela Griffith

By ThinkReliability Staff

On March 24, 2015, Germanwings flight 9525 crashed into the French Alps, killing all 150 onboard. Evidence available thus far suggests the copilot deliberately locked the pilot out of the cockpit and intentionally crashed the plane. While evidence collection is ongoing, because of the magnitude of this catastrophe, solutions to prevent similar recurrences are already being discussed and, in some cases, implemented.

What is known about the crash can be captured in a Cause Map, or visual form of root cause analysis. Visually diagramming all the cause-and-effect relationships allows the potential for addressing all related causes, leading to a larger number of potential solutions. The analysis begins by capturing the impacted goals in the problem outline. In this case, the loss of 150 lives (everybody aboard the plane) is an impact to the safety goal and of primary concern in the investigation. Also impacted are the property goal due to the loss of the plane, and the recovery and investigation efforts (which are particularly difficult in this case due to the difficult-to-access location of the crash.)

Asking “Why” questions from the impacted goals develops cause-and-effect relationships. In this case, the deaths resulted from the crash of the plane into the mountains of the French Alps. So far, available information appears to support the theory that the copilot deliberately crashed the plane. Audio recordings of the pilot requesting re-entry into the cockpit, the normal breathing of the co-pilot, and the manual increase of speed of the descent while crash warnings sounded all suggest that the crash was deliberate. Questions have been raised about the co-pilot’s fitness for duty. Some have suggested increased psychological testing for pilots, but the agency Airlines for America says that the current system (at least in the US), is working: “All airlines can and do conduct fitness-for-duty testing on pilots if warranted. As evidenced by our safety record, the U.S. airline industry remains the largest and safest aviation system in the world as a result of the ongoing and strong collaboration among airlines, airline employees, manufacturers and government.”

Some think that technology is the answer. The flight voice recorder captured cockpit alarms indicating an impending crash. But these were simply ignored by the co-pilot. If flight guidance software was able to take over for an incapacitated pilot (or one who deliberately ignores these warnings, disasters like this one could be avoided. Former Department of Transportation Inspector General Mary Schiavo says, “This technology, I believe, would have saved the flight. Not only would it have saved this flight and the Germanwings passengers, it would also save lives in situations where it is not a suicidal, homicidal pilot. It has implications literally for safer flight across the industry.”

Others say cockpit procedures should be able to prevent an issue like this. According to aviation lawyers Brian Alexander & Justin Green, in a blog for CNN, “If Germanwings had implemented a procedure to require a second person in the cockpit at all times – a rule that many other airlines followed – he would not have been able to lock the pilot out.”

After 9/11, cockpit doors were reinforced to prevent any forced entry (according to the Federal Aviation Administration, they should be strong enough to withstand a grenade blast). The doors have 3 settings – unlock, normal, and lock. Under normal settings, the cockpit can be unlocked by crewmembers with a code after a delay. But under the lock setting (to be used, for example, to prevent hijackers who have obtained the crew code from entering the cockpit), no codes will allow access. (The lock setting has to be reset every 5 minutes.) Because of the possibility a rogue crewmember could lock out all other crewmembers, US airlines instituted the rule that there must always be two people in the cockpit. (Of course, if only a three-person crew is present, this can cause other issues, such as when a pilot became locked in the bathroom while the only other two flight crew members onboard were locked in the cockpit, nearly resulting in a terror alert. See our previous blog on this issue.)

James Hall, the former chairman of the National Transportation Safety Board, agrees. He says, “The flight deck is capable of accommodating three pilots and there shouldn’t ever be a situation where there is only one person in the cockpit.” In response, many airlines in Europe and Canada, including Germanwings’ parent company Lufthansa, have since instituted a rule requiring at least two people in the cockpit at all times. Other changes to increase airline safety may be implemented after more details regarding the crash are discovered.

Root Cause Analysis - Incident Investigation

Software Error Causes 911 Outage

October 23, 2014 Kim Smiley

By Kim Smiley

On April 9, 2014, more than 6,000 calls to 911 went unanswered. The problem was spread across seven states and went on for hours. Calling 911 is one of those things that every child is taught and every person hopes they will never need to do – and having emergency calls go unanswered has the potential to turn into a nightmare.

The Federal Communications Commission (FCC) investigated this 911 outage and has released a study detailing what went wrong on that day in April. The short answer is that a software error led to the unanswered calls, but there is nearly always more to the story than a single “root cause”. A Cause Map, an intuitive format for performing a root cause analysis, can be used to better understand this issue by visually laying out the causes (plural) that led to the outage.

There are three steps in the Cause Mapping process. The first is to define an issue by completing an Outline that documents the basic background information and how the problem impacts the overall goals. Most incidents impact more than one goal and this issue is no exception, but for simplicity let’s focus on the safety goal. The safety goal was impacted because there was the potential for deaths and injuries. Once the Outline is completed (including the impacts to the goals), the Cause Map is built by asking “why” questions.

The second step of the Cause Mapping process is to analyze the problem by building the Cause Map. Starting with the impacted safety goal – “why” was there the potential for deaths and injuries? This occurred because more than 6,000 911 calls were not answered. An automated system was designed to answer the calls and it wouldn’t accept new calls for hours. There was a bug in the automated system’s software AND the issue wasn’t identified for a significant period of time. The error occurred because the software used a counter with a pre-set limit to assign calls a tracking number. The counter hit the limit and couldn’t assign a tracking number so it quit accepting new calls.

The delay in identification of the problem is also important to identify in the investigation because the problem would have been much less severe if it had been found and corrected more quickly. Any 911 outage is a problem, but one that lasts 30 minutes is less alarming than one that plays out over 8hours. In this example, the system identified the issue and issued alerts, but categorized them as “low level” so they were never flagged for human review.

The final step in the Cause Mapping process is to develop and implement solutions to reduce the risk of the problem recurring. In order to fix the issues with the software, the pre-set limit on the timer has been increased and will periodically be checked to ensure that the max isn’t hit again. Additionally, to help improve how quickly a problem is identified, an alert has been added to notify operators when the number of successful calls falls below a certain percentage.

New issues will likely continue to crop up as emergency systems move toward internet-powered infrastructure, but hopefully the systems will become more robust as lessons are learned and solutions are implemented. I imagine there aren’t many experiences more frightening than frantically calling 911 for help and having no one answer.

To view a high level Cause Map of this issue, including a completed Outline, click on “Download PDF” above.

Your Expert Root Cause Analysis Resource

Tag Archives: Solutions

Spider in air monitoring equipment causes erroneously high readings

Power grid near Google datacenter struck by lightning 4 times

Legionnaires’ Disease Outbreak Blamed on Contaminated Cooling Towers

Extensive Contingency Plans Prevent Loss of Pluto Mission

Make safeguards an automatic step in the process

Deadly Train Derailment Near Philadelphia

New Regulations Aim to Reduce Railroad Crude Oil Spills

Distraction Related Accidents: Eyes on Road, Hands on Wheel, AND Mind on Task

Crash of Germanwings flight 95252 Leads to Questions

Software Error Causes 911 Outage