Tag Archives: root cause analysis

Hundreds of Flights Disrupted After Air-Traffic Control System Confused by U-2 Spy Plane

By Kim Smiley

Hundreds of flights were disrupted in the Los Angeles area on April 30, 2014 when the air traffic control system En Route Automation Modernization system, known as ERAM, crashed.   It’s been reported that the presence of a U-2 spy plane played a role in the air traffic control issues.

This issue can be analyzed by building a Cause Map, a visual format for performing a root cause analysis.  A Cause Map intuitively lays out the cause-and-effect relationships so that the problem can be better understood and a wider range of solutions considered.  In order to build a Cause Map, the impacted goals are determined and “why” questions are asked to determine all the causes that contributed to the issue.

In this example, the schedule goal was clearly impacted because 50 flights were canceled and more than 400 were delayed.  Why did this occur?  The flight schedule was disrupted because planes were unable to land or depart safely because the air traffic control system used to monitor the landings was down.  The computer system crashed because it became overwhelmed when it tried to reroute a large number of flights in a short period of time.

The system attempted to reroute so many flights at once because the system’s calculations showed that there was a risk of plane collisions because the system misinterpreted the flight path, specifically the altitude, of a U-2 on a routine training mission in the area.  U-2s are designed for ultra-high altitude reconnaissance, and the plane is reported to have been flying above 60,000 feet, well above any commercial flights.  The system didn’t realize that the U-2 was thousands of feet above any other aircraft so it frantically worked to reroute planes so they wouldn’t be in unsafe proximity.

It took several hours to sort out the problem, but then the Federal Aviation Administration was able to implement a short term fix relatively quickly and get the ERAM system back online.  The ERAM system is being evaluated to ensure that no other fixes are needed to ensure that a similar problem doesn’t occur again.  It’s also worth noting that ERAM is a relatively new system (implementation began in 2002) that is replacing the obsolete 1970s-era hardware and software system that had been in place previously.  Hopefully there won’t be many more growing pains with the changeover to a new air traffic control system.

To see a high level Cause Map of this problem, click on “Download PDF” above.

Why Can’t the Missing Malaysia Airliner be Found?

By Holly Maher

On March 8, 2014 Malaysia Airline flight MH370 took off from Kuala Lumpur heading for Beijing, China.  The aircraft had 239 passengers and crew aboard.  Less than 1 hour into the flight, communication and radar contact was lost with the aircraft.  Forty-nine days later, the location and fate of the aircraft is still unknown despite a massive international effort to locate the missing airliner.  The search effort has dominated the news for the last month and the question is still out there: how, with today’s technology, can an entire aircraft go missing?

Since we may never know what happened to flight MH370, this analysis is intended to understand why we can’t find it and identify the causes required to produce this effect.  This will allow us to identify many possible solutions for preventing it from happening again.  We start by asking “why” questions and documenting the answers to visually lay out all the causes that contributed to this incident.  The cause-and-effect relationships lay out from left to right.

In this example, the Customer Service Goal is impacted because we are missing 239 passengers and crew.  This is caused by the fact that we can’t locate Malaysia Airline MH370.  The inability to locate the airline is a result of a number of causes over the 49 day period.  One reason is that 3 days were initially spent looking in the wrong location, along the original flight path from Kuala Lumpur to Beijing, in the Gulf of Thailand and the South China Sea.  The reason 3 days were mistakenly spent looking in this location is that the airline had left the original flight path and officials were unaware of that fact.  Why the aircraft left the original flight path is still unknown, but we can look at some of the causes that allowed the flight to leave the original flight path undetected.

One of the reasons the aircraft was able to leave the original flight path undetected was that air traffic control was unable to track the airplane with radar. The transponder onboard the aircraft, which allows the ground control to track the aircraft using airspeed and altitude, was turned off less than one hour into the flight.  We don’t know the reason the transponder was turned off; however, the fact that it is designed to be turned off manually is a cause of the transponder being turned off.  It is designed to be manually turned off to reduce risk in the event of failure or fire, and to reduce radio traffic when the airplane is on the ground.  After 9/11, when 3 out of the 4 hijacked airplanes had transponders that had been turned off, the airline industry debated the manual on/off design of the transponder, but aviation experts strongly supported the need for the pilots to be able to turn off the transponders, as needed, for the safety of the flight.

Another reason the aircraft left the original flight path undetected was because the flight crew outside the cockpit did not communicate distress or change of route.  This is because all communications from the airplane come from/through the cockpit.  The aircraft is not currently equipped to allow for communication, specifically distress communications, from outside the cockpit.

Days into the investigation, radar data was identified which showed the change of course of the aircraft.  This changed the area of the search away from the original flight path.  However, this radar detection was not identified in real time, as the plane was moving away from the original flight path.  This is also a cause of the aircraft being able to leave the flight path undetected.

Once the search area moved west, the size of the potential search area was incredibly large, another cause of being unable to locate the aircraft.  At its largest, the search area was 2.96 million square miles.  This was based on an analysis of how far the flight could have gotten with the amount of fuel on board.  Further analysis of satellite data, or “handshakes” with the computer framework on board the aircraft, continued to refine the search area.

Many people have asked why no one on the flight made cell phone calls indicating distress (if this was an act of terrorism).  The reason no cell phone calls were made was because cell phones do not work over 2000 ft.  That is because there is no direct line to a cellular tower.

Another cause of being unable to locate MH370 is being unable to locate the black box.  The black box is made of aluminum and is very heavy, designed to withstand significant forces in the event of a crash.  This causes the black box to sink, instead of float, making it difficult to locate.  The depth of the ocean in which the search is occurring ranges from 4,000-23,000 ft, adding to the difficulty of finding the black box.  Acoustic pings were last detected from the black box on April 8, 2014, 32 days into the search.  This is because the battery life on the black box is ~30 days.  This had been the battery design life criteria prior to the Air France Flight 447 crash in 2009.  It took over 2 years to locate the black box and wreckage from flight 447, therefore the design criteria for the black box battery life was changed from 30 days to 90 days.  This would allow search crews more time to locate the black box.  Malaysia Airlines Flight MH370 still had a black box with a battery life of 30 days.

Once the analysis has broken down incident into its causes, solutions can be identified to mitigate the risk a similar incident in the future.

To view the Outline and Cause Map, please click “Download PDF” above.  Or click here to read more.

Nearly 2.6 million GM Vehicles Recalled, Costs Soar to 1.3 Billion

By Kim Smiley

During the first quarter of 2014, General Motors (GM) recalled 2.6 million vehicles due to ignition switch issues tied to at least 13 deaths.  Costs associated with the issue are estimated to be around $1.3 billion and, possibility even more damaging to the long term health of the company, is the beating the company’s reputation has taken.

The ignition switch issues are caused by a small, inexpensive part called a switch indent plunger.  An ignition switch has four main positions (off, accessories, on, and start) and the switch indent plunger holds the ignition switch into position.  In the accidents associated with the recent recall, the ignition switch slipped out of the on position and into the accessories position because the ignition switch plunger didn’t have enough torque to hold it in place.  When the ignition is put into the accessories mode, the car loses both power steering and power braking, and the air bags won’t inflate.  It’s easy to see how a situation that makes a car less safe and more difficult to control can quickly create a dangerous, or even deadly, situation.  Additionally, it’s important to know the problem is most likely to occur when driving on a bumpy surface or if a heavy key ring is pulling on the key.

The other key element of this issue is how the problem has been handled by GM.  There are a lot of hard questions being asked about what was known about the problem and when it was known.  It is known that the faulty part was redesigned in 2006 to address the problem, but the new design of the part wasn’t given a new part number as would normally be done.  Multiple federal inquiries are working to determine when it was known that the faulty parts posed a danger to drivers and why there was such a long delay before a recall was done.  The fact that the redesigned part wasn’t assigned a new part number has also lead to questions about whether there was an attempt to cover up the issue. GM is not civilly liable for deaths and injuries associated with the faulty ignition switches because of its 2009 bankruptcy, but the company could potentially be found criminally liable.

No company ever wants to recall a product, but it’s important to remember that how the recall is handled is just as important as getting the technical details right.  Consumers need to believe that a company will do the right thing and that any safety concerns will quickly and openly be addressed.  Once consumers lose faith in a company’s integrity the cost will be far greater than the price of a recall.

If you drive a GM car, you can get more information about the recall here.  The recalled models are Chevrolet Cobalts and Pontiac G5s from the 2005 through 2007 model years; Saturn Ion compacts from 2003 through 2007; and Chevrolet HHR SUVs, and Pontiac Solstice and Saturn Sky sports cars from 2006 and 2007.

To view the Outline and Cause Map showing the root cause analysis of this issue, please click “Download PDF” above.  Or click here to read more.

Chicago O’Hare Commuter Train Derailment Injures 33

By Sarah Wrenn

At 2:49 AM on March 24, 2014, a Blue Line Commuter train entered the Chicago-O’Hare International Airport Station, collided with the track bumper post, and proceeded to derail landing on an escalator and stairway.  Thirty-two passengers and the train operator were injured and transported to nearby hospitals.  Images showing the lead rail car perched on the escalator look like the train was involved in filming an action movie.

So what caused a Chicago Transit Authority (CTA) train, part of the nation’s second largest public transportation system, to derail?  We can use the Cause Mapping process to analyze this specific incident with the following three steps: 1) Define the problem, 2) Conduct the analysis and 3) Identify the best solutions.

We start by defining the problem.  In the problem outline, you’ll notice we’ve asked four questions: What is the problem? When did it happen? Where did it happen? And how did it impact the goals?

Next we’ll analyze the incident.  We start with the impacted goals and begin asking “why” questions while documenting the answers to visually lay out all the causes that contributed to the incident.  The cause and effect relationships lay out from left to right.  As can be seen in the problem outline, this incident resulted in multiple goals being impacted.

In this incident, 33 people were injured when the train they were riding derailed in the O’Hare station thereby affecting our safety goal of zero injuries.  The injuries were caused by the train derailing, so let’s dig in to why the train derailed.  Let’s first ask why the train operator was unable to stop the train.  Operator statements are crucial to understanding exactly what happened.  Here, it is important to avoid blame by asking questions about the process followed by the operator.  Interestingly, 45 seconds before the crash, the operator manually reduced the train speed.  However, at some point, the train operator dozed off.  The train operator’s schedule (working nearly 60 hours the previous week), length of shift, and time off are all possible causes of the lack of rest.  Evidence that the operator was coming off of an 18 hour break allows us to eliminate insufficient time off between shifts as a cause.  In addition, the train operator was relatively new (qualified train operator in January 2014), but also she was an “extra-board” employee meaning she substituted for other train operators who were out sick or on vacation.

Next, let’s ask why the train was unable to stop.  An automatic breaking system is installed at this station and the system activated when the train crossed the fixed trip stop.  The train was unable to stop, because there was an insufficient stopping distance for the train’s speed.  At the location of the trip stop, the train speed limit was 25 mph and the train was traveling 26 mph.  While the emergency braking system functioned correctly, the limited distance and the speed of the train did not allow the train to stop.

The train derailing impacted multiple organizational goals, but also the personal goal of the train operator who was fired.  During the investigation, we learn that the train operator failed to appear at a disciplinary hearing and had a previous safety violation in which she dozed off and overshot a station.  These details reveal themselves on the cause map by asking why questions.

The final step of the investigation is to use the cause map to identify and select the best solutions that will reduce the risk of the incident recurring.  On April 4, 2014, the CTA announced proposed changes to the train operator scheduling policy.  In addition, the CTA changed the speed limit when entering a station and moved the trip stops to increase the stopping distance.   Each of these identified solutions reduce the risk of a future incident by addressing many of the causes identified during the investigation.

Risks of Future Landslides – and Actual Past Landslides – Ignored

By ThinkReliability Staff

Risk is determined by both the probability of a given issue occurring, and the consequence (impact) if it does. In the case of the mudslide that struck Oso, Washington on March 22, 2014, both the probability and consequence were unacceptably high.

The probability of a landslide happening in the area had not only been well-documented in reports as far back as 1951, the same area where dozens were killed on March 22 had experienced 5 prior landslides since 1949. The consequences of these prior landslides were less than the 2014 landslide because of the severity of the landslide, and because increased residential development meant more people were in harm’s way.

While the search for victims is still ongoing, the causes and impacts of the landslide are mostly known. This incident can be analyzed using a Cause Map, or visual root cause analysis, to show the cause-and-effect relationships that led to the tragic landslide.

First, we capture the background information and the impact to the goals in the problem outline, thereby defining the problem. The landslide (actually a reactivation of an existing landslide, according to Professor Dave Petley, in his blog) occurred around 10:40 a.m. on March 22, 2014 in an Oso, Washington residential area. As previously noted, there had been prior landslides in the area, and there were outdated boundaries used for logging permissions (which we’ll talk more about later). The safety goal was impacted due to the 30 known deaths, 15 and people missing. (Not all of the 27 have been identified, so the known dead and missing numbers may overlap. However, at this point, there is little hope that any additional rescues will take place.) The environmental goal was impacted due to the landslide and the customer service goal (insofar as the residents can be considered customers of their local area) was impacted due to the displacement of 30 families. Logging in an area that should have been protected impacts the regulatory goal. The estimated losses (of residences and belongings) are approximately $10 million, impacting the property goal and the massive search and a recovery effort impacts the labor goal.

Beginning with these impacted goals, asking ‘why” questions allows us to develop cause-and-effect relationships showing how the incident occurred. The safety goal was impacted because of the deaths and missing, which resulted from people being overcome by a landslide. In order for this to occur, the landslide had to occur, and the people had to be in the vicinity of the landslide.

As is known from history (see the timeline on the downloadable PDF), this area is prone to landslides. Previous reports identified the erosion of the area due to the proximity of the river as a cause of these landslides. An additional cause is water seepage in the area. Water seepage is increased when the water table rises from overly wet weather (as is typically found at the end of winter). Trees can help reduce water seepage by absorbing the water. When trees are removed, water seepage in an area can increase significantly. Because of this, removal of trees (for logging or other purposes) is generally restricted near areas prone to landslides. However, for reasons yet unknown, logging was permitted in what should have been a restricted area, because the maps used to allow it were outdated. Says the geologist who developed the new maps, “I suspect it just got lost in the shuffle somewhere.” Additionally, analysis by the Seattle Times, the logging went into the “old” restricted area as well. The State Forester is investigating the allegations and whether the logging played a role in the landslide.

Regardless of the magnitude of the impact of the logging and weather, the area was prone to landslides. Yet it was allowed to be developed, despite multiple reports warning of danger and five previous landslides. In fact, construction in the area resumed just three days after the last landslide in 2006. The 2006 landslide also interrupted a plan to divert the river farther from the landslide area. Despite all of this, the area built up (with houses built as recently as 2009) and those residents were allowed to stay. (While buying out the residents was under consideration, it was apparently dismissed because the residents did not want to move.) While officials in the area maintain that they thought it was safe, a long history of reports and landslides suggest otherwise.

If a lack of knowledge of the risk of the area continues to be a concern, aerial scanning with advanced technology (lidar) could help. Use of lidar in nearby Seattle identified four times the number of landslide zones that were spotted with aerial surveying, which is more typically used.

To view a summary of the investigation, including a timeline, problem outline and Cause Map, please click “Download PDF” above.

When You Call Yourself ThinkReliability…

By ThinkReliability Staff

While I was bombasting about the Valdez oil spill in 1989, one of those ubiquitous internet fairies decided that I did not really need the network connection at my remote office.  Sadly this meant that the attendees on my Webinar had to listen only to me speaking without seeing the pretty diagrams I made for the occasion (after a short delay to switch audio mode).

Though I have all sorts of redundancies built in to Webinar presentations (seriously, I use a checklist every time), I have not prepared for the complete loss of network access, which is what happened during my March 20th, 2014 Webinar.  I’m not going to use the term “root cause”, because I still had another plan . . . (yep, that failed, too).

For our mutual amusement (and because I get asked for this all the time), here is a Cause Map, or visual root cause analysis – the very method I was demonstrating during the failure – of what happened.

First we start with the what, when and where.  No who because blame isn’t the point, though in this case I will provide full disclosure and clarify that I am, in fact, writing about myself.  The Webinar in question was presented on March 20, 2014 at 2:00 PM EST (although to my great relief the issues didn’t start until around 2:30 pm).  That little thorn in my side? It was the loss of a network connection at the Wisconsin remote office (where I typically present from).  I was using Citrix Online’s GoToWebinar© program to present a root cause analysis case study of the Valdez oil spill online.

Next we capture the impact to the organization’s (in this case, ThinkReliability) goals.  Luckily, in the grand scheme of things, the impacted goals were pretty minor.  I annoyed a bunch of customers who didn’t get to see my slides and I scheduled an additional Webinar.  Also I spent some time doing follow-up to those who were impacted, scheduling another Webinar, and writing this blog.

Next we start with the impacted goals and ask “Why” questions.  The customer service goal was impacted because of the interruption in the Webinar.  GoToWebinar© (as well as other online meeting programs) has two parts: audio and visual.  I temporarily lost audio as I was using the online option (VOIP), which I use as a default because I like my USB headset better than my wireless headset.  The other option is to dial in using the phone.  As soon as I figured out I had lost audio, I switched to phone and was able to maintain the audio connection until the end of the Webinar (and after, for those lucky enough to hear me venting my frustration at my office assistant).

In addition to losing audio, I lost the visual screen-sharing portion of the Webinar.   Unlike audio, there’s only one option for this.  Screen sharing occurs through an online connection to GoToWebinar©.  Loss of that connection means there’s a problem with the GoToWebinar© program, or my network connection.  (I’ve had really good luck with GoToWebinar; over the last 5 years I have used the program at least weekly with only two connection problems attributed to Citrix.)  At this point I started running through my troubleshooting checklist.  I was able to reconnect to audio, so it seemed the problem was not with GoToWebinar©.  I immediately changed from my wired router connection to wireless, which didn’t help.  Meanwhile my office assistant checked the router and determined that the router was not connected to the network.

You will quickly see that at this point I reached the end of my expertise.  I had my assistant restart the router, which didn’t work, at least not immediately.  At this point, my short-term connection attempts (“immediate solutions”) were over.  Router troubleshooting (beyond the restart) or a call to my internet provider were going to take far longer than I had on the Webinar.

Normally there would have been one other possibility to save the day.  For online presentations, I typically have other staff members online to assist with questions and connection issues, who have access to the slides I’m presenting.  That presenter (and we have done this before) could take over the screen sharing while I continued the audio presentation.  However, the main office in Houston was unusually short-staffed last week (which is to say most everyone was out visiting cool companies in exciting places).  And (yes, this was the wound that this issue rubbed salt in), I had been out sick until just prior to the Webinar.  I didn’t do my usual coordination of ensuring I had someone online as my backup.

Because my careful plans failed me so completely, I scheduled another Webinar on the same topic.  (Click the graphic below to register.)  I’ll have another staff member (at another location) ready online to take over the presentation should I experience another catastrophic failure (or a power outage, which did not occur last week but would also result in complete network loss to my location).   Also, as was suggested by an affected attendee, I’ll send out the slides ahead of time.  That way, even if this exact series of unfortunate events should recur, at least everyone can look at the slides while I keep talking.

To view my comprehensive analysis of a presentation that didn’t quite go as planned, please click “Download PDF above.  To view one of our presentations that will be “protected” by my new redundancy plans, please see our upcoming Webinar schedule.

Microsoft Withdrawing Support for Windows XP, Still Used by 95% of World’s 2.2 Million ATMs

By ThinkReliability Staff

On April 8, 2014, Microsoft will withdraw support for its XP operating system.  While this isn’t new news (Microsoft made the announcement in 2007), it’s quickly becoming an issue for the world’s automated teller machines (ATMs).  Of the 2.2 million ATMs in the world, 95% run Windows XP.  Of these, only about a third will be upgraded by the April 8th deadline.

These banks then face a choice: upgrade to a newer operating system (which will have to be done eventually anyway), pay for extended support, or go it alone.  We can look at the potential consequences for each decision – and the reasons behind the choices – in a Cause Map, a visual form of root cause analysis.

First we look at the consequences, or the impacts to the goals.  The customer service goal is impacted by the potential exposure to security threats.  (According to Microsoft, it’s more than just potential.  Says Timothy Rains, Microsoft’s Director of trustworthy computing, “The probability of attackers using security updates for Windows 7, Windows 8, Windows Vista to attack Windows XP is about 100 per cent.”)  Required upgrades, estimated to cost each bank in the United Kingdom $100M (US) by security experts, impact the production/schedule and property/equipment goals.   Lastly, if implemented, extended service/ support contracts will impact the labor/time goal.  Though many banks have announced they will extend their contract, the costs of such an extension are unclear, and likely vary due to particular circumstances.

As mentioned above, banks have a choice.  They can upgrade immediately, as will be required at some point anyways.  However, it’s estimated that most (about two-thirds) of banks worldwide won’t make the deadline.  They will then continue to operate in XP, with or without an extended service/ support contract.

Operating without an extended contract will create a high vulnerability to security risks – hackers and viruses.  It has been surmised that hackers will take security upgrades developed for other operating systems and reverse engineer them to find weaknesses in XP.  The downside of the extended contracts is the cost.

Given the risk of security issues with maintaining XP as an operating system, why haven’t more banks upgraded in the 7 years since Microsoft announced it would be withdrawing support?  There are multiple reasons.  First, because of the huge number of banks that still need to upgrade, experts available to assist with the upgrade are limited.  Many banks use proprietary software based on the operating system, so it’s not just the operating system that would need to be upgraded – so would many additional programs.

The many changes that banks have been dealing with as a result of the financial crisis may have also contributed to the delay.  (For more on the financial crisis, see our example page.)  Banks are having trouble implementing the many changes within the time periods specified.  Another potential cause is that banks may be trying to perform many upgrades together.  For example, some ATMs will move to a new operating system and begin accepting chip cards as part of the same upgrade.  (For more about the move towards chip cards, see our previous blog.)

Some banks are just concerned about such a substantial change.  “I ask these companies why they are using old software, they say ‘Come on, it works and we don’t want to touch that,'” says Jaime Blasco, a malware researcher for AlienVault.  The problem is, soon it won’t be working.

To view the Outline and Cause Map, please click “Download PDF” above.  Or click here to read more.

Cleaning up Fukushima Daiichi

By ThinkReliability Staff

The nuclear power plants at Fukushima Daiichi were damaged beyond repair during the earthquake and subsequent tsunami on March 11, 2011.  (Read more about the issues that resulted in the damage in our previous blog.)  Release of radioactivity as a result of these issues is ongoing and will end only after the plants have been decommissioned.  Decommissioning the nuclear power plants at Fukushima Daiichi will be a difficult and time consuming process.  Not only the process but the equipment being used are essentially being developed on the fly for this particular purpose.

Past nuclear incidents offer no help.  The reactor at Chernobyl which exploded was entombed in concrete, not dismantled as is the plan for the reactors at Fukushima Daiichi.  The reactor at Three Mile Island which overheated was defueled, but the pressure vessel and buildings in that case were not damaged, meaning the cleanup was of an entirely different magnitude.  Lake Barrett, the site director during the decommissioning process at Three Mile Island and a consultant on the Fukushima Daiichi cleanup, says that nothing like Fukushima has ever happened before.

An additional challenge?  Though the reactors have been shut down since March 2011, the radiation levels remain too high for human access (and will be for some time).  All access, including for inspection, has to be done by robot.

The decommissioning process involves 5 basic steps (though the completion of them will take decades).

First, an inspection of the site must be completed using robots.  These inspection robots aren’t your run-of-the-mill Roombas.  Because of the steel and concrete structures involved with nuclear power, wireless communication is difficult.  One type of robot used to survey got stuck in reactor 2 after its cable was entangled and damaged.   The next generation of survey robots unspools cable, takes up slack when it changes direction and plugs itself in for a recharge.  This last one is particularly important: not only can humans not access the reactor building, they can’t handle the robots after they’ve been in there.  The new robots should be able to perform about 100 missions before component failure, pretty impressive for access in a site where the hourly radiation dose can be the same as a cleanup worker’s annual limit (54 millisieverts an hour).

Second, internal surfaces will be decontaminated.  This requires even more robots, with different specialties.  One type of robot will clear a path for another type, which will be outfitted with water and dry ice, to be blasted at surfaces in order to remove the outer level, and the radiation with it.  The robots will them vacuum up and remove the radioactive sludge from the building.  The resulting sludge will have to be stored, though the plan for the storage is not yet clear.

Third, spent fuel rods will be removed, further reducing the radiation within the buildings.  A shielded cask is lowered with a crane-like machine, which then packs the fuel assemblies into the cask.  The cask is then removed and transported to a common pool for storage.  (The fuel assemblies must remain in water due to the decay heat still being produced.)

Fourth, radioactive water must be contained.  An ongoing issue with the Fukushima Daiichi reactors is the flow of groundwater through contaminated buildings.  (Read more about the issues with water contamination in a previous blog.)  First, the flow of groundwater must be stopped.  The current plan is to freeze soil to create a wall of ice and put in a series of pumps to reroute the water.    Then, the leaks in the pressure vessels must be found and fixed.  If the leaks can’t be fixed, the entire system may be blocked off with concrete.

Another challenge is what to do with the radioactive water being collected.  So far, over 1,000 tanks have been installed.  But these tanks have had problems with leaks.    Public sentiment is against releasing the water into the ocean, though the contamination is low and of a form that poses a “negligible threat”.  The alternative would be using evaporation to dispose of the water over years, as was done after Three Mile Island.

Finally, the remaining damaged nuclear material must be removed.  More mapping is required, to determine the location of the melted fuel.  This fuel must then be broken up using long drills capable of withstanding the radiation that will still be present.  The debris will then be taken into more shielded casks to a storage facility, the location of which is yet to be determined.  The operator of the plant estimates this process will take at least 20 years.

To view the Process Map laid out visually, please click “Download PDF” above.  Or click here to read more.

Dangerous Combination: Propane Shortages and a Bitterly Cold Winter

By Kim Smiley

Propane shortages and skyrocketing prices in parts of the United States have made it difficult for some homeowners to affordably and consistently heat their homes this winter.   The brutally cold winter many regions are experiencing is also worsening both the causes and effects of the shortages.

A Cause Map can be built to help understand this issue.  Cause Maps are a visual format for performing a root cause analysis that intuitively lay out the causes that contributed to an issue to show the cause-and-effect relationships.  To view a high level Cause Map of this issue, click on “Download PDF” above.

Why have there been recent propane shortages in regions of the United States?  This question is particularly interesting given the fact that propane production in the United States has increased 15 percent in the past year.   One of the reasons that propane prices have dramatically increased is because of a spike in demand.  There was a larger than normal grain crop this fall, which was also wetter than usual.  Wet grains must be dried prior to storing to prevent spoiling and propane is used in the process.  Local propane supplies were depleted in some areas because five times more propane was used to dry crops this year than last.   About 5 percent of homes in the United States depend on propane for heat and the unusually frigid temperatures this winter have resulted in additional increases in propane demand.

In addition to the increase in demand, there have been issues replenishing local supplies of propane quickly enough to support the increased demand.  There have been some logistical problems transporting propane this winter.  The Cochin pipeline was out of service for repairs, limiting how quickly propane could be transported to areas experiencing shortages.  There were rail rerouting issues that impacted shipments from Canada.

Additionally, many are asking questions about what role propane exports have played into the domestic shortages.   Propane exports have quadrupled in the last 3 years.  New mining techniques and improved infrastructure have made exporting propane to foreign markets more lucrative and companies have begun to ship more propane overseas. As more propane is shipped to foreign markets, there is less available for use in the United States.

The propane shortages are an excellent example of supply and demand in action.  Increasing demand combined with decreasing supply will result in higher prices.  Unfortunately addressing the problem isn’t simple. There are very complex logistic and economic issues that need to be addressed, but if people don’t have access to affordable heating, the situation can quickly become dangerous, or even deadly.  In the short term, lawmakers are taking a number of steps to get propane shipped to the impacted areas, but how the US chooses to deal with this issue in the long term is still being debated.

1 Dead and 27 Hospitalized from Carbon Monoxide at Restaurant

By Holly Maher

On Saturday evening, February 22, 2014, 1 person died and 27 others were hospitalized due to carbon monoxide poisoning.  The individuals were exposed to high levels of carbon monoxide that had built up in the basement of a restaurant.  The restaurant was evacuated and subsequently closed until the location could be deemed safe and the water heater, located in the basement, was inspected and cleared for safe operation.

So what caused the fatality and 27 hospitalizations?  We start by asking “why” questions and documenting the answers to visually lay out all the causes that contributed to the incident.  The cause and effect relationships lay out from left to right.

In this example, the 1 fatality and 27 hospitalizations occurred because of an exposure to high levels of carbon monoxide gas, which is poisonous.  The exposure to high levels of carbon monoxide gas was caused not only by the high levels of carbon monoxide gas being present, but also because the restaurant employees and emergency responders were unaware of the high levels of carbon monoxide gas.

Let’s first ask why there were high levels of carbon monoxide present.  This was due to carbon monoxide gas being released into the basement of the restaurant. The carbon monoxide gas was released into the basement because there was carbon monoxide in the water heater flue gas and because the flue gas pipe, intended to direct the flue gas to the outside atmosphere, was damaged.  The carbon monoxide was present in the flue gas because of incomplete combustion in the water heater.  At this point in the investigation, we don’t have any further information.  This can be indicated as a follow-up point on the cause map using a question mark.  We have also identified the reason for the flue gas pipe damage as a question mark, as we do not currently have the exact failure mechanism (physical damage, corrosion, etc.) for the flue gas pipe.  What we can identify as one of the causes of the flue gas pipe failure is an ineffective inspection process.  How do we know the inspection process was ineffective?  Because we didn’t catch the failure before it happened, which is the whole point of requiring periodic inspections.  This water heater had passed its annual inspection in March of 2013 and was due again in March 2014.

If we now ask the question, why were the employees unaware of the high levels of carbon monoxide present, we can identify that not only is carbon monoxide colorless and odorless, but also there was no carbon monoxide detector present in the restaurant.  There was no carbon monoxide detector installed because it is not legally required by state or local codes.  The regulations only require carbon monoxide detectors to be installed in residences or businesses where people sleep, i.e. hotels.

Once all the causes of the fatality and hospitalizations have been identified, possible solutions to prevent the incident from happening again can be brainstormed.  Although we still have open questions in this investigation, we can already see some possible ways to mitigate this risk going forward.  One possible solution would be to legally require carbon monoxide detectors in restaurants.  This would have alerted both employees and responders of the hazard present.  Another possible solution would be to require more frequent inspections of this type of combustion equipment.

To view the Outline and Cause Map, please click “Download PDF” above.