Tag Archives: root cause analysis

City Facing Default

By ThinkReliaiblity Staff

A small Rhode Island town is on the brink of financial disaster.  A low tax basis and mounting liabilities are leaving Central Falls with few options short of filing for bankruptcy protection.   The town has requested financial assistance from state and federal governments and is begging pensioners to accept lower benefits.  But how did they get to this point, and what can be done to keep neighboring towns – and the state itself – from bankruptcy?  A Cause Map visually shows how this occurred.

Like other towns facing financial difficulty, Central Falls accepted more debt than they are now able to pay.  This two-fold reason is at the center of the Cause Map.  All of the effects Central Falls now faces – such as closed town services and the loss of local jobs – stem from the fact that the city had to cut spending.  The city had to cut spending because it is facing bankruptcy.  The Cause Map method allows us to trace the reasons back even further and build a complete picture.

The first piece is that the town has a large debt – $80M to be exact – in pension liabilities for its 214 city police officers and fire fighters; this is in addition to $25M in budget deficits over the next five years.  The generous pensions can be traced back to two state laws regarding public worker negotiations.  Rhode Island is one of the few states that allows workers unlimited collective bargaining, meaning that workers can negotiate for a higher salary for any reason.  Without any limits, talks often broke down.  When talks broke down arbitrators stepped in, and their decisions were binding.  In past years, arbitrators often settled on benefits that were comparable to surrounding towns instead of what the city could actually afford.  Unlimited collective bargaining and binding arbitration together contributed to the poor negotiations and overly-generous benefits.

The second piece is that the town doesn’t have a large income.  It has a small tax basis since the median family income is only around $33,000.  Other sources of income have been pulled back as well – like state and federal funding.  The state is facing similar issues, and is in no place to bail out the multiple municipalities at risk.  The federal government had extended aid, but rescinding it when Central Fall’s credit rating was downgraded by Moody’s.

Municipal bankruptcy is a rare occurrence, with fewer than 50 occurring in the last 3 decades nationwide.  State bankruptcy is practically unheard of.  Arkansas was the last to default on its bonds, following the Great Depression.  This is in part to bankruptcy laws put in place after to avoid such an occurrence.  When one town goes bankrupt, neighboring communities are often negatively affected.  The resulting domino effect could be disastrous.  Rhode Island is a small state with little room to maneuver if local towns – like Central Falls – start going bankrupt.

Great Seattle Fire

By ThinkReliability Staff

On June 6, 1889, a cabinet-maker was heating glue over a gasoline fire.  At about 2:30 p.m., some of the glue boiled over and thus began the greatest fire in Seattle’s history.  We can look at the causes behind this fire in a visual root cause analysis, or Cause Map.  A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page.

First we begin with the impacts to the goals.  There was one confirmed death resulting from the fire, and other fatalities resulting from the cleanup.  These are impacts to the safety goal.  The damage to the surrounding areas can be considered an impact to the environmental goal.  The fire-fighting efforts were insufficient; this can be considered an impact to the customer service goal.  Loss of water and electrical services is an impact to the production goal, the destruction of at least 25 city blocks is an impact to the property goal, and the rebuilding efforts are an impact to the labor goal.

Beginning with these impacted goals, we can lay out the causes of the fire.  The fire did so much damage because of the large area it covered.  It was able to spread over downtown Seattle because it continued to have the three elements required for fire – heat, fuel, and oxygen.  The heat was provided by the initial fire, oxygen by the atmosphere, and plenty of fuel with dry timber buildings.  The weather had been usually dry for the Pacific Northwest, and most of the downtown area had been built with cheap, abundant wood.

Additionally, fire fighters were unable to successfully douse the flames.  The all-volunteer fire department (most of whom reportedly quit after this fire) had insufficient water – hydrants were only placed at every other block, and the water pressure was unable to sustain multiple fire-fighting hoses.  Additionally, some of the water piping was also made of wood, and burned in the fire.  Firefighters attempted to pump water from the nearby bay, but their hoses were not long enough.

Before spreading across the city, the fire spread across the building where it began.  The fire began when glue being heated on a gasoline fire boiled over and lit.  The fire then began to burn the wood chips and turpentine spilled on the floor.  When the worker attempted to spray water at the fire, it only succeeded in spreading the lit turpentine, and thus the fire.  When firefighters arrived, the smoke was so thick that they were unable to find the source of the fire, and so it continued to burn.

The city of Seattle instituted many improvements as a result of this fire.  Wood burnings were banned in the district, and wood pipes were replaced.  A professional fire department was formed, and the city took over the distribution of water.  Possibly because of the vast improvements being made (and maybe because of the reported death of 1 million rats in the fire), the population of Seattle more than doubled in the year after the fire.

View the Cause Map by clicking on “Download PDF” above

The Side Effects of Fracking: Explosive Water?

By ThinkReliability Staff

America’s push for clean energy has certainly been a source of intense debate – the safety of off-shore drilling, the hidden costs of ethanol subsidies, even the aesthetics of wind farms.  New evidence is set to increase the intensity on yet another topic – the debate over hydraulic fracturing.

Hydraulic fracturing is a process where internal fluid pressure is used to extend cracks, or fractures, into a rock formation.  It can occur in nature, but in man-made operations fractures are made deep in the earth by pumping fluid (mostly water) and a proppant (such as sand) out the bottom of a well.  The proppant prevents the fracture from closing back up after the injection of fluid stops.  Chemicals are sometimes added to the pumping fluid to aid in the process.  These fractures allow the gas or liquid trapped in the rock formation to flow back through the fracture, up the well and out for production.

More commonly known as “fracking”, the technique is used to release natural gas from shale rock formations.  These formations, especially common on the East Coast and in Canada, have provided thousands of new, well-paying jobs.  Fracking has allowed natural gas companies to access enormous reserves of natural gas, previously thought inaccessible and prohibitively expensive to drill.  In fact fracking has allowed drillers to tap what is potentially the world’s largest known reserve of natural gas in the Marcellus and Utica shale deposits, stretching from New York to Georgia.

As with any new technology however, there are potential consequences.  Lawmakers and regulators have debated the safety of the largely unregulated fracking industry, but with little definitive evidence either way…until now.  A study by Duke University has concluded that fracking does indeed lead to methane contamination in drinking water.  Methane is the primary component in natural gas and is not lethal to consume.  However, high concentrations are explosive.

The study determined that fracking causes methane to leak into drinking water.  Water sources within a kilometer were found to have significant levels of methane, more than 17 times higher than wells located further from drilling sites.  Furthermore, it was determined that the source of the methane was the much older methane released from the bedrock, versus newer methane produced naturally in the environment.

The exact reason for this is unclear, but a Cause Map can lay out the possible areas needing further investigation.  For instance, the frack chemicals might enter the water supply accidentally during the drilling process.  Spills could also contaminate surface water, or chemicals could migrate into the water supply.

The study indicates that chemical migration is most likely what’s happening.  Surface spills, which have happened, are not a major contributor to the wide-spread methane contamination; so that cause can be left in the Cause Map but won’t be investigated further for our purposes.  Furthermore, the study produced no evidence that the drilling process itself was causing the contamination; so that block can be crossed off the Cause Map.

That leaves one possibility – migration.  The chemicals (including methane) could migrate in two different ways – through the well casing or through the bedrock.  The study’s authors felt it was unlikely that chemicals were migrating thousands of feet through bedrock, so migration from well casings experiencing high pressure flow  is more probable.  While more evidence is needed, it is possible that the well casings are weakened by the fracking process which pushes sand through the casings at high pressure.

An EPA study looks to definitively determine fracking’s impact on drinking water, and specifically human health.  However that study is not scheduled to be completed until 2014.  Until then, lawsuits and tighter regulations are likely to dominate headlines.

Gaming Network Hacked

By Kim Smiley

Gamers worldwide have been twiddling their thumbs for the last two weeks, after a major gaming network was hacked last month.  Sony, well known for its reputation for security, quickly shut down the PlayStation Network after it learned of the attacks, but not before 100+ million customers were exposed to potential identity theft.  Newspapers have been abuzz with similar high-profile database breaches in the last few weeks, but this one seems to linger.  The shut down has now prompted a Congressional inquiry and multiple lawsuits.  What went so wrong?

A Cause Map can help outline the root causes of the problem.  The first step is to determine how the event impacted company goals.  Because of the magnitude of the breach, there were significant impacts to customer service, property and sales goals.  The impact to Sony’s customer service goals is most obvious; customers were upset that the gaming and music networks were taken offline.  They were also upset that their personal data was stolen and they might face identity fraud.

However, these impacts changed as more information came to light and the service outage lingered.  Sony has faced significant negative publicity from the ongoing service outage and even multiple lawsuits.  Furthermore customers were upset by the delay in notification, especially considering that the company wasn’t sure if credit card information had been compromised as well.

As the investigation unfolded new evidence came to light about what happened.  This provided enough information to start building an in-depth Cause Map.  It turns out that network was hacked for three reasons.  Sony was busy fending off Denial of Service attacks, and simultaneously hackers (who may or may not have been affiliated with the DoS attacks) attempted to access the personal information database.  A third condition was required though.  The database had to actually be accessible to hack into, and unfortunately it was.

Why were hackers able to infiltrate Sony’s database?  At first, there was speculation that they may have entered Sony’s system through its trusted developer network.  It turns out that all the hackers needed to do was target the server software Sony was running.  That software was outdated and did not have firewalls installed.  With the company distracted, it was easy for hackers to breach their minimal defenses.

Most of the data that the hackers targeted was also unencrypted.  Had the data been encrypted, it would have been useless.  This raises major liability questions for the company.  To fend off both the negative criticism and lawsuits, Sony has been proactive about implementing solutions to protect consumers from identity fraud.  U.S. customers will soon be eligible for up to $1M in identity theft insurance.  However other solutions need to be implemented as well to prevent or correct other causes.  Look at the Cause Map; notice how that if you only correct issues related to fraud, there are still impacts without a solution.

Sony obviously needs to correct the server software and encryption flaws which let the hackers access customer’s data in the first place.  Looking to the upper branch of the Cause Map is also important, because the targeted DoS attack and possibly coordinated data breach jointly contributed to the system outage.  More detailed information on this branch will probably never become public, but further investigation might produce effective changes that would prevent a similar event from occurring.

Loss of Network Cloud Compute Service

By ThinkReliability Staff

On April 21, 2011, some of Amazon’s Elastic Compute Cloud (EC2) customers experienced issues when a combination of events led to their East region’s Elastic Block Store (EBS) being unable to process read or write operations.  This seriously impacted their customer service.  Massive efforts were undertaken and services, and  most data, was restored within 3 days.  Amazon has released their post-mortem analysis of these events.  Using the information they’ve provided, we can begin a visual root cause analysis, or Cause Map, laying out the event.

We begin with the affected goal.  Customer service was impacted because of the inability to process read or write operations.  This ability was lost due to a degraded EBS cluster.  (A cluster is a group of nodes, which are responsible for replicating data and processing read and write requests.)  The cluster was degraded by the failure of some nodes, and a high number of nodes searching for replicas.

At this point, we’ll look into the process to explain what’s going on.  When a user makes a request, a control plane accepts and processes that request to an EBS cluster. The cluster elects a node to be the primary replica of this data.  That node stores the data, and looks for other available nodes to make backup replicas.  If the node-to-node connection is lost, the primary replica searches for another node.  Once it has established connectivity with that node, the new node becomes another replica.  This process is continuous.

In this case, a higher number of nodes were searching for replicas because they lost connection to the other nodes.  Based on the process discussed above, the nodes then began a search for other nodes.  However, they were unable to find any other nodes because the network was unavailable (so the nodes could not communicate with each other).  The nodes had a long time-out period for searching for other nodes, so their search continued, and more nodes lost communication and began a search, increasing the volume.

The network communication was lost because data was shifted off the primary network.  This was caused by an error during a network configuration change to upgrade the capacity of the primary network.  The data should have been transferred to a redundant router on the primary network but was instead transferred to the secondary network.  The secondary network did not have sufficient capacity to handle all the data and so was unable to maintain connectivity.

In addition to a large number of nodes searching for other nodes, the EBS cluster was impacted by node failures.  Some nodes failed because of a race condition designed so that a node would fail when it attempted to process multiple concurrent requests for replicas.  These requests were caused by the situation above.  Additionally, the nodes failing led to more nodes losing their replicas, compounding the difficulty of recovering from this event.

Service is back to normal, and Amazon has made some changes to prevent this type of issue from reoccurring.   Immediately, the data was shifted back to the primary network and the error which caused the shifting was corrected.  Additional capacity was added to prevent the EBS cluster from being overwhelmed.  The retry logic which resulted in the nodes continuing to search for long periods of time has been modified, and the source of the race condition resulting in the failure of the nodes has been identified and repaired.

View the root cause analysis investigation of this event – including an outline, timeline, Cause Map, solutions and Process Map, by clicking “Download PDF” above.

75 Year Old Woman Cuts Internet Service to Armenia With a Shovel

By Kim Smiley

On March 28, 2011, a 75-year-old woman out digging for scrap metal accidentally cut internet service to nearly all of Armenia.  There were also service interruptions in Azerbaijan and part of Georgia.  Some regions were able to switch to alternative internet suppliers within a few hours, but some areas were without internet service for 12 hours.

How did this happen?  How could an elderly woman and a shovel cause such chaos without even trying?

A root cause analysis can be performed and a Cause Map built to show what contributed to this incident.  Building a Cause Map begins with determining the impacts to the organizational goals.  Then “why” questions are asked and causes are added to the map.

In this example, the Customer Service Goal is impacted because there was significant internet service interruption and the Production Schedule Goal was also impacted because of loss of worker productivity.  The Material Labor Goal also needs to be considered because of the cost of repairs.

Now causes are added to the Cause Map by asking “why” questions.  Internet service was disrupted because a fiber optic cable was damaged by a shovel.  In addition, this one cable provided 90 percent of Armenia’s internet so damaging it created a huge interruption in internet service.

Why would a 74-year-old woman be out digging for cables?  The woman was looking for copper cable and accidentally hit the fiber optic cable.  This happened because both types of cables are usually buried inside PCV conduit and can look similar.  The reason she was looking for copper cable is because there is a market for scrap metal.  Metal scavenging is a common practice in this region because there are many abandoned copper cables left in the ground.  She was also able to hit the fiber optic cable because it was closer to the surface than intended, likely exposed by mudslides or heavy rains.

The woman, who had been dubbed the spade-hacker by local media, has been released from police custody.  She is still waiting to hear if she faces any punishment, but police statements implied that the prosecutor won’t push for the maximum of three years in prison due to her age.

To see the Cause Map of this issue, click on the “Download the PDF” button above.

Grounding the 737’s: SWA Flight 812

By ThinkReliability Staff

As new information comes to light, processes need to be reevaluated.  A hole in the fuselage of a 15-year-old Boeing 737-300 led to the emergency descent of Southwest Airlines Flight 812.  737’s have been grounded as federal investigators determine why the hole appeared.  At the moment, consensus is that a lap joint supporting the top of the fuselage cracked.

While the investigation is still in the early stages, it appears that stress fatigue caused a lap joint to fail.  Stress fatigue is a well known phenomenon, caused in aircraft by the constant pressurization and depressurization occurring during takeoff and landing.  Mechanical engineers designing the aircraft would have been well aware of this phenomenon.  The S-N curve, which plots a metal’s expected lifespan vs. stress, has been used for well over a century.

Just as a car needs preventative maintenance, planes are inspected regularly for parts that are ready to fail.  However, the crack in lap joint wasn’t detected during routine maintenance.  In fact, that joint wasn’t even checked.  It wasn’t an oversight however.  Often the design engineers also set the maintenance schedule, because they hold the expertise needed to determine a reasonable procedure.  The engineers didn’t expect the part to fail for at least 20,000 more flight hours.  At the moment, it’s unclear why that is.

In response to the incident, the FAA has grounded all similar aircraft and ordered inspections of flights nearing 30,000 flight hours.   Cracks have been found in 5 aircraft of 80 grounded aircraft so far.  However a looming concern is how to deal with 737’s not based in the United States, and therefore outside the FAA’s jurisdiction.

Air Traffic Controller Asleep On the Job

By Kim Smiley

At least three times over the past decade, air traffic controller fatigue has been investigated by the National Transportation Safety Board (NTSB) in near-miss airline accidents.  Five years ago, controller fatigue was a significant factor in a Lexington, KY crash killing 49, the last fatal crash related to this problem.  Again last week, controller fatigue was in the news when two early-morning aircraft had uncontrolled landings at Reagan National Airport near Washington D.C.  The controller, who had 20 years of experience with most of them at Reagan, was clearly well experienced.  In fact, the controller was also a supervisor.  But no level of experience can overcome the effects of fatigue.  The relieved controller stated that he had worked the 10 p.m. to 6 a.m. shift four nights in a row.

Faced with harsh criticism over the latest incident, the FAA reacted by mandating a second controller at Reagan National Airport and reviewing traffic management policies at all single-person towers.  Regional radar controllers are now required to check in with single-person towers during night shifts to ensure controllers are prepared to handle incoming traffic.

Controller fatigue is a well known problem, and multiple solutions have been suggested over the past two decades.  It has been a part of the NTSB’s Most Wanted list since 1990.  In 2007 following the Lexington crash, the NTSB urged the Federal Aviation Administration (FAA) to overhaul their controller schedules, claiming that the stressful work and hectic pace were putting passengers and crews at risk.  The FAA responded, and is currently working with the National Air Traffic Controllers Association (NATCA) to develop “a science-based controller fatigue mitigation plan”.

In addition, from 2007 to 2011, more than 5,500 new air traffic controllers were hired.  However, many of these simply replaced air traffic controllers who were retiring, resulting in no net gain in the pool of available labor.  Air traffic controllers have a mandated retirement age of 56, with exceptions available up to age 61.  Additionally, on-the-job training is extensive, requiring two to four years just to receive initial certification.  Adding staffing therefore is more difficult than initially meets the eye.

Faced with an expected increase in air traffic and an aging infrastructure, the FAA has aggressively pursued a long-term modernization called NextGen.  With the proposed modernization and staffing, the 2011 FAA budget request is now $1.14B, a $275M or 31% increase from 2010.  While material and personnel changes are often necessary, sometimes simpler solutions are equally effective or quicker to implement.

The associated Cause Map reflects the multiple solutions suggested, and even implemented, to combat the problem of controller fatigue.  As discussed, the FAA, NTSB and NATCA have pursued multiple paths to overcome the issue of controller fatigue.  However, as the Cause Map shows, there are multiple contributing factors in this case.  Controller fatigue isn’t the only reason those planes had an uncontrolled landing, and controller fatigue wasn’t caused by just four night shifts in a row.  Because there are multiple reasons why this happened, it also means there are multiple opportunities to correct future problems.  The key isn’t eliminating all of the causes, but rather eliminating the right one.

Issues at Fukushima Daiichi Unit 3

By ThinkReliability Staff

There are many complex events occurring with some of Japan’s nuclear power plants as a result of the earthquake and tsunami on March 11, 2011.  Although the issues are still very much ongoing, it is possible to begin a root cause analysis of the events and issues.  In order to clearly show one issue, our analysis within this blog is limited to the issues affecting Fukushima Daiichi Unit 3.  This is not to minimize the issues occurring at the other plants and units, but rather to clearly demonstrate the cause-and-effect within one small piece of the overall picture.

The issues surrounding Unit 3 are extremely complex.  In events such as these, where many events contribute to the issues, it can be helpful to make a timeline of events.  A timeline of the events so far can be seen by clicking “Download PDF” above.  A timeline can not only help to clarify the order of contributing events, it can also help create the Cause Map, or visual root cause analysis.  To show how the events on the timeline fit into the Cause Map, some of the entries are denoted with numbers, which are matched to the same events on the Cause Map.  Notice that in general, because Cause Maps build from right to left with time, earlier entries are found to the right of newer events.  For example, the earthquake was the cause of the tsunami, so the earthquake is to the right of the tsunami on the map.  Many of the timeline events are causes, but some are also solutions.  For example, the venting of the reactor is a solution to the high pressure.  (It also becomes a cause on the map.)

A similar analysis could be put together for all of the units affected by the earthquake, tsunami and resulting events.  Parts of this cause map could be reused as many of the issues affecting the other plants and units are     similar to the analysis shown here. It would also be possible to build a larger Cause Map including all impacts from the earthquake.

The impact to goals needs to be determined prior to building a Cause Map. As a direct result of the events at Unit 3, 7 workers were injured.  This is an impact to the worker safety goal.  There is the potential for health effects to the population, which is an impact to the public safety goal.  The environmental goal was impacted due to the release of radioactivity into the environment.  The customer service goal was impacted due to evacuations and rolling blackouts, caused by the loss of electrical production capacity, which is an impact to the production goal.  The loss of capacity was caused by catastrophic damage to the plant, which is an impact to the property goal.  Additionally, the massive effort to cool the reactor is an impact to the labor goal.

The worker safety and property goals were impacted because of a hydrogen explosion, which was caused by a buildup of pressure in the plant, caused by increasing reactor temperature.  Heat continues to be generated by a nuclear reactor, even after it is shutdown, as a natural part of the operating process.  In this case, the normal cooling supply was lost when external power lines were knocked down by the tsunami (which was caused by the earthquake).  The tsunami also apparently damaged the diesel generators which provided the emergency cooling system.  The backup to the emergency cooling supply stopped automatically and was unable to be restarted, for reasons that are as yet unknown.

The outline, timeline and cause map shown on the PDF are extremely simplified.  Part of this simplification is due to the fact that as the event is still ongoing and not all information is known, or has been released. Once more information becomes available, it can be added to the analysis, or the analysis can be revised.

To learn more about the reactor issues at Fukushima Daiichi, view our video summary.  To see a blog about the impact of the fallout on the health of babies in the US, see our healthcare blog.

Two Killed in Barge/Tour Boat Collision

By ThinkReliability Staff

On July 7, 2010, a barge being propelled by a tug boat collided with a tour boat that had dropped anchor in the Delaware River.  As a result of the collision, two passengers on the tour boat were killed and twenty-six were injured.  The tour boat sank in 55 feet of water.

Detail regarding the incident has just been released in an updated NTSB report.  We can use the information about this report to begin a Cause Map, or visual root cause analysis.  The information in the report can also point us in the direction of important questions that remain to be answered to determine exactly what happened and, most importantly, how incidents like these can be prevented in the future.

In this case, a tour boat had dropped anchor to deal with mechanical problems.  According to the tour boat crew’s testimony and radio recordings, the tour boat crew attempted to get in touch with the tug boat by yelling and making radio calls.  Neither were answered or apparently noticed.  The barge that was being propelled by the tug boat crashed into the tour boat, resulting in deaths, injuries and loss of the tour boat.

The lookout on the tug boat was inadequate (had it been adequate, the tug boat would have noticed the tour boat in time to avoid the collision).  The report has determined that the tug boat master was off-duty and below-deck at the time of the collision.  According to cell phone records, the mate who was on lookout duty was on a phone call at the time of the collision and had made several phone calls during his duty. The inadequate lookout combined with the inability of the tour boat to make contact with the tug boat resulted in the collision.

There are two obvious areas where more detail is needed in the Cause Map to determine what was going on that led to the issues on the tug boat.  Specifically, why was the lookout on the cell phone and why wasn’t the tour boat able to contact the tug boat through the radio?  Because of the strict requirements for lookouts on marine duty, there is also an ongoing criminal investigation into the lookout’s actions.  When the final NTSB report is issued, and the criminal case is closed, these questions should be answered.  More detail can be added to this Cause Map as the analysis continues. As with any investigation the level of detail in the analysis is based on the impact of the incident on the organization’s overall goals.