All posts by Kim Smiley

Mechanical engineer, consultant and blogger for ThinkReliability, obsessive reader and big believer in lifelong learning

Train Crash in China Kills 39

By Kim Smiley

It is rare for the conduct of the investigation to be one of the biggest headlines in the week following an accident, but this has been the case after a recent train crash in China.  On July 23, 2011, two trains collided in Wenzhou, China, killing 39 and sending another 192 people to the hospital.

What appears to have happened is that a train moving at speed rear ended another train that had stalled on the tracks. It was announced that the first train had stalled after a lightning strike.  Soon after the accident, people reported seeing the damaged train cars broken apart by back hoes and buried.  Meaning the evidence was literally being buried without ever having been thoroughly examined.  The Chinese government stated that the cars contained “State-level” technology and were being buried to keep it safe.

The internet frenzy and public outrage fueled by how this investigation was handled was impressive. According to a recent New York Times article, 26 million messages about the tragedy have been posted on China’s popular twitter-like microblogs.  So powerful has the public outrage been that the first car from the oncoming train has been dug up and sent to Wenzhou for analysis.

More information  on the technical reasons for the train crash is slowly coming to light.  Five days after the accident, government officials have stated that a signal which would have stopped the moving train failed to turn red and the error wasn’t noticed by workers.  There is talk about system design errors and inadequate training.

It’s unlikely that all the details will ever be public knowledge, but there is one takeaway from this accident that can be applied to any organization in any industry that performs investigations – the importance of transparency. The Chinese government spent over $100 billion in 2010 expanding the high speed rail system, but if people don’t feel safe riding the rail system it won’t be money well spent.  Customers need to feel that an adequate investigation has been performed following an accident or they won’t use the products produced by the company.

To view an initial Cause Map built for this train accident, please click on “Download PDF” above.  A Cause Map is an intuitive, visual method of performing a root cause analysis.  One of the benefits of a Cause Map is that it’s easily understood and can help improve the transparency of an investigation for all involved.

Potential Power of Solar Flares

By Kim Smiley

The largest solar flare in recorded history occurred on September 1, 1859.  As the energy released from the sun hit the earth’s atmosphere, the skies erupted in a rainbow of colored auroras that were visible as far south as Jamaica and Hawaii.  The most alarming consequences of this “Carrington Event” (named for solar astronomer Richard Carrington who witnessed it) were its effect on the telegraph system. Operators were shocked and telegraph paper caught fire.

No solar flares approaching the magnitude of the Carrington Event have occurred since, but the question must be asked – What if a similarly sized solar flare happened today?

There is some debate on how severe the consequences would be, but the bottom line is that modern technology would be significantly impacted by a large solar flare.  When large numbers of charged particles bombard the earth’s atmosphere (as occurs during a large solar flare), the earth’s magnetic field is deformed.  A changing magnetic field will induce current in wires that are inside it resulting in large currents in electrical components within the earth’s atmosphere during a solar fare.

Satellites would likely malfunction, taking with them wireless communication, GPS capabilities and other technologies.  This would severely impact the modern world, but the largest impact would likely be to the power grid.  There is debate on how long power would be out and how severe the damage is, but it is clear that solar flares have the ability to significantly damage the power grid.  Solar flares much smaller than the Carrington Event have caused blackouts, but power was returned relatively quickly.  One of the more impressive of these examples occurred in 1989 when the entire province of Quebec lost power for about 12 hours. (Click here to read more.)

NASA works to predict and monitor sun activity so that preventive actions can be taken to help minimize damage if a large solar flare occurs.  For example, portions of the power grid could be shut down to help protect against overheating.  Scientists continue to study the issue, working to improve predictions for sun flare activity and learn how to better protect technology from them.  Click the “Download PDF” button above to view a high level Cause Map, a visual root cause analysis, built for this issue.

More information can be found in a report by the National Academy of Sciences, Severe Space Weather Events–Understanding Societal and Economic Impacts and the NASA website.

Deadly E.Coli Outbreak from Sprouts

By Kim Smiley

Since May, at least 31 people have died and nearly 3,000 have been sickened from E.coli infections in Europe in one of the widest spread and deadliest E.coli outbreaks in recent memory.  After days of confusion, German authorities determined that the source of the contamination is sprouts from an organic farm in northern Germany. The farm has suspended sale of produce and won’t reopen until it is determined safe.

This issue can be investigated by creating a Cause Map, an intuitive format for performing a root cause analysis.  In a Cause Map, the causes contributing to an incident are determined and organized by cause-and-effect relationships.  To view a high level Cause Map of this incident, please click on “Download PDF” above.

This investigation is still underway and additional information can easily be added to the Cause Map as it becomes available. The initial source of contamination at the farm had not yet been determined, but sprouts are known have a high risk of carrying dangerous bacteria.

Sprouts are considered to be a high risk food for a number of reasons.  The seeds are often grown in countries with less stringent inspection criteria so they can arrive at growers already contaminated. Seeds can be contaminated in any number of ways.  E. coli live in the gut of mammals so any time animals or animal waste are near sprout seeds there is a chance of contamination.

It can also be difficult to sanitize the seeds.  Bacteria can hide inside damaged seeds and be missed during sanitizing steps.  Sprouts are also grown in warm water, ideal conditions for growing bacteria as well.  Another factor to consider is that many people eat sprouts raw; cooking would kill any bacteria that were present.

Sprouts have been the source of many bacteria outbreaks in the past.  The U.S. has had at least 30 reported outbreaks related to sprouts in the last 15 years.  Sprouts are associated with enough risk that the Food and Drug Administration has issued warnings for those at high risk, (children, the elderly, pregnant women and people with compromised immune systems) to avoid eating raw sprouts.  If you fall into the high risk category or are just feeling nervous after recent events, the easiest way to prevent bacterial infection from sprouts is to cook them.

Chicago Plans for a Warmer Future

By Kim Smiley

The very existence of climate change continues to be controversial, but some cities have already decided to start preparing for a hotter future.  While the rest of the world continues to debate whether man’s impact on the world is producing climate change, the city of Chicago is already taking action to prepare for a warmer climate.

The effort to adapt Chicago to the predicted climate of the future began in 2006 under the then mayor Richard M. Daley.   The first step in the process was a model that was created by scientists specializing in climate change to predict how global warming would affect Chicago.  The output of the model shocked city planners.  Experts predicted that summers in Chicago would be like current summers in the Deep South, with as many as 72 days over 90 degrees by the end of the century.  A private risk assessment firm was tasked to determine how the predicted climate shift would impact the city.  The dire predictions included an invasion of termites, heat-related deaths reaching 1,200 a year and billions of dollars’ worth of deterioration to building and infrastructure in the city.  Chicago decided the time to take action was now.

Created by Robert A. Rohde as part of the Global Warming Art project.

Armed with the predictions, city planners began to plan how best to adapt Chicago for the warmer future.  There are a number of ways that Chicagoans are already changing how they maintain the city.  Much attention has been given to the paved spaces in the city to improve drainage to accommodate higher levels of predicted rain.  13,000 concrete alleys in Chicago were originally built without drainage and city planners are working to change this.  150 alleys have already been remade with permeable pavers that allow 80 percent of rainwater to filter to the ground below.  City planners are also changing the mix of trees that are planted to make sure they are selecting varieties that can withstand hotter temperatures.  Air conditioning is also being planned for Chicago’s public schools, which have been heated but not air conditioned until now.

Time will tell whether the steps Chicago is taking will prove necessary, but the Chicago’s adaption strategy is an interesting case study in a nation still debating the existence of global warming.

When trying to select the best solutions to a problem such as in this case, the Cause Mapping method of root cause analysis can be an effective way to organize all the information.  A Cause Map detailing the many causes of a problem may make it easier to select the most cost effective and efficient means of preventing a problem.  A Cause Map can also be adapted to fit the scope of the problem.  In this example, a Cause Map could be built to detail the issue of preparing Chicago for a warmer future or a bigger Cause Map could be built to tackle the problem of global warmer on a larger scale.

To read more about the Chicago Climate Action Plan, please visit their website.

Nuclear Waste Stalemate in US

By Kim Smiley

America’s 104 commercial nuclear reactors produce about 2,000 metric tons of spent nuclear fuel each year.  The United States currently has no long term solution in place to deal with spent nuclear fuel.  The end result of this stalemate is that that there is more than 75,000 tons of spent nuclear fuel at 122 temporary sites in 39 states with nowhere to go.

Much of the nation’s spent fuel is currently stored in pools near operating nuclear reactors or near sites where reactors once were. Recent events at the Fukushima nuclear plant in Japan have sparked discussion about the potential safety risk of having so much fuel stored near operating reactors creating a situation where one single event can trigger a larger release of radiation.  To make things more complicated, storage pools at US plants are more heavily loaded than the ones at the Fukushima reactors.  Additionally, the pools will reach capacity at some point in the not so distant future and the fuel will have to be moved if the US plans to continue operating nuclear reactors.

How did we get in this situation?  The problem of no long term solution for spent nuclear fuel can be analyzed by building a Cause Map.  A Cause Map is a visual root cause analysis that lays out the causes that contribute to a problem in an intuitive, easy to understand format. Click on “Download PDF” above to view a high level Cause Map of this issue.

Looking at the Cause Map, it’s apparent one of the causes of this problem is that the plan for the Yucca Mountain Nuclear Waste Repository was canceled without an alternative plan being created.  Yucca Mountain Repository was planned to be a deep geological repository where nuclear waste would be stored indefinitely, shielded and packaged to prevent any release of radiation.  The Yucca Mountain Repository was canceled in 2009 for a number of reasons, some technological and some political.  Environmentalists and residents near the planned site were very vocal about their opposition to the selection of Yucca Mountain site for the nation’s repository.

A Blue Ribbon Commission of experts appointed by President Obama recently presented their recommendations of how to approach this problem.  Their proposal was to develop one or more sites where spent reactor fuel could be stored in above ground steel and concrete structures.  These structures could contain fuel for decades, allowing time for a more permanent solution to be developed.  These structures would not require any cooling beyond simple circulation of air and they could be stored at locations deemed safe, with the lowest risk of earthquakes and other disasters.  Hopefully the recommendations by the commission are the first step to solving this problem and developing a safe long term storage solution to the nation’s nuclear waste.

Gaming Network Hacked

By Kim Smiley

Gamers worldwide have been twiddling their thumbs for the last two weeks, after a major gaming network was hacked last month.  Sony, well known for its reputation for security, quickly shut down the PlayStation Network after it learned of the attacks, but not before 100+ million customers were exposed to potential identity theft.  Newspapers have been abuzz with similar high-profile database breaches in the last few weeks, but this one seems to linger.  The shut down has now prompted a Congressional inquiry and multiple lawsuits.  What went so wrong?

A Cause Map can help outline the root causes of the problem.  The first step is to determine how the event impacted company goals.  Because of the magnitude of the breach, there were significant impacts to customer service, property and sales goals.  The impact to Sony’s customer service goals is most obvious; customers were upset that the gaming and music networks were taken offline.  They were also upset that their personal data was stolen and they might face identity fraud.

However, these impacts changed as more information came to light and the service outage lingered.  Sony has faced significant negative publicity from the ongoing service outage and even multiple lawsuits.  Furthermore customers were upset by the delay in notification, especially considering that the company wasn’t sure if credit card information had been compromised as well.

As the investigation unfolded new evidence came to light about what happened.  This provided enough information to start building an in-depth Cause Map.  It turns out that network was hacked for three reasons.  Sony was busy fending off Denial of Service attacks, and simultaneously hackers (who may or may not have been affiliated with the DoS attacks) attempted to access the personal information database.  A third condition was required though.  The database had to actually be accessible to hack into, and unfortunately it was.

Why were hackers able to infiltrate Sony’s database?  At first, there was speculation that they may have entered Sony’s system through its trusted developer network.  It turns out that all the hackers needed to do was target the server software Sony was running.  That software was outdated and did not have firewalls installed.  With the company distracted, it was easy for hackers to breach their minimal defenses.

Most of the data that the hackers targeted was also unencrypted.  Had the data been encrypted, it would have been useless.  This raises major liability questions for the company.  To fend off both the negative criticism and lawsuits, Sony has been proactive about implementing solutions to protect consumers from identity fraud.  U.S. customers will soon be eligible for up to $1M in identity theft insurance.  However other solutions need to be implemented as well to prevent or correct other causes.  Look at the Cause Map; notice how that if you only correct issues related to fraud, there are still impacts without a solution.

Sony obviously needs to correct the server software and encryption flaws which let the hackers access customer’s data in the first place.  Looking to the upper branch of the Cause Map is also important, because the targeted DoS attack and possibly coordinated data breach jointly contributed to the system outage.  More detailed information on this branch will probably never become public, but further investigation might produce effective changes that would prevent a similar event from occurring.

Plane Clips Another While Taxiing at JFK Airport

By Kim Smiley

Around 8:30 pm on April 11, 2011, a large passenger airplane taxiing at John F. Kennedy Airport in New York clipped the wing of a smaller plane.  The larger plane involved in the incident was an Airbus A380 carrying 485 passengers and 25 crew members.  The smaller plane was a Bombardier CRJ and carrying 52 passengers and 4 crew members at the time it was clipped.

At the time of the accident, the Airbus was taxiing to take off and the CRJ had recently landed and was waiting to park.  The incident was caught on amateur video and it appears that the left wing tip of the Airbus struck the left horizontal stabilizer of the CRJ. No injuries were reported, but both planes sustained some damage.

After the planes made contact, the fire department responded as a precautionary measure.  Passengers were deplaned from the Airbus so that the planes could be inspected and information could be gathered to support the investigation.

At this time there is limited information available about what caused this incident, but the National Transportation and Safety Board (NTSB) has begun an investigation.  The NTSB has requested fight recorders from both airplanes and also plans to review the air traffic control tapes and the ground movement radar data to determine how this happened.

Even through the investigation is just getting started, it is still possible to create a Cause Map based on what is known.  The first step is to create an Outline of the event by determining the impact to the organization goals.  In this example, the Safety Goal was impacted because there was the potential for injuries, the Customer Service goal was impacted because the passengers were unable to reach their destination, the Production Schedule Goal was impacted because the flight was unable to depart and the Material and Labor goal was impacted because there was damage to both planes.

From this point, Causes can be added to the cause map by asking “why” question. Missing information can be noted by adding a Cause box with a “?”.  Any additional information can be added later.  To see an initial Cause Map of this incident and the Outline, click on the “Download PDF” above.

75 Year Old Woman Cuts Internet Service to Armenia With a Shovel

By Kim Smiley

On March 28, 2011, a 75-year-old woman out digging for scrap metal accidentally cut internet service to nearly all of Armenia.  There were also service interruptions in Azerbaijan and part of Georgia.  Some regions were able to switch to alternative internet suppliers within a few hours, but some areas were without internet service for 12 hours.

How did this happen?  How could an elderly woman and a shovel cause such chaos without even trying?

A root cause analysis can be performed and a Cause Map built to show what contributed to this incident.  Building a Cause Map begins with determining the impacts to the organizational goals.  Then “why” questions are asked and causes are added to the map.

In this example, the Customer Service Goal is impacted because there was significant internet service interruption and the Production Schedule Goal was also impacted because of loss of worker productivity.  The Material Labor Goal also needs to be considered because of the cost of repairs.

Now causes are added to the Cause Map by asking “why” questions.  Internet service was disrupted because a fiber optic cable was damaged by a shovel.  In addition, this one cable provided 90 percent of Armenia’s internet so damaging it created a huge interruption in internet service.

Why would a 74-year-old woman be out digging for cables?  The woman was looking for copper cable and accidentally hit the fiber optic cable.  This happened because both types of cables are usually buried inside PCV conduit and can look similar.  The reason she was looking for copper cable is because there is a market for scrap metal.  Metal scavenging is a common practice in this region because there are many abandoned copper cables left in the ground.  She was also able to hit the fiber optic cable because it was closer to the surface than intended, likely exposed by mudslides or heavy rains.

The woman, who had been dubbed the spade-hacker by local media, has been released from police custody.  She is still waiting to hear if she faces any punishment, but police statements implied that the prosecutor won’t push for the maximum of three years in prison due to her age.

To see the Cause Map of this issue, click on the “Download the PDF” button above.

Air Traffic Controller Asleep On the Job

By Kim Smiley

At least three times over the past decade, air traffic controller fatigue has been investigated by the National Transportation Safety Board (NTSB) in near-miss airline accidents.  Five years ago, controller fatigue was a significant factor in a Lexington, KY crash killing 49, the last fatal crash related to this problem.  Again last week, controller fatigue was in the news when two early-morning aircraft had uncontrolled landings at Reagan National Airport near Washington D.C.  The controller, who had 20 years of experience with most of them at Reagan, was clearly well experienced.  In fact, the controller was also a supervisor.  But no level of experience can overcome the effects of fatigue.  The relieved controller stated that he had worked the 10 p.m. to 6 a.m. shift four nights in a row.

Faced with harsh criticism over the latest incident, the FAA reacted by mandating a second controller at Reagan National Airport and reviewing traffic management policies at all single-person towers.  Regional radar controllers are now required to check in with single-person towers during night shifts to ensure controllers are prepared to handle incoming traffic.

Controller fatigue is a well known problem, and multiple solutions have been suggested over the past two decades.  It has been a part of the NTSB’s Most Wanted list since 1990.  In 2007 following the Lexington crash, the NTSB urged the Federal Aviation Administration (FAA) to overhaul their controller schedules, claiming that the stressful work and hectic pace were putting passengers and crews at risk.  The FAA responded, and is currently working with the National Air Traffic Controllers Association (NATCA) to develop “a science-based controller fatigue mitigation plan”.

In addition, from 2007 to 2011, more than 5,500 new air traffic controllers were hired.  However, many of these simply replaced air traffic controllers who were retiring, resulting in no net gain in the pool of available labor.  Air traffic controllers have a mandated retirement age of 56, with exceptions available up to age 61.  Additionally, on-the-job training is extensive, requiring two to four years just to receive initial certification.  Adding staffing therefore is more difficult than initially meets the eye.

Faced with an expected increase in air traffic and an aging infrastructure, the FAA has aggressively pursued a long-term modernization called NextGen.  With the proposed modernization and staffing, the 2011 FAA budget request is now $1.14B, a $275M or 31% increase from 2010.  While material and personnel changes are often necessary, sometimes simpler solutions are equally effective or quicker to implement.

The associated Cause Map reflects the multiple solutions suggested, and even implemented, to combat the problem of controller fatigue.  As discussed, the FAA, NTSB and NATCA have pursued multiple paths to overcome the issue of controller fatigue.  However, as the Cause Map shows, there are multiple contributing factors in this case.  Controller fatigue isn’t the only reason those planes had an uncontrolled landing, and controller fatigue wasn’t caused by just four night shifts in a row.  Because there are multiple reasons why this happened, it also means there are multiple opportunities to correct future problems.  The key isn’t eliminating all of the causes, but rather eliminating the right one.

Deadly Tiger Attack

By Kim Smiley

On December 25, 2007, a tiger escaped her enclosure at the San Francisco Zoo and attacked three people.  One 17 year old boy was killed and the other two were injured. The enclosure was built in the 1940s and had safely contained tigers for more than 60 years without incident.

So how did this happen?  How did the tiger escape?

A Cause Map can be built using this example to help determine how this incident was able to occur. To begin a Cause Map, the impacts to the organizational goals are first determined and then “why” questions are asked to add causes to the map.  In this case, there was obviously an impact to the safety goal because one zoo patron was killed and two were injured.  The customer service goal was also impacted because the zoo was closed until January 3, 2008 following the incident.  Why was a zoo patron killed?  He was killed because he was mauled by a tiger.  Why was he mauled?  Because the tiger escaped her enclosure and she went after the victims.

Let’s focus on the question of how the tiger escaped her enclosure first.  An investigation was conducted by the United States Department of Agriculture’s Animal and Plant Health Inspection Service, the government body who is charged with overseeing the nation’s zoos.  Based on claw marks and other evidence at the scene, they determined that the tiger jumped from the bottom of a dry moat and was able to pull herself over the fence surrounding her enclosure.  The investigation also determined the fence was lower than typically used around tiger enclosures.  The Association of Zoos & Aquariums recommends that walls around a tiger exhibit be at least 16.4 feet and the fence around the San Francisco Zoo was only 12.5 feet at the time.

The second question of why the tiger went after the boys is not as easy to answer.  A few experts have stated that the tiger didn’t behave in a typical way.  There has been significant speculation in the media that the victims taunted the tiger or provoked her in some way, but nothing has ever officially been determined.

This focus on the behavior of the victims is a good example of some of the issues that can come up during an investigation.  It can be tempting to focus on assigning blame when investigating an incident.  But the real question is “What should we do to prevent this from happening again?”.  Whether or not the boys provoked the tiger, she should never have been able to escape her enclosure.

After the incident, the zoo extensively remodeled the tiger enclosure, adding a much higher fence and with hotwire at the top to prevent any similar incidents from occurring.