All posts by Kim Smiley

Mechanical engineer, consultant and blogger for ThinkReliability, obsessive reader and big believer in lifelong learning

App Takes Down National Weather Service Website

By Kim Smiley

The National Weather Service (NWS) website was down for hours on August 25, 2014.  Emergency weather alerts such as tornado warnings were still disseminated through other channels, but this issue raises questions about the robustness of a vital website.

This issue can be analyzed by building a Cause Map, a visual format for performing a root cause analysis.  Cause Maps are built by laying out all the causes that contributed to a problem to show the cause-and-effect relationships.  The idea is to identify all the causes (plural), not just THE one root cause.

This example is a good illustration of the potential danger of focusing on a single root cause.  The NWS website outage was caused by an abusive Android app that bogged the site down with excessive traffic.  The app was designed to provide current weather information and it pulled data directly from the forecast.weather.gov website.  The app inadvertently queried the website thousands of times a second because of a programming error and the website was essentially overwhelmed.  It was similar to the denial of service attacks that have been directed at websites such as Bank of America and Citigroup, but the spike in traffic in this case wasn’t deliberate.

It may be tempting to say that the app was the root cause. Or you could be more specific and say the programming error was the root cause.  But labeling either of these “the root cause” would imply that you solved the problem once you fix the software error. The root cause is gone, no more problem…right?  In order to address the issue, NWS installed a filter to block the excessive queries and worked with app developer to ensure the error was fixed, but there are other factors that must be considered to effectively reduce the risk of a similar problem recurring.

One of the things that must be considered in this example is why a filter that blocked denial of service attacks wasn’t already in place.  Flooding a website with excessive traffic is a well-known strategy of hackers.  If an app could accidently take the site down for hours, it is worrisome to consider what somebody with malicious intent could do.  The NWS is responsible for disseminating important safety information to the public and needs a reasonably robust website.  In order to reduce the impact of a similar issue in the future, the NWS needs to evaluate the protections they have in place for their website and see if any other safeguards should be implemented beyond the filter that addressed this specific issue.

If the investigation was focused too narrowly on a single root cause, the entire discussion of cyber security could be missed.  Building a Cause Map of many causes ensures that a wider variety of solutions are considered and that can lead to more effective risk prevention.

To view a high level Cause Map of this issue, click on “Download PDF” above.

Ice Bucket Challenge Ends in Serious Injuries

By Kim Smiley

In a terrible reminder that awful things can happen at any time, two firefighters were seriously injured helping the Campbellsville University’s marching band raise money for amyotrophic lateral sclerosis (ALS) research by participating in the trendy ice bucket challenge.  If you ever log onto Facebook, you are probably already familiar with the concept behind the ice bucket challenge, but in case you are not a social media fan, the idea behind the ice bucket challenge is that friends tag each other to either donate $100 to an ALS-related charity  or dump a bucket of ice water over their head.  If you choose the ice bucket, you are supposed to take a video or photo as evidence and post it online.

Trying to create an entertaining video of the ice bucket dumping is part of the fun for many of the participants.  In order to make a memorable video to post on social media, the firefighters that were injured used a fire truck ladder to dump ice water on the band from above.  While on the ladder, the firefighters were near high voltage power lines (although they never actually touched the lines) and electricity arced out, injuring four firefighters.  Two firefighters were treated and released, but two were still hospitalized days later.  One was listed as stable, but the other was in critical condition.

This accident clearly illustrates that high voltage can be extremely dangerous even if you don’t touch the equipment. An arc flash can occur when a flashover of electric current leaves its intended path and travels through the air from one conductor to another or to the ground.  The closer a person is when an arc happens, the more dangerous it is.  Arcs are exceptionally hot and can cause very serious injuries and even death from several feet away when high voltage is in use.

The Public Service Commission stated that they will investigate the location to ensure that the power line had the correct clearance from the ground, trees and structures, but initial reports do not indicate any problems with the power poles.  Possible solutions that could be used to reduce the risk of a similar problem in the future are increased education on the risks of high voltage and ensuring that adequate warning signs are in place.

These have been the most dramatic injuries associated with the ice bucket challenge, but there are a slew of videos featuring buckets dropped on heads, slips and a variety of other unintended outcomes that look painful.  If you are considering doing the ice bucket challenge, please remember that a gallon of water weighs over 8 pounds.  A five gallon bucket filled with water is pretty heavy.  Think the plan through carefully before you ask somebody to dump water on you off a balcony because it may end badly.

Freight Trains Collide Head-On in Arkansas

By Kim Smiley

On August 17, 2014, two freight trains collided head-on in Arkansas, killing two and injuring two more.  The accident resulted in a fire after alcohol spilled from a damaged rail car ignited, prompting evacuation of about 500 people from nearby homes.  The trains were carrying toxic chemicals, but none of the cars carrying the toxic chemicals are believed to have been breached during the accident.

The National Transportation Safety Board (NTSB) is currently investigating this accident, but an initial Cause Map, or visual root cause analysis, can still be built to help document and illustrate the information that is known.  One of the benefits of a Cause Map is that it can easily be expanded to incorporate information as it becomes available.  The first step of the Cause Mapping process is to fill in an Outline with the basic information for an incident.  In addition, anything that was different at the time of accident is listed.  How the incident impacts the overall goals is also documented on the bottom of the Outline.

Like many incidents, there are a number of goals that were impacted by this train collision.  The safety goal is obviously impacted by two fatalities and injuries.  The property goal is impacted because of the significant damage to the trains and freight.  The labor/time goal is impacted because of the response effort and investigation that are required as a result of the accident. Potential impacts or near misses should also be documented so the potential release of toxic chemicals is considered an impact to the environmental goal.

The second step is to perform the analysis by building the Cause Map.  To build the Cause Map, start with one impacted goal and ask “why” questions.  Each answer is added to the Cause Map.  Each impacted goal should be considered and the cause boxes should all connect at some location on the Cause Map.  Starting with the safety goal in this example, the first question would be: why were two people killed?  This occurred because there was a train collision.  The trains collided because they were traveling toward each other on the same track.  No details have been released about how the trains ended up on the same track.  The trains’ daily recorders (which provide information about the trains’ speed, braking and throttle) have been found and will be analyzed by investigators. The NTSB has stated that they will be looking into a number of factors such as the train signals and fatigue since the accident occurred late at night.

The final step in the Cause Mapping process is to develop solutions that can be implemented to reduce the risk of a similar problem recurring in the future.  Since the investigation is ongoing, talk of solutions is premature at this point.  Once more is known about the causes that contributed to this issue, the lessons that are learned can be used to develop solutions.

Software Glitch Delays U.S. Travel Documents

By Kim Smiley

The Consular Consolidated Database (CCD) is the global database used by the U.S. State Department to process visas and other travel documents.  On July 20, 2014, the CCD experienced software issues and had to be taken offline.  The outage lasted several days with the CCD being returned to service with limited capacity on July 23.  The CCD is huge, one of the largest Oracle-based warehouses in the world, and is used to process a hefty number of visas each year and the effects of the software glitch have been felt worldwide.  The State Department processed over 9 million immigrant and non-immigrant visas overseas in 2013 so a delay of even a few days means a significant backlog.

This issue can be analyzed by building a Cause Map, a visual root cause analysis.  A Cause Map visually lays out the different causes that contribute to an issue so that the problem is better understood and a wider range of solutions can be considered.  The first step in the Cause Mapping process is to define the problem, which includes documenting the overall impacts to the goal.  Most problems impact more than one goal and this example is no exception.

The customer service goal is clearly impacted because thousands – and potentially even millions – have had their travel document processing delayed.  The negative publicity can also be considered an impact to the customer service goal because this software glitch isn’t doing the international image of the U.S. any favors.  The delay in travel document services is an impact to the production/schedule goal and the recovery effort and investigation into the problems impact the labor/time goal.  Additionally, there are potential economic impacts to both individuals who may have had to change travel plans and to the U.S. economy because these issues may discourage international tourism.

The next step in the Cause Mapping method is to build the Cause Map.  This is done by asking “why” questions and using the answer to visually lay out the cause-and-effect relationships.  The delay in processing travel documents occurred because the CCD is needed to process them and the CCD had to be taken offline as a result of software issues.  Why were there issues with the database? Maintenance was done on the CCD on July 20 and the performance issues began shortly thereafter.  The maintenance was done to improve system performance and to fix previous intermittent performance issues. The State Department has stated that this was not a terrorist act or anything more malicious than a software glitch.  An investigation is currently underway to determine exactly what caused the software glitch, but the details have not been released at this time.  It can be assumed that the test program for the software was inadequate since the glitch wasn’t identified prior to implementation.

The final step in the Cause Mapping process is to identify solutions that can be implemented to reduce the risk of a problem recurring.  Details of exactly what was done to deal with the issue in the short term and bring the CCD back online aren’t available, but the State Department has stated that additional servers were added to increase capacity and improve response time.  There is also a plan to improve the CCD in the longer term by upgrading to a newer version of the Oracle database software by the end of the year which will hopefully prove more stable.

To view an Outline and high level Cause Map of this issue, click on “Download PDF” above.

Deadly Moscow Metro Derailment

By Kim Smiley

On July 15, 2014, a routine morning commute on the Moscow subway quickly became a nightmare when a metro train dramatically derailed, resulting in 23 deaths and about 150 injuries.  A massive rescue operation took hours and the investigation into the incident promises to be lengthy as well.

The investigation into this horrific accident is still ongoing, but an initial Cause Map can be built to capture the information that is already available and the Cause Map can be expanded as more details are known.  A Cause Map is a format for performing a visual root cause analysis.  The first step is to define the problem by filling in an Outline with the background information for the incident.  Additionally, any different or unique elements are documented because differences should almost always be investigated.  The impacts to the overall goals are documented on the bottom half of the Outline.  Once the problem is defined, the analysis is performed by asking “why” questions and using the answer to build the Cause Map. (To view the Outline and an initial Cause Map for this accident, click on “Download PDF” above.)

This safety goal was clearly impacted in this example because of the fatalities and injuries.  Why were so many hurt?  This occurred because a metro train derailed.  According to initial reports in the media, the train derailed because of an issue with a track switch mechanism that had recently been repaired.  It appears there was a problem with the repair work that was done and it can be assumed that the supervision or inspection of the work wasn’t adequate since the problem wasn’t discovered prior to the accident.

A second impact to the safety goal is that it was particularly difficult to quickly access and treat the injured passengers after the accident.  The derailment occurred at the deepest metro station on the Moscow subway, about 275 feet underground.  Rescue workers had to climb down steps to reach injured passengers and had to carry many up to the surface.

After a Cause Map is completed, the final step is to use it to develop solutions that can be implemented to reduce the risk of a similar accident occurring.  In this example, there may be changes needed to how track work is managed.  At a minimum, a careful look into how repair work is inspected prior to a track being put back into service seems warranted after this accident.

The Controversy Over the 2010 World Cup Ball

By Kim Smiley

Unlike other sports where the balls remain relatively constant, a new soccer ball is typically unveiled for the World Cup every 4 years.  The changes made to the balls aren’t just cosmetic; the behavior of the soccer ball can vary between designs.  One of the most controversial designs in recent memory was the Janulani, the official ball of the 2010 World Cup that was widely criticized and dubbed “the beach ball”.

The issues surrounding the Janulani soccer ball can be analyzed by building a Cause Map, a visual root cause analysis.  To build a Cause Map, “why” questions are asked to determine what factors contributed to an issue and answers are visually laid out to show the cause-and-effect relationships. To view a Cause Map of this issue, click on “Download PDF”.

So why was the 2010 World Cup ball the focus of so many complaints? Players felt that the ball was unpredictable and behaved differently than previous ball designs.  Scientists studied the Janulani ball and determined that it was less aerodynamically stable, particularly at the speeds typical for a professional free kick, which made the goalie’s job significantly harder and tempers flare.  The Janulani ball was a fundamentally different design: it had fewer panels (8 instead of the traditional 32), a smoother surface and internal stitches.  The ball was basically so smooth it changed how air flowed around it, including the speed where the transition between smooth and turbulent flow occurred.  The placement of the seams was also significantly different and not as balanced so that the ball moved erratically at times.  One can assume that the testing program for the new soccer ball design was inadequate since the changes in flight path patterns were not intentional, so that is another cause that needs to be considered.

It’s also worth noting that the fact that a new soccer ball design was used for the 2010 World Cup is a cause of the problem.  Few other sports have equipment that is changed so frequently and/or debut new equipment at major international events. So why is there a new ball for every World Cup?  Money certainly plays a role since there is a huge demand for World Cup merchandise and a new ball means a new product to sell.  The restrictions governing soccer ball design are also vague – for example, the number of panels are not specified – which allows plenty of wiggle room for innovation.

The problems with the 2010 World Cup ball seem to have been fixed and the 2014 World Cup ball, the Brazuca, doesn’t seem to be generating close to the amount of negative press.  In order to smooth out the flight pattern, this design is about a half-ounce heavier, has a slightly rougher surface and deeper seams.  There has been some speculation that the fast flying Brazuca is responsible for the high number of goals scored this World Cup, but the ball appears to fly predictably – if fast. If you want a stylish new Brazuca official match ball of your own, they are selling for $160 each.

If you are still feeling blue that the US is out the World Cup, try searching #ThingsTimHowardCouldSave.  It should cheer you up a bit.

Can a “Super Banana” Reduce Vitamin A Deficiency?

By Kim Smiley

Vitamin A deficiency is rare in developed countries, but it remains a major public health issue in more than half of all countries, particularly in especially in Africa and South-East Asia. Researchers at the Queensland University have created a “super banana” genetically engineered to contain alpha- and beta-carotene that they hope will reduce vitamin A deficiency in parts of the world where bananas are a staple crop.

The problem of vitamin A deficiency can be analyzed using a Cause Map, a visual format for performing a root cause analysis. A Cause Map is built by determining how an issue impacts the overall goals and then asking “why” questions and laying out the answers visually to show the cause-and-effect relationships. In this example, the overall goal of public safety is impacted because vitamin A deficiency causes 650,000 – 700,000 deaths and results in blindness in 250,000-500,000 children annually. This occurs because the body, especially growing bodies, needs vitamin A to function properly and the diet does not contain adequate vitamin A.

Bodies use vitamin A in a number of ways. For example, vitamin A is important for healthy vision and a lack of it will result in blindness.  It has been shown to play an important role in the immune system. Diets in some regions of the world lack enough vitamin A because they are poor subsistence-farming communities that predominantly consume locally grown crops and the local crops don’t contain sufficient vitamin A.

There have been a number of different ways to help reduce the occurrence of vitamin deficiency such as distribution of vitamins and introduction of new crops, but the problem of vitamin deficiency is still a widespread issue which led to the idea of genetically modifying local crops to be more nutritious. The idea behind the “super banana” is that they would look the same as other East African Highland bananas and grow in the same conditions, but that they would be enriched with additional nutrients. The inside of the “super bananas” is more orange than regular East African Highland bananas, but the outside looks the same.

Lab tests with gerbils have been successful and the first human trials of the modified bananas are scheduled starting this summer. If the human trials are successful, the next necessary step is for Uganda’s legislature to approve a bill allowing the crops to be grown. Researchers are hoping to have the modified bananas growing in Uganda by 2020 if the government approves the project.

To view a high level Cause Map, click on “Download PDF” above.

Software Glitch in Electronic Voting System during Belgium’s Federal Election

By Kim Smiley

A root cause analysis of electronic voting – at the most basic level, the idea behind elections seems very simple – let every citizen vote one time and count them.  But in reality, it often proves difficult to quickly and accurately collect and count thousands and thousands of votes.   The recent software bug during the May federal elections in Belgium illustrates some of the technical difficulties that can come into play during an election.

Root Cause Analysis Cause Mapping of  Belgiums voting system

Belgium held federal elections on May 25, 2014 and used an electronic voting system to collect and count many of the votes.  While computing election results, officials realized that some of the votes weren’t calculating correctly.  Announcement of the election results was delayed while the problem was addressed, but the bigger problem is that any software hiccups during elections make people question the validity of the vote.

A root cause analysis by Government officials have stated that the problem was quickly addressed and that the impacted votes would not have changed the outcome of the election, but the lack of transparency in the process worries some.  In fact, many countries have banned the use of electronic voting because of concern over potential issues and Belgium is one of the only European countries to still use e-voting machines.

There are two separate electronic voting systems in use in Belgium.  The software glitch impacted the older, first generation Jites system computers using DOS operating systems.  The Jites system was certified and tested, but the test program should be reevaluated before future elections because it missed a significant software glitch.  Another option would be to upgrade the first generation computers before the next election to reduce the risk of future issues by only having one system to test and maintain.

Conducting a large scale national vote is a tricky problem and worth pondering.  The system needs to be transparent enough that the public feels the system is “fair”, but secret enough that individual voters are ensured privacy.   Officials need to be able to ensure that only eligible voters participate, but need the process to not be so onerous that it inhibits citizens’ ability to navigate it (think the ongoing debate in the US regarding photo IDs).   There are a number of strong, opposing forces at play in the process and any issues like a software error only add fuel to the fire.

To view the Outline and the root cause analysis Cause Map, please click “Download PDF” above.  Or click here to read more.

Smoke at FAA Facility Results in Major Flight Disruptions

By Kim Smiley

A smoking bathroom fan resulted in the disruption of more than a thousand flights in the Chicago area on May 13, 2014 in a dramatic demonstration of real time cause-and-effect.  This incident illuminates how a relatively small issue can quickly grow into an expensive and time-consuming problem.  In an ideal world, a smoking bathroom fan wouldn’t result in national headlines.

So what happened?  How did a smoking bathroom fan that wasn’t even at the airport delay so many flights?  A Cause Map, a visual method for performing a root cause analysis, is a useful tool for understanding the causes that contributed to an issue.   When building a Cause Map, causes are laid out based on cause-and-effect relationships to clearly show what lead to the problem.

In this example, flights were delayed because there was limited support from air traffic control available and air traffic control support is necessary for safe operation.  Air traffic control support was reduced because the Elgin FAA facility that monitors airports in the Chicago area was evacuated for several hours because the building was filled with smoke.  The building had to be evacuated  for personnel safety and  it took some time to reestablish safe conditions.  Emergency personnel had a difficult time pinpointing the source of the smoke because it spread through the space.  The smoke was throughout the building  because the source of the smoke, a bathroom fan, was part of the HVAC system.

The media reports didn’t provide details about why exactly the bathroom fan was smoking in this particular case, but bathroom fans are a relatively common cause of building fires.  Lint or dust can build up in the fan motor over time, eventually leading to the motor overheating.  The situation can quickly become dangerous, particularly when a motor is left powered after it has seized which is a common failure mode for this equipment.

A few fairly easy things can be done to reduce the risk of bathroom fan fires.  Fan should be cleaned at least annually, but should be cleaned more frequently if they appear dirty or dusty.  A motor that is making unusual sounds or noise should be immediately turned off and inspected by an electrician prior to being returned to service.  Any fan that isn’t making the typical whizz sound should also be powered off and repaired or replaced prior to use because a motor that isn’t rotating has a greater likelihood of overheating.   Older models that aren’t thermally protected are most at risk for a fire and replacing them with a newer model with thermal protection can significantly reduce the risk of fire.

To view a high level Cause Map, click on “Download PDF” above.

Hundreds of Flights Disrupted After Air-Traffic Control System Confused by U-2 Spy Plane

By Kim Smiley

Hundreds of flights were disrupted in the Los Angeles area on April 30, 2014 when the air traffic control system En Route Automation Modernization system, known as ERAM, crashed.   It’s been reported that the presence of a U-2 spy plane played a role in the air traffic control issues.

This issue can be analyzed by building a Cause Map, a visual format for performing a root cause analysis.  A Cause Map intuitively lays out the cause-and-effect relationships so that the problem can be better understood and a wider range of solutions considered.  In order to build a Cause Map, the impacted goals are determined and “why” questions are asked to determine all the causes that contributed to the issue.

In this example, the schedule goal was clearly impacted because 50 flights were canceled and more than 400 were delayed.  Why did this occur?  The flight schedule was disrupted because planes were unable to land or depart safely because the air traffic control system used to monitor the landings was down.  The computer system crashed because it became overwhelmed when it tried to reroute a large number of flights in a short period of time.

The system attempted to reroute so many flights at once because the system’s calculations showed that there was a risk of plane collisions because the system misinterpreted the flight path, specifically the altitude, of a U-2 on a routine training mission in the area.  U-2s are designed for ultra-high altitude reconnaissance, and the plane is reported to have been flying above 60,000 feet, well above any commercial flights.  The system didn’t realize that the U-2 was thousands of feet above any other aircraft so it frantically worked to reroute planes so they wouldn’t be in unsafe proximity.

It took several hours to sort out the problem, but then the Federal Aviation Administration was able to implement a short term fix relatively quickly and get the ERAM system back online.  The ERAM system is being evaluated to ensure that no other fixes are needed to ensure that a similar problem doesn’t occur again.  It’s also worth noting that ERAM is a relatively new system (implementation began in 2002) that is replacing the obsolete 1970s-era hardware and software system that had been in place previously.  Hopefully there won’t be many more growing pains with the changeover to a new air traffic control system.

To see a high level Cause Map of this problem, click on “Download PDF” above.