All posts by Kim Smiley

Mechanical engineer, consultant and blogger for ThinkReliability, obsessive reader and big believer in lifelong learning

5 killed and dozens injured when duck tour boat collides with bus

By Kim Smiley

Five people were killed and dozens more injured when an amphibious Ride the Ducks tour boat collided with a charter bus in Seattle on September 24, 2015.  The circumstances of the accident were particularly unfortunate because two large vehicles carrying tour groups across a busy bridge were involved.  Traffic was mangled for hours as emergency responders worked to treat the high number of victims, investigate the accident and clear the roadway.

The National Transportation Safety Board (NTSB) is investigating the accident to determine exactly what led to the collision and if there are lessons learned that could help reduce the risk of a similar crash in the future.  Potential issues with the duck boat are some of the early focuses of the investigation.  In case you are unfamiliar, duck boats are amphibious landing craft that were used by the U.S. Army during World War II that have been refurbished for use as tour vehicles that can travel on both water and land to give visitors a unique way to experience a city.  Their military designation DUKW was changed to the more user-friendly duck boat moniker that is used by many tour companies throughout the world.

Eyewitnesses of the accident have reported that the duck boat unexpectedly swerved while crossing the bridge, slamming into the driver’s side of the tour bus.  Reports are that the left front wheel of the duck boat locked up and the driver lost control of the vehicle.  NTSB investigators have stated that the duck boat didn’t have a recommended axle repair done that was recommended in 2013 and that they are working to determine whether or not this played a role in the accident.

Investigators are also looking into whether or not Seattle Ride the Ducks was notified of the repair.  Photos of the wrecked duck boat show that the front axle sheared and the left wheel popped off the vehicle, but it hasn’t been conclusively determined whether the damage was the cause of the accident or occurred during the accident.  The issues with the axle certainly seem like a smoking gun, but a thorough investigation still needs to be performed and the process will take up to a year.  If there was a mechanical failure on the duck boat unrelated to the already identified axle issue, that will need to be identified and reviewed to see if it applies to other duck tour vehicles.

This severity of this accident is raising concerns about the overall safety of duck tours.  The duck boat involved in this accident underwent regular annual inspections and was found to meet federal standards.  If a mechanical failure was in fact involved, hard questions about the adequacy of standards and inspections will need to be asked.  The issue of the recommended repair that was not done also raises questions about how the recommendations are passed along to companies running duck boat tours as well as incorporated into inspection standards.

Click on “Download PDF” above to see an outline and Cause Map of this issue.

Volkswagen admits to use of a ‘defeat device’

By Kim Smiley

The automotive industry was recently rocked by Volkswagen’s acknowledgement that the company knowingly cheated on emissions testing of several models of 4-cylinder diesel cars starting in 2009.  The diesel cars in question include software “defeat devices” that turn on full emissions control only during emissions testing.  Full emissions control is not activated during normal driving conditions and the cars have been shown to emit as much as 40 times the allowable pollution.   Customers are understandably outraged, especially since many of them purchased a “clean diesel” car in an effort to be greener.

The investigation into this issue is ongoing and many details aren’t known yet, but an initial Cause Map, a visual format for performing a root cause analysis, can be created to document and analyze what is known.  The first step in the Cause Mapping process is to fill in a Problem Outline with the basic background information and how the issue impacts the overall organizational goals.  The “defeat device” issue is a complex problem and impacts many different organizational goals.  The increased emissions obviously impacts the environmental goal and the potential health impacts of those emissions is an impact to the safety goal.  Some of the specific details are still unknown, like the exact amount of the fines the company will face, but we can safely assume the company will be paying significant fines (on the order of billions) as a result of this blatant violation of the law.  The Volkswagen stock price also took a major hit and dropped more than 20 percent following the announcement of the diesel emissions issues.  It is difficult to quantify how much the loss of consumer confidence will impact the company long-term, but being perceived as a dishonest company by many will certainly impact their sales.   A large recall that will be both time-consuming and costly is also in Volkswagen’s future.  Depending on the investigation findings, there is also the potential for criminal prosecution because of the intentional nature of this issue.

Once the overall impacts to the goals are defined, the actual Cause Map can be built by asking “why” questions.  So why did these cars include “defeat devices” to cheat on emissions tests?  The simple answer is increased profits.  Designing cars that appeared to have much lower emissions than they did in reality allowed Volkswagen to market a car that was more desirable. Car design has always included a trade-off between emissions and performance.  Detailed information hasn’t been released yet, but it is likely that the car had improved fuel economy and improved driving performance during normal driving conditions when full emissions control wasn’t activated. Whoever was involved in the design of the “defeat device” also likely assumed the deception would never be discovered, which raises concern about how emissions testing is performed.

The design of the “defeat device” is believed to work by taking advantage of unique conditions that exist during emissions testing. During normal driving, the steering column moves as the driver steers the car, but during emissions testing the wheels rotate, but the steering column doesn’t move.  The “defeat device” software appears to have monitored the steering column and wheels to sense when the conditions indicated an emissions test was occurring.  When the wheels turned without corresponding steering wheel motion, the software turned the catalytic scrubber up to full power, reducing emissions and allowing the car to pass emissions tests. Details on how the “defeat device” was developed and approved for inclusion in the design haven’t been released, but hopefully the investigation into this issue will be insightful and help understand exactly how something this over the line occurred.

Only time will tell exactly how this issue impacts the overall health of the Volkswagen company, but the short-term effects are likely to be severe.  This issue may also have long-reaching impacts on the diesel market as consumer confidence in the technology is shaken.

To view an Outline and initial Cause Map of this issue, click on “Download PDF” above.

Spider in air monitoring equipment causes erroneously high readings

By Kim Smiley

Smoke drifting north from wildfires in Washington state has raised concerns about air quality in Calgary, but staff decided to check an air monitoring station after it reported an alarming rating of 28 on a 1-10 scale.  What they found was a bug, or rather a spider, in the system that was causing erroneously high readings.

The air monitoring station measures the amount of particulate matter in air by shining a beam of light through a sample of air.  The less light that makes it through the sample, the higher the number of particulates in the sample and the worse the quality of air.  You can see the problem that would arise if the beam of light was blocked by a spider.

This example is a great reminder not to rely solely on instrument readings.  Instruments are obviously useful tools, but the output should always be run through the common sense check.  Does it make sense that the air quality would be so far off the scale?  If there is any question about the accuracy of readings, the instrument should probably be checked because the unexpected sometimes happens.

In this case, inaccurate readings of 10+ were reported by both Environment Canada and Alberta Environment before the issue was discovered and the air quality rating was adjusted down to a 4.  Ideally, the inaccurate readings would have been identified prior to posting potentially alarming information on public websites.  The timing of the spider’s visit was unfortunate because it coincided with smoky conditions that made the problem more difficult to identify, but extremely high readings should be verified before making them public if at all possible.

Adding an additional verification step when there are very high readings prior to publicly posting the information could be a potential solution to reduce the risk of a similar problem recurring.  A second air monitoring station could be added to create a built-in double check because an error would be more obvious if the monitoring stations didn’t have similar readings.

Depending on how often insects and spiders crawl into the air monitoring equipment, the equipment itself could be modified to reduce the risk of a similar problem recurring in the future.

To view a Cause Map, a visual root cause analysis, of this issue, click on “Download PDF” above.

Power grid near Google datacenter struck by lightning 4 times

By Kim Smiley

A small amount of data was permanently lost at a Google datacenter after lightning struck the nearby power grid four times on August 13, 2015. About five percent of the disks in Google’s Europe-west1-b cloud zone datacenter were impacted by the lightning strikes, but nearly all of the data was eventually recovered with less than 0.000001% of the stored data not able to be recovered.

A Cause Map, or visual root cause analysis, can be built to analyze this issue. The first step in the Cause Mapping process is to fill in an Outline with the basic background information such as the date, time and specific equipment involved. The bottom of the Outline has a spot to list the impacted goals to help define the scope of an issue. The impacted goals are then used to begin building the Cause Map. The impacted goals are listed in red boxes on the Cause Map and the impacts are the first cause boxes on the Cause Map. Why questions are then asked to add to the Cause Map and visually lay out the cause-and-effect relationships.

For this example, the customer service goal was impacted because some data was permanently lost. Why did this happen? Data was lost because datacenter equipment failed, this particular data was stored on less stable system and wasn’t duplicated in another location. Google has stated that the lost data was newly written data that was located on storage systems which were more susceptible to power failures. The datacenter equipment failed because the nearby power grid was struck by lightning four times and was damaged. Additionally, the automatic auxiliary power systems and backup battery were not able to prevent data loss after the lightning damage.

When more than one cause was required to produce an effect, all the causes are listed vertically and separated by an “and”. You can click on “Download PDF” above to see a high level Cause Map of this issue that shows how an “and” can be used to build a Cause Map. A more detailed Cause Map could be built that could include all the technical details of exactly why the datacenter equipment failed. This would be useful to the engineers developing detailed solutions.

The final step in the Cause Mapping process is to develop solutions to reduce the risk of a problem recurring in the future. For this example, Google has stated that they are upgrading the datacenter equipment so that it is more robust in the event of a similar event in the future. Google also stated that customers should backup essential data so that it is stored in another physical location to improve reliability.

Few of us probably design datacenter storage systems, but this incident is a good reminder of the importance of having a backup. If data is essential to you or your business, make sure there is a backup that is stored in a physically separate location from the original. Similar to the “unsinkable” Titanic, it is always a good idea to include enough life boats or backups in a design just in case something you didn’t expect goes wrong. Sometimes lightning strikes four times so it’s best to be prepared just in case.

A single human error resulted in the deadly SpaceShipTwo crash

By Kim Smiley

The National Transportation and Safety Board (NTSB) has issued a report on their investigation into the deadly SpaceShipTwo crash on October 31, 2014 during a test flight.  Investigators confirmed early suspicions that the space plane tore apart after the tail boom braking system was released too early, as discussed in a previous blog.  The tail boom is designed to feather to increase the drag and slow down the space plane, but when the drag was applied earlier than expected the additional aerodynamic forces ripped the space plane apart at both high altitude and velocity.  Amazingly, one of the two pilots survived the accident.

Information from the newly released report can be used to expand the Cause Map from the previous blog.  The investigation determined that the pilot pulled the lever that released the braking system too early.  Even though the pilots did not initiate a command to put the tail booms into the braking position, the forces on the tail booms forced them into the feathered position once they were unlocked.  The space plane could not withstand the additional aerodynamic forces created by the feathered tail booms while still accelerating and it tore apart around the pilots.

A Cause Map is built by asking “why” questions and documenting the answers in cause boxes to visually display the cause-and-effect relationships. So why did the pilot pull the lever too early?  A definitive answer to that may never be known since the pilot did not survive the crash, but it’s easy to understand how a mistake could be made in a high-stress environment while trying to recall multiple tasks from memory very quickly.  Additionally, the NTSB found that training did not emphasize the dangers of unlocking the tail booms too early so the pilot may not have been fully aware of the potential consequences of this particular error.

A more useful question to ask would be how a single mistake could result in a deadly crash.  The plane had to be designed so that it was possible for the pilot to pull a lever too early and create a dangerous situation.  Ideally, no single mistake could create a deadly accident and there would have been safeguards built into the design to prevent the tail booms from feathering prematurely.  The NTSB determined the probable cause of this accident to be “failure to consider and protect against the possibility that a single error could result in a catastrophic hazard to the SpaceShipTwo vehicle.”  The investigation found that the design of the space plane assumed that the pilots would perform the correct actions every time.  Test pilots are highly trained and the best at what they do, but assuming human perfection is generally a dangerous proposition.

The NSTB identified a few causes that contributed to the lack of safeguards in the SpaceShipTwo design.  Designing commercial space craft is a relatively new field; there is limited human factors guidance for commercial space operators and the flight database for commercial space mishaps is incomplete. Additionally, there was insufficient review during the design process because it was never identified that a single error could cause a catastrophic failure. To see the recommendations and more information on the investigation, view a synopsis from the NTSB’s report.

To see an updated Cause Map of this accident, click on “Download PDF” above.

Small goldfish can grow into a large problem in the wild

By Kim Smiley

Believe it or not, the unassuming goldfish can cause big problems when released into the wild.  I personally would have assumed that a goldfish set loose into the environment would quickly become a light snack for a native species, but invasive goldfish have managed to survive and thrive in lakes and ponds throughout the world.  Goldfish will keep growing as long as the environment they are in supports it.  So while goldfish kept in an aquarium will generally remain small, without the constraints of a tank, goldfish the size of dinner plates are not uncommon in the wild. These large goldfish both compete with and prey on native species, dramatically impacting native fish populations.

This issue can be better understood by building a Cause Map, a visual format of root cause analysis, which intuitively lays out the cause-and-effect relationships that contributed to the problem.  A Cause Map is built by asking “why” questions and recording the answers as a box on the Cause Map.  So why are invasive goldfish causing problems?  The problems are occurring because there are large populations of goldfish in the wild AND the goldfish are reducing native fish populations.  When there are two causes needed to produce an effect like in this case, both causes are listed on the Cause Map vertically and separated by an “and”.   Keep asking “why” questions to continue building the Cause Map.

So why are there large populations of goldfish in the wild?  Goldfish are being introduced to the wild by pet owners who no longer want to care for them and don’t want to kill their fish.  The owners likely don’t understand the potential environmental impacts of dumping non-native fish into their local lakes and ponds.  Goldfish are also hardy and some may survive being flushed down a toilet and end up happily living in a lake if a pet owner chooses to try that method of fish disposal.

Why do goldfish have such a large impact on native species?  Goldfish can grow larger than many native species and they compete with them for the same food sources.  In addition, goldfish eat small fish as well as eggs from native species.  Invasive goldfish can also introduce new diseases into bodies of water that can spread to the native species.  The presence of a large number of goldfish can also change the environment in a body of water.  Goldfish stir up mud and other matter when they feed which causes the water to be cloudier, impacting aquatic plants.  Some scientists also believe that large populations of goldfish can lead to algae blooms because goldfish feces is a potential food source for them.

Scientists are working to develop the most effective methods to deal with the invasive goldfish.  In some cases, officials may drain a lake or use electroshocking to remove the goldfish.  As an individual, you can help the problem by refraining from releasing pet fish into the wild.  It’s an understandable impulse to want to free an unwanted pet, but the consequences can be much larger than might be expected. You can contact local pet stores if you need to get rid of aquarium fish; some will allow you to return the fish.

To view a Cause Map of this problem, click on “Download PDF” above.

Deadly balcony collapse in Berkeley

By Kim Smiley

A 21st birthday celebration quickly turned into a nightmare when a fifth-story apartment balcony collapsed in Berkeley, California on June 16, 2015, killing 6 and injuring 7.  The apartment building was less than 10 years old and there were no obvious signs to the untrained eye that the balcony was unsafe prior to the accident.

The balcony was a cantilevered design attached to the building on only one side by support beams.  A report by Berkeley’s Building and Safety Division stated that dry rot had deteriorated the support beams significantly, causing the balcony to catastrophically fail under the weight of 13 bodies.

Dry rot is decay caused by fungus and occurs when wood is exposed to water, especially in spaces that are not well-ventilated. The building in question was built in 2007 and the extensive damage to the support beam indicates that there were likely problems with the water-proofing done during construction of the balcony.  Initial speculation is that the wood was not caulked and sealed properly when the balcony was built, which allowed the wood to be exposed to moisture and led to significant dry rot. However, the initial report by the Building and Safety Division did not identify any construction code violations, which raises obvious questions about whether the codes are adequate as written.

As a short-term solution to address potential safety concerns, the other balconies in the building were inspected to identify if they were at risk of a similar collapse so they could be repaired. As a potential longer-term solution to help reduce the risk of future balcony collapses in Berkeley as a whole, officials proposed new inspection and construction rules this week.  Among other things, the proposed changes would require balconies to include better ventilation and require building owners to perform more frequent inspections.  Only time will tell if proposed code changes will be approved by the Berkeley City Council, but something should be changed to help ensure public safety.

Finding a reasonable long-term solution to this problem is needed because balconies and porches are susceptible to rot because they are naturally exposed to weather.  Deaths from balcony failures are not common, but there have been thousands of injuries.  Since 2003, only 29 deaths from collapsing balconies and porches have been reported in the United States (including this accident), but an estimated 6,500 people have been injured.

Click on “Download PDF” above to see a Cause Map, a visual format of root cause analysis, of this accident.  A Cause Map lays out all the causes that contributed to an issue to show the cause-and-effect relationships.

Live anthrax mistakenly shipped to as many as 24 labs

By Kim Smiley

The Pentagon recently announced that live anthrax samples were mistakenly shipped to as many as 24 laboratories in 11 different states and two foreign countries.  The anthrax samples were intended to be inert, but testing found that at least some of the samples still contained live anthrax.  There have been no reports of illness, but more than two dozen have been treated for potential exposure.  Work has been disrupted at many labs during the investigation as testing and cleaning is performed to ensure that no unaccounted-for live anthrax remains.

The investigation is still ongoing, but the issues with anthrax samples appear to have been occurring for at least a year without being identified.  The fact that some of the samples containing live anthrax were transported via FedEx and other commercial shipping companies has heightened concern over possible implications for public safety.

Investigations are underway by both the Centers for Disease Control and the Defense Department to figure out exactly what went wrong and to determine the full scope of the problem. Initial statements by officials indicated that there may be problems with the procedure used to inactivate the anthrax.   Investigators so far have indicated that the work procedure was followed, but it may not have effectively killed 100 percent of the anthrax as intended.  Technicians believed that the samples were inert prior to shipping them out.

It may be tempting to call the issues with the work process used to inactivate the anthrax as the “root cause” of this problem, but in reality there is more than one single cause that contributed to this issue and more than one solution should be used to reduce the risk of future problems to acceptable levels.  Clearly, there is a problem if the procedure used to create inactive anthrax samples doesn’t kill all the bacteria present and that will need to be addressed, but there is also a problem if there aren’t appropriate checks and testing in place to identify that live anthrax remains in samples.  When dealing with potentially deadly consequences, a work process should be designed where a single failure cannot create a dangerous situation if possible.  An effective test for live anthrax prior to shipping the sample would have contained the problem to a single facility designed to handle live anthrax and drastically reduced the impact of the issue.  Additionally, an another layer of protection could be added by requiring that a facility receiving anthrax samples test them upon receipt and handle them with additional precautions until they were determined to be fully inert.

Building in additional testing does add time and cost to a work process, but sometimes it is worth it to identify small problems before they become much larger problems.  If issues with the process used to create inert anthrax samples were identified the first time it failed to kill all the anthrax, it could have been dealt with long before it was headline news and people were unknowingly exposed to live anthrax. Testing both before shipping and after receipt of samples may be overkill in this case, but something more than just working to fix the process for creating inert sample needs to be done because inadvertently shipping live anthrax for more than a year indicates that issues are not being identified in a timely manner.

6/4/2015 Update: It was announced that anthrax samples that are suspected of inadvertently containing live anthrax were sent to 51 facilities in 17 states, DC and 3 foreign countries (Australia, Canada and South Korea). Ten samples in 9 states have tested positive for live anthrax and the number is expected to grow as more testing is completed. 31 people have been preventative treated for exposure to anthrax, but there are still no reports of illness. Click here to read more.

Deadly Train Derailment Near Philadelphia

By Kim Smiley

On the evening of May 12, 2015, an Amtrak train derailed near Philadelphia, killing 8 and injuring more than 200.  The investigation is still ongoing with significant information about the accident still unknown, but changes are already being implemented to help reduce the risk of future rail accidents and improve investigations.

Data collected from the train’s onboard event recorder shows that the train sped up in the moments before the accident until it was traveling 106 mph in a 50 mph zone where the train track curved.  The excessive speed clearly played a role in the accident, but there has been little information released about why the train was traveling so fast going into a curve.  The engineer controlling the train suffered a head injury during the accident and has stated that he has no recollection of the accident. The engineer was familiar with the route and appears to have had all required training and qualifications.

As a result of this accident and the difficulty determining exactly what happened, Amtrak has announced that cameras will be installed inside locomotives to record the actions of engineers.  While the cameras may not directly reduce the risk of future accidents, the recorded data will help future investigations be more accurate and timely.

The excessive speed at the time of the accident is also fueling the ongoing debate about how trains should be controlled and the implementation of positive train control (PTC) systems that can automatically reduce speed.  There was no PTC system in place at the curve in the northbound direction where the derailment occurred and experts have speculated that one would have prevented the accident. In 2008, Congress mandated nationwide installation and operation of positive train control systems by 2015.  Prior to the recent accident, the Association of America Railroads stated that more than 80 percent of the track covered by the mandate will not have functional PTC systems by the deadline. The installation of PTC systems requires a large commitment of funds and resources as well as communication bandwidth that has been difficult to secure in some area and some think the end of year deadline is unrealistic. Congress is currently considering two different bills that would address some of the issues.  The recent deadly crash is sure to be front and center in their debates.

In response to the recent accident, the Federal Railroad Administration ordered Amtrak to submit plans for PTC systems at all curves where the speed limit is 20 mph less than the track leading to the curve for the main Northeast Corridor (running between Washington, D.C. and Boston).  Only time will tell how quickly positive train control systems will be implemented on the Northeast Corridor as well as the rest of the nation, and the debate on the best course of action will not be a simple one.

An initial Cause Map, a visual root cause analysis, can be created to capture the information that is known at this time.  Additional information can easily be incorporated into the Cause Map as it becomes available.  To view a high level initial Cause Map of this accident, click on “Download PDF”.

ISS Supply Mission Fails

By Kim Smiley

An unmanned Progress supply capsule failed to reach the International Space Station (ISS) and is expected to burn up during reentry in the atmosphere along with 3 tons of cargo.  Extra supplies are stored on the ISS and the astronauts onboard are in no immediate danger, but the failure of this supply mission is another in a string of high-profile issues with space technology.

This issue can be analyzed by building a Cause Map, a visual format of root cause analysis.  A Cause Map intuitively lays out the causes that contributed to an issue to show the cause-and-effect relationships.  To build a Cause Map, “why” questions are asked and the answers are documented on the Cause Map along with any relevant evidence to support the cause.

So why did the supply mission fail? The mission failed because the supply capsule was unable to dock with the ISS because mission control was unable to communicate with the spacecraft.  The Progress is an unmanned Russian expendable cargo  capsule that cannot safely dock with a space station without communication with mission control.  Mission control needs to be able to verify that all systems are functional after launch and needs a communication link to navigate the unmanned capsule through docking.

Images of the capsule showed that two of the five antennas failed to unfold leading to the communication issues.  Debris spotted around the capsule while it was in orbit indicates a possible explosion.  No further information has been released about what might have caused the explosion and it may be difficult to decisively determine the cause since the capsule will be destroyed in orbit.

The ISS recycles oxygen and water to an impressive degree and food is the first supply that would run out on the ISS, but NASA has stated that there are at least four months of food onboard at this time.  The failure of this mission may mean that the cargo for future missions will need to be altered to include more basic necessities and less scientific equipment, but astronaut safety is not a concern at this time. The failure of this mission does put additional pressure on the next resupply mission scheduled to be done by SpaceX in June in addition to creating more bad press for space programs that are already struggling during a turbulent time.

To view a intermediate Cause Map of this issue, click on “Download PDF” above.