All posts by Kim Smiley

Mechanical engineer, consultant and blogger for ThinkReliability, obsessive reader and big believer in lifelong learning

Possible Toyota Prius Recall

By Kim Smiley

A new potential safety issue has developed and Toyota may recall the newest model of the gas electric hybrid Prius that has been sold since last May.  The National Highway Traffic Safety Administration has received 124 reports from consumers claiming that the brakes don’t engage immediately at times.  Toyota has stated that the company has received 180 reports of braking problems in Japan and the United States. The reports include 4 incidents that resulted in accidents with 2 people receiving minor injuries.

Even a slight delay in the response of car braking systems can be very dangerous because cars can travel nearly 100 feet in one second at highway speeds.

No official details are known yet on what is causing the delay in brake engagement.  In one article, a power train expert speculated that it was a software glitch caused when the hybrid switched between using the electric motor and the internal combustion engine.  In the Prius design, the same motor that is powering the car, powers the brakes.  When the hybrid is switching between motors, there might be a momentarily loss of power to the brakes during the transition.

A preliminary root cause analysis can be started using the available information.  The Cause Map can be expanded and revised as necessary as new information becomes available.  Click on the “Download PDF” button above to view the initial Cause Map.

Toyota has not stated whether a formal recall will be made.  A potential recall would affect 300,000 vehicles worldwide.

This new issue comes on the heels of a major announcement on January 21 where 2.3 million cars were recalled because of sticky gas pedals that can cause sudden acceleration. Additionally, there was a recall issued in September 2009 because there was a potential for floor mats to move out of place and cause the accelerator to stick. (A previous blog addressed this issue.)

Toyota shares dropped 21 percent following the January announcement and any farther safety issues will likely negatively impact consumer confident and stock prices.

Two DC Metro Workers Killed

By Kim Smiley

On January 26, 2010 just before 2 am, two Metro workers were killed near the Rockville metro station.  They were crushed by a metro utility vehicle while working on the track to install safety equipment.

The utility vehicle is a gas powered truck that is designed to operate on the track when electricity is shut off.  They are called high-rail vehicles and are typically used to carry equipment.  At the time of the accident, the vehicle was placing devices that tell approaching trains that there is a work crew in the area.

Many details of this accident are not available yet, but a preliminary root cause analysis can be started.  The basic information can be documented in an Outline and an initial cause map can be started.  Click on the “Download PDF” button above to see what this would look like.

The men killed and the workers in the vehicle were not part of the same crew and it’s not clear why the driver of the truck wasn’t aware that workers were in the area.  At the time of the accident the vehicle was traveling in reverse, which is a routine mode of operation.

Safety regulations require all vehicle operators to be informed about work crew locations, but it isn’t clear if that is being done effectively.

The National Transportation Safety Board (NTSB) has begun to investigate this incident and more details should be available as their investigation progresses.   The NTSB is currently reviewing employee work history and training and gathering all relevant data such as radio recordings and work procedures.

The DC Metro system has the worst safety record of any metro system in the country.  Five workers have now been killed while on the tracks in the last seven months.  There was also a metro train accident that killed 9 people on June 22, 2009.  To see a cause map of the June accident, click here.

Stuck in the Chunnel

By Kim Smiley

High-speed train service in the Channel Tunnel (connecting Britain, France and Belgium) resumed partially on Tuesday, December 22, 2009 after a complete stoppage that began Friday, December 18th when five trains failed inside the tunnel.

Eurostar, the operator of the train, has stated that the failure of the trains were caused by an electrical failure due to condensation from snow that was able to enter the snow screens protecting the engine and higher temperatures within the tunnel than outside. Unseasonably cold weather was believed to cause finer, lighter snow than usual, which was able to enter the screens.

There was a delay in rescuing the trapped passengers in the tunnel – some of whom were trapped for up to 18 hours. Responsibility for rescue lies with both the train operator and the tunnel operator, and the process for rescue obviously needs to be reviewed by both parties to determine a better course of action the next time a rescue plan is needed. Additionally, the train operator will want to review its policies based on the reports of abysmal customer service throughout the event.

Eurostar, British Rail Class 373 at St Pancras railway station by Oxyman (11/23/07)
Eurostar, British Rail Class 373 at St Pancras railway station by Oxyman (11/23/07)

Eurostar took immediate action to install finer filters on the engine intakes and trains were put back into service on Tuesday, the 22nd. The company has also stated it will reimburse passengers for the delay, but this solution will take longer to implement.

The entire root cause analysis investigation so far is shown on the downloadable PDF. A thorough root cause analysis built as a Cause Map can capture all of the causes in a simple, intuitive format that fits on one page. Even more detail can be added to this Cause Map as the analysis continues. As with any investigation the level of detail in the analysis is based on the impact of the incident on the organization’s overall goals. To view the PDF, click on “Download PDF” above.

Toyota Recall: Problems, Interim Solutions and Permanent Solutions

by Kim Smiley

On September 29, 2009, Toyota/Lexus issued a safety advisory that some 2004-2010 model year vehicles could be prone to a rapid acceleration issue if the floor mat moved out of place and jammed the accelerator pedal. Although the recall is only applicable in the U.S. and Canada because of the type of floor mats used, over 4 million vehicles are affected by the recall.

Although all the solutions to this problem have not yet been implemented, we can look at the issue so far in a Cause Map, or visual root cause analysis. First we define the problem. Here we could consider the problem the recall, or the acceleration problems. We can list all the models and years that are affected by the recall, and that the recall is limited to the U.S. and Canada.

We define the problem with respect to the organization’s goals. There have been at least 5 fatalities addressed by the National Highway Transportation Safety Administration (NHTSA), though some media outlets have reported more. Additionally, the NHTSA has reported 17 accidents (again, some claim more) and has received at least 100 complaints. The fatalities and accidents are impacts to the safety goal. Complaints are impacts to the customer service goal. The recall of more than 4 million cars is an impact to the production/schedule goal, and the replacement of the accelerator pedals and floor mats as a result of the recall is estimated to cost $250 million, which is an impact to the property goal.

Once we’ve completed the outline, we can begin the Cause Map, or the analysis step of the process. The fatalities are caused by vehicle crashes resulting from a loss of control of the vehicle. The loss of control is caused by a sudden surge of acceleration, inability to brake, and sometimes an inability to shut down the engine of the car. Toyota says the sudden bursts of acceleration are caused by entrapment of the accelerator pedal due to interference from floor mats. Toyota refutes the possibility that there may be a malfunction in the electronic control system, saying it’s been ruled out by Toyota research.

The vehicles are unable to brake because the brake is non-functional when the accelerator pedal is engaged, as it is in these cases. Additionally, owners whose models are equipped with keyless ignition cannot quickly turn off their ignition. These models require the ignition button to be pressed for 3 seconds to prevent inadvertent engine stops, and the instructions are not posted on the dashboard, so owners who weren’t meticulous about reading (or remembering) instructions from the owners’ manual may not know how to turn off the car while moving at very quick speeds.

When the Cause Map is complete to a sufficient level of detail, it’s time to explore some solutions. In this case, the permanent solutions (which will reduce the risk of these accidents most significantly) to be implemented by Toyota are to reconfigure the accelerator pedal, replace the floor mats, and install a brake override system which will allow the brakes to function even with the accelerator pedal engaged. However, designing and implementing these changes for more than 4 million cars will take some time, so owners of Toyotas require interim solutions. Interim solutions are those that do not sufficiently reduce the risk for long-term applicability but can be used as a stop-gap until permanent solutions are put in place. In this case, Toyota has asked owners to remove floor mats, and has put out guidance that drivers who are in an uncontrolled acceleration situation should shift the engine into neutral, which will disengage the engine and allow the brake to stop the car.

View the high level summary of the investigation by clicking “Download PDF” above.

Learn more about the recall at the NHTSA website.

Missed by 150 Miles?

By Kim Smiley

On October 21, 2009, Northwest Airlines Flight 188 left San Diego and overshot its intended destination, Minneapolis-St. Paul by about 150 miles. Luckily, the incident resulted in a safe landing at the intended destination, but the circumstances surrounding the flight remain vague and unsettling.

One of the strangest facts that have come out is that the plane lost contact with air-traffic controllers for one hour and 18 minutes.  In the post 9/11 aviation environment, controllers are very sensitive to planes that quit responding to communications.  The Federal Aviation Administration had contacted military authorities about the possibility of terrorism.  Fighter jets were ready to respond and prepared to intercept the plane if necessary.

So what happened?  How did the pilots overshoot the airport by such a significant amount without realizing their mistake?

Initial reports were that the pilots stated that they were in a heated discussion and simply lost situational awareness, but many aviation experts have stated it is unlikely that pilots would miss repeated hails for over an hour because of an argument.  Other reports have speculated that maybe both pilots fell asleep.

The most recent information to come out is that the pilots were using their laptops during the time they failed to respond to hails.  The pilots stated that they both were working on laptops and that they were discussing monthly flight crew scheduling.

Details concerning the overshoot are still being investigated, but an initial root cause analysis can be started to help document the investigation as it progresses.  This is what an Outline could look like at this stage:

A preliminary Cause Map can be started at this stage of an investigation.  As more information is known a detailed Cause Map can be built to document all the relent information.

More data should be available soon.  The Cockpit Voice Recorder and the Flight Data Recorder have both been sent to the National Transportation Safety Board for analysis and interviews of all involved parties continue.

On October 28, it was announced that the FAA has revoked the licenses of the two pilots involved because they violated several federal regulations, including fail to comply with air traffic control instructions and operating carelessly and recklessly. There are no currently specific federal rules banning the use of laptops after the flight reaches 10,000 feet at this time.

Genesis Spacecraft Crash

By Kim Smiley

The mission of the Genesis spacecraft was to collect the first samples of the solar wind and return the samples to earth to be analyzed. The goal was to provide fundamental data to help scientists determine the composition of the sun and learn more about the formation of our solar system.

Unfortunately, during descent on September 8, 2004, the Genesis crashed into the earth at high velocity. Its descent was only slowed by air resistance and the collection capsule was damaged on impact.

What happened? What went wrong with the re-entry?

A root cause analysis can be performed to evaluate this incident. The investigation can be documented by building a Cause Map that collects all the information associated with the incident in a visual format that is easy to follow.

In this case, the main goal we’ll consider is the production goal. The production goal was impacted because the collection capsule was damaged, which had the potential to destroy all the physical data collected during the three year mission.

The investigation can proceed by asking “why” questions and adding the causes to the Cause Map. In this scenario, the collection capsule was damaged because it impacted the earth at high velocity. This occurred because the parachute that was intended to slow the descent to allow for a midair recovery by helicopter failed to deploy.

Post-accident investigation determined that the parachute was never triggered to deploy because gravity switches were installed backwards. The backward installation occurred for several reasons: the design was flawed, the design review process didn’t detect the error and the testing performed didn’t detect the error.

Luckily, the impact to the production goal has been less significant than it might have been in this case. The collection capsule was cushioned somewhat by the soft ground and while desert dirt entered the capsule, liquid water did not. The solar wind particles were embedded in the collection materials and the contaminating dirt was able to be removed for the most part. NASA has been able to retrieve significant amounts of data from the mission.

NASA’s Mishap Report can be downloaded for free for additional information on the incident.

A one page PDF showing a high level Cause Map of the incident can be downloaded by clicking on the button above.

The Space Junk Problem

By Kim Smiley

The Defense Advanced Research Projects Agency (known as DARPA) issued a request for ideas on how to clean up orbital debris, commonly known as space junk, last week. The term space junk refers to all the objects currently in orbit around earth that no longer serve a useful purpose.

Why would DARPA want to put effort into removing space junk?  Why is it a problem?

A root cause analysis of this issue can be performed.  The first step is to identify the problem.  Then the investigation can be documented as a Cause Map and the causes contributing to the space junk problem should be investigated. In this case, the problem is that space junk poses a threat to unmanned and manned spacecraft, including satellites.

Space junk comes from a variety of sources (which will be discussed later) and is a wide variety of sizes. Impacts with large debris (greater than 1 kilogram) can destroy spacecraft at orbital velocities.  The only protection currently available is to move the spacecraft out of the path of space junk. Impacts with tiny debris cause erosion damage and can substantially shorten the life span of spacecraft.  Solar panels and windows are especially vulnerable to this type of damage.

Destroyed spacecraft then become part of the problem as long as they remain in orbit as defunct space junk themselves.

In addition to nonfunctioning, dead spacecraft, some of the causes of space junk are boosters from past spacecraft launches, lost equipment, and debris from weapons testing.  These causes should all be added to the cause map.

The problems associated space junk continue to increase and with more and more debris is created in earth’s orbit.

The largest space debris incident in history occurred in 2007 after China performed an anti satellite missile test and intentionally blew up a defunct satellite.  This test also targeted a satellite in the most heavily populated area of earth’s orbit.

Currently, the Space Surveillance Network tracks more than 20,000 objects in orbit.  And this number only includes those large enough to track.  There are estimated to be thousands of objects too small to track currently in orbit.

Hopefully DARPA is able to find an effective solution to mitigate the problem and reduce the risk posed by space junk.

Click on the “Download PDF” bottom above to view an intermediate level Cause Map of this issue.

NASA Budget Realities

By Kim Smiley

A recent report by a White House panel of independent space experts says NASA’s current goal to return to the moon isn’t feasible with the current budget. The panel estimates that NASA would need about $3 billion extra a year beyond the current budget to continue with human space flight.

The budget shortfall is obviously a problem that may prevent NASA from meeting their overall organizational goals.  A root cause analysis built as a Cause Map can be created to understand how this issue developed.

In this case, the production goal is impacted because NASA is likely to be unable to meet the stated goal of a moon mission by 2010.  This is caused by the high cost of a moon mission, other budget considerations (such as the cost of possibly extending the moon mission and the International Space Station) and the limited NASA budget.  The causes of each of these can then be explored.

NASA has been working toward a return to the moon because five years ago then-President George W. Bush stated that NASA should work to return astronauts to the moon, with a proposed date of 2020.  NASA has already spent $7.7 billion working toward this goal, including the design and the construction of new rockets.

Part of the plan to pay for this venture was to retire the space shuttle in 2010 and deorbit the International Space Station in 2015, but the panel also recommended revaluating these deadlines, which would add additional budget pressure.

The panel found that extending the life of the space station beyond 2015 would allow a better return on the billions of dollars invested into it.  The panel also felt the space shuttle should be evaluated for possible life extension as well in order to continue to service the space station, since there is no viable alternative that will be developed in the necessary time frame.

NASA budget continues to be limited as national budget constraints increase.  In order to raise funds, the panel also recommended including other countries and private-for-profit firms in addition to increasing NASA budget.

This problem has no easy, clear solution.  Only time will tell how President Obama will choose to respond to these findings and if human space flight will continue to be a goal for NASA.

Loan Payments and Unemployment

By Kim Smiley

In recent years, the cost of attending a college in the US has risen about three times faster than inflation and the average student loan amounts have increased accordingly.  Combine this with the highest unemployment rates of college graduates since 1979 (4.8% in May, up from 2.1% at the start of the recession) and lower starting salaries during recessions, and there are many recent graduations struggling to make their loan payments.

As with any problem, it is possible to perform a root cause analysis of the issue. To begin let’s assume the production goal of an individual considering college is to earn a comfortable living and the potential impact to this goal will be difficultly making loan payments.

The causes of this potential problem repaying loans could be unemployment after graduation, lower starting salaries for new hires during a recession and large loan payments.  The causes of each of these factors can then be explored.  Click on the Download PDF graphic above to view an intermediate level Cause Map with more causes added.

A recent USA Today article entitled “In a Recession, Is College Worth It? Fear of Debt Changes Plans” discussed how many students are rethinking their college plans.  Enrollment at community college is soaring and many students are choosing a less expensive option and skipping the big name private institutions.

This makes sense when considering the potential difficulty repaying college loans because the only cause that the student has direct control over is the size of the loan payments.

The bottom line is that each individual needs to think through their particular situation, consider how much the college costs and how much the starting salary for their particular degree is projected to be.  There are very real dangers in amassing large student loans without calculating the monthly payments and ensuring that they are within a realistic budget.

The reality is that some universities cost more and there is no guarantee that attending a more expensive college will result in a higher salary.  It may well be a smart decision to choose a less expensive option when selecting a college.

If you’re interested in reading an analysis of the 2009 Financial Mess, please click here.

Lessons from Three Mile Island

By Kim Smiley

The partial meltdown of a core at the nuclear power plant at Three Mile Island is one of the most well known engineering disasters in US history.  Luckily, no one was injured and there was no significant environmental impact, but the potential for major issues was very real.  Three Mile Island also had a huge impact on the nuclear industry and required a major clean up effort.

Performing a root cause analysis of historical incidents is useful because there are a number of lessons learned that can often be applied across a variety of industries.

As is true with any complex system, there were many causes that contributed to the Three Mile Island incident.  At the most simplified level, cooling water flow was stopped to the primary system (the nuclear portion).  The primary system then started to heat up, increasing the pressure to the point that a relief valve lifted.  The relief valve then failed to reseat and a large volume of coolant was lost.  The core eventually overheated because it was uncovered due to the loss of coolant.

Another factor that contributed significantly to the Three Mile Island incident was operator action during the casualty, which occurred over several shifts.  Had operators been able to understand the status of the plant in a timelier manner, the plant could have been put into a safe condition.

At first glance, it’s easy to stop at this point and use a term like “operator error”, but a thorough analysis requires more digging. Even if the technology being considered is radically different than a nuclear power plant, there are many lessons that can be learned from studying how the control room design impacted the operator actions during the incident.

The design of the control room significantly contributed to the operators’ inability to identify plant conditions.  The control room was huge with hundred of instruments to monitor, some of which were on the back of the control panels and couldn’t be viewed in the normal watch standing locations.  Dozens of alarms, both audible and flashing lights, went off in a very short period of time without any obvious priority.  The alarms continued throughout the casualty and the sheer volume of information was nearly impossible to interpret accurately.

Many industries continue to benefit from the lessons learned from the design of the control room.

For more detailed information on the Three Mile Island accident, please see the NRC’s Three Mile Island fact sheet.