All posts by Kim Smiley

Mechanical engineer, consultant and blogger for ThinkReliability, obsessive reader and big believer in lifelong learning

Boeing 747 “Dreamlifter” Cargo Jet Lands At Wrong Airport

By Kim Smiley

On November 21, 2013, a massive Boeing 747 Dreamlifter cargo jet made national headlines after it landed at the wrong airport near Wichita, Kansas.  For a time, the Dreamlifter looked to be stuck at the small airport with a relatively short runway, but it was able to take off safely the next day after some quick calculations and a little help turning around.

At the time of the airport mix-up, the Dreamlifter was on its way to the McConnell Air Force base to retrieve Dreamliner nose sections made by nearby Spirit Aerosystems.   Dreamlifters are notably large because they are modified jumbo jets designed to haul pieces of Dreamliners between the different facilities that manufacture parts for aircraft.

So how does an airplane land at the wrong airport?  It’s not entirely clear yet how a mistake of this magnitude was made.  The Federal Aviation Administration is planning to investigate the incident to determine what happened and to see whether any regulations were violated.  What is known is that the airports have some similarities in layout that can be confusing from the air.  First off, there are three airports in fairly close proximity in the region.  The intended destination was the McConnell Air Force base, which has a runway configuration similar to Jabara airfield where the Dreamlifter landed by mistake.  Both runways run north-south and  are nearly parallel.  It can also be difficult to determine how long a runway is from the airport so the shorter length isn’t necessarily easy to see.  Beyond the airport similarities, the details of how the plane landed at the wrong airport haven’t been released yet.

What is known can be captured by building an initial Cause Map, a visual format for performing a root cause analysis.  One of the advantages of Cause Maps is they can be easily expanded to incorporate more information as it becomes available.  The first step in Cause Mapping is to fill in an Outline with the basic background information and to list how the issue impacts the overall goals.  There are a number of goals impacted in this example.  The potential for a plane crash means that there was an impact to both the safety and property goal because of the possibility of fatalities and damage to the jet.  The effort needed to ensure that the jet could safely take off on a shorter runway is an impact to the labor goal and the delay was an impact to the schedule goal.  The negative publicity surrounding this incident can also be considered an impact to the  customer service goal.

Once the Outline is completed, the Cause Map is built by asking “why” questions and intuitively laying out the answers until all the causes that contributed to the issue are documented.  Click on “Download PDF” above to see an Outline and initial Cause Map of this issue.

Good luck with any air travel planned for this busy holiday week.  And if your plane makes it to the right airport (even if it’s a little late), take a moment to be thankful because it’s apparently not the given I’ve generally assumed.

Can the Epidemic of Smartphone Thefts be Stopped?

By Kim Smiley

About 1.6 million handheld devices were stolen in the United States in 2012, the majority of which were smartphones.  In fact, the frequency at which the popular Apple devices are taken has given rise to a whole new term, “apple picking”.  Stolen smartphones cost consumers nearly $30 billion a year.  These thefts affect a significant number of smartphone owners with approximately 10 percent reporting that they have had a device stolen.

The problem of smartphone theft can be analyzed by building a Cause Map, a method for performing a visual root cause analysis.  A Cause Map is built by completing an Outline by both filling in the basic background information and listing how the issue impacts the overall goals.  The impacts to the goals from the Outline are then used as the first step in building the Cause Map.  Causes are then added by asking “why” questions to determine what other causes contributed to an issue.  (To view a high level Cause Map of this issue, click on “Download PDF” above.)

So why do so many smartphone get taken?  Smartphones are a popular target because it is lucrative to resell them, they are relatively easy to steal, and many of the crimes go unpunished.  Smartphones are fairly easy to steal because they are readily available since so many people carry them, and they are both small and light weight.  Many criminals who steal smartphones go unpunished because there are so many of them taken and it is difficult to locate the thieves.  Many stolen smartphones are shipped overseas which further complicates the situation.

The black market for smartphones is lucrative because the items are popular and relatively expensive to buy new.  People buy stolen smartphones because they are cheaper and they are able to be used by the “new owner”, especially overseas where the networks are different and phones deactivated in the US may be able to be used.

One of the possible solutions suggested to reduce the number of smartphone thefts is to include a kill switch in smartphone software.  This kill switch would essentially make the phone worthless because it would no longer function no matter where it was in the world.  If smartphones no longer have resale value, then there would be little incentive to steal them and the number of thefts should dramatically decrease.  While this idea is elegant in its simplicity, like most things there is more that needs to be considered.

The addition of a kill switch was recently rejected by cellphone carriers because of concerns about hacking and problems with reactivation.  If hackers found a way to flip the kill switches they would have the ability to destroy a huge number of smartphones from anywhere in the world.  Depending on how many users were targeted this could have a huge impact, which could be especially problematic for people who use their phones in an official capacity like law enforcement. It doesn’t take much imagination to see how this scenario could go horribly wrong. The proposed kill switch is also permanent so users won’t be able to reactivate their phones and any stolen phones that were recovered would be useless.  Companies continue to work on a number of ideas to make it more difficult to resell smartphones, but there isn’t general agreement on the best approach yet.  Only time will tell if the tide of smartphone thefts has peaked.

The Morris Worm: The First Significant Cyber Attack

By Kim Smiley

In 1988 the world was introduced to the concept of a software worm when the Morris worm made headlines for significantly disrupting the fledgling internet.  The mess left in the wake of the Morris worm took several days to clean up. The estimates for the cost of the Morris worm vary greatly from $100,000–10,000,000, but even at the lower range the numbers are still substantial.

A Cause Map, or visual root cause analysis, can be used to analyze this issue.  A Cause Map is built by asking “why” questions and using the answers to visually lay out the causes that contributed to an issue to show the cause-and-effect relationships.  In this example, a programmer was trying to build a “harmless” worm that could be used to gauge the size of the internet, but he made a mistake.  The goal was to infect each computer one time, but the worm was designed to duplicate itself every seventh time a computer indicated it already had the worm to make the worm hard to defend against.  The problem was that the speed of propagation was underestimated. Once released, the worm quickly reinfected computers over and over again until they were unable to function and the internet came crashing down.  (To view a Cause Map of this example, click on “View PDF” above.)

One of the lasting impacts from the Morris worm that is hard to quantify is the impact on cyber security.  The worm exploited known bugs that no one had worried about enough to fix.  At the time of the Morris worm, there was no commercial traffic on the internet or even Web sites.  The people who had access to the internet were a small, elite group and concerns about cyber security hadn’t really come up.  If the first “hacker” attack had had malicious intent behind it and came a little later it’s likely that the damage would have been much more severe.  While the initial impacts of the Morris worm were all negative, it’s a positive thing that it highlighted the need to consider cyber security relatively early in the development of the internet.

It’s also interesting to note that the programmer behind the Morris worm, Robert Tappan Morris, become the first person to be indicted under the 1986 Computer Fraud and Abuse Act. He was sentenced with a $10,050 fine, 400 hours of community service, and a three-year probation. Morris was a 23 year old graduate student at the time he released his infamous worm.  After this initial hiccup, Morris went one to have a successful career and now works in the MIT Computer Science and Artificial Intelligence Laboratory.

“Ghost Train” Causes Head-On Collision in Chicago

By Kim Smiley

On September 30, 2013, an unoccupied train collided head on with another train sending 30 people to the hospital in Chicago.  In a nod to the season and the bizarre circumstances of the accident, the unoccupied train has been colorfully dubbed “the ghost train”. 

So what caused the “ghost train” and how did it end up causing a dangerous train collision?  Investigators from the National Transportation Safety Board (NTSB) are still reviewing the details of the accident, but some information is available.  An initial Cause Map, or visual root cause analysis, can be built to capture what is already known and can be expanded to incorporate more information as the investigation progresses.  A Cause Map is built by asking “why” questions and documenting the answers to visually lay out all the causes that contributed to an accident to show the cause-and-effect relationships from left to right.

In this example, the trains collided because an unoccupied train began moving and the safety systems in place did not stop the train.  Investigators still haven’t determined exactly what caused the train cars to move, but a key piece of the puzzle is that there was still power to the cars while they were being stored in a repair terminal awaiting maintenance.  The NTSB believes that it was common practice to leave power to cars so that the lights could be used to illuminate the terminal.  Workers used the lights to discourage graffiti and vandalism because the terminal was located in a high crime neighborhood. 

Investigators will need to not only determine why the train started rolling, but also learn more about why the safety systems didn’t prevent the accident.  Before colliding with another train, the unoccupied train traveled through five mechanical train-stop mechanisms, each of which should have stopped a train without a driver.  Emergency brakes were applied at each train-stop that caused the train to pause momentarily, but then it started moving because the setting on the master lever caused the train to restart.  Review of the safety systems will need to be part of the investigation to ensure that adequate protection is in place to prevent anything similar from occurring again.

The NTSB investigation is still ongoing, but the NTSB has stated that de-energizing propulsion power and using an alternate brake setting could help prevent unintended movement of unoccupied train cars. Additionally, the NTSB believes the use of a wheel chock and/or derail would ensure that a train stopped by a mechanical train stop mechanism remains stopped.  Based on the information already uncovered, the NTSB has issued an urgent safety recommendation to the Federal Transit Authority (FTA). The NTSB recommended that the FTA issue a safety advisory to all rail transit properties to review procedures for storing unoccupied train cars to ensure that they were left in a safe condition that wouldn’t allow unintended movement and to ensure that they had redundant means of stopping any unintended movement.  There is more information that is needed to fully understand this accident, but these precautions would be effective solutions that can be quickly implemented to reduce the risk of train accidents.

Rules on Inflight Electronics May be Changing Soon

By Kim Smiley

In welcome news to many airline passengers, it looks like the FAA may soon allow the use of personal electronic devices during the entire duration of flights, including takeoff and landing.  The current restrictions on the use of personal electronics are being reviewed following a recent recommendation by an aviation advisory committee made of up pilots, mechanics, engineers and other aviation experts.

A Cause Map, a visual format for performing a root cause analysis, can be used to analyze this issue.  A Cause Map is built by asking “why” questions and intuitively laying out the many causes that contributed to an issue to show the cause-and-effect relationships.  The first step in the Cause Mapping process is to document the basic background information as well as list how the issue impacts the goals in the an Outline.

One of the major impacts for this example is that there is concern that use of personal electronic devices onboard aircraft may be dangerous and increase the risk of a plane crash.  Currently, the use of personal electronics is allowed once a plane is above 10,000 feet, which is basically the whole flight except landing and takeoff which are considered the most critical portions of the flight.   These restrictions are in place because pilots depend on electronic systems, such as navigation and communications systems, to safely do their job and there is concern about the potential for interference with these vital systems.

How likely it is that dangerous interference could be an actual issue is debated.  There were 75 reports by pilots of suspected electronic device interference between 2003 and 2009, according to the International Air Transport Association.  However, it’s difficult to reproduce interference and it has never been cited as a cause in any airplane accident.  The current ban on the use of electronics also seems to be loosely enforced, raising questions about its necessity and effectiveness.  (A survey by the Consumer Electronics Association also found that nearly a third of airplane passengers said they left on a portable electronic device on a flight during the previous year.)  There seems to be a general consensus that this is low risk issue, but the potentially high consequences if it occurs has made some reluctant to reduce the restrictions.

There are also some non-technical issues that need to be considered with the onboard use of electronics.  There is concern that passengers enthralled with their devices will be distracted and miss important information during preflight safety briefs.  There is also a concern that larger devices, such as laptops, could become a missile hazard and hurt passengers if the plane moves unexpectedly.

If the new recommendations are approved, passengers will be able to use any device that doesn’t transfer data the entire flight, including takeoff and landing.  Passengers would be able to leave all devices turned on, but they would need to set them to airplane mode so that no data is transmitted.  So you won’t be able to make calls on your smartphone or stream video, but you would be able to rock out to music already downloaded or read a book on a kindle.  Larger devices will still need to be stowed during takeoff and landing because nobody wants to be hit with a laptop, but smaller gadgets will be fair game if the new recommendations are adopted.

To see a Cause Map of this issue, click on “Download PDF” above.

 

 

 

Sea life Devastated by Molasses Spill

By Kim Smiley

On September 9, 2013, a reported 1,400 tons of molasses was inadvertently spilled into Honolulu Harbor in Hawaii, devastating the sea life.   When I think of ocean spills, pictures of oil-covered animals jump into my mind, but the molasses spill is proving to be potentially just as damaging to the environment.

This incident can be analyzed by building a Cause Map, an intuitive format for performing a root cause analysis.  A Cause Map visually lays out the causes that contribute to an accident to show the cause-and-effect relationships between them so that it’s easier to understand the factors that led to the issue.  Understanding all the causes and not just focusing on a single “root cause” helps broaden the potential solutions that are considered and can lead to a better long term solution. The first step in the Cause Mapping process is to define how the problem impacted the goals and then these impacts are used as the starting point for the Cause Map.

The most obvious impact from the molasses spill is that thousands of fish and other marine life were killed.  They suffocated because the molasses sank and displaced the oxygen- containing seawater in the harbor.  The density of molasses is what makes this spill so different from an oil spill.  Oil is lighter than water and floats on top of the ocean while molasses sinks to the bottom, with devastating effects at all levels in the ocean.  Divers investigating the molasses spill reported that there were no signs of life in the ocean near the spill; all bottom dwellers had been killed.

The fact that molasses sinks also means that there is no practical way to clean it up.  One positive about molasses is that molasses, unlike oil, will mix with water. It sits on the bottom until it is diluted and ocean movements disperse it.  Since the spill occurred in a protected harbor, the ocean movements are weaker and the time frame to move the molasses is longer than it would be in the open ocean, but nature will eventually return oxygen levels in the harbor to life-supporting levels.

The cause of the spill has been reported to be a leaking pipe.  Molasses produced on Hawaii was being pumped into a ship for transportation to the mainland where it was planned for use in animal feed.  During the transfer, the molasses was accidently pumped through a pipe with a leak and nobody noticed before the majority of the molasses had been released into the harbor.  Details about what specifically caused the leak haven’t been released.

There are also other impacts from the spill that are worth considering.  With any environment issue, the cost of the investigation and any clean up that needs to be done is always substantial. Many businesses in the area were also impacted by a drop in tourism because the harbor was closed for about two weeks after the accident and normal tourism levels will probably not return until marine life in the area begins to recover.  There was also a potential safety risk to any swimmers for a time after the accident because the presence of thousands of dead fish could attract predators.

To view an Outline and high level Cause Map of this accident, click on “Download PDF” above.

Is Your Emergency Response Plan Really Good Enough?

By Kim Smiley

The deadly shooting at the Washington Navy Yard this week hit especially close to home.  I live about 15 miles away from the Navy Yard and I also worked there for 5 years.  And my husband, a Navy civilian, still does.

As far as the thousands impacted by the shootings, my family was very lucky.  My spouse came home safely while others did not.  Additionally, we hit the jackpot from a logistics stand point because he was in his office (which is not in the affected building) when the order to shelter in place was issued.  He had access to his phone and internet (as well as a bathroom and his packed lunch).   I had word almost immediately that he was safe and was able to communicate with him throughout the day.  When word came that he could go home but his car couldn’t, we were able to coordinate and get him home as quickly as possible.

Like many people across the country, I was riveted by the news and was holding my breath as the information fluctuated by the hour.   At ThinkReliability, we are generally called in to help investigate or document information after an incident so the opportunity to watch an incident as it plays out in real time is fairly rare.  I was throwing together a makeshift family emergency response as I was bombarded with calls and messages from concerned friends and family as well as trying to figure when and how my husband would get home.  And for somebody who works on processes and solutions for a living, my personal emergency response wasn’t very impressive.  Take my word for it: the ideal time to discover your mother in-law has your old cell phone number isn’t when your husband’s place of employment just made national headlines.

A time like this is an excellent opportunity to review both your organization and family’s emergency response plan.  Is your organization ready to handle a shelter-in-place situation?   Do you know which authorities to contact in case of emergencies?  And, one piece that I think is often overlooked: how you would handle the flow of information?  How do you pass word to families if something significant occurs and do people know where to look for the information?  Would you post the information on the website?  Would an old-fashioned phone tree serve your needs?  Do you have updated contact information and home phone numbers?

It’s also important to have a basic plan in place for your family in case something unforeseen happens.  There was a flurry of activity on Monday as everyone worked to make sure that there was a plan for all the children of the people we knew on the Navy Yard to be picked up and potentially kept overnight.  Thousands of people work on the Navy Yard and there were several cases were a single parent or both parents were stuck on lock-down for an indeterminate amount of time. Are you really ready to handle a situation like that?  If your family or employees have any special needs, like requiring medication, I would recommend making a plan to deal with it.   I also highly recommend taking a moment to make sure that any list with people allowed to pick up your children is up to date and includes a few folks who do not work in your building or even on the same side of town.  Fairly simple precautions can make a tough situation go much smoother.

And don’t think you don’t need a basic plan if you have no dependents.   Do you know how you would get home if you suddenly had to leave your car at work like many of the Navy Yard employees did?  What if your wallet was left behind in a rushed evacuation?  It might be a good idea to have enough money to cover cab fare in your car or in your badge holder if you wear one to work.  How do you pass word to your parents that you’re okay, especially if you don’t have access to a phone?  Would your mom think to check her email?  Do you have a friend who has your parents’ or siblings’ phone numbers and could call them for you if they aren’t comfortable with social media or computers?  Trust me; your families would be very interested in hearing that you’re okay.

I hope you never experience any crisis even remotely close to the tragedy at the Navy Yard.  But if there is ever an emergency, you’ll be grateful if you made a plan beforehand.

NYT Website Disrupted for Hours

By Kim Smiley

On Tuesday, August 27, 2013 the New York Times website went dark for several hours after being attacked by a well-known group of hackers.   Reports of hacked websites are becoming increasingly common and the New York Times was just one of many recent victims.

A Cause Map, or visual root cause analysis, can be used to analyze the recent attack on the New York Times website.  A Cause Map lays out the many causes that contribute to an issue in an intuitive format that illustrates the cause-and-effect relationships.   A Cause Map is useful for understanding all the causes involved and can help when brainstorming solutions.  To see a Cause Map of this example, click on “Download PDF” above.

Some details of how the attack was done have been released, as documented on the Cause Map. The New York Times website itself was not technically hacked, but traffic was redirected away from the legitimate website to another web domain.   To pull off this feat, hackers changed the domain name records for the New York Times website after acquiring the user name and password of an employee at the domain name registrar company.  The employee inadvertently provided the information to the hackers by responding to a phishing email asking for personal information.

The email sent by the hackers looked legitimate enough to fool the employee.

So why did hackers target the New York Times in the first place?  The answer is that the New York Times is one of many western media outlets to be targeted by Syrian Electronic Army (S.E.A.), who has claimed responsibility for the attack.  The S.E.A. supports President Bashar al-Assad of Syria and is generally unhappy with the way the events in Syria have been portrayed in the West.

So the next logical question is how do you protect yourself from a phishing scheme?  The first step is awareness.  Pretty much everybody who uses email can expect to receive some suspicious emails.  A few things to look out for:  attachments, links, misspellings, and a mismatched “from” field or subject line.  Also any alarming language should be a red flag.  For example, an email from your credit card company warning you that your account will be closed unless you take immediate action is probably not the real deal.  A good rule of thumb is to never respond to any email with personal information or to click on links in emails. If you think a request for action may be real, either call the company or open a new web browser window and type in the company’s web address.  It’s best to delete any suspicious emails immediately.

This example is also a good reminder to be aware that websites can get hacked.  A great example of this is when the S.E.A. hacked the Associated Press’s twitter feed last April and used it to announce (falsely) that the White House had been bombed.  That one tweet is estimated to have caused a $136 billion loss in the stock markets as people responded to the news.  In general, it is probably good to be skeptical about anything shocking you read online until the information is confirmed.

What Happens When a Copy Isn’t a Copy?

By Kim Smiley

Think of how many documents are scanned every day. Imagine how important some of these pieces of paper are, such as invoices, property records, and medical files. Now try to picture what might happen if the copies of these documents aren’t true copies. This is exactly the scenario that Xerox was recently facing.

It recently came to light that some copies of scanned documents were altered by the scanning process. Specifically, some scanner/copier machines changed numbers on documents. This issue can be analyzed by building a Cause Map, an intuitive, visual format for performing a root cause analysis. The first step in the Cause Mapping process is to fill in an Outline with the basic background information on an issue. Additionally, the impacts to the overall goals are documented on the Outline to help clarify the severity of any given issue. In this example, the customer service goal is impacted because the scanners weren’t operating as expected. There is also a potential impact to the overall economic goal because the altered documents could result in any number of issues. There is also an impact because of the labor needed to investigate and fix the problem.

After completing the Outline, the next step is ask “why” questions to build the Cause Map. Why weren’t the scanners operating as expected? This happened because the scanners were changing some documents during the scanning process. Scanners use software to help interpret the original documents and Xerox has stated that the problem happened because of a software bug. Testing showed that the number substitutions were more likely to occur when the settings on the scanners were set to lower quality/ higher compression because of the specific software used for these settings. Testing also showed that the error was more likely to occur when scanning those documents that were more difficult to read such as those with small fonts or that had already been copied multiple times.

Xerox had been aware of the potential for number substitution at lower quality settings, but didn’t appear to expect it to occur at factory settings (which was found to be very unlikely, but possible). A notice that stated that character substitutions were possible appeared on the scanners when lower resolution settings were selected and was included in some manuals, but this approach seems to have been ineffective since many users were caught unaware by this issue.

After a Cause Map has been built with enough detail to understand the issue, it can be used to help develop solutions. In this example, Xerox developed a software patch that corrected the error. Xerox also posted several blogs on their website to keep customers informed about the issue and worked with users to ensure that the patch was successful in correcting the error.

To see a high level Cause Map of this issue, click on “Download PDF” above.

 

Trading Glitch Loses Goldman Sachs Millions

By Kim Smiley

A Goldman Sachs trading glitch on August 20, 2013 caused a large number of erroneous single stock and ETF options trades.  About 80 percent of the errant trades were cancelled, but the financial damage is still speculated to be as much as one hundred million dollars. The company also finds itself once again in the uncomfortable position of making headlines for negative reasons which is never good for business.

The glitch occurred during an update to an internal computer system that is used to determine where to price options.  The update changed the software so that the system began inadvertently misinterpreted non-binding indications of interest as actual bids and offers.  The system acted on these bids and executed a large volume of trades at errant prices that were out of touch with actual market prices.

This issue can be built into a Cause Map, an intuitive method for performing a root cause analysis.  One of the advantages of a Cause Map is that it visually lays out all the causes and the cause-and-effect relationships between them. Seeing all the causes can broaden the solutions that are considered.

In this example, a Cause Map can help illustrate the fact that the software glitch itself isn’t the only thing worth focusing on.  The lack of an effective test program also contributed to the problem and testing may be the easiest place to implement an effective solution.  If the problem would have been caught in testing, the only cost would have been the time and effort needed to fix the software.  The importance of a robust test program for software is difficult to overstate.  If the software is vital to whatever your company’s mission is, develop a way to test it.

To view a high level Cause Map of this issue, click on “Download PDF” above.  Click here to read about the loss of the Mars Climate Orbiter, another excellent example of a software error with huge consequences.