Tag Archives: root cause analysis

Olympic Track Worker Hit By Bobsled

By Kim Smiley

A worker at the bobsled track for the Sochi Winter Olympics was hit by a bobsled on February 13, 2014.  The worker suffered two broken legs and a possible concussion, but is reported to be stable after undergoing surgery.  There was also minor damage done to the track.  Part of a lighting system suspended from the ceiling was replaced and time was needed to clean small plastic shards off the ice.

Investigation into this accident is still underway, but the information that is available in the media can be used to build an initial Cause Map. One of the advantages of using Excel to build Cause Maps is that they can be easily modified to incorporate additional information once the investigation is complete.

When beginning the Cause Mapping process, the first step is to fill in an Outline with the basic background information for an issue.  How an incident impacted the overall organizational goals is also documented on the bottom half of the Outline.  Once the Outline is completed, the Cause Map is built by asking “why” questions. (Click on “Download PDF” above to view a high level Cause Map and Outline for this accident.)

So why was the worker hit by a bobsled?  This occurred because a forerunner sled was sent down the track while the worker was on the track.  The forerunner sled was on the track because they are used to test the track prior to training runs and competitions, and training was scheduled later that day.  Forerunner sleds ensure that ice conditions are good and that all systems, like the timing system, are functional.  People at the top of the track can’t see the entire track so there wasn’t an easy way for them to identify the position of the worker prior to running the sled.  Initial reports are that the normal announcements were made to the workers prior to running the forerunner sled so it doesn’t appear that the people on the top of the track had any reason to suspect a problem.

The worker was on the track doing work to prepare it for the training runs and competition scheduled that day.  We can safely assume that he was unaware that the forerunner sled was running the track at the same time.  Investigators have determined that the worker was using a loud motorized air blower and believe he was unable to hear both the announcement and the approaching bobsled.  Two other workers were also working on the track, but they were able to scramble out of danger as the bobsled approached.  Until the investigation is complete, it won’t be clear if other factors were involved, but it seems the use of loud equipment played a role in the accident.

The final step in the Cause Mapping process is to find solutions to reduce the risk of a problem reccurring.  It appears that the current method of letting workers know to clear the track isn’t adequate in all situations.  Officials will need to modify the process, especially when loud equipment is in use, to ensure the safety of all workers.  Workers need to be on the track at times in order to do their jobs and there needs to be a way to ensure they have moved to a safe location prior to any sled running the track.

It’s worth noting this is not the first time someone has been hit by a bobsled. In 2005, recent silver medalist skeleton racer Noelle Pikus-Pace was hit by a bobsled.  She shattered a leg and ended up missing the 2006 Turin Olympics as a result.  This accident occurred on a different track, but it highlights the dangers of bobsled tracks and the important of ensuring safety.

Concerns Raised About Safety of Olympic Slopestyle Course

By Kim Smiley 

One of the stories making headlines leading up to the start of the 2014 Winter Olympics was concern about the safety of the slopestyle course.  There were early rumblings about the slopestyle course, especially after a few falls during training runs, but the media interest intensified after well-known snowboarder Shaun White withdrew from the event.   There is also a heighten sensitivity to safety concerns after the death of a luger during the last Winter Olympics , which was the first  death in Olympic training or competition since 1964.

Safety of the athletes involved in the Olympics is obviously paramount, but media coverage of slopestyle course safety concerns is also an issue because it created negative press for both the Olympics and the host country.  A Cause Map can be built to help analyze this issue and illustrate all the factors involved with the controversy surrounding the Olympic slopestyle course. (To see a high level Cause Map of this issue, click on “Download PDF”.)

Several athletes fell during training runs on the slopestyle course, which led to questions about course safety.   There were some injuries on the course, the most notable being Torstein Horgmo of Norway who broke his collarbone during a practice run.  Horgmo was a favorite to medal in the event and was unable to compete after his injury, which has to be heartbreaking.

The course is different from the typical slopestyle course, partly because this is the Olympics and the designer wanted an exciting course.   Athletes are getting more air time from the jumps on the course because they are large step-down jumps where the landing zones are below the ramps.  Designing the first Olympic slopestyle course was a unique challenge and there was no precedent.

The weather has been an added challenge for the course designer.  The jumps were created intentionally oversized with plans to modify them as needed to help accommodate melting concerns in the above freezing weather.  It’s much easier to make a jump smaller, as opposed to larger, so designers would rather err on the size of too big.  Rain and warm weather also played havoc with plans to test the course.  A test event scheduled for last February was canceled because of weather.  Tests were scheduled to allow for more time to groom the course prior to the Olympics, but six days of massive rains pushed course completion past schedule.

It’s also worth noting that there is inherent danger in slopestyle.  Slopestyle is an extreme sport with snowboarders performing high intensity tricks in the air.  Factor in the pressure to bring the goods in an Olympic event and snowboarders are going to be pushing their limits.  The falls don’t all happen on the jumps, despite media focus on the large jumps on this course.  Torstein Horgmo’s Olympic-ending crash occurred on the stair set on top of the course.   While a course can be made too dangerous, there will never be a completely safe slopestyle course because of the nature of the sport.

Snowboarder Shaun White made headlines when he pulled out of slopestyle because of injury concerns, but it’s also important to remember that slopestyle isn’t White’s main event.  Although White failed to reach the podium this Olympics, he was the defending gold medalist on the halfpipe and wasn’t willing to risk his chance to compete in that event.  White suffered minor injuries from a crash on the slopestyle course and he didn’t want to impact his halfpipe chances by getting hurt worse.  Halfpipe came after slopestyle so the consequences of a potential injury were high for White.  I’m willing to bet he would have been much more likely to compete in slopestyle if it occurred after the halfpipe event.

The slopestyle course was modified after training runs, which is typical for an untested slopestyle course.  Forty to fifty centimeters were removed from the top deck of the jumps and snow was added to the knuckles of each landing.  The course crew has been credited for listening to athletes’ concerns and being responsive to issues. Lessons learned from the experience with the first Olympics slopestyle course will hopefully help things go smoother next time.  I hope the focus during the next Olympics is on the amazing athletes and not so much on the course.

Volunteer Killed in Helicopter Fall

By ThinkReliability Staff

On September 12, 2013, the California National Guard invited Shane Krogen, the executive director of the High Sierra Volunteer Trail Crew and the U.S. Forest Service’s Regional Forester’s Volunteer of the Year for 2012, to assist in the reclamation effort of a portion of the Sequoia National Forest where a marijuana crop had been removed three weeks earlier.  Because the terrain in the area was steep, the team was to be lowered from a helicopter into the area.

After Mr. Krogen left the helicopter to be lowered, an equipment failure caused the volunteer to fall 40 feet.  He later died from blunt force trauma injuries. The Air Force’s report on the incident, which was released in January, determined that Mr. Krogen had been improperly harnessed.  The report also found that he should have never been invited on the flight.

To show the combination of factors that resulted in the death of the volunteer, we can capture the information from the Air Force report in a Cause Map, or visual root cause analysis.  First it’s important to determine the impacts to the goals.  In this case, Mr. Krogen’s death is an impact to the safety goal, and of primary consideration.  Additionally, the improper harnessing can be considered an impact to the customer service goal, as Mr. Krogen was dependent on the expertise of National Guard personnel to ensure he was properly outfitted.  Because it was contrary to Air Force regulations, which say civilian volunteers cannot be passengers on counter-drug operations, the fact that Mr. Krogen was allowed on the flight can be considered an impact to the regulatory goal.  Lastly, the time spend performing the investigation impacts the labor goal because of the resources used during the investigation.

Beginning with the impacted goal of primary concern – the safety goal – asking “Why” questions allows for the determination of causes that resulted in the impacted goal (the end effect).   In this case, Mr. Kroger died of blunt force trauma injuries from falling 40 feet.  He fell 40 feet because he was being lowered from a helicopter and his rigging failed.  He was being lowered from a helicopter to aid in reclamation efforts and because the terrain was too steep for the helicopter to land.

The rigging failure resulted from the failure of a D-ring which was used to connect the harness to the hoist.  Specifically, the D-ring was not strong enough to handle the weight of a person being lowered on it.  This is because the hoist was connected to Mr. Krogen’s personal, plastic D-ring instead of a government-issued, load-bearing metal D-ring.  After Mr. Krogen mistakenly connected the wrong D-ring, his rigging was checked by National Guard personnel.  The airman doing the checking didn’t notice the mistake, likely because of the proximity of the two D-rings and the fact that Mr. Krogen was wearing his own tactical vest, loaded with equipment, over the harness to which the metal D-ring was connected.

I think Mark Thompson sums up the incident best in his article for Time:   “The death of Shane Krogen, executive director of the High Sierra Volunteer Trail Crew, last summer in the Sequoia National Forest, just south of Yosemite National Park, was a tragedy. But it was an entirely preventable one.  It stands as a reminder of how dangerous military missions can be, and on the importance of a second set of eyes to make sure that potentially deadly errors, whenever possible, are reviewed and reversed before it is too late.”

To view the Outline and Cause Map, please click “Download PDF” above.  Or click here to read more.

Millions Impacted by Data Breach At Target

By Kim Smiley

Are you one of the millions of customers affected by the recent data breach at Target?  Because I am.  I for one am curious about how data for approximately 40 million credit and debit cards was compromised at one of the United States’ largest retailers.

The investigation is ongoing and many details about the data breach haven’t been released, but an initial Cause Map can be built to begin analyzing this incident.  The latest information released is that the Justice  Department is  performing an investigation into this incident.  An initial Cause Map can capture the information that is available now and can easily be expanded to include more detail in the future.  A box with a question mark can be used to indicate that more information is needed on the Cause Map. (Click on “Download PDF” to view an Outline and high level Cause Map.)

One of the causes that I think is worth discussing is that retailers in the United States are being specifically targeted for this type of attack in recent years.  The vast majority of credit and debit cards in use in the United States are magnetic strip cards, while Europe has been transitioning to newer credit card technology that uses chips.   Magnetic strip credit cards are a more desirable target for criminals because the technology to create fake magnetic strip cards is readily available.  The data on magnetic strip cards also stays the same while chips use unique codes for each transaction.  Cards with chips also require a pin when used, adding an additional layer of protection.

So why does the United States still use magnetic strip cards?  One of the main complicating factors is money.  Transitioning to cards that use chips requires a significant investment of money by both banks and retailers.  It is estimated that the cost to transition to the higher tech cards will be $8 billion so the money required is considerable. Both parties are nervous about being the first to commit to the process.

Rising credit card fraud rates in the United States have been increasing the pressure to move to newer credit card technology.  Credit card fraud rates in the U.S. have doubled in the 10 years since Europe began using chip cards.  As long as the United States remains the softest target, the rates are likely to increase.

On a positive note, the transition to the newer chip cards should be gaining traction in the next few years.  Credit card companies have typically footed the bill for credit card fraud, but many card companies have stated that merchants or banks that have not transitioned to chip cards will be held accountable for fraudulent purchases that the higher tech cards would have prevented by the end of 2015.

The frustrating thing is that there are limited ways individual consumers can protect themselves short of switching to cash.  You can be smart about where you swipe your cards, for example avoiding unmanned ATM kiosks, but a major retailer like Target didn’t seem suspicious.  As somebody who has had multiple instances of credit card fraud in the last few years, I look forward to a safer credit card in the future.

300,000 Unable to Use Water after Chemical Spill in West Virginia

By Kim Smiley

Hundreds of thousands of West Virginians were unable to use their water for days after it was contaminated by a chemical spill on January 9, 2014. About 7,500 gallons of 4-methyl-cyclohexane-methanol, known as MCHM, leaked out of a storage tank and into the Elk River.   At the time of the spill, little information was known about MCHM, but officials ordered residents not to use the use the water because the chemical can cause vomiting, nausea, and skin, eye and throat irritation.  The ban on water usage obviously meant that residents should not drink the water, but they were also told not to cook, bathe, wash clothes or brush their teeth with it.

The investigation into this incident is still ongoing, but some information is available.  An initial Cause Map, or visual root cause analysis, can be built now and it can easily be expanded in the future.  A Cause Map is used to illustrate the cause-and-effect relationships between the many causes that contribute to any incident.  In this example, it is known that the MCHM leaked into the river because it was being stored in a tank near the river and the tank failed.  MCHM was being stored in a tank because it is used in coal processing and it was profitable for the company to sell it.

The cause of the tank failure hasn’t been officially determined, but the company who owned the facility has stated that an object punctured the tank after the ground under the tank froze.  (Suspected causes can be included on the Cause Map with a question mark to indicate that more evidence is needed to confirm their validity.)

The tank in question was older, built about 70 years ago.  There were no regulations that required the tank to be inspected while it was being used to store MCHM because the chemical is not currently legally considered a hazardous material.  The tank is also an atmospheric tank so it is exempt from current federal safety inspections because it is not under pressure, cooled or heated.

Many are asking questions about why a tank full of a chemical that can make people sick that was so close to the water supply had so little regulation and no required inspections.  The debate that has been sparked by this accident will force a close review of current regulations governing these types of facilities.

It’s also alarming how little was known about this chemical prior to this accident.  It’s still not well understood exactly how dangerous MCHM is.  Experts have stated that the long term impacts should be minimal, but it would be awfully reassuring to the people living in the area if there was more information about the chemical available.

Companies need to have a clear understanding of the risks involved in their operations if they hope to reduce the risk to the lowest reasonable level and develop effective emergency response plans to deal with any issues that do arise.  As the old saying goes – failure to plan is planning to fail.  Just ask the company involved.  Freedom Industries filed bankruptcy papers on January 17, 2013 as a direct result of this accident.

Freight Train Carrying Crude Oil Explodes After Colliding With Another

By Kim Smiley

On Monday, December 30, 2013, a 106-car freight train carrying crude oil derailed in North Dakota and violently exploded after colliding with another derailed train that was on the tracks.  No injuries were reported, but the accident did cause an impressive plume of hazardous smoke and major damage to two freight trains.

The investigation into the accident is ongoing and it’s still unknown what caused the first train to derail. Investigators have stated that it appears that there was nothing wrong with the railroad track or with the signals.  It is known that a westbound freight train carrying grain derailed about 2:20 pm.  A portion of this train jumped onto the track in front of the eastbound train.  There wasn’t enough time for the mile long train loaded with crude oil to stop and it smashed into the grain train, causing the eastbound oil train to derail.  (To see a Cause Map of this accident, click on “Download PDF” above.)

Train cars carrying crude oil were damaged and oil leaked out during the accident.  The train accident created near ideal conditions for an explosion: sparks and a large quantity of flammable fluid.   The fire burned for more than 24 hours, resulting in a voluntary evacuation of nearby Casselton, North Dakota due to concerns over air quality.  The track was closed for several days while the initial investigation was performed and the track was cleaned up.

The accident has raised several important issues.  The safety of the train cars used to transport oil has been questioned.  Starting in 2009, tank train cars have been built to tougher safety standards, but most tank cars in use are older designs that haven’t been retrofitted to meet the more stringent standards.  This accident, and others that have involved the older design tank cars in recent year, have experts asking hard questions about their safety and whether they should still be in use.

The age of the train cars is particularly concerning since the amount of oil being transported by rail has significantly expanded in result years.  Around 9,500 carloads of oil were reportedly transported in 2008 and nearly 300,000 carloads were moved during the first three quarters of 2013.  The oil industry in North Dakota has rapidly expanded in recent years as new technology makes oil extraction in the area profitable.   North Dakota is now second only to Texas in oil production since the development of the Bakken shale formation.  Pretty much the only way to transport the crude oil extracted in North Dakota is via rail.  There isn’t a pipeline infrastructure or other alternative available.

And most of the time, transporting oil via freight train is a safe evolution.  The Association of American Railroads has reported that 99.99 percent of all hazardous materials shipped by rail reach the destination safely.  But it’s that 0.01 percent that can get you in trouble.  As a nation, we have to decide if where we are at is good enough or if it’s worth the money to require all tank cars used to transport oil to be retrofitted to meet the newest safety standards, a proposition that isn’t cheap.

Department of Energy Cyber Breach Affects Thousands, Costs Millions

By ThinkReliability  Staff

Personally identifiable information (PII), including social security numbers (SSNs) and banking information, for more than 104,000 individuals currently or formerly employed by the Department of Energy (DOE) was accessed by hackers from the Department’s Employee Data Repository database (DOEInfo) through the Department’s Management Information System (MIS).  A review by the DOE’s  Inspector General in a recently released special report analyzes the causes of the breach and provides recommendations for preventing or mitigating future breaches.

The report notes that, “While we did not identify a single point of failure that led to the MIS/DOEInfo breach, the combination of the technical and managerial problems we observed set the stage for individuals with  malicious intent to access the system with what appeared to be relative ease.”  Because of the complex interactions between the systems, personnel interactions and safety precautions (or lack thereof) that led to system access by hackers, a diagram showing the cause-and-effect relationships can be helpful.  Here those relationships – and the impacts it had on the DOE and DOE personnel – are captured within a Cause Map, a form of visual root cause analysis.

In this case, the report uncovered concerns that other systems were at risk for compromise – and that a breach of those systems could impact public health and safety.  The loss of PII for hundreds of thousands of personnel can be considered an impact to the customer service goal.  The event (combined with two other cyber breaches since May 2011), has resulted in a loss of confidence in cyber security at the Department, an impact to the mission goal.  Affected employees were given 4 hours of authorized leave to deal with potential impacts from the breach, impacting both the production and labor goals.  (Labor costs for recovery and lost productivity are estimated to cost $2.1 million.)  The Department has paid for credit monitoring and established a call center for the affected individuals, at an additional cost of $1.6 million, leading to a cost of this event of $3.7 million.  With an average of one cyber breach a year for the past 3 years, the Department could be looking at multi-million dollar annual costs related to cyber breaches.

These impacts to the goals resulted from hackers gaining access to unencrypted PPI.  Hackers were able to gain access to the system, which was encrypted, and contained significant amounts of PPI, as this database was the central repository for current and former employees.  The PPI within the database included SSNs which were used for identifiers, though this is contrary to Federal guidance.  There appeared to have been no effort to remove SSNs as identifiers per a 5-year-old requirement for reasons that are unknown.  Reasons for the system remaining unencrypted appear to have been based on performance concerns, though these were not well documented or understood.

Hackers were able to “access the system with what appeared to be relative ease” because the system had inadequate security controls (only a user name and password were required for access), and could be directly accessed from the internet, presumably in order to accomplish necessary tasks.   In the report, ability to access the system was directly related to “continued operation with known vulnerabilities.”  This concept may be familiar to many at a time when most organizations are trying to do more with less.   Along with a perceived lack of authority to restrict operation, inability to address these vulnerabilities based on unclear responsibility for applying patches, and vulnerabilities that were unknown because of the limited development, testing, troubleshooting and ongoing scanning of the system, cost was also brought up as a potential issue for delay in addressing the vulnerabilities that contributed to the system breach.

According to the report, “The Department should have considered costs associated with mitigating a system breach … We noted the Department procured the updated version in March 2013 for approximately $4,200. That amount coupled with labor costs associated with testing and installing the upgrade were significantly less than the cost to mitigate the affected system, notify affected individuals of the compromise of PII and rebuild the Department’s reputation.”

The updated system referred to  was purchased in March 2013 though the system had not been updated since early 2011 and core support for the application upon which the system was built ended in July 2012.  Additionally, “the vulnerability exploited by the attacker was specifically identified by the vendor in January  2013.”  The update, though purchased in March,  was not installed until after the breach occurred.  Officials  stated that a decision to upgrade the system had not been made until December 2012, because it had not reached the end of its useful life.”  The Inspector General ‘s note about considering costs of mitigating a system breach is poignant, comparing the several thousand dollar cost of an on-time upgrade to a several million dollar cost of mitigating a breach.   However, like the DOE, many companies find themselves in the same situation, cutting costs on prevention and paying exponential higher costs to deal with the inevitable problem that will arise.

To view the Outline, Cause Map and recommended solutions based on the DOE Inspector General’s report, please click “Download PDF” above.  Or click here to read more.

Boeing 747 “Dreamlifter” Cargo Jet Lands At Wrong Airport

By Kim Smiley

On November 21, 2013, a massive Boeing 747 Dreamlifter cargo jet made national headlines after it landed at the wrong airport near Wichita, Kansas.  For a time, the Dreamlifter looked to be stuck at the small airport with a relatively short runway, but it was able to take off safely the next day after some quick calculations and a little help turning around.

At the time of the airport mix-up, the Dreamlifter was on its way to the McConnell Air Force base to retrieve Dreamliner nose sections made by nearby Spirit Aerosystems.   Dreamlifters are notably large because they are modified jumbo jets designed to haul pieces of Dreamliners between the different facilities that manufacture parts for aircraft.

So how does an airplane land at the wrong airport?  It’s not entirely clear yet how a mistake of this magnitude was made.  The Federal Aviation Administration is planning to investigate the incident to determine what happened and to see whether any regulations were violated.  What is known is that the airports have some similarities in layout that can be confusing from the air.  First off, there are three airports in fairly close proximity in the region.  The intended destination was the McConnell Air Force base, which has a runway configuration similar to Jabara airfield where the Dreamlifter landed by mistake.  Both runways run north-south and  are nearly parallel.  It can also be difficult to determine how long a runway is from the airport so the shorter length isn’t necessarily easy to see.  Beyond the airport similarities, the details of how the plane landed at the wrong airport haven’t been released yet.

What is known can be captured by building an initial Cause Map, a visual format for performing a root cause analysis.  One of the advantages of Cause Maps is they can be easily expanded to incorporate more information as it becomes available.  The first step in Cause Mapping is to fill in an Outline with the basic background information and to list how the issue impacts the overall goals.  There are a number of goals impacted in this example.  The potential for a plane crash means that there was an impact to both the safety and property goal because of the possibility of fatalities and damage to the jet.  The effort needed to ensure that the jet could safely take off on a shorter runway is an impact to the labor goal and the delay was an impact to the schedule goal.  The negative publicity surrounding this incident can also be considered an impact to the  customer service goal.

Once the Outline is completed, the Cause Map is built by asking “why” questions and intuitively laying out the answers until all the causes that contributed to the issue are documented.  Click on “Download PDF” above to see an Outline and initial Cause Map of this issue.

Good luck with any air travel planned for this busy holiday week.  And if your plane makes it to the right airport (even if it’s a little late), take a moment to be thankful because it’s apparently not the given I’ve generally assumed.

Pilot Response to Turbulence Leads to Crash

By ThinkReliability Staff

All 260 people onboard Flight 587, plus 5 on the ground, were killed when the plane crashed into a residential area on November 12, 2001.  Flight 587 took off shortly after another large aircraft.  The plane experienced turbulence.  According to the NTSB, the pilot’s overuse of the rudder mechanism, which had been redesigned and as a result was unusually sensitive, resulted in such high stress that that vertical stabilizer separated from the body of the plane.

This event is an example of an Aircraft Pilot Coupling (APC) event.  According to the National Research Council, “APC events are collaborations between the pilot and the aircraft in that they occur only when the pilot attempts to control what the aircraft does.  For this reason, pilot error is often listed as the cause of accidents and incidents that include an APC event.  However, the [NRC] committee believes that the most severe APC events attributed to pilot error are the result of the adverse APC that misleads the pilot into taking actions that contribute to the severity of the event.  In these situations, it is often possible, after the fact, to analyze the event carefully and identify a sequence of actions the pilot could have taken to overcome the aircraft design deficiencies and avoid the event.  However, it is typically not feasible for the pilot to identify and execute the required actions in real time.”

This crash is a case where it is tempting to chalk up the accident to pilot error and move on.  However, a more thorough investigation of causes identifies multiple issues that contributed to the accident and, most importantly, multiple opportunities to increase safety for future pilots and passengers.  The impacts to the goals, causes of these impacts, and possible solutions can be organized visually in cause-and-effect relationships by using a Cause Map.  To view the Outline and Cause Map, please click “Download PDF” above.

The wake turbulence that initially affected the flight was due to the small separation distance between the flight and a large plane that took off 2 minutes prior (the required separation distance by the FAA).  This led to a recommendation to re-evaluate the separation standards, especially for extremely large planes.  In the investigation, the NTSB found that the training provided to pilots on this particular type of aircraft was inadequate, especially because changes to the aircraft’s flight control system rendered the rudder control system extremely sensitive.  This combination is believed to be what led to the overuse of the rudder system, leading to stress on the vertical stabilizer that resulted in its detachment from the plane.  Specific formal training for pilots based on the flight control system for this particular plane was incorporated, as was evaluation of changes to the flight control system and requirements of handling evaluations when design changes are made to flight control systems for   previously certified aircraft. A caution box related to rudder sensitivity was incorporated on these planes, as was a detailed inspection to verify stabilizer to fuselage and rudder to stabilizer attachments.  An additional inspection was required for planes that experience extreme in-flight lateral loading events.  Lastly, the airplane upset recovery training aid was revised to assist pilots in recovering from upsets such as from this event.

Had this investigation been limited to a discussion of pilot error, revised training may have been developed, but it’s likely that a discussion of the causes that led to the other solutions that were recommended and/or implemented as a result of this accident would not have been incorporated.  It’s important to ensure that incident investigations address all the causes, so that as many solutions as possible can be considered.

The Morris Worm: The First Significant Cyber Attack

By Kim Smiley

In 1988 the world was introduced to the concept of a software worm when the Morris worm made headlines for significantly disrupting the fledgling internet.  The mess left in the wake of the Morris worm took several days to clean up. The estimates for the cost of the Morris worm vary greatly from $100,000–10,000,000, but even at the lower range the numbers are still substantial.

A Cause Map, or visual root cause analysis, can be used to analyze this issue.  A Cause Map is built by asking “why” questions and using the answers to visually lay out the causes that contributed to an issue to show the cause-and-effect relationships.  In this example, a programmer was trying to build a “harmless” worm that could be used to gauge the size of the internet, but he made a mistake.  The goal was to infect each computer one time, but the worm was designed to duplicate itself every seventh time a computer indicated it already had the worm to make the worm hard to defend against.  The problem was that the speed of propagation was underestimated. Once released, the worm quickly reinfected computers over and over again until they were unable to function and the internet came crashing down.  (To view a Cause Map of this example, click on “View PDF” above.)

One of the lasting impacts from the Morris worm that is hard to quantify is the impact on cyber security.  The worm exploited known bugs that no one had worried about enough to fix.  At the time of the Morris worm, there was no commercial traffic on the internet or even Web sites.  The people who had access to the internet were a small, elite group and concerns about cyber security hadn’t really come up.  If the first “hacker” attack had had malicious intent behind it and came a little later it’s likely that the damage would have been much more severe.  While the initial impacts of the Morris worm were all negative, it’s a positive thing that it highlighted the need to consider cyber security relatively early in the development of the internet.

It’s also interesting to note that the programmer behind the Morris worm, Robert Tappan Morris, become the first person to be indicted under the 1986 Computer Fraud and Abuse Act. He was sentenced with a $10,050 fine, 400 hours of community service, and a three-year probation. Morris was a 23 year old graduate student at the time he released his infamous worm.  After this initial hiccup, Morris went one to have a successful career and now works in the MIT Computer Science and Artificial Intelligence Laboratory.