Tag Archives: internet

Experts warn that vehicles are vulnerable to cyberattacks

May 27, 2016 Kim Smiley

By now, you have probably heard of the “internet of things” and the growing concern about the number of things potentially vulnerable to cyberattacks as more and more everyday objects are designed to connect to the internet. According to a new report by the Government Accountability Office (GAO), cyberattacks on vehicles should be added to the list of potential cybersecurity concerns. It’s easy to see how bad a situation could quickly become if a hacker was able to gain control of a vehicle, especially while it was being driven.

A Cause Map, a visual root cause analysis, can be built to analyze the issue of the potential for cyberattacks on vehicles. The first step in the Cause Mapping process is to define the problem by filling out an Outline with basic background information as well as how the problem impacts the overall goals. The Cause Map is then built by starting at one of the goals and asking “why” questions to visually lay out the cause-and-effect relationships.

In this example, the safety goal would be impacted because of the potential for injuries and fatalities. Why is there this potential? There is the possibility of car crashes caused by cyberattack on cars. Continuing down this path, cyberattacks on cars could happen because most modern car designs include advanced electronics that connect to outside networks and these electronics could be hacked. Additionally, most of the computer systems in a car are somehow connected so gaining access to one electronic system can give hackers a doorway to access other systems in the car.

Hackers can gain access to systems in the car via direct access to the vehicle (by plugging into the on-board diagnostic port or the CD player) or, a scenario that may be even more frightening, they may be able to gain access remotely through a wireless network. Researchers have shown that it is possible to gain remote access to cars because many modern car designs connect to outside networks and cars in general have limited cybersecurity built into them. Why cars don’t have better cybersecurity built into them is a more difficult question to answer, but it appears that the potential need for better security hadn’t been identified.

As of right now, the concern over potential cyberattacks on cars is mostly a theoretical one. There have been no reports about injuries caused by a car being attacked. There have been cases of cars being hacked, such as at Texas Auto Center in 2010 when a disgruntled ex-employee caused cars to honk their horns at odd hours and disabled starters, but there are few (if any) reports of cyberattacks on moving vehicles. However, the threat is concerning enough that government agencies are determining the best way to respond to it. The National Highway Traffic Safety Administration established a new division in 2012 to focus on vehicle electronics, which includes cybersecurity. Ideally, possible cyberattacks should be considered and appropriate cybersecurity should be included into designs as more and more complexity is added to the electronics in vehicles, and objects ranging from pace-makers to refrigerators are designed to connect to wireless networks.

Root Cause Analysis - Incident Investigation

Facebook Bug Makes Users Feel Old

January 8, 2016 Kim Smiley

By ThinkReliability Staff

In a real blow for an industry constantly trying to remain hip and relevant, many Facebook users were notified of “46 year anniversaries” of their relationships with friends on Facebook on the last day of 2015. Facebook (which is itself only 11 years old) issued a statement saying “We’ve identified this bug and the team’s fixing it now so everyone can ring in 2016 feeling young again.”

While Facebook didn’t release any details about what caused the bug, a pretty convincing explanation was posted by Microsoft engineer Mark Davis. We can his theory to create an initial Cause Map, or visual root cause analysis. The first step in the Cause Mapping process is to fill out a problem outline. The problem outline captures the what (Facebook glitch), when (December 31, 2015), where (Facebook) and the impact to the organization’s goals. In this case, the only goals that appear to be impacted are the customer service goal (resulting from the negative publicity to Facebook) and the labor/time goal (which resulted from the time required to fix the glitch).

The next step in the Cause Mapping process is the analysis. The Cause Map begins with an impacted goal. Asking “Why” questions develops the cause-and-effect relationship that resulted in the effect. In this case, the impact to the customer service goal results from the negative publicity. Continuing to ask “Why” questions will add more detail to the Cause Map. The negative publicity was caused by Facebook posting incorrect anniversaries.

Some effects will result from more than one cause. Facebook posting incorrect anniversaries can be considered an effect that was caused by incorrect anniversary dates being identified by Facebook AND Facebook posting anniversary dates. Because both of these causes were required to produce an effect, they are joined with an “AND” on the Cause Map. (If the anniversary dates had been identified correctly, or if they weren’t posted on Facebook, the issue would not have occurred.) The incorrect anniversary dates were due to a software glitch (or bug), according to Facebook. Inadequate testing can generally be considered a cause whenever any bug is found in software that is used or released to the public. Had a larger range of dates been used to test this feature, the software glitch would have been identified before it resulted in public postings on Facebook.

Other impacted goals are added to the Cause Map as effects of the appropriate goals. In this case, the labor/ time goal is impacted because of the time needed to fix the glitch. The cause of this is the software glitch. All impacted goals should be added to the Cause Map.

The cause of the software bug is not definitively known. To indicate potential causes, we include a “?” after the cause, and include as much evidence as possible to support the cause. Testimony can be used as evidence for causes. In this case, the source of the potential causes is a Microsoft engineer, who described a potential scenario that could lead to this issue on Facebook. Unix, which is an operating system, associates the value of “0” with the date of 1/1/1970 (known as the Unix epoch). If the date a user friended another user was entered as “0” and the system identified friending dates for all friends, the system would identify friending dates as 1/1/970, and with some accounting for time zones, would see 46 years of friendship on December 31, 2015. It is presumed that the friend date would be entered as “0” if a friendship already existed prior to Facebook tracking anniversaries.

Errors associated with the Unix epoch are pretty common, but this appears to be the first time a bug like this has bitten Facebook. Presumably the error was quickly fixed, but we won’t know for sure until next December.

Root Cause Analysis - Incident Investigation

App Takes Down National Weather Service Website

September 10, 2014 Kim Smiley

By Kim Smiley

The National Weather Service (NWS) website was down for hours on August 25, 2014. Emergency weather alerts such as tornado warnings were still disseminated through other channels, but this issue raises questions about the robustness of a vital website.

This issue can be analyzed by building a Cause Map, a visual format for performing a root cause analysis. Cause Maps are built by laying out all the causes that contributed to a problem to show the cause-and-effect relationships. The idea is to identify all the causes (plural), not just THE one root cause.

This example is a good illustration of the potential danger of focusing on a single root cause. The NWS website outage was caused by an abusive Android app that bogged the site down with excessive traffic. The app was designed to provide current weather information and it pulled data directly from the forecast.weather.gov website. The app inadvertently queried the website thousands of times a second because of a programming error and the website was essentially overwhelmed. It was similar to the denial of service attacks that have been directed at websites such as Bank of America and Citigroup, but the spike in traffic in this case wasn’t deliberate.

It may be tempting to say that the app was the root cause. Or you could be more specific and say the programming error was the root cause. But labeling either of these “the root cause” would imply that you solved the problem once you fix the software error. The root cause is gone, no more problem…right? In order to address the issue, NWS installed a filter to block the excessive queries and worked with app developer to ensure the error was fixed, but there are other factors that must be considered to effectively reduce the risk of a similar problem recurring.

One of the things that must be considered in this example is why a filter that blocked denial of service attacks wasn’t already in place. Flooding a website with excessive traffic is a well-known strategy of hackers. If an app could accidently take the site down for hours, it is worrisome to consider what somebody with malicious intent could do. The NWS is responsible for disseminating important safety information to the public and needs a reasonably robust website. In order to reduce the impact of a similar issue in the future, the NWS needs to evaluate the protections they have in place for their website and see if any other safeguards should be implemented beyond the filter that addressed this specific issue.

If the investigation was focused too narrowly on a single root cause, the entire discussion of cyber security could be missed. Building a Cause Map of many causes ensures that a wider variety of solutions are considered and that can lead to more effective risk prevention.

To view a high level Cause Map of this issue, click on “Download PDF” above.

Root Cause Analysis - Incident Investigation

Software Glitch Delays U.S. Travel Documents

August 8, 2014 Kim Smiley

By Kim Smiley

The Consular Consolidated Database (CCD) is the global database used by the U.S. State Department to process visas and other travel documents. On July 20, 2014, the CCD experienced software issues and had to be taken offline. The outage lasted several days with the CCD being returned to service with limited capacity on July 23. The CCD is huge, one of the largest Oracle-based warehouses in the world, and is used to process a hefty number of visas each year and the effects of the software glitch have been felt worldwide. The State Department processed over 9 million immigrant and non-immigrant visas overseas in 2013 so a delay of even a few days means a significant backlog.

This issue can be analyzed by building a Cause Map, a visual root cause analysis. A Cause Map visually lays out the different causes that contribute to an issue so that the problem is better understood and a wider range of solutions can be considered. The first step in the Cause Mapping process is to define the problem, which includes documenting the overall impacts to the goal. Most problems impact more than one goal and this example is no exception.

The customer service goal is clearly impacted because thousands – and potentially even millions – have had their travel document processing delayed. The negative publicity can also be considered an impact to the customer service goal because this software glitch isn’t doing the international image of the U.S. any favors. The delay in travel document services is an impact to the production/schedule goal and the recovery effort and investigation into the problems impact the labor/time goal. Additionally, there are potential economic impacts to both individuals who may have had to change travel plans and to the U.S. economy because these issues may discourage international tourism.

The next step in the Cause Mapping method is to build the Cause Map. This is done by asking “why” questions and using the answer to visually lay out the cause-and-effect relationships. The delay in processing travel documents occurred because the CCD is needed to process them and the CCD had to be taken offline as a result of software issues. Why were there issues with the database? Maintenance was done on the CCD on July 20 and the performance issues began shortly thereafter. The maintenance was done to improve system performance and to fix previous intermittent performance issues. The State Department has stated that this was not a terrorist act or anything more malicious than a software glitch. An investigation is currently underway to determine exactly what caused the software glitch, but the details have not been released at this time. It can be assumed that the test program for the software was inadequate since the glitch wasn’t identified prior to implementation.

The final step in the Cause Mapping process is to identify solutions that can be implemented to reduce the risk of a problem recurring. Details of exactly what was done to deal with the issue in the short term and bring the CCD back online aren’t available, but the State Department has stated that additional servers were added to increase capacity and improve response time. There is also a plan to improve the CCD in the longer term by upgrading to a newer version of the Oracle database software by the end of the year which will hopefully prove more stable.

To view an Outline and high level Cause Map of this issue, click on “Download PDF” above.

Root Cause Analysis - Incident Investigation

When You Call Yourself ThinkReliability…

March 25, 2014 ThinkReliability Staff

By ThinkReliability Staff

While I was bombasting about the Valdez oil spill in 1989, one of those ubiquitous internet fairies decided that I did not really need the network connection at my remote office. Sadly this meant that the attendees on my Webinar had to listen only to me speaking without seeing the pretty diagrams I made for the occasion (after a short delay to switch audio mode).

Though I have all sorts of redundancies built in to Webinar presentations (seriously, I use a checklist every time), I have not prepared for the complete loss of network access, which is what happened during my March 20th, 2014 Webinar. I’m not going to use the term “root cause”, because I still had another plan . . . (yep, that failed, too).

For our mutual amusement (and because I get asked for this all the time), here is a Cause Map, or visual root cause analysis – the very method I was demonstrating during the failure – of what happened.

First we start with the what, when and where. No who because blame isn’t the point, though in this case I will provide full disclosure and clarify that I am, in fact, writing about myself. The Webinar in question was presented on March 20, 2014 at 2:00 PM EST (although to my great relief the issues didn’t start until around 2:30 pm). That little thorn in my side? It was the loss of a network connection at the Wisconsin remote office (where I typically present from). I was using Citrix Online’s GoToWebinar© program to present a root cause analysis case study of the Valdez oil spill online.

Next we capture the impact to the organization’s (in this case, ThinkReliability) goals. Luckily, in the grand scheme of things, the impacted goals were pretty minor. I annoyed a bunch of customers who didn’t get to see my slides and I scheduled an additional Webinar. Also I spent some time doing follow-up to those who were impacted, scheduling another Webinar, and writing this blog.

Next we start with the impacted goals and ask “Why” questions. The customer service goal was impacted because of the interruption in the Webinar. GoToWebinar© (as well as other online meeting programs) has two parts: audio and visual. I temporarily lost audio as I was using the online option (VOIP), which I use as a default because I like my USB headset better than my wireless headset. The other option is to dial in using the phone. As soon as I figured out I had lost audio, I switched to phone and was able to maintain the audio connection until the end of the Webinar (and after, for those lucky enough to hear me venting my frustration at my office assistant).

In addition to losing audio, I lost the visual screen-sharing portion of the Webinar. Unlike audio, there’s only one option for this. Screen sharing occurs through an online connection to GoToWebinar©. Loss of that connection means there’s a problem with the GoToWebinar© program, or my network connection. (I’ve had really good luck with GoToWebinar; over the last 5 years I have used the program at least weekly with only two connection problems attributed to Citrix.) At this point I started running through my troubleshooting checklist. I was able to reconnect to audio, so it seemed the problem was not with GoToWebinar©. I immediately changed from my wired router connection to wireless, which didn’t help. Meanwhile my office assistant checked the router and determined that the router was not connected to the network.

You will quickly see that at this point I reached the end of my expertise. I had my assistant restart the router, which didn’t work, at least not immediately. At this point, my short-term connection attempts (“immediate solutions”) were over. Router troubleshooting (beyond the restart) or a call to my internet provider were going to take far longer than I had on the Webinar.

Normally there would have been one other possibility to save the day. For online presentations, I typically have other staff members online to assist with questions and connection issues, who have access to the slides I’m presenting. That presenter (and we have done this before) could take over the screen sharing while I continued the audio presentation. However, the main office in Houston was unusually short-staffed last week (which is to say most everyone was out visiting cool companies in exciting places). And (yes, this was the wound that this issue rubbed salt in), I had been out sick until just prior to the Webinar. I didn’t do my usual coordination of ensuring I had someone online as my backup.

Because my careful plans failed me so completely, I scheduled another Webinar on the same topic. (Click the graphic below to register.) I’ll have another staff member (at another location) ready online to take over the presentation should I experience another catastrophic failure (or a power outage, which did not occur last week but would also result in complete network loss to my location). Also, as was suggested by an affected attendee, I’ll send out the slides ahead of time. That way, even if this exact series of unfortunate events should recur, at least everyone can look at the slides while I keep talking.

To view my comprehensive analysis of a presentation that didn’t quite go as planned, please click “Download PDF above. To view one of our presentations that will be “protected” by my new redundancy plans, please see our upcoming Webinar schedule.

Root Cause Analysis - Incident Investigation

The Morris Worm: The First Significant Cyber Attack

November 7, 2013 Kim Smiley

By Kim Smiley

In 1988 the world was introduced to the concept of a software worm when the Morris worm made headlines for significantly disrupting the fledgling internet. The mess left in the wake of the Morris worm took several days to clean up. The estimates for the cost of the Morris worm vary greatly from $100,000–10,000,000, but even at the lower range the numbers are still substantial.

A Cause Map, or visual root cause analysis, can be used to analyze this issue. A Cause Map is built by asking “why” questions and using the answers to visually lay out the causes that contributed to an issue to show the cause-and-effect relationships. In this example, a programmer was trying to build a “harmless” worm that could be used to gauge the size of the internet, but he made a mistake. The goal was to infect each computer one time, but the worm was designed to duplicate itself every seventh time a computer indicated it already had the worm to make the worm hard to defend against. The problem was that the speed of propagation was underestimated. Once released, the worm quickly reinfected computers over and over again until they were unable to function and the internet came crashing down. (To view a Cause Map of this example, click on “View PDF” above.)

One of the lasting impacts from the Morris worm that is hard to quantify is the impact on cyber security. The worm exploited known bugs that no one had worried about enough to fix. At the time of the Morris worm, there was no commercial traffic on the internet or even Web sites. The people who had access to the internet were a small, elite group and concerns about cyber security hadn’t really come up. If the first “hacker” attack had had malicious intent behind it and came a little later it’s likely that the damage would have been much more severe. While the initial impacts of the Morris worm were all negative, it’s a positive thing that it highlighted the need to consider cyber security relatively early in the development of the internet.

It’s also interesting to note that the programmer behind the Morris worm, Robert Tappan Morris, become the first person to be indicted under the 1986 Computer Fraud and Abuse Act. He was sentenced with a $10,050 fine, 400 hours of community service, and a three-year probation. Morris was a 23 year old graduate student at the time he released his infamous worm. After this initial hiccup, Morris went one to have a successful career and now works in the MIT Computer Science and Artificial Intelligence Laboratory.

Root Cause Analysis - Incident Investigation

NYT Website Disrupted for Hours

September 13, 2013 Kim Smiley

By Kim Smiley

On Tuesday, August 27, 2013 the New York Times website went dark for several hours after being attacked by a well-known group of hackers. Reports of hacked websites are becoming increasingly common and the New York Times was just one of many recent victims.

A Cause Map, or visual root cause analysis, can be used to analyze the recent attack on the New York Times website. A Cause Map lays out the many causes that contribute to an issue in an intuitive format that illustrates the cause-and-effect relationships. A Cause Map is useful for understanding all the causes involved and can help when brainstorming solutions. To see a Cause Map of this example, click on “Download PDF” above.

Some details of how the attack was done have been released, as documented on the Cause Map. The New York Times website itself was not technically hacked, but traffic was redirected away from the legitimate website to another web domain. To pull off this feat, hackers changed the domain name records for the New York Times website after acquiring the user name and password of an employee at the domain name registrar company. The employee inadvertently provided the information to the hackers by responding to a phishing email asking for personal information.

The email sent by the hackers looked legitimate enough to fool the employee.

So why did hackers target the New York Times in the first place? The answer is that the New York Times is one of many western media outlets to be targeted by Syrian Electronic Army (S.E.A.), who has claimed responsibility for the attack. The S.E.A. supports President Bashar al-Assad of Syria and is generally unhappy with the way the events in Syria have been portrayed in the West.

So the next logical question is how do you protect yourself from a phishing scheme? The first step is awareness. Pretty much everybody who uses email can expect to receive some suspicious emails. A few things to look out for: attachments, links, misspellings, and a mismatched “from” field or subject line. Also any alarming language should be a red flag. For example, an email from your credit card company warning you that your account will be closed unless you take immediate action is probably not the real deal. A good rule of thumb is to never respond to any email with personal information or to click on links in emails. If you think a request for action may be real, either call the company or open a new web browser window and type in the company’s web address. It’s best to delete any suspicious emails immediately.

This example is also a good reminder to be aware that websites can get hacked. A great example of this is when the S.E.A. hacked the Associated Press’s twitter feed last April and used it to announce (falsely) that the White House had been bombed. That one tweet is estimated to have caused a $136 billion loss in the stock markets as people responded to the news. In general, it is probably good to be skeptical about anything shocking you read online until the information is confirmed.

Root Cause Analysis - Incident Investigation

Gaming Network Hacked

May 9, 2011 Kim Smiley

By Kim Smiley

Gamers worldwide have been twiddling their thumbs for the last two weeks, after a major gaming network was hacked last month. Sony, well known for its reputation for security, quickly shut down the PlayStation Network after it learned of the attacks, but not before 100+ million customers were exposed to potential identity theft. Newspapers have been abuzz with similar high-profile database breaches in the last few weeks, but this one seems to linger. The shut down has now prompted a Congressional inquiry and multiple lawsuits. What went so wrong?

A Cause Map can help outline the root causes of the problem. The first step is to determine how the event impacted company goals. Because of the magnitude of the breach, there were significant impacts to customer service, property and sales goals. The impact to Sony’s customer service goals is most obvious; customers were upset that the gaming and music networks were taken offline. They were also upset that their personal data was stolen and they might face identity fraud.

However, these impacts changed as more information came to light and the service outage lingered. Sony has faced significant negative publicity from the ongoing service outage and even multiple lawsuits. Furthermore customers were upset by the delay in notification, especially considering that the company wasn’t sure if credit card information had been compromised as well.

As the investigation unfolded new evidence came to light about what happened. This provided enough information to start building an in-depth Cause Map. It turns out that network was hacked for three reasons. Sony was busy fending off Denial of Service attacks, and simultaneously hackers (who may or may not have been affiliated with the DoS attacks) attempted to access the personal information database. A third condition was required though. The database had to actually be accessible to hack into, and unfortunately it was.

Why were hackers able to infiltrate Sony’s database? At first, there was speculation that they may have entered Sony’s system through its trusted developer network. It turns out that all the hackers needed to do was target the server software Sony was running. That software was outdated and did not have firewalls installed. With the company distracted, it was easy for hackers to breach their minimal defenses.

Most of the data that the hackers targeted was also unencrypted. Had the data been encrypted, it would have been useless. This raises major liability questions for the company. To fend off both the negative criticism and lawsuits, Sony has been proactive about implementing solutions to protect consumers from identity fraud. U.S. customers will soon be eligible for up to $1M in identity theft insurance. However other solutions need to be implemented as well to prevent or correct other causes. Look at the Cause Map; notice how that if you only correct issues related to fraud, there are still impacts without a solution.

Sony obviously needs to correct the server software and encryption flaws which let the hackers access customer’s data in the first place. Looking to the upper branch of the Cause Map is also important, because the targeted DoS attack and possibly coordinated data breach jointly contributed to the system outage. More detailed information on this branch will probably never become public, but further investigation might produce effective changes that would prevent a similar event from occurring.

Root Cause Analysis - Incident Investigation

75 Year Old Woman Cuts Internet Service to Armenia With a Shovel

April 14, 2011 Kim Smiley

By Kim Smiley

On March 28, 2011, a 75-year-old woman out digging for scrap metal accidentally cut internet service to nearly all of Armenia. There were also service interruptions in Azerbaijan and part of Georgia. Some regions were able to switch to alternative internet suppliers within a few hours, but some areas were without internet service for 12 hours.

How did this happen? How could an elderly woman and a shovel cause such chaos without even trying?

A root cause analysis can be performed and a Cause Map built to show what contributed to this incident. Building a Cause Map begins with determining the impacts to the organizational goals. Then “why” questions are asked and causes are added to the map.

In this example, the Customer Service Goal is impacted because there was significant internet service interruption and the Production Schedule Goal was also impacted because of loss of worker productivity. The Material Labor Goal also needs to be considered because of the cost of repairs.

Now causes are added to the Cause Map by asking “why” questions. Internet service was disrupted because a fiber optic cable was damaged by a shovel. In addition, this one cable provided 90 percent of Armenia’s internet so damaging it created a huge interruption in internet service.

Why would a 74-year-old woman be out digging for cables? The woman was looking for copper cable and accidentally hit the fiber optic cable. This happened because both types of cables are usually buried inside PCV conduit and can look similar. The reason she was looking for copper cable is because there is a market for scrap metal. Metal scavenging is a common practice in this region because there are many abandoned copper cables left in the ground. She was also able to hit the fiber optic cable because it was closer to the surface than intended, likely exposed by mudslides or heavy rains.

The woman, who had been dubbed the spade-hacker by local media, has been released from police custody. She is still waiting to hear if she faces any punishment, but police statements implied that the prosecutor won’t push for the maximum of three years in prison due to her age.

To see the Cause Map of this issue, click on the “Download the PDF” button above.