Tag Archives: service outage

BlackBerry’s Widespread Failure

By ThinkReliability Staff

BlackBerry faced yet another setback last month when service went down world-wide for multiple days.  The Research in Motion (RIM) company, already facing stiff competition from other smart phone vendors, apologized profusely for the outage and vowed to woo back its customers.  What caused the extensive and possibly business-ending service outage?

A root cause analysis can help identify what occurred.  The first step is to outline the incident.  The service outage originated in Europe, then spread to four other continents over a 72 hour period.  Customers were furious with the service outage and the slow PR response from the company.  This outage impacted two major RIM goals – to generate revenue for shareholders and maintain customer satisfaction.  Working backwards from these goals, the Cause Map shows what events led to the catastrophic failure and where further investigation is needed.

The company faces a potential loss of revenue if it loses customers.  The company may not have had to worry about the impact of such service outages in the past…except that now there are viable alternatives such as Apple and Android devices.  Continuing to work backwards, customers were upset because of a service outage.  At this point, it helps to examine the BlackBerry network architecture.

BlackBerry’s architecture is fundamentally different from that of Apple and Android.  All data is filtered through the company’s internal service network, before being passed on to carrier networks such as Sprint and Verizon.  Apple and Android don’t provide processing in the middle.  When BlackBerry’s core switch failed in an English data center, a backup switch was supposed to take over.  It had been tested successfully.  Unfortunately the backup didn’t work, leading to a buildup of messages waiting to be processed.  That mountain of messages led to backlogs in other data centers worldwide.  When the switch failed, it also corrupted the database software managing all the messages within the network.  

It turns out that this network architecture is both a liability and at the heart of the company’s business success.  By centrally processing all data messages – both compressing and encrypting them – RIM provides additional security and reduces the processing required at the user device, meaning lower energy use and a longer battery life.  Despite these strengths, RIM would be wise to find out why their network crashed.  As users store more data within the network – as with cloud computing – outages could cripple the system for even longer.

Gaming Network Hacked

By Kim Smiley

Gamers worldwide have been twiddling their thumbs for the last two weeks, after a major gaming network was hacked last month.  Sony, well known for its reputation for security, quickly shut down the PlayStation Network after it learned of the attacks, but not before 100+ million customers were exposed to potential identity theft.  Newspapers have been abuzz with similar high-profile database breaches in the last few weeks, but this one seems to linger.  The shut down has now prompted a Congressional inquiry and multiple lawsuits.  What went so wrong?

A Cause Map can help outline the root causes of the problem.  The first step is to determine how the event impacted company goals.  Because of the magnitude of the breach, there were significant impacts to customer service, property and sales goals.  The impact to Sony’s customer service goals is most obvious; customers were upset that the gaming and music networks were taken offline.  They were also upset that their personal data was stolen and they might face identity fraud.

However, these impacts changed as more information came to light and the service outage lingered.  Sony has faced significant negative publicity from the ongoing service outage and even multiple lawsuits.  Furthermore customers were upset by the delay in notification, especially considering that the company wasn’t sure if credit card information had been compromised as well.

As the investigation unfolded new evidence came to light about what happened.  This provided enough information to start building an in-depth Cause Map.  It turns out that network was hacked for three reasons.  Sony was busy fending off Denial of Service attacks, and simultaneously hackers (who may or may not have been affiliated with the DoS attacks) attempted to access the personal information database.  A third condition was required though.  The database had to actually be accessible to hack into, and unfortunately it was.

Why were hackers able to infiltrate Sony’s database?  At first, there was speculation that they may have entered Sony’s system through its trusted developer network.  It turns out that all the hackers needed to do was target the server software Sony was running.  That software was outdated and did not have firewalls installed.  With the company distracted, it was easy for hackers to breach their minimal defenses.

Most of the data that the hackers targeted was also unencrypted.  Had the data been encrypted, it would have been useless.  This raises major liability questions for the company.  To fend off both the negative criticism and lawsuits, Sony has been proactive about implementing solutions to protect consumers from identity fraud.  U.S. customers will soon be eligible for up to $1M in identity theft insurance.  However other solutions need to be implemented as well to prevent or correct other causes.  Look at the Cause Map; notice how that if you only correct issues related to fraud, there are still impacts without a solution.

Sony obviously needs to correct the server software and encryption flaws which let the hackers access customer’s data in the first place.  Looking to the upper branch of the Cause Map is also important, because the targeted DoS attack and possibly coordinated data breach jointly contributed to the system outage.  More detailed information on this branch will probably never become public, but further investigation might produce effective changes that would prevent a similar event from occurring.