BlackBerry faced yet another setback last month when service went down world-wide for multiple days. The Research in Motion (RIM) company, already facing stiff competition from other smart phone vendors, apologized profusely for the outage and vowed to woo back its customers. What caused the extensive and possibly business-ending service outage?
A root cause analysis can help identify what occurred. The first step is to outline the incident. The service outage originated in Europe, then spread to four other continents over a 72 hour period. Customers were furious with the service outage and the slow PR response from the company. This outage impacted two major RIM goals – to generate revenue for shareholders and maintain customer satisfaction. Working backwards from these goals, the Cause Map shows what events led to the catastrophic failure and where further investigation is needed.
The company faces a potential loss of revenue if it loses customers. The company may not have had to worry about the impact of such service outages in the past…except that now there are viable alternatives such as Apple and Android devices. Continuing to work backwards, customers were upset because of a service outage. At this point, it helps to examine the BlackBerry network architecture.
BlackBerry’s architecture is fundamentally different from that of Apple and Android. All data is filtered through the company’s internal service network, before being passed on to carrier networks such as Sprint and Verizon. Apple and Android don’t provide processing in the middle. When BlackBerry’s core switch failed in an English data center, a backup switch was supposed to take over. It had been tested successfully. Unfortunately the backup didn’t work, leading to a buildup of messages waiting to be processed. That mountain of messages led to backlogs in other data centers worldwide. When the switch failed, it also corrupted the database software managing all the messages within the network.
It turns out that this network architecture is both a liability and at the heart of the company’s business success. By centrally processing all data messages – both compressing and encrypting them – RIM provides additional security and reduces the processing required at the user device, meaning lower energy use and a longer battery life. Despite these strengths, RIM would be wise to find out why their network crashed. As users store more data within the network – as with cloud computing – outages could cripple the system for even longer.