On April 21, 2011, some of Amazon’s Elastic Compute Cloud (EC2) customers experienced issues when a combination of events led to their East region’s Elastic Block Store (EBS) being unable to process read or write operations. This seriously impacted their customer service. Massive efforts were undertaken and services, and most data, was restored within 3 days. Amazon has released their post-mortem analysis of these events. Using the information they’ve provided, we can begin a visual root cause analysis, or Cause Map, laying out the event.
We begin with the affected goal. Customer service was impacted because of the inability to process read or write operations. This ability was lost due to a degraded EBS cluster. (A cluster is a group of nodes, which are responsible for replicating data and processing read and write requests.) The cluster was degraded by the failure of some nodes, and a high number of nodes searching for replicas.
At this point, we’ll look into the process to explain what’s going on. When a user makes a request, a control plane accepts and processes that request to an EBS cluster. The cluster elects a node to be the primary replica of this data. That node stores the data, and looks for other available nodes to make backup replicas. If the node-to-node connection is lost, the primary replica searches for another node. Once it has established connectivity with that node, the new node becomes another replica. This process is continuous.
In this case, a higher number of nodes were searching for replicas because they lost connection to the other nodes. Based on the process discussed above, the nodes then began a search for other nodes. However, they were unable to find any other nodes because the network was unavailable (so the nodes could not communicate with each other). The nodes had a long time-out period for searching for other nodes, so their search continued, and more nodes lost communication and began a search, increasing the volume.
The network communication was lost because data was shifted off the primary network. This was caused by an error during a network configuration change to upgrade the capacity of the primary network. The data should have been transferred to a redundant router on the primary network but was instead transferred to the secondary network. The secondary network did not have sufficient capacity to handle all the data and so was unable to maintain connectivity.
In addition to a large number of nodes searching for other nodes, the EBS cluster was impacted by node failures. Some nodes failed because of a race condition designed so that a node would fail when it attempted to process multiple concurrent requests for replicas. These requests were caused by the situation above. Additionally, the nodes failing led to more nodes losing their replicas, compounding the difficulty of recovering from this event.
Service is back to normal, and Amazon has made some changes to prevent this type of issue from reoccurring. Immediately, the data was shifted back to the primary network and the error which caused the shifting was corrected. Additional capacity was added to prevent the EBS cluster from being overwhelmed. The retry logic which resulted in the nodes continuing to search for long periods of time has been modified, and the source of the race condition resulting in the failure of the nodes has been identified and repaired.
View the root cause analysis investigation of this event – including an outline, timeline, Cause Map, solutions and Process Map, by clicking “Download PDF” above.