Geographic Replication

Aerospike’s integrated geographic replication system automatically and continuously copies data among cluster nodes – for Internet-scale and disaster-proof applications.

Cross Datacenter Replication (XDR) supports different replication topologies, including active-active, active-passive, chain, star, and multi-hop configurations.

Aerospike’s integrated geographic replication system automatically and continuously copies data among cluster nodes, enabling Internet-scale, disaster-proof applications. In a normal deployment state (i.e., when there are no failures), each node logs operations happening on that node for both the master and replica partitions. However, each node only ships to remote clusters the data for master partitions on that node. The changes logged on behalf of replica partitions are used only in the case of node failure. If a node fails, all the other nodes detect this failure and takeover the pending work on behalf of the failed node. This scheme scales horizontally as one can just add more nodes to handle increasing replication load.

When a write happens, the system first logs the change, reads the whole record, and ships it. There are a few optimizations to save the amount of data read locally and shipped across. The data is read in batches from the log file. First, the system checks whether the same record is updated multiple times in the same batch. The record is read exactly once on behalf of all the changes in that batch. Once the record is read, the system compares its generation with the generation recorded in the log file. If the generation on the log file is less than the generation of the record, the system skips shipping the record. There is a maximum number of times a record can be skipped, as the record may never be shipped if it is continuously getting updated. These optimizations provide a huge benefit when there are hot keys in the system whose records are updated frequently.

The XDR component on each node acts as a client to the remote cluster. It performs all the roles just like a regular client, i.e., it keeps track of remote cluster state changes, connects to all the nodes of the remote cluster, maintains connection pools, and so on. Indeed, this is a very robust distributed shipping system as there is no single point of failure. All nodes in the source cluster ship data in proportion to their partition ownership, and all nodes in the destination cluster receive data in proportion to their partition ownership. This shipping algorithm allows both source and destination clusters to have different cluster sizes.

Our model ensures that clusters continue to ship new changes as long as there is at least one surviving node in the source or destination clusters. It also adjusts very easily to new node additions in source or destination clusters and is able to equally utilize all the resources in both clusters.

For cross datacenter shipping, Aerospike uses an asynchronous pipelined scheme. Each node in the source cluster communicates with all the nodes in the destination cluster. Each shipping node keeps a pool of 64 open connections that are used in a round-robin manner to ship records. The record is shipped asynchronously, i.e., multiple records are shipped on the open connection; afterwards, the source waits for the responses. So, at any given point in time, there can be multiple records on the connection waiting to be written at the destination. This pipelined model is the main way we are able to deliver high throughput on high-latency connections over WAN. When the remote node writes the shipped record, it sends an acknowledgment back to the shipping node with the return code. We set an upper limit on the number of records that can be in-flight for the sake of throttling network utilization.

Specifically, as shown in the above figure, by grouping data objects by namespace into the same arena, the long-term object creation, access, modification, and deletion pattern is optimized, and fragmentation is minimized.