Hybrid Memory Architecture
Aerospike enables use of flash storage (SSD, PCIe, NVMe) in parallel on one machine to perform reads at sub-millisecond latencies at very high throughput (100K to 1M) in the presence of a heavy write load. This use of SSD enables enormous vertical scaleup at a 5x lower total cost of ownership (TCO) than pure RAM.
Aerospike implements a hybrid memory architecture wherein the index is purely in-memory (not persisted), and data is stored only on a persistent storage (SSD) and read directly from the disk. Disk I/O is not required to access the index, which enables predictable performance. Such a design is possible because the read latency characteristic of I/O in SSDs is the same, regardless of whether it is random or sequential. For such a model, optimizations described are used to avoid the cost of a device scan to rebuild indexes.
This ability to do random read I/O comes at the cost of a limited number of write cycles on SSDs. In order to avoid creating uneven wear on a single part of the SSD, Aerospike does not perform in-place updates. Instead, it employs a copy-on-write mechanism using large block writes. This wears the SSD down evenly, which, in turn, improves device durability. Aerospike bypasses the Operating System’s file system and instead uses attached flash devices directly as a block device using a custom data layout.
When a record is updated, the old copy of the record is read from the device and the updated copy is written into a write buffer. This buffer is flushed to the storage when completely full.
The unit of read, RBLOCKS, is 128 bytes in size. This increases the addressable space and can accommodate a single storage device of up to 2TB in size. Writes in units of WBLOCK (configurable, usually 1MB) optimize disk life.
Aerospike operates on multiple storage units of this type by striping the data across multiple devices based on a robust hash function; this allows parallel access to the data while avoiding any hot spots.
Note that SSDs can store an order of magnitude more data per node than DRAM. The IOPS supported by devices keep increasing; for instance, NVMe drives can now perform 100K IOPS per drive. Many 20-30 node Aerospike clusters use this setup and run millions of operations/second 24×7 with sub-millisecond latency.
All Flash Feature:
Aerospike’s All Flash feature extends Aerospike’s offerings in Hybrid Memory to support a broader set of cases. These uses surround having a very large number ( 100’s of billions ) of small ( < 1000 byte ) objects , which is common in architectures which store individual behaviors as separate database elements, or the need to segregate data elements for GDPR-style data layouts.
In these cases, Aerospike can now be configured to use storage ( often NVMe Flash drives ) as the index. This can radically reduce cost in these cases, although it also increases latency as storage is used for indexes instead of DRAM in Aerospike’s traditional Hybrid Memory configuration. Aerospike is still very high performance. In testing, we’ve seen 150 byte objects still result in 99% < 2ms latencies on Amazon i3 instances – and much higher performance on more modern hardware.
Uniform balance is a feature which subtly changes Aerospike’s data distribution mechanism, with large positive impacts for deployments with larger cluster sizes – often reducing the cluster size and reducing the hardware cost of running Aerospike.
Aerospike’s prior implementation used a form of random allocation of partitions to nodes, which would normally result in slight differences between the amount of data. That algorithm was chosen because it minimizes the amount of data motion required when nodes are added and removed from clusters. However, we also found that because of the random distribution, some customers were requiring more nodes than were strictly required.