Author: Russ Sullivan, Principal Architect at Aerospike
Aerospike is a proven horizontally scalable distributed key-value store with very fast (200K TPS) single node performance. We recently spent a little time, optimizing Aerospike to better linearly scale across 2 relevant hardware trends: the increasing number of cores per CPU and the increasing prevalence of dual CPU-socket servers. The optimizations (Aerospike performance Beta – coming soon) pushed our 200K TPS per node to 500K TPS on a cheap server ($2K) and 1M TPS on a reasonably priced server ($5K). The optimal TPS-per-hardware-dollar achieved was 250 or in other words: for each $1 you spend on hardware, you get 250 TPS ☺
Using many commodity machines as a cluster has been proven to be the lowest TCO solution to BigData problems. Recent hardware trends are redefining commodity hardware to include multiple PhysicalCPUs each of which has multiple CPU-cores. Clustering commodity machines is evolving from a dominantly horizontal scaling problem space to one also including vertical scaling’s issues. Vertical scaling is the lowest TCO solution to traffic bursts (TPS scaling), whereas horizontal scaling is the lowest TCO solution to SizeOfData scaling. A system that scales efficiently both vertically and horizontally allows a system architect to predictably/accurately/conservatively (i.e. over-capacity for bursts) spec out hardware requirements and even customize for expected use-cases in a cost effective manner.
Two classic bottlenecks encountered when trying to reach 1M database-transactions over-the-wire per second are mitigating network-I/O-soft-interrupt overhead and avoiding unnecessary context switches. Widening these 2 bottlenecks in multi-core and multi-PhysicalCPU architectures is a complicated enough business that the correct solution requires an architecture grounded in a philosophy. Our belief is that isolation is the key to fast multi-core multi-PhysicalCPU performance in shared nothing architectures and our interpretation of this philosophy’s application is outlined below.
Aerospike’s shared nothing architecture has the inherent benefit that it can be abstracted to not just scale linearly horizontally, but also scale linearly vertically, if the software is able to divide and isolate a single physical machine’s resources. The 2 axis of vertical scaling that we optimized (in this iteration) in order to beat our personal best TPS by 5X were:
1) Number of cores per chip
2) Number of PhysicalCPUs
Aerospike’s approach to scale high TPS workloads across Physical CPUs is to marry a Physical CPU to a physical NIC port. For instance: a quad cpu-socket machine (w/ 4 Physical CPUs) should utilize a quad port NIC and each physical NIC port should communicate exclusively w/ only one PhysicalCPU. Using IRQ affinity from the physical NIC port to all of the CPU cores on a single Physical CPU and using ip-rules to ensure the reverse effectively isolates the PhysicalCPU <-> NIC communication, and insures near-linear scalability of Network I/O as number of PhysicalCPUs increases.
Another benefit of isolating PhysicalCPU/NIC pairs is it automatically adds NUMA-awareness to Aerospike’s data-storage layer, as the threads pinned to this PhysicalCPU allocate memory exclusively from their local NUMA pool. NUMA-awareness increases throughput and run-time predictability, as memory access is always uniform to the fastest (local) memory pool.
Aerospike’s approach to scale efficiently as the number of cores per PhysicalCPU increases is largely the result of exhaustive testing of the possible thread-to-core configuration-space. This means, the solution for a QuadCore differs from that of a Hexacore/Octacore/etc… The existence of an extremely fast communication path between CPU-cores adds in another optimization/complexity axis, but the philosophy of isolating different hardware/OS/software resources once again yielded optimal performance.
One of the most interesting findings was that on CPUs w/ greater than 4 cores, optimal performance was attained by leaving one (or more) cores entirely free to process and then distribute hardware interrupts from the NIC.
On QuadCores, being able to dynamically pin (& un-pin) groups of threads to groups of CPU-cores at run-time and directly target NIC queues’ IRQ affinities to the CPU-cores running network facing threads, provided a very predictable/stable TPS-turbo-boost-mode for our software w/ very little code change.
For anyone interested in the rest of the hard-core tech details, I wrote a highscalability.com blog post describing in detail the architectural principles necessary to attain Aerospike’s raw speed and the actual specs and results are here.
Aerospike was built from the ground up to scale linearly on the horizontal axis and with each node capable of processing up to 1M TPS the aggregate cluster top speed is ready for VERY Big & Fast Data. Most people need only a fraction of this speed, but this top-speed has the subtle benefit of increased stability & predictability at lower speeds. At Aerospike, we want to go fast and are all about speed, Hot Nasty Bad Ass Speed.