Amazon EC2 Capacity Planning

AWS brings a large number of instance types, but some are better suited for running Aerospike than others. Below we explore the different instance types and what configuration they lend themselves to.

Instance Type	Use case
m5(d)	Low Cost
r5(d)	In Memory clusters
c5(d)	low latency/high throughput
i3	High Data Capacity
i3en	Data Capacity + Throughput

Do not use burst instances Noisy neighbors can consume CPU or bandwidth, causing latency spikes.
Instance with shared SSD controller Instances with less than a full sized local SSD will share an SSD controller.
i3 Must be over provisioned by 20%.
m5d, r5d, c5d Have multiple disk drives, test your pool to get the lowest common denominator.
OS Amazon Linux 2023 with Database 6.4 and later is recommended for compatibility and performance.

Aerospike as In-Memory with no Persistence

In-memory without storage-backed persistence is ideal for a cache based use-case. See Configure namespace storage.

Aerospike Network Planning

Each network interface on an Amazon Linux HVM instance can handle about 250K packets per second. If you need higher performance per instance, do one of the following:

Add More NIC/ENI Elastic Network Interfaces (ENI) provide a way of adding multiple (virtual) NICs to an instance. A single NIC peaks at around 250k TPS, bottlenecking on cores processing interrupts. Adding more interfaces helps to process more packets per second on the same instance. Using ENIs with private IPs is free of cost in AWS.

You can specify separate network interfaces for service, info and fabric traffic. This will help alleviate both packets per second and bandwidth concerns with individual ENIs. But adds to the complexity of your Aerospike cluster.
Receive Packet Steering

RPS is only available in kernel version 2.6.35 and above.

Another simpler approach is to distribute IRQ over multiple cores using RPS
```
echo f > /sys/class/net/eth0/queues/rx-0/rps_cpus
```
With Aerospike, this eliminates the need for multiple NICs or ENIs, making management easier, and resulting in similar TPS. A single NIC with RPS enabled can achieve up to 800K TPS with interrupts spread over 4 cores. Ensure your instance types have been sized appropriately for this.

Aerospike as a Fast Persistent Data Store

The storage engine suited for this use case is the SSD Storage Engine.

Amazon EC2 provides storage in the form of Elastic Block Storage or EBS. These are network attached to virtual machine instances.

EBS performance is either set using Provisioned IOPS or General Purpose. Provisioned IOPS (io1) delivers consistent IOPS but are costly. General Purpose (gp2) volumes have variable performance based on size. For information about the relationship between volume size and IOPS for gp2 volumes, see Amazon EBS volume types.

High Availability using Availability Zones

Amazon EC2 is hosted in multiple locations world-wide. These locations are composed of regions and Availability Zones. Each region is a separate geographic area. Each region has multiple, isolated locations known as Availability Zones. Amazon EC2 provides you the ability to place resources, such as instances, and data in multiple locations. Resources aren’t replicated across regions unless you do so specifically. Amazon operates state-of-the-art, highly-available datacenters. Although rare, failures can occur that affect the availability of instances that are in the same location.

Ephemeral SSD Based Cache Backed by EBS Persistence

Benefits:

RAM requirement is same as the EBS persistence model only.
Provides persistence offered by EBS while surpassing the performance bottleneck of EBS by making use of ephemeral SSDs performance as caching layer.
Provides the best of performance and persistence possible by using Ephemeral SSD as RAM alternative along with EBS for persistence storage.

Drawbacks:

More operational overhead than any other storage models.
Must use instances supporting the required number and amount of ephemeral SSD instance storage volumes.

Autoscaling

There are no logical default thresholds and step sizes, they must be based on your workload characteristic and other non-standard CloudWatch metrics.
Autoscaling a cluster down can lose data! Ensure you know what your lower bound is and that it is properly set in your ASG logic!
Autoscaling a cluster up and back for daily cycles can cost more in cross AZ network cost (migrations) than is saved in compute!