NoSQL flexibility & RDBMS reliability

Flexibility, Reliability & Operational efficiency

Aerospike is a fast Key Value Store or Distributed Hash Table architected to be a flexible NoSQL platform for today’s high scale Apps. Designed to meet the reliability or ACID requirements of traditional databases, there is no single point of failure (SPOF) and data is never lost. Aerospike can be used as an in-memory database and is uniquely optimized to take advantage of the dramatic cost benefits of flash storage. Written in C, Aerospike runs on Linux.

Based on our own experiences developing mission-critical applications with high scale databases and our interactions with customers, we’ve developed a general philosophy of operational efficiency that guides product development.  Three principles drive Aerospike architecture: NoSQL flexibility, traditional database reliability, and operational efficiency.

First published in the Proceedings of VLDB (Very Large Databases) in 2010, the Aerospike architecture consists of 3 layers:

Aerospike database architecture overview
  1. The cluster-aware Client Layer includes open source client libraries that implement Aerospike APIs, track nodes and know where data reside in the cluster.
  2. The self-managing Distribution Layer oversees cluster communications and automates fail-over, replication, cross data center synchronization and intelligent re-balancing and data migration.
  3. The flash-optimized Data Storage Layer reliably stores data in DRAM and Flash.

Aerospike cluster-aware Client Layer

Smart Client The Aerospike “smart client” is designed for speed. It is implemented as an open source linkable library available in C, C#, Java, Ruby, PHP and Python, and developers are free to contribute new clients or modify them as needed. The Client Layer has the following functions:
  • Implements the Aerospike API, the client-server protocol and talks directly to the cluster.
  • Tracks nodes and knows where data is stored, instantly learning of changes to cluster configuration or when nodes go up or down.
  • Implements its own TCP/IP connection pool for efficiency. Also detects transaction failures that have not risen to the level of node failures in the cluster and re-routes those transactions to nodes with copies of the data.
  • Transparently sends requests directly to the node with the data and re-tries or re-routes requests as needed. One example is during cluster re-configurations.

This architecture reduces  transaction latency, offloads work from the cluster and eliminates work for the developer. It also ensures that applications do not have to be restarted when nodes are brought up or down. Finally, it eliminates the need to setup and manage additional cluster management servers or proxies.

 Aerospike self-managing Distribution Layer

Shared-Nothing Architecture The Aerospike “shared nothing” architecture is designed to scale and never fail. This layer scales linearly, implements many of the ACID guarantees and reliably stores terabytes of data with automatic fail-over, replication and cross data center synchronization. The Distribution layer is also designed to eliminate manual operations with the systematic automation of all cluster management functions. It includes 3 modules:
  • The Cluster Management Module tracks nodes in the cluster. The key algorithm is a Paxos-like consensus voting process which determines which nodes are considered part of the cluster. Once membership in the cluster has been agreed upon, each node uses a distributed hash algorithm to divide the primary index space into data ‘slices’ and assign read and write masters and replicas to each of the slices. Because the division is purely algorithmic, the system scales without a master and eliminates the need for additional configuration that is required in a sharded environment.
  • When a node is added or removed, the Data Migration Module intelligently balances the distribution of data across nodes in the cluster, and ensures that each piece of data is duplicated across nodes and across data centers, as specified by the system’s configured replication factor.
  • The Transaction Processing Module reads and writes data as requested and provides many of the consistency and isolation guarantees. For writes with immediate consistency, it propagates changes to all replicas before committing the data and returning the result to the client. In rare cases during cluster re-configurations when the Client Layer may be briefly out of date, it transparently proxys the request to another node. Finally, when a cluster is recovering from being partitioned, it resolves any conflicts that may have occurred between different copies of data. Resolution can be configured to be automatic, in which case the data with the latest timestamp is canonical or both copies of the data can be returned to the application for resolution at that higher level.

Aerospike flash-optimized Data Layer

Schema-less for maximum flexibility This layer is designed for maximum flexibility. It implements the schema-less Aerospike data model. Data is organized into policy containers called ‘namespaces’, semantically similar to ‘databases’ in an RDBMS system. Namespaces are configured when the cluster is started, and are used to control retention and reliability requirements for a given set of data. One of the most important system configuration policies is the replication factor, which controls the number of copies of every piece of data stored.

Within a namespace, data is subdivided into ‘sets’ (similar to ‘tables’) and ‘records’ (similar to ‘rows’). Each record has an indexed ‘key’ that is unique in the set, and one or more named ‘bins’ (similar to columns) that hold values associated with the record.

Sets and bins do not need to be defined up front, but can be added during run-time for maximum flexibility. Values in bins are strongly typed, and can include strings, integers, and binary data, as well as language-specific binary blobs that are automatically serialized and de-serialized by the system. Bins themselves are not typed, so different records could have the same bin with values of different types.

Aerospike Namespace

The Data Layer was particularly designed for speed and a dramatic reduction in hardware costs. It can operate all in-memory, eliminating the need for a caching layer or it can take advantage of unique optimizations for flash storage. In either case, data is never lost.

Indexes (primary keys) are stored in DRAM for ultra-fast access and values can be stored either in DRAM or more cost-effectively on SSDs. Each namespace can be configured separately, so small namespaces can take advantage of DRAM and larger ones gain the cost benefits of SSDs.

  • 100 Million keys take up only 6.4GB. Although keys have no size limitations, each key is efficiently stored in just 64 bytes.
  • Native, multi-threaded, multi-core Flash I/O and an Aerospike log structured file system take advantage of low level SSD read and write patterns. In addition, writes to disk are performed in large blocks to minimize latency. This mechanism bypasses the standard file system, historically tuned to rotational disks.
  • Also built-in are a smart Defragmenter and intelligent Evictor. These processes work together to ensure that there is space in DRAM and that data is never lost and safely written to disk.  The Defragmenter tracks the number of active records in each block and reclaims blocks that fall below a minimum level of use. The Evictor removes expired records and reclaims memory if the system gets beyond a set high water mark. Expiration times are configured per namespace, the age of a record is calculated from the last time it was modified, the application can override the default lifetime at any time and specify that a record should never be evicted.
  • For greater throughput, data in each namespace is spread evenly across every node in the cluster, and duplicated as specified by the namespace’s configured Replication Factor. For optimal efficiency, the location of any piece of data in the system can be determined algorithmically, without the need for a stored lookup table.

Learn more about Aerospike’s ACID compliance, scalability and cross data center replication capabilities.