This blog post shows how to implement transactional semantics in your Aerospike application. Certain situations have to be handled differently from traditional databases that you may be familiar with because of the single record scope of Aerospike transactions and certain cluster transitions. Specifically, the post discusses:

  • How to handle common read-write transactions, and
  • How to resolve “in-doubt” transactions because of partition changes due to network or node failures.

Different Data, One Database

As a developer, you need to store and process different types of data across the correctness and availability spectrum. For a class of data, correctness is critical. For example, financial applications must preserve records accurately, logins must use the latest password, and sharing on a social network must stop immediately after unfriending a connection. Database transactions provide ACID guarantees for such data:

  • Atomicity: Either succeeds or fails as a whole.
  • Consistency: Leaves the database in a consistent state
  • Isolation: Can be ordered serially with respect to other transactions.
  • Durability: Preserves data across power or node failures.

In a distributed and replicated database, a transactional update should also guarantee consistency of values across all replicas.

For another class of data you may decide to prioritize availability over consistency. For example, product recommendations in e-commerce applications, ad selection in an ad serving platform, and fraud risk scoring in payment applications may value availability with reasonable accuracy over absolute recency and correctness of underlying data.

Enterprises spend huge effort in designing, developing, and operating multiple databases targeted for such diverse data needs. Fortunately, Aerospike handles diverse data access over the availability and consistency spectrum at predictable millisecond performance, petabyte scale, and high resilience. Aerospike also obviates the need for a front-end cache typically needed to meet latency and throughput requirements of modern applications that complicate design and deployment even further.

Strongly Consistent (SC) and Available (AP) Modes

Aerospike is a distributed key-value database (with support for the document-oriented data model) that automatically partitions (shards) data across a cluster of nodes with the desired replication factor. Data is stored in Aerospike as records (rows) that are accessed using a unique key. A record can contain one or more bins (columns) of different data types.

Per the famous CAP theorem, one must choose between consistency and availability if the database cluster is tolerant of network partitions (continues to function). Aerospike provides both consistency and availability modes by allowing namespaces (databases) to be configured in either SC (Strongly Consistent, synonymous with CP in CAP) or AP (Available during Partition) mode. It should be noted that C in the CAP theorem refers to maintaining consistency among replicas, whereas C in ACID refers to enforcing integrity constraints.

All single record requests in Aerospike are atomic. The AP mode has the “eventual consistency” guarantee that a typical NoSQL database provides. However, in AP mode:

  • a record across all replicas may not be consistent at all times,
  • reads may sometimes access stale data, and
  • occasionally writes may be lost.

While this sounds undesirable, there is a CAP trade-off to be made. Because the possibility of data loss and stale reads can be very rare, many applications may be able to justify the tradeoff for higher availability over consistency and pick the AP mode.

However many applications demand absolute correctness, and in such cases, to ensure that a read gets only the latest committed value (no stale or dirty/uncommitted reads) and no committed data is lost (durable commits), Aerospike offers the SC mode that guarantees strong consistency. In this mode,  a “linearizable” (explained below) read is always consistent across replicas. Aerospike’s strong consistency support has been independently confirmed through Jepsen testing.

The remaining post assumes the SC mode which is essential for transactional guarantees.

Transactions in Aerospike

In Aerospike, all single record operations have transactional guarantees. Every single record request, including those containing multiple operations on one or more “bins” (columns), executes atomically under a record lock with isolation and durability, and ensures that all replicas are consistent.

Aerospike transactions do not span multiple record boundaries. An application can eliminate or minimize the need for multi-record transactions by taking advantage of Aerospike’s rich data model that includes nested List and Map types, and its multi-operation single record transactions. With appropriate data modeling, multi-record transactions requiring joins and constraints can be avoided. For instance, data in multiple tables in a relational database may be modeled as a single denormalized record. Multi-operation transaction control operations like Begin, Commit, Abort are therefore not part of the Aerospike API.

Implementing Read-Write Transactions

Aerospike allows multiple read/write operations on the same key in a single transaction. However a write in Aerospike is a simple (e.g., set, add, append) operation, and cannot be an arbitrary function of one or more bins within the record. In other words, complex update logic cannot be sent to server for execution within a transaction. This Aerospike design keeps common read operations simple and predictably fast, and deferring to the application any added complexity of less frequent read-write transactions.

Aerospike has two options for general read-write transactions:

  1. Using Record Read-Modify-Write (or Check-and-Set) pattern on the client side within the application.
  2. Implementing the update logic in a User Defined Function (UDF) that resides on the server. The UDF, akin to a stored procedure in databases, must be installed by an administrator and can then be invoked through the API.

UDFs, which are implemented in Lua and residing on the server, are slower to execute, develop, test, and change. As such they are not best suited for most read-write transactions in applications. While they are the solution of choice in many situations, this post will focus on the first alternative, that is, the R-M-W approach.

The R-M-W approach makes use of the record’s “generation” metadata which essentially is a counter that is incremented on every write operation on the record. The pattern is as follows:

  1. Reading the record.
  2. Modifying the record in the application.
  3. Setting the write policy to GEN_EQUAL, requiring write will succeed only if the record generation matches at the time of read.
  4. Writing the modified data.

R-M-W pattern for a read-write transaction

In R-M-W approach, we ensure that the record is unchanged by performing the write only if the generation is equal to that at the time of the read. If there was an interim change, the GEN_EQUAL condition will fail and so will the write, and the entire R-M-W pattern may be retried.

Instead of making the write conditional on the record’s generation metadata, the write may be made conditional on the read value using a Predicate Filter. In that case, the write will succeed only if the the previously read value has not changed.

Resolving In-Doubt Transactions

An application must know that a transaction succeeded or failed so that it can make any necessary adjustments. If the transaction fails, the application can take appropriate steps to retry the transaction (or have the higher logic including manual intervention to take appropriate action).

In rare occasions during cluster transition (when the cluster is splitting), it is not immediately possible to know if the transaction succeeded or failed. In such a case, the write transaction times out with the “InDoubt” flag in the exception object set to true, reflecting the still unknown outcome. The in-doubt transaction can be resolved when the affected data partition becomes live again as part of the automatic cluster recovery process.

A transaction’s outcome may be unknown when the network is in the midst of splitting, when the write was received in the minority sub-cluster, and the other sub-cluster became inaccessible before all replicas are known to be successfully updated. The exact outcome of the write is not immediately available as success or failure. The affected partition needs to become live and accessible after recovery in order to resolve the outcome of the write.

A request that is in-doubt will time out. There isn’t a built-in way to resolve an in-doubt transaction, and so the client has to devise its own scheme to find out the outcome of the write. Again, this is in line with Aerospike’s design philosophy that in order to keep common operations simple and predictably fast it will defer any added complexity to applications to handle rare situations.

A typical way to resolve an in-doubt write is to record the transaction ID (a client generated globally unique id) atomically in the record along with the update through a multi-operation request. A “txns” bin in the record maintains the list of transaction IDs that succeeded, as each write prepends the transaction ID to the list through an atomic multi-operation request. Whether an in-doubt write was successful or not can be determined by the application by checking if the transaction ID is in the list. The txns list is kept from growing indefinitely by trimming it to a specific size. The size needs to be appropriately chosen for the write frequency of the app: it should be large enough to ensure transaction ID is not already trimmed when the check is performed, but not too large as it will increase record size and disk space. The txns list may be viewed as a simple record-specific transaction log managed by the application.

The following pseudo-code describes a way to resolve in-doubt transactions. It requires the record to have a “txns” list bin containing the last MAX_TXNS transaction ids.

  1. Read the record.
  2. Write the record using operate and GEN_EQUAL policy to match the read generation with the these operations:
    1. write,
    2. prepend to txns list, and
    3. trim txns list.
  3. Handle the result: If the transaction is successful then return Success; if it failed with generation error, return Retry;  if it timed out and is in-doubt, resolve in-doubt transaction.

Read-write transaction with in-doubt handling

The in-doubt transaction can be resolved by waiting until the partition recovers and checking for the presence of transaction ID in the txns list. Below is the logic and pseudo-code for the polling read. Note that the generation must go past the original read generation to be sure that the original write is not still “in process” on the server. After a suitable elapsed time, if the generation has not moved past the original read generation, a touch operation (which only increments generation without modifying the record) may be issued with gen-equal check to force the generation past the original value. A successful touch operation will confirm the write failed or ensure it will fail. Otherwise the write is still in-doubt and must be resolved at a higher level (including through manual intervention).

  1. Poll the record until timeout or Success.
    1. Read txns while the record generation is still the same as the read generation and total wait is less than RESOLVE_TIMEOUT.
    2. If transaction ID is in txns list, the transaction definitely succeeded; return Success. Otherwise, the transaction definitely failed, and may be retried.
  2. If the record generation is still equal to the read generation, touch the record with gen-equal check in an attempt to force failure and retry.
    1. If touch fails with generation-error, read txns and resolve as in 1b.
    2. If touch fails with any other error, return InDoubt for out-of-band resolution of the in-doubt transaction.
    3. If touch succeeds, the original transaction has failed or will fail (because it was issued with the same gen-equal check), and can be safely retried.

Resolving an in-doubt transaction

Here are a few additional things to remember while implementing transactions in Aerospike:

  • The retry policy must be set to “no retry” on a write request so that it is not repeated on time-out as the timed-out (in-doubt) transaction might succeed. For this (as shown above), use write policy with max_retries set to zero.
  • All updates are written to all replicas synchronously, and each replica asynchronously flushes its write buffer to persistent storage in a single block for efficiency. For additional durability, developers can instruct Aerospike to commit each write to the disk on a per-transaction basis using commit-to-device config.

Example

The following simple example shows a caller using the RMW function for a read-write transaction.

Example read-write caller

Choosing Read Consistency

Aerospike has client-side controls for read consistency level in read-policies. The client-level default applies to all reads unless a per-transaction setting is specified in which case it takes precedence. These include:

  • linearizable or globally consistent,
  • session-consistent, that is, consistent to client, and
  • additional relaxed consistency modes

For additional details see the documentation.

Note that while the API provides durability-related client-side write policies, they apply only to AP mode, and are ignored in SC mode.

Session-consistent reads are faster than linearizable reads because they avoid the additional round-trip for “regime” (which is the cluster formation version) check with replicas by master.  The cost for increased speed is occasional stale reads during network partition transitions. Note, the application doesn’t have to do anything to avoid stale reads in session-consistent mode, as the driver library automatically detects and avoids stale reads  by keeping track of the highest known regime for a partition and re-issuing a read if it is from a lower regime.

A read may fail in SC mode for various reasons, and the application should handle the failure appropriately, usually by retrying. A read may fail in some cases because it cannot return a consistent (latest committed) value as the cluster has split, and the connected sub-cluster has no valid partition replica available. Rules of how each partition becomes available in different sub-clusters of a split cluster are explained in this whitepaper.

Conclusion

As an application developer, you will need to handle data that requires consistency across all replicas in a distributed cluster. Aersopike’s Strong Consistency (SC) mode provides such strong consistency and other transactional guarantees for single record operations without forsaking speed, throughput, and resilience. Aerospike also allows applications to choose availability over consistency for other data through the AP mode. The choice of SC versus AP mode is made at the namespace (database) level on the server side, and it governs the transactional semantics for all data within the namespace. Read-Modify-Write pattern is required for most read-write transactions. During cluster transitions, a transaction’s outcome may be uncertain and not immediately known. In such cases, the application would need to implement a scheme to record and check transaction-id to ascertain whether the transaction definitely succeeded or definitely failed.