Aerospike 4.4: Change Notification and Operational Improvements

November 19, 2018 | 4 min read

Aerospike Founder and CTO

Aerospike is pleased to announce the availability of our 4.4 release.The 4.4 release contains a number of operational improvements, and the major feature of Change Notification. Existing deployments should carefully read the operational improvements (listed below) – implementing them may save millions of dollars and result in vastly improved latency and reliability.

Aerospike’s Change Notification feature allows outside databases and message queues to be updated in parallel with writes from database clients. We consider this to be a key usability feature with Aerospike as the database of record. Based on Aerospike’s well-accepted Cross Datacenter Replication (XDR), notification has been battle tested with large scale deployments across the globe.

Aerospike’s Kafka connector will be the first Aerospike-shipped component to use this feature. You’ll be able to take your writes from Aerospike directly to Kafka, which simplifies client processing. Write to one database, and Aerospike will update the destinations – even if there are network lags or other operational issues, the data will be safe in Aerospike.

This exciting platform technology allows any partner or programmer to write a simple plugin that can update an analytics cache, update legacy machine learning stores, or a wide range of possibilities. Analytics responsiveness can be dramatically improved and achieve practical HTAP-like (i.e. hybrid transactional/analytical processing) use by supporting real-time updates of different analytic stores. Surges in writes would be effectively buffered using Aerospike, smoothing out periods where an analytic database may falter, but be synced with Aerospike as soon as possible.

This enterprise feature may require a separate trial license, please contact Aerospike support for access.

Operational Improvement features include: Quiesce ( 4.3.1 ). With this functionality, you can route database traffic away from a node with planned maintenance, avoiding any operational impact for planned changes. This can also be used in public clouds or VM environments which require “live migration” of underlying machines and notify virtual machines in advance.

Limiting data migration ( 4.3.1 ). During a planned maintenance, it is wasteful to move data between cluster nodes, only to move back moments later. A new parameter called ‘migrate-fill-delay‘ will limit synchronization to only updates, which prevents the extra data motion in cases of short maintenance. It replaces workarounds used by customers based on limiting the number of data migration threads and use of the ‘single-replica-limit’ parameter. By implementing both ‘quiesce’ and ‘migrate-fill-delay’, early customers have seen maintenance periods with no database timeouts, even in the strictest operational environments.

Rack aware reads ( client releases ). In many cloud environments, it is best practice to build a cluster which spans multiple availability zones. Aerospike’s clustering is a great fit due to the rack-aware deployment model. Aerospike now provides a mechanism to have a database client preferentially read from servers in their rack / zone, dramatically reducing cost and latency. Implementing this feature saved one large public cloud installation over $1M a year in traffic charges, with lower latency and increased stability. This feature is available starting in Java client 4.2.3, C# client 3.6.8, C client 4.4.0, and Go client 1.36, and will be coming soon to other clients. This feature currently only works with AP namespaces.

Read Cache ( 4.3.1 ). Recent events in Linux kernels have improved the performance and reliability of the Linux page cache. Until recently, Aerospike would by default set the O_DIRECT device flag to avoid lower level read caches, and would never recommend allowing the page cache. Based on our research with Kernel 4.4 and 4.13 in particular, we now recommend enabling the page cache for cases with a modest working set. This modification also reduces pollution of the page cache by maintenance tasks, making the most of DRAM space.

Histogram and timeout fixes. We have found and fixed an internal flaw regarding Aerospike latency measurements, which affects how the server calculates timeouts and statistics used to monitor server health. Measurements were incorrect when the incoming database request was large ( large batch requests and large writes ), and thus not account properly for the network transfer into the server. The incorrect accounting would affect the “start” portion of all transaction histograms. Servers would also process transactions that the client had already abandoned, where the correct accounting would have timed out the request. When you apply 4.4, you might notice monitoring elements are subtly different, or you might notice a greater number of timeouts if you had set the client timeout to “unlimited” but had set a server timeout. We have analyzed several deployment scenarios, and found in every case that 4.4’s accounting is correct. This correction aids diagnosing of network performance issues.

4.4 Release Notes

Aerospike 4.4: Change Notification and Operational Improvements

Keep reading

Fail fast, stay resilient: How to stop hidden gray failures in Aerospike on AWS EBS

Determining the best machine learning and AI databases

The three price tags: How Redis unpredictability costs you infrastructure, engineering time, and UX

Monitoring Aerospike Enterprise in Datadog: What you get and how it works