Meltdown / Spectre and Aerospike

Executive Summary

The Meltdown and Spectre CPU bugs require OS and cloud hypervisor patches to provide data security. These patches, while necessary, come at a performance cost.  In general, we have seen few impacts across the Aerospike customer base due to our remarkably low CPU and system call use.

This blog will help you monitor, analyze, and likely mitigate performance aspects of these required security patches.

Background

The Meltdown and Spectre CPU bugs allow attacks where untrusted processes can read data from the kernel and trusted processes, but most critically allows virtual machines on the same physical machine to steal data from each other – devastating for the public cloud.

Hypervisor and OS fixes have rolled out on public clouds and Linux distributions the first week of January 2018, and these patches cause increased CPU use and higher latency, which can impact your database. As Aerospike is primarily used in low-latency use cases, we are relating a number of our results as guidance for proper Aerospike operations.

In general, we have seen few impacts across the Aerospike customer base caused by these changes. Aerospike has extraordinarily low CPU and system call use compared to Java-based systems or relational systems, and in our sizing recommendations with customers, rarely is CPU the bottleneck to performance compared to network and device limitations. Aerospike already contains a large number of optimizations to reduce system call use except for necessary network and IO calls.

In a variety of tests against high-transaction Flash systems, we have observed a variety of end-to-end latency increases. In one public cloud test on Amazon i3 instances, we observed an increase of 300 microseconds (0.3 ms), which was the largest we observed. Bare metal tests ranged from 16 microseconds (lower load, 500K transactions per server)  to 150 microseconds (higher load) on a Haswell CPU.

Latency increases above this level appear to be caused by CPU overload. In all cases we have currently observed, these causes can all be mitigated through a variety of strategies detailed below ( such as avoiding core 0 overload ) or increasing the number of CPUs (scale out).

Many of these steps apply to all environments, but we will specifically discuss the Amazon cloud and bare metal / hybrid cloud deployments.

In all cases, patch your servers, observe latency, and hunt down and mitigate CPU increases that may have cropped up by this patch throughout your stack.

Checking KPTI is enabled on your Linux kernel

Aerospike recommends updating your bare metal and/or guest Linux kernel to include the KPTI patches. The following external links can be helpful in validating that you have correctly applied the necessary patches:

Debian / Ubuntu specific:
https://askubuntu.com/questions/992137/how-to-check-that-kpti-is-enabled-on-my-ubuntu

Generic tool:

https://raw.githubusercontent.com/speed47/spectre-meltdown-checker/master/spectre-meltdown-checker.sh

Recommended Actions

We have observed CPU increases which range from 3 percentage points on the low end to a maximum of  20 percentage points on high-load cases. All tests were on Aerospike databases running at moderate operational load, and we focused our testing on Flash-based tests. We generally find that impact on network packet processing (per packet) overhead is the likely culprit.

These guidelines will help you understand the impact of these patches and mitigate them.

Monitor latency – While CPU load is important to functional systems and avoiding operational problems, the most likely cause of business impact will be increases in latency. Aerospike is used in low-latency cases where microseconds count, so be sure – before you hastily attempt mitigations – that you have a true business problem.

Monitor CPU load – We have seen that these security patches increase CPU load on clients and servers. Make sure to monitor the “per core” utilization (‘mpstat -P ALL’, or top showing core statistics). The observed increase in IO processing overhead may be disproportionately affecting the cores processing packets or IO, and re-tuning the system to distribute packet load among more cores may be necessary, which is discussed in this Aerospike forum article. Alternately, the fastest short-term mitigation may be to add application servers or database servers to your fleet or cluster, especially if running in a public cloud.

Upgrade Kernels – There are a large number of kernel improvements that researchers have found to be beneficial when related to the security patches. Those include PCID (support for the “invpcid” instruction in Haswell and better CPUs), up-to-date ethernet drivers, NVMe storage drivers, and better page table support. Current research focuses on the currently stable 4.14 kernels. It can be difficult finding the right steps to use “backport” versions, but researchers and Amazon are touting benefits.

Network optimizations – ( Huge packets and ENI ) – Aerospike has always recommended ENI networking, and Amazon recommends this change in order to reduce the effect of the patches. Huge packets for local connections is also highly advised, because access to records over 1500 bytes will result in far fewer system calls. Numerous guides exist on the internet for enabling this feature and checking configuration.

Enable hugepages – Amazon reports that enabling hugepages can be beneficial for workloads that do large amounts of memory access, but only on very recent kernels (4.14+). While we have previously mandated disabling Transparent Hugepages (THP), due to several noted problems including those related to JEMalloc and Aerospike Secondary Indexes, Amazon’s positive recommendation for THP on new kernels (notably 4.14) should be noted, and perhaps considered. This should be approached with great caution after substantial testing with your configuration, and we have not reproduced this result.

Consider disabling PTI patches – While data security is important, it is also important to balance the business risk of performance vs the business risk of certain attack vectors. The potential for data leaks through public cloud infrastructure means hypervisor patches are required, but on dedicated instances and bare metal deployments, the attack vector is solely through an opponent which has already breached your network and host to an unprivileged account. For non-personally identifiable (non-PII) data, you may come to the conclusion that your existing network and host security is strong enough to allow short-term disabling of KPTI changes. This change can also be used to validate the performance impact of KPTI kernels, which is a great shortcut in diagnosing problems. If desired, the KPTI patches can be disabled through the ‘nopti’ kernel option.

Amazon EC2

Although many customers have not reported problems and minimal CPU increases ( less than 5% ), we have one report from a customer that Aerospike servers are seeing a 20 percentage points increase in CPU utilization (from 30% to 50%) due to of Amazon’s hypervisor changes.

In that particular reported case, the deployment-related “some” latency impact although analysis was difficult because all tiers were impacted.

The low impact of these fixes on Aerospike is likely due to Aerospike’s generally low CPU use in deployment. Even after the Meltdown / Spectre patches, the deployment was still running under 50% CPU utilization.

At Aerospike, we were fortunate to be in the middle of a round of performance tests when the hypervisor changes affected a set of test instances. We were able to directly observe – on a lightly loaded system – the latency impact independent from CPU increases. The increase in end-to-end latency in this particular test was about 300 microseconds. This shifted all latency measurements – best case, average and 95% worst case latency – by the same amount, consistent with increased code path lengths along IO and interrupt handlers.

While this number is significant, it does include both client and server latency, as well as device latency ( EC2 i3 instances ). We expect pure in-memory latencies increases to be lower.

That internal test case did not include some of the Amazon recommendations, such as running on a PCID-enabled kernel, although running with ENI

We have had one customer which was using smaller instances (m3.medium) which benefited greatly from moving to a larger and more recent instance type. This is in-line with Amazon’s current recommendations, which points out that post-Haswell CPUs include the PCID CPU instruction which reduce the impact of the patches. There may be other benefits to larger instances, such as fewer neighbors.

Interestingly, our testing did NOT show performance loss caused by running patched Amazon Linux guest kernels. Once the hypervisor fix was in place, we observed no additional cost to a patched guest kernel.

Other Clouds

Google’s GCE hypervisor fixes rolled out previously. Google is not releasing the date they began or ended this rollout, as per their blog post on the topic. Given the uncertain timing of the rollout, we are unable to offer a “before and after” measurement, but we have re-measured latency on GCE vs EC2, and continue to find that GCE has a measurable latency edge over EC2 in our test cases, although we realize there are many factors in public cloud provider selection.

Azure’s changes have rolled out even more recently. We do not yet have numbers based on their changes, but would expect it to be in-line with the measurements we have on EC2 and/or bare metal, and the same recommended actions should be applied.

The general tuning and analysis guidelines above apply to all cloud environments.

Bare Metal

As of January 9th 2017, it appears that all major distributions include patched kernels – from Redhat-derived distributions such as Centos, to Debian-derived systems such as Ubuntu, have mitigating patch sets applied.

Our testing on bare metal has been mixed. In a test on Centos 7 (3.10 kernel) and Haswell CPUs, we saw latency increases that were both negligible (5% or 16 microseconds) and substantial (20% or 150 microseconds).  In these two test runs, the higher latency impact was observed when the utilization of the machine was only slightly ( 20% ) higher.

A slight reduction in utilization may greatly reduce latency impact. These results imply that latency impact of the patches can be negligible when running a system that is not overloaded.

Please contact Aerospike support for further recommendations regarding bare metal latency mitigations.