At Aerospike, we serve customers with very strict requirements for latency and throughput. We take these needs very seriously.
One of our larger customers needed an expansion of their cluster, and the drives they used were no longer available. They asked us for a recommendation, and we did a round of testing with an open-source tool called ‘fio’. Based on those numbers, we recommended a particular drive (which shall remain nameless). The customer deployed a few servers with that drive. After a day, those servers had very severe performance problems, which would have crippled their business had they deployed more servers with this drive. Every few hours showed more timeouts and degraded performance, and our customer wanted the problem solved…immediately.
We had to get to the bottom of the problem. While we suspected the drives, there could have been other unintended differences. We realized we needed a specific test for Aerospike’s flash optimized I/O patterns, so we quickly wrote the ACT program and put sample drives through the test sequence. At first, the drives looked very fast – similar to the numbers we received from ‘fio’, but then we saw the drives’ characteristics changing. At about 8 hours of sustained load, performance started degrading, and it reached unacceptable levels within 12 hours. We gave the code to the manufacturer, who replicated the result and determined they couldn’t fix the behavior in the current generation of hardware. We then recommended another drive, which the customer has used with great success.
In another case, we went to a major manufacturer – Intel – whose drive also failed ACT, but they were able to prescribe a workaround – overprovisioning – which allowed us to recommend their drive. We have run ACT on a large number of drives, found some premium enterprise products that failed, and received firmware updates that have since benefited all customers. We also found some less expensive consumer drives that succeeded.
The benefits of a source-available tool focused on latency measurement became clear to us. In cases where a drive would not meet a customer’s required SLA, we could work with manufacturers and hardware engineers easily. They could simply run the tests themselves to improve their firmware and make recommendations.
There are a number of enterprise flash brands – Hitachi, STEC, PureStorage, and Violin – which we have not tested. If you are considering these brands or others, run ACT and publish your own results.
ACT measures latency under write load – with large block writes – and increasing throughput until failure
In our blog post at High Scalability, we’ve laid out the general principles we’ve used to optimize for flash storage. Data must be written in large blocks – as in a log-based approach – and read in short blocks with high levels of parallelism.
The ACT code and tool we’ve open-sourced does this. The project is available on Github at http://github.com/aerospike/act, and it runs under Linux. The README file describes the configuration file, and building the test requires simply executing ‘make’.
We started by simulating a base load per device of 2,000 transactions per second (TPS) of read load, and 1,000 TPS of write load, with 1.5 KB objects – what we call 1x load. This is a good “base load” and object size for our Web session management customers.
Aerospike optimizes for flash by using large block writes and small block reads. Each read is easy to model – a call to the read() function – but writes are combined into larger blocks. The size of the large block operations, the write buffer, is configurable, but we found most drives perform best with a size of 128 KB. Due to defragmentation, a half-full drive must be written at twice the desired rate. Thus, to simulate 1,000 TPS of writes, we need 1.5 MB per second of primary writes, another 1.5 MB per second of defragmentation writes, and 3.0 MB per second of reads.
If the latency of the drive is acceptable at this “1x” rate, we run a test at “3x”, then “6x”, and continue upward. As a database provider, we are most concerned when we see the number of read requests taking more than 1 ms growing past 5%. We are also very concerned with read requests that take a very long time – over 64 ms – because pauses of this type will hang up threads in the database, as well as pause an application server waiting for responses.
The ACT tool is different from most benchmarks because it measures throughput in multiple latency buckets at a single load profile. Most benchmarks measure throughput, and some measure latency at peak throughput, which is not how anyone would run a device in production. For example, the usually thorough Tom’s Hardware benchmarks do not measure latency or simultaneous reads and writes. Storage Review does a very thorough job measuring latency and throughput, but does not measure latency under a defined load.
The following numbers show the difference in our testing between the first generation of Intel drives – the X25M – and the second generation. The result shows that the second generation was not appreciably faster for this use case.
The performance numbers are the percentage of read requests that required greater than 1 ms, or 8 ms, or 64 ms to complete. The second numbers refer to the raw number of requests that took longer, and the first number is the total number of delayed transactions, due to queue delays in Aerospike caused by the drive’s delays.
|Drive name||Capacity||Load||20% Over Provisioning?||> 1 ms (total / ssd only)||> 8 ms (total / ssd only)||> 64 ms (total / ssd only)||Test details|
|Intel X25M||160G||1x||NO||17.9 / 16.9||0.6 / 0.02||0.4 / 0.01||10/23/11|
|Intel 320||160G||1x||NO||15.9 / 15.6||0.02 / 0.01||0 / 0||11/2/2011|
|Intel 320||160G||1x||YES||5.4 / 5.2||0 / 0||0 / 0||11/2/2011|
|Intel 320||160G||3x||YES||18.0 / 13.2||0.3 / 0.01||0.07 / 0||11/3/2011|
These test results show the poor showing of the Intel 320 drive when we tested it initially, with numbers only slightly better than the previous generation X25M. With 15% of requests requiring more than 1ms, the drive would not meet the needs of our customers, even with our “1x” load (3 MB per second read, 1.5 MB per second of write). Intel suggested applying 20% overprovisioning, and we found that only 5% of reads took more than 1 ms, and at 9 MB per second, we were seeing a substantial number of slow requests. This was judged acceptable by the standards of late 2011.
The following more comprehensive table includes drives from a wider variety of manufacturers.
|Drive name||Size||Load||20% OP||> 1 ms (% total / ssd only)||> 8 ms (% total / ssd only)||> 64 ms (% total / ssd only)||Test details|
|Unnamed drive||100G||1x||NO||4.5 / 1.9||2.6 / 0.08||2.3 / 0.04||8/14/2011|
|OCZ Deneva 2 SLC||120G||1x||YES||0.9 / 0.7||0.08 / 0.02||0 / 0||10/12/2011|
|OCZ Deneva 2 SLC||120G||3x||YES||3.2 / 2.2||0.4 / 0.03||0 / 0|
|Samsung SS805||100G||1x||YES||2.0 / 1.7||0.1 / 0.01||0 / 0||8/20/2011|
|Samsung SS805||100G||3x||YES||12.7 / 8.6||1.9 / 0.1||0.03 / 0||8/24/11|
|Samsung 830||256G||1x||YES||0.64 / 0.59||0 / 0||0 / 0||1/14/12|
|Samsung 830||256G||3x||YES||2.21 / 1.86||0 / 0||0 / 0||1/15/12|
|Samsung 830||256G||6x||YES||6.09 / 3.96||0 / 0||0 / 0||1/16/12|
|Samsung 840||256G||1x||YES||11.67 / 11.44||0 / 0||0 / 0||11/30/2012|
|Samsung 840||256G||3x||YES||59.74 / 34.37||21.75 / 0.84||10.92 / 0||11/30/2012|
|Samsung 840 Pro||256G||3x||YES||11.77 / 9.75||0 / 0||0 / 0||11/30/2012|
|OCZ Vertex 3 Max10PS||120G||1x||YES||3.8 / 3.4||0.4 / .04||0.03 / 0||11/04/11|
|OCZ Vertex 4||256G||1x||YES||1.39 / 1.36||0.01 / 0||0.01 / 0||10/30/12|
|OCZ Vertex 4||256G||3x||YES||5.38 / 5.33||0.02 / 0||0.01 / 0||11/1/12|
|OCZ Vertex 4||256G||6x||YES||16.86 / 11.25||0.09 / 0||0.06 / 0||11/4/12|
|OCZ Vertex 4||256G||12x||YES||93.70 / 93.60||0.36 / 0.18||0.1 / 0||11/5/12|
|FusionIO I0 Drive2 MLC||785G||3x||NO||2.62 / 1.56||0 / 0||0 / 0||12/16/12|
|FusionIO I0 Drive2 MLC||785G||6x||NO||7.33 / 2.81||0.10 / 0||0 / 0||12/16/12|
|FusionIO I0 Drive2 MLC||785G||12x||NO||15.04 / 9.24||0 / 0||0 / 0||12/16/12|
|FusionIO I0 Drive2 MLC||785G||24x||NO||57.09 / 19.63||0 / 0||0 / 0||12/16/12|
|Intel S3700||400G||1x||YES||0.56 / 0.48||0 / 0||0 / 0||11/10/12|
|Intel S3700||400G||3x||YES||1.6 / 1.29||0 / 0||0 / 0||11/10/12|
|Intel S3700||400G||6x||YES||5.4 / 2.92||0 / 0||0 / 0||11/10/12|
|Intel S3700||400G||12x||YES||12.2 / 11.3||0 / 0||0 / 0||11/10/12|
|Intel S3700||400G||1x||NO||0.47 / 0.40||0 / 0||0 / 0||11/16/12|
|Intel S3700||400G||3x||NO||1.66 / 1.35||0 / 0||0 / 0||11/16/12|
|Intel S3700||400G||6x||NO||5.13 / 2.73||0 / 0||0 / 0||11/16/12|
There are a variety of conclusions that can be drawn from this raw data. We see strong performance from the OCZ Vertex 4 drive with its next-generation controller, but the now-discontinued Samsung SS803 drive has lower latency at the same performance as the Vertex 4. Samsung has made strong positive strides between the SS805 and 830, but the 840 model is a step backward.
Fusion-io’s product, while capturing great acclaim for benefiting traditional relational databases, is exceptional. At high loads, no requests were found at the higher latency levels. However, the number of requests requiring more than 1 ms was higher than expected at these performance levels.
The Intel S3700 presents a very interesting product, as the performance on a per-drive basis is very high. Even testing at the 12x level doesn’t result in long requests. These drives also have initial performance that matches performance at the 12 hour and 24 hour mark, making production configuration quicker and more predictable. Importantly, the S3700 takes no CPU from the main processor, does not impact the memory bus, and will typically be configured with between 4 and 12 drives per server – giving a further performance boost to already exceptional numbers. The drives do not benefit from overprovisioning, and should be used at full capacity.
Flash has moved from a special purpose hardware solution to commodity in only a few years. We’re seeing vendors change models rapidly, and tuned to today’s real-time database demands. As they do, we will continue to use the ACT tool to evaluate their performance, and we recommend that anyone evaluating flash run the test themselves to determine the best drive for their real-time big data demands.