Open Source Databases: The Unknown Community

Aerospike Founder and CTO

February 9, 2016|5 min read

It’s hard to judge how popular – and how well-used – an open source project is. At Aerospike, we’d also like more insight into the use of our open source database software. To this end, we propose to collect statistics from running Aerospike clusters to better understand usage patterns, system behavior, cluster topologies and other hard metrics, rather than rely on our assumptions of how we think people use our software.

Many claim downloads, GitHub forks, and social posts as a proxy for popularity. While these numbers can seem up and to the right, often they are hard to prove or just don’t correlate to real product usage.

Apache attempts to solve this problem with the Apache Community Process. Those who contribute, determine the direction of the project. But those who contribute don’t know how a project is being used; they only have their guesses. Several of the most significant open source projects – like Linux, Python and Docker – have a BDL (Benevolent Dictator for Life). These “keepers of the flame” field information from the industry and provide vision and direction.

At Aerospike, what kind of information do we receive about our database software? We get posts to the Aerospike forum. We see the support issues raised by subscribers. We see downloads, web views, and discussions about Aerospike in the press and conferences. And we coalesce all this data – industry information, perspective from advisors, requests from forums and postings – and our best judgment in deciding how best to steer our project.

Changing the Game

You may have noticed – here in 2016 – that many open source projects are asking you to contribute data. Android, Firefox, Chrome, Eclipse: each one asks to contribute usage data and lets you say either “yes” or “no”. They all have user interfaces with a sensible place to click in order to accept or reject data collection.

At Aerospike, we’d also like you to contribute usage data. We want to know how people are using our database software. We want to know which features – which languages, what kind of queries, at what speeds, with what size clusters – are being used. We want to know if your servers have been misbehaving so we can arrange our forces to get a fix out, pronto.

We get that kind of rich information from our paid subscribers, who raise support tickets and whose experience contributes to roadmap updates. Our community users provide feedback through the Aerospike forum. But we want more.Personally, I’d like it if all the infrastructure providers knew how their communities are using their code. I’d like to discuss – fairly, with data in hand – who is using what. For example, I’d like to know how much Apache Cassandra is being used, at what speeds, with what size clusters. Often, I contend, “we know” a project is popular, based on little data other than download numbers; but the latter can be gamed by frequent updates, or suffer from being well-distributed through third parties which don’t track usage.

Privacy

We know you need your privacy. I like my privacy, too.

We don’t need to know who you are. At all.We really, really don’t want the data you store in Aerospike. We don’t want your IP address, or your MAC address, either.

We’ll anonymize everything. We want you to review the distributed source code (in Python) to verify our commitment.

We just need to know what you’re doing, so we can build the best Aerospike possible.

Opt-in vs. Opt-out

We considered letting users opt in to enable this usage data collection. But opt in rates for infrastructure projects, which have no user interface, tend to be low – as low as 10% according to estimates I’ve seen – even when tied to a monitoring service. And without being tied to a monitoring service, it’s unclear who would ever contribute data.

Moreover, even if you trust the provider and wish to opt in, updating a config file for every server in order to do so (and updating that config when a package updates) makes for a very cumbersome process.

Lastly, according to the Research Ethics Guidebook, there is a range of evidence showing that opt-in samples are less representative than samples collected by opt out methods; they have been shown to bias samples which may inadvertently exclude certain categories of users.

All these reasons leads us to believe that the most effective way to achieve a valid and representative data sample for our infrastructure project is to collect usage data on an opt out basis. Thus, we are making opt out the default to collect data.

Volunteering more information about you – such as your email address – will enable us to build a path to routing your information to our support organization so we can be proactive in the future. We’ll even be able to do things like find cases where you’re impacted by a bug that’s already been fixed, and tell you that there’s an update.

Having said all this, we will support your ability to easily decline the collection of your data.

Our Promise

The code that “phones home” must be distributed in its source form and in an easy-to-read format.The anonymization of identifiable data must be clear in the published source code.

You shall have the ability to turn off data collection. Disabling must be obvious, straightforward, and published everywhere – within the project, in distributed binaries, and prominently on the project’s website.

We’ll publish these statistics in aggregate form on a regular basis.

Our hope is that better data will lead to a better understanding of how Aerospike is used; this feedback loop will help to make our project better. We think this is the best approach.

As an open source project, we value your feedback and input on this topic and on all things related to Aerospike.

Open Source Databases: The Unknown Community

Changing the Game

Privacy

Opt-in vs. Opt-out

Our Promise

Keep reading

Aerospike 7 named in-memory database of the year

Google Cloud’s Z3 series: The new storage-optimized virtual machine family for your database

Why I joined Aerospike

Vector search: Considerations for database efficiency