Taking Advantage of Probabilistic Data Structures

Andre Chang, Solutions Architect Blog, Developer

Note: the code examples in this post are written in python.  HLL support requires at least version 4.0.0 of the Aerospike python client.

pip install "aerospike>=4.0.0"

Complete sample code can be found at https://github.com/aerospike-examples/hll-python

I once worked on a team that was involved with testing and profiling new hardware.  Test engineers crafted and executed numerous tests against a phalanx of hardware for each hardware revision, each test collecting an array of time series data totalling GBs of storage.  After the test suite was completed, the data from each test would be loaded, analysed, and compiled into a report.

My team was tasked with addressing the length of time it took to perform the post-processing.  We examined the database, the network, and anything else we could think of, and we made big improvements.  But the test engineers still complained.  We then turned to the data analysis pipeline.  We found that, in most cases, the analysis was reporting only the minimum, maximum, and mean of most of the time series.  To do so, the pipeline had to load each time series into memory and examine each measurement.

No doubt you’re thinking that this pipeline could have been a lot more efficient.  We could have kept accumulators that tracked the minimum and maximum values as the measurements were ingested, and maintained a running total with the count of the number of samples to calculate the mean at the end of the test.  Not only did the data not have to be read and processed again after the test, but they never had to be individually stored in the first place.

Use Case – Online Ad Targeting

HyperLogLog is an algorithm for estimating the number of distinct elements in a multiset.  The HyperLogLog data type was introduced in Aerospike 4.9 as a native data type, encompassing the algorithm and underlying data structures.  (The HyperLogLog data type also implements the HyperMinHash algorithm for estimating similarity between sets.)  With HyperLogLog (and HyperMinHash), we are able to apply the same principle to new classes of problems.  For example, a common big data problem is online ad targeting.

Each time a user performs an action online, information is collected about the user.  For site visits, data is typically collected in the form of a set of tags (eg. user characteristics like “male” or “female”, interests like “databases”, or locations like “California”) associated with a user identifier (eg. an email address).  This is how advertisers are able to target marketing to a specific demographic or interest.  Before launching a marketing campaign, the advertiser wants to know the audience that will be reached by the campaign (ie. the number of unique user profiles that match a specific set of tags over a given period of time).

Suppose we wanted to estimate the number of unique users who searched for the term “aerospike” in a particular month.  The naïve approach is to record all the identifiers and associated tags.  Then we can count the number of unique identifiers that are associated with the desired tag within the time period.  But, like the example of finding the minimum, maximum, and mean values of a time series, we can do better.

Using HyperLogLog (HLL), we can estimate the number of unique values in a set.  It is a probabilistic data structure, so it does not yield a precise count.  Instead, it estimates a count with a maximum error that is a function of the number of bits used for the HLL bins.  In our example, we use 12 bits of HLL index; this yields a maximum error of 1.6250% (see https://www.aerospike.com/docs/guide/hyperloglog.html#error-bounds-by-hyperloglog-operation).

Ingesting data

Suppose, throughout a month, we are given (partial) user profiles.  Rather than storing the raw profiles, we can add the profile identifier to HLL data structures instead.  We create a new Aerospike record for each tag in a month; the key for this record is “<tag>:<month>”, and the HLL is stored in the bin “hll_bin”.  For each partial profile ingested, we add the profile identifier to the corresponding record for each tag contained in the partial profile.

# code - hll_add

    for tag in profile:
        ops = [
            hll_operations.hll_add(HLL_BIN, [str(i)], HLL_INDEX_BITS)
        ]

        _, _, result = client.operate(getkey(tag, month), ops)

Adding the same identifier to a HLL element is idempotent.  Thus throughout the month, we simply add partial profiles as we receive them without needing to know if that identifier has previously been added to the HLL elements for the month.

Data footprint

With 12 HLL bits (and 0 HyperMinHash bits), each HLL bin requires just over 3 kB of storage (3,088 bytes to be precise).

Suppose we were not using HLL, but, instead, stored each partial user profile with each tag as a string.  Each time a partial profile includes the tag “aerospike”, storing the tag requires 18 bytes (9 overhead plus length 9).  If the “aerospike” tag were to be included in 1000 profiles in a month, then that corresponds to 18 kB of storage; using HLL reduces the storage in this case by 83%.

As our usage scales up and we process more partial profiles per month, the storage required for HLL remains constant.  However, the storage required for the raw tags increases, and this savings becomes even greater.

Counting

Once our partial profiles are ingested over the month, we can get the estimated count of unique profile identifiers in the HLL element.

# code - hll_get_count

    ops = [
        hll_operations.hll_get_count(HLL_BIN)
    ]
    _, _, result = client.operate(getkey(tag, month), ops)
    print 'tag:%s month:%d count:%d' % (tag, month, result[HLL_BIN])

# output

tag:aerospike month:0 count:3367
tag:aerospike month:1 count:3324
tag:aerospike month:2 count:3419
tag:aerospike month:3 count:3298
tag:aerospike month:4 count:3298
tag:aerospike month:5 count:3515
tag:aerospike month:6 count:3353
tag:aerospike month:7 count:3412
tag:aerospike month:8 count:3423
tag:aerospike month:9 count:3308
tag:aerospike month:10 count:3326
tag:aerospike month:11 count:3530

Union

Now after a year of collecting data, we wish to estimate the number of unique profile identifiers for a tag over the year.  We can’t simply add the monthly estimates since some of those identifiers may be counted in multiple months.  But we can calculate a union of multiple HLL data structures.

# code - hll_get_union_count

    records = [record[2][HLL_BIN] for record in client.get_many([getkey(tag, time) for time in times])]

    ops = [
        hll_operations.hll_get_union_count(HLL_BIN, records)
    ]

    _, _, result = client.operate(getkey(tag, t0), ops)
    print 'tag:%s months:%d-%d count:%d' % (tag, t0, t0+months-1, result[HLL_BIN])

# output

tag:aerospike months:0-11 count:18381

Intersection

Our data contains geographic tags to identify the location of the users.  We’ve already created HLL records for all of our tags, so it’s easy to check how many of our users are located in Vancouver:

tag:vancouver month:0 count:6217

There are two cities on the Pacific coast named Vancouver: one in Canada and the other in Washington.  We can get the count of users in each of those:

tag:canada month:0 count:5460
tag:washington month:0 count:6720

But how many of our users are from Vancouver, Washington?  We can use HLL intersection to estimate this.

# code - get_hll_intersect_count

    records = [record[2][HLL_BIN] for record in client.get_many([getkey(tag, month) for tag in tags[1:]])]

    ops = [
        hll_operations.hll_get_intersect_count(HLL_BIN, records)
    ]

    _, _, result = client.operate(getkey(tags[0], month), ops)
    print 'tags:%s month:%d count:%d' % (tags, month, result[HLL_BIN])

# output

tags:['vancouver', 'washington'] month:0 count:724

HLL estimates that of the 6217 profiles that contain the tag “vancouver”, 724 of them also contain the tag “washington”.

Conclusion

There are many use cases where we are interested in the approximate size of a set, of the union of sets, or of the intersection of sets.  For these use cases, HyperLogLog is remarkably efficient, in both time and space, at estimating the cardinality of sets.  Operational costs are lower with improvements in either one of the dimensions; the combination allows for significant cost savings that should not be neglected.

Complete sample code can be found at https://github.com/aerospike-examples/hll-python

Share:

About Author

mm

    Andre Chang, Solutions Architect

    All posts by this author
    Andre is an avid outdoors enthusiast with decades of experience in the backcountry. He applies his expertise to Search and Rescue, volunteering his time to the tracking of those unfortunate enough to be lost in the wilderness. In a previous life, Andre was a software engineer, honing his technical skills over a myriad of companies and industries. That led to his current role as a Solutions Architect where he is driven by his desire to understand and solve problems faced by his customers.