Aerospike Engineering Blog, Technology, Data Modeling

Faceting is a pattern that can be applied to many domain problems. In this context, faceting presents summaries of important metrics and trends and aggregations of key quantitative and qualitative information sources. Applying faceting techniques to your operational datastore can help you make faster, more informed business decisions than if you were to use a purpose-built analytics solution.

To illustrate how to use facets with Aerospike, let’s start with a JSON schema and consider this common problem—you have a product catalog that you need to query on a distinct set of attributes.

products: { sku: "123-ABC-723",
name: "Wheelbarrow",
pre_assembled: True,
weight_in_kg: 12,
category: ["Garden", "Tool"]
}

Let’s assume that SKU (Stock Keeping Unit) is the primary key. If we know the primary key, we need a direct lookup in order to get the record. But how many of us remember the SKU of a product we want to buy? You are likely to want to find the product through some other criterion, such as name or product category.

The RDBMS Solution

In an RDBMS, you might be tempted to create secondary indexes on an attribute you want to query on (such as name or weight_in_kg), or you might simply allow some slow queries to be implemented as table scans. Creating multiple indexes means that as a record is inserted, these indexes have to be changed. Updates to those indexed values will also require index updates. A list-like category would need to be implemented as an intersection table between the unique categories and the usage by the product. In order to effectively query on multiple criteria, you’ll end up with multiple tables, joins, and potential index merges. As the index structures are maintained, this can impact the concurrency, latencies, and throughput of the overall system. When you then distribute this data across many machines using partitioning, for example, you’ll need to scatter the query across each of these nodes and gather all the results together. Again, this can increase the latency of the returned results..

Faceting

Faceting is simply turning each of these possible secondary index queries into multiple primary key lookups. Let’s look at a possible JSON schema:

products:
{ sku: "123-ABC-723",
name: "Wheelbarrow",
pre_assembled: True,
pickup_only: True,
weight_in_kg: 12,
category: ["Garden", "Tool"]
}

{ sku: "737-DEF-911",
name: "Bicycle Pump",
pre_assembled: True,
pickup_only: False,
weight_in_kg: 0.5,
category: ["Tool"]
}

{ sku: "320-GHI-921",
name: "Kite",
pre_assembled: False,
pickup_only: False,
weight_in_kg: 0.5,
category: ["Toy"]
}

lookups:
{ key: "category/Tool", products: [ "123-ABC-723", "737-DEF-911" ] }
{ key: "category/Garden", products: [ "123-ABC-723"] }
{ key: "category/Toy", products: [ "320-GHI-921" ] }
{ key: "pre_assembled/True", products: [ "123-ABC-723", "737-DEF-911"] }

If we are looking for a product in the category “Tool”, we can perform a primary key query on lookups with the value “category/Tool”. This returns a record with an encapsulated set of product SKUs. We can then iterate through the products list and perform primary key lookups for each product.

We can now avoid performing a scatter and gather across a distributed system and turn this into a series of primary key lookups, each of which can be parallelized. This dramatically improves the throughput, latency, and concurrency of the system. It also has the added bonus of being the fastest way to return the first record, as whichever server happens to be the most lightly loaded will return a response that can then be delivered to the application.

The final benefit of this pattern is that you no longer need to build and maintain indexes on each of the attributes. Building a new lookup simply requires creating and inserting a new lookup record. Changing the product attributes requires some compromise, as a secondary process is required to manage the associated lookup records. You need to strike a balance between your requirements for speedy and flexible lookups and the anticipated cost of maintenance. If your values are slowly changing or immutable, it may be a worthwhile tradeoff.

Multiple Attribute Faceting

But what about filtering on multiple attributes? This is a typical request, so let’s again look at an example—how can we find products where pickup_only and pre_assembled are both True? There are a couple of ways to achieve this. We could perform two queries and find the intersection. Here’s an example of how to do that in Python:

import aerospike
import os
import hashlib

config = { 'hosts': [(os.environ.get("AEROSPIKE_HOST", "127.0.01"), 3000)],
'policies': { 'key': aerospike.POLICY_KEY_SEND }
}

client = aerospike.client(config).connect()

def create_products():
wheelbarrow = { 'sku': "123-ABC-723",
'name': "Wheelbarrow",
'pre_assembled': True,
'pickup_only': True,
'weight_in_kg': 12,
'category': ["Garden", "Tool"]
}
pump = { 'sku': "737-DEF-911",
'name': "Bicycle Pump",
'pre_assembled': True,
'pickup_only': False,
'weight_in_kg': 0.5,
'category': ["Tool"]
}
kite = { 'sku': "320-GHI-921",
'name': "Kite",
'pre_assembled': False,
'pickup_only': False,
'weight_in_kg': 0.5,
'category': ["Toy"]
}
client.put(("test", "products", wheelbarrow['sku']), wheelbarrow)
client.put(("test", "products", pump['sku']), pump)
client.put(("test", "products", kite['sku']), kite)

def create_lookups():
client.put(("test", "lookups", "pre_assembled/True"),
{ 'products': [ "123-ABC-723", "737-DEF-911"] })
client.put(("test", "lookups", "store_pickup_only/True"),
{ 'products': [ "123-ABC-723"] })

def match(key1, key2):
m = []
(key1, meta1, record1) = client.get(("test","lookups",key1))
(key2, meta2, record2) = client.get(("test","lookups",key2))
matches = list(set(record1['products']) & set(record2['products']))
for sku in matches:
(key, meta, record) = client.get(("test", "products", sku))
m.append(record)
return m

# Find matches based on two criteria
create_products()
create_lookups()
# Find the match
matches = match("pre_assembled/True", "store_pickup_only/True")
for m in matches:
print m

We can use the list manipulation built into most languages to find the intersection (see bolded code above). Running the code will result in the following:

{'sku': '123-ABC-723', 'category': ['Garden', 'Tool'], 'name': 'Wheelbarrow',
 'pre_assembled': True, 'pickup_only': True, 'weight_in_kg': 12}

While this adds a lot of complexity due to the fact that you need to calculate the intersection of the results of each query, it’s still a reasonable, scalable solution. Aerospike’s user-defined functions (UDFs) can provide even more flexible in-database filtering capabilities as shown in this example project. But how can we remove the need to find the intersection in the client code, especially if each of those lists starts to get really big?

Compounding and Hashing Attribute Keys

In effect, this faceting pattern creates a compound key and can be implemented in an RDBMS, along with the previously discussed overhead. The pattern is similar with Aerospike; in essence, we build a compound key from the attribute we want to query and then use this compound key to create a primary key for the object we want to lookup. Here’s a sample of what that data would look like in JSON form:

lookups:
{ key: {pre_assembled: True, pickup_only: True},
products: [ "123-ABC-723"] }

Here’s a piece of Python code to perform the compound query:

def create_hashed_lookups(lookup_key, products):
h = hashlib.new("ripemd160")
h.update(str(lookup_key))
client.put(("test", "lookups", h.hexdigest()),
{ 'products': products})

def match_hashed(lookup_key):
m = []
h = hashlib.new("ripemd160")
h.update(str(lookup_key))
(key, meta, found) = client.get(("test", "lookups", h.hexdigest()))
for sku in found['products']:
(key, meta, record) = client.get(("test", "products", sku))
m.append(record)
return m

# Find matches based on hashed criteria
lookup_key={'pickup_only': True, 'pre_assembled': True}
create_hashed_lookups(lookup_key, ["123-ABC-723"])
# Find the match
matches = match_hashed(lookup_key)
for m in matches:
print m

The function create_hashed_lookups is creating a hash (using RIPEMD-160) of the compound values we want to query on, thus providing a compact and reproducible value to query against. We wanted a deterministic hash that minimizes collision, so we used RIPEMD-160, which is also used in the Bitcoin algorithm. We could have easily used SHA512 or another popular hash instead. We could also have used a simple concatenation of strings, but a hash avoids the problems of key size and key distribution. This allows a primary key lookup to be made on these compound values. Once the lookup record has been returned, we can then execute the subsequent primary key lookups of the product data as we have previously done.

When you run the code, you will see the matching product printed:

{'sku': '123-ABC-723', 'category': ['Garden', 'Tool'], 'name': 'Wheelbarrow',
 'pre_assembled': True, 'pickup_only': True, 'weight_in_kg': 12}

Summary

Faceting is a powerful pattern that enables complex query patterns to be executed in an efficient way within a key-value store. With any denormalization, you have to consider the cost of propagating the changes to the denormalized data, taking into account the tradeoff between the frequency of changes to your database and the query flexibility needed by your application.

In the next article, we will discuss how to deal with structures such as queues and state machines.

References

Code samples can be found in Github.

About Author

    Aerospike Engineering

    All posts by this author