Aerospike Engineering Blog, Technology, Data Modeling

The first section reviews the basic data modeling concepts of encapsulation and embedding, linking, and denormalization. These concepts will be built upon throughout this article. 

Encapsulation and Embedding  

The Aerospike record defines the transactional boundary. But unlike an RDBMS record, an Aerospike record can describe a complex object and its relationship.

fig1Figure 1: Simple containment relationship

In Figure 1, entity A encapsulates or owns all the associated B entities. When A is deleted, all the associated Bs are removed as well. In an RDBMS, you could implement this as two tables and a foreign key with a delete cascade rule.

In Aerospike, you implement this very differently. One alternative is to encapsulate the B instances within the A record. In JSON, this is shown as an array of objects:

A:
{ a1: "foo",
  B: [ {b1: "bar"}, 
       {b1: "baz"}
     ]
}

This means that instances of entity B are held in an array within record A. Thus, removing record A implicitly removes the A record and all the encapsulated B records. Any operation that modifies record A, such as setting the attribute a1 to “foobar” or adding or removing an element in the B array, is still a single record operation, and thus, atomic.

Linking

If we take the same example from Figure 1 and change A to Department and B to Employee, how does this change the way we think about the data? It’s reasonable to assume that an employee works for a specific department at any given moment in time, but perhaps they transferred to a new department last week. We could model this in the following way:

fig2Figure 2: Linking and “Owning”

What this is saying is that each department has a number of assignments, and each assignment is for a specific employee. That assignment has a start_date so you can see current and historical assignments. Bear in mind that there are many alternative models, but we’re going to concentrate on the generic patterns.

Using JSON again, here’s one way this could be modeled:

departments:
  { name: "HQ", 
    assignments: [ { ename: "Fred", start_date: "01-Jan-2016" },
                   { ename: "Anne", start_date: "18-Feb-2014" } 
                 ]
  }

employees:
  {ename: "Fred", job_title: "Machinist"},
  {ename: "Anne", job_title: "Sales Exec"}

In this model, assignments have been encapsulated within the department. Assuming ename is the primary key for the employees (please take this with a grain of salt), you can use this information to perform a second query on the employees to determine other details such as  job_title. Aerospike, like many NoSQL databases, does not perform traditional joins like an RDBMS. Again, this is because the whole operation could require the coordination of many nodes to satisfy the query, which kills the overall latency of the operation. Aerospike works on the basis that you can break down the operations into discrete steps. In our example, we can now perform two primary key accesses to obtain all the information required. But since these are now discrete operations, eventual consistency does come into play as interleaved reads will see changed data.

Any operation on department now encapsulates the assignment, and changes to either are atomic within a single operation, just as in the last example.

Bidirectional Linking

We also need to consider that some relationships are not linked in one way but are instead bidirectional. In the model above, we can find the employee within the department, but there is no simple way to find which department a given employee works for. Our linking model would require you to look at each department and check whether the employee has an active assignment.

We can optimize this pattern and store multiple links, as described in the following JSON:

departments:
  { name: "HQ", 
    assignments: [ { ename: "Fred", start_date: "01-Jan-2016" },
                   { ename: "Anne", start_date: "18-Feb-2014" } 
                 ]
  }

employees:
  {ename: "Fred", job_title: "Machinist", dept: "HQ" },
  {ename: "Anne", job_title: "Sales Exec", dept: "HQ" }

This adds some additional processing steps as the link has to be maintained at both ends when performing changes such as moving an employee to a new department. The ins and outs of maintaining a bidirectional linking system are beyond the scope of this article, but we will explore bidirectional linking further in an upcoming article.

Denormalization

Referring back to the last example, note that the bidirectional model requires you to execute two primary key lookups in order to obtain an employee attribute (e.g., job_title). If you need to generate employee information comprised of more than one attribute, such as a summary of all active employees for a department, each individual attribute will require two lookups. Given the high performance of a key-value optimized database, this is often a reasonable pattern. In cases where you need to execute not just two lookups but potentially 5, 10, or even 20 lookups, one for each relationship, this can create a performance issue. Can we optimize this process and avoid extra lookups? The short answer is yes, through the process of denormalization. Remember that the process of normalization removes any repeated data, so it should come as no surprise that denormalization reintroduces that repeated data. The following JSON schema shows the denormalized job_title on the assignment.

departments:
  { name: "HQ", 
    assignments: [ { employee: {emp_name: "Fred", 
                                job_title: "Machinist" },
                     start_date: "01-Jan-2016" },
                   { employee: {emp_name: "Anne",
                                job_tile: "Sales Exec" },
                     start_date: "18-Feb-2014" } 
                 ]
  }

employees:
  {ename: "Fred", title: "Machinist", dept: "HQ" },
  {ename: "Anne", title: "Sales Exec", dept: "HQ" }

As you can see, we have now added or denormalized more that just the primary key of the employee into the assignment. All the data required has been denormalized into this object, which removes the need for a second query. But this does not come without tradeoffs. Extra storage is required as we are now storing the same data (e.g. job_title) multiple times. If “Anne” gets promoted to “VP of Sales”, we potentially have to make multiple updates—once to the employee record, and once to the department record that encapsulates the assignment.

For this use case, you can argue that perhaps you don’t need to update the assignment because you can simply insert a new assignment into the array with the new job_title and start_date. You would probably need to update the employee record with the new job_title or simply remove the job_title from employee because the current job_title is on the assignment.

As you can see, there are many options depending on your data requirements and use case.

References

Code samples can be found in Github.

About Author

    Aerospike Engineering

    All posts by this author