Granulate nodes

The typical graph modeling pattern that we will discuss in this section will be called the granulate pattern. This means that in graph database modeling, we will tend to have much more fine-grained data models with a higher level of granularity than we would be used to having in a relational model.

In a relational model, we use a process called database normalization to come up with the granularity of our model. Wikipedia defines this process as follows:

"…the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database using the defined relationships."

The reality of this process is that we will create smaller and smaller table structures until we reach the third normal form. This is a convention that the IT industry seems to have agreed on: a database is considered to have been normalized as soon as it achieves the third normal form. Visit http://en.wikipedia.org/wiki/Database_normalization#Normal_forms for more details.

As we discussed before, this model can be quite expensive as it effectively introduces the need for join tables and join operations at query time. Database administrators tend to denormalize the data for this very reason, which introduces data-duplication--another very tricky problem to manage.

In graph database modeling, however, normalization is much cheaper for the simple reason that these infamous join operations are much easier to perform. This is why we see a clear tendency in graph models to create thin nodes and relationships, that is, nodes and relationships with few properties on them. These nodes and relationships are very granular and have been granulated.

Related to this pattern is a typical question that we ask ourselves in every modeling session--should I keep this as a property or should the property become its own node? For example, should we model the alcohol percentage of a beer as a property on a beer brand? The following diagram shows the model with the alcohol percentage as a property:

A data model with fatter nodes

The alternative would be to split the alcohol percentage off as a different kind of node.
The following diagram illustrates this:

A data model with a granulated node structure

Which one of these models is right? I would say both and neither. The real fundamental thing here is that we should be looking at our queries to determine which version is appropriate. In general, I would present the following arguments:

If we don't need to evaluate the alcohol percentage during the course of a graph traversal, we are probably better off keeping it as a property of the end node of the traversal. After all, we keep our model a bit simpler when doing this, and everyone appreciates simplicity.
If we need to evaluate the alcohol percentage of a particular (set of) beer brands during the course of our graph traversal, then splitting it off into its own node category is probably a good idea. Traversing through a node is often easier and faster than evaluating properties for each and every path.

As we will see in the next paragraph, many people actually take this approach a step further by working with in-graph indexes.

官术网_书友最值得收藏!

Learning Neo4j 3.x（Second Edition）

Granulate nodes