Graph or relational databases

From Endeavour Knowledge Base

A question arises as to whether the Discovery Data Service should use relational, property graph, document, or key value based database management systems  in its main patient record data stores.

This article considers the differences between graph and relational from the perspective of the known information requirements of health query. It reaches a conclusion as a result.

The starting point is to examine a few key logical differences then move on to some technical differences. Following this they are compared and contrasted from the perspective of the type of queries needed, in order to conclude which one to go for.

Firstly, its worth noting that a pure Graph model is not included for consideration. There are known problems at scale with pure graphs. If one were to adopt Graph then a property graph is the most likely variant. A property graph has properties on the nodes, and in the case of NEO4J, also has properties on the edges (which are called relationships).

Graph and relational relationships

Taking some very simple entity relationships from a health record, the following illustrates the different logical approach.  Graph is on the left, relational is on the right.

Graph versus relational.jpeg

The logical difference is that graph's can be explicit about a type of relationship between entities whereas a relational model uses foreign keys to link entities together i.e. the field on the right "patient id" points to the patient entity on the left but does not state the semantic reason for the link. On the left the graph states that a patient is a subject of an encounter, and could conversely indicated that an encounter has a subject that is a patient. The Graph approach is much clearer and much more intuitive. A graph is therefore more relational than a relational database. Conclustion : Graph wins for intuitiveness.

The following diagram illustrates the processes involved in traversing from a patient to an observation (graph) versus a join between a patient and an observation, in both case using the encounter as an intermediate entity (accepting that in reality there may be  more direct relationship.

Graph vs relational traverse.jpeg

It can be seen that in Graph, the relationship indexes are adjacent to the connecting nodes and therefore 4 direct memory adress pointers are needed whereas in the relational model the foregn key of the observation table is searched for in the encounter primary route index and back to the encounter and up again making a total of 5 searches i.e. each index block must be searched to find the destination of the record in each table.  Conclusion, Graph wins for multi-hop traversals over relational joins. (which is of course what graph databases are for are what they are)

Health query patterns

The above section illustrates the power of graph over relational. The question is whether this is helpful in health query.

The main characteristics of health query over standard business query is the idea of subsumption testing. Subsumption testing involves the examination of concepts that are recorded in health events to see if they are subtypes of a high level type. For example, a query such as "find me the incidence of all infectious diseases in the population of London in the last 10 years" involves (in theory) the examination of billions of observations in patient records, determine that the dates are in the last 2 weeks and then for each concept recorded in each observation, determine whether it is an infectious deisease. 

To do this, the general approach is to start with a list of infectious diseases. This involves using an ontology that has concepts with definitions which include the fact that it is a disease, and has a patholigical process which is infectious. This would result in a list of around 10,000 concepts.

Rather than examine the billions of observations in total, it may be beneficial to use some form index or relationship on the observation.  

In a relational query this would mean joining the 10,000 concepts (likely to be in a table prepared for the query) against the observations via an index on concept.

In a graph query there are two options depending on how concepts are related to the observations

a) If a concept is a property on the observation node, then the list of 10,000 concepts would be matched against a concept index on the observation nodes.

b) If a concept is linked via a relationship to the observations then the relationships between each of the 10,000 concepts and their observations would be used directly.