Resolution health query language

From Discovery Data Service
Jump to navigation Jump to search

Background

Data is complex, and getting data from data is even more complex.

In order to answer questions relating to this vast pool of data, it is necessary to extract some data and transform it into a form that can be used to provide the answer. This is query.

The current mainstream languages that a computer can understand require great skill to use. These query languages originated from the need to provide a “relatively” easy to understand means of accessing data from a data store. By “easy to understand” this means easy for people with mathematical or programming backgrounds to understand, not the 99% of people who don’t have technical backgrounds. In essence this rules out 99% of people

A different approach is needed for at least a proportion of the other 99%. The usual approach is to use an application that enables mere mortals to build queries. Most of these query applications build the query definitions using some form of underlying language for example DAX or M. These languages tend to be tailored towards the needs of the application and the underlying data formats.

The gap between a layout of an understandable query and an underlying mainstream query language such as SPARQL or SQL is large. A useful approach would be to have a "halfway house" that models the logic of a query and can map directly to text or graphics that a lay person can understand, and also map directly to a process that can use a mainstream query language.

This is the purpose of the health query language.

The world’s store of digital data can be considered as a graph of nodes and connections. Properties and relationships form the connections and the objects form the nodes.

To make sense of the trillions of nodes and connections a model of the world’s data is helpful. A data model of a graph can be considered as a graph of types. In a model each node, relationship, or property in the data store is given a type. A data model can be created as a graph of types, bound where appropriate to value sets or classes from an ontology.

We use SHACL as the language to represent these graph types as SHACL shapes. SHACL shapes are generally used to model data but as predicates are also verbs, they can also be used to model process.

Thus the query language is merely some additional vocabulary that indicates the processes that a query should follow and the data model the query should operate on.

A model of a query (which is the definition of a query) is a graph structured in a way that conforms to a number of shapes or (classes) . An application that helps a user construct a query would result in a conformant query graph and a conformant query graph would be used as the source for the display. The query graph would be used as the source to generate the run time query syntax, itself probably generated from query templates e.g. SPARQL query templates.

Overall approach

In Resolution, the approach taken towards this query involves the generation of a result graph (or target graph) from a source data graph via a series of step. An actual data set is then produced in any form the user wishes e.g. Relational files or database.

The result graph contains data from the source graph but may also contain newly created nodes and connections generated by functions..

To model the generation of a graph from a graph it is necessary to describe a set of instructions as to what the resulting graph should contain, and from what parts of the source graph the result graph is generated from, and what manipulation on the source data is needed.

To make it easier for someone to construct the instructions, a tool to enable the instructions to be generated is required and this tool needs some things:

  1. A model of the main data graph, which the query will operate on. A Data Model
  2. A model of the process involved in generating the graph. A Query Model.
  3. A user interface supporting a user experience commensurate with the knowledge and skill set of the person asking the question.
  4. Artefacts, examples and statistics to aid above e.g. query term library, templates, ontologies, concept sets etc.
  5. A means by which a computer can follow the instructions i.e. an interpreter to a mainstream query language.

Challenges

A small graph is normally easy to understand. For example, the fact that a patient had a consultation with a doctor is easy to visualize as 3 nodes and two connections. As follows:

A 3 node representation of a consultation

The problem arises when things get complex, and most questions have a degree of complexity that creates challenges when using plain language to describe.

For example,

“Persons (patients) who at any time have been registered with a London GP practice EXCLUDING those whose latest registration status is either 'left' or 'died' before 1st Jan 2019”

This is relatively unambiguous and complete. However, it’s a bit tricky to interpret. This logic can be presented in a number of different ways, all meaning the same thing. For example, it could also be presented as:

“Patients who have been registered with a London GP practice at any time, but if they left or died before 1st Jan 2019 AND have NOT subsequently been registered again, will NOT be included”

Pretty obscure logic which requires understanding of Boolean operators and implicit nesting.

A way of eliminating the need for deep understanding and mental “subquery” is required.

Working example

Take the above two narratives as a starting point. There are clearly several steps needed in order to simplify. The logic is complex because if a patient left and came back after 2019 and left again they should still be included.

Furthermore, they may also have been registered in other practices so this needs to be checked also.

Logical text

The totality of the steps appear more complex than the initial statement but more properly reflects the logic.

The step approach can be illustrated as follows. This is a textual form example, but could also be diagrammatic

Step 1 gets any patient registered with a GP in London

Step 1 GET Patient

Where Patient has a registration history,

with registration status of Registered ,

at organisation of type = GP Practice,

commissioned by = London CCG

Call this London Registered At Some Time

Step 2 looks for and adds in the latest registration status for these patients noting only those statuses in London practices

Step 2 ADD to step 1 (London Registered at some time)

Latest : registration history/date

Where Patient has a registration history

at organisation of type = GP Practice,

commissioned by = London CCG

Call this Latest London GP Registration

Step 3 Keep only patients from step 2 if their latest status was left before 2019.

Step 3 KEEP FROM Step 2 (LatestLondonGPRegistration)

Where has registration history,

Status IS Left or died

Date before 01/01/2019

Call this Latest is Gone before 1-1-2019

Step 4 simply removes the patients in step 3 from step 1 to yield the result

Step 4 FROM Step 1 ((London Registered At Some Time)

EXCLUDE Step 3 (Latest is Gone before1-1-2019) Call the result : Patients registered at some time with a gp unless they left before 2019 and never returned.

Run time Query

The above logic, being mapped to the query model can then be interpreted to a standard query language. In this example SPARQL is used but could be SQL

INSERT {GRAPH tmp:LondonRegisteredAtSomeTime {     #create a temporary graph

        ?mainSubject rdf:type im:Patient} }  

WHERE {GRAPH im:Discovery {                       #from the source data  discovery graph

   ?mainSubject rdf:type im:Patient.                  #patients

   ?mainSubject im:registrationHistory ?ob1.          #has registration history

   ?ob1 im:status ?ob2.                               #with registration status

   ?ob2 im:memberOf im:CSET_Registered.               # of registered

   ?ob1 im:organisation ?ob3.                         #at organisation<

   ?ob3 rdf:type im:GPPractice.                       # of type GP Practice

   ?ob3 im:commissionedBy :LondonCCG.                 #commissioned by London CCG } }

The additional steps use the graph and the source graphs with very simple joins to build the result


Query model classes

The query model classes, as all classes can be, can be modelled in graph , using SHACL. A visualisation of these classes could be

Special clauses and functions

Some health care queries such as the above example, imply the use of derived tables (SQL) or subqueries. In the above example "latest" means latest for that patient which means that for every patient, events must be sorted and ordered with the most recent selected.

There are several ways of doing this, and this can get more complex when the request is for the latest 2.

The approach taken is to extend the model to include these functions and map them to templates built from the query languages. For example: Step 2 looks for and adds in the latest registration status for these patients noting only those statuses in London practices

Step 2 ADD to step 1 (London Registered at some time)

Latest : registration history/date

Where Patient has a registration history

at organisation of type = GP Practice,

commissioned by = London CCG

Call this Latest London GP Registration

The term "latest" implies some operation on the patient registration history when defining this in a query language such as SPARQL this would map to a sub-select

INSERT {GRAPH tmp:LatestLondonGPRegistration {
        ?mainSubject im:latestDate ?recent.                #gets the main patient from step 1 graph
           }}
 WHERE {
 SELECT ?mainSubject (max(?date) as ?recent)               #sub select to project the maximum date
 where { 
   GRAPH tmp:LondonRegisteredAtSomeTime {                  #gets the patients from step 1
	?mainSubject ?p ?o
         }
  GRAPH im:Discovery {                                    #gets the dates
   ?mainSubject im:registrationHistory ?ob4.              
    ?ob4 im:date ?date.
    ?ob1 im:organisation  ?ob3.
    ?ob3 rdf:type im:GPPractice.
            
        }
  }
  group by ?mainSubject                                   #groups by patient

This would be developed as a SPARQL template with the interpreter using a function to pass in the triples, the grouping and function.

Run time optimisation

One of the advantages of a logical model abstracted from the run time is that the run time queries can be optimised.

In the above example, the 4 steps could be combined into a single query by extracting the intersections and the subquery.

This can occur independently of the query definition which can remain static. Likewise different techniques can be used for different purposes such as Window functions in SQL.