Information model query

From Discovery Data Service
Jump to navigation Jump to search

Background to IMQ

Information model query language is designed to operate as an intermediary between plain language and the underlying run time query languages. The language supports query of both the information model and any health records that map to a data model as defined within the IM.

As an RDF Graph knowledge base, the information model could (and can) be directly queried using SPARQL. The IM also holds text data which can be queried directly using open Search or elastic.

However, as Health records are likely to be stored as relational, or at least SQL compatible data bases, querying health records that are aligned with the model will require SQL to query them.

Consider the following issues with SQL/SPARQL:

  • Directly authoring SQL and SPARQL languages require a high degree of skill and health query in particular needs heavily nested subqueries , including some of the more advanced techniques such as correlated query or window functions.
  • Translating a user oriented intuitive query builder into SQL or SPARQL directly and in reverse is very difficult. Most query applications use an intermediate language from which the queries are then generated. Examples include GraphQL or Power BI DAX and M.
  • Enabling direct query via SPARQL end points or SQL APIs can result in crippling performance problems and are hard to police.

Consequently the IM provides a pragmatic Query domain specific language (DSL) to help bridge the gap between a plain language representation and the run time query. This DSL, being in plain JSON, can be used to exchange query definitions across multiple instances via standard REST APIs.

The IM, as open source, also provides reference software showing how SQL or SPARQL or OpenSearch Query can be generated from the DSL and how a plain language or diagrammatic interpretation can be produced from the DSL. The reference software also shows the converse i.e. the generation of the DSL from plain language.

It is not expected that the language must be understood by clinicians, or even those familiar with SQL. Instead the purpose of the language is to enable interoperability of query definition and use across standard REST APIs. However, the logic of a query definition precisely follows the logic of a plain language description of the output required and the criteria to be applied. Thus IM query is one way of fully documenting the logic of health query, in a human and machine readable format.

The language is designed to meet the following requirements

Query language requirements

Requirement 1 - Should support the vast majority of query patterns for defining and producing data sets or patient profiles that are needed in the real world.

Requirement 2 - Should enable mapping to SQL via simple type-table, property- field maps, for health data held in relational forms, as long as the health data content conforms to an IM data model

Requirement 3 - Should enable mapping to SPARQL and Elastic and open search directly, for querying the IM itself

Requirement 4 - Should enable mapping directly from and to Snomed Expression constraint language (ECL) for searching and set definitions.

Requirement 5 - Should enable a technical query to be built as Java Script or POJO objects avoiding the need for a language specific parser.

Requirement 6- Should embed inference statements such as subtype, super type, or set inclusion as part of the query definition, thus avoiding the need for explicit modelling of the complex logic in the query itself

Requirement 7 - Should support object result format as well as relational format i.e. nested json object results as well as flat table results akin to GraphQL,

Requirement 8 - Should follow a logical plain but logical language definition of the output and criteria to be applied but there is no requirement for the language itself to be understandable by non technical users. Plain logical language is distinct from natural language, the latter being prone to ambiguity.

Requirement 9 - Should support at least one plain language and at least one diagrammatic representation of the query that can be understood by lay user.

In other words the requirement is not only to create an intermediate query model but to include a translator to and from an understandable language or diagrammatic user interface.

Language overview

The language follows the familiar pattern of most query languages with constraints. As well as incorporating the core concepts of SQL and SPARQL it includes the nested structure approach as used by GRAPHQL.

The syntax uses JSON and therefore is somewhat verbose compared with a succinct language but In return there is less ambiguity and conforms to a standard object representation.

A query consists of the following main clauses:

QueryRequest : The query run time wrapper holding the run time parameters such as reference date, paging, and contains the query either as a reference IRI to a previously authored query or the inline query itself. A Query Request contains one query

Query : Includes the iri, name, description, result format, use of prefixes, the main entity, Select clause, sub-select clause (where a query produces many column groups). Query contains one select, but as selects contain sub-selects, complex multiple queries against a single entity type can be constructed as a single select.

Select : Equivalent to SQL and SPARQL SELECT with GraphQL nesting. Includes a list of properties and aliases as well as any nested properties i.e. select property , select and a match clause and any ordering or limit. Select contains properties to select, one or more match clauses (implied and) or one or more subset selects.

Match : Equivalent to SQL FROM/WHERE and SPARQL WHERE/FILTER. Defines the graph patterns filters or functions including ordering and limit followed by a test i.e. subquery. Match clause contains match Boolean operators (and./or/not) , and subquery, with any number of levels.

Order/ Limit/Offset : used in both select and match, Orders by a field and optionally limits return to a number of entities. . Often used for paging but the commonest use in within a match clause to match the most recent entry of something that matches, in order to support another match on the result.

The language grammar and syntax can be represented as a formal grammar (ABNF) or Json schema. this article presents the grammar by plain language example and goes on to the more formal specification of the language.

IM Query API

The information model manager provides a REST API for querying the IM or validating health queries. As the preferred approach to API is to exchange JSON, IMQ uses the Json syntax and because entities are represented as RDF URIs, Json-LD is the preferred format.

N.B. The difference between JSON-LD and plain JSON is the representation of ontology IRIS. The JSON-LD context object enables prefixed iris for ease of reading and iri references are presented as objects with the "@id" : "http://.." format

Within the body of a query the query predicates or "key words" are represented by their local names, making it easier to map to JS or Java objects.

IMQ by example

This section takes a set of common snippets of plain language query and shows the DSL equivalent.

Examples start with simple patterns and progress to complex temporal patterns of the kind used in real world query.

For convenience data IRIs are represented by their labels.

Simple queries

Two properties of persons.

Get me the NHS number and age in years of all patients.

  1. Select things of type person,
  2. Select their NHS number,.
  3. Select their age

(this being a "function property" that is defined in the IM as the time

difference between the date of birth and a reference date, with the units of YEARS

IMQ example
IMQ Result (json)
"select" :{ 
  "entityType" : {"@id" : "Person"},
  "property" [ {
    "@id" : {"@id": "nhsNumber" } ,
   {
    "@id" : {"@id": "age"},
    "argument": [ { "unit" : "YEARS" }
}]}
[ {

    nhsNumber : 1234567890,
    age : 44},

[ {
    nhsNumber : 0987654321,
    age : 14}]]

Two properties of persons, this time as CSV

IMQ Example
IMQ Result csv)
{"query" : {
"resultFormat" : "CSV",
"select" :{ 
  "entityType" : {"@id" : "Person"},
  "property" [ {
    "@id" : {"@id": "nhsNumber" } ,
   {
    "@id" : {"@id": "age"},
    "argument": [ { "unit" : "YEARS" }
}]}}

nhsNumber , age

1234567890, 44

0987654321, 14


The same DSL can be used to query the IM itself.

Get me the code and term of 'body structures' that match

the phrase "ches wall"

Input term "ches wall"

get the first page of length 10

that are subtypes of the concept "body structure"

IMQ Example
IMQ Result json
{"queryRequest" : {
  "textSearch" : "ches wall",
   "page" : 1, "pageSize": 10,
  "query" : {
    "select" :{
      "entityId" : {
         "@id": "Body structure",
        "includeSubtypes" : true},
      "property" : [ { 
          "@id" : "code", "alias" : "code"},
          "@id" : "label","alias" : "term"}]}}}
{[
  code : 244237006
  term : Chest wall artery },
{
   code : 78904004
   term  : chest wall structure ,

Cohort definition

Get me patients registered as a GMS patient with a general practice on the reference date of the query

  1. things of type person
  2. Their GP registration entries that consist o
  3. a regular GMS patient type
  4. a start date before the reference date
  5. either no end date (still registered) or an end date after the reference date (were registered at the time)
IMQ Example
IMQ Result json
{"select" : {
  "entityType" : { "@id" :"Person"},
  "match" : [ {
    "pathTo" : [ {"@id": "isSubjectOf"}],
    "entityType" : { "@id" : "GPRegistration"},                        
    "property" : [ {
      "@id" : "patientType",
      "isConcept" : {"@id" : regular patient}},                           
      {    
      "@id" : "effectiveDate",
      "value" : { 
         "comparison" : "LESS_THAN_OR_EQUAL",          
          "valueData" : "$referenceDate"}}], 
    "orProperty" : [ {                                            
      "notExist" : true,
      "@id" : "endDate"},                                           
      {
      "@id" :"endDate",
      "value" : {                                            
         "comparison" : "GREATER_THAN",
         "valueData" : "$referenceDate"}}]}}
{[
  {id : 1},
  { id : 2}]

The date range 'latest from and test' pattern

Get me patients whose latest systolic blood pressure within the last 18 months was taken in the surgery and was > 140.

  1. Select things of type person
  2. Look for observations and check for
  3. a systolic blood pressure
  4. within 18 months prior to the reference date,
  5. Then sort and select the latest
  6. Test whether it is office based (and not therefore not home based)
  7. Test to see if it is over 140


N.B only the match clause is illustrated here. The outer select and boolean match clauses would be present in the real query.

IMQ Example
IMQ Result json
 1 {"match" : {
 2   "pathTo": [ {"@id" : "isSubjectOf"}],
 3   "entityType" : {"@id": "Observation"},
 4   "property": [ {
 5     "@id" : "concept",
 6     "inSet" : {"@id" : "Systolic arterial measurements"},
 7     {
 8      "@id" : "effectiveDate",
 9      "function" : {
10         "@id" :"TimeDifference", 
11         "argument": { [{
12           "first date":  "$this"}, 
13         { "second date" :"referenceDate"},
14         { "units" : "MONTHS"}]},
15      "value" : {
16         "comparison" : "greater or equal", 
17         "valueData" : -18}],
18      "orderBy" : { 
19          "@id" :"effectiveDate",
20          "direction" : "descending",
21          "count 1"},
22      "testProperty" [ {
23         ":@id" :"concept",
24         "inSet" : { "@id" : "Office based systolic blood pressures"}},
25         { 
26          "value" : { "comparison" : "greater", "valueData" : 140}}]}]}
{[
  {id : 1},
  { id : 2}]

Expression constraint ECL

Get me concepts that are oral none steroidal ant inflammatory products

ECL for which is

<<763158003 | medinal products :

<<127489000 : has active ingredient = <<372665008 | none steroidal anti inflammatory agent,

<<411116001 : manufactured dose form = <<385268001 | oral

IMQ Example
IMQ
{"match" : [{
  "entityId" : { 
      "@id" : "medicinal product",
      "includeSubtypes": true},
   "property" : [{ 
     "@id" : "has active ingredient",
     "includeSubtypes" : true,
     "isConcept" : {
          "@id" : "non steroidal anti inflammatory agent",
          "includeSubtypes" : true}},
     {
       "@id" : "manufactured dose form",
       "includeSubtypes" : true,
        "isConcept" : {
             "@id" :"oral",
              "includeSubtypes" :  true} ]}

Data Sets

Get me a data set for a cohort of patients with 120 fields grouped into 50 groups

This is a line level report from a list of patients, with data items from the health record filtered by criteria.

(Only the first couple are shown for brevity)

  1. Select things of type person,.
  2. must be in the search result "Cardiovascular bleed search"
  3. Patient details : Full name, age
  4. Aspirin details : date, medication for Aspirin

a) issued in the last 6 months,

b) still an active authorised medication.

c) Get the latest

IMQ example
IMQ
{"query": {
   "select" :{
      "entityType" : {"@id" : "Person"},
      "match" : [{ "entityInSet" : { "@id": "Cardiovascular bleed search"}}],
      "subselect" : [{
         "name" : "patient details",
          "property" : [ {
              "@id": "fullName"},
              {"@id" : "age"},
           {
            "name" : "Aspirin",
           "pathTo" : "hasEntry",
            "property" : [{
               "@id": "effectiveDate","alias" : "date of issue"},
               {"@id": "medication",
                "select" : {
                   "property" : [{"@id": "rdfs:label","alias" : "name"}]}}],
          "match" :[{ 
            "entityType": { "@id" : "medicationRequest"},
             "property": [{
	       "@id" : "medication",
               "isConcept" : { 
                  "@id" : "Aspirin products",
                  "includeSubtypes" : true},
              {
               "@id" : "effectiveDate",
                "value": {
                    "comparison": "LESS_THAN_OR_EQUAL",	"valueData": 6},
                 "function": {
                    "@id": "Time Difference",
		       "argument": [
                        {
                          "parameter": "units", 
                          "valueData": "MONTH"},
			{
                           "parameter": "firstDate", 
                          "valueVariable": "$this"},
			{
                          "parameter": "secondDate", 
                          "valueVariable": "$referenceDate"}]},
               {
               "pathTo": [{"@id": "authorisation"},
               "@id": "course Status",
	       "isConcept": {"@id" : "Active"}}],
	      "orderLimit": {
                 "orderBy": {"@id": "effective Date"}}]}}

Connections between entities

A common use case is to find out whether there is a connection between an entity and another entity.

For example, We wish to support the idea that an enquirer, when building a query,  will be able to enter a term which matches any concept in the information model. Once having selected the concept, they may have one of the following intentions:

a)      The concept is a thing of interest, and they want to know something about it and in what way it may be related to health record entries.

b)     They have an intention to report on some record entries that hold something relating to the concept but have not yet refined their query i.e. want the application to help.

We have decided to support these by helping the user construct a query of health records that are in someway related to the selected concept, whilst at the same time illustrating how the concept they have selected relates to record entries.

Examples of things they may enter  are “Atenolol”, “date of birth”, “post code”, “Angina pectoris”, “beta blockers”.

There are some basic objectives for the application when helping users build a query. These are:

1.       We wish to constrain query to a finite set of entity types, for example a patient or organisation. This is for performance reasons and based on the assumption that most business requirements involve the identification of properties relating to a certain entity type. We refer to this type as the “main entity” and we have decided, based on experience, that there may be around half a dozen of these.

2.       We wish to present to the user a pretty much preformed query definition as quickly as possible, allowing them to tweak the query a bit e..g. by expanding.  This query definition will also illustrate the relationship between the main entity and Atenolol.


The use case has some special features:

a)      We wish to include the data model entities and data model properties in the path

b)     The concept may be a property of a a data model entity, a concept that is a member of a value set against a property in the data model, or may be the value of a property of a concept.

c)      If the selected entity is a concept property concept, We DO NOT wish to include the concepts that have the properties as there may  be too many. For example, many drugs contain Atenolol so we don’t want to list them.

d)     We DO NOT need to include the value set identifiers in the response as the data model properties will suffice. In other words, a property “type of observation” points to the value set for observations, so including the value set would be pointless.

e)      We DO wish to include the properties that link a concept property value to the  concept who are in the value sets. For example we do may wish to look for types of medication that have active ingredients of atenolol.

f)       If the concept we are looking for is a value of a concept property and there is a super property in common only show the super property. For example, active ingredient is a super property of specific active ingredient and precise active ingredient.

Examples may be

Selected concept Data model relationships
Atenolol Patient -> has entry -> Prescription - >type of medication -> has active ingredient- Atenolol

Patient -> has entry -. Allergy -> type of allergy -> causative agent - >atenolol

Date of birth Patient

Practitioner in role

Post code Patient -> address- > Address

Patient - > provider -> Organisation - >address - >Address

Address Patient -> address


The UI can then use this information to offer clarifications to the user and auto construct a query and concept set.

The following examples considers how to get from one of the main entities (e.g. patient) to a constituent of a drug.

Health records record actual or virtual products so the system needs to link from a patient through to the types of record that record products to products that have active ingredients of atenolol.

IMQ example
IMQ
  {
  "query" : {
    "name" : "Path query- Main record types to Atenolol - substance",
    "select" : {
      "entityId" : {
        "@id" : "http://endhealth.info/im#MainRecordType"
      },
      "pathToTarget" : {
        "name" : "atenolol substance",
        "depth" : 3,
        "@id" : "http://snomed.info/sct#387506000"
      }
    }
  }
}

Language grammar

The grammar presented here, roughly following ABNF format, omits the JSON syntax overlay i.e. assumes the implementation of the tokens and data is in JSON-LD syntax.

Rule Subrules or lexical pattern Explanation
queryRequest textSearch

page

pageSize

argument+

queryIri | query

textSearch {characters} text search term which may be a word a phrase and partial words.

query
































Grammar

This is the section on grammar