Health Information modelling language - overview: Difference between revisions

From Endeavour Knowledge Base
(23 intermediate revisions by the same user not shown)
Line 1: Line 1:
This article describes the language used for creating information models used in health records, as well as the means by which health record queries can be defined in a system independent manner.
== Purpose and scope of the language==


For those familiar with the semantic web languages, it is safe to assume that the language described herein is simply a dialect (or profile) of the standard W3C recommendations of RDF, RDFS, OWL2, SHACL and SPARQL. The profile is designed to simplify implementations of the model and to constrain it’s features to ensure relevance and optimum performance. The language is designed to support an information model conceived as a graph  and thus implementation of the model and related health data are represented logically as a graph.
This article describes the languages used for creating querying and maintaining the information model, as well as the means by which health record queries can be defined in a system independent manner.


It should be noted that this is the modelling language, not the physical data schema or actual query. The modelling language uses logical constructs for use by an interpreter to convert to the relevant schema or query. If the actual implementation supports the modelling languages themselves as low level languages (e.g. RDF or SPARQL) then the logical model and the physical model are the same.
As the information model is an RDF graph, the modelling language uses the main stream semantic web languages. In addition there is a pragmatic json-LD based domain specific language for query definition which is based on a constraint of, and specialisation of, SPARQL with support for Graphql type queries.  


Before tackling the language it is worth reading the article on [[Health Information modelling|information modelling.]]
The language includes description logic, shape constraints, expression constraints, and a pragmatic approach to modelling query of real data.


<br />
Details on the W3C standard languages that make up the grammar are described below.
== Purpose background and rationale ==
Question: Yet another language? Surely not.


Answer: No, not at all. What can be stated though is that it is not a health domain language. It is a language used commonly across all sectors across the world, applied to the health and social care domain.  
It should be noted that these  are modelling languages, not the physical data schema or actual query. These are defined in the languages commensurate with the technology (e.g. sql)


One of the ambitions of the language is to free the representation of health data from the modelling silos within healthcare itself. This is based on the principle that the only difference between health data and other data is the vocabulary and ontologies used.
The main purpose of a modelling language is to exchange data and information about [[Discovery health information model|information models]] in a way that machines can understand. It is not expected that clinicians would use the languages directly. The use of standard languages ensures that the models are interoperable across many domains including non health care domains.


The language is used to enable the information model to be readable by machines. The language covers all areas of the model and uses different grammars for the different approaches.
The languages cover the following areas:
The different parts of the IM can be listed as follows. This article covers item 8 which models items 1 to 7:


# An ontology, which is a vocabulary and definitions of the concepts used in healthcare, or more simply put, a vocabulary of health. The ontology is made up of the world's leading ontology Snomed-CT, with a London extension and supplemented with additional concepts for data modelling.
# An ontology, which is a vocabulary and definitions of the concepts used in healthcare, or more simply put, a vocabulary of health. The ontology is made up of the world's leading ontology Snomed-CT, with a London extension and supplemented with additional concepts for data modelling.
# A data model, which is a set of classes and properties, using the vocabulary, that represent the data and relationships as published by live systems that have published data, Note that this data model is NOT a standard model but a collated set of entities and relationships bound to the concepts based on real data, that are mapped to a single model.
# A data model, which is a set of classes and properties, using the vocabulary, that represent the data and relationships as published by live systems that have published data to a data service that uses these models. Note that this data model is NOT a standard model but a collated set of entities and relationships bound to the concepts based on real data, that are mapped to a single model.
# A library of business specific concept and value sets, which are expression constraints on the ontology for the purpose of query
# A library of business specific concept and value sets, which are expression constraints on the ontology for the purpose of query
# A catalogue of reference data such as geographical areas, organisations and people derived and updated from public resources.
# A catalogue of reference data such as geographical areas, organisations and people derived and updated from public resources.
# A library of Queries for querying and extracting instance data from reference data or health records.
# A library of Queries for querying and extracting instance data from reference data or health records.
# A set of maps creating mappings between published concepts and the core ontology as well as structural mappings between submitted data and the data model.
# A set of maps creating mappings between published concepts and the core ontology as well as structural mappings between submitted data and the data model.
# An open source set of utilities that can be used to browse, search, or maintain the model.
# A modelling language using the World wide web semantic languages that can be used to exchange all elements of the model.
The main purpose of a modelling language is to exchange data and information about [[Discovery health information model|information models]] in a way that both machines and humans can understand. A language must be able to describe the types of structures that make up an information model. Diagrams and pictures are all very well, but they cannot be used by machines.
It is necessary to support both human and machine readability so that a model can be validated both by humans and computers.  Humans can read diagrams or text. Machines can read data bits. The two forms can be brought together as a stream of characters forming a language. 
A purely human based language would be ambiguous, as all human languages are. A language that is both can be used to promote a shared understanding of often complex structures whilst enabling machines to process data in a consistent way.
It is almost always the case that a very precise machine readable language is hard for humans to follow and that a human understandable language is hard to compute consistently. As a compromise, many languages are presented in a variety of grammars and syntaxes, each targeted at different readers. The language in this article all adopt a multi-grammar approach in line with this dual purpose.
Multi grammars are the norm in computer languages. Different software implementations use different technologies and different grammars are needed. The crucial point is that for systems to interoperate effectively, the different grammars must mean the same thing in the end, they are just presented in different ways with different syntaxes.
== Requirements and Contributory languages ==
Health data can be conceptualised as a graph, and thus the model of health data is a graph model.


The language must be both human and machine readable in text form.
== Contributory languages ==
Health data can be conceptualised as a graph, and thus the model is a graph model.


The language must use the recognisable plain language characters in UTF-8. For human readability the characters read from left to right and for machine readability a graph is a character stream from beginning to end.
When exchanging models using the language grammar both json-ld and turtle are supported as well as the more specialised syntaxes such as owl functional syntax or expression constraint language.
 
Two grammars are required; one for human legibility, and the other for optimised machine processing. However, both must be human and machine readable.


The modelling language is an amalgam of the following languages:
The modelling language is an amalgam of the following languages:
Line 50: Line 31:
* [https://www.w3.org/TR/REC-rdf-syntax/ RDF.] An information model can be modelled as a Graph i.e. a set of nodes and edges (nodes and relationships, nodes and properties). Likewise, health data can be modelled as a graph conforming to the information model graph. RDF Forms the statements describing the data. RDF in itself holds no semantics whatsoever. i.e. it is not practical to infer or validate or query based purely on an RDF structure. To use RDF it is necessary to provide semantic definitions for certain predicates and adopt certain conventions. In providing those semantic definitions, the predicates themselves can then be used to semantically define many other things. RDF can be represented using either TURTLE syntax or JSON-LD.
* [https://www.w3.org/TR/REC-rdf-syntax/ RDF.] An information model can be modelled as a Graph i.e. a set of nodes and edges (nodes and relationships, nodes and properties). Likewise, health data can be modelled as a graph conforming to the information model graph. RDF Forms the statements describing the data. RDF in itself holds no semantics whatsoever. i.e. it is not practical to infer or validate or query based purely on an RDF structure. To use RDF it is necessary to provide semantic definitions for certain predicates and adopt certain conventions. In providing those semantic definitions, the predicates themselves can then be used to semantically define many other things. RDF can be represented using either TURTLE syntax or JSON-LD.
* [https://www.w3.org/TR/rdf-schema/ RDFS]. This is the first of the semantic languages. It is used for the purposes of some of the ontology axioms such as subclasses, domains and ranges as well as the standard annotation properties such as 'label'
* [https://www.w3.org/TR/rdf-schema/ RDFS]. This is the first of the semantic languages. It is used for the purposes of some of the ontology axioms such as subclasses, domains and ranges as well as the standard annotation properties such as 'label'
* [https://www.w3.org/TR/owl2-primer/ OWL2 DL.] This brings with it more sophisticated description logic such as equivalent classes and existential quantifications and is used in the ontology and for defining things when an open world assumption is required
* [https://www.w3.org/TR/owl2-primer/ OWL2 DL.] For the ontology. This brings with it more sophisticated description logic such as equivalent classes and existential quantifications and is used in the ontology and for defining things when an open world assumption is required.
* [https://www.w3.org/TR/shacl/ SHACL]. Used for everything that defines the shape of data where a closed world assumption is required. Although SHACL is designed for validation of RDF, as SHACL describes what  things 'should be' it can be used as a data modelling language
* [https://www.w3.org/TR/shacl/ SHACL]. For the data models. Used for everything that defines the shape of data or logical entities and attributes. Although SHACL is designed for validation of RDF, as SHACL describes what  things 'should be' it can be used as a data modelling language
* [https://www.w3.org/TR/sparql11-query/ SPARQL] Used as the logical means of querying model conformant data (not to be confused with the actual query language used which may be SQL)
* [https://www.w3.org/TR/sparql11-query/ SPARQL] Used as the logical means of querying model conformant data (not to be confused with the actual query language used which may be SQL)
* RML, an extension of R2RML, used to map between RDF and other formats such as RDBMS, JSON or CSV.
*[[Information model query|IM Query Language]], used as a bridge between plain language and the mainstream query languages such as SQL or SPARQL
 
 
A model presented in the human legible grammar must be translatable directly to the machine representation without loss of semantics.


'''Example (OWL2)'''
'''Example (OWL2)'''


Consider a definition of a grandfather, in the first example the grandfather is an equivalent to a person who is male ''and'' has children who are people that must have children.
Consider a definition of chest pain
 
using the turtle language
<syntaxhighlight lang="turtle" style="border:3px solid grey">
<syntaxhighlight lang="turtle" style="border:3px solid grey">
:Grandfather
Chest pain
  owl:EquivalentClass [
  is Equivalent to : pain of truncated structure
      owl:intersectionOf
                    and
              :Person,                           
                    has site : Thoracic structure
              [owl:onProperty :hasGender;
              owl:somValuesFrom :Male],
              [owl:onProperty :hasChild;
              owl:somValuesFrom [owl:intersectionOf
                                          :Person,               
                                        [owl:onProperty :hasChild;
                                          owl:someValuesFrom :Person] ) ])   
.
</syntaxhighlight>
</syntaxhighlight>


JSON is a popular syntax currently and thus this is used as an alternative.
== Grammars and syntaxes ==
 
JSON represents subjects , predicates and objects as object names and values with values being either literals or or objects.
 
JSON itself has no inherent mechanism of differentiating between different types of entities and therefore JSON-LD is used. In JSON-LD identifiers resolve initially to @id and the use of @context enables prefixed IRIs and aliases.
 
The above  Grandfather can be represented in JSON-LD (context not shown) as follows:<syntaxhighlight lang="json-ld" style="border:3px solid grey">
{"@id" : ":Grandfather",
"owl:EquivalentClass" : [
            {"owl:intersectionOf" :[
                    { "@id": ":Person"},
                    { "owl:onProperty" : ":hasGender",
                      "owl:somValuesFrom": {"@id":":Male"}},
                   
                      { "owl:onProperty" : ":hasChild",
                        "owl:someValuesFrom" : {
                          "owl:intersectionOf": [
                                { "@id":"Person"},
                                {"owl:onProperty" : ":hasChild",
                                  "owl:someValuesFrom" : {"@id":":Person"}}]]}}
</syntaxhighlight>
 
== Sublanguages and syntaxes ==


=== Foundation grammars and syntaxes - RDF, TURTLE and JSON-LD ===
=== Foundation grammars and syntaxes - RDF, TURTLE and JSON-LD ===
Line 112: Line 58:




==== Identifiers, aliasing  prefixes and context ====
 
'''Identifiers, aliasing  prefixes and context'''
 
Concepts are identified and referenced by the use of International resource identifiers (IRIs).  
Concepts are identified and referenced by the use of International resource identifiers (IRIs).  


Line 131: Line 79:
# MAPPING CONTEXT definitions for system level vocabularies. This provides sufficient context to uniquely identify a local code or term by including details such as the health care provider, the system and the table within a system. In essence a specialised class with the various property values making up the context.
# MAPPING CONTEXT definitions for system level vocabularies. This provides sufficient context to uniquely identify a local code or term by including details such as the health care provider, the system and the table within a system. In essence a specialised class with the various property values making up the context.


=== Ontology - OWL2 DL ===
=== OWL2 and RDFS ===
 
For the purposes of authoring and reasoning  the semantic ontology axiom and class expression vocabulary uses the tokens and structure from the OWL2 profile [https://www.w3.org/TR/owl2-profiles/#OWL_2_EL OWL EL], which itself is a sublanguage of the [https://www.w3.org/TR/owl2-syntax/ OWL2 language]
 
In addition to the open world assumption of OWL, RDFS constructs of domain and ranges are used in a closed word manner.
 
Within an information model instance itself the data relationships are held on their post inferred closed form i.e. inferred properties and relationships are explicitly stated using a normalisation process to eliminate duplications from super types.  In other words, whereas an ontology may be authored using the open world assumption, prior to population of the live IM, classifications and inheritance are resolved. This uses the same approach as followed by Snomed-CT, whereby the inferred relationship containing the inherited properties and the "isa" relationship are included explicitly.
 
In the live IM many these are replaced with the RDFS standard terms and simplified. For example OWL existential quantifications are mapped to "role groups" in line with Snomed-CT.
 
'''Use of Annotation properties'''


For the purposes of reasoning the semantic ontology axiom and class expression vocabulary uses the tokens and structure from the OWL2 profile [https://www.w3.org/TR/owl2-profiles/#OWL_2_EL OWL EL], which itself is a sublanguage of the [https://www.w3.org/TR/owl2-syntax/ OWL2 language]
Annotation properties are the properties that provide information beyond that needed for reasoning.&nbsp; They form no part in the ontological reasoning, but without them, the information model would be impossible for most people to understand.&nbsp;


However, in addition some standard OWL2 DL axioms are used in order to provide a means of specifying additional relationships that are of value when defining relationships. The following table lists the main owl  types used and example for each.  Note that their aliases are used for brevity. Please refer to the OWL2 specification to describe their meanings
Typical annotation properties are names and descriptions.
{| class="wikitable"
{| class="wikitable"
|+
|+
!Owl construct
!Owl construct
!usage examples
!usage examples
!'''IM live conversion'''
|-
|-
|Class
|Class
|An entity that is a class concept e.g. A snomed-ct concept or a general concept
|An entity that is a class concept e.g. A snomed-ct concept or a general concept
|rdfs:Class
|-
|-
|ObjectProperty
|ObjectProperty
|'hasSubject' (an observation '''has a subject''' that is a patient)
|'hasSubject' (an observation '''has a subject''' that is a patient)
|rdf:Property
|-
|-
|DataProperty
|DataProperty
|'dateOfBirth'  (a patient record has a date of birth attribute
|'dateOfBirth'  (a patient record has a date of birth attribute
|owl:dataTypeProperty
|-
|-
|annotationProperty
|annotationProperty
|'description'  (a concept has a description)
|'description'  (a concept has a description)
|
|-
|-
|SubClassOf
|SubClassOf
|Patient is a subclass of a Person
|Patient is a subclass of a Person
|rdfs:subClassOf
|-
|-
|Equivalent To
|Equivalent To
|Adverse reaction to Atenolol is equivalent to An adverse reaction to a drug AND has causative agent of Atenolol (substance)
|Adverse reaction to Atenolol is equivalent to An adverse reaction to a drug AND has causative agent of Atenolol (substance)
|rdfs:subClassOf
<br />
|-
|-
|Disjoint with
|Sub property of
|Father is disjoint with Mother
|-
|Sub property of  
|has responsible practitioner is a subproperty of has responsible agent
|has responsible practitioner is a subproperty of has responsible agent
|rdfs:subPropertyOf
|-
|-
|Property chain  
|Property chain
|is sibling of'/ 'is parent of' / 'has parent' is a sub property chain of 'is first cousin of'
|is sibling of'/ 'is parent of' / 'has parent' is a sub property chain of 'is first cousin of'
|owl:Property chain
|-
|-
|Inverse property
|Existential quantification ( ObjectSomeValuesFrom)
|is subject of is inverse of has subject
|-
|Transitive property
|is child of is transitive
|-
|Existential quantification
|Chest pain and
|Chest pain and
Finding site of  - {some} thoracic structure
Finding site of  - {some} thoracic structure
|im:roleGroup
|-
|-
|Object Intersection
|Object Intersection
|Chest pain is equivalent to pain of truncal structure AND finding in region of thorax AND finding site of thoracic structure
|Chest pain is equivalent to pain of truncal structure AND finding in region of thorax AND finding site of thoracic structure
|-
|rdfs:Subclass
|Individual
 
|All chest pain subclasses but not the specific i''nstance of acute chest pain''
+
 
role groups
|-
|-
|DataType definition
|DataType definition
|Date time  is a restriction on a string with a regex that allows approximate dates
|Date time  is a restriction on a string with a regex that allows approximate dates
|
|-
|-
|Property domain
|Property domain
|a property domain of has causative agent is allergic reaction
|a property domain of has causative agent is allergic reaction
|rdfs:domain
|-
|-
|Property range
|Property range
|A property range of has causative agent is a substance
|A property range of has causative agent is a substance
|rdfs:range
|}
|}
 
{| class="wikitable"
==== The ubiquitous ISA and inferred views. ====
|+
OWL has a number of ways of handling sub types. The axioms "subclass of", "sub property of" or "equivalent class" indicate sub types. In the information model, in line with Snomed-CT the predicate "is a" is used as a supertype in order to simplify the sub typing. In addition, 'is a' assumes that the ontology is classified i.e. from the stated axioms, a subtype hierarchy has been produced by a reasoner, and it is the sub type hierarchy that 'is a' refers to.
!Annotation
 
!Meaning
In other words, for every subclass or equivalent class stated axiom an 'is a' relationship is generated, and this is used throughout the model when using query.
 
Similarly, when querying for properties, the stated axioms can create complex sub property and sub range logic. Consequently, the 'inferred view' is assumed in query i.e. where a descendant concept inherits its properties directly, unless overridden by sub properties.
 
In practical implementation of the IM both isa and inferred properties are explicitly modelled as direct relationships enabling the ontology to be queries without the need for axioms within the query itself.
 
==== Use of Annotation properties for original codes====
Annotation properties are the properties that provide information beyond that needed for reasoning.&nbsp; They form no part in the ontological reasoning, but without them, the information model would be impossible for most people to understand. Annotation properties can also be used for implementation supporting properties such as release status, version control, authoring dates and times and so on.&nbsp;
 
Typical annotation properties are names and descriptions. They are also used as meta data such as a status of a concept or the version of a document.
 
Many concepts are derived directly from source systems that used them as codes, or even free text.
 
The concept indicates the source and original code or text (or combination) in the form actually entered into the source system. It should be noted that many systems do not record codes exactly as determined by an official classification or provide codes via mappings from an internal id.&nbsp; It is the codes or text used from the publishers perspective that&nbsp; is used as the source.
 
Thus in many cases, it is convenient to auto generate a code, which is then placed as the value of the “code” property in the concept, together with the scheme. From this, the provenance of the code can be inferred.
 
Each code must have a scheme. A scheme may be an official scheme or&nbsp; proprietary scheme or a local scheme related to a particular sub system.
 
For example, here are some scheme/ code combinations
{| class="MsoTableGrid"
|-
|-
| width="109" |<span><span>Scheme</span></span>
|rdfs:label
| width="316" |<span><span>Original Code/Text/Context</span></span>
|The name or term for an entity
| width="106" |<span><span>Concept code/ Auto code</span></span>
| width="224" |<span><span>Meaning</span></span>
|-
|-
| width="109" |<span><span>Snomed-CT&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></span>
|rdfs:comment
| width="316" |<span><span>47032000&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></span>
|the description of an entity
| width="106" |<span><span>47032000&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></span>
| width="224" |<span><span>Primary hydrocephaly</span></span>
|-
|-
| width="109" |<span><span>EMIS- Read</span></span>
|
| width="316" |<span><span>H33-1&nbsp;&nbsp;</span></span>
|
| width="106" |<span><span>H33-1&nbsp;&nbsp;</span></span>
|}
| width="224" |<span><span>Bronchial asthma</span></span>
<br />
|-
| width="109" |<span><span>EMIS – EMIS</span></span>
| width="316" |<span><span>EMISNQCO303</span></span>
| width="106" |<span><span>EMLOC_EMISNQCO303</span></span>
| width="224" |<span><span>Confirmed corona virus infection</span></span>
|-
| width="109" |<span><span>Barts/Cerner</span></span>
| width="316" |<span><span>Event/Order=687309281</span></span>
| width="106" |<span><span>BC_687309281</span></span>
| width="224" |<span><span>Tested for Coronavirus (misuse of code term in context)</span></span>
|-
| width="109" |<span><span>Barts/Cerner</span></span>
| width="316" |<span><span>Event/Order= 687309281/ResultTxt= SARS-CoV-2 RNA DETECTED</span></span>
| width="106" |<span><span>BC_dsdsdsdx7</span></span>
| width="224" |<span><span>Positive coronavirus result</span></span>
 
&nbsp;
|}Note that in the last example, the original code is actually text and has been contextualised as being from the Cerner event table, the order field having a value of 687309281 and the result text having a value of ResultTxt= SARS-CoV-2 RNA DETECTED


=== Data model - SHACL ===
=== SHACL shapes - data model ===
As in the semantic ontology, the language borrows the constructs from the W3C standard SHACL, which can also be represented in any of the RDF supporting languages such as TURTLE or JSON-LD.   
The shapes constraint language, as in the semantic ontology, the language borrows the constructs from the W3C standard SHACL, which can also be represented in any of the RDF supporting languages such as TURTLE or JSON-LD.   


'''Example'''  
'''Example'''  
Line 285: Line 207:
</syntaxhighlight>
</syntaxhighlight>


=== Set definitions - SHACL/SPARQL ===
=== Query and temporal concepts ===
In line with expression constraint language, defining a set is a query over the ontology resulting in a set of concepts to use in a subsequent query.
As the IM itself is held as RDF quads (triple+ graphs) the IM can be queried using SPARQL for graph query and Lucene query (OpenSearch) for text query.
 
The information model uses SHACL to define the set meta data (e.g. the name of the set, the fact that it is a set etc) and SPARQL for the expression constraint.
 
SHACL has the idea of "focus node" which is an RDF node that the shape wishes to validate. As the focus node of a set are all members of the class of concept, then there has to be a way of narrowing down the nodes from the millions of concepts. SHACL has the idea of 'custom target' which is a filter applied to the graph. A typical custom target is a SPARQL target and therefore a SPARQL target is used to specify the constraint.
 
==== Example - complex set ====
Lets say a commissioner needs to know who the patients are that have had Covid vaccines.
 
Covid vaccines are recorded either as immunisation records, or medication records, or both.  To query the medication records, a set of vaccine medication concepts are searched for, these being stored in medication order record entries. Covid vaccines change every few weeks as new brands or strengths are released.
 
A definition of a covid vaccine is helpful, thus a concept set is defined.
<br /><syntaxhighlight lang="turtle">
im:CSET_CVSMeds
  a im:ConceptSet,sh:NodeShape;
  rdfs:label "Covid vaccine study medication set";
sh:target [                                                      #Custom target for the shape
  sh:targetType sh:SPARQLTarget;                                #Focus nodes are sparql results
  sh:prefixes im:, sn: ;                                          #prefixes used in query
    sh:select """
      Select ?this                                                #Select concept
        Where {                                                #where the concept is….
              {?this im:isA sn:39330711000001103}        # is a Covid vaccine
          UNION                                                    #or (
              {?this im:isA sn:10363601000001109.          # is a uk product
                                                                      #and
            ?this sn:10362601000001103 sn:10362601000001103} }      #has vmp Covd vaccine)
 
            """
].
</syntaxhighlight>The above states that a covid vaccine must be a subtype of either
 
a)      a  covid vaccine product
 
OR
 
b)       a UK product and which is both an  actual product which is a covid vaccine virtual medicinal product.
 
Note the direct use of inferred 'is a' and property restrictions, avoiding the need for complex axiom
 
==== Example Set with exclusion ====
It is common to remove certain subclasses. OWL Object complement cannot be used for this due to the open world assumption.
 
Exclusion assumes something to exclude against (i.e. a set of things from which to exclude)
 
Let us assume that we define an event type of 'procedure' and we model a procedure as having a property of 'concept' whose range is the class of procedures as defined by Snomed-CT. Within Snomed-CT, immunisations are also classified as procedures but within the data model they are classified as immunisations. Consequently in the procedure value set we wish to exclude immunisations.<syntaxhighlight lang="turtle">
 
im:VSET_Category_Procedures
    a im:ValueSet, sh:NodeShape;
    rdfs:label "Value set -  Procedures" ;
  sh:target [                                                      #Custom target for the shape
  sh:targetType sh:SPARQLTarget;                                #Focus nodes are sparql results
    sh:prefixes im:, sn: ;                                          #prefixes used in query
    sh:select """
      Select ?this       
      Where {
              ?this im:isA sn:71388002.                        # is a procedure   
            MINUS {
              ?this im:isa ?exclusions                          #Minus  is a exclusions
              VALUES ?exclusions {  sn:33879002  sn:51116004 }  #which are vaccination, passive immunisation
                              } } """
]                                                                                                                                       
 
</syntaxhighlight>
 
=== Catalogue query - SHACL/SPARQL ===
In the same way that concept sets use SHACL/SPARQL to query the ontology, instance data can be queried as a set in the same way.
 
==== Example -reference data set ====
It is often the case that a subscriber wishes to access data from only a limited number of source publishers i.e. a set of organisations.
 
During the query process, this set of organisations will be used to filter organisations according to their data controller.
 
Whilst these may be selected as a static it is also possible to define a set by certain criteria so that the definition can be re-used
 
Using the same technique – A set can be defined, for example<syntaxhighlight lang="turtle">
im:SET_OrgSetNELGP
  a sh:NodeShape, im:Set;
  rdfs:label "North East London commissioned general practices located in E1"
  sh:target [
  sh:targetType sh:SPARQLTarget;                              #Focus nodes are sparql results
    sh:prefixes im:, sn: ;                                          #prefixes used in query
    sh:select """
      Select ?this       
            Where {
              ?this rdf:type im:Organisation;                        #Searching the organisation resources                                                
              im:commissionedBy org:NELCCG;               # commissioned by NELCCG
                    im:organisationType im:GPPracticeType;          # organisation type general practice
                    im:hasMainLocation/ im:address/ im:postcode
                                                        ?postCode
                    FILTER regex (?postCode, “^E1”)                                   
                                                                  # Main location with address with post
                                                                    code starting with E1 
  """                                                                 
].
</syntaxhighlight>
 
=== Health Record cohort style query - SHACL/SPARQL ===
From a logical perspective there is no difference between querying instance data from the catalogue and health records.
 
Data sets are normally based on a base  population of some kind such as a cohort of patients.
 
==== Example- patient cohort based on events ====
For example, let us define active diabetics, which are people who have had the condition of diabetes recorded and may or may not have a record of it being resolved, but if so, then the date of the active diabetes record must be after the date of resolution (otherwise their diabetes would be resolved).<syntaxhighlight lang="turtle">
 
dds:ActiveDiabetics                                                  //  iri for cohort definition
    rdfs:label “Patients with active diabetes”;
    sh:target [
    sh:targetType sh:SPARQLTarget;
    sh:prefixes im:, sn: ;                                          #prefixes used in query
    sh:select """
        Select ?patient
          where {?patient im:hasEvent ?event.                      #entity has event.
                ?event im:concept im:SET_Diabetes.                  #event has concept in diabetes set
                  ?event im:effectiveDate ?date                      #date must be latest from .....
                  FILTER (?date >= ?max)
            {Select ?patient (max(?date) as ?max)                   #get the maximum date
                where {
                    VALUES ?dores {im:SET_Diabetes im:SET_Resolved}  #diabetes or resolved
                        ?patient im:hasEvent ?event.                  # patient has event
                        ?event im:concept ?dores.                    #event has concept in diores set
                        ?event im:effectiveDate ?date}                #get date
                          group by ?patient}}                        #group by patient
              """
].
 
</syntaxhighlight>This states the following
 
1.      First as a subquery find patient and their most recent date of events, with concepts of diabetes or resolved, group by patient
 
2.      Then as an outer query find matching patients with events with concepts of diabetes whose date for is the same or greater than the patient’s most recent date from the sub query.
 
There are some implicit assumptions in this query definition that the query interpreter should use:


There is no need to state that ?patient refers to resources of type patient. This is because within the IM, the domain of "im:hasEvent" IS a patient and therefore only patient resources will be retrieved
However, in order to support real world queries, which are usually articulated in plain language, the IM employs a pragmatic Query DSL to make the query of the IM, or any health record store, easier to construct.


There is no need to explicitly state that the concepts being searched for are members of the set of diabetes. The interpreter uses 'Entailment' which means that any member of a concept set (or indeed, any sub property or subclass) will be entailed in the statement.
The query DSL has two main roles:


The result of this is a simpler definition. The above when implemented on a relational database would expect the concept set to be expanded to a list of members.  
1. Extends description logic to define concepts that have temporal or functional relationships with other concepts. For example, defining whether a patient has diabetes involves making sure that the statement that they have diabetes is not followed by a statement that it has resolved. Age is an example of a function based concept with a date of birth and a reference date being parameters.


=== Health Record data set style query - SHACL / SPARQL ===
2. Provides a simple method of defining data sets or report data, or derived data models for the purposes of further analysis.
A data set is normally separated into 3 steps, each of which are defined independently


# Identification of the source of data to be searched. This might be a set of organisations. An example is described above.
The query DSL can also be seen as a "bridge" between the plain language of a user interface and the lower level queries in SQL or SPARQL that result from it. It is designed to hide much of the complexity involved in the underling query constructs which for health record query can result in hundreds of lines of obscure advanced SQL. It is designed to be easy for developers to use, can can be understood by informaticians by direct visualisation.
# Definition of a base population of some entity type, which is used to filter any instances of that type in the data set. The base population definition is independent of the subset of the data to be extracted i.e. many data sets may use the same base population. An example is described above
# Definition of the entities and attributes in the data set. In essence a list of entity types, together with any filtering, and a list of attributes to extract.


From a logical perspective the data set has two main patterns
The DSL itself, in JSON form can be submitted via the APIs to the IM service or a data service that supports a model compliant patient record data store, whether that is in relational form or not.


a) A statement of the entity and attributes in the data set, which are a subset of the entities and attributes within the data model. These are usually restricted to those entities within the base population.
As well as the DSL the IM provides reference software for SQL and SPARQL conversion.
 
b) A query definition of additional filters of the entities included in the data set e.g. only certain events or only certain values of certain attributes may be included.
 
==== Example - data set ====
Let us say that for the population of active diabetes we wish to extract the first date of their condition, the type of diabetes they had, together with their latest systolic blood pressure.
 
The data set shape is defined as having 2 resources (the condition and observation) i,e, two "tables" each with a filter criteria associated with it. The first consisting of the code and scheme as well as date, the latter consisting only of value.
The name of the resource and the fields are defined by the shape thus enabling user defined fields.<syntaxhighlight>
dds:DiabetesOnsetBp                                                  //  iri for cohort definition
    a im:Dataset
    rdfs:label "Onset of diabetes and latest systolic";
    sh:Property [jh
        sh:path im:hasResource;
        sh:node dds:DiabetesOnset;
        sh:name "DiabetesOnset" ]
  sh:property [
        sh:path im:hasResource;
        sh:node dds:LatestSystolic;
        sh:name "LatestSystolic"]
.
dds:DiabetesOnset
  a im:DataSetResource
    sh:target [
    sh:targetType sh:SPARQLTarget;
    sh:prefixes im:, sn: ;                                          #prefixes used in query
    sh:select """Select ?this
      where {
      VALUES ?concept { im:SET_Diabetes}
      ?this rdf:type im:Condition;
              im:concept ?concept;
              im:effectiveDate ?date.}
        ORDER BY (?date)
        LIMIT 1"""] ;
 
    sh:property [
      sh:path im:concept, im:code;
      sh:name "code";
      sh:dataType xsd:string];
  sh:property [
      sh:path im:concept, im:scheme;
      sh:name "scheme";
      sh:nodeKind sh:Iri]
    sh:property[
        sh:path im:effectiveDate;
        sh:dataType xsd:dateTime]
.
dds:LatestSystolic
  a im:DataSetResource
    sh:target [
    sh:targetType sh:SPARQLTarget;
    sh:prefixes im:, sn: ;                                          #prefixes used in query
    sh:select """
         
        Select ?this
      where {
      VALUES ?concept { im:SET_Systolic}
      ?this rdf:type im:Observation;
              im:concept ?concept;
              im:effectiveDate ?date.}
        ORDER BY DESC(?date)
        LIMIT 1
          """ ];
    sh:property [
      sh:path im:numericValue, im:;
      sh:name "Latest BP";
      sh:dataType xsd:integer];
.
 
</syntaxhighlight>
 
=== Data mapping and matching ===
 
This part of the language is used to define mappings between the data model and an actual schema to enable query and filers to automatically cope with the ever extending ontology and data properties.&nbsp;
 
The processes involved in mapping and matching concepts are described in the article on [[Mapping and matching concepts|mapping of concepts codes and structures]]


The query DSL is specified more fully in the article on [[information model query]]
<br />
<br />
=== ABAC language ===
''Main article : [[Attribute based access control|ABAC Language]]''
The Discovery attribute based access control language is presented as a pragmatic JSON based profile of the XACML language, modified to use the information model query language (SPARQL) to define policy rules. ABAC attributes are defined in the semantic ontology in the same way as all other classes and properties.
The language is used to support some of the data access authorisation processes as described in the specification - [[Identity Authentication Authorisation|Identity, authentication and authorisation]] .
This article specifies the scope of the language , the grammar and the syntax, together with examples. Whilst presented as a JSON syntax, in line with other components of the information modelling language, the syntax can also be accessed via the ABAC xml schema which includes the baseline Information model XSD schema on the Endeavour GitHub, and example content viewed in the information manager data files folder<br />

Revision as of 14:44, 4 June 2022

Purpose and scope of the language

This article describes the languages used for creating querying and maintaining the information model, as well as the means by which health record queries can be defined in a system independent manner.

As the information model is an RDF graph, the modelling language uses the main stream semantic web languages. In addition there is a pragmatic json-LD based domain specific language for query definition which is based on a constraint of, and specialisation of, SPARQL with support for Graphql type queries.

The language includes description logic, shape constraints, expression constraints, and a pragmatic approach to modelling query of real data.

Details on the W3C standard languages that make up the grammar are described below.

It should be noted that these are modelling languages, not the physical data schema or actual query. These are defined in the languages commensurate with the technology (e.g. sql)

The main purpose of a modelling language is to exchange data and information about information models in a way that machines can understand. It is not expected that clinicians would use the languages directly. The use of standard languages ensures that the models are interoperable across many domains including non health care domains.

The languages cover the following areas:

  1. An ontology, which is a vocabulary and definitions of the concepts used in healthcare, or more simply put, a vocabulary of health. The ontology is made up of the world's leading ontology Snomed-CT, with a London extension and supplemented with additional concepts for data modelling.
  2. A data model, which is a set of classes and properties, using the vocabulary, that represent the data and relationships as published by live systems that have published data to a data service that uses these models. Note that this data model is NOT a standard model but a collated set of entities and relationships bound to the concepts based on real data, that are mapped to a single model.
  3. A library of business specific concept and value sets, which are expression constraints on the ontology for the purpose of query
  4. A catalogue of reference data such as geographical areas, organisations and people derived and updated from public resources.
  5. A library of Queries for querying and extracting instance data from reference data or health records.
  6. A set of maps creating mappings between published concepts and the core ontology as well as structural mappings between submitted data and the data model.

Contributory languages

Health data can be conceptualised as a graph, and thus the model is a graph model.

When exchanging models using the language grammar both json-ld and turtle are supported as well as the more specialised syntaxes such as owl functional syntax or expression constraint language.

The modelling language is an amalgam of the following languages:

  • RDF. An information model can be modelled as a Graph i.e. a set of nodes and edges (nodes and relationships, nodes and properties). Likewise, health data can be modelled as a graph conforming to the information model graph. RDF Forms the statements describing the data. RDF in itself holds no semantics whatsoever. i.e. it is not practical to infer or validate or query based purely on an RDF structure. To use RDF it is necessary to provide semantic definitions for certain predicates and adopt certain conventions. In providing those semantic definitions, the predicates themselves can then be used to semantically define many other things. RDF can be represented using either TURTLE syntax or JSON-LD.
  • RDFS. This is the first of the semantic languages. It is used for the purposes of some of the ontology axioms such as subclasses, domains and ranges as well as the standard annotation properties such as 'label'
  • OWL2 DL. For the ontology. This brings with it more sophisticated description logic such as equivalent classes and existential quantifications and is used in the ontology and for defining things when an open world assumption is required.
  • SHACL. For the data models. Used for everything that defines the shape of data or logical entities and attributes. Although SHACL is designed for validation of RDF, as SHACL describes what things 'should be' it can be used as a data modelling language
  • SPARQL Used as the logical means of querying model conformant data (not to be confused with the actual query language used which may be SQL)
  • IM Query Language, used as a bridge between plain language and the mainstream query languages such as SQL or SPARQL

Example (OWL2)

Consider a definition of chest pain

Chest pain
 is Equivalent to : pain of truncated structure
                    and
                    has site : Thoracic structure

Grammars and syntaxes

Foundation grammars and syntaxes - RDF, TURTLE and JSON-LD

Discovery language has its own Grammars built on the foundations of the W3C RDF grammars:

  • A terse abbreviated language, TURTLE
  • SPARQL for query
  • JSON-LD representation, which can used by systems that prefer JSON, wish to use standard approaches, and are able to resolve identifiers via the JSON-LD context structure.


Identifiers, aliasing prefixes and context

Concepts are identified and referenced by the use of International resource identifiers (IRIs).

Identifiers are universal and presented in one of the following forms:

  1. Full IRI (International resource identifier) which is the fully resolved identifier encompassed by <>
  2. Abbreviated IRI a Prefix followed by a ":" followed by the local name which is resolved to a full IRI
  3. Aliases. The core language tokens (that are themselves concepts) have aliases for ease of use. For example rdfs:subClassOf is aliased to subClassOf,

There is of course nothing to stop applications using their own aliases and when used with JSON-LD @context may be used to enable the use of aliases.

Data is considered to be linked across the world, which means that IRIs are the main identifiers. However, IRIs can be unwieldy to use and some of the languages such as GRAPH-QL do not use them. Furthermore, when used in JSON, (the main exchange syntax via APIs) they can cause significant bloat. Also, identifiers such as codes or terms have often been created for local use in local single systems and in isolation are ambiguous.

To create linked data from local identifiers or vocabulary, the concept of Context is applied. The main form of context in use are:

  1. PREFIX declaration for IRIs, which enable the use of abbreviated IRIs. This approach is used in OWL, RDF turtle, SHACL and Discovery itself.
  2. VOCABULAR CONTEXT declaration for both IRIs and other tokens. This approach is used in JSON-LD which converts local JSON properties and objects into linked data identifiers via the @context keyword. This enables applications that know their context to use simple identifiers such as aliases.
  3. MAPPING CONTEXT definitions for system level vocabularies. This provides sufficient context to uniquely identify a local code or term by including details such as the health care provider, the system and the table within a system. In essence a specialised class with the various property values making up the context.

OWL2 and RDFS

For the purposes of authoring and reasoning the semantic ontology axiom and class expression vocabulary uses the tokens and structure from the OWL2 profile OWL EL, which itself is a sublanguage of the OWL2 language

In addition to the open world assumption of OWL, RDFS constructs of domain and ranges are used in a closed word manner.

Within an information model instance itself the data relationships are held on their post inferred closed form i.e. inferred properties and relationships are explicitly stated using a normalisation process to eliminate duplications from super types. In other words, whereas an ontology may be authored using the open world assumption, prior to population of the live IM, classifications and inheritance are resolved. This uses the same approach as followed by Snomed-CT, whereby the inferred relationship containing the inherited properties and the "isa" relationship are included explicitly.

In the live IM many these are replaced with the RDFS standard terms and simplified. For example OWL existential quantifications are mapped to "role groups" in line with Snomed-CT.

Use of Annotation properties

Annotation properties are the properties that provide information beyond that needed for reasoning.  They form no part in the ontological reasoning, but without them, the information model would be impossible for most people to understand. 

Typical annotation properties are names and descriptions.

Owl construct usage examples IM live conversion
Class An entity that is a class concept e.g. A snomed-ct concept or a general concept rdfs:Class
ObjectProperty 'hasSubject' (an observation has a subject that is a patient) rdf:Property
DataProperty 'dateOfBirth' (a patient record has a date of birth attribute owl:dataTypeProperty
annotationProperty 'description' (a concept has a description)
SubClassOf Patient is a subclass of a Person rdfs:subClassOf
Equivalent To Adverse reaction to Atenolol is equivalent to An adverse reaction to a drug AND has causative agent of Atenolol (substance) rdfs:subClassOf


Sub property of has responsible practitioner is a subproperty of has responsible agent rdfs:subPropertyOf
Property chain is sibling of'/ 'is parent of' / 'has parent' is a sub property chain of 'is first cousin of' owl:Property chain
Existential quantification ( ObjectSomeValuesFrom) Chest pain and

Finding site of - {some} thoracic structure

im:roleGroup
Object Intersection Chest pain is equivalent to pain of truncal structure AND finding in region of thorax AND finding site of thoracic structure rdfs:Subclass

+

role groups

DataType definition Date time is a restriction on a string with a regex that allows approximate dates
Property domain a property domain of has causative agent is allergic reaction rdfs:domain
Property range A property range of has causative agent is a substance rdfs:range
Annotation Meaning
rdfs:label The name or term for an entity
rdfs:comment the description of an entity


SHACL shapes - data model

The shapes constraint language, as in the semantic ontology, the language borrows the constructs from the W3C standard SHACL, which can also be represented in any of the RDF supporting languages such as TURTLE or JSON-LD.

Example

SHACL for part of Encounter record type data model, note that it is both a class and a shape so it is classified as a subclass of an event which means it inherits the properties of an event (such as effective date), but the super class "has concept" property has a range constrained to a London extension" which is the class of encounter types such as gp consultation.

im:Encounter
 a sh:NodeShape , owl:Class;
     rdfs:label "Encounter (record type)" .
     im:isA im:Event ;
     im:status im:Active;
     rdfs:subClassOf im:PatientEvent;
     
     rdfs:comment "An interaction between a patient (or on behalf of the patient) and a health professional or health provider. It includes consultations as well as care processes such as admission, discharges. It also includes the noting of a filing of a document or report.";
     
     sh:property 
          [sh:path im:additionalPractitioners;
           sh:class im:PractitionerInRole] , 
          [sh:path im:completionStatus;
           sh:class im:894281000252100] , 
          [sh:path im:duration;
           sh:minCount "1"^^xsd:integer;
           sh:class im:894281000252100] , 
          [sh:path im:linkedAppointment;
           sh:class im:Appointment] , 
          [sh:path im:concept;
           sh:maxCount "1"^^xsd:integer;
           sh:minCount "1"^^xsd:integer;
           sh:class im:1741000252102]
         ......

Query and temporal concepts

As the IM itself is held as RDF quads (triple+ graphs) the IM can be queried using SPARQL for graph query and Lucene query (OpenSearch) for text query.

However, in order to support real world queries, which are usually articulated in plain language, the IM employs a pragmatic Query DSL to make the query of the IM, or any health record store, easier to construct.

The query DSL has two main roles:

1. Extends description logic to define concepts that have temporal or functional relationships with other concepts. For example, defining whether a patient has diabetes involves making sure that the statement that they have diabetes is not followed by a statement that it has resolved. Age is an example of a function based concept with a date of birth and a reference date being parameters.

2. Provides a simple method of defining data sets or report data, or derived data models for the purposes of further analysis.

The query DSL can also be seen as a "bridge" between the plain language of a user interface and the lower level queries in SQL or SPARQL that result from it. It is designed to hide much of the complexity involved in the underling query constructs which for health record query can result in hundreds of lines of obscure advanced SQL. It is designed to be easy for developers to use, can can be understood by informaticians by direct visualisation.

The DSL itself, in JSON form can be submitted via the APIs to the IM service or a data service that supports a model compliant patient record data store, whether that is in relational form or not.

As well as the DSL the IM provides reference software for SQL and SPARQL conversion.

The query DSL is specified more fully in the article on information model query