Health Information modelling language - overview: Difference between revisions

From Endeavour Knowledge Base
No edit summary
 
(178 intermediate revisions by the same user not shown)
Line 1: Line 1:
N.B Not to be confused with the [[Information model meta model|Information model meta model.]] which specifies the classes that hold the information model data, those classes described using the languages defined below.


<span style="color:#FF0000">Please note. The information in this section represents a specification of intent and work in progress. Actual implementations using the language have partial implementation of the grammars and syntaxes described here.</span>
This article describes the languages used in the information model meta model. In other words, the underlying grammar and syntax used as the building bricks for the classes that make up the model, instances of those classes being objects that conform to the class properties.  


== Background and rationale ==
Details on the W3C standard languages that make up the grammar are described below.
Question: Yet another language? Surely not.


Answer: Nor, or at least, not quite.
In addtion,  


Firstly, it is worth considering what the purpose of a modelling language is.  
If a system can consume RDF in its two main syntaxes (turtle and JSON-LD) then the model can be easily exchanged.


The main purpose of a modelling language is to exchange data and information about data models in a way that both machines and humans can understand. A purely machine based language would be no more than a series of binary bits, perhaps recognisable as hexadecimal digits. A purely human based language would be ambiguous as all human languages are. A language that is both can be used to promote a shared understanding of often complex structures whilst enabling machines to process data in a consistent way.  
The main advantage of RDF and the W3C standards is that types and properties are given internationally unique identifiers which are both humanly readable and can be resolved via the world wide web protocols.


The Discovery modelling language serves this purpose in an unusual way. Instead of a new language, it can be considered "a mixed language" representing a convergence of modern semantic web based modelling languages". The language is used as a means of eliminating the conflicting grammars and syntaxes from  different open standard modelling languages. It does so by applying the relevant parts of the different languages (or profiles), using a set of conventions, to achieve an integrated whole. The language itself supports its own grammar, making pragmatic decisions about the, but the constituent languages can be used in their native form also as they are directly translatable.
Thus, in the information model, all classes, properties and value types (subjects and predicates and objects) are IRIs which are defined by ontological techniques.


It is worth looking at the background from a historical perspective, looking back over he last 40 years or so.
== Contributory languages ==
Health data can be conceptualised as a graph, and thus the model is a graph model.


Prior to the semantic web idea , information modelling was considered in either hierarchical or relational terms. Healthcare informatics adopted the hierarchical approach, which resulted in the partial adoption of standards such as EDIFACT and HL7, or in some cases home grown XML compliant representations such as used in the NHS Data Dictionary.
As the information model is a graph, and both classes and properties are uniquely identified, [[wikipedia:Resource_Description_Framework|RDF]] is the language used. As the technical community use Json as the main stream syntax for exchanging objects, the preferred syntax for the model classes and properties is [[wikipedia:JSON-LD|JSON-LD,]] with instances in plain [[wikipedia:JSON|JSON]]


Following the proposed Semantic Web, the publication of resource descriptor framework (RDF and subsequent RDFS) and OWL  brought together the fundamentals of spoken language grammar such as Subject/Predicate/Object, with the mathematical constructs of description logic and graph theory. Since then a plethora of W3C grammars have evolved, each designed to tackle different aspects of data modelling.
RDF itself has limited grammar the modelling language uses the main stream semantic web grammars and vocabularies, these being RDFS, OWL and SHACL. Additional vocabularies are added to the IM to accommodate the shortfalls in vocabularies,


All of these show a degree of convergence in that they are all based on the same fundamentals which include graph and description logic. There is a tendency towards the use of an IRI(International resource identifier) to represent concepts and the use of graphs to model relationships.  However their grammars and syntaxes create tensions between the different perspectives on seemingly similar concepts.  These different perspectives are gradually being harmonised but remain incompatible in places.
In addition the IM accommodates some languages required to use the main health ontology i,e Expression Constraint language and Snomed compositional grammar. Within the IM ECL is modelled as query and Snomed-CT compositional grammar is modelled as a Concept class.


Rather than propose a new harmonised language, the Discovery modelling language is designed to demonstrate a real world practical application of a convergent approach to modelling using a set of  multi-organisational health records covering a population of  millions citizens. This combined language enables a single integrated approach to modelling data whilst at the same time supporting interoperable standards based languages used for the various different specialised purposes. The open community  based languages, and their various syntaxes, can thus be considered specialised sub languages of the Discovery language.
Finally, as a means of bridging the gap between user visualisation of query definitions and the underlying query languages such as SPARQL and SQL, the IM uses a set of classes to model query definitions, using a form that maps directly to SPARQL, SQL, GRAPHQL.


== The main language component areas ==
When exchanging models using the language grammar both Json-LD and turtle are supported as well as the more specialised syntaxes such as owl functional syntax or expression constraint language.  
[[File:Language components.png|thumb|Venn diagram of language components]]
The language is designed to support the 3 main purposes of information modelling, which are: '''Inference, validation''' and '''enquiry.'''


To support these purposes, the language is used to model 3 main types of constructs: '''Ontology, Data model (or shapes), and Query.'''
The modelling language is an amalgam of the following languages:


It is not necessary to understand the standard languages used in order to understand the modelling or use Discovery,  but for those who have an interest, and have a technical aptitude,  the best places to start are with [https://www.w3.org/TR/owl2-primer/ OWL2], [https://www.w3.org/TR/shacl/ SHACL,] [https://www.w3.org/TR/sparql11-query/ SPARQ]L, [https://graphql.org/ GRAPHQL] and specialised use case based constructs such as [[wikipedia:XACML|ABAC.]] For those who want to get to grips with underlying logic, the best place to start is First order Logic, Description logic, and an understanding of at least one programming language like C# Java, Java script, Python etc + any query language such as SQL.
* [https://www.w3.org/TR/REC-rdf-syntax/ RDF.] An information model can be modelled as a Graph i.e. a set of nodes and edges (nodes and relationships, nodes and properties). Likewise, health data can be modelled as a graph conforming to the information model graph. RDF Forms the statements describing the data. RDF in itself holds no semantics whatsoever. i.e. it is not practical to infer or validate or query based purely on an RDF structure. To use RDF it is necessary to provide semantic definitions for certain predicates and adopt certain conventions. In providing those semantic definitions, the predicates themselves can then be used to semantically define many other things. RDF can be represented using either TURTLE syntax or JSON-LD.
* [https://www.w3.org/TR/rdf-schema/ RDFS]. This is the first of the semantic languages. It is used for the purposes of some of the ontology axioms such as subclasses, domains and ranges as well as the standard annotation properties such as 'label


The only purpose of a language is to help create, maintain, and represent information models and thus how the languages are used are best seen in the sections on the [[Information modelling in Discovery|Information model.]]
*[https://www.w3.org/TR/shacl/ SHACL]. For the data models of types.  Used for everything that defines the shape of data  or logical entities and attributes. Although SHACL is designed for validation of RDF, as SHACL describes what  things 'should be' it can be used as a data modelling language


*[https://www.w3.org/TR/owl2-primer/ OWL2 DL.]  This is supported in the authoring phase, but is simplified within the model. This brings with it more sophisticated description logic such as equivalent classes and existential quantifications ,and is used in the ontology and for defining things when an open world assumption is required. This has contributed to the design of the IM languages but OWL is removed in the run time models with class expressions being replaced by RDFS subclass, and role groups.
*[https://confluence.ihtsdotools.org/display/DOCECL#:~:text=The%20Expression%20Constraint%20Language%20is,either%20precoordinated%20or%20postcoordinated%20expressions. ECL.] This is a specialised query language created for Snomed-CT, used  for simple concepts modelled as subtypes, role groups and roles, and is of great value in defining sets of concepts for the myriad of business purposes used in health.
*[https://confluence.ihtsdotools.org/display/DOCSCG/Compositional+Grammar+-+Specification+and+Guide SCG]. Snomed compositional grammar, created for Snomed-CT, which is a concise syntax for representing simple concepts modelled  as subtypes. role groups and roles and is a way of displaying concept definitions.




<br />


== Language and the information models ==
'''Example  multiple syntaxes and grammars'''
The language (or languages) are a means to an end i.e. a human and machine readable means of exchanging information models and use of the language to interact with implementations of health records.
[[File:IM logical object model.png|thumb]]
An information model is an abstract representation of data but an information model must have content and that content must be stored.


Data cannot be stored conceptually, only physically, and thus there must be a relationship between the abstract model and the physical store.
Consider a definition of chest pain in several syntaxes. Note that the OWL definition is in a form prior to classification whereas the others use the post classified structure (so called inferred)
<div class="toccolours mw-collapsible mw-collapsed">
Chest pain in Manchester syntax, SCG, ECL, OWL FS, IM Json-LD:
<div class="mw-collapsible-content">
<syntaxhighlight lang="turtle" style="border:3px solid grey">
# Definition of Chest pain in owl Manchester Syntax
equivalentTo  sn:298705000 and sn:301366005 and (sn:363698007 sn:51185008)


An abstract model is best instantiated as a set of objects, which are instances of classes.  In reality those objects are instantiated in some form of language. e.g. Java.  
#In RDF turtle
sn:29857009
  rdfs:subClassOf
        sn:301366005 ,  
        sn:298705000;
  im:roleGroup [im:groupNumber "1"^^xsd:integer;
  sn:363698007 sn:51185008];
  rdfs:label "Chest pain (finding)" .


The model can then be used as the source and target of the exchange of data, the latter using a language interoperating via a set of APIs


This can be visualised as in the diagram on the right. It can be seen that the inner physical store, is accessed by an object model layer, which is itself accessed by APIs using modelling language grammar and syntax. The diagram shows the main grammars supported by the Discovery information model, including the Discovery information modelling language grammar itself.
# In Snomed compositional grammar
=== 298705000 |Finding of region of thorax (finding)| +
    301366005 |Pain of truncal structure (finding)| :
            { 363698007 |Finding site (attribute)| = 51185008 |Thoracic structure (body structure)| }


Support for the main languages means that a Discovery information model instance has 2 levels of separation of concerns from the languages used to exchange data, and the languages are industry standards (or industry adopted),  thus the models are  therefore interoperable. There is no reason to buy into Discovery language to use the information model.
# When using ECL to retrieve chest pain
<<298705000 |Finding of region of thorax (finding)| and  
    (<<301366005 |Pain of truncal structure (finding)| :
            { 363698007 |Finding site (attribute)| = 51185008 |Thoracic structure (body structure)| })


The remainder of this article describes the language itself, starting with some high level sections on the components, and eventually providing a specification of the language and links to technical implementations, all of which are open source.


#When used in OL functional syntax
EquivalentClasses(
:29857009 |Chest pain (finding)|
ObjectIntersectionOf(
:22253000 |Pain (finding)|
ObjectSomeValuesFrom(
:609096000 |Role group (attribute)|
ObjectSomeValuesFrom(
:363698007 |Finding site (attribute)|
:51185008 |Thoracic structure (body structure)|
)
)
)
)
# In Json-LD


{
  "@id" : "sct:29857009",
  "rdfs:label" : "Chest pain (finding)",
  "im:definitionalStatus" : {"@id" : "im:1251000252106","name" : "Concept definition is sufficient (equivalent status)"},
  "rdfs:subClassOf" : [ {
    "@id" : "sct:301366005",
    "name" : "Pain of truncal structure (finding)"
  }, {
    "@id" : "sct:298705000",
    "name" : "Finding of region of thorax (finding)"
  } ],
  "im:roleGroup" : [ {
    "im:groupNumber" : 1,
    "sct:363698007" : [ {
      "@id" : "sct:51185008",
      "name" : "Thoracic structure (body structure)"
    } ]
  } ]
}
</syntaxhighlight>
</div>
</div> <div class="mw-collapsible-content">&nbsp;</div>


== The main language constructs ==
== Internal IM languages for IMAPI usage ==
This section describes some of the high level conceptual constructs used in the language. This does not describe the language grammar itself, as this is described later on.
An implementation of the IM as a terminology server or query library exists.


=== The Concept ===
This implementation uses the following mainstream languages
Common to all of the language is the modelling abstraction "'''concept",''' which is an idea  that can be defined, or at least described. All classes and properties in a model are represented as concepts. In line with semantic web standards a concept is represented in two forms:


# A named concept, the name being an International resource identifier '''IRI.'''  A concept is normally also annotated with human readable labels such as clinical terms, scheme dependent codes, and descriptions.
* Java, used as the main logical business end, server side and services the REST APIs used to exchange information with the IM server
#An unnamed (anonymous) concept, which is defined by an expression, which itself is made up of named concepts or expressions
* Javscript / TypeScript extension used for business logic that provides UI specific APIs the web applications


The information model itself is a graph. Thus the IRIs can also be said to be nodes or edge and the anonymous concepts used as anonymous nodes, something which many of the languages support.  
*[https://www.w3.org/TR/sparql11-query/ SPARQL] Used as the logical means of querying model conformant data (not to be confused with the actual query language used which may be SQL). Used as the query language for the IM and mapped from IM Query Health queries would generally use SQL
*[https://opensearch.org/docs/latest/opensearch/query-dsl/index/ OpenSearch / Elastic.] Used for complex free text query for fining concepts using the AWS OpenSearch DSL (derivative of Lucene Query). Note that simple free text Lucene indexing is supported by the IM database engines and is used in combined graph/text query.
*[[Meta model class specification#Query .2FSet definition|IM Query.]] Not strictly a language but a class definition representing a scheme independent  way of defining sets (query results) including all the main health queries used by clinicians and analysts.


Concepts are specialised into classes or properties and there is a wide variety of types and purposes of properties.
== Grammars and syntaxes ==


#
=== Foundation syntaxes - RDF, TURTLE and JSON-LD ===
Discovery language has its own Grammars built on the foundations of the W3C RDF grammars:


The language vocabulary  also includes specialised types of properties, effectively used as reserved words. For example, the ontology uses a type of property known as an '''Axiom'''  which states the definition of a concept, for example  the axiom "''is a subclass o''f"  to state that class A is entailed by class B. A data model may use a specialised property "target class" to state the class which the shape is describing and constraining, for a particular business purpose. The content of these vocabularies are dictated by the grammar specification but the properties and their purpose are derived directly from the sublanguages.
* A terse abbreviated language, TURTLE


=== Context ===
* JSON-LD representation, which can used by systems that prefer JSON (the majority) , and are able to resolve identifiers via the JSON-LD context structure.
Data is considered in a linked form, which means that IRIs are the main identifiers. However, IRIs can be unwieldy to use and some of the languages such as GRAPH-QL do not use them. Furthermore, when used in JSON, (the main exchange syntax via APIs) they can cause significant bloat. Also, identifiers themselves have often been created for local use in local single systems.


To create linked data from local identifiers or vocabulary, the concept of Context is applied. The main form of context in use are
'''Identifiers, aliasing prefixes and context'''


# PREFIX declaration for IRIs, which enable the use of abbreviated IRIs. This approach is used in OWL, RDF turtle, SHACL and Discovery itself.
Concepts are identified and referenced by the use of International resource identifiers (IRIs).  
# VOCABULAR CONTEXT declaration for both IRIs and other tokens. This approach is used in JSON-LD which converts local JSON properties and objects into linked data identifiers via the @context keyword
# MAPPING CONTEXT definitions for system level vocabularies. This provides sufficient context to uniquely identify a local code or term by including details such as the health care provider, the system and the table within a system.


=== Grammars and syntaxes ===
Identifiers are universal and presented in one of the following forms:
The Discovery language, as a mixed language, has its own grammars as below, but in addition the language sub components can be used in their respective grammars and syntaxes. This enables multiple levels of interoperability, including between specialised community based languages and more general languages.


For example, the Snomed-CT community has a specialised language "Expression constraint language" (ECL), which can also be directly mapped to OWL2 and Discovery, and thus Discovery language maps to the 4-6 main OWL syntaxes as well as ECL. Each language has it's own nuances ,usually designed to simplify representations of complex structures. For example, in ECL, the reserved word MINUS (used to exclude certain subclasses from a superclass) , maps to the much more obscure OWL2 syntax that requires the modelling of class IRIs "punned" as individual IRIs in order to properly exclude instances when generating lists of concepts.
# Full IRI (International resource identifier) which is the fully resolved identifier encompassed by <>
# Abbreviated IRI  a Prefix followed by a ":" followed by  the local name which is resolved  to a full IRI
#Aliases. The core language tokens (that are themselves concepts) have aliases for ease of use. For example rdfs:subClassOf is aliased to subClassOf,


Discovery language has its own Grammars which include:
There is of course nothing to stop applications using their own aliases and when used with JSON-LD @context may be used to enable the use of aliases.


* A human natural language approach to describing content, presented as optional terminal literals to the terse language
Data is considered to be linked across the world, which means that IRIs are the main identifiers. However, IRIs can be unwieldy to use and some of the languages such as GRAPH-QL do not use them. Furthermore, when used in JSON, (the main exchange syntax via APIs) they can cause significant bloat. Also, identifiers  such as codes or terms have often been created for local use in local single systems and in isolation are ambiguous.


* A terse abbreviated language, similar to Turtle
To  create linked data from local identifiers or vocabulary, the concept of Context is applied. The main form of context in use are:


* Proprietary JSON based grammar. Which directly maps to the internal class structures used in Discovery
# PREFIX declaration for IRIs, which enable the use of abbreviated IRIs. This approach is used in OWL, RDF turtle, SHACL and Discovery itself.
# VOCABULAR CONTEXT declaration for both IRIs and other tokens. This approach is used in JSON-LD which converts local JSON properties and objects into linked data identifiers via the @context keyword. This enables applications that know their context to use simple identifiers such as aliases.
# MAPPING CONTEXT definitions for system level vocabularies. This provides sufficient context to uniquely identify a local code or term by including details such as the health care provider, the system and the table within a system. In essence a specialised class with the various property values making up the context.


* An open standard JSON-LD representation
=== OWL2 and RDFS ===


Because the information models are accessible via APIs, this means that systems can use any of the above, or exchange information in the specialised standard sublanguages which are:
For the purposes of authoring and reasoning  the semantic ontology axiom and class expression vocabulary uses the tokens and structure from the OWL2 profile [https://www.w3.org/TR/owl2-profiles/#OWL_2_EL OWL EL] , which itself is a sublanguage of the [https://www.w3.org/TR/owl2-syntax/ OWL2 language]


* Expression constraint language (ECL) with its single string syntax
In addition to the open world assumption of OWL, RDFS constructs of domain and ranges (OWL DL) but are are used in a closed word manner as RDFS.


* OWL2 DL presented as functional syntax, RDF/XML, Manchester, JSON-LD
Within an information model instance itself the data relationships are held on their post inferred closed form i.e. inferred properties and relationships are explicitly stated using a normalisation process to eliminate duplications from super types.  In other words, whereas an ontology may be authored using the open world assumption, prior to population of the live IM, classifications and inheritance are resolved. This uses the same approach as followed by Snomed-CT, whereby the inferred relationship containing the inherited properties and the "isa" relationship are included explicitly.


* SHACL presented as JSON-LD
In the live IM OWL Axioms are replaced with the RDFS standard terms and simplified. For example OWL existential quantifications are mapped to "role groups" in line with Snomed-CT.


* GRAPHQL presented as JSON-LD(GraphQL-LD)  or GraphQL natively
'''Use of Annotation properties'''
*XACML presented as JSON


=== Semantic Ontology ===
Annotation properties are the properties that provide information beyond that needed for reasoning.&nbsp; They form no part in the ontological reasoning, but without them, the information model would be impossible for most people to understand.&nbsp;


''Main article''  [[Discovery semantic ontology language]]
Typical annotation properties are names and descriptions.
{| class="wikitable"
|+
!Owl construct
!usage examples
!'''IM live conversion'''
|-
|Class
|An entity that is a class concept e.g. A snomed-ct concept or a general concept
|rdfs:Class
|-
|ObjectProperty
|'hasSubject' (an observation '''has a subject''' that is a patient)
|rdf:Property
|-
|DataProperty
|'dateOfBirth'  (a patient record has a date of birth attribute
|owl:dataTypeProperty
|-
|annotationProperty
|'description'  (a concept has a description)
|
|-
|SubClassOf
|Patient is a subclass of a Person
|rdfs:subClassOf
|-
|Equivalent To
|Adverse reaction to Atenolol is equivalent to An adverse reaction to a drug AND has causative agent of Atenolol (substance)
|rdfs:subClassOf
<br />
|-
|Sub property of
|has responsible practitioner is a subproperty of has responsible agent
|rdfs:subPropertyOf
|-
|Property chain
|is sibling of'/ 'is parent of' / 'has parent' is a sub property chain of 'is first cousin of'
|owl:Property chain
|-
|Existential quantification ( ObjectSomeValuesFrom)
|Chest pain and
Finding site of - {some} thoracic structure
|im:roleGroup
|-
|Object Intersection
|Chest pain is equivalent to pain of truncal structure AND finding in region of thorax AND finding site of thoracic structure
|rdfs:Subclass


The semantic ontology subsumes OWL2 DL.
+


OWL2, like Snomed-CT, forms the log'''ical basis''' for the static data representations, including semantic definition, data modelling and modelling of value sets.OWL2 subsets of Discovery are available in the Discovery syntaxes or the OWL 2 syntaxes.
role groups
|-
|DataType definition
|Date time  is a restriction on a string with a regex that allows approximate dates
|
|-
|Property domain
|a property domain of has causative agent is allergic reaction
|rdfs:domain
|-
|Property range
|A property range of has causative agent is a substance
|rdfs:range
|}
{| class="wikitable"
|+
!Annotation
!Meaning
|-
|rdfs:label
|The name or term for an entity
|-
|rdfs:comment
|the description of an entity
|-
|
|
|}


In its usual use, OWL2 EL is used for reasoning and classification via the use of the [[wikipedia:Open-world_assumption|Open world assumption]]. In effect this means that OWL2 can be used to infer X from Y which forms the basis of most [[Subsumption test|subsumption]] or entailment queries in healthcare.  
=== SHACL shapes ===
SHACL is used as a means of specifying the "data model types" of health record entities and also the IM itself as described directly in the [[Information model meta model#Meta model class specification|meta model article]].


OWL2 DL can also used to model property domains and ranges so that then may be used as editorial policies.  Where classic OWL2 DL normally models domains of a property in order to infer the class of a certain entity, one can use the same grammar for use in editorial policies i.e. only certain properties are allowed for certain classes.
SHACL is used in its standard form and is not extended.


For example, where OWL2 may say that one of the  domains of a causative agent is an allergy (i.e.an unknown class with a property of causative agent is likely to be an allergy), in the modelling the editorial policy states that an allergy ''can only'' have properties that are allowed via the property domain. Thus the Snomed MRCM can be modelled in OWL2 DL
=== OWL extension : data property expressions ===
Within health care, (and in common parlance), data properties are often used as syntactical short cuts to objects with qualifiers  and a literal value element.  


The grammar for the semantic ontology language used for reasoning is  [https://www.w3.org/TR/owl2-profiles/#OWL_2_EL OWL EL], which is limited profile of OWL DL. Thus only existential quantification and object Object intersections are use for reasoning. However the language is also used for some aspects of data modelling and [[Value sets|value set]] modelling which requires  [https://www.w3.org/TR/owl2-syntax/ OWL2 DL] as the more expressive constructs such as union (ORS) are required.
For example, the data property "Home telephone number" would be expected to simply contain a number. But a home telephone number also has a number of properties by implication, such as the fact that its usage is "home", and has a country and area code.


As such the ontology supports the OWL2 syntaxes such as the Functional syntax and Manchester syntax, but can be represented by JSON-LD or the Discovery JSON based syntax, as part of the full information modelling language.  
OWL 2 has a known limitation (as described in the OWL specification itself) in respect of data property expressions. OWL2 can only define data property expressions as data property IRIs with annotations.  


Together with the query language, OWL2 DL makes the language compatible also with [https://confluence.ihtsdotools.org/display/DOCECL/Expression+Constraint+Language+-+Specification+and+Guide Expression constraint language] which is used as the standard for specifying Snomed-CT expression query.
In many health care standards such as HL7 FHIR, these data properties are object properties with the objects having the "value" as one of its properties..


Ontology purists will notice that modelling a "content model" in OWL2 is in fact a breach of the fundamental &nbsp;[[wikipedia:Open-world_assumption|open world assumption]]&nbsp;view of the world taken in ontologies and instead applies the&nbsp;[[wikipedia:Closed-world_assumption|closed world assumption]]&nbsp;view instead. Consequently, the sublanguage used for data modelling uses OWL for inferencing (open world) but SHACL for describing the models (closed world).  
For example, in FHIR  the patients home telephone number is carried explicitly as the property contact {property= telecom -> value =  {property use= Home, /property System= coding system,/ value = the actual number } } i.e. 3 ;levels of nesting.


The ontologies that are modelled are considered as modular ontologies. it is not expected that one "mega ontology" would be authored but that there would be maximum sharing of concept definitions (known as axioms) which results in a super ontology of modular ontologies.  
Whilst explicit modelling is vital for information exchanged between systems with different data models, if stored in this way, queries would underperform, so the actual systems usually store the home telephone number perhaps in  a field "home telephone"  in the patient table or a simple triple.


=== Data  modelling and shapes ===
To resolve the bridge between a complex object definition and simple data property the information model supports data property expressions (but without introducing a new language construct() as follows:


Data models , model classes and properties according to business purposes. This is a different approach to the open world assumption of semantic ontologies.
# Simple data property against the class e.g. a "contact"
# Patient's home telephone number modelled as a ''sub property'' "homeTelephoneNumber with is a sub property of "telephone number", which is itself a sub property of "contact".
# A standard RDFS  property of the homeTelephone property entity - > "isDefinedBy" which points to a class expression which defines a home telephone number, (itself a subclass of a class expression TelephoneNumber) thus allowing all properties values to be "implicit but defined" as part of the ontology.


To illustrate the difference, take the modelling of a human being or person.
By this technique subsumption queries that look for home contacts or home telephone numbers or find numbers with US country codes will find the relevant field and the relevant sub pattern of a data property..


From a semantic perspective a person being could be said to be an equivalent to an animal with a certain set of DNA (nuclear or mitochondrial) and perhaps including the means of growth or perhaps being defined at some point before, at the start of, or sometime after the embryonic phase. One would normally just state that a person  is an instance of a homo sapiens and that homo sapiens is a species of.... etc.
Implementations would still need to parse numbers to properties if they stored numbers as simple numbers but these would be part of a data model map against the IM models definition.


From a data model perspective we may wish to model a record of a person. We could say that a record of a person models a person, and will have one date of birth, one current gender, and perhaps a main residence. 
== Information model meta classes ==
See main article [[Information model meta model|Information model meta classes]]


The difference is between the open and close world and the model of the person is a constraint on the possible (unlimited) properties of a person.
Using the above languages this defines the classes used to model all health data.


A particular data model is a particular business oriented perspective on a set of concepts. As there are potentially thousands of different perspectives (e.g. a GP versus a geneticist) there are potentially unlimited number of data models. All the data models modelled in Discovery share the same atomic concepts and same semantic ontological definitions across ontologies where possible, but where not, mapping relationships are used. 


The binding of a data model to its property values is based on a business specific model. For example a standard FHIR resource will map directly to the equivalent data model class, property and value set, whose meaning is defined in the semantic ontology, but the same data may be carried in a non FHIR resource without loss of interoperability.


A common approach to modelling and use of a standard approach to ontology, together with modularisation, means that any sending or receiving machine which uses concepts from the semantic ontology can adopt full semantic interoperability. If both machines use the same data model for the same business, the data may presented in the same relationship, but if the two machines use different data models for different businesses they may present the data in different ways, but without any loss of meaning or query capability.
<br />
 
'''''The integration between data model shapes and ontological concepts makes the information model very powerful and is the singe most important contributor to semantic interoperability,'''''
 
=== Data mapping ===
 
This part of the language is used to define mappings between the data model and an actual schema to enable query and filers to automatically cope with the ever extending ontology and data properties.&nbsp;
 
This is part of the semantic ontology but uses the idea of context (described later on).
 
=== Query ===
It is fair to say that data modelling and semantic ontology is useless without the means of query.
 
The current approach to the specification of query uses the GRAPHQL approach with type extensions and directive extensions.
 
Graph QL , (despite its name) is not in itself a query language but a way of representing the graph like structure of a underlying model that has been built using OWL. GRAPH QL has a very simple class property representation, is ideal for REST APIs and results are JSON objects in line with the approach taken by the above Discovery syntax.
 
Nevertheless, GRAPHQL considers properties to be functions (high order logic) and therefore properties can accept parameters. For example, a patient's average systolic blood pressure reading could be considered a property with a single parameter being a list of the last 3 blood pressure readings. Parameters are types and types can be created and extended.
 
In addition GRAPHQL supports the idea of extensions of directives which further extend the grammar.
 
Thus GRAPHQL capability is extended by enabling property parameters as types to support such things as filtering, sorting and limiting in the same way as an.y other query language by modelling types passed as parameters. Subqueries are then supported in the same way.
 
GRAPHQL itself is used when the enquirer is familiar with the local logical schema i.e. understands the available types and fields. In order to support semantic web concepts an extension to GRAPHQL, GRAPHQL-LD is used, which is essentially GRAPH-QL with JSON-LD context.
 
GRAPH QL-LD  has been chosen over SPARQL for reasons of simplicity and many now consider GRAPHQL to be a de-facto standard. However, this is an ongoing consideration.
 
=== ABAC language ===
''Main article : [[Attribute based access control|ABAC Language]]''
 
The Discovery attribute based access control language is presented as a pragmatic JSON based profile of the XACML language, modified to use the information model query language (SPARQL) to define policy rules. ABAC attributes are defined in the semantic ontology in the same way as all other classes and properties.
 
The language is used to support some of the data access authorisation processes as described in the specification - [[Identity Authentication Authorisation|Identity, authentication and authorisation]] .
 
This article specifies the scope of the language , the grammar and the syntax, together with examples. Whilst presented as a JSON syntax, in line with other components of the information modelling language, the syntax can also be accessed via the ABAC xml schema which includes the baseline Information model XSD schema on the Endeavour GitHub, and example content viewed in the information manager data files folder<br />

Latest revision as of 14:53, 5 January 2023

N.B Not to be confused with the Information model meta model. which specifies the classes that hold the information model data, those classes described using the languages defined below.

This article describes the languages used in the information model meta model. In other words, the underlying grammar and syntax used as the building bricks for the classes that make up the model, instances of those classes being objects that conform to the class properties.

Details on the W3C standard languages that make up the grammar are described below.

In addtion,

If a system can consume RDF in its two main syntaxes (turtle and JSON-LD) then the model can be easily exchanged.

The main advantage of RDF and the W3C standards is that types and properties are given internationally unique identifiers which are both humanly readable and can be resolved via the world wide web protocols.

Thus, in the information model, all classes, properties and value types (subjects and predicates and objects) are IRIs which are defined by ontological techniques.

Contributory languages

Health data can be conceptualised as a graph, and thus the model is a graph model.

As the information model is a graph, and both classes and properties are uniquely identified, RDF is the language used. As the technical community use Json as the main stream syntax for exchanging objects, the preferred syntax for the model classes and properties is JSON-LD, with instances in plain JSON

RDF itself has limited grammar the modelling language uses the main stream semantic web grammars and vocabularies, these being RDFS, OWL and SHACL. Additional vocabularies are added to the IM to accommodate the shortfalls in vocabularies,

In addition the IM accommodates some languages required to use the main health ontology i,e Expression Constraint language and Snomed compositional grammar. Within the IM ECL is modelled as query and Snomed-CT compositional grammar is modelled as a Concept class.

Finally, as a means of bridging the gap between user visualisation of query definitions and the underlying query languages such as SPARQL and SQL, the IM uses a set of classes to model query definitions, using a form that maps directly to SPARQL, SQL, GRAPHQL.

When exchanging models using the language grammar both Json-LD and turtle are supported as well as the more specialised syntaxes such as owl functional syntax or expression constraint language.

The modelling language is an amalgam of the following languages:

  • RDF. An information model can be modelled as a Graph i.e. a set of nodes and edges (nodes and relationships, nodes and properties). Likewise, health data can be modelled as a graph conforming to the information model graph. RDF Forms the statements describing the data. RDF in itself holds no semantics whatsoever. i.e. it is not practical to infer or validate or query based purely on an RDF structure. To use RDF it is necessary to provide semantic definitions for certain predicates and adopt certain conventions. In providing those semantic definitions, the predicates themselves can then be used to semantically define many other things. RDF can be represented using either TURTLE syntax or JSON-LD.
  • RDFS. This is the first of the semantic languages. It is used for the purposes of some of the ontology axioms such as subclasses, domains and ranges as well as the standard annotation properties such as 'label
  • SHACL. For the data models of types. Used for everything that defines the shape of data or logical entities and attributes. Although SHACL is designed for validation of RDF, as SHACL describes what things 'should be' it can be used as a data modelling language
  • OWL2 DL. This is supported in the authoring phase, but is simplified within the model. This brings with it more sophisticated description logic such as equivalent classes and existential quantifications ,and is used in the ontology and for defining things when an open world assumption is required. This has contributed to the design of the IM languages but OWL is removed in the run time models with class expressions being replaced by RDFS subclass, and role groups.
  • ECL. This is a specialised query language created for Snomed-CT, used for simple concepts modelled as subtypes, role groups and roles, and is of great value in defining sets of concepts for the myriad of business purposes used in health.
  • SCG. Snomed compositional grammar, created for Snomed-CT, which is a concise syntax for representing simple concepts modelled as subtypes. role groups and roles and is a way of displaying concept definitions.


Example multiple syntaxes and grammars

Consider a definition of chest pain in several syntaxes. Note that the OWL definition is in a form prior to classification whereas the others use the post classified structure (so called inferred)

Chest pain in Manchester syntax, SCG, ECL, OWL FS, IM Json-LD:

# Definition of Chest pain in owl Manchester Syntax
 equivalentTo  sn:298705000 and sn:301366005 and (sn:363698007 sn:51185008)

#In RDF turtle
sn:29857009
   rdfs:subClassOf 
         sn:301366005 , 
         sn:298705000;
   im:roleGroup [im:groupNumber "1"^^xsd:integer;
   sn:363698007 sn:51185008];
   rdfs:label "Chest pain (finding)" .


# In Snomed compositional grammar
=== 298705000 |Finding of region of thorax (finding)| + 
    301366005 |Pain of truncal structure (finding)| :
            { 363698007 |Finding site (attribute)| = 51185008 |Thoracic structure (body structure)| }

# When using ECL to retrieve chest pain
<<298705000 |Finding of region of thorax (finding)| and 
    (<<301366005 |Pain of truncal structure (finding)| :
            { 363698007 |Finding site (attribute)| = 51185008 |Thoracic structure (body structure)| })


#When used in OL functional syntax
EquivalentClasses(
	:29857009 |Chest pain (finding)|
	ObjectIntersectionOf(
		:22253000 |Pain (finding)|
		ObjectSomeValuesFrom(
			:609096000 |Role group (attribute)|
			ObjectSomeValuesFrom(
				:363698007 |Finding site (attribute)|
				:51185008 |Thoracic structure (body structure)|
			)
		)
	)
)
# In Json-LD

{
  "@id" : "sct:29857009",
  "rdfs:label" : "Chest pain (finding)",
  "im:definitionalStatus" : {"@id" : "im:1251000252106","name" : "Concept definition is sufficient (equivalent status)"},
  "rdfs:subClassOf" : [ {
    "@id" : "sct:301366005",
    "name" : "Pain of truncal structure (finding)"
  }, {
    "@id" : "sct:298705000",
    "name" : "Finding of region of thorax (finding)"
  } ],
  "im:roleGroup" : [ {
    "im:groupNumber" : 1,
    "sct:363698007" : [ {
      "@id" : "sct:51185008",
      "name" : "Thoracic structure (body structure)"
    } ]
  } ]
}
 

Internal IM languages for IMAPI usage

An implementation of the IM as a terminology server or query library exists.

This implementation uses the following mainstream languages

  • Java, used as the main logical business end, server side and services the REST APIs used to exchange information with the IM server
  • Javscript / TypeScript extension used for business logic that provides UI specific APIs the web applications
  • SPARQL Used as the logical means of querying model conformant data (not to be confused with the actual query language used which may be SQL). Used as the query language for the IM and mapped from IM Query Health queries would generally use SQL
  • OpenSearch / Elastic. Used for complex free text query for fining concepts using the AWS OpenSearch DSL (derivative of Lucene Query). Note that simple free text Lucene indexing is supported by the IM database engines and is used in combined graph/text query.
  • IM Query. Not strictly a language but a class definition representing a scheme independent way of defining sets (query results) including all the main health queries used by clinicians and analysts.

Grammars and syntaxes

Foundation syntaxes - RDF, TURTLE and JSON-LD

Discovery language has its own Grammars built on the foundations of the W3C RDF grammars:

  • A terse abbreviated language, TURTLE
  • JSON-LD representation, which can used by systems that prefer JSON (the majority) , and are able to resolve identifiers via the JSON-LD context structure.

Identifiers, aliasing prefixes and context

Concepts are identified and referenced by the use of International resource identifiers (IRIs).

Identifiers are universal and presented in one of the following forms:

  1. Full IRI (International resource identifier) which is the fully resolved identifier encompassed by <>
  2. Abbreviated IRI a Prefix followed by a ":" followed by the local name which is resolved to a full IRI
  3. Aliases. The core language tokens (that are themselves concepts) have aliases for ease of use. For example rdfs:subClassOf is aliased to subClassOf,

There is of course nothing to stop applications using their own aliases and when used with JSON-LD @context may be used to enable the use of aliases.

Data is considered to be linked across the world, which means that IRIs are the main identifiers. However, IRIs can be unwieldy to use and some of the languages such as GRAPH-QL do not use them. Furthermore, when used in JSON, (the main exchange syntax via APIs) they can cause significant bloat. Also, identifiers such as codes or terms have often been created for local use in local single systems and in isolation are ambiguous.

To create linked data from local identifiers or vocabulary, the concept of Context is applied. The main form of context in use are:

  1. PREFIX declaration for IRIs, which enable the use of abbreviated IRIs. This approach is used in OWL, RDF turtle, SHACL and Discovery itself.
  2. VOCABULAR CONTEXT declaration for both IRIs and other tokens. This approach is used in JSON-LD which converts local JSON properties and objects into linked data identifiers via the @context keyword. This enables applications that know their context to use simple identifiers such as aliases.
  3. MAPPING CONTEXT definitions for system level vocabularies. This provides sufficient context to uniquely identify a local code or term by including details such as the health care provider, the system and the table within a system. In essence a specialised class with the various property values making up the context.

OWL2 and RDFS

For the purposes of authoring and reasoning the semantic ontology axiom and class expression vocabulary uses the tokens and structure from the OWL2 profile OWL EL , which itself is a sublanguage of the OWL2 language

In addition to the open world assumption of OWL, RDFS constructs of domain and ranges (OWL DL) but are are used in a closed word manner as RDFS.

Within an information model instance itself the data relationships are held on their post inferred closed form i.e. inferred properties and relationships are explicitly stated using a normalisation process to eliminate duplications from super types. In other words, whereas an ontology may be authored using the open world assumption, prior to population of the live IM, classifications and inheritance are resolved. This uses the same approach as followed by Snomed-CT, whereby the inferred relationship containing the inherited properties and the "isa" relationship are included explicitly.

In the live IM OWL Axioms are replaced with the RDFS standard terms and simplified. For example OWL existential quantifications are mapped to "role groups" in line with Snomed-CT.

Use of Annotation properties

Annotation properties are the properties that provide information beyond that needed for reasoning.  They form no part in the ontological reasoning, but without them, the information model would be impossible for most people to understand. 

Typical annotation properties are names and descriptions.

Owl construct usage examples IM live conversion
Class An entity that is a class concept e.g. A snomed-ct concept or a general concept rdfs:Class
ObjectProperty 'hasSubject' (an observation has a subject that is a patient) rdf:Property
DataProperty 'dateOfBirth' (a patient record has a date of birth attribute owl:dataTypeProperty
annotationProperty 'description' (a concept has a description)
SubClassOf Patient is a subclass of a Person rdfs:subClassOf
Equivalent To Adverse reaction to Atenolol is equivalent to An adverse reaction to a drug AND has causative agent of Atenolol (substance) rdfs:subClassOf


Sub property of has responsible practitioner is a subproperty of has responsible agent rdfs:subPropertyOf
Property chain is sibling of'/ 'is parent of' / 'has parent' is a sub property chain of 'is first cousin of' owl:Property chain
Existential quantification ( ObjectSomeValuesFrom) Chest pain and

Finding site of - {some} thoracic structure

im:roleGroup
Object Intersection Chest pain is equivalent to pain of truncal structure AND finding in region of thorax AND finding site of thoracic structure rdfs:Subclass

+

role groups

DataType definition Date time is a restriction on a string with a regex that allows approximate dates
Property domain a property domain of has causative agent is allergic reaction rdfs:domain
Property range A property range of has causative agent is a substance rdfs:range
Annotation Meaning
rdfs:label The name or term for an entity
rdfs:comment the description of an entity

SHACL shapes

SHACL is used as a means of specifying the "data model types" of health record entities and also the IM itself as described directly in the meta model article.

SHACL is used in its standard form and is not extended.

OWL extension : data property expressions

Within health care, (and in common parlance), data properties are often used as syntactical short cuts to objects with qualifiers and a literal value element.

For example, the data property "Home telephone number" would be expected to simply contain a number. But a home telephone number also has a number of properties by implication, such as the fact that its usage is "home", and has a country and area code.

OWL 2 has a known limitation (as described in the OWL specification itself) in respect of data property expressions. OWL2 can only define data property expressions as data property IRIs with annotations.

In many health care standards such as HL7 FHIR, these data properties are object properties with the objects having the "value" as one of its properties..

For example, in FHIR the patients home telephone number is carried explicitly as the property contact {property= telecom -> value = {property use= Home, /property System= coding system,/ value = the actual number } } i.e. 3 ;levels of nesting.

Whilst explicit modelling is vital for information exchanged between systems with different data models, if stored in this way, queries would underperform, so the actual systems usually store the home telephone number perhaps in a field "home telephone" in the patient table or a simple triple.

To resolve the bridge between a complex object definition and simple data property the information model supports data property expressions (but without introducing a new language construct() as follows:

  1. Simple data property against the class e.g. a "contact"
  2. Patient's home telephone number modelled as a sub property "homeTelephoneNumber with is a sub property of "telephone number", which is itself a sub property of "contact".
  3. A standard RDFS property of the homeTelephone property entity - > "isDefinedBy" which points to a class expression which defines a home telephone number, (itself a subclass of a class expression TelephoneNumber) thus allowing all properties values to be "implicit but defined" as part of the ontology.

By this technique subsumption queries that look for home contacts or home telephone numbers or find numbers with US country codes will find the relevant field and the relevant sub pattern of a data property..

Implementations would still need to parse numbers to properties if they stored numbers as simple numbers but these would be part of a data model map against the IM models definition.

Information model meta classes

See main article Information model meta classes

Using the above languages this defines the classes used to model all health data.