Health Information modelling language - overview: Difference between revisions

Revision as of 15:31, 26 December 2020

Please note. The information in this section represents a specification of intent and work in progress. Actual implementations using the language have partial implementation of the grammars and syntaxes described here.

Background and rationale

Question: Yet another language? Surely not.

Answer: Not quite.

The Discovery modelling language can be considered "a mixed language representing a convergence of modern semantic web based modelling languages".

This means it is designed to illustrate a means of eliminating the conflicting syntaxes from different open standard modelling languages by applying a pragmatic and easy to understand grammar to achieve an integrated approach.

The approach is based on the observation that, since the idea of the semantic web has become mainstream, despite the apparent plethora of recommendations, as well as disagreements amongst informaticians, there is an underlying convergence towards a common approach to representing data relationships.

Prior to the semantic web idea , information modelling was considered as either hierarchical or relational. Healthcare informatics adopted the hierarchical approach, which resulted in adopting standards such as EDIFACT and HL7, or simple in line typed constructs such as used in the NHS Data Dictionary.

Following the publication of resource descriptor framework (RDF), which brought in the fundamentals of spoken language grammar such as Subject/Predicate/Object, when put together with the mathematical constructs of description logic and graph theory, a plethora of grammars have evolved, each designed to tackle different aspects of data modelling. There is nevertheless a tendency towards the use of an IRI(International resource identifier) to represent classes and properties and the use of chained triples or graphs to represent relationships.

All of these show a degree of convergence in that they are all based on the same fundamentals. The Discovery information modelling language is designed to demonstrate a real world practical application of a convergent approach to modelling the multi-organisational health records of a population of millions citizens. The combined language enables a single integrated approach to modelling data whilst at the same time supporting interoperable standards based languages used for the various different specialised purposes. The open community based languages, and their various syntaxes, can be considered specialised sub languages of the Discovery language.

Venn diagram of language components

The language is designed to support the 3 main purposes of information modelling, which are: Inference, validation and enquiry.

To support these purposes, the language is used to model 3 main types of constructs: Ontology, Data model (or shapes), and Query.

It is not necessary to understand the standard languages used in order to understand the modelling or use Discovery, but for those who have an interest, and have a technical aptitude, the best places to start are with OWL2, SHACL, SPARQL, GRAPHQL and specialised use case based constructs such as ABAC. For those who want to get to grips with underlying logic, the best place to start is First order Logic, Description logic, and an understanding of at least one programming language like C# Java, Java script, Python etc + any query language such as SQL.

The only purpose of a language is to help create, maintain, and represent information models and thus how the languages are used are best seen in the sections on the Information model.

The remainder of this article describes the language itself, starting with some high level sections on the components, and eventually providing a specification of the language and links to technical implementations, all of which are open source.

The language components

The Concept

Common to all of the language is the modelling abstraction "concept", which is an idea that can be defined, or at least described. All classes and properties in a model are represented as concepts. In line with semantic web standards a concept is represented in two forms:

A named concept, the name being an International resource identifier IRI. A concept is normally annotated with human readable labels such as clinical terms and descriptions.
An unnamed (anonymous) concept, which is defined by an expression, which itself is made up of named concepts or expressions.

Concepts are specialised into classes or properties and there is a wide variety of types and purposes of properties.

The language vocabulary also includes specialised types of properties, effectively used as reserved words. For example, the ontology uses a type of property known as an Axiom which states the definition of a concept, for example the axiom "is a subclass of" to state that class A is entailed by class B. A data model may use a specialised property "target class" to state the class which the shape is describing and constraining, for a particular business purpose. The content of these vocabularies are dictated by the grammar specification but the properties and their purpose are derived directly from the sublanguages.

Grammars and syntaxes

The Discovery language, as a mixed language, has its own grammars as below, but in addition the language sub components can be used in their respective grammars and syntaxes. This enables multiple levels of interoperability, including between specialised community based languages and more general languages.

For example, the Snomed-CT community has a specialised language "Expression constraint language" (ECL), which can also be directly mapped to OWL2 and Discovery, and thus Discovery language maps to the 4-6 main OWL syntaxes as well as ECL. Each language has it's own nuances ,usually designed to simplify representations of complex structures. For example, in ECL, the reserved word MINUS (used to exclude certain subclasses from a superclass) , maps to the much more obscure OWL2 syntax that requires the modelling of class IRIs "punned" as individual IRIs in order to properly exclude instances when generating lists of concepts.

Discovery language has its own Grammars which include:

A human natural language approach to describing content, presented as optional terminal literals to the terse language

A terse abbreviated language, similar to Turtle

Proprietary JSON based grammar. Which directly maps to the internal class structures used in Discovery

An open standard JSON-LD representation

Because the information models are accessible via APIs, this means that systems can use any of the above, or exchange information in the specialised standard sublanguages which are:

Expression constraint language (ECL) with its single string syntax

OWL2 DL presented as functional syntax, RDF/XML, Manchester, JSON-LD

SHACL presented as JSON-LD

GRAPHQL presented as JSON-LD(GraphQL-LD) or GraphQL natively

GRAPHQL

Graph QL , despite its name is not in itself a query language but a way of representing the graph like structure of a underlying model that has been built using OWL. GRAPH QL has a very simple class property representation, is ideal for REST APIs and results are JSON objects in line with the approach taken by the above Discovery syntax.

Nevertheless, GRAPHQL considers properties to be functions (high order logic) and therefore properties can accept parameters. For example, a patient's average systolic blood pressure reading could be considered a property with a single parameter being a list of the last 3 blood pressure readings.

Thus GRAPHQL capability is extended by enabling property parameters as types to support such things as filtering, sorting and limiting in the same way as any other query language by modelling types passed as parameters. Subqueries are then supported in the same way.

GRAPH QL has been chosen over SPARQL for reasons of simplicity and may now consider GRAPHQL to be a de-facto standard.

ABAC language

Main article : Discovery ABAC language

The standard XACML specifies a language that may be used to implement ABAC. XACML includes a set of grammatical concepts such as policy sets, policies, rules, combination rules, targets, obligations, effects and so on with many and variable sophisticated tokens and functions used to build the policy rules. XACML has its own XML syntax that can be used directly.

This language is somewhat disconnected with the other standards in terms of syntax and approach to vocab. Consequently Discovery uses a JSON profile of XACML as its ABAC language which itself models the attributes as OWL properties, and uses SPARQL as its rule representation.

Semantic Ontology

Main article Discovery semantic ontology language

The semantic ontology subsumes OWL2 DL.

OWL2, like Snomed-CT, forms the logical basis for the static data representations, including semantic definition, data modelling and modelling of value sets.OWL2 subsets of Discovery are available in the Discovery syntaxes or the OWL 2 syntaxes.

In its usual use, OWL2 is used for reasoning and classification via the use of the Open world assumption. In effect this means that OWL2 can be used to infer X from Y which forms the basis of most subsumption or entailment queries in healthcare.

OWL2 is also used to model property domains that then may be used as editorial policies. Where OWL2 normally models domains of a property in order to infer the class of a certain entity, one can use the same grammar for use in editorial policies i,e. only certain properties are allowed for certain classes. For example, where OWL2 may say that one of the domains of a causative agent is an allergy (i.e.an unknown class with a property of causative agent is likely to be an allergy), in the modelling the editorial policy states that an allergy can only have properties that are allowed via the property domain. Thus the Snomed MRCM can be modelled in OWL2.

The grammar for the semantic ontology language used for reasoning is OWL EL, which is limited profile of OWL DL. The language used for some aspects of data modelling and value set modelling is OWL2 DL as the more expressive constructs such as union (ORS) are required.

As such the ontology supports the OWL2 syntaxes such as the Functional syntax and Manchester syntax, but can be represented by JSON-LD or the Discovery JSON based syntax, as part of the full information modelling language.

Together with the query language, OWL2 DL makes the language compatible also with Expression constraint language which is used as the standard for specifying Snomed-CT expression query.

Ontology purists will notice that modelling a "data model" in OWL2 is in fact a breach of the fundamental open world assumption view of the world taken in ontologies and instead applies the closed world assumption view instead. Consequently, the sublanguage used for data modelling uses OWL for inferencing but SHACL for describing the models.

The ontologies that are modelled are considered as modular ontologies. it is not expected that one "mega ontology" would be authored but that there would be maximum sharing of concept definitions (known as axioms) which results in a super ontology of modular ontologies.

Data modelling and semantic interoperability

Data models, and concept definitions and objects are modelled in OQL language using the Graph paradigm. As a result, all content can be viewed as semantic triples consisting of subject predicate and object.

Data Modelling takes account of ontology modularisation. A particular data model is a particular business oriented perspective on a set of concepts. As there are potentially thousands of different perspectives (e.g. a GP versus a geneticist) there are potentially unlimited number of data models. All the data models in Discovery share the same atomic concepts and same semantic definition across ontologies where possible, but where not, mapping relationships are used. The binding of a data model to its property values is based on a business specific model. For example a standard FHIR resource will map directly to the equivalent data model class, property and value set, whose meaning is defined in the semantic ontology, but the same data may be carried in a non FHIR resource without loss of interoperability.

A common approach to modelling and use of a standard approach to ontology, together with modularisation, means that any sending or receiving machine which uses concepts from the "super" ontology can adopt full semantic interoperability. If both machines use the same data model for the same business, the data may presented in the same relationship, but if the two machines use different data models for different businesses they may present the data in different ways but without any loss of meaning or query capability

Data mapping

This part of the language is used to define mappings between the data model and an actual schema to enable query and filers to automatically cope with the ever extending ontology and data properties.

The language can be used to auto generate starter schemas for implementation i.e. schemas that will then be optimised for real world use.

the main use case for he mapping sublanguage is data transformation. This uses techniques such as Object relational mapping and therefore the transform instructions in the form of maps, follow this approach. There is no single standard for ORM maps but best practice of the kind supported by open source utilities such as Hibernate is followed:

@@ Line 67: / Line 67: @@
 * SHACL presented as JSON-LD
-* SHACL SPARQL
+* GRAPHQL presented as JSON-LD(GraphQL-LD)  or GraphQL natively
-* GRAPHQL-LD  as JSON-LD
+<br />
-=== OWL2 ===
-OWL2, like Snomed-CT, forms the log'''ical basis''' for the static data representations, including semantic definition, data modelling and modelling of value sets.
-Because raw OWL2 language (with its 4 syntaxes) is quite arcane to use, Discovery has created a JSON/XML based projection of the language which is simpler to follow and use.  This makes access to the information model building blocks more interoperable, for example via the use of REST APIs and JSON. The JSON is mapped directly to classes to enable processing via languages such as Java. RDF/XML (one of the OWL syntaxes) can be exchanged as an option but many are unfamiliar with the OWL variation.
-In its usual use, OWL2 is used for reasoning and classification via the use of the [[wikipedia:Open-world_assumption|Open world assumption]]. This is also used in the information modelling, but  the need to constrain the chaos of health data via data models requires the use of a [[wikipedia:Closed-world_assumption|closed world assumption.]] To enable this the OWL syntax is still used but is interpreted in a closed world manner.  For example, where OWL2 models domains of a property in order to infer the class of a certain entity, Discovery uses the same syntax for use in editorial policies. Where OWL2 may say that one of the  domains of a causative agent is an allergy (i.e.an unknown class with a property of causative agent is likely to be an allergy), in the data modelling the editorial policy states that an allergy can only have properties that are allowed via the property domain. Thus the Snomed MRCM can be modelled in OWL2.
 === GRAPHQL ===
@@ Line 94: / Line 87: @@
 This language is somewhat disconnected with the other standards in terms of syntax and approach to vocab. Consequently Discovery uses a J[[Discovery ABAC language|SON profile of XACML]] as its ABAC language which itself models the attributes as OWL properties, and uses SPARQL as its rule representation.
-== Objectives and purposes ==
+=== Semantic Ontology ===
-The purpose of the language is to help build a health information information model in a way that supports implementations of the model using different data base technologies and query languages.
-The underlying philosophy behind the use language is described in the article : [[Information modelling language - philosophy]]
+''Main article''  [[Discovery semantic ontology language]]
-[[File:Sublanguages.png|thumb]]
-This article focuses on the description of the language itself.
-As mentioned above each sublanguage is based on the grammar of  a single recognised standards based language, the language having been selected as the ones that are the closest fit to the information requirements that the Discovery health information model is designed to support. A single grammar and optional single syntax enables the model to operate in an integrated manner, but at the same time enables the sublanguages to be represented in their native standard languages.
+The semantic ontology subsumes OWL2 DL.
-Like [https://www.nlm.nih.gov/research/umls/index.html UMLS], the sublanguages use a common link in cross reference, a [[Discovery semantic ontology language|concept,]] which is identified with a unique identifier (Internationalised resource identifier - IRI) . a concept is usually named and defined semantically, and forms the means of traversing the model from different starting points to different end points, for different purposes.
+OWL2, like Snomed-CT, forms the log'''ical basis''' for the static data representations, including semantic definition, data modelling and modelling of value sets.OWL2 subsets of Discovery are available in the Discovery syntaxes or the OWL 2 syntaxes.
-== Semantic Ontology ==
+In its usual use, OWL2 is used for reasoning and classification via the use of the [[wikipedia:Open-world_assumption|Open world assumption]]. In effect this means that OWL2 can be used to infer X from Y which forms the basis of most [[Subsumption test|subsumption]] or entailment queries in healthcare.
-''Main article''  [[Discovery semantic ontology language]]
-The semantic ontology language is part of the Discovery information modelling language.
+OWL2 is also used to model property domains that then may be used as editorial policies.  Where OWL2 normally models domains of a property in order to infer the class of a certain entity, one can use the same grammar for use in editorial policies i,e. only certain properties are allowed for certain classes.  For example, where OWL2 may say that one of the  domains of a causative agent is an allergy (i.e.an unknown class with a property of causative agent is likely to be an allergy), in the modelling the editorial policy states that an allergy ''can only'' have properties that are allowed via the property domain. Thus the Snomed MRCM can be modelled in OWL2.
-The grammar for the semantic ontology language used for the Discovery ontology is  [https://www.w3.org/TR/owl2-profiles/#OWL_2_EL OWL EL], which is  limited profile of OWL DL. The language used for data modelling and [[Value sets|value set]] modelling is [https://www.w3.org/TR/owl2-syntax/ OWL2 DL] as the more expressive constructs are required.
+The grammar for the semantic ontology language used for reasoning is  [https://www.w3.org/TR/owl2-profiles/#OWL_2_EL OWL EL], which is  limited profile of OWL DL. The language used for some aspects of data modelling and [[Value sets|value set]] modelling is [https://www.w3.org/TR/owl2-syntax/ OWL2 DL] as the more expressive constructs such as union (ORS) are required.
-As such the ontology supports the OWL2 syntaxes such as the Functional syntax and Manchester syntax, but also supports the Discovery JSON based syntax, as part of the full information modelling language.
+As such the ontology supports the OWL2 syntaxes such as the Functional syntax and Manchester syntax, but can be represented by JSON-LD or the Discovery JSON based syntax, as part of the full information modelling language.
 Together with the query language, OWL2 DL makes the language compatible also with [https://confluence.ihtsdotools.org/display/DOCECL/Expression+Constraint+Language+-+Specification+and+Guide Expression constraint language] which is used as the standard for specifying Snomed-CT expression query.
-Ontology purists will notice that modelling a data model in OWL2 is in fact a breach of the fundamental &nbsp;[[wikipedia:Open-world_assumption|open world assumption]]&nbsp;view of the world taken in ontologies and instead applies the&nbsp;[[wikipedia:Closed-world_assumption|closed world assumption]]&nbsp;view instead. Consequently, a data model would normally be used independently of DL
+Ontology purists will notice that modelling a "data model" in OWL2 is in fact a breach of the fundamental &nbsp;[[wikipedia:Open-world_assumption|open world assumption]]&nbsp;view of the world taken in ontologies and instead applies the&nbsp;[[wikipedia:Closed-world_assumption|closed world assumption]]&nbsp;view instead. Consequently, the sublanguage used for data modelling uses OWL for inferencing but SHACL for describing the models.
 The ontologies that are modelled are considered as modular ontologies. it is not expected that one "mega ontology" would be authored but that there would be maximum sharing of concept definitions (known as axioms)  which results in a super ontology of modular ontologies.