Health Information modelling language - overview: Difference between revisions

From Endeavour Knowledge Base
No edit summary
No edit summary
Line 1: Line 1:


<span style="color:#000080"> Please note. The information in this section represents a specification of intent and work in progress. Actual implementations using the language are under continuous development with partial implementation of the grammars and syntaxes.</span>
<span style="color::#FF0000"> Please note. The information in this section represents a specification of intent and work in progress. Actual implementations using the language are under continuous development with partial implementation of the grammars and syntaxes.</span>


== Background and rationale ==
== Background and rationale ==

Revision as of 10:49, 26 December 2020

Please note. The information in this section represents a specification of intent and work in progress. Actual implementations using the language are under continuous development with partial implementation of the grammars and syntaxes.

Background and rationale

Question: Yet another language? Surely not.

Answer: Not quite.

The Discovery modelling language can be considered "an operational demonstrator of a convergence of modern data modelling languages". This means it is designed to illustrate a means of eliminating the conflicting syntaxes from different open standard modelling languages by applying a pragmatic and easy to understand grammar to achieve an integrated approach.

The approach is based on the observation that, since the idea of the semantic web has become mainstream, despite the apparent plethora of recommendations, there is an underlying convergence towards a common approach to representing data relationships.

Prior to the semantic web idea , information modelling was considered as either hierarchical or relational. Healthcare informatics adopted the hierarchical approach, which resulted in adopting standards such as EDIFACT and HL7, or simple on line typed constructs such as the NHS Data Dictionary.

After the semantic web, which brought in the fundamentals of spoken language grammar such as Subject/Predicate/Object, when put together with the mathematical constructs of description logic and graph theory, a plethora of grammars have evolved, each designed to tackle different aspects of data modelling. There is nevertheless a tendency towards the use of an IRI(International resource identifier) to represent classes and properties and the use of chained triples or graphs to represent relationships.

All of these show a degree of convergence in that they are all based on the same fundamentals. Discovery is designed to demonstrate practical application of a convergent approach to modelling the multi-organisational health records of a population of millions citizens. The combined language enables a single integrated approach to modelling data whilst at the same time supporting interoperable standards based languages used for the various different specialised purposes. The standards based languages, and their various syntaxes, can be considered specialised sub languages of the Discovery language.

Venn diagram of language components

The Discovery language is designed to support the 3 main purposes of information modelling, which are: Inference, validation and enquiry.

To support these purposes, the language is used to model 3 main types of constructs: Ontology, Data model (or shapes), and Query.

It is not necessary to understand the standard languages used in order to understand the modelling or use Discovery, but for those who have an interest, and have a technical aptitude, the best places to start are with OWL2, SHACL, SPARQL, GRAPHQL and specialised use case based constructs such as ABAC. For those who want to get to grips with underlying logic, the best place to start is First order Logic, Description logic, and an understanding of at least one programming language like C# Java, Java script, Python etc + any query language such as SQL.

The only purpose of a language is to help create, maintain, and represent information models and thus how the languages are used are best seen in the sections on the Information model.


The language components

The Common Concept

Common to all of the language is the modelling abstraction of a concept, which is an "idea" that can be defined or at least described. All classes and properties are represented as concepts. In line with web standards a concept is represented in two forms:

  1. A named concept, the name being an International resource identifier IRI
  2. An unnamed (anonymous) concept, which is defined by an expression, which itself is made up of named concepts or expressions.

Concepts are specialised into classes or properties and there is a wide variety of types and purposes of properties.

The language vocabulary also includes specialised types of properties, effectively used as reserved words. For example, the ontology uses a type of property known as an Axiom which states the definition of a concept, for example a "is a subclass of" b . whereas a data model may use a specialised property "target class" to state the class which the shape is describing and constraining, for a particular business purpose. The content of these vocabularies are dictated by the grammar specification but are derived directly from the sublanguages.

Grammars and syntaxes

The Discovery language, as a mixed language, has its own grammars, but in addition the language sub components can be used in their respective grammars and syntaxes. This enables multiple levels of interoperability, including between specialised community based languages and more general languages.

For example, the Snomed-CT community has a specialised language "Expression constraint language" (ECL), which can also be directly mapped to OWL2 and Discovery and thus Discovery language maps to the 4-6 main OWL syntaxes as well as ECL. Each language has it's own nuances ,usually designed to simplify representations of complex structures. For example, in ECL, the reserved word MINUS (used to exclude certain subclasses from a superclass) , maps to the much more obscure OWL2 syntax that requires the modelling of class IRIs "punned" as individual IRIs in order to properly exclude instances when generating lists of concepts.

Discovery language has its own Grammars which include:

  • A human natural language approach to describing content, presented as optional terminal literals to the terse language
  • A terse abbreviated language, similar to Turtle
  • Proprietary JSON based grammar. Which directly maps to the internal class structures used in Discovery
  • An open standard JSON-LD representation

Because the information models are accessible via APIs, this means that systems can use any of the above, or exchange information in the specialised standard sublanguages which are:

  • Expression constraint language (ECL) with its single string syntax
  • OWL2 DL presented as functional syntax, RDF/XML, Manchester, JSON-LD
  • SHACL presented as JSON-LD
  • SHACL SPARQL
  • GRAPHQL-LD as JSON-LD

OWL2

OWL2, like Snomed-CT, forms the logical basis for the static data representations, including semantic definition, data modelling and modelling of value sets.

Because raw OWL2 language (with its 4 syntaxes) is quite arcane to use, Discovery has created a JSON/XML based projection of the language which is simpler to follow and use. This makes access to the information model building blocks more interoperable, for example via the use of REST APIs and JSON. The JSON is mapped directly to classes to enable processing via languages such as Java. RDF/XML (one of the OWL syntaxes) can be exchanged as an option but many are unfamiliar with the OWL variation.

In its usual use, OWL2 is used for reasoning and classification via the use of the Open world assumption. This is also used in the information modelling, but the need to constrain the chaos of health data via data models requires the use of a closed world assumption. To enable this the OWL syntax is still used but is interpreted in a closed world manner. For example, where OWL2 models domains of a property in order to infer the class of a certain entity, Discovery uses the same syntax for use in editorial policies. Where OWL2 may say that one of the domains of a causative agent is an allergy (i.e.an unknown class with a property of causative agent is likely to be an allergy), in the data modelling the editorial policy states that an allergy can only have properties that are allowed via the property domain. Thus the Snomed MRCM can be modelled in OWL2.

GRAPHQL

Graph QL , despite its name is not in itself a query language but a way of representing the graph like structure of a underlying model that has been built using OWL. GRAPH QL has a very simple class property representation, is ideal for REST APIs and results are JSON objects in line with the approach taken by the above Discovery syntax.

Nevertheless, GRAPHQL considers properties to be functions (high order logic) and therefore properties can accept parameters. For example, a patient's average systolic blood pressure reading could be considered a property with a single parameter being a list of the last 3 blood pressure readings.

Thus GRAPHQL capability is extended by enabling property parameters as types to support such things as filtering, sorting and limiting in the same way as any other query language by modelling types passed as parameters. Subqueries are then supported in the same way.

GRAPH QL has been chosen over SPARQL for reasons of simplicity and may now consider GRAPHQL to be a de-facto standard.

ABAC language

Main article : Discovery ABAC language 

The standard XACML specifies a language that may be used to implement ABAC. XACML includes a set of grammatical concepts such as policy sets, policies, rules, combination rules, targets, obligations, effects and so on with many and variable sophisticated tokens and functions used to build the policy rules. XACML has its own XML syntax that can be used directly.

This language is somewhat disconnected with the other standards in terms of syntax and approach to vocab. Consequently Discovery uses a JSON profile of XACML as its ABAC language which itself models the attributes as OWL properties, and uses SPARQL as its rule representation.

Objectives and purposes

The purpose of the language is to help build a health information information model in a way that supports implementations of the model using different data base technologies and query languages.

The underlying philosophy behind the use language is described in the article : Information modelling language - philosophy

Sublanguages.png

This article focuses on the description of the language itself.

As mentioned above each sublanguage is based on the grammar of a single recognised standards based language, the language having been selected as the ones that are the closest fit to the information requirements that the Discovery health information model is designed to support. A single grammar and optional single syntax enables the model to operate in an integrated manner, but at the same time enables the sublanguages to be represented in their native standard languages.

Like UMLS, the sublanguages use a common link in cross reference, a concept, which is identified with a unique identifier (Internationalised resource identifier - IRI) . a concept is usually named and defined semantically, and forms the means of traversing the model from different starting points to different end points, for different purposes.

Semantic Ontology

Main article Discovery semantic ontology language

The semantic ontology language is part of the Discovery information modelling language.

The grammar for the semantic ontology language used for the Discovery ontology is OWL EL, which is limited profile of OWL DL. The language used for data modelling and value set modelling is OWL2 DL as the more expressive constructs are required.

As such the ontology supports the OWL2 syntaxes such as the Functional syntax and Manchester syntax, but also supports the Discovery JSON based syntax, as part of the full information modelling language.

Together with the query language, OWL2 DL makes the language compatible also with Expression constraint language which is used as the standard for specifying Snomed-CT expression query.

Ontology purists will notice that modelling a data model in OWL2 is in fact a breach of the fundamental  open world assumption view of the world taken in ontologies and instead applies the closed world assumption view instead. Consequently, a data model would normally be used independently of DL

The ontologies that are modelled are considered as modular ontologies. it is not expected that one "mega ontology" would be authored but that there would be maximum sharing of concept definitions (known as axioms) which results in a super ontology of modular ontologies.

Data modelling and semantic interoperability

Data models, and concept definitions and objects are modelled in OQL language using the Graph paradigm. As a result, all content can be viewed as semantic triples consisting of subject predicate and object.

Data Modelling takes account of ontology modularisation. A particular data model is a particular business oriented perspective on a set of concepts. As there are potentially thousands of different perspectives (e.g. a GP versus a geneticist) there are potentially unlimited number of data models. All the data models in Discovery share the same atomic concepts and same semantic definition across ontologies where possible, but where not, mapping relationships are used. The binding of a data model to its property values is based on a business specific model. For example a standard FHIR resource will map directly to the equivalent data model class, property and value set, whose meaning is defined in the semantic ontology, but the same data may be carried in a non FHIR resource without loss of interoperability.

A common approach to modelling and use of a standard approach to ontology, together with modularisation, means that any sending or receiving machine which uses concepts from the "super" ontology can adopt full semantic interoperability. If both machines use the same data model for the same business, the data may presented in the same relationship, but if the two machines use different data models for different businesses they may present the data in different ways but without any loss of meaning or query capability

Data mapping

This part of the language is used to define mappings between the data model and an actual schema to enable query and filers to automatically cope with the ever extending ontology and data properties. 

The language can be used to auto generate starter schemas for implementation i.e. schemas that will then be optimised for real world use.

the main use case for he mapping sublanguage is data transformation. This uses techniques such as Object relational mapping and therefore the transform instructions in the form of maps, follow this approach. There is no single standard for ORM maps but best practice of the kind supported by open source utilities such as Hibernate is followed: