Information modelling

From Endeavour Knowledge Base

Background

To make sense of huge variation with thousands of data types and millions of codes from thousands of providers using scores of different systems, it is useful to create an information model covering a data model, an ontology of concepts, value sets bound to the data model.

It is useful to visualise the information model via publicly accessible web application and a set of APIs that enable users and systems to use the data within the model.

Having established such a model , it is then possible to construct logical definitions of query and concept sets that can then be used on the data published from the sources. The information model thus contains models of set definitions and queries.

Services that link and normalise the data can use the model and/ or the ontologies within it, creating maps between source data and a common model.

This articles and linked pages herein describe one approach to an information model based on linked data principles as established as part of the idea of a semantic web.

Most models in healthcare either use bespoke health care languages such as those used by HL7 or openEHR, or conventional entity diagrams with a separate terminology server. The approach used in the Endeavour information model is to adopt and adapt the Main stream semantic web languages, based on a view of health data as a graph with the nodes and edges modelled as RDF IRIs.

The model is not a new standard or an invention of new concepts. Instead, the content of the Endeavour IM incorporates concepts from a number of recognised sources including:

a) The main stream health ontology Snomed-CT with extensions to accommodate the unmapped NHS data dictionary attributes, local codes, and code taxonomies such as OPCS, ICD10 as well as the legacy mappings to Read 2.

b) The main stream messaging model resources such as FHIR making the IM FHIR compatible via simple transforms.

c) The main stream query definitions such as QOF rules and dataset definitions.

General approach

The IM is a representation of the meaning and structure of data held in the electronic records of the health and social care sector, together with libraries of query, value sets, concept sets, data set definitions and mappings. These are computable abstract logical models, not physical schemas. "Computable" means that operational software operates directly from the model artefacts, as opposed to using the model for illustration purposes. As a logical model it models data that may be physically held any a variety of different types of data stores, including relational or graph data stores. Because the model is independent of the physical schemas, the model itself has to be interoperable and without any proprietary lock in.

The IM is a broad model that integrates a set of different approaches to modelling using a common ontology. The components of the model are:

  1. A set of ontologies, which is a vocabulary and definitions of the concepts used in healthcare, or more simply put, a vocabulary of health. The ontologies is made up of the world's leading ontology Snomed-CT, with a London extensions, various code based taxonomies (e.g. ICD10, Read, supplier codes and local codes)
  2. A common data model, which is a set of classes and properties, using the vocabulary, that represent the data and relationships as published by live systems that have published data, Note that this data model is NOT a standard model but a collated set of entities and relationships bound to the concepts based on real data, that are mapped to a common model.
  3. A library of business specific concept value sets, (aka reference sets) which are expression constraints on the ontology for the purpose of query
  4. A catalogue of reference data such as geographical areas, organisations and people derived and updated from public resources.
  5. A library of Data set (query) definitions for querying and extracting instance data from the information model, reference data, or health records.
  6. A set of maps creating mappings between published concepts and the core ontology as well as structural mappings between submitted data and the data model.
  7. An open source set of utilities that can be used to browse, search, or maintain the model.

Modelling languages

To build a model, it is necessary to use building blocks. In computing, this means the use of high level languages of some kind.

The approach taken in the Endeavour IM is to use a combination of the Semantic Web languages, standard JSON, standard APIs and property graph query, thus ensuring compatibility with mainstream web based approaches.

Thus the information model languages are thus constraints of the semantic web languages, with vocabularies tailored to the IM requirements.

Query

Query can be thought of as a set of questions chained together to refine the questions iteratively. If one has a model of health data, then it seems sensible to model the query of that data in a way that follows the logic of the plain language questions.

Thus, the IM provides a query definition object model. This 'query model' is not a new query language or DSL. Instead, it is a machine and human readable object model of a query definition of the well known property graph language CYPHER that can be converted into a the query construct of choice at run time. The purpose of this object model is to create a bridge between the run time query and a user interface for query definition building.

The class model comes with open source interpreters that generate a plain language representation, as well as reference interpreters to 3 main query languages, CYPHER, SQL, and SPARQL with Elastic interpreter for text queries.

The approach to modelling query definitions is described, with outputs in a machine and human readable form, in either Json or plain text, covering the majority of health data query requirements.

Meta model

Do build an information model, it is necessary to use a meta model that defines the classes used to populate the model. Because IM uses semantic web languages as its building blocks, the meta model uses the SHACL semantic web approach to describe the classes.

Transformation of data

Data held in actual systems vary massively in their data models and ontologies. Health care interoperability operates by the adoption of maps or transforms between one format and another. An information model without the ability to transform data is in itself of little value. In the same way as modelling data and query is of value, modelling data transformation is also of value.

Currently open mapping modelling languages such as R2RML or RML are somewhat crude and most mapping models remain proprietary and obscure. A more open approach to data transformation language is necessary. An approach to mapping of concepts and published data is described.