Information modelling

From Endeavour Knowledge Base

Background

To make sense of huge variation with thousands of data types and millions of codes from thousands of providers using scores of different systems, it is useful to create information models covering a data model, an ontology of concepts and value sets bound to the data model.

It is useful to visualise the information model via publicly accessible web application and a set of APIs that enable users and systems to use the data within the model.

Having established such a model , it is then possible to construct logical definitions of query and concept sets that can then be used on the data published from the sources. The information model thus contains models of set definitions and queries.

Services that link and normalise the data can use the model and/ or the ontologies within it, creating maps between source data and a common model.

This articles and linked pages herein describe one approach to an information model based on linked data principles as established as part of the idea of a semantic web.

Most models in healthcare either use bespoke health care languages such as those used by HL7 or openEHR, or conventional entity diagrams with a separate terminology server. The approach used in the Endeavour information model is to adopt and adapt the Main stream semantic web languages, based on a view of health data as a graph with the nodes and edges modelled as RDF IRIs.

The model is not a new standard or an invention of new concepts. Instead, the content of the Endeavour IM incorporates concepts from a number of recognised sources including:

a) The main stream health ontology Snomed-CT with extensions to accommodate the unmapped NHS data dictionary attributes, local codes, and code taxonomies such as OPCS, ICD10 as well as the legacy mappings to Read 2.

b) The main stream messaging model resources such as FHIR making the IM FHIR compatible via simple transforms.

c) The main stream query definitions such as QOF rules and dataset definitions.

General approach

The IM is a representation of the meaning and structure of data held in the electronic records of the health and social care sector, together with libraries of query, value sets, concept sets, data set definitions and mappings. These are computable abstract logical models, not physical schemas. "Computable" means that operational software operates directly from the model artefacts, as opposed to using the model for illustration purposes. As a logical model it models data that may be physically held any a variety of different types of data stores, including relational or graph data stores. Because the model is independent of the physical schemas, the model itself has to be interoperable and without any proprietary lock in.

The IM is a broad model that integrates a set of different approaches to modelling using a common ontology. The components of the model are:

  1. A set of ontologies, which is a vocabulary and definitions of the concepts used in healthcare, or more simply put, a vocabulary of health. The ontologies is made up of the world's leading ontology Snomed-CT, with a London extensions, various code based taxonomies (e.g. ICD10, Read, supplier codes and local codes)
  2. A common data model, which is a set of classes and properties, using the vocabulary, that represent the data and relationships as published by live systems that have published data, Note that this data model is NOT a standard model but a collated set of entities and relationships bound to the concepts based on real data, that are mapped to a common model.
  3. A library of business specific concept value sets, (aka reference sets) which are expression constraints on the ontology for the purpose of query
  4. A catalogue of reference data such as geographical areas, organisations and people derived and updated from public resources.
  5. A library of Data set (query) definitions for querying and extracting instance data from the information model, reference data, or health records.
  6. A set of maps creating mappings between published concepts and the core ontology as well as structural mappings between submitted data and the data model.
  7. An open source set of utilities that can be used to browse, search, or maintain the model.

Modelling languages

To build a model, it is necessary to use building blocks. In computing, this means the use of high level languages of some kind.

The approach taken in the Endeavour IM is to use a combination of the Semantic Web languages, standard JSON, standard APIs and property graph query, thus ensuring compatibility with mainstream web based approaches.

Thus the information model languages are thus constraints of the semantic web languages, with vocabularies tailored to the IM requirements.

Logical Data model

To make sense of data, a data model is normally required. A data model should be capable of being viewed by a non technical person, as well as being machine readable for the purposes of class generation.

The endeavour information model manager contains visualisation of a SHACL based data model of health data, visible at

https://im.endeavourhealth.net/#/directory/folder/http:%2F%2Fendhealth.info%2Fim%23DataModel

Ontologies

Along with a data model, it is also necessary to model the values of the properties (fields), often referred to as concepts or codes.

The Endeavour information model manager contains visualisations of a set of ontologies authored as OWL defined concepts, which enables inference for the purposes of classification. For example the Endeavour IM contains UK Snomed-CT and a local extension.

This is available at https://im.endeavourhealth.net/#/directory/folder/http:%2F%2Fendhealth.info%2Fim%23HealthModelOntology

Value sets (code sets)

When using health records for decision support, or when searching health records, it is necessary to author and create sets of concepts, referred to as value sets or concept sets or code sets. The definition of value sets can be quite sophisticated as the definition can make full use of the entailment an inference from the ontology.

When using health records to store data, the ontological concepts used as values within the records are usually bound to the type of entry that uses them. Thus some value sets are "bound" to data models. For example an entry type of a "Condition" is bound to a value set for "Conditions" consisting of all of the concepts considered to be conditions.

The Endeavour Information model manager contains visualisations of value sets, such as those used in GP system query. This is available at

https://im.endeavourhealth.net/#/directory/folder/http:%2F%2Fendhealth.info%2Fim%23Sets

Query

Query can be thought of as a set of questions chained together to refine the questions iteratively. If one has a model of health data, then it seems sensible to model the query of that data in a way that follows the logic of the plain language questions.

Thus, the IM provides a query definition object model. This 'query model' is not a new query language or DSL. Instead, it is a machine and human readable object model of a query definition of the well known property graph language CYPHER that can be converted into a the query construct of choice at run time. The purpose of this object model is to create a bridge between the run time query and a user interface for query definition building.

The class model comes with open source interpreters that generate a plain language representation, as well as reference interpreters to 3 main query languages, CYPHER, SQL, and SPARQL with Elastic interpreter for text queries.

The approach to modelling query definitions is described, with outputs in a machine and human readable form, in either Json or plain text, covering the majority of health data query requirements.

Meta model

Do build an information model, it is necessary to use a meta model that defines the classes used to populate the model. Because IM uses semantic web languages as its building blocks, the meta model uses the SHACL semantic web approach to describe the classes.

Transformation of data

Data held in actual systems vary massively in their data models and ontologies. Health care interoperability operates by the adoption of maps or transforms between one format and another. An information model without the ability to transform data is in itself of little value. In the same way as modelling data and query is of value, modelling data transformation is also of value.

Currently open mapping modelling languages such as R2RML or RML are somewhat crude and most mapping models remain proprietary and obscure. A more open approach to data transformation language is necessary. An approach to mapping of concepts and published data is described.