Health information model: Difference between revisions

From Endeavour Knowledge Base
No edit summary
 
(177 intermediate revisions by the same user not shown)
Line 1: Line 1:
This article describes the approach taken to producing information models,  including ; what they are, what their purpose is, and what the technical components of the models are.


Information modelling is the set of processes by which representations of data relationships are created and maintained. The Discovery models are designed both for human visualisation and for computers to use directly.  
The article does not include the content of any particular model.  


Implementations that use a model can use three approaches:
== What is the health information model (IM) and what is its purpose? ==
The IM is a representation of the meaning and structure of data held in the electronic records of the health and social care sector, together with libraries of query, value sets, concept sets, data set definitions and mappings.


# Direct use of the model data, either in a simple relational form, in one of the open standard representation syntaxes, or in the Discovery syntax
The main purpose is to bridge the chasm that exists between highly technical digital representations and plain language so that when questions are asked of data, a lay person could use plain language without prior knowledge of the underlying models.
# Use via a set of APIs designed both to provide access to the data within the model, or to generate implementable data structures that use 1) and 2)
__TOC__


== Objectives of the Information models ==
It is a computable abstract logical model, not a physical structure or schema. "computable" means that operational software operates directly from the model artefacts, as opposed to using the model for illustration purposes. As a logical model it models data that may be physically held any a variety of different types of data stores, including relational or graph data stores. Because the model is independent of the physical schemas, the model itself has to be interoperable and without any proprietary lock in.
The information models are sets of components designed as a contribution to achieving the following objectives:


*Enable people who are not technical experts to visualise and understand the structure and content of health records.  
The IM is a broad model that integrates a set of different approaches to modelling using a common ontology. The components of the model are:
*Enable people who are technical experts to design systems based on the logical structure and content of the model  
*Enable people to define the data they need in order to perform advanced analytics or decision support, in particular where the definition involves [[Subsumption_test|subsumption testing]] 
*Enable query authors to have a library of value sets (sets of concepts) and query definitions for re-use across the health sector


The models are independent of implementation technology, i.e. they are abstract models, thus can be implemented in technologies of choice.
# A set of ontologies, which is a vocabulary and definitions of the concepts used in healthcare, or more simply put, a vocabulary of health. The ontologies is made up of the world's leading ontology Snomed-CT, with a London extensions, various code based taxonomies (e.g. ICD10, Read, supplier codes and local codes)
# A common data model, which is a set of classes and properties, using the vocabulary, that represent the data and relationships as published by  live systems that have published data, Note that  this data model is NOT a standard model but a collated set of entities and relationships bound to the concepts based on real data,  that are mapped to a common model.
# A library of business specific concept value sets, (aka reference sets) which are expression constraints on the ontology for the purpose of query
# A catalogue of reference data such as geographical areas, organisations and people derived and updated from public resources.
# A library of Data set (query) definitions  for querying and extracting instance data from the information model, reference data, or health records.
# A set of maps creating mappings between published concepts and the core ontology as well as structural mappings between submitted data and the data model.
# An open source set of utilities that can be used to browse, search, or maintain the model.


== Business domains and domain types ==
<br />
A model is only relevant for a particular set of business purposes and here is no single model that can accommodate all business purposes, although common information models can accommodate quite broad purposes.  A reasonably well understood set of business purposes is referred to in these topics as  a "business domain" or "domain of interest" and a particular information model is designed to cover a business domain.
 
An example of a business domain might be patient related health characteristics, clinical management in General Practice, or commissioning in the English NHS. Domains may be specialised, for example a rapid access chest pain clinic can also be considered as a business domain. A common information model will generally include data relationships needed by many domains, arranged in a way that inconsistency or unreliability is avoided.
 
Discovery modelling covers two main types of Domain:
 
# A semantic based definition and classification of concepts used within a domain. These are ontologies and represented in a standards based way that supports advanced ontological techniques such as classification and reasoning. All concepts represented in a business domain will have a semantic definition in a semantic ontology. An ontology covering the health care domain would typically include Snomed-CT, commonly used classifications such as ICD, as well as bespoke system specific concepts used for business purposes. The Discovery common information model semantic ontology is a super-ontology of ontologies.
# Data models for business purposes. The term "business" includes the business of data recording for clinical purposes and thus covers commissioning, research, as well as clinical management and interoperability between systems. Data models may be derived from other data models. Other terms are frequently used such as a "data set definition" (which is a model of a data set) or a "content model", which is a specification for use in data exchange, but they all refer to the same thing.
 
== The model building bricks ==
All the Discovery models, whether semantic models or business related data models are built using the same machine readable language. More specifically, the language grammars used are based on the following set of open standard based languages, interpreted in a way that matches the businesses:
 
# [https://www.w3.org/TR/owl2-syntax/ OWL2- Description Logic.] The grammar  is interpreted according to the domain type and us case. For example, when used for classification and reasoning the language is interpreted via the  [[wikipedia:Open-world_assumption|open world assumption]] i.e. standard OWL reasoners may be used.  However, when used for decision support (such as checking whether two drugs interact), the [[wikipedia:Closed-world_assumption|closed world assumption]] is used. This converts the language into a multi-purpose language able to support negation and to remove undecidability (this being a necessary mental process in diagnosis and clinical judgement).
# [https://www.w3.org/TR/sparql11-query/ SPARQL Query.] Discovery uses a small subset of the language in order to preserve an easy direct mapping to SQL. It uses an OWL entailment regime of a kind that simplifies query by assuming subsumption unless otherwise specified.
#[[Discovery ABAC language|Attribute based Access contro]]<nowiki/>l language which is specialised to include a vocabulary used to control access permissions. It maps precisely to XACML
 
Both OWL2 and SPARQL bring complexity with them and are not in themselves optimised for implementation using common technologies such as object oriented languages and relational databases. In order to address this problem Discovery has a pragmatic [[Information modelling language|JSON based syntax]] which maps 1:1 to Manchester OWL syntax SPARQL and SQL. The syntax enables easier auto-generation of user interfaces, implementation classes and relational schemas.


== Ontologies and modules ==
== Model building blocks and visualisation ==
The Discovery common information model can be thought of as an ontology of ontologies. More precisely though it should be considered as an ontology consisting of a set of [[wikipedia:Ontology_modularization|ontology module]]<nowiki/>s with each module defined according to business needs. The principle of concept sharing, whereby one concept is identified once across the entire set of domains,  suggests that there is a single ontology. However a data model that is specified for a particular business purpose may have different class structures from another business purpose even though they share the same semantic definitions.
The model consists of classes, sets and objects that are instances of classes.
[[File:Ethnicity.jpg|thumb|Ethnicity]]
Objects can act as objects in their own rights (e.g. an instance of chest pain) or may also act as classes (e.g. the class of objects that are chest pain). Likewise sets have members that are objects and the objects may also act as classes or sets. For example a set for the 2011 Ethnicity census will contain a member object of "British" which is also a set with members such as English and so on.


For example, take the idea of recoding information about a blood pressure. This is an example of a component in a data model. In General practice, it would be common practice to record a systolic and  diastolic blood pressure and thus the component would consist of 3 classes. However, in a specialist research study involving different interpretations of blood pressures, including perhaps the size o nature of the cuff, or the exact position of the patient, this component may be more complex.
The model itself is stored as an RDF based  knowledge graph, which means it is implementable in any mainstream Graph database technology. There are no vendor specific extensions to RDF.  


This is addressed by modularisation where the axioms that define the classes belong to a particular model, even though the property domains and their ranges are shared across the ontology. This is analogous to the idea of templates derived from subsets of archetypes. The difference is that there is no "super-archetype" requiring international agreement on the items in the archetype, but instead there is a demand that the same identifier of the diastolic blood pressure record class is used throughout, even though the class definition is business specific.
In line with the RDF standard,  all  persistent types, classes, , property identifiers and object value identifiers are uniquely named using international resource identifiers. In most cases the identifiers are externally provided (e.g. Snomed-CT identifiers) whilst in others that have been created for a particular model. Organisations that author elements of the models use their own identifiers.  


== Disambiguation of terms ==
From a data modelling perspective the arrangements of types may be referred to as archetypes, which are conceptually similar to FHIR profiles. In the semantic web world they would be considered "shapes". There are an unlimited number of these which frees the model from any particular conventional relational database schema. Inheritance of types is supported which enables broad classifications of types and re-usability.  
Throughout the healthcare domain, the same terms are often used to mean different things when used in different contexts. This can create some fundamental problems in design, which the information modelling attempts to overcome.


Take a "blood pressure" as an example, this could be used mean :
The variation between the parts of the model that model terminology concepts and those that model data use slightly different grammars in keeping with their different purposes. The information model language describes the differences.


a) The actual blood pressure observation itself (a blood pressure was observed)  
The models can be viewed in their raw technical form (in JSON or Turtle) or can be viewed by the information model viewer at the online tool [https://im.endeavourhealth.net/#/ Information model directory] 


b) The record of a blood pressure (Mrs Smith's blood pressure on 1st February),
== Information model language ==


c) The blood pressure procedure (A blood pressure procedure was undertaken).
''Main article'' [[Health Information modelling language - overview|information modelling language]] describes the language in more detail.


This ambiguity is addressed by using different concept identifiers for the different contexts. Editorial policy would normally disambiguate the term e.g. blood pressure (observation), blood pressure measurement (procedure), or blood pressure (record).
The semantic web approach is adopted for the purposes of identifiers and grammar. In this approach, data can be described via the use of a plain language grammar consisting of a subject, a predicate, and an object;  A triple, with an additional context referred to as a graph or RDF data set. The theory is that all health data can be described  in this way (with predicates being extended to include functions).


The net result of disambiguation is the generation of many more identifiers for record based classes in the model. In the blood pressure example several identifiers are likely to be used to model the blood pressure related in the following way:
However, the semantic web languages are highly complex and a set of more pragmatic approaches are taken for the more specialised structures.


In the GP Domain Module a blood pressure data model might be:
The consequence of this approach is that W3C web standards can be used such as the use of [[wikipedia:Resource_Description_Framework|Resource Descriptor Framework o]]<nowiki/>r RDF. This sees the world as a set of triples (subject/ predicate/ object) with some things named and somethings anonymous. Systems that adopt this approach can exchange data in a way that the semantics can be preserved. Whilst RDF is an incredibly arcane language at a machine level, the things it can describe can be very intuitive when represented visually. In other words the Information modelling approach involves an RDF Graph.


Blood pressure (record) -> is a subclass of -> observation (record)
In addition to semantic web languages, other commonly used languages are in place are used to enable the model to be accessed by more people.. For example the Snomed-CT expression constraint language is a common way of defining concept sets. ECL is logically equivalent to a closed world query on an open world OWL ontology. The IM language uses the semantic language of SPARQL together with entailment to model ECL but ECL can be exported or used as input as an alternative.
 
& -> is a record of -> a blood pressure (observation)       
 
& -> has subcomponent- > [of systolic blood pressure (record),
 
and diastolic blood pressure (record)]
 
 
This creates a type hierarchy in the data model similar to the type hierarchy in the semantic model. From an implementation perspective it is likely that the "Observation" would be mapped to a table directly, whereas the blood pressure record would be assigned to a "record of" or "type" column, the content of which could be any object that is a member of the observation type value set, and 2 additional observations as components would be created. HL7 FHIR uses this model via the use of the generic term "code" and "component".
 
Unlike other frameworks such as FHIR or OpenEHR there is no health specific "Reference" model as such beyond the OWL standard need to differentiate classes from properties from data types. This is because a selected structure from the top down is determined by interpretation, use case and the communicating community.
 
== Models viewed in packages ==
Another way of categorising the information models is by the use of the idea of packages.
 
Discovery information models can be said to reside within one of 5 packages, each package directed at particular sets of use cases. Two of the packages fall into the sematic ontology domain type, and 3 fall into the data modelling domain type.
 
Crucially, all the packages are integrated by a common language and share the same concepts, each of which are defined within a semantic ontology.
 
__TOC__[[File:Information model.png|thumb]]<br />
 
*The semantic ontology is the set of concepts used in all parts of the information model, from clinical concepts through to data structure concepts
*The data model is a set of entities, attributes and value sets, all of which are defined precisely in the ontology, but he data model, being created for a specific business of healthcare is separate to the ontology.
*Value sets , or concept sets, are business purposes specific collections of concepts from the ontology used in the data model or in query and contain concepts as defined in the ontology, using the ontology language,&nbsp; including advanced concept classes.
*Data set definitions apply rules and filters to a data model in order to specify the nature of the entries and their content required in a purpose specific data set
*Model maps specify how data is transformed from a data model to a particular database or messaging format.
*Data base schemas are reference schemas (RDB and maps) showing an implementation of a data model and data sets. Strictly speaking these are not part of the information model but are included as “proof of solution” of the model.
*Query definitions are a library of re-usable queries.


<br />
<br />
=== Semantic ontology ===
A semantic ontology defines the ''meaning'' of the concepts that make up the content of health records. The meaning is defined in a way that a computer can use to classify, reason and analyse.
As a semantic ontology is based on Description Logic, the semantic ontology applies the same interpretation to its concepts for classification as does OWL However when used for purposes such as authoring and value set generation, closed world assumption is used of the kind described in the Snomed expression constraint language.
The world leading Snomed-CT ontology forms the major part of the semantic ontology for the definition of health characteristics, supplemented by maps to the legacy concepts such as ICD10 o Read codes. Consequently the semantic ontology can be represented in several syntaxes including the mainn OWL syntaxes, Manchester Syntax, [https://confluence.ihtsdotools.org/display/SLPG/SNOMED+CT+Compositional+Grammar Snomed compositional grammar] and [https://confluence.ihtsdotools.org/display/DOCECL/Expression+Constraint+Language+-+Specification+and+Guide Expression constraint language]
[[File:Ontology.jpg|Main ontology structures|alt=|thumb]]
An ontology is made of of a number of axioms which relate concepts to other concepts in a [https://en.wikipedia.org/wiki/Fractal/ fractal] like manner. T
The ontology may also be  defined using the [[Discovery semantic ontology language]], which is itself a syntactical simplification&nbsp;on the standard OWL2 language. The Discovery language exists in order to accommodate additional constructs not covered in OWL, namely data set definitions, value set definitions, and transactional messaging.
[[File:Super ontology.png|thumb|Core and legacy parts of the ontology]]
=== Relationship between core and legacy ===
The semantic ontology can be modularised into Core and Legacy concepts according to the namespaces of the concept identifiers. That is not to say that the legacy concepts are not used. Quite the reverse. Given that 99%+ of all healthcare data is still recorded using legacy concepts the semantic ontology must incorporate these. In addition a vast number of system specific or provider specific codes are in use.
Both core and legacy are defined by OWL2 DL axioms. However, the core concepts are likely to be defined in a way that sufficiently identifies the concept within the domain whereas legacy concepts are more likely to be defined only to the extent necessary for query subsumption. Nevertheless legacy concepts may also be sufficiently defined. In many cases those definitions come in the form of expressions containing core concepts.
For example,  An adverse reaction to Atenolol is a GP legacy concept. This can be sufficiently defined in the following Axiom
Adverse reaction to Atenolol - is equivalent to - (Adverse drug reaction & causative agent = Atenolol)
The Discovery common model creates relationships between core and legacy using  a mapping relationship, the commonest being
# Equivalent. Where the legacy code or term is deemed to be equivalent in meaning and definition to the core concept
# Subclass. Where the legacy code or term is deemed to be  subclass of the core concept
# Mapped to. Where the legacy code would be expected to be a member of the set defined by the core concept, but may not be sufficiently defined to be confident of equivalence or subclass.
From a mapping perspective the maps operate from Core -> Legacy and not the other way round. For example, if one were searching for Diabetes using a core concept, and a patient had a diagnosis of the ICD10 code "Diabetes without mention of complication" then one would expect that patient to be found (depending on the enquirers preference). However, if querying on "Diabetes without mention of complication" then no core concept would be found as the relationship does not go forward. The exception to this rule is the "equivalent" axiom which is bidirectional.
If the relationship between a core and a legacy is "equivalent" or "subclass", this does not mean that the child codes of the legacy codes would normally be included, as the child codes are often not subclasses from a semantic perspective. This is important to recognise when authoring queries using core concepts, operating on data that uses legacy codes.
=== Value sets or concept sets  or reference sets ===
''Main article&nbsp;:&nbsp;''[[Value_sets|Value_sets]]
A value set definition, and it's&nbsp;run time counterpart- value set transitive closure&nbsp;&nbsp;, is&nbsp;a set of&nbsp;[https://wiki.discoverydataservice.org/index.php?title=Concepts_classes_and_properties class expressions]&nbsp;collected together for a particular business purpose.
There are a range of purposes for a value set. Examples range from defining a data set&nbsp;according to a set of recorded concepts, indicating the expected range of a property in a health record, or testing the presence of a feature in a patient record.&nbsp;
=== Data models ===
''Main article :'' [[Data Model]]
A data model is module of an ontology that defines classes required for particular business purposes. 
Business purposes vary from the need to store particular items of data through the need to display items in a certain way. This is the model that defines the ever evolving structure of health records held within multi-domain health records, varying from common high level classes through to specialised classes. An example of the former is an 'observation', and an example  of the latter is a 'Blood pressure' or an 'histological/immunological report on a breast carcinoma'. 
N.B. in IS013606 these are called archetypes and their derivative templates. In FHIR they are referred to as resources and profiles. 
=== Data definitions - query ===
Data set definitions or queries are a key component of the information model.
A data set definition is a specification of a subset of data derived from one or more data models
A data set definition, once established, can also be used as a source data model and thus data sets can be chained by placing a data set into the role of a data model.
A data set uses query like constructs to define its structures. Data set entities and data set attributes may be derived from a combination of ontology and data model query. To that extent, a data set definition can be said to use a query language.
The Discovery data definition language is not designed to operate as an actual query language, as it does not extend to include all the sophistication needed by a run time query language. For example, there are no optimisation techniques employed or references to the use of indexes. However, the language is sufficiently rich to be able to easily generate SQL or Cypher from the specification when used with a data model map to the implementation schema.
=== Data Maps ===
Data maps hold the maps for a variety of purposes, mainly being:
* Maps between the data model and an implementation schema to enable auto generation of query syntax such as SQL or CYPHER
* Maps between legacy data models and or their values to the common information models
&nbsp;

Latest revision as of 10:28, 21 August 2022

This article describes the approach taken to producing information models, including ; what they are, what their purpose is, and what the technical components of the models are.

The article does not include the content of any particular model.

What is the health information model (IM) and what is its purpose?

The IM is a representation of the meaning and structure of data held in the electronic records of the health and social care sector, together with libraries of query, value sets, concept sets, data set definitions and mappings.

The main purpose is to bridge the chasm that exists between highly technical digital representations and plain language so that when questions are asked of data, a lay person could use plain language without prior knowledge of the underlying models.

It is a computable abstract logical model, not a physical structure or schema. "computable" means that operational software operates directly from the model artefacts, as opposed to using the model for illustration purposes. As a logical model it models data that may be physically held any a variety of different types of data stores, including relational or graph data stores. Because the model is independent of the physical schemas, the model itself has to be interoperable and without any proprietary lock in.

The IM is a broad model that integrates a set of different approaches to modelling using a common ontology. The components of the model are:

  1. A set of ontologies, which is a vocabulary and definitions of the concepts used in healthcare, or more simply put, a vocabulary of health. The ontologies is made up of the world's leading ontology Snomed-CT, with a London extensions, various code based taxonomies (e.g. ICD10, Read, supplier codes and local codes)
  2. A common data model, which is a set of classes and properties, using the vocabulary, that represent the data and relationships as published by live systems that have published data, Note that this data model is NOT a standard model but a collated set of entities and relationships bound to the concepts based on real data, that are mapped to a common model.
  3. A library of business specific concept value sets, (aka reference sets) which are expression constraints on the ontology for the purpose of query
  4. A catalogue of reference data such as geographical areas, organisations and people derived and updated from public resources.
  5. A library of Data set (query) definitions for querying and extracting instance data from the information model, reference data, or health records.
  6. A set of maps creating mappings between published concepts and the core ontology as well as structural mappings between submitted data and the data model.
  7. An open source set of utilities that can be used to browse, search, or maintain the model.


Model building blocks and visualisation

The model consists of classes, sets and objects that are instances of classes.

Ethnicity

Objects can act as objects in their own rights (e.g. an instance of chest pain) or may also act as classes (e.g. the class of objects that are chest pain). Likewise sets have members that are objects and the objects may also act as classes or sets. For example a set for the 2011 Ethnicity census will contain a member object of "British" which is also a set with members such as English and so on.

The model itself is stored as an RDF based knowledge graph, which means it is implementable in any mainstream Graph database technology. There are no vendor specific extensions to RDF.

In line with the RDF standard, all persistent types, classes, , property identifiers and object value identifiers are uniquely named using international resource identifiers. In most cases the identifiers are externally provided (e.g. Snomed-CT identifiers) whilst in others that have been created for a particular model. Organisations that author elements of the models use their own identifiers.

From a data modelling perspective the arrangements of types may be referred to as archetypes, which are conceptually similar to FHIR profiles. In the semantic web world they would be considered "shapes". There are an unlimited number of these which frees the model from any particular conventional relational database schema. Inheritance of types is supported which enables broad classifications of types and re-usability.

The variation between the parts of the model that model terminology concepts and those that model data use slightly different grammars in keeping with their different purposes. The information model language describes the differences.

The models can be viewed in their raw technical form (in JSON or Turtle) or can be viewed by the information model viewer at the online tool Information model directory

Information model language

Main article information modelling language describes the language in more detail.

The semantic web approach is adopted for the purposes of identifiers and grammar. In this approach, data can be described via the use of a plain language grammar consisting of a subject, a predicate, and an object; A triple, with an additional context referred to as a graph or RDF data set. The theory is that all health data can be described in this way (with predicates being extended to include functions).

However, the semantic web languages are highly complex and a set of more pragmatic approaches are taken for the more specialised structures.

The consequence of this approach is that W3C web standards can be used such as the use of Resource Descriptor Framework or RDF. This sees the world as a set of triples (subject/ predicate/ object) with some things named and somethings anonymous. Systems that adopt this approach can exchange data in a way that the semantics can be preserved. Whilst RDF is an incredibly arcane language at a machine level, the things it can describe can be very intuitive when represented visually. In other words the Information modelling approach involves an RDF Graph.

In addition to semantic web languages, other commonly used languages are in place are used to enable the model to be accessed by more people.. For example the Snomed-CT expression constraint language is a common way of defining concept sets. ECL is logically equivalent to a closed world query on an open world OWL ontology. The IM language uses the semantic language of SPARQL together with entailment to model ECL but ECL can be exported or used as input as an alternative.