Information modelling language - philosophy

This article discusses the underlying rationale and philosophy for the approach taken to creating the Discovery information modelling language syntax.

It starts with the fundamentals behind software languages in general and moves on to discuss the more specific purposes of the language syntax.

The syntaxes themselves and references to the underlying grammars ae further described in the article Discovery information modelling language.

Software languages and information

The value of all software comes from its ability to produce more output per person than the cost of producing and running the software.

The realisation over 80 years ago that machines can think has led to an exponential increase in output per person. Nowadays, it is expected that one person developing a software algorithm can produce outputs that affect the way billions of people behave and act.

The general approach is to develop software in a language that produces software that a computer can use to do things with. Over the years, the number of layers of languages have steadily increased. Nowadays a web developer would be expected to write software in a language that produces JavaScript and HTML, those languages being used by software in a browser application to produce software that when interpreted, produces machine code for a computer screen to render.

Software languages that describe arrangements of data are many and varied. However, languages that produce information from data are far less well developed. A particular language - Unified modelling language was developed in an attempt to make sense of data relationships, but it was primarily designed as a specification for producing diagrams or narrative. These are very informative for humans, but not very useful for machines. In the end, UML diagrams form the basis of specifications for human beings to produce software by hand.

The problem of producing information from data resides in the requirement to use reasoning to produce inferences. Reasoning can used in the form of deductive reasoning by machines or inductive reasoning by human beings or machines. In order to reason it is necessary to classify data in a way that allows deduction to be applied to vast volumes of data. Without classification information is overwhelming to view and impractical to infer.

It follows that software languages that produce information from data, taking account of reasoning, are likely to follow the same multi-layered approach to software i.e. software that produces software. This is the basis for the placement of Discovery IML in the chain.

Data through to information

To make sense out of data we either convert to plain language or diagrams and pictures.

At a fairly deep level (close to a machine language) there is a need to represent mathematical constructs such as logic gates and rules in a language. An example of one such language is predicate logic (first order logic) and that is used in the mathematical community to represent logical relations of variables. An example of a language of value that uses part of FOL and other constructs is Description Logic, which is itself a family of languages designed to describe and define objects.

These languages are all very well but are incredibly arcane for the majority of people, including software developers. As a consequence a number of human understandable languages have evolved and established as web standard. Examples of these are OWL2 and more basic structural approaches to transporting data such as RDF.

Languages such as OWL2 have a single grammar and a number of syntax standards that can be used to represent it. These two can be somewhat impenetrable for most software engineers who nowadays are more familiar with manipulating JSON or XML into and out of software objects that are themselves specified by software classes.

Thus Discovery Syntax has been created to carry OWL2 as a JSON syntax for convenience.

Nevertheless, Description logic remains difficult to grasp and using the OWL syntax directly or through Discovery syntax to process data remains quite hard.

From this follows the need to be able to use familiar approaches to holding the data and visualising it. i.e. the use of conventional structures such as database schemas, XML schemas, or language classes such as Java classes. These conventional structures, rather than simply reflecting the underlying grammars, are designed in a way that is closer to the end use case.

For example, a user may wish to view a hierarchy of disease information in the form of a classification of disease classes; A common cold being a type of respiratory tract infection, and an infection caused by a cold virus etc. It is unlikely that the user would be familiar with the idea of object intersection and existential quantification. What they need is a tree with parent nodes and child nodes. Likewise the software developer is going to use a tree control and that tree control would have nodes associated with the diseases arranged as parents and children.

To address this problem, the Discovery information modelling language can be perceived as a set of meta classes i.e. a set of classes that can be used to define a set o programming classes. For this to happen a set of software mapping utilities are required that can map the arcane IML content to simpler representations such as tree views, or more purpose specific classes and properties.

A further challenge occurs when an information model, containing a data model, needs to be supported by an implementation that stores objects and allows objects to be queried. There is a gap between the Discovery syntax representation of the underlying model and the storage of objects that conform. A set of utilities are developed that can generate reference schemas from the underlying model. These may be useful even if in the end, the implementer chose a different schema.

Why another language?

As well as producing a JSON based language that conforms to the underlying grammar of a language there is an additional reason to bring the several languages together as a single syntax.

Within the health informatics community, a historical separation has evolved between two modelling camps, those that model semantics of concepts via an ontology (aka terminologists), and those that model data structures for storing and transmitting data (aka structuralists). This separation reflects the difference in purpose, the difference in mindset, and the difference in skills required by the different disciplines.

A problem with this separation occurs at the points of overlap. Different camps model their tokens and vocabularies in different ways, both from a grammar and syntax perspective.

For example, in health care it is possible to model a surgical operation as a data structure with a "body site" attribute. (FHIR R4 procedure) does precisely this. It is equally possible to model a procedure by including the body "site" either as a qualifier of a type of procedure, or as part of the procedure definition itself. Snomed-CT does precisely this. Both approaches might use the same concept for the body site itself (e.g. the hip), but they would use separate property concepts for the property of "has body site" itself. This separation of approach can lead to massive divergences. Taking the structuralist approach and extending it results in archetypes of the kind modelled by OpenEHR. Taking the ontological approach further leads to complex nested expressions which are nigh on impenetrable.

Health record query can be achieved via the use of a standard language such as SPARQL or a specialised form of query such as AQL. However, when querying the attributes of a user as part of an attribute based access control policy, a completely different way of representing query may be used.

Having a grammar and syntax that encompasses both semantics and structure, and makes the use of the common overlapping concepts much easier to manage. Having a common syntax for query definition means that a rule in an ABAC policy can use the same syntax as a health record query. Having a common message format in line within interoperability standard such as FHIR makes sure that the data is never locked in an information silo. A classic structural concept such as an encounter record and its semantic definition, can be seamlessly integrated.

Selecting one language is not an option. For example, it is possible to model data in OWL2 DL by extensive use of complex OWL constructs including functional properties, property domains, ranges, precise cardinality. It is also possible to model query as OWL expressions, except for function parameters. However, the purpose of OWL is to support reasoning, and reasoners use the open world assumption.. Data models and data query uses a closed world assumption and query languages are declarative in nature i.e. instructions as to what to do. Using OWL for purposes other than reasoning and or classification is like using English to prove Pythagoras theorem.

Bringing the languages together, at least as a temporary measure to solve a particular set of information requirements, seems worthwhile. Hence a new language.

The syntaxes themselves and references to the underlying grammars ae further described in the article Discovery information modelling language.