Information model query: Difference between revisions

From Endeavour Knowledge Base
No edit summary
No edit summary
Line 1: Line 1:
== Background ==
Having an information model is one thing. Querying to extract data is another.
Having an information model is one thing. Querying to extract data is another.


As an RDF Graph knowledge base the information model can be directly queried using SPARQL via the information model SPARQL end point.
As an RDF Graph knowledge base the information model can be directly queried using SPARQL. The IM holds text data which can be queried directly using open Search or elastic.


However, as Health records are likely to be stored as relational, or at least SQL compatible data bases, querying health records that are aligned with the model will require SQL to query it.
However, as Health records are likely to be stored as relational, or at least SQL compatible data bases, querying health records that are aligned with the model will require SQL to query them.


Exposing general query languages via APIS such as SPARQL end points brings problems because of the extensive nature of those languages. It is hard to prevent highly damaging queries being run. The general approach is to put time outs on the query or balance queries.
There are problems with SQL and SPARQL /Elastic as an approach to developing query of the IM or the health records that use them:


The gap between plain language query and SQL or SPARQL is large and direct mappings between one and the other have not proved successful in the market place. It is unlikely that most users will be able to build complex queries directly in either SQL or SPARQL.
* Directly authoring  SQL and SPARQL languages require a high degree of skill and health query in particular needs heavily nested subqueries  and including some of the more advanced techniques such as correlated query or window function.
* Translating a user oriented intuitive query builder into SQL or SPARQL directly and in reverse is very difficult. Most query applications use an intermediate language from which the queries are then generated. Examples include GraphQL or Power BI DAX and M.
* Enabling direct query via SPARQL end points or SQL APIs can result in crippling performance problems.


Furthermore, these languages provide result sets are provided in their native form which requires software to process and map to the client formats
Consequently the IM provides a pragmatic Query domain specific language (DSL)  to bridge the gap between a plain language representation and the run time query. This DSL can be used to exchange query definitions across multiple instances.


To address this problem, Query applications use various domain specific languages to bridge from user friendly screens to the lower level languages that actually retrieve or update data. These languages support the actual requirements of the applications, for example Power BI uses DAX and M.
The IM also provides reference software showing how SQL or SPARQL or OpenSearch Query can be generated from the DSL and how a plain language or diagrammatic interpretation can be produced from the DSL. The reference software also shows the converse i.e. the generation of the DSL from plain language.  


There have been attempts at bridging some of the gaps. The Facebook language GraphQL is designed to provide results sets in the form requested, but GraphQL does not support standard query functions such as filter, without extension and the use of method parameters, which in effect is another programming language.
The language is designed to meet the following requirements


The information model languages includes a "bridging language"
== Query language requirements ==
'''Requirement 1 -''' Should support the vast majority of query patterns for defining and producing data sets or patient profiles that are needed in the real world.


The information model has three main categories of use case that require such a bridge:
'''Requirement 2 -''' Should enable mapping to SQL via simple type-table, property- field maps, for health data held in relational forms, as long as the health data content conforms to an IM data model


# Query of the information model content by end users e.g. finding concept
'''Requirement 3 -''' Should enable mapping to SPARQL directly for querying the IM itself
# Query by the information model applications via the IM APIs to retrieve objects for display or update the stored objects
# Query by health staff of implemented health records , using a plain language or graphical query builder.


The approach taken by information model is to provide, as an option to raw SPARQL,  a pragmatic intermediate query object model that supports all 3 categories. This model is then used by the applications themselves but is also available for use by third parties via the APIs.
'''Requirement 4''' - Should enable mapping directly from and to Expression constraint language (ECL) for searching and set definitions.


The query model  is designed meets the following requirements:
'''Requirement 5 -''' Should enable a query to be built as Java Script or POJO objects avoiding the need for a language specific parser.


# Should be easy to map to and from plain language
'''Requirement 6-''' Should embed inference statements such as subtype, super type, or set inclusion as part of the query definition, thus avoiding the need for explicit modelling of the complex logic in the query itself
# Should be easy to map to and from a SQL or SPARQL query with a JSON response object builder.
# Should return the result set in the form that the client has described in its request. e,g. nested JSON or CSV.
# Should use request format in JSON format.
# Should support the IRI format of the IM content.


The following are not requirements
'''Requirement 7 -''' Should support object result format as well as relational format i.e. nested json object results as well as flat table results.


# To operate as a fully functional query language
 
 
ECL example
<br /><syntaxhighlight lang="turtle">
<<39330711000001103          # is a Covid vaccine
OR                                            #or (
<<10363601000001109:          # is a uk product
                                                                      #and
    <<s10362601000001103 = 10362601000001103} }      #has vmp Covd vaccine)
 
</syntaxhighlight>





Revision as of 09:49, 2 June 2022

Background

Having an information model is one thing. Querying to extract data is another.

As an RDF Graph knowledge base the information model can be directly queried using SPARQL. The IM holds text data which can be queried directly using open Search or elastic.

However, as Health records are likely to be stored as relational, or at least SQL compatible data bases, querying health records that are aligned with the model will require SQL to query them.

There are problems with SQL and SPARQL /Elastic as an approach to developing query of the IM or the health records that use them:

  • Directly authoring SQL and SPARQL languages require a high degree of skill and health query in particular needs heavily nested subqueries and including some of the more advanced techniques such as correlated query or window function.
  • Translating a user oriented intuitive query builder into SQL or SPARQL directly and in reverse is very difficult. Most query applications use an intermediate language from which the queries are then generated. Examples include GraphQL or Power BI DAX and M.
  • Enabling direct query via SPARQL end points or SQL APIs can result in crippling performance problems.

Consequently the IM provides a pragmatic Query domain specific language (DSL) to bridge the gap between a plain language representation and the run time query. This DSL can be used to exchange query definitions across multiple instances.

The IM also provides reference software showing how SQL or SPARQL or OpenSearch Query can be generated from the DSL and how a plain language or diagrammatic interpretation can be produced from the DSL. The reference software also shows the converse i.e. the generation of the DSL from plain language.

The language is designed to meet the following requirements

Query language requirements

Requirement 1 - Should support the vast majority of query patterns for defining and producing data sets or patient profiles that are needed in the real world.

Requirement 2 - Should enable mapping to SQL via simple type-table, property- field maps, for health data held in relational forms, as long as the health data content conforms to an IM data model

Requirement 3 - Should enable mapping to SPARQL directly for querying the IM itself

Requirement 4 - Should enable mapping directly from and to Expression constraint language (ECL) for searching and set definitions.

Requirement 5 - Should enable a query to be built as Java Script or POJO objects avoiding the need for a language specific parser.

Requirement 6- Should embed inference statements such as subtype, super type, or set inclusion as part of the query definition, thus avoiding the need for explicit modelling of the complex logic in the query itself

Requirement 7 - Should support object result format as well as relational format i.e. nested json object results as well as flat table results.


ECL example


<<39330711000001103          # is a Covid vaccine
OR                                            #or (
<<10363601000001109:          # is a uk product
                                                                      #and
     <<s10362601000001103 = 10362601000001103} }      #has vmp Covd vaccine)







































Grammar

This is the section on grammar