No edit summary |
No edit summary |
||
Line 56: | Line 56: | ||
''<small>Example date format</small>'' | ''<small>Example date format</small>'' | ||
==== Zip files ==== | |||
The zip files are PGP encrypted, using a pre-agreed public key. The Remote Filer has the private key, allowing it to decrypt the files. | The zip files are PGP encrypted, using a pre-agreed public key. The Remote Filer has the private key, allowing it to decrypt the files. | ||
Revision as of 11:09, 11 June 2021
This article describes how the Discovery Data Service makes data available for Remote Subscriber Databases (RSDs) and how the DDS Remote Filer application interacts with this to update an RSD. Although it is recommended that the Remote Filer application be used for updating RSDs, any other technical solution can be substituted provided it is able to match the Remote Filer behaviour as far as the points of interaction are concerned.
Subscriber database schema
DDS currently provides SQL scripts for creating subscriber databases, one for each of the two support database engines; the database scripts are the same no matter what configuration options are selected when setting up the feed from DDS (e.g. PI versus de-identified).
- MySQL
- SQL Server (or compatible, e.g. Azure SQL DB)
Please note the following:
|
Subscriber feeds
There are two separate feeds of data that DDS sends to each RSD:
- Published Data Feed – this includes all patient data, plus some supporting data (clinicians and organisations for example) that is sent into DDS by external publishers.
- Reference Data Feed – this includes lookups and mappings for clinical codes (Read2 to SNOMED for example) that is not directly published into DDS but is updated in subscriber databases.
Data for each feed is staged in a separate directory on the DDS SFTP server for each subscriber. The DDS Remote Filer application runs supports running in two different modes, one to download and process the Published Data Feed and the other to download and process the Reference Data Feed.
If you replace the Remote Filer application with an alternative solution it must support both feeds. |
DDS SFTP server
The DDS SFTP server is used to stage all data for all RSDs.
A user is created on this server for each DDS subscriber, with their own username and SSH certificate, to allow them to securely access the data intended for them.
If a DDS subscriber has multiple RSDs (for example, one for GP data and one for acute data), the same SFTP user is used for both RSDs.
For each RSD, the following three directories are created under the SFTP user home directory to:
- stage data for download for the Published Data Feed.
- upload feedback files related to the Published Data Feed
(feedback files are explained later in this article). - stage data for download for the Reference Data Feed.
Example SFTP user home directory structure
Published Data Feed Staging Directory
This directory is used by DDS to stage the published data intended for the RSD, and includes all the patient and clinical data. When data from the DDS is available, for a subscriber, it is placed in this directory.
The files placed in this directory are always named in the following format:
<YYYYMMDDHHMMSS>_Subscriber_Data.zip
Where YYYYMMDDHHMMSS is the date and time the data is staged for collection. When a Remote Filer connects and downloads the files, they should be sorted by file name, so they are in date order, and applied in that order.
Example date format
Zip files
The zip files are PGP encrypted, using a pre-agreed public key. The Remote Filer has the private key, allowing it to decrypt the files.
If a zip file exceeds 10MB, it is fragmented into a multi-part zip. |
Within each zip file is one or more 'inner' zip files that generally represent updates to individual patient records (although this is not always the case). These zip files are named with the pattern:
<YYYYMMDD>_<Ordinal>_<UUID>.zip
Where:
- <YYYYMMDD> is the date the file was staged.
- <Ordinal> is the number that indicates that files should be processed by the subscriber.
There may gaps in the ordinals due to the incrementing integer being used across all DDS subscribers. For example, file 63579 should be processed before 63581, but the fact that file 63580 was not found is not an error.
- <UUID> is a unique identifier generated by DDS for each inner zip file.
Example zip files showing the naming convention and required order
Files should be sorted by ordinal number and applied in that order.
Each inner zip file contains one or more CSV files, which contain the actual data to be applied to the Remote Subscriber Database, and a single JSON file.
Example inner zip file contents showing the CSV and JSON files
CSV files
Each CSV file relates to a table in the Remote Subscriber Database, with the same name; for example, patient.csv relates to the patient table.
Within each CSV file, the first column is always called is_delete. This is either:
- FALSE - indicates that the record is an upsert (update or insert), in which case the remaining columns directly map to the columns on the table, giving the new values. If any cell is empty (as 0 is in the below image) then this should be treated as a null value rather than an empty String.
- TRUE - indicates that the record should be deleted.
The second column is always an id column, giving the unique identifier for the record.
Example CSV file showing is_delete and id columns
JSON files
The JSON file contains metadata about the CSV files and source data types in DDS (which map to the data types used in the Remote Subscriber database). For example, the below image shows the JSON file definition of the patient.csv file, stating that the “title” field was generated from a String value in DDS (which would indicate that the equivalent field in the RSD is a varchar), and that the “date_of_birth” field is a date.