#FHIR and Bulk Data Access Proposal

Oct 22, 2017

ONC have asked the FHIR community to add new capabilities to the FHIR specification to increase support for API-based access and push of data for large number of patients in support of provider-based exchange, analytics and other value-based services. The background to this requirement is that while FHIR allows for access to data from multiple patients at a time, the Argonaut implementations are generally constrained to a single patient access, and requires human mediated login on a regular basis. This is mainly because the use case on which the Argonaut community focused was patient portal access. If this work is going to be extended to provide support for API based access to a large number of patients in support of provider based exchange, the following questions (among others) need to be answered:

how does a business client (a backend service, not a human) get access to the service? How are authorizations handled?
How do the client and server agree about which patients are being accessed, and which data is available?
What format is the data made available in?
How is the request made on a RESTful API?
How would the client and server most efficiently ensure the client gets the data it asks for, without sending all data every time?

The last few questions are important because the data could be pretty large - potentially >100000000 resources, and we’ve been focused on highly granular exchanges so far. Our existing solutions don’t scale well.

In response to some of these problems, the SMART team had drafted an initial strawman proposal, which a group of us (FHIR editors, ONC staff, EHR & other vendors) met to discuss further late one night at the San Diego WGM last week. Discussion was - as expected - vigorous. Between us, we hammered out the following refined proposal:

Summary

This proposal describes a way of granting an application access to data on a set of patients. The application can request a copy of all pertinent (clinical) access to the patients in a single download. Note: We expect that this data will be pretty large.

High-level Use Case Description – FHIR-enabled Population Services (this section provided by ONC)

Ecosystem outcome expected to enable many specific use case/business needs: Providers and organizations accountable for managing the health of populations can efficiently access to large volumes of informationon a specified group of individuals without having to access one record at a time. This population-level access would enable these stakeholders to: assess the value of the care provided, conduct population analyses, identify at-risk populations, and track progress on quality improvement.
Technical Expectations: There would be a standardized method built into the FHIR standard to support access to and transfer of a large amount of data on a specified group of patients and that such method could be reused for any number of specific business purposes.
Policy Expectations: All existing legal requirements for accessing identifiable patient information via other bulk methods (e.g., ETL) used today would continue to apply (e.g., through HIPAA BAAs/contracts, Data Use Agreements, etc).

Authorizing Access

Access to the data is granted by using the SMART backend services spec.

Note: We discussed this at length, but we didn’t see a need for Group/* or Launch/* kind of scopes - System/.read will do fine. (or User/.*, for interactive processes, though interactive processes are out of scope for this work). This means that a user cannot restrict Authorization down to just a group, but in this context, users will trust their agents.

Accessing Data

The application can do either of the following queries:

 GET [base]/Patient/$everything?start=[date-time]&_type=[type,type]
 GET [base]/Group/[id]/$everything?start=[date-time]&_type=[type,type]

Notes:

The first query returns all data on all patients that the client’s account has access to, since the starting date time provided.
The second query provides access to all data on all patients in the nominatedgroup. The point of this is that applications can request data on a subset of all their patients without needing a new access account provisioned (exactly how the Group resource is created/identified/defined/managed is out of scope for now - the question of whether we need to do sort this out has been referred to ONC for consideration).
The start date/time means only records since the nominated time. In the absence of the parameter, it means all data ever
The _type parameter is used to specify which resource types are part of the focal query - e.g. what kind of resources are returned in the main set. The _type parameter has no impact on which related resources are included) (e.g. practitioner details for clinical resources). In the absence of this parameter, all types are included.
The data that is available for return includes at least the CCDS (we referred the question of exactly what the data should cover back to the ONC)
The FHIR specification will be modified to allow Patient/$everything to cross patients, and to add $everything to Group
Group will be added as a compartment type in the base Specification

Asynchronous Query

Generally, this is expected to result in quite a lot of data. The client is expected to request this asynchronously, per rfc 7240. To do this, the client uses the Prefer header:

Prefer: respond-async

When the server sees this return header, instead of generating the response, and then returning it, the server returns a 202 Accepted header, and a Content-Location at which the client can use to access the response.

The client then queries this content location using GET content-location (no prefixing). The response can be one of 3 outcomes:

a 202 Accepted that indicates that processing is still happening. This may have an “X-Progress header” that provides some indication of progress to the user (displayed as is to the user - no format restrictions but should be <100 characters in length). The client repeats this request periodically until it gets either a 200 or a 5xx
a 5xx Error that indicates that preparing the response has failed. The body is an OperationOutcome describing the error
a 200 OK with the response for the original request. This response has one or more Link: headers (seerfc 5988) that list the files that are available for download as a result of servicing the request. The response can also carry a X-Available-Until header to indicate when the response will no longer be available

Notes:

This asynchronous protocol will be added as a general feature to the FHIR spec for all calls. it will be up to server discretion when to support it.
The client can cancel a task or advise the server it’s ok to delete the outcome using DELETE [content-location]
Other than the 5xx response, these responses have no body, except when the accept content type is ‘text/html’, in which case the responses should have an HTML representation of the content in the header (e.g. a redirect, an error, or a list of files to download) (it’s up to server discretion to decide whether to support text/html - typically, the reference/test servers do, and the production servers don’t)
Link Headers can have one or more links in them, per rfc 5988
Todo: decide whether to add ‘supports asynchronous’ flag to the CapabilityStatement resource

Format of returned data

If the client uses the Accept type if application/fhir+json or application/fhir+xml, the response will be a bundle in the specified format. Alternatively, the client can use the type application/fhir+ndjson. In this case:

The response is a set of files in ndjson format (see http://ndjson.org/).
Each file contains only resources of a single type.
There can be more than one file for each resource type.
Bundles are broken up at Bundle.entry.resource - e.g. a bundle is split on Entries so the the bundle json file will contain the bundle without the entry resource, and the resources are found (by id) in the type specific resource files (todo: how does that work for history?)

The nd-json files are split up by resource type to facilitate processing by generic software that reads nd-json into storage services such as Hadoop.

Notes:

the content type application/fhir+ndjson will be documented in the base spec
We may need to do some registration work to make +ndjson legal
We spent some time discussing formats such as Apache Avro and Parquet - these have considerable performance benefits over nd-json but are much harder to produce and consume. Clients and servers are welcome to do content type negotiation to support Parquet/ORC/etc, but for now, only nd-json is required. We’ll monitor implementation experience to see how it goes

Follow up Requests

Having made the initial request, applications should store and retain the data, and then only retrieve subsequent changes. this is done by providing a _start time on the request.

Notes:

Todo: Is _start the right parameter (probably need _lastUpdated, or a new one)?
Todo: where does the marker time (to go into the start/date of the next follow up request) go?
clients should be prepared to receive resources that change on the boundary more than once (still todo)

Subscriptions

The ONC request included “push of data”. It became clear, when discussing this, that server side push is hard for servers. Given that this applications can perform these queries regularly/as needed, we didn’t believe that push (e.g. based on subscriptions) was needed, and we have not described how to orchestrate a push based on these semantics at this time

Prototyping

It’s time to test out this proposal and see how it actually works. With that in mind, we’ll work to prototype this API on the reference servers, and then we’ll hold a connectathon on this API at the New Orleans HL7 meeting in January. Given this is an ONC request, we’ll be working with ONC to find participants in the connectathon, but we’ll be asking the Argonaut vendors to have a prototype available for this connectathon.

Also, I’ll be creating FHIR change proposals for community review for all the changes anticipated in this write up. I guess we’ll also be talking about an updated Argonaut implementation guide at some stage.