A JSON representation for HL7 v2?

Jan 27, 2015

Several weeks ago, I was in Amsterdam for the Furore FHIR DevDays. While there, Nikolay from Health Samurai showed off a neat javascript based framework for sharing scripts that convert from HL7 v2 to FHIR. Sharing these scripts, however, requires a standard JSON representation for HL7 v2 messages, and that turns out to have it’s challenges. Let’s start with what looks like a nice simple representation:

{
 "MSH" : ["|", null, "HC", "HC1456", "ATI", "ATI1001", "200209100000",
    null, null, [ "ACK", "ACK" ], "11037", "P", "2.4"],
 "MSA" : [ "AA", "345345" ]
}

This is a pretty natural way to represent a version 2 message in JSON, but it has a number of deficiencies. The first is that a message can contain more than one segment of the same type, and JSON property names must be unique (actually, JSON doesn’t explicitly say this, but Tim Bray’s clarification does). So the first thing we need to do is make the segments an array:

{
 "v2" : [
  [ "MSH", "|", null, "HC", "HC1456", "ATI", "ATI1001", "200209100000",
     null, null, [ "ACK", "ACK" ], "11037", "P", "2.4"],
  [ "MSA", "AA", "345345" ]
 ]
}

This format - where the segment code is item 0 in the array of values that represent the segment - has the useful property that field “1” in the HL7 definitions becomes item 1 in the array.

Btw, alert readers will note that the { “v2”: } part is pure syntax, and could potentially be dropped, but my experience is that many JSON parsers can only accept an object, not an array (arrays must be properties of objects), so we really should have an object wrapper. At the DevDays, we discussed pulling out some data from the MSH, and making it explicit:

{
 "event" : "ACK",
 "msg" : "ACK",
 "structure" : "ACK",
 "segments" : [
   ...
 ]
}

I’m not sure whether that’s justified or not. The information is in the MSH segments, so it’s straight duplication.

Problems

However this nice simple to grasp format turns out to be relatively unstable - the actual way that an item is represented depends on the values around it, and so scripts won’t be shareable across different implementations. As an example, take the representation of MSH-3, of type HD (ST from 2.1 to 2.2). In the example above, it’s represented as “HC” - just a simple string, to correspond to |HC|. If, however, the source message uses one of the other components from the HD data type, then it would change to a JSON representation of e.g. |^HC^L|, to, say:

 { "Universal ID" : "HC", "Universal ID Type" : "L" }

So the first problem is that whether or not subsequent components appear changes the representation of the first component. Note that this is an ambiguity built into v2 itself, and is handled in various different ways by the many existing HL7 v2 libraries. The second problem with this particular format is that the names given to the fields have varied across the different versions of HL7 v2, as they have never been regarded as signficant. Universal ID is known as “universal ID” from v2.3 to 2.4 - other fields have much more variation than that. So it’s better to avoid names altogether, especially since implementers regularly just use additional components that are not yet defined:

 { "2" : "HC", "3" : "L" }

but if all we’re going to do is have index values, then let’s just use an array:

 [ null, "HC", "L" ]

Though this does have the problem that component 2 is element 01 We could fix that with this representation:

 [ "HD", null, "HC", "L" ]

where the first item in the array has it’s type; this would be variable across versions, and could be omitted (e.g. replaced with null) - I’m not sure whether that’s a value addition or not. Below, I’m not going to add the type to offset the items in the array, but it’s still an option.

The general structure for a version 2 message (or batch) is:

  • A list of segments.
  • Each segment has a code, and a number of data elements
  • Each data element can occur more than once

then:

  • Each Data element has a type, which is either a simple text value, or one or more optional components
  • Each component has a type, which is either a simple text value, or one or more optional sub-components
  • Each subcomponent has a text value of some type

or:

  • Each Data element has one or more components
  • Each component has one or more subcomponents
  • Each subcomponent has a text value of some type

Aside: where’s the abstract message syntax? Well, we tried to introduce it into the wire format in v2.xml - this was problematic for several reasons (names vary, people don’t follow the structure, the structures are ambiguous in some commonly used versions, and most of all, injecting the names into the wire format was hard), and it didn’t actually give you much validation, which was the original intent, since people don’t always follow them. That’s why it’s called “abstract message syntax”. Here, we’re dealing with concrete message syntax.

The first is what the specification describes, but the wire format hides the difference between the various forms, and you can only tell them apart if you have access to the definitions. The problem is, often you don’t, since the formats are often extended informally or formally, and implementers make a mess of this across versions. And this practice is fostered by the way the HL7 committees change things. I’ve found, after much experimentation, that the best way to handle this is to hide the difference behind an API - then it doesn’t matter. But we don’t have an API to hide our JSON representation behind, and therefore we have to decide.

That gives us a poisoned chalice: we can decide for a more rigorous format that follows my second list. This makes for more complicated conversion scripts that get written against the wire format, and are much more re-usable, or we can decide for a less rigorous format that’s easier to work with, that follows the v2 definitions more naturally, but that is less robust and less re-useable.

Option #1: Rigor

In this option, there’s an array for every level, and following the second list:

  1. Array for segments
  2. Array for Data Elements
  3. Array for repeats
  4. Array for components
  5. Array for sub-components

And our example message looks like this:

{
 "v2" : [
  [ [[["MSH"]]], [[["|"]]], null, [[["HC"]]], [[["HC1456"]]], [[["ATI"]]], 
    [[["ATI1001"]]], [[["200209100000"]]], null, null, [[["ACK"], ["ACK"]]], 
    [[["11037"]]], [[["P"]]], [[["2.4"]]] ],
  [ [[["MSA"]]], [[["AA"]]], [[["345345"]]] ]
}

This doesn’t look nice, and writing accessors for data values means accessing at the sub-component level always, which would be a chore, but it would be very robust across implementations and versions. I’m not sure how to evaluate whether that’s worthwhile - mostly, but not always, it’s safe to ignore additional components that are added across versions, or in informal extensions.

Option 2: Simplicity

In this option, there’s a choice of string or array:

  1. Array for segments
  2. Array for Data Elements
  3. Array for repeats
  4. String or Array for components
  5. String or Array for sub-components

And our example message looks like this:

{
 "v2" : [
  [ "MSH", ["|"], null, ["HC"], ["HC1456"], ["ATI"], ["ATI1001"], ["200209100000"],
     null, null, [[ "ACK", "ACK" ]], ["11037"], ["P"], ["2.4"]],
  [ "MSA", ["AA"], ["345345"] ]
}

The annoying thing here is that we haven’t achieved the simplicity that we really wanted (what we had at the top) because of repeating fields. I can’t figure out a way to remove that layer without introducing an object (more complexity), or introducing ambiguity.

Summary

Which is better? That depends, and I don’t know how to choose. For the FHIR purpose, I think that the robust format is probably better, because it would allow for more sharing of conversion scripts. But for other users, the simpler format might be more appropriate.

p.s. Nikolay watched the discussion between James Agnew and myself on this with growing consternation, and decided to cater for multiple JSON formats. That’s probably a worse outcome, but I could understand his thinking.