FHIR Resources and Unicode
Feb 9, 2013In the FHIR specification we say that the basic language for resources is unicode:
The XML character set is always Unicode.
Actually, that’s not the right wording - what it should have said is “The character set of a resource is always Unicode”.
Now if the character set is unicode, then any character encoding that is fully mapped to unicode is therefore valid. However, elsewhere in the specification, it says:
FHIR uses UTF-8 for all request and response bodies
This attracted several comments, all along the same lines - why require UTF-8? Well, the logic is fairly simple:
- content type negotiation doesn’t work very well for character sets
- while it might be legal to represent a resource in any character encoding mapped to unicode, what would you do if someone asked you to represent a resource in a character set that doesn’t have a mapping for one or more characters in the unicode?
- Even though it’s possible to convert resources between character sets, what happens to digital signatures?
- What’s going to happen if systems with different encodings, or with different supported subsets try to interoperate?
- As for which unicode encodings, why support more than one? and UTF-8 is widely supported, and required by several HL7 Asian affiliates for v2
- It’s just simpler to say, everyone use UTF-8.
One problem with requiring UTF-8 is that the HTTP default is ISO-8859-1. This means that you have to specify UTF-8 as the character set on all the http requests and responses. But since it’s a parameter of the content type, and you have to specify the content type anyway, I didn’t see that as particularly painful - but it did get comment in the connectathons, because you do have to remember.
Unicode subsets
However if you don’t support unicode natively - which is still a large subset of systems - then the fact that resources are always in UTF-8 presents you with a problem - you have to do something about the unicode issue, even if you are positive that all your trading partners are using pure ASCII. There’s still so many systems that don’t support unicode (the reason for this is because even though the platforms support unicode relatively well, to support it in your application, the entire eco-system - database, UIs, printers, messaging formats, etc all have support unicode, and for many vendors sorting this out simply isn’t feasible in a financial sense).
What I see in practice, is systems that can’t interoperate safely because they thought they were using pure ASCII, but they weren’t. (In fact, it’s not that unusual to see systems that don’t fully operate, let alone interoperate.) So I’d always prefer unicode as the wire format - it makes everyone deal with the issue.
So, we have several comments - why require UTF-8? Why not allow at least ISO-8859-1? Or why not allow any round-trip encoding? What if we require all interfaces to “support” UTF-8 in addition to anything else that they also do? Or maybe we require all servers to support UTF-8 at least?
We’ve discussed this in committee several times, and we’re just not sure what to do here. Seen as an entire eco-system - and I do think FHIR interfaces will be highly interconnected - a simple blanket rule of always UTF-8 is obviously much simpler overall. But it imposes an entry cost on many systems - especially the existing data stores, which are generally older systems - and maybe this isn’t a very good idea?
HHS HIT Standards Committee & Character Set
The situation is somewhat complicated by this (private communication that made it’s way to me):
The HHS HIT Standards Committee was asked how EHR language display should be certified using standards and the recommendation was ISO 8859-15 aka “Latin 9” which has character support for all the required ISO 639 languages including direct support for the Eastern European languages and transliteration to Latin characters for e.g. cyrillic and mandarin. This EHR certification requirement is anticipated to raise issues for HL7 standards and HL7 implementers particularly for systems with interfaces to certified EHRS.
I’ve got to say, I don’t really understand this. If you’re going to recommend something, why not Unicode? The point is, US EHR vendors (which includes all the multinationals) are going to be forced to change towards whatever this committee recommends. But now, instead of migrating to unicode, which is at least a sensible long term option, they’re going to spending their money changing from ISO-8859-1 - which is the default for all the systems I’ve ever looked at personally, to ISO-8859-15. I can only see that as a sideways move, and not a good investment on behalf of the end users. And how that will play in other countries, where ISO-8859-15 is not on the list of supported character sets in national standards?
In terms of FHIR and unicode, I’m not exactly sure what the impact of this is. ISO-8859-15 is fully mapped to unicode, so it probably doesn’t really change the basic question - unicode, or something else that makes subset support explicit? But EHR vendors are going to be important adopters of FHIR, so I think this weighs on the decision.