Version 2 and character sets and encoding
Jul 14, 2011I’ve been rewriting my v2 parser and trying to make it fully conformant to the v2 specification with regard to character sets. It’s a tough problem. There’s several parts of the problem that make it tough. Finding the character set/encoding
The character set is embedded many bytes into the message content at MSH-18. So you need to read the first 40-100 bytes or so into characters before you know how to turn them into characters…. sounds like fun. Actually, it’s pretty much a managable problem, because there’s no need to use characters with value >127 before MSH-18 (note that there’s no need, but it’s possible to use them). Given that the message starts with ‘MSH’, you can tell by inspecting the first 6 bytes whether you have single or double encoding, and if it’s double byte encoding, what the endianness is. Note that you can also tell that from a byte order mark (BOM) if there is one. Given this, and if the sender didn’t send any characters >127 while using UTF-8, then you can reliably find and read MSH-18. Once I’ve read that, then I reset the parser and start again with the specified encoding.
Of course, it’s always possible that the character set as specified by the BOM or made clear by inspecting the first 6 bytes differs from what is implied by the value of MSH-18… I ignore MSH-18 if it doesn’t match.
Note that v2 doesn’t say anything about the BOM - I think it should in a future version.
Understanding the character set/encoding
The second part of the problem is that MSH-18 is sometimes character set and sometimes character encoding (see here for discussion) - the values are an unholy mix of the two. In addition, the list of values matches precisely the list of values in DICOM, and as far as I can tell, no other list at all. Here’s a list of the possible values for MSH-18 (v2.6):
- ASCII - The printable 7-bit ASCII character set.
- 8859/1 -The printable characters from the ISO 8859/1 Character set
- 8859/2 - The printable characters from the ISO 8859/2 Character set
- 8859/3 - The printable characters from the ISO 8859/3 Character set
- 8859/4 - The printable characters from the ISO 8859/4 Character set
- 8859/5 - The printable characters from the ISO 8859/5 Character set
- 8859/6 - The printable characters from the ISO 8859/6 Character set
- 8859/7 - The printable characters from the ISO 8859/7 Character set
- 8859/8 - The printable characters from the ISO 8859/8 Character set
- 8859/9 - The printable characters from the ISO 8859/9 Character set
- 8859/15 The printable characters from the ISO 8859/15 (Latin-15)
- ISO IR14 - Code for Information Exchange (one byte)(JIS X 0201-1976).
- ISO IR87 - Code for the Japanese Graphic Character set for information interchange (JIS X 0208-1990)
- ISO IR159 - Code of the supplementary Japanese Graphic Character set for information interchange (JIS X 0212-1990).
- GB 18030-2000 - Code for Chinese Character Set (GB 18030- 2000)
- KS X 1001 - Code for Korean Character Set (KS X 1001)
- CNS 11643-1992 - Code for Taiwanese Character Set (CNS 11643-1992)
- BIG-5 - Code for Taiwanese Character Set (BIG-5)
- UNICODE - The world wide character standard fromISO/IEC 10646-1-19935
- UNICODE UTF-8 - UCS Transformation Format, 8-bit format
- UNICODE UTF-16 UCS Transformation Format, 16-bit format
- UNICODE UTF-32 - UCS Transformation Format, 32-bit format
That’s a fun list. The default is ASCII, btw. Now I’m not going to write my own general character encoding engine - who is? I’m going to use the inbuilt functions in windows to convert everything to unicode. That means I have to map these values to windows code pages to pass to the windows conversion routines. But it’s a problem, mapping between these values and the windows code page values. Here’s my mapping list.
- ASCII = 20127 or 437
- 8859/1 = 28591 : ISO 8859 : Latin Alphabet 1
- 8859/2 = 28592 : ISO 8859 : Latin Alphabet 2)
- 8859/3 =28593 : ISO 8859 : Latin Alphabet 3
- 8859/4 =28594 : ISO 8859 : Latin Alphabet 4)
- 8859/5 =28595 : ISO 8859 : Cyrillic
- 8859/6 =28596 : ISO 8859 : Arabic)
- 8859/7 =28697 : ISO 8859 : Greek
- 8859/8 =28598 : ISO 8859 : Hebrew
- 8859/9 = 28599 : ISO 8859-9 Turkish
- 8859/15 = 28605 : ISO 8859-15 Latin 9
- ISO IR14 = ??
- ISO IR87 = ??
- ISO IR159 = ??
- GB 18030-2000 = 54936 : GB18030 Simplified Chinese (4 byte); Chinese Simplified (GB18030)
- KS X 1001 = ??
- CNS 11643-1992 = ??
- BIG-5 = 950, ANSI/OEM Traditional Chinese (Taiwan; Hong Kong SAR, PRC); Chinese Traditional (Big5)
As you can see, it’s incomplete. I just don’t know enough to map between the HL7/DICOM codes and the windows code pages. Searching on the internet didn’t quickly resolve them either. All the links I found pointed to either HL7 or dicom standards, or copies thereof.
If you know what the mappings are, please let me know, and I’ll update the list.
The character set can change
If that’s not enough, the character set is allowed to change mid-message. There’s a couple of escape sequences (\C..\ and \M….) that allow the stream to switch characters mid-stream. This makes for a slow parser because of the way windows does character conversion - you can’t ask for x number of characters to be read off the stream, but for x number of bytes to be read into characters (how do you tell how many bytes were actually read - convert the characters back to bytes - I suspect that this isn’t deterministic, and there’s some valid unicode sequences that some windows applications will fail to read, but I don’t know how to test that). So you have to keep reading a byte or two at a time until you get a character back, because you can’t get an encoder to read ahead on the stream - you might have to switch encoders.
Having said that, I’ve never seen these escape sequences change in the wild, and it seems like a sensationally dumb idea to me (however, I’ll make a post about unicode and the Japanese in the future).
If I have any Japanese readers, how does character encoding in v2 actually work in Japan?
Mostly, implementers get this wrong
This stuff is sufficiently poorly understood that most implementers assume their working in ANSI,use characters from their local code page, put them in and claim they’re using something else. The windows character conversion routines fail in some of these cases. I don’t know what to do about that.
There. That’s enough. We really really need to retire v2. It’s time has passed.