Question: What content can ED contain?

Previous: Question: Intervals and Boundary Imprecision »

Question: What content can ED contain?

Nov 20, 2011

One of the most frequently asked questions is what content the V3 data type ED (“Encapsulated Data”) can contain. There’s a simple answer. ED can contain the following types of data:

plain text
base64 encoded data
XML - a CDA document, a V3 Message, or any other kind of XML
CDA structured narrative
A reference to a URL from which the data can be obtained

But when we start looking at the details, it’s not quite so simple, which is why it’s such a common question.

ED Abstract Definition

The abstract definition (R2) is as follows:

So an ED has data, which is a list of boolean value (bits). The “abstract” data type definition is an in principle definition of the meaning of the datatypes, without considering any implementation details. And this is the place where it’s abstractness is most evident: binary data is a considered to be a list of bits. I’ve never handled data like that, and I doubt you have either. (ok, if you’re doing huffman… ) Note: the R1 definition differs a little, in that ED specializes BIN, rather than having a data property, but this is syntactical sugar: the meaning is no different

Anyhow, the only thing we need to learn from the abstract spec in this regard is that the data is a series of bytes, and the form in which the data is provided is neither here nor there - we just break it down to a series of bytes. In principle.

The abstract data types also notes that the ED can carry a reference to the data instead of the data itself. In fact, the abstract only introduces the reference property in order to make some rules about the reference - principally that the reference can never be used for any other data. It doesn’t really matter whether the data is provided as binary data directly, or whether it’s provided by a reference to some URL - it’s just a stream of bytes.

An instance can provide the data directly, or it can provide a reference to the data, or it can provide both. If it provides both, they must be the same - so there’s not really a lot of utility in providing both. The normal use for a reference is to provide an image, a thumbnail of the image, and a reference to where the whole (big fat large) image can be retrieved from if the user desires.

XML Representation

When it comes to the XML representation, we can, as the abstract spec describes, provide either the data directly, and/or a reference. If we are going to provide a reference, then the XML looks like this:

<xx mediaType="image/png">

 <reference value="http://temp.myurl.com/images?id=23...">

</xx>

is the name of the element. We don't know what that is - ED is a type, and the name of the element is assigned in the context in which it is used (this will become relevant later). The ED contains a single element "reference" in the standard v3 XML namespace, which has the reference. Remember that a reference may always be provided, where the data is provided or not. In the XML representation, there is 3 different ways to represent the data directly in the instance. The first way is as simple plain text: ```` This is some plain text ```` The plain text is the simple text content of the element itself. This is pretty easy, but only suitable when 1. the character encoding of the text is the same as the character encoding of the XML data (or you can make it so) (this is usually the case) 1. You don't really care about whitespace at the start and the end of the text - or you are sure there isn't any (not an unusual condition) 1. the plain text won't need lot's of escaping for non-xml savvy characters (i.e. it isn't binary data like a PDF file) Note that according to the data types XML specification, if you have thumbnail or reference elements, they come first before the text - though there's probably no meaningful use of a thumbnail or reference for plain text anyway. Also note that the plain text may contain special characters such as tabs, line feeds etc; but this is usually a bad idea - implementors generally do not handle these characters well or consistently, whether represented directly in the XML. or as character entity references. If you have to exchange these characters, use base64 - this encourages the use of non-XML tools for handling the data that contains them. If the content doesn't meet the conditions above - which usually means it's a pdf, a word document, an image or a video, but anything else is possible and allowed - then the usual way to include the data is as base64 encoded data. ```` MNYD83jmMdomSJUEdmde9j44zmMir.... ```` You can always tell when the data is base64 encoded this way, because the representation="B64" attribute must be present. Note: Base64 encoding is not the most dense representation of the data. You don't have to do it this way. You can embed the XML that contains the ED in a MIME package, add the binary content as a MIME section, and put a reference in the ED instead. Or you can use DIME (shudder!). But whatever you do requires that the recipient expects to receive this; you can't be sure about that, whereas you can be sure that they can accept base64 encoded data. And base64 encoded data compresses down to about the same size as the same compressed binary. There's a third option for representing the data: if it's XML, HTML, or SGML, and it's **well formed** and in the same character encoding as the document, you can stick it straight in as XML. In particular, you could put another CDA document, or a v3 message. Here's an example: ```` ```` ```` ```` ```` ```` ```` ```` ```` ```` The specification says that in this case, the XML fragment must be well formed, and that it must be a in a single element in the ED. So you couldn't have this, which would be just confusing: ```` ```` ```` ```` ```` ```` There is, however, a special case, which is CDA structured narrative (also appears in SPL and will be used more widely, I think). Here's an example: ```` Henry Levin, the 7^th is a 67 year old male referred for further asthma management. Onset of asthma in his twenties teens. He was hospitalized twice last year, and already twice this year. He has not been able to be weaned off steroids for the past several months. ```` So while in general, an ED carrying XML contains a single well formed element, in the case of CDA, it can carry a mix of text and other elements as described by the CDA structured narrative schema. (and the structured narrative cannot have a reference or a thumbnail). Irrespective of how the data is provided - as plain text, as base 64 encoded something, as a reference to an attachment or some other source, or as XML, it can be stripped down to a plain old sequence or bytes. In principle. In practice, due to XML handling techniques, character set and character encoding issues, and reference resolution, it's not always so easy to do this, and it's not really required very often. For instance, the theoretical definition of equality says that you derive the sequence of bytes for two EDs, and compare these, but it's extremely rare to compare two ED values, except in the case of plain text data in names. ### Media Type The media type (or "mime type") of the content **must** be known and stated in the instance. It has a default value: text/plain. If the media type is something different, you have to say so - even if you're providing the data as a reference. The only exception to this rule is in CDA structured narrative, where the media type is fixed and defaulted to "text/x-hl7-text+xml". One interesting result of the way the structured narrative is defined is that you can't use it as is in a general ED; you can only use it directly in a CDA section, or where ever else the applicable specification explicitly allows it's use. If you want to use the structured narrative in a normal ED (i.e. an Act.text in a v3 message), you have to push it down into a single child element and fill out the mediatype: ```` Henry Levin, the 7^th is a 67 year old male referred for further asthma management. Onset of asthma in his twenties teens. He was hospitalized twice last year, and already twice this year. He has not been able to be weaned off steroids for the past several months. ```` Note that the most likely name for xx in this context is actually "text". The name of the inner element is arbitrary and not specified anywhere, but "text" seems like the most reasonable name to use ## R1 ED Schema. The R1 ED schema content model is: ```` ```` and BIN is: ```` ```` The really fun thing about these schema fragments is, where's the data we've been talking about? There's the "representation" attribute, and a mediaType attribute with the default value, but the data is not actually described... Well, no, it's not described. And it's a great source of confusion for newbie implementors, particularly those not well versed in XML and schema. In addition to the reference and thumbnail elements, the element for the complexType ED can contain text, which may or may not be base 64 encoded. There's no way to explicitly describe this text: all we can do is say that the type has mixed content (mixed="true"). Unfortunately, simply indicating mixed type doesn't convey what the intent is: that you can have text **after** the reference and thumbnail elements- the schema is basically useless here. Note: I think this is a major limitation of schema; there should be a text type, so that we can be specific about the contents of mixed case, instead of simply yielding control like that. However that's what we have to deal with. If that problem isn't bad enough, the schema makes no mention at all of the other things that are allowed in the ED. We can have any additional single element - instead of text - and that element can have any name in other namespace than the v3 namespace, **or** it can be in the v3 namespace and be a valid v3 instance, with the appropriate name. Schema can't describe this content model completely. For some reason lost in the mists of time, HL7 (we, I) don't describe the content as well as we could, but simply shipped a schema that doesn't describe the XML feature at all. This causes real problems when it comes to conformance, because the schema is **wrong**: it wrongly rejects valid instances. Here's an improved schema (courtesy of [Keith Boone](http://motorcycleguy.blogspot.com/)): ```` ```` This one allows the additional element, though only in another namespace. Keith defined this schema to allow the incorporation of elements from the v3 namespace: ```` ```` ```` <xs:element name="foo" substitutionGroup="abstractInteraction"type="foo"/> ```` This schema needs to be hand coded to allow whatever v3 contents are appropriate. ## Changes in R2 (ISO 21090) This is a considerable pain point for implementers - and by far the most common FAQ for the editors of the specification - so after much debate, we elected to change the XML form in Release 2 (which is ISO 21090). Briefly, the changes are: * ED is no longer a mixed content type * plain text is moved into an attribute "value" instead of being represented as text in the element * base64 encoded content is moved into a "data" element, which is explicitly assigned a type of base64Binary in the schema * XML content is moved into an "xml" element. The XML element is contains the same single element that the ED would have contained previously * the Structured Narrative type is unchanged * You can only have one of (value attribute | data element | xml element) Though there is no functional change, we have drawn apart the three kinds of data; this allows the content model to be properly described in schema, and for much simpler parsers to be written that do not have to indulge in speculative logic to read all the valid contents of an ED data type (say, when using SAX). In addition, the ED definition makes the 3 forms of data explicitly clear: ```` ```` ## Compression and Compressed Data One last issue to cover about ED. In the ED, you can indicate that the data is compressed using deflate, gzip, zlib, or compress. If you provide the data in the instance as plain text or XML, then you cannot use the compression attribute - you can't compress that kind of data. If you provide the data as base64, you can indicate that the data has been compressed using one of these methods. We need to carefully differentiate between "has been compressed" and "is compressed". The problem is that some data is inherently compressed using one of these methods in it's "native form". In that case, should you mark it as compressed? what if you compressed it again - what does that mean now? The problem is worsened by the fact that some transfer protocols - such as http - provide their own internal compression. What should you do if the data is provided as an http: reference, and the web server that serves the request will automatically use gzip compression on the answer? After debate, the answer the HL7 committee agreed to is this: * Any compression that is defined in a reference protocol - such as http: - is not described by the ED compression attribute * Whether the data is provided in-line or as a reference, compression is only specified if (and must be specified if) decompression using the specified algorithm is required to obtain the specified mediatype. So: * if the mediatype is application/gzip, and the content is gzipped in the appropriate form, then you don't say that it's compressed. * If the mediatype is text/html, and the content provided as a reference, and is still gzipped after resolving any compression specified by the web server in it's response, then you do say it's compressed with gzip. Note that a smart web server might see or know that the content of a gzipped file is html, and send the content as a mediaType text/html and gzipped by protocol - so just because the web server content is gzipped doesn't mean you should assume that the ED reference should be marked as gzip. This extra complexity is the price of redundancy in protocols.