Testing FHIR Data types for equality

Previous: Understanding v2 Acknowledgements »

Testing FHIR Data types for equality

Mar 16, 2014

One of the more difficult parts of building an object library is deciding what to do about equality checking. There’s several common operations where you want to do equality checking: * Are Objects A and B the same instance?

Does object A have exactly the same values as object B?
Do Objects A and B refer to the same thing?
I am merging lists in an update - is A in a List L of objects?
I am keeping a map (dictionary) - is A in the map?

Even within most of these choices, there’s more subtle issues, such as

Do you just want to check the simple properties of this object, or do you want to check all the properties of all the objects (shallow vs deep)
Do you have to worry about circularity if you’re doing deep?
Does order matter in lists?
Do empty properties mean match or no match?

There’s no simple one size fits all, so we don’t override equals() methods etc in the reference implementations.

However, there’s no doubt that with anything but the primitive data types, there’s some properties that are identifying properties, and others that are describing, and you’d treat them differently for testing whether they refer to the same thing. Here’s an analysis of the different data types:

Primitive Types

boolean, integer, decimal, base64Binary, string, date: value = identity, and the comparison is simple
uri: in principle, the comparison is simple, but some URIs include the capability to include access properties (e.g. FTP username/password) or descriptive labels (tel:), and there may be a need to canonicalize the url (e.g. http parameter order only matters within a set of parameters with the same name)
dateTime: this is complicated the optional presence of a timezone. Obviously, if two date times are the same, and have the same timezone, then they are the same. If they have different timezones, then they might still be considered the same if they refer to the same instant when converted to UTC (probably should be considered the same, but it does depend on context). If there’s no timezones, then the correct action depends on what can be assumed about timezones based on the local context of use.

Attachment

The url and data properties are identifying. If the bits and/or the address are the same, then this must refer to the same record. The properties contentType, language, size, and hash are derived properties from the content, though they may or may not be provided - but you’d ignore them when comparing the objects. The title property is descriptive, and should also be ignored, though might might vary between different instances.

Coding

The system and value properties are identifying. Both most be present, or else you can’t tell whether they are the same (which shows that comparison for sameness is not a binary outcome - you might be sure that they are, sure that they’re not, or unsure). Whether version counts towards the comparison depends on the semantics of the underlying code system, and could be very complicated indeed.

The display property might vary (if the system defines multiple displays, like SNOMED CT), and shouldn’t count to the comparison. The primary property is about how a code was used, and doesn’t count to whether two codings refer to the same content, but may count towards comparison of items that contain a Coding (see next). The same logic applies to the valueSet.

CodeableConcept

This is a most difficult test - the rules are highly contextual, and also dependent on the underlying code system. Generally, you could assume that if two CodeableConcepts have a matching coding - e.g. both of them have multiple codings, and at least one of them is the same - then they are same concept. However, if you were merging a list of CodeableConcepts, then you’d probably want to check that all the codings match (order doesn’t matter), and depending on the user case, that primary was the same too.

If two CodeableConcepts only have text, then they would be the same if the text matches (ignoring whitespace, and case, and possibly some grammatical characters).

Quantity

Logically, two quantity values are the same if they have the same value and units. Except that there’s several issues around this:

What to do if the precisions of the values vary (e.g. 1.3 vs 1.299) depends on the context, and there’s no general answer
If a code for the units is provided, then you compare based on that. Subsumption or equivalence testing is required (rather than straight string matching)
If there’s no code for the units, then you can compare the human readable unit, but this is fragile (ug vs µg)
For extra points, you can do comparison based on canonical units if a ucum code is provided
If a comparator is present, it can’t be ignored. is <4 the same as <3? There’s a set of cases here where you might be unsure whether they are the same

Range

At last - something simple! Two ranges are the same if the low and high properties are the same (including if they are absent).

Ratio

This is also simple - Two ratios are the same if the low and high properties are the same (including if they are absent). Note that 1:2 is not the same as 2:4 (if that were logically true, then a straight quantity should have been used instead).

Period

In principle, this is also simple: Two periods are the same if the low and high properties are the same (including if they are absent). However the issues around precision of the dates are tricky, as is comparing timezones (see notes above)

SampledData

Two SampledData values would the same if their origin, period, dimensions and data match, after adjusting for the factor. lowerLimit and upperLimit probably matter too.

There’s not a lot of call for matching SampledData values - whether the match would be driven by the surrounding data whereever they are used.

Identifier

The system and value properties are identifying. Both most be present, or else you can’t tell whether they are the same.

The use, label, period and assigner properties are all descriptive, and don’t count toward comparison of whether this is the same identifier.

Human Name

Given the huge variation in how names are used, and how systems track how names are used, there’s no one size fits all method. However the following is a good base to start from:

The family name list must match (order dependent)
The given name list must match (order dependent) though you might terminate the match checking when the shorter list ends (or if the short list only has one name)
Prefix and Suffix usually don’t count
Period doesn’t matter, nor does use
text should be ignored, unless there’s no name parts, in which case you match on that. For advanced points, you can compare text and parts, but that’s hard and potentially risky

Address

Address is similar to name, but the order of the parts doesn’t matter:

The line list must match. Order is not important
The city, state, and zip must match. In the absence of a city and zip, there’s no match. State doesn’t matter if the country doesn’t have states
There may be a default country, based on either system location, or patient information
Period doesn’t matter, nor does use
text should be ignored, unless there’s no name parts, in which case you match on that. For advanced points, you can compare text and parts, but that’s hard and potentially risky

Contact

The system and value properties are identifying. Both most be present, or else you can’t tell whether they are the same. (well, in the absence of system, you might be able to).

The use and period properties are descriptive, and don’t count toward comparison of whether this is the same contact.

Schedule

Technically, two schedules are the same if they describe the same set of times, but since they could describe these differently, that’s a pretty complicated test. In practice, there’s not a lot of call for sophistication here, and it suffice to simply do a comparison of the sub-properties. Notionally, the order of events would not matter, but these would usually be in order anyway

Resources

The rules for each resource differs. If there’s enough interest, I might do this for some of the resources (comments please)

Question

Is this kind of knowledge worth embedding in the resource definitions as meta-data in future versions?