People have been telling me for a long time that I ought to start one to share my thoughts and ideas. I've finally caved and fired one up. This inaugural post happens to be about EDI and XML, which are particular formats for representing data, but I expect to write articles on a variety of different topics. Most of the posts will probably be related to computer programming. I may also resurrect some write-ups I have done in other forums and repost them here. If you have any suggestions about topics, or any other thoughts or comments, by all means contact me. :-)
I've been working on integrating with a large partner recently. We want to be able to automate billing, because with the expected volume, manually reconciling what they actually bill us for with what we expect to be billed for would be infeasible. In this day and age, one might expect the billing information to be supplied in the form of XML documents, or perhaps (effectively the same thing) complex types in a SOAP response, but one would be wrong.
It turns out, long before XML was created, back when computers were barely capable of actually storing data, let alone operating on it, it was nevertheless noticed that having the computer store the data instead of storing hard copy made data management much easier. For instance, eliminating data more than, say, 7 years old was one command to the system that took maybe 20 seconds to type, instead of an exhaustive search of poorly-organized filing cabinets that could occupy the better part of a week. In order to store data, though, it needs to have a format.
We humans are pretty good at recognizing information. We can look at two post-it notes:
Our brains understand intuitively that these are the same kind of note, but they have slightly different information, and it's laid out differently. A computer could be made to understand both types of notes, but it would have to be given a complete description of each possible layout. A third note with another layout would need more programming. Computers don't have the pattern recognition that our brain does, so to work with data in multiple formats requires a lot of programming work, a different path through the code -- a different plan -- for pulling relevant bits of information from each.
Because of this characteristic, when people are organizing data for computers to understand, they tend to do three things:
- The data is organized in a format that takes as little programming code to understand as possible. This isn't actually strictly true, because there are other factors in the trade-off. Sometimes a format more complicated to parse is preferred for other reasons, such as making it easy not just for computers but for human eyes as well. However, the principle is there: an overcomplicated format is no fun to work with, because the code needed is difficult to write and understand, and code that is difficult to write and understand is likely to have lots of bugs in it.
- More importantly, the format is devised in such a way that a given piece of information in it can only be understood in one way. There is no ambiguity, not even ambiguity that can be resolved with context. There is always exactly one right way to read the information.
- Most importantly, once a given format has been established, all the data is presented in exactly the same format. There are no exceptions, no loose rules. The format says it must be done in this way, so it is.
There are other concerns as well, but these are the fundamental essentials of encoding information for computers to work with.
So, back in the 1980s, people were looking for a way to use computers for business information. Nowadays, computers tend to fall into one of two baskets, PC or Mac, and while the way they work behind the scenes is very different, on the surface, things have converged so that using one isn't significantly different from using the other, and by and large, computer programs for one can talk with computer programs for the other. For instance, images tend to be JPEG, PNG or SVG files, music tends to be MP3 files, and almost everybody can read PDF and DOC files. We've achieved a standardization in data formats that makes it much easier for people to work together. It's easy to forget that not much more than ten years ago, the files common on each type of system were not interchangeable. Before the proliferation of MP3s, people shared short audio clips. On Windows computers, they shared WAV files, while on Mac computers, they shared AIFF files. If a Windows user gave his Mac-using friend a floppy disk with a WAV file on it, then setting aside the fact that the Macintosh computer couldn't even read the floppy disk, assuming that he could get to the file, it would take special software on the Macintosh to be able to understand the WAV file. Similarly, very little PC software existed that could understand AIFF files.
This same discrepancy in formats existed in business data in the early 1980s, but the problem was much more severe. Business computers weren't either PCs or Macs. Every business had their own solution, generally with their own custom software with its own custom data format. It was virtually impossible to share data, short of printing it out in human-readable form on one system, and then typing it back in on the other. Something had to be done.
Various industry consortia got together and decided that they needed to settle upon a way of representing data. They called their system Electronic Data Interchange, or EDI. They decided that virtually any document could be divided up into logical pieces, and they called these pieces segments. The defining characteristic of a segment is that it had a finite number of pieces of information, or data elements, and each piece was a single unit. For instance, a line item on an invoice has a description, a unit price and a quantity. No line item will have two unit prices or five quantities. However, an invoice may have multiple line items. So, a line item is a segment, and the description, unit price and quantity are data elements. A report on changing stock prices over time would put each price into a different segment, because the exact length of that duration of time, and thus the number of prices, is not fixed.
With EDI, they decided that a segment starts out with a few characters that identify the segment's meaning, and then consists of a series of data elements, and the computer tells where one data element starts and the next begins by means of a special character selected to be a separator. Similarly, the computer tells where one segment ends and the next one starts by means of a special character selected to be a segment terminator. A typical segment looks like this:
BIG*20121004*4207359598**ACW/JUU1287005005***DI~
This segment is a Beginning of Invoice Group. The first data element is the date, the second is a unique ID assigned by the sender's system, the third one is empty, the fourth one is a purchase order number, data elements 5 and 6 are empty, and the final element, "DI", indicates that this is a Debit Invoice, as opposed to "CR", a Credit Memo, or any number of other codes.
Having come up with this basic concept of segments and data elements, they then set about coming up with specific arrangements. The key thing here is that a segment might want to, logically, have other segments inside of it. For instance, if you have a segment that describes an individual, you might want to be able to present more than one telephone number for that individual. Each telephone number would have to have its own segment, but somewhere along the line, it's important also that those segments be seen as part of the higher-level segment that introduces the individual in the first place.
With EDI, what was done to address this problem was to create data dictionaries. In addition to this basic underlying format of segments and data elements, the data dictionary goes on to specify what the possible segments are, how many data elements they have and what they mean, and, crucially, how the data elements tie together to create structure in the document. It is here, for instance, that the N1 segment describing an individual is ascribed a child segment of PER describing a contact number. More than one PER can be contained in one N1.
They created these data dictionaries, hundreds and thousands of pages of them describing countless types of documents, and it was glorious, but it was wrong. Very, very wrong.
To understand where they went wrong, let's investigate another data format: XML. It was introduced more than ten years after the EDI formats with the Utopian goal of allowing any two computer systems to talk together. Of course, this doesn't happen by magic; programmers still have to write code to talk XML specifically instead of something else specifically. The thinking behind it, though, was to come up with a format so standard and ubiquitous that the necessary programming to work with it could be already a part of the basic set of tools that any programmer starts out with when creating a piece of software. Largely, they have succeeded, and yet they have produced no data dictionaries at all. The core XML standard is entirely divorced from exactly what data is being separated.
This basic separation of concerns is actually a core facet of software development. The better a programmer is, the better he is at untangling the different pieces that talk to each other and designing them in such a way that they, by and large, don't need to know anything about what other pieces of code are using them. Another word for this is modularization. XML is well-modularized, because its format isn't tied to any particular type of data.
On the surface, XML works quite similarly to EDI. Data is separated into elements, rather than segments, and elements have attributes rather than data elements, but they serve roughly the same purpose. The above "BIG" element could, in XML, be presented like this:
<InvoiceGroup Date="2012-10-04" ID="4207359598" PurchaseOrderNumber="ACW/JUU1287005005" Type="DebitInvoice" />
All the same data is there. Of course, one difference is immediately evident: Instead of simply having to know that, for instance, data element #4 is the purchase order number, the attributes have names. This isn't actually always a good thing, because every time you write out an <InvoiceGroup> element, you have to repeat the names. In any XML document, the names of the attributes tend to be a significant proportion of the amount of data being stored; it isn't very compact. However, this has benefits in other areas. If, for whatever reason, a human being needs to work with XML data, it can often be largely self-explaining.
So far, XML is just prettier, but more verbose EDI. Where XML really diverges, though, is in its representation of structure. Rather than structure being a part of a data dictionary, so that the precise usage of segments depends on what exactly you're storing with it, XML makes structure a part of the basic representation of data, much as EDI specifies '*' to separate data elements. Consider the following EDI data:
N1**ARTHUR JONES*1*9012345918341*~
PER**HOME*TE*212-555-1234~
PER**WORK*TE*212-555-4321~
N1**ROBERT SMITH*1*9041371238134*~
PER**WORK*TE*212-555-6543~
In trying to understand this document, you must simply know beforehand that PER is subsidiary to N1, so that Arthur Jones has a home and a work telephone number on file, whereas Robert Smith has only his work number. Without the data dictionary, you are left guessing at the relationship, if any, between the N1 and PER segments.
Represented as XML, however, this same data might look like this:
<Individual Name="Arthur Jones" ID="9012345918341">
<Contact Name="Home" Type="Phone" Number="212-555-1234" />
<Contact Name="Work" Type="Phone" Number="212-555-4321" />
</Individual>
<Individual Name="Robert Smith" ID="9041371238134">
<Contact Name="Work" Type="Phone" Number="212-555-6543" />
</Individual>
I am going to assume the reader has a basic familiarity with the structure being represented (most importantly, that everything between "<Individual>" and "</Individual>" is contained by that Individual element). The key thing, which should be quite clear, is that the <Contact> elements have a logical parent, and that parent is very easily identifiable.
EDI does not separate the concern of structure from the concern of meaning, and the result is that it is quite difficult and a lot of work to parse an EDI document. Having parsed one EDI document, writing code to parse another type requires you to essentially start from scratch in specifying what segments are allowed and how they relate to one another. With XML, everything about the nature of the data is representable and understandable using the basic underlying format of XML, without needing to know anything about what the data actually means.
Of course, I have omitted certain factors from the story, for simplicity:
- EDI allows data elements to be subdivided into component data elements. Another separator character is used for this. Thus, there is some minimal structure. However, the basic constraint remains that data elements be in a fixed arrangement within a segment, without any variance in the number or ordering.
- There are multiple EDI standards, and they have some core differences. In my work so far, I've looked into EDI ANSI ASC X12, used predominantly in North America, and the United Nations EDIFACT standard which is common in most other parts of the world (but especially Europe), and is internationally recognized. A core difference between the standards is that the ANSI X12 standard is closed and commercial, while EDIFACT is openly documented. To view the EDIFACT standards (though I do warn that they are not written to be easily understood!), you can go to the site of the United Nations Economic Commission for Europe (UNECE). To get solid information about ANSI X12, on the other hand, you must commit to purchasing expensive volumes from a private standards body called the Accredited Standards Committee.
- XML is inarguably much nicer to work with than EDI, but it is still an overwhelmingly complicated standard. Also, other standards have grown up around XML that are, at least in some circumstances, core to working with it, and they add substantial complexity of their own (e.g. XSD, XSLT).
- A growing number of common file formats have come into existence of the last ten years that are really specific uses of XML. There's a good chance your word processor writes .docx files -- these are really compressed XML files. Many line drawing editors support a format called SVG; these are also XML files with a specific layout. Essentially, this is the same as EDI's concept of a data dictionary, but since they are designed around XML, instead of defining structure they simply make use of it. The structure is still present in the absence of having a particular meaning for the data.
- In order to work with EDI, many companies have resorted to third-party service providers who offer to translate the data from EDI to XML. Once in XML format, it is much easier to load the document and interpret its contents. In my integration work, this is essentially the strategy that I adopted as well, though rather than pay a monthly fee, I decided to write my own code to parse the invoice data from EDI documents and transform it to XML. Writing this code, especially the part that keeps track of e.g. PER segments being children of N1 segments, was a bit of an adventure!

No comments:
Post a Comment