Tuesday, 30 October 2012

On Compiler Reductionism

In response to an e-mail that went around the office here, I recently took a poke at explaining some of the features the C♯ programming language implements. As always, my philosophy is that if you understand how a tool works under the hood, you can better use it.
The C♯ programming language is full of nice features that make code more concise and easier to read. Things like 'foreach' statements and 'using' blocks are examples of these. It also has core language features that are necessary to provide the basic level of functionality needed to write a program in the .NET environment, such as basic flow control like 'if' and 'while' and error handling with 'try', 'catch' and 'finally'.
Some of these features are implemented directly. When you write an 'if' statement, the compiler translates it to a conditional branch. For instance, this:
    if (condition)
      truePart();
    else
      falsePart();
..becomes a somewhat more verbose equivalent to this:
    if (!condition)
      goto _else;

    truePart();
    goto _endif;

    _else:
    falsePart();

    _endif: ;
The compiler doesn't translate your 'if' into a series of 'goto's, though. The flow control is the native level of the code in its conceptual representation. Your code is represented as a directed graph, where each node is a block of code that can only be run one way (no "branches", a sequential sequence of operations), and then nodes are joined to one another by graphs that say, "if this condition is met, do this node next". Statements like 'if', 'while', 'for', 'switch' and so on get translated directly to this form. Similarly, 'try', 'catch' and 'finally' are actually represented by metadata in the final compiler output that says, "If an exception happens between these two byte offsets of MSIL code, then jump to this other byte offset to handle it, and this fourth byte offset is the 'finally' block that needs to be executed whether flow control exits the 'try' range via an exception or a 'leave' opcode."
(I mentioned MSIL code in that last metaphorical sentence -- this is the "machine code" upon which .NET operates. In order to be truly multi-platform, code written for .NET doesn't actually get translated for the CPU -- not immediately, anyway. MSIL is an intermediate language, modelled after Java's bytecode, which is designed to make it relatively easy to quickly produce efficient machine code for a variety of actual hardware. Thus, the same compiled EXE can be used on PCs, tablets, phones, PDAs, even Macs. :-)
So, some features are core to how the language gets translated to the underlying machine code. These are called primitives. Other features, though, are implemented in terms of these underlying primitives. The 'using' statement, for instance, is really a 'try' block with a 'finally' that always calls .Dispose() on the object being "used". Similarly, the 'foreach' statement is really a 'while' loop that calls the .MoveNext() method of an enumerator and, as long as it gets a next element, retrieves it with the .Current property before running the body of the loop. The 'while' statement is inside of a 'using' block to ensure that the enumerator gets disposed when it's done.
I'm going to call these features that are based on other features composed.
Some of these composed features are so complicated, the compiler needs to synthesize entire classes to support their operation. Most features that work this way are more recently introduced into C♯.
The e-mail I initially responded to was an educational thread about one of these more advanced composed features called "closures". To be sure, C♯ does not have true lexical scopes, but it does nevertheless allow the implicit sharing of variables between scopes that have the superficial look & feel of first-class functions being passed around. As a practical tool, it's still very useful, even if it doesn't satisfy language purists.
The question that was asked is as follows:
Hey guys,

I told the group today that I’d send out a better example for variable scoping when using delegates, so here it is.  I’m sending it to all developers because it’s useful for everyone.

Without running this code, what will be the
output?  Why?  What’s the scope of the variable x?
    public class Program
    {
        private static void Main(string[] args)
        {
            var resultOfFoo = Foo();
            resultOfFoo();
            resultOfFoo();
            resultOfFoo();
        }

        public static Action Foo()         {             int x = 0;             return () => Console.WriteLine(x++);         }     }
To understand this, I find it simplest to understand what the compiler does when it sees this kind of code.
Roughly speaking, when interpreting the code, the compiler turns this (repeated for alternative highlighting):
  public class Program
  {
    private static void Main(string[] args)
    {
      var resultOfFoo = Foo();
      resultOfFoo();
      resultOfFoo();
      resultOfFoo();
    }

    public static Action Foo()
    {
      int x = 0;
      return () => Console.WriteLine(x++);
    }
  }
..into this:
  public class Program
  {
    private static void Main(string[] args)
    {
      var resultOfFoo = Foo();
      resultOfFoo();
      resultOfFoo();
      resultOfFoo();
    }

    public static Action Foo()
    {
      VariableScopeClass scope = new VariableScopeClass();

      scope.x = 0;

      return (Action)scope.AnonymousMethod;
    }

    private class VariableScopeClass
    {
      public int x;

      public void AnonymousMethod()
      {
        Console.WriteLine(x++);
      }
    }
  }
When C♯ was first released, it didn't have support for the type of construct shown in the upper listing, where a function is apparently created and returned directly in the last line of the "Foo()" method. Using this trick of throwing together a class to hold the scope of the variable (and the method!) allowed them to implement this new language feature in a way that didn't require any changes to the runtime. If you're writing a plug-in for a program written back before .NET supported this feature, you can still use this feature in your implementation; your method looks the same as if it were written out the long way, and can be called by code that has no knowledge that this feature ever existed.
Another feature that involves this sort of translation, though one that is considerably more involved, is the definition of enumerators as coroutines. To be sure, as with the not-quite-lexical-scopes, these are not-quite-coroutines, but they are close enough to be a useful tool. The idea behind a coroutine is to have two methods running simultaneously, taking turns (i.e., not concurrently), and each method passes control to the other thread by what it thinks is a method call. When it makes that call, the other function sees a call that it made earlier return. Eventually, method #2 will make a call back to method #1, which has been patiently waiting for its function call to return, and, mirror image, method #2's call becomes method #1's return. Kind of wacky stuff to wrap your head around!
The way C♯ has exposed this functionality is designed to be somewhat easier to wrap the head around. Instead of two general-purpose methods bouncing back and forth, swapping out their stacks and generally causing mayhem (if not in the execution environment, then in the brains of the programmers having to deal with them), C♯ models one half of the coroutine as generating a sequence of objects. To generate the next object in the sequence, you use the statement 'yield return'. Conceptually, you can simply consider your function as continuing execution after the 'yield return', doing one after another until finally the function completes execution and returns.
Enumerations in .NET, though, are a form of "lazy evaluation". The caller has an object representing a sequence of objects, but it only sees one of them at a time. When it's done with the current object, it probes to see if there is another object by calling 'MoveNext()'. If MoveNext() returns true, then there is another object waiting in the 'Current' property, and if it returns false, there is not. How is this compatible with the idea of a method whose execution straightforwardly produces the entire list?
The C♯ compiler is doing a bit of rewriting on your behalf. The example here is a bit longer, and I ran out of nice colours to use for the highlighting, so my apologies in advance for any eyesight damage that may occur due to the highlighting.
Before I show how the enumerator itself is generated, I'll show how the caller is generated, because the majority of C♯ code never directly refers to the 'MoveNext()' method and 'Current' property I've been talking about.
To access the sequence of objects from an enumeration, you write code something like this:
    static void Main(string[] args)
    {
      foreach (string str in TestEnumerator(true))
        Console.WriteLine(str);
    }
The compiler pretends you've written code like this, which is semantically identical, a reduction of the 'foreach' compound feature to its underlying implementation:
    static void Main(string[] args)
    {
      using (var enumerator = TestEnumerator(true).GetEnumerator())
      {
        while (enumerator.MoveNext())
        {
          string str = enumerator.Current;
          Console.WriteLine(str);
        }
      }
    }
In turn, reducing the 'using' compound feature to the true underlying primitives, the compiler treats this code as though it were this:
    static void Main(string[] args)
    {
      var enumerator = TestEnumerator(true).GetEnumerator();

      try
      {
        while (enumerator.MoveNext())
        {
          string str = enumerator.Current;
          Console.WriteLine(str);
        }
      }
      finally
      {
        if (enumerator != null)
          enumerator.Dispose();
      }
    }
This is the code that, finally, gets turned into MSIL for execution. You can see here the way a conceptual stream of objects is turned into a discrete sequence of function calls, each one retrieving the next in sequence. A key point here is that you could, at any time, decide not to call MoveNext again, and no further elements would be retrieved from that enumerator. Let's take a look at what the TestEnumerator method might look like:
    static IEnumerable<string> TestEnumerator(bool bigger)
    {
      yield return "starting";

      if (bigger)
      {
        yield return "BIG";
        yield return "GER";
      }
      else
        yield return "smaller";

      for (int i = 1; i <= 10; i++)
        yield return "iterating: #" + i;

      yield return "finishing";
    }
This looks like a straightforward method that will run through and generate a sequence of objects. This isn't compatible, though, with the idea of a MoveNext() call that may or may not even take place. To support the MoveNext() mechanic, and a Current property that holds its value between calls, some sort of "stop-and-go" mechanism is clearly needed, and that is exactly what the compiler supplies. The above code, which uses 5 different highlight colours twice, gets translated into something rather like this behind the scenes:
    static IEnumerable<string> TestEnumerator(bool bigger)
    {
      var ret = new TestEnumeratorStateMachine(
        TestEnumeratorState.ParentIEnumerable);

      ret.bigger = bigger;

      return ret;
    }

    enum TestEnumeratorState
    {
      ParentIEnumerable = -2, // means that the IEnumerator<string>.MoveNext() method won't do anything for this instance
      Interrupted = -1, // means that flow didn't leave the MoveNext() method the expected way, and the machine is in an error state

      Initial,
      AfterStarting,
      AfterBIG,
      AfterBIGGERorsmaller,
      TopOfForLoop,
      BottomOfForLoop,
      AfterForLoop,
      EndOfEnumeration,
    }

    class TestEnumeratorStateMachine : IEnumerable<string>, IEnumerator<string>
    {
      public bool bigger;

      public TestEnumeratorStateMachine(TestEnumeratorState initialState)
      {
        _state = initialState;
      }

      private TestEnumeratorState _state;
      private string _current;

      private int i;

      public string Current { get { return _current; } }

      public bool MoveNext()
      {
        switch (_state)
        {
          case TestEnumeratorState.Initial:
            _state = TestEnumeratorState.Interrupted;

            _current = "starting";
            _state = TestEnumeratorState.AfterStarting;
            break;
          case TestEnumeratorState.AfterStarting:
            _state = TestEnumeratorState.Interrupted;

            if (this.bigger)
            {
              _current = "BIG";
              _state = TestEnumeratorState.AfterBIG;
            }
            else
            {
              _current = "smaller";
              _state = TestEnumeratorState.AfterBIGGERorsmaller;
            }

            break;
          case TestEnumeratorState.AfterBIG:
            _state = TestEnumeratorState.Interrupted;

            _current = "GER";
            _state = TestEnumeratorState.AfterBIGGERorsmaller;

            break;
          case TestEnumeratorState.AfterBIGGERorsmaller:
            _state = TestEnumeratorState.Interrupted;

            i = 1; // for loop initialization
            goto case TestEnumeratorState.TopOfForLoop;

            break;
          case TestEnumeratorState.TopOfForLoop:
            _state = TestEnumeratorState.Interrupted;

            if (i <= 10)
            {
              _current = "iterating: #" + i;
              _state = TestEnumeratorState.BottomOfForLoop;
            }
            else
              goto case TestEnumeratorState.AfterForLoop;

            break;
          case TestEnumeratorState.BottomOfForLoop:
            _state = TestEnumeratorState.Interrupted;

            i++;
            goto case TestEnumeratorState.TopOfForLoop;
          case TestEnumeratorState.AfterForLoop:
            _state = TestEnumeratorState.Interrupted;

            _current = "finishing";
            _state = TestEnumeratorState.EndOfEnumeration;
            break;

          default:
            // Handles states ParentIEnumerable, Interrupted, and
            // EndOfEnumeration. Returning false here tells the caller
            // that Current is now indeterminate and there is nothing
            // further in the sequence.
            return false;
        }

        return true;
      }

      public IEnumerator<string> GetEnumerator()
      {
        var enumerator = new TestEnumeratorStateMachine(TestEnumeratorState.Initial);

        enumerator.bigger = this.bigger;

        return enumerator;
      }

      void IDisposable.Dispose()
      {
        // This would handle any 'using' blocks in the original
        // method. If a 'yield return' occurred in the middle of a
        // 'using', we'd have an object waiting to be .Dispose()d.
        // Continuing the enumeration and letting it flow out of the
        // 'using' block would achieve that, but if the caller
        // decided to stop enumerating and dispose the enumerator,
        // it'd still need to be cleaned up. That would go here :-)
      }

      // IEnumerable<string> and IEnumerator<string> both extend the
      // older, non-generic IEnumerable and IEnumerator interfaces.
      // So, to implement the generic methods, we also have to supply
      // non-generic methods.

      System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() { return this.GetEnumerator(); }
      object System.Collections.IEnumerator.Current { get { return this.Current; } }
      void System.Collections.IEnumerator.Reset() { throw new NotSupportedException(); }
    }
  }
Wow, that's a lot of code! When this particular type of code happens to be exactly what you need, it's very nice to have a compiler that can interpret the 10-line version using 'yield return' and produce this in an automated way, no risk of errors, typos or silly bugs.
As you can see, each section of the code between 'yield return' blocks (even if it's empty, like between 'BIG' and 'GER') becomes its own 'case', so that a subsequent call to 'MoveNext()' can jump straight back there. Also notice how every case block starts off by setting the state to 'Interrupted'. Before returning, it sets it to the actual next state. If an exception occurs, though, the state is left at 'Interrupted', and henceforth, the 'switch' block always jumps to the 'default' case that returns false: Nothing more to come from this enumeration.
(The more astute reader may note that the “TopOfForLoop” state is not externally reachable. As such, it doesn’t need to set the state to ‘Interrupted’; it will already be in that state when it gets there. I’ve left it in for consistency. The actual compiler output does not use a separate state in this way, but I found it instructive, as it mirrors the conceptual flow control of the bits in a “for (init; cond; iter) stmt;” block: “init; top: if (cond) { stmt; iter; goto top; }”.)
I should mention that I've omitted certain aspects of the true compiler output. For instance, in both this case and the first example with the anonymous method, the compiler uses names that aren't valid C♯ identifiers (but are valid in MSIL). This guarantees that there's no chance of you typing up a variable of your own that unwittingly steals the name the compiler wants to use. The true enumerator code also has a small optimization in it, where if GetEnumerator() is called from the same thread that created the object in the ParentIEnumerable state, it reuses the same object, transitioning it to the Initial state in the GetEnumerator() call. (Why the same thread? Simply to ensure that there aren't two threads doing it at the same time. It doesn't matter which thread it is -- if you only let one thread do that transition, then you can't ever end up with two threads holding a transitioned initial object.)
These transformations, reductions from compound features to primitive features during the compilation phase, are taking place countless times in our codebase and any other piece of complex C♯ code.

Friday, 26 October 2012

Getting it from here to there

My last post was all about EDI and its horrific intertwining of data structure with meaning. Suppose you are a business that integrates with another business via EDI. You've just created an EDI document describing some business object, such as an invoice, that your partner needs to see (almost certainly by means of translating your document from a less-arcane internal format -- there might be no help for you if your business uses EDI internally!). How do you get it to them?
While the EDI X12 standard is a product of private industry and costs many pennies to get ahold of, the people charged with figuring out ways to get data from one place to another took a different approach, one more in line with the atmosphere of open communication that characterizes the technological core of the Internet.
There are many standards out there (many many standards) that define how the Internet works. (What does it mean to say "how the Internet works"? The Internet is a "network of networks". Within each network, it's a fair bet that, most of the time, the computers are all of the same type, and even where they're not, there's a local system administrator making sure every machine knows how to talk to every other machine. Two disparate networks, however, could be running very different systems. The whole purpose of the Internet is to provide a consistent set of protocols so that a computer in network A and a computer in network B have a common language that allows them to address each other and share information.) Each of these standards is called an RFC, short for Request For Comments, and describes how one small piece works.
To give an example, back when people were first inventing e-mails, an e-mail message consisted of a block of monospaced text. There was no formatting, no special meaning, no means to attach files. It was just a block of text. Within one mainframe system, users sending each other e-mail was handled by a piece of code on the server that had the special privilege to go into the recipient's files and add the text of the new message to the end of a "mailbox" file. How exactly the file was formatted wasn't particularly important, as long as the program for reading mail within that system could understand the files produced by the code for delivering mail.
Next thing you know, though, people at one site (a university, most likely) want to send messages to people at another site, and the two mainframes aren't compatible. Their mail programs don't write the same format, and even if they did, since they're different computers, it isn't possible for the mail delivery software on one to get its fingers into the files on the other. The two computers could speak to each other only by establishing connections. A connection was a lot like a file in that you could read from it and write from it, but it had one very significant difference. With a file, you can "rewind" or "fastforward" -- you can jump to a particular point in the file and do your reading or writing there. With a network connection (really, with any stream, of which a connection is a particular type), you can write data, but it must come after the data you wrote just before it,  and the other end will get it in just the same way. Similarly, you can read data, but it'll always come in exactly the order the other end sent it. So, you want to have a piece of software at site A talk to a piece of software at site B and deliver mail. Clearly, they must be speaking the same language.
Enter the RFC. In the early 1980s, some smart men got together and put some thought into exactly what pieces of information would need to be exchanged in what order, and they came up with a generic way for any site to present an e-mail message to another site. Once received, the other site could do with it what it wanted, such as appending it to a user's "mailbox" file. The key thing was that over the wire, there was a standard way to deliver the mail. It might have looked something like this:
MAIL FROM: <alice>
RCPT TO: <bob>
DATA
Hello Bob,
Want to go for lunch?
-Alice

.
QUIT
E-mail today is still sent using a protocol very much like this, called SMTP, or Simple Mail Transfer Protocol, and it is defined by RFC 2821.
Now, you could imagine an argument taking place between two computer scientists at different universities. One might be telling the other one, "Our mailboxes store the date that a message was sent. It's very useful! You should do the same thing." But where to put this date? The other might reply, "That's all very well, but we've found it more useful to be able to send the same message to more than one person, for a group discussion. Our system stores a list of people involved in such a group discussion." Where to put this list?
A separate RFC describing an Internet Mail Format takes care of these types of details. Completely separate from how the e-mail is delivered from one place to another, it specifies how the e-mail is laid out. The SMTP protocol moves a block of bytes from one server to another, and the mail format says what exactly those bytes are.
The Internet Mail Format tells you not only how to indicate the date, but exactly what layout to use for it, and it tells you to indicate who the message is from and who it's for and exactly how to lay out their e-mail addresses, and it provides for a clear separation of these bits of "extra" information, which it calls headers, from the message itself, which it calls the body.
An SMTP delivery of a message in this format looks something like this:
MAIL FROM: <alice>
RCPT TO: <bob>
DATA
From: Alice MacBob <alice>
To: Bob MacAlice <bob>
Subject: lunch
Date: Fri, 26 Oct 2012 10:30:00 -0500

Hello Bob,
Want to go for lunch?
-Alice

.
QUIT
This split between headers and body allows a fair degree of flexibility, but it doesn't give you the ability to attach a video file of a cat doing something hilariously stupid. To allow for this, an extension of the Internet Mail Format was created called Multipurpose Internet Mail Extensions, or MIME.
MIME was defined so that any valid Internet Mail Format message is also a valid MIME entity, but MIME allows you to do so much more. To start with, it allows you to define a message in multiple parts. These parts could be alternative versions of the same thing, such as one using HTML to provide nice formatting to the content of a message, and another with the same message stripped down to plain text for mail programs that don't support HTML, or it could be multiple documents, such as a message body and attachments associated with it.
So, we started out talking about the delivery of EDI from one place to another, and we ended up talking about how e-mails work. What could these two things possibly have in common? To misuse a metaphor, we've gone off a tangent onto a curve, and that curve has come full circle. :-)
I mentioned earlier that a group of people was tasked with coming up with a standardized way of getting EDI data from one business server to another. The result of their work was a series of documents they called Applicability Statements, and they made each one into an RFC: AS1, AS2, AS3, AS4. These strangely-named documents all have one thing in common: They don't actually specify protocols or formats. Instead, they list other formats and describe how they can be applied to business-to-business communication. The first of those standards, AS1? Yes, it does specify business-to-business communication by e-mail.
Setting aside for the moment how crazy this really is, let's look at how it recommends you do this. First, here's a series of headings from within the AS1 specification:

3.0 Referenced RFCs and Their Contribution

3.1 RFC 821 SMTP [7]

3.2 RFC 822 Text Message Format [3]

3.3 RFC 1847 MIME Security Multiparts [6]

3.4 RFC 1892 Multipart/report [9]

3.5 RFC 1767 EDI Content [2]

3.6 RFC 2015, 3156, 2440 PGP/MIME [4]

3.7 RFC 2045, 2046, and 2049 MIME [1]

3.8 RFC 2298 Message Disposition Notification [5]

3.9 RFC 2633 and 2630 S/MIME Version 3 Message Specifications [8]

Yes, that is fourteen different RFCs upon which this standard is building.
The basic idea is that the EDI document gets treated in roughly the same way as that "Hey Bob, Want to go for lunch? -Alice" message earlier -- a sequence of bytes that get wrapped up in the structure defined by the MIME RFC. The resulting MIME entity is then delivered by SMTP to the remote server. Once delivered, the remote server can process it however it wants -- it might be spooled into a mailbox for another application to read, or the mail server might be customized for AS2 and pass it off directly to a handler on-the-spot.
In the integration I'm working on, we are fortunately not using AS1. The idea of involving a mail server in a critical business flow frankly gives me nightmares. Our integration uses Applicability Statement 2, which describes the official, standard means of delivering EDI documents via HTTP.
HTTP, as a protocol, is far better-suited to this type of activity. It has long had as one of its primary responsibilities moving machine-readable data from one process to another for the specific purpose of processing that data and producing a response.
An HTTP request is actually remarkably similar to a document encoded according to the rules of the Internet Mail Format -- similar enough as to arouse suspicions that it may have been intentionally designed to be similar. Here is a comparison:
POST /handler HTTP/1.1
Host: servername
Content-Type: text/plain
Content-Length: 123
Accept: text/plain, */*

Hello Bob,
Want to go for lunch?
-Alice
From: Alice MacBob <alice>
To: Bob MacAlice <bob>
Subject: lunch
Date: Fri, 26 Oct 2012 10:30:00 -0500

Hello Bob,
Want to go for lunch?
-Alice
Now, here's where things get a bit weird. AS2 says, on the one hand, to use MIME, with its header/data structure, as the "packaging" for the EDI data, and, on the other hand, to use HTTP, with its header/data structure, as the "delivery method", but it says to treat them as the same thing!
When delivering an AS2 request, MIME entity that represents the wrapped-up EDI document has a bunch of headers, a blank line, and then data. The HTTP request that does the actual delivery also has a bunch of headers, a blank line, and then data. The fully-formed AS2 request merges the MIME headers (red) into the HTTP headers (blue), so that the body of the MIME entity is the body of the HTTP request:
POST /path/to/request/handler HTTP/1.1
Host: servername
Content-Type: application/edi-x12
Content-Length: 123
AS2-From: Fabrikam
AS2-To: Contoso
AS2-Version: 1.0
Message-ID: 09A8D1403A2F4C87BF3FEC1120C4EFA3
Date: Fri, 26 Oct 2012 10:30:00 -0500


edi data here
This is weird. It's also interesting to implement, because while web servers are very good at the request processing model, they aren't typically designed to give you full control over the request or response. In most cases, they handle the headers for you, since in most cases, only the body of the request and the body of the response really matter significantly to the application. Meanwhile, most components for working with MIME entities have no reason to suspect this split of the headers from the body data, so parsing the MIME entity being sent in the request and sending the response MIME entity back are both a bit tricky.
Just for completeness, I'll mention that the AS3 standard takes delivery of business documents for on-demand processing into the realm of FTP. It models the delivery of a MIME-encoded EDI document around a file upload. E-mail may not be the best fit for on-demand request processing, but surely FTP uploads represent the polar opposite of what is desirable!
I think the message to take away from all of this is that the people out there making the standards that everyone ends up having to implement are not geniuses, and they're not focused on ways to simplify the implementation. They really should be, though, because the simpler the implementation is, the fewer bugs there will be, and the more reliable the software will be in the long run. :-)

Tuesday, 16 October 2012

The Nightmare that is EDI

Welcome to my blog :-)
People have been telling me for a long time that I ought to start one to share my thoughts and ideas. I've finally caved and fired one up. This inaugural post happens to be about EDI and XML, which are particular formats for representing data, but I expect to write articles on a variety of different topics. Most of the posts will probably be related to computer programming. I may also resurrect some write-ups I have done in other forums and repost them here. If you have any suggestions about topics, or any other thoughts or comments, by all means contact me. :-)
I've been working on integrating with a large partner recently. We want to be able to automate billing, because with the expected volume, manually reconciling what they actually bill us for with what we expect to be billed for would be infeasible. In this day and age, one might expect the billing information to be supplied in the form of XML documents, or perhaps (effectively the same thing) complex types in a SOAP response, but one would be wrong.
It turns out, long before XML was created, back when computers were barely capable of actually storing data, let alone operating on it, it was nevertheless noticed that having the computer store the data instead of storing hard copy made data management much easier. For instance, eliminating data more than, say, 7 years old was one command to the system that took maybe 20 seconds to type, instead of an exhaustive search of poorly-organized filing cabinets that could occupy the better part of a week. In order to store data, though, it needs to have a format.
We humans are pretty good at recognizing information. We can look at two post-it notes:
Our brains understand intuitively that these are the same kind of note, but they have slightly different information, and it's laid out differently. A computer could be made to understand both types of notes, but it would have to be given a complete description of each possible layout. A third note with another layout would need more programming. Computers don't have the pattern recognition that our brain does, so to work with data in multiple formats requires a lot of programming work, a different path through the code -- a different plan -- for pulling relevant bits of information from each.
Because of this characteristic, when people are organizing data for computers to understand, they tend to do three things:
  1. The data is organized in a format that takes as little programming code to understand as possible. This isn't actually strictly true, because there are other factors in the trade-off. Sometimes a format more complicated to parse is preferred for other reasons, such as making it easy not just for computers but for human eyes as well. However, the principle is there: an overcomplicated format is no fun to work with, because the code needed is difficult to write and understand, and code that is difficult to write and understand is likely to have lots of bugs in it.
  2. More importantly, the format is devised in such a way that a given piece of information in it can only be understood in one way. There is no ambiguity, not even ambiguity that can be resolved with context. There is always exactly one right way to read the information.
  3. Most importantly, once a given format has been established, all the data is presented in exactly the same format. There are no exceptions, no loose rules. The format says it must be done in this way, so it is.
There are other concerns as well, but these are the fundamental essentials of encoding information for computers to work with.
So, back in the 1980s, people were looking for a way to use computers for business information. Nowadays, computers tend to fall into one of two baskets, PC or Mac, and while the way they work behind the scenes is very different, on the surface, things have converged so that using one isn't significantly different from using the other, and by and large, computer programs for one can talk with computer programs for the other. For instance, images tend to be JPEG, PNG or SVG files, music tends to be MP3 files, and almost everybody can read PDF and DOC files. We've achieved a standardization in data formats that makes it much easier for people to work together. It's easy to forget that not much more than ten years ago, the files common on each type of system were not interchangeable. Before the proliferation of MP3s, people shared short audio clips. On Windows computers, they shared WAV files, while on Mac computers, they shared AIFF files. If a Windows user gave his Mac-using friend a floppy disk with a WAV file on it, then setting aside the fact that the Macintosh computer couldn't even read the floppy disk, assuming that he could get to the file, it would take special software on the Macintosh to be able to understand the WAV file. Similarly, very little PC software existed that could understand AIFF files.
This same discrepancy in formats existed in business data in the early 1980s, but the problem was much more severe. Business computers weren't either PCs or Macs. Every business had their own solution, generally with their own custom software with its own custom data format. It was virtually impossible to share data, short of printing it out in human-readable form on one system, and then typing it back in on the other. Something had to be done.
Various industry consortia got together and decided that they needed to settle upon a way of representing data. They called their system Electronic Data Interchange, or EDI. They decided that virtually any document could be divided up into logical pieces, and they called these pieces segments. The defining characteristic of a segment is that it had a finite number of pieces of information, or data elements, and each piece was a single unit. For instance, a line item on an invoice has a description, a unit price and a quantity. No line item will have two unit prices or five quantities. However, an invoice may have multiple line items. So, a line item is a segment, and the description, unit price and quantity are data elements. A report on changing stock prices over time would put each price into a different segment, because the exact length of that duration of time, and thus the number of prices, is not fixed.
With EDI, they decided that a segment starts out with a few characters that identify the segment's meaning, and then consists of a series of data elements, and the computer tells where one data element starts and the next begins by means of a special character selected to be a separator. Similarly, the computer tells where one segment ends and the next one starts by means of a special character selected to be a segment terminator. A typical segment looks like this:
BIG*20121004*4207359598**ACW/JUU1287005005***DI~
This segment is a Beginning of Invoice Group. The first data element is the date, the second is a unique ID assigned by the sender's system, the third one is empty, the fourth one is a purchase order number, data elements 5 and 6 are empty, and the final element, "DI", indicates that this is a Debit Invoice, as opposed to "CR", a Credit Memo, or any number of other codes.
Having come up with this basic concept of segments and data elements, they then set about coming up with specific arrangements. The key thing here is that a segment might want to, logically, have other segments inside of it. For instance, if you have a segment that describes an individual, you might want to be able to present more than one telephone number for that individual. Each telephone number would have to have its own segment, but somewhere along the line, it's important also that those segments be seen as part of the higher-level segment that introduces the individual in the first place.
With EDI, what was done to address this problem was to create data dictionaries. In addition to this basic underlying format of segments and data elements, the data dictionary goes on to specify what the possible segments are, how many data elements they have and what they mean, and, crucially, how the data elements tie together to create structure in the document. It is here, for instance, that the N1 segment describing an individual is ascribed a child segment of PER describing a contact number. More than one PER can be contained in one N1.
They created these data dictionaries, hundreds and thousands of pages of them describing countless types of documents, and it was glorious, but it was wrong. Very, very wrong.
To understand where they went wrong, let's investigate another data format: XML. It was introduced more than ten years after the EDI formats with the Utopian goal of allowing any two computer systems to talk together. Of course, this doesn't happen by magic; programmers still have to write code to talk XML specifically instead of something else specifically. The thinking behind it, though, was to come up with a format so standard and ubiquitous that the necessary programming to work with it could be already a part of the basic set of tools that any programmer starts out with when creating a piece of software. Largely, they have succeeded, and yet they have produced no data dictionaries at all. The core XML standard is entirely divorced from exactly what data is being separated.
This basic separation of concerns is actually a core facet of software development. The better a programmer is, the better he is at untangling the different pieces that talk to each other and designing them in such a way that they, by and large, don't need to know anything about what other pieces of code are using them. Another word for this is modularization. XML is well-modularized, because its format isn't tied to any particular type of data.
On the surface, XML works quite similarly to EDI. Data is separated into elements, rather than segments, and elements have attributes rather than data elements, but they serve roughly the same purpose. The above "BIG" element could, in XML, be presented like this:
<InvoiceGroup Date="2012-10-04" ID="4207359598" PurchaseOrderNumber="ACW/JUU1287005005" Type="DebitInvoice" />
All the same data is there. Of course, one difference is immediately evident: Instead of simply having to know that, for instance, data element #4 is the purchase order number, the attributes have names. This isn't actually always a good thing, because every time you write out an <InvoiceGroup> element, you have to repeat the names. In any XML document, the names of the attributes tend to be a significant proportion of the amount of data being stored; it isn't very compact. However, this has benefits in other areas. If, for whatever reason, a human being needs to work with XML data, it can often be largely self-explaining.
So far, XML is just prettier, but more verbose EDI. Where XML really diverges, though, is in its representation of structure. Rather than structure being a part of a data dictionary, so that the precise usage of segments depends on what exactly you're storing with it, XML makes structure a part of the basic representation of data, much as EDI specifies '*' to separate data elements. Consider the following EDI data:
N1**ARTHUR JONES*1*9012345918341*~
PER**HOME*TE*212-555-1234~
PER**WORK*TE*212-555-4321~
N1**ROBERT SMITH*1*9041371238134*~
PER**WORK*TE*212-555-6543~
In trying to understand this document, you must simply know beforehand that PER is subsidiary to N1, so that Arthur Jones has a home and a work telephone number on file, whereas Robert Smith has only his work number. Without the data dictionary, you are left guessing at the relationship, if any, between the N1 and PER segments.
Represented as XML, however, this same data might look like this:
<Individual Name="Arthur Jones" ID="9012345918341">
  <Contact Name="Home" Type="Phone" Number="212-555-1234" />
  <Contact Name="Work" Type="Phone" Number="212-555-4321" />
</Individual>
<Individual Name="Robert Smith" ID="9041371238134">
  <Contact Name="Work" Type="Phone" Number="212-555-6543" />
</Individual>
I am going to assume the reader has a basic familiarity with the structure being represented (most importantly, that everything between "<Individual>" and "</Individual>" is contained by that Individual element). The key thing, which should be quite clear, is that the <Contact> elements have a logical parent, and that parent is very easily identifiable.
EDI does not separate the concern of structure from the concern of meaning, and the result is that it is quite difficult and a lot of work to parse an EDI document. Having parsed one EDI document, writing code to parse another type requires you to essentially start from scratch in specifying what segments are allowed and how they relate to one another. With XML, everything about the nature of the data is representable and understandable using the basic underlying format of XML, without needing to know anything about what the data actually means.
Of course, I have omitted certain factors from the story, for simplicity:
  • EDI allows data elements to be subdivided into component data elements. Another separator character is used for this. Thus, there is some minimal structure. However, the basic constraint remains that data elements be in a fixed arrangement within a segment, without any variance in the number or ordering.
  • There are multiple EDI standards, and they have some core differences. In my work so far, I've looked into EDI ANSI ASC X12, used predominantly in North America, and the United Nations EDIFACT standard which is common in most other parts of the world (but especially Europe), and is internationally recognized. A core difference between the standards is that the ANSI X12 standard is closed and commercial, while EDIFACT is openly documented. To view the EDIFACT standards (though I do warn that they are not written to be easily understood!), you can go to the site of the United Nations Economic Commission for Europe (UNECE). To get solid information about ANSI X12, on the other hand, you must commit to purchasing expensive volumes from a private standards body called the Accredited Standards Committee.
  • XML is inarguably much nicer to work with than EDI, but it is still an overwhelmingly complicated standard. Also, other standards have grown up around XML that are, at least in some circumstances, core to working with it, and they add substantial complexity of their own (e.g. XSD, XSLT).
  • A growing number of common file formats have come into existence of the last ten years that are really specific uses of XML. There's a good chance your word processor writes .docx files -- these are really compressed XML files. Many line drawing editors support a format called SVG; these are also XML files with a specific layout. Essentially, this is the same as EDI's concept of a data dictionary, but since they are designed around XML, instead of defining structure they simply make use of it. The structure is still present in the absence of having a particular meaning for the data.
  • In order to work with EDI, many companies have resorted to third-party service providers who offer to translate the data from EDI to XML. Once in XML format, it is much easier to load the document and interpret its contents. In my integration work, this is essentially the strategy that I adopted as well, though rather than pay a monthly fee, I decided to write my own code to parse the invoice data from EDI documents and transform it to XML. Writing this code, especially the part that keeps track of e.g. PER segments being children of N1 segments, was a bit of an adventure!