A computer science professor of mine once said, “For me to understand your code, show me your data.” The design of data is central to designing code. It can shape the character of the code. Architectural decisions can turn on an estimation of how much and what kind of data is used during the program execution.
While it’s not uncommon in software applications to read data from relational databases or even flat files with columns of data (CSV or TSV), often a more elegant structure is needed to express more intricate relationships between data. This is where XML and JSON have come into wide use. XML was used for many years, but gradually JSON has taken over as the data format of choice in many applications.
XML and JSON each have some fundamental features that reflect the way data is organized in computer applications:
- data objects with attributes
- hierarchy to express subordinate relationships between data objects
- arrays to gather a possibly large number of similar data objects in one place
Data with attributes is a fundamental concept in computer science. It’s a central feature of object-oriented programming, and before that C and C++ had structs, Lisp had assoc lists and properties. Attributes capture features of data. A data object representing a customer would have details like a first name, last name, age, gender, etc. Data objects with attributes can also express dictionaries, constructs which map from one set of data values to another (like a map of month names to month numbers, “January” is 1, “February” is 2, and so on). This is a powerful way of encoding some intelligence in software, defining associations that reflect meaning between data.
Hierarchy is a common way of expressing a relationship between related objects. A customer might have an address, which in turn has attributes like street name, city, country and mail code. Hierarchy might also involve grouping, like a list of product orders outstanding for a customer.
Arrays provide a way to collect multiple instances of data in one place, offering the opportunity to process the data in a simple loop construct in code. The same programmatic loop can process any amount of data, be it 500 or 5,000,000, and is key for creating powerful code that can flexibly handle arbitrarily large amounts of data.
In the mid-1990s software developers started using XML to define structured data. HTML had been used very successfully to tag elements of a web document to specify their appearance. XML used a very similar tagged notation to specify parts of data and their significance. HTML was designed to be read and interpreted by a web browser. XML was designed to be read mostly by application software.
Here’s an example of XML syntax, representing some data about a customer and their recent orders, demonstrating attributes, hierarchy, and arrays:
(The example here is nicely formatted and indented for readability. In real applications, the newlines and indentation would most likely be stripped away — computers can still read it even if humans can’t)
XML became wildly popular as a way to exchange data between the client and server sides in so-called “multi-tier” applications and was also commonly used to define the format of configuration files for many applications. Software standards and tools were developed to specify, validate and manipulate XML-structured data. DTDs (Data Type Definitions) and later XSchema to express the structure of XML data; XSLT to transform XML data from one format to another — each of these themselves encoded in XML format (XML-like in the case of DTDs).
But the popularity of XML also coincided with the growth of B2B applications. XML began to be used to pass business-critical data between partner corporations large and small, and startup companies like Aruba and Commerce One appeared at this time providing platforms and toolkits for an exchange of data. SOAP (“Simple Object Access Protocol”) was introduced as an XML-based interchange protocol: a common “envelope” of XML headers which provided a way to specify addressing/routing and security, and “payload” section that carried application-specific data to be sent from one computer to another. Other standards were developed for use under the general umbrella of “Electronic Data Interchange” (EDI) for B2B applications.
XML was a powerful standard for structuring data for processing and exchanging data. But it had some quirks and limitations.
It could be very verbose. The leading tag at the start of an XML element defines the content for processing by machines and to be readable by people alike. When you see “Customer” as the start of an XML element, you know what kind of data that element encloses. The trailing tag improves readability slightly for people but doesn’t really add anything for machine readability. Eliminating the closing tag of XML element in favor of a simpler way of terminating the content could measurably reduce the size of the data.
Also, there is no explicit representation of an array element in XML. Collections of similar objects that were intended to be processed as a group were simply put together under a common element. But there’s no explicit indication of this intention in the XML data. A spec in a DTD or XSchema could be created to define this, and it would be clear from reading the code that processes the data that the code is looping to process repeated XML elements.
But XML offers no visual indicator of a data array. It’s possible to create such an indicator by using a wrapping element (like a “<orders>” element around a group of “<order>” elements), but this syntax is not required in XML.
XML does support namespacing, a prefix to the element name indicating that it belongs in a certain group of related tags, most likely originated by a separate organization and governed by a distinct XML schema. It’s useful for organization and validation by a computer (especially for partitioning/classifying the parts of a data exchange: SOAP envelope vs the payload, etc), but adds complexity to parsing of XML as well as visual clutter for the human reader.
Then there’s one of the classic topics of debate in software engineering (right in there with “curly braces on the same line or next line”): should attributes or elements be used for properties of a data object? XML leaves this choice open to the implementer. Details about a Customer object could equally be specified using XML attributes:
…or using subelements of the XML data object:
Attribute names have to be unique to the element, there can’t be more than one. But there can be more than one subelement with the same tag name under any given element.
Subelements have an implicit order that could be treated as significant by the producing and consuming code (without any visual cue). Attributes do not have an explicit order.
There’s kind of a notion that attributes should express an “is-a” relationship to the XML element, whereas subelements express a “has-a” relationship, but in a lot of cases, the decision is a gray area.
Here’s an example of the same customer data, formatted as JSON:
JSON simplified things quite a bit compared to XML format. The only association that can be expressed in JSON is an attribute. Hierarchy is expressed by nested curly braces, where each curly brace-enclosed object is associated with a property of its parent. And there’s no terminating name or label at each level of hierarchy, just a closing curly brace, making JSON a much simpler and more succinct way than XML of encoding a collection of data.
Here is an example of the same customer data represented in Lisp:
And here is a simple Lisp function that interprets the data:
…and a demo of how the function and the data work together:
The first element in a Lisp list is significant. In code, it begins an executable “form” (a function), but in data often serves as a label that is somehow associated with the succeeding elements in the list. As demonstrated in the above code, the “assoc” function looks up data by testing the first element of each of the sublists. This is the equivalent of a dictionary lookup in other programming languages.
However, XML is hardly dead and gone today. Website syndication with RSS is still widely used (it’s a basic feature of WordPress, which powers a significant number of today’s websites), and a recent article suggested that it may stage a comeback. Electronic data interchange (EDI) is still in wide use by major corporations. A recent story about the NotPetya ransomware attacktold of the international shipping firm Maersk and how it was shut down for days when their shipping and logistics EDI would no longer run, resulting in container trucks lined up at shipping terminals and stalled deliveries around the world.
But representing associations between objects as a nested hierarchy doesn’t fit some application domains. One example is social network data, for which GraphQL (championed by Facebook, and still using a JSON-like representation) is often a choice.
RDF (an XML-based representation developed by the W3C Semantic Web group) also expresses non-hierarchical graphs of data using “(subject, predicate, object)” triples, where the “object” part may be a reference to another triple to define a general graph of relationships between data. It’s being used in many projects on the web.
And namespacing that was originally used in XML now finds its way into tag data in HTML (for example, semantic markup like the “twitter:” and “og:” namespaces in Twitter and Facebook card markup).
Plug: LogRocket, a DVR for web apps
LogRocket is a frontend logging tool that lets you replay problems as if they happened in your own browser. Instead of guessing why errors happen, or asking users for screenshots and log dumps, LogRocket lets you replay the session to quickly understand what went wrong. It works perfectly with any app, regardless of framework, and has plugins to log additional context from Redux, Vuex, and @ngrx/store.