usable in any place a human can be used

20100201

open format

[caption id="attachment_663" align="alignright" width="300" caption="We\'ll just lock your data up in here, don\'t worry we\'ll open it up later if you need it. This is what a closed format sounds like"]bank safe[/caption]

Back in the days when a computer had 64k of RAM and high-capacity meant you had a double-sided floppy disc, there were these funny little things called binary file formats. We still have them today, they are the "pattern" that a program will write and read to and from disc to save and load a file into a program. Some are open, like image file formats, and some are closed, like Microsoft's binary blob formats for Office. As the world has advanced and storage has become dirt cheap people started looking for an easier way, and whenever there is a problem that we don't have a tool for yet, we reached for XML.


XML is actually not a bad fit for this problem domain, its a little on the ugly side, but that's ok, very few people crack open a raw file to read it. The people that are cracking open files in a text editor to peer inside are probably tech-savvy enough to read a little XML. The big bonus of switching to XML, like ODF or DOCX, is that there are very mature XML parsers and writers for almost every programming language. That means that a determined programmer can grab a copy of your format spec, her favorite XML parser, and write a program that operates on your file format. This is the essence of that oh-so-marketing-speak word synergy.


Now I would love to take credit for this awesome idea, but if you've read The Art of Unix Programming you will recognize this as nothing more than an extension of the Rule of Composition. Now that you've read the Rule of Composition (and you should make the time to read the rest of the Art of Unix Programming, even if you never plan on programming anything for Unix, the lessons within are just in general great to know), you will recognize the inherent advantage of having parsable file formats. Now that I have cunningly converted you over to my way of thinking, what format should you use?


XML


XML (eXtensible Markup Language) is the old standby, it is reliable, standardized by the W3C, well-supported, and there are tons of tools around for playing with XML. XML is "human-readable" for those of you unfamiliar with it here is an example.


[xml]
<book>
<title>Example Book</title>
<pages>
<page>
<![CDATA[
...page stuff here...
]]>
</page>
<page>
...you get the idea...
</page>
</pages>
</book>
[/xml]

Its a bit tag-heavy and that means a slightly large file size, this isn't a huge concern since storage is so cheap, but you should be aware of it. XML has a neat feature called DTD (Document Type Definition), armed with this most parsers will be able to tell right away if a document is well formed. XML is big and clunky but a fine choice for many applications, especially if you need typing information.


YAML


YAML (YAML Ain't Markup Language) is the format of choice for the ruby community. YAML is well supported by most mainstream programming languages, it is a more lightweight choice than XML. Here is the same thing as above in YAML.


[ruby]
book: Example Book
pages:
- page: >
...here goes my page data...
- page: >
...you get the idea....
[/ruby]

YAML uses the structure of the text to indicate the structure of the data. Ending tags are dropped and indentation becomes more important. YAML looks simplistic at first but has a wide-array of functionality hiding below the simple hello world examples. References, hashes, arrays, and much more are possible with YAML. The specification allows you to make concise documents that contain an astounding amount of information.


JSON


JSON (JavaScript Object Notation) is a lightweight way to represent data structures. JSON excels by being incredibly simple to learn and use. There is native support for it in JavaScript which makes it ideal for use in AJAX (which would then technically by called AJAJ), and there are JSON parser available in most mainstream languages. Here is the example from above in JSON.


[javascript]
{title: "Example Book", pages: [ page: "...page stuff goes here...", page: "...you get the idea..." ] };
[/javascript]

Just like in JavaScript everything in JSON is a Hash or Array. JSON is a simple typeless data recording system, perfect for dynamic languages. JSON is a proper subset of YAML 1.2 so most YAML parsers can also parse JSON. JSON's incredibly lightweight nature lends itself for being used when sending data over a wire or when space is at a premium.


BSON


BSON (Binary JavaScript Object Notation) is a binary form of JSON. It is meant to be more space efficient than JSON but maintain the ease of parsing. It is described as "A General-Purpose Data Format" and was devised as the serialization medium for the MongoDB project. It is an interesting format to look at if space is at a premium and there are already parsers for C, C++, PHP, Python, and Ruby.




File formats no longer need to be gnarled, tangled messes. We have new standards that allow us to freely share data created in one program to be analyzed, manipulated, and turned into something new in another program. Being able to freely interoperate is the future of computing, it is the driving force behind web standardizations, micro formats, and open file formats. The next time you need to persist data to the file system, resist the urge to roll your own serialization scheme, and check out one of the technologies presented above. The world will thank you.

1 comment:

  1. Great post, I was confused for a moment when you metioned binary them XML (a text markup). I am guilty of rolling my own serialization a number of times, but the data formats were always documented. I've grown to really like json for use in rails.

    Optimal data exchange is gzipped json right (the binary representation is inherent to a zipped file)?

    ReplyDelete