50. Compress if space is a problem

50. Compress if space is a problem

Verbosity is a common criticism of XML. However, in practice, most developers' intuitions about the verbosity of XML are wrong. XML documents are almost always smaller than the equivalent binary file format. The sad truth is that most modern software pays little to no attention to optimizing documents for space. However, if your XML documents are so big or your available space so small that size is a real issue, you can simply gzip (or zip or bzip or compress) the XML documents.

For example, consider Microsoft Word. A seventy page chapter including about a dozen screen shots and diagrams from one of my previous books occupied 6.7 MB. Opening that document in OpenOffice 1.0 and immediately resaving it into OpenOffice's native compressed XML format reduced the file's size to 522K, a savings of more than 90%. I unzipped the OpenOffice document into its component parts, and the resulting directory was also 6.7 MB, almost exactly the same size as the original binary file format. Most of that space was taken up by the pictures.

For another example, consider a typical database. One of the fundamental principles of a modern RDBMS is that the physical storage is decoupled from the logical representation. This allows the database to optimize performance by carefully deciding where to place which fields on the disk. Holes are left in the files to allow for insertion of additional data in the future. Indexes are created across the data. Some data may even be duplicated in multiple places if that helps to optimize performance. But one thing that is not optimized is storage space. A typical relational database uses several times to several dozen times the space that would be required purely to store the data without worrying about optimization.

As an experiment, I took a small FileMaker Pro 6 database containing information about 650 books and exported it to XML. The original database was 1.5 megabytes. The exported XML document was only 1.0 megabytes large, a savings of 33%. This is actually on the small side of the savings you can expect by moving to XML, mostly because FileMaker does a better than average job of cramming data into limited space. It's not uncommon to produce XML documents that are as small as 10% of the size of the original database.

Information theory tells us that given a perfectly efficient compression algorithm two documents containing the same information will compress to the same final size, regardless of format. Reasonably fast compression algorithms like gzip and bzip2 aren't perfectly efficient. Nonetheless, in actual tests I've found that comparing gzipped XML documents to the gzipped binary equivalents mostly results in files that are within 10% of each other in size. Whether the gzipped binary file is 10% smaller or 10% larger than the gzipped XML equivalent seems unpredictable. Sometimes it's one way, sometimes the other; but at this point the details are too small to care about.

Java includes built-in support for zip, gzip and inflate/deflate algorithms in the java.util.zip package. These are all implemented as filter streams so it's straight-forward to hook one up to your original source of data, and then pass it to a parser which reads from or writes to the stream as normal. For example, suppose you've built up a DOM document object named doc in memory and you want to serialize it into a file named data.xml.gz in the current working directory. The data in the file will be gzipped. First open a FileOutputStream to the file, chain this to a GZipOutputStream, and then write the document onto the OutputStream as normal. For example, the following code uses Xerces's XMLSerializer class to write a DOM Document object into a compressed file:

Document doc;
// load the document...
try {
  OutputStream fout    = new FileOutputStream("data.xml.gz");
  OutputStream out     = new GZipOutputStream(fout);
  OutputFormat format  = new OutputFormat(document);
  XMLSerializer output = new XMLSerializer(out, format);
  output.serialize(doc);
}
catch (IOException ex) {
  System.err.println(ex);
}

From this point forward you neither need to know nor care that the data is compressed. It's all done behind the scenes automatically.

Input is equally easy. For example, suppose later you want to read data.xml.gz back into your program. Decompression adds just one line of code to hook up the GZipInputStream:

  InputStream fin = new FileInputStream("data.xml.gz");
  InputStream in  = new GZipInputStream(fin);
  DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();   
  DocumentBuilder parser = factory.newDocumentBuilder();
  Document doc = parser.parse(in);
  // work with the document...

Of course, the same techniques work if you need to read or write from the network instead of a file. You'll just hook up the filter streams to network streams rather than file streams.

Similar techniques are available for C and C++. Although compression is not a standard part of the C or C++ libraries, Greg Roelofs, Mark Adler, and Jean-loup Gailly's zlib library <http://www.gzip.org/zlib/> should satisfy most needs. zlib is available in source and binary forms for pretty much all modern platforms. Indeed the java.util.zip package is just a wrapper around calls to this library. Python includes the GzipFile class for convenient access to this same library. The Compress::Zlib module available from CPAN <http://www.cpan.org/modules/by-module/Compress/> performs the same task for Perl. .Net aficionados can use Mike Krueger's open source #ziplib <http://www.icsharpcode.net/OpenSource/SharpZipLib/>.

Finally, if you're serving data over the web, modern web servers and browsers have built in support for compression. They can transparently compress and decompress documents as necessary before transmitting them. Since bandwidth tends to be a lot more expensive and limited on both ends than CPU speed, this is normally a win-win proposition.

By no means should you let fear of fatness stop you from using XML file formats. Most of the time the fear is unfounded. Even in those rare cases where it isn't, standard compression algorithms neatly solve the problem.