Output Streams, Writers, and Encodings

Output Streams, Writers, and Encodings
Prev	Chapter 3. Writing XML with Java	Next

Most of the time you don’t want to dump an XML document to System.out. Instead you want to write it in a file or onto a network socket. You might even store it in a string and pass it to some other process. What connects all these possible targets in Java is the java.io.OutputStream class. Files, sockets, and even strings can all be treated as just another kind of stream.

XML documents are text. The text is made up of Unicode characters. When those Unicode characters are actually written onto a stream, you need to pick a character encoding that specifies how each character is converted into bytes. This encoding can be one of the Unicode encodings such as UTF-8 or UTF-16 or it can be a local code page such as ISO-8859-1 or MacRoman. Characters that don’t exist in the local code page can be escaped using numeric character references. The encoding declaration will be set to indicate the character set in use. The normal way Java converts characters into bytes in a specific encoding is by chaining an OutputStreamWriter to an OutputStream. As chars and strings are written onto the OutputStreamWriter, they are converted to bytes in the specified encoding which are then written onto the underlying OutputStream.

Let’s suppose you want to dump the Fibonacci numbers into a file called fibonacci.xml in the current working directory. First you would open a FileOutputStream to that file, like this:

OutputStream fout = new FileOutputStream("fibonacci.xml");

If performance is at all a concern, you would immediately chain this FileOutputStream to a BufferedOutputStream like this:

OutputStream bout = new BufferedOutputStream(fout);

Then you would chain the BufferedOutputStream to an OutputStreamWriter. You’d pass the Java name of the encoding you want as the second argument to the OutputStreamWriter() constructor. For example, this line chooses the ISO-8859-1, Latin-1 encoding, though it uses Java’s name for this encoding, “8859_1”:

OutputStreamWriter out = new OutputStreamWriter(bout, "8859_1");

Finally, you’d write the output onto that OutputStreamWriter, making sure to include the right encoding declaration using the XML name for the encoding. Example 3.9 demonstrates.

Example 3.9. A Java program that writes an XML file

import java.math.BigInteger;
import java.io.*;


public class FibonacciFile {

  public static void main(String[] args) {
   
      BigInteger low  = BigInteger.ONE;
      BigInteger high = BigInteger.ONE;      
      
      try {        
        OutputStream fout= new FileOutputStream("fibonacci.xml");
        OutputStream bout= new BufferedOutputStream(fout);
        OutputStreamWriter out 
         = new OutputStreamWriter(bout, "8859_1");
      
        out.write("<?xml version=\"1.0\" ");
        out.write("encoding=\"ISO-8859-1\"?>\r\n");  
        out.write("<Fibonacci_Numbers>\r\n");  
        for (int i = 1; i <= 10; i++) {
          out.write("  <fibonacci index=\"" + i + "\">");
          out.write(low.toString());
          out.write("</fibonacci>\r\n");
          BigInteger temp = high;
          high = high.add(low);
          low = temp;
        }
        out.write("</Fibonacci_Numbers>\r\n"); 
        
        out.flush();  // Don't forget to flush!
        out.close();
      }
      catch (UnsupportedEncodingException e) {
        System.out.println(
         "This VM does not support the Latin-1 character set."
        );
      }
      catch (IOException e) {
        System.out.println(e.getMessage());        
      }

  }

}

One change from the System.out version is that the line breaks have to be encoded explicitly. They’re not really necessary for this XML document, but the examples are prettier if the XML isn’t just one long line of text. I recommend using a carriage-return/linefeed pair (\r\n) as your line break. This is the native format for DOS and Windows, and most Unix and Macintosh text editors can handle it. More importantly it is the standard line ending for network protocols such as HTTP and SMTP. XML parsers do normalize all line breaks to a single linefeed on input, so the proper choice of line break for an XML document is not nearly as fraught as for some other types of files. Nonetheless, picking carriage-return/linefeed does help when processing or transmitting XML documents with non-XML-aware tools.

Although most XML parsers written in Java support exactly those encodings that are available in Java, they don’t use the same names. Java tends to use underscores where XML uses hyphens in encoding names, or to eliminate the hyphens completely. The reason is that earlier Java virtual machines used reflection to locate the classes that convert between different encodings, so its encoding names had to be legal Java class names. Table 3.1 lists the Java and XML equivalents of the standard character sets and encodings. Later versions of Java often allow multiple names for the same encoding, especially in Java 1.4. Here I’ve picked the names that are supported across the broadest range of virtual machines.

Table 3.1. Standard Character Sets and Encodings

XML Name	Java Name	First supported in Java	Scripts and Languages
ISO-8859-1	8859_1	1.1	Latin-1: ASCII plus the accented characters needed for most Western European languages including Albanian, Basque, Breton, Catalan, Cornish, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Frisian, Galician, German, Greenlandic, Icelandic, Irish, Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Sorbian, Spanish, and Swedish as well as many non-European languages written in the Latin alphabet such as Swahili and Malaysian
ISO-8859-2	8859_2	1.1	Latin-2: ASCII plus the accented characters needed for most Central European languages including Albanian, Croatian, Czech, Finnish, German, Hungarian, Latin, Polish, Romanian, Slovak, Slovenian, and Sorbian
ISO-8859-3	8859_3	1.1	Latin-3: ASCII plus the accented characters needed for most Southern European languages including English, Esperanto, Finnish, French, German, Italian, Latin, Maltese, Portuguese, and Turkish
ISO-8859-4	8859_4	1.1	Latin-4: ASCII plus the accented characters needed for most Northern European languages including Danish, English, Estonian, Finnish, German, Greenlandic, Latin, Latvian, Lithuanian, Norwegian, S�mi, Slovenian, and Swedish
ISO-8859-5	8859_5	1.1	ASCII plus Cyrillic
ISO-8859-6	8859_6	1.1	ASCII plus Arabic
ISO-8859-7	8859_7	1.1	ASCII plus Greek
ISO-8859-8	8859_8	1.1	ASCII plus Hebrew
ISO-8859-9	8859_9	1.1	Latin-5: same as Latin-1 except the Turkish letters Ğ, ğ, İ, ı, Ş, and ş take the place of the Icelandic letters þ, Þ, ý, Ý, Ð, and ð
ISO-8859-13	ISO8859_13	1.3	Latin-7: ASCII plus the accented characters needed for most Baltic languages including Latvian, Lithuanian, Estonian, and Finnish, as well as English, Danish, Swedish, German, Slovenian, and Norwegian.
ISO-8859-15	ISO8859_15_FDIS	1.2	Latin-9: same as Latin-1 but with the Euro sign € instead of the international currency symbol ¤. It also replaces the infrequently used symbol characters ¦, ¨, ´, ¸, ¼, ½, and ¾ with the infrequently used French and Finnish letters Š, š, Ž, ž, Œ, œ, and Ÿ.
UTF-8	UTF8	1.1	The default encoding of XML documents; each Unicode character is represented in between 1 and 4 bytes.
UTF-16	UnicodeBig or UnicodeLittle	1.2	An encoding of Unicode in which characters in the Basic Multilingual Plane are encoded in two bytes, and all other characters are encoded as two two-byte surrogates
ISO-10646-UCS-2	N/A	N/A	A straightforward encoding in which each Unicode character is represented as a two-byte integer; cannot represent characters outside the Basic Multilingual Plane
ISO-10646-UCS-4	N/A	N/A	A straightforward encoding in which each Unicode character is represented as a four-byte integer
ISO-2022-JP	JIS	1.1	Japanese
Shift_JIS	SJIS	1.1	Japanese
EUC-JP	EUCJIS	1.1	Japanese
US-ASCII	ASCII	1.2	English
GBK	GBK	1.1	Simplified Chinese
Big5	Big5	1.1	Traditional Chinese
ISO-2022-CN	ISO2022CN	1.1	Traditional Chinese
ISO-2022-KR	ISO2022KR	1.1	Korean

I have deliberately omitted XML legal encodings that are not yet supported by Java such as ISO-8859-10, ISO-8859-11, ISO-8859-14, and ISO-8859-16. It’s not hard to add them in Java 1.4; but since they’re not available by default, you’re better off picking UTF-8 or one of the other encodings of Unicode.

The exact list of encodings Java supports varies from virtual machine to virtual machine and version to version. Java 1.4 is a major leap forward in support for many character sets, as well as for different aliases for character set names. However, since the standard Unicode and ISO encodings let you handle most environments today, there’s little reason to use other encodings in XML documents.

Some parsers, including Xerces-J, have an option to recognize the Java names, as well as all the other encodings Java supports. This can be useful when you’re reading XML documents sent to you by other people and systems. However, you should not generate documents that use these encodings. The standard encodings in Table 3.1 should be sufficient for any document you need to create, and are a lot more cross-platform compatible than platform-specific code pages such as Cp1252 and MacRoman. In a few cases you might prefer to use standard non-Unicode, non-ISO national character sets such as KS C 5601 for Korean or KOI8-R for Cyrillic. These are OK too, but still a little less well recognized around the world than the standard encodings shown in Table 3.1. The general principle is to be conservative in what you generate, but liberal in what you accept. Try to stick to the most standard encodings when writing documents, but accept any encoding you recognize when reading documents.

Other character sets you should not use in XML but that are available in Java include UTF-16BE and UTF16-LE. These are big-endian and little-endian encodings of Unicode without an explicit byte order mark. XML documents in UTF-16 must have an explicit byte order mark. It may not be omitted. UTF-8 documents may have a byte order mark, but in general should not for maximum compatibility with other software.

Note

Output streams, output stream writers, files, Unicode, character sets and character encodings, and many other aspects of input and output in Java are covered in much more detail in my book [Java I/O, O’Reilly & Associates, 1999, ISBN 1-56592-485-1]

Copyright 2001, 2002 Elliotte Rusty Harold	elharo@metalab.unc.edu	Last Modified May 24, 2002
	Up To Cafe con Leche

Prev	Up	Next
Writing XML	Home	A Simple XML-RPC Client