Summary

XML is a standard textual markup language suitable for encoding almost any sort of data. It works very well for both unstructured narrative data written by people and for the record-oriented data common in computer applications. About the only thing it’s not really suitable for are bitmapped things such as photographs and recorded sound.

Logically an XML document is made up of nested elements. Each element has a name, a set of attributes and some content. The content can include plain text and/or other elements. The attributes are name value pairs associated with the element. Each document has a single topmost element called the root or document element. Since all non-root elements nest completely inside other elements, an XML document has a natural tree structure. Besides elements and text nodes, XML documents can also contain comments, processing instructions, an XML declaration, and a document type declaration.

Syntactically elements are delimited by tags that look like <Quantity>, </Quantity>, and <Quantity/>. <Quantity> is a start-tag that must be matched by the corresponding end-tag </Quantity>. The content of the Quantity element comes in-between these two tags. <Quantity/> is an empty-element tag that represents a Quantity element with no content. Attributes are indicated by name="value" pairs inside start-tags and empty-element tags. For example, <Quantity number="17"/> is an empty Quantity element that has a number attribute with the value 17.

Physically, an XML document is divided into storage units called entities. These entities can be files, database records, data structures in memory, or something else. The document entity contains the root element of the document. Parsed entities contain XML markup and that will be merged to form the entire document. Parsed entities are located via general entity references such as &anaconda; in the document entity or another parsed entity. Unparsed entities contain non-XML, possibly binary data that will be identified by ENTITY type attributes in the document.

Every XML document must be well-formed. Among other things this means, every start-tag must have a matching end-tag, every attribute value must be quoted, and only certain characters can be used in element names. If a document is not well-formed, it is not an XML document; and XML parsers will not accept it. Beyond well-formedness, documents that have a schema may be (but do not have to be) valid. A valid document adheres to all the constraints listed in the schema. Schema languages include Document Type Definitions (DTDs), the W3C XML Schema Language, and the XPath-based Schematron.

Since XML markup normally focuses on the structure and semantics of the contained information, before a document can be shown to a human reader, it must first be associated with a style sheet that tells the browser or other tool how to format the document for display to a person. The two most popular style languages are Cascading Style Sheets (CSS) and the Extensible Stylesheet Language (XSL). CSS is a non-XML declarative language for applying simple styles such as font-weight to elements of certain types. XSL is actually two separate XML applications, the XSL-FO page description language and the XSLT Turing-complete functional language. An XSLT style sheet is used to transform a source XML document into the XSL-FO vocabulary. However, XSLT can also be used to transform to other XML vocabularies such as XHTML.


Copyright 2001, 2002 Elliotte Rusty Haroldelharo@metalab.unc.eduLast Modified July 29, 2001
Up To Cafe con Leche