1 MARCH, 2005
The Joy of SAX: XML processing with the SAX Parser

By John Hunt

In this Java Jolt column, we are going to briefly explore the SAX (Simple API for XML parsing) API for processing XML documents form Java. There are in fact two standard APIs for processing XML documents. As well as the SAX, there is also the Dom (Document Object Model). The difference between the two relates to how they present the XML document to a Java program. The SAX provides access to the data in an XML document as it is read in. In contrast, the DOM loads the whole XML document into memory in a hierarchical data structure (which is the documents’ object model). It is then possible to traverse the tree to access the information within it. The end result is that the SAX may be faster and require less memory but the DOM offers a more sophisticated environment.

The JAXP API
The JAXP API (Java API for XML Parsing) from Sun (and bundled with Java since Java 2 SDK 1.4) provides access to not only SAX and DOM parsers but also XSL translators.

One area of XML processing that can be confusing within the Java world is the relationship between the JAXP API and the actual SAX (Simple API for XML parsing) and DOM (Document Object Model) parsers and the translator. It might seem at first sight that having the JAXP API from Sun means that it is not necessary to have a separate SAX, DOM or XSLT parser. However, the JAXP API is really just a front end to such parsers and processors. It does provide a common front-end to different parsers but is not parser itself. This is very useful because, the use of a SAX or a DOM parser typically requires knowledge of the specific implementation of the parser.

The JAXP API is able to allow different parsers to be plugged in via a Pluggability layer. This Pluggability mechanism allows a compliant SAX or DOM parser to be "plugged in" with no visible affect on the JAXP interface.

This scenario is not unique within the Java world as the JDBC (Java Database Connectivity API) provides a common front end to different database drivers but does not itself provide an actual connection to a database.

In our examples, we have used the default distribution, which includes the Crimson SAX and DOM parsers. Crimson was derived from the Java Project X parser from Sun but is now available from Apache. The Xalan XSLT processor from Apache is also used. Note that the future plan is to move from Crimson to Xerces. These examples presented here should work with either parser plugged into JAXP 1.1.

The JAXP is made up of the classes in interfaces in the javax.xml package and sub packages. There are four classes in the javax.xml.parsers package that are used enable an application to load XML documents. Two relate to the SAX API and two to the DOM API. These classes are:

DocumentBuilder Defines the API to obtain DOM Document instances from an XML document.
DocumentBuilderFactory Defines a factory API that enables applications to obtain a parser that produces DOM object trees from XML documents.
SAXParser Defines the API that wraps an XMLReader implementation class.
SAXParserFactory Defines a factory API that enables applications to configure and obtain a SAX based parser to parse XML documents.

However, as we are focussing on SAX here we will ignore the other aspects of JAXP for the moment.

The SAX API
The SAX API has an interesting history: it was originally a collaborative effort by members of the XML_Dev mailing list. This effort was coordinated and finalized by David Megginson who is still considered the father of the SAX API.

The SAX API is the de-facto standard interface for event-based XML parsing. It is lightweight and fast with low memory overheads. This is because the SAX API merely informs the application using it, of each element it has found in turn. It does not build up any in memory structures nor does it remember which elements it has already processed. It is up to whatever is using the SAX to do this. Interestingly the SAX is often the basis of higher-level APIs such as DOM.

To develop a SAX based application you need an XML parser that supports SAX such as Apache’s Xerces or Crimson or as default implementations in the JAXP distribution. The SAX distribution is a collection of classes and interfaces that come with SAX compliant parsers and as part of JAXP. In the case of the Crimson implementation, the SAX can be found in the package org.xml.sax and its subpackages. From a Java point of view, essentially SAX provides a set of event interfaces that must be implemented by one or more event handler that will deal with document events such as "here is a new element".

The majority of the interfaces in the SAX API are used to represent the concepts that may be found in an XML document. These interfaces are then implemented as appropriate by different implementations of the SAX specification (such as Xerces and Crimson). The key interfaces are presented below.


Figure: The SAX API interfaces

The key interfaces in the SAX API for processing an XML document are:

Content Handler - This is the main interface that most SAX applications must implement whenever we want the application to be notified of the elements in the XML document. The application class must implement this interface and then call the setContentHandler method on the actual SAX parser used. The parser uses the instance to report basic document-related events like the start and end of elements and character data.

EntityResolver - If a SAX application needs to implement customized handling for external entities, it must implement this interface and register an instance with the SAX driver using the setEntityResolver method.

ErrorHandler - If a SAX application needs to implement customized error handling, it must implement this interface and then register an instance with the XML reader using the setErrorHandler method. The parser will then report all errors and warnings through this interface. Note that if no error handler is provided then no errors will be reported.

DTDHandler - If a SAX application needs information about notations and unparsed entities, then the application implements this interface and registers an instance with the SAX parser using the parser's setDTDHandler method. The parser uses the instance to report notation and unparsed entity declarations to the application. Note that this interface includes only those DTD events that the XML recommendation requires processors to report: notation and unparsed entity declarations. This means that if what you want to do is to process the DTD used by an XML document then you need to look at the DeclHandler interface in the org.xml.sax.ext package.

SAX2 represents an extension of the original SAX specification and provides two interfaces in the org.xml.sax.ext package which can be very useful:

DeclHandler - This is an optional extension handler for SAX2 to provide information about DTD declarations in an XML document. XML readers are not required to support this handler, and this handler is not included in the core SAX2 distribution.

LexicalHandler - This is an optional extension handler for SAX2 to provide lexical information about an XML document, such as comments and CDATA section boundaries; XML readers are not required to support this handler, and it is not part of the core SAX2 distribution.

There are other interfaces, but the most important of these is the org.xml.sax.XMLReader interface. An XmlReader is an object that loads an XML document into memory. In essence, the SAXParser class provided by the JAXP API wraps a class implementing the XmlReader interface inside it and uses that class to actually load the XML document.

When you are actually creating a class or classes, that will implement one or more of the interfaces in the SAX API for event monitoring, it is common to start with the DefaultHandler class from the org.xml.sax.helpers package. DefaultHandler is a convenience base class for SAX2 applications: it provides default implementations for all of the callbacks in the four core SAX2 handler interfaces (that is for EntityResolver, DTDHandler, ContentHandler and ErrorHandler). A developer can extend the DefaultHandler class when they need to implement only part of an interface.

To actually use the SAX to process the contents of an XML document you must therefore implement one or more interfaces (for example the ContentHandler), which can be done by extending the DefaultHandler convenience class. Create an instance of this class and register it with a SAXParser obtained form the SAXParserFactory provided as part of the JAXP API. Then when this SAXParser is used to load an XML document each element found will be reported to this instance.

The methods in the ContentHandler interface that must be implemented include:

startDocument() - Notifies beginning of a document
endDocument() – Notifies end of a document
startElement(String namespaceUri,
                    String localName,
                    String qName,
                    Attributes atts) - Start of an element
endElement(…) - Notifies end of element
characters(char ch[], int start, int length) - Notifies character data

The Verifier application
To illustrate the use of the SAX API we will look at a simple program that uses the SAX to verify whether an XML document in a file is both well formed and valid. This program is called Verifier. The Verifier class itself extends DefaultHandler and therefore implements the four core interfaces of the SAX API including ContentHandler and ErrorHandler. We can therefore register the Verifier instance with the parser object we obtain from the SAXParser.

To obtain the SAXParser we first obtain a SAXParserFactory class using the newInstance() method and then configure the SAXParserFactory object we obtain. This configuration allows the SAXParserFactory to determine which SAXParser to provide. You can then either work with the SAXParser object obtained, or as we do in the Verifier class, obtain the implementation of the XMLReader that it wraps. This allows us to work directly with the XMLReader instance and to request that the XML document is parsed by the XMLReader.

Once the application has set up the parser it then calls the parse method on the XMLReader within the Verifier.verify() method. This forces the XMLReader object to notify the Verifier class of any errors and XML elements. This is done by calling the startDocument, endDocument, startElement, endElement etc. methods in the order that matches the contents of the XML file.

The Verifier application is presented below:

import java.io.File;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.XMLReader;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import org.xml.sax.helpers.DefaultHandler;

public class Verifier extends DefaultHandler {
   private String uri;
   private XMLReader parser;
   /**
   
 * Runs the application
   
 **/
   public static void main(String args[]) {
   
    if (args.length != 1) {
   
         System.err.println(
   
                     "Usage: java jdt.Verifier xml-file");
   
         System.exit(1);
   
    }
   
    Verifier v = new Verifier(args[0]);
   
    v.verify();
   }
   /**
   
 * Constructor for the class
   
 * @param file the name of the XML file to load
   
 **/
   public Verifier(String file) {
   
    try {
   
         // Set the URI for the file to laod
   
         uri = "file:" + (new File(file)).getAbsolutePath();
   
         // Obtain the parser provided by the SAXParser factory
   
         SAXParserFactory spf = SAXParserFactory.newInstance();
   
         spf.setValidating(true);
   
         spf.setNamespaceAware(true);
   
         parser = spf.newSAXParser().getXMLReader();
   
         // Set this object to receiv notification of both
   
         // XML elements and of any errors
   
         parser.setContentHandler(this);
   
         parser.setErrorHandler(this);
   
    } catch(Exception e) {e.printStackTrace();}
   }
   /**
   
 * Initiates the processing of the XML file
   
 **/
   public void verify() {
   
    try {
   
         parser.parse(uri);
   
    }catch(Exception e) {e.printStackTrace();}
   }

   //**********************************************************
   // DefaultHandler methods
   //**********************************************************

   public void setDocumentLocator(Locator l) {
   
    // Useful if need to resolve relative URIs
   }
  public void startDocument() throws SAXException {
   
    System.out.println("<?xml version='1.0'?>");
  }
  public void endDocument() throws SAXException {
   
    System.out.println(
   
           "XML verification of " + uri + " complete");
  }
  public void startElement(String namespaceUri,
   
                  String localName,
   
                  String qName, Attributes atts)
   
                         throws SAXException {
   
    System.out.print("<" + qName);
   
    for (int i=0; i < atts.getLength(); i++)
   
        System.out.print(" " + atts.getLocalName(i) + " = \"" +
   
                         atts.getValue(i) + "\"");
   
    System.out.println(">");
  }
  public void endElement(String namespaceUri,
   
                  String localName,
   
                     String qName) throws SAXException {
   
    System.out.println("</" + qName + ">");
  }
  public void characters(char buf[],
   
                  int start,
   
                  int length) throws SAXException {
   
    String output = new String(buf, start, length);
   
    if (output.length() != 0)
   
        System.out.print(output);
  }
  public void ignorableWhitespace(char buf[],
   
                         int start,
   
                         int length)
   
                              throws SAXException {
   
        // Ignorable - so we do
  }
  public void processingInstruction(String target, String data)
   
                              throws SAXException {
   
        System.out.println("<?" + target + " " + data + "?>");
  }
  public void startPrefixMapping(String prefix, String uri)
   
                              throws SAXException {
   
    // We ignore the prefix and uri as namespaceAware
   
    // property is true
  }
  public void endPrefixMapping(String prefix)
   
                              throws SAXException {}
  public void skippedEntity(String name) throws SAXException {
   
    // Most parsers will not skip entities so we can ignore
  }

  //*********************************************************
  // Error handling methods
  //*********************************************************

  public void error(SAXParseException exp)
   
                              throws SAXException {
   
    System.out.println("** Error" +
   
                         ". line " + exp.getLineNumber() +
   
                         ", uri " + exp.getSystemId());
   
    System.out.println(" " + exp.getMessage());
  }

public void fatalError(SAXParseException exp)
   
                              throws SAXException {
   
    System.out.println("*** Fatal error" +
   
                         ". line " +
   
                              exp.getLineNumber() +
   
                         ", uri " + exp.getSystemId());
   
    System.out.println(" " + exp.getMessage());
   
    throw exp;
  }

public void warning(SAXParseException exp)
   
                              throws SAXException {
   
    System.out.println("* Warning" +
   
                         ". line " + exp.getLineNumber() +
   
                         ", uri " + exp.getSystemId());
   
    System.out.println(" " + exp.getMessage());
  }
}

The text XML file we will use is contacts.xml presented in the figure below.



Figure: contacts.xml

The result of running the Verifier application on the contacts.xml file is presented below:

C:\ >java Verifier contacts.xml
<?xml version='1.0'?>
* Warning. line 3, uri file:C:\contacts.xml
  Valid documents must have a <!DOCTYPE declaration.
** Error. line 3, uri file:C:\contacts.xml
  Element type "CONTACTS" is not declared.
<CONTACTS>
  ** Error. line 4, uri file:C:\contacts.xml
  
  Element type "CONTACT" is not declared.
<CONTACT>
  
    ** Error. line 5, uri file:C:\contacts.xml
  
  Element type "NAME" is not declared.
<NAME>
Denise Cooke</NAME>
  
    ** Error. line 6, uri file:C:\contacts.xml
  
  Element type "ADDRESS" is not declared.
<ADDRESS>
10 High St</ADDRESS>
  </CONTACT>
  <CONTACT>
  
  <NAME>
John Hunt</NAME>
    <ADDRESS>
24 Grange Close</ADDRESS>
  </CONTACT>
</CONTACTS>
XML verification of file:C:\contacts.xml complete




   


Get 6 FREE copies of "Application Development Advisor" magazine. This offer is open to IT professionals based in the United Kingdom. Coverage of .NET, XML and databases by experts.
www.appdevadvisor.co.uk
Visit Solutions Architect, our new website covering service-oriented infrastructures. Read articles and sign-up to receive the weekly emails.
www.solutionsarchitect.co.uk
Visit RFID Today, our new website covering Radio Frequency Identification. Read the latest articles and case studies. Sign-up to receive RFID Today magazine.
www.rfidtoday.co.uk

ADA Communications, Charwell House,
Wilsom Road, Alton, Hampshire, GU34 2PP, UK.
Tel: +44 (0)1420 594200
www.adacom.co.uk

© Copyright 2001 - 2005 by ADA Communications Ltd. All rights reserved. Statements of opinion and fact are made on the responsibility of the authors alone and do not imply an opinion on the part of ADA Communications Ltd or the editorial staff. Registered in England No. 04843018