1
MARCH, 2005
The Joy
of SAX: XML processing with the SAX Parser
By John Hunt
In
this Java Jolt column, we are going to briefly explore the SAX (Simple
API for XML parsing) API for processing XML documents form Java. There
are in fact two standard APIs for processing XML documents. As well as
the SAX, there is also the Dom (Document Object Model). The difference
between the two relates to how they present the XML document to a Java
program. The SAX provides access to the data in an XML document as it
is read in. In contrast, the DOM loads the whole XML document into memory
in a hierarchical data structure (which is the documents’ object
model). It is then possible to traverse the tree to access the information
within it. The end result is that the SAX may be faster and require less
memory but the DOM offers a more sophisticated environment.
The
JAXP API
The JAXP API (Java API for XML Parsing) from Sun (and bundled with Java
since Java 2 SDK 1.4) provides access to not only SAX and DOM parsers
but also XSL translators.
One area
of XML processing that can be confusing within the Java world is the relationship
between the JAXP API and the actual SAX (Simple API for XML parsing) and
DOM (Document Object Model) parsers and the translator. It might seem
at first sight that having the JAXP API from Sun means that it is not
necessary to have a separate SAX, DOM or XSLT parser. However, the JAXP
API is really just a front end to such parsers and processors. It does
provide a common front-end to different parsers but is not parser itself.
This is very useful because, the use of a SAX or a DOM parser typically
requires knowledge of the specific implementation of the parser.
The JAXP
API is able to allow different parsers to be plugged in via a Pluggability
layer. This Pluggability mechanism allows a compliant SAX or DOM parser
to be "plugged in" with no visible affect on the JAXP interface.
This scenario
is not unique within the Java world as the JDBC (Java Database Connectivity
API) provides a common front end to different database drivers but does
not itself provide an actual connection to a database.
In our examples,
we have used the default distribution, which includes the Crimson SAX
and DOM parsers. Crimson was derived from the Java Project X parser from
Sun but is now available from Apache. The Xalan XSLT processor from Apache
is also used. Note that the future plan is to move from Crimson to Xerces.
These examples presented here should work with either parser plugged into
JAXP 1.1.
The JAXP
is made up of the classes in interfaces in the javax.xml package and sub
packages. There are four classes in the javax.xml.parsers package that
are used enable an application to load XML documents. Two relate to the
SAX API and two to the DOM API. These classes are:
| DocumentBuilder |
Defines
the API to obtain DOM Document instances from an XML document. |
| DocumentBuilderFactory |
Defines
a factory API that enables applications to obtain a parser that produces
DOM object trees from XML documents. |
| SAXParser |
Defines
the API that wraps an XMLReader implementation class. |
| SAXParserFactory |
Defines
a factory API that enables applications to configure and obtain a
SAX based parser to parse XML documents. |
However,
as we are focussing on SAX here we will ignore the other aspects of JAXP
for the moment.
The
SAX API
The SAX API has an interesting history: it was originally a collaborative
effort by members of the XML_Dev mailing list. This effort was coordinated
and finalized by David Megginson who is still considered the father of
the SAX API.
The SAX API
is the de-facto standard interface for event-based XML parsing. It is
lightweight and fast with low memory overheads. This is because the SAX
API merely informs the application using it, of each element it has found
in turn. It does not build up any in memory structures nor does it remember
which elements it has already processed. It is up to whatever is using
the SAX to do this. Interestingly the SAX is often the basis of higher-level
APIs such as DOM.
To develop
a SAX based application you need an XML parser that supports SAX such
as Apache’s Xerces or Crimson or as default implementations in the
JAXP distribution. The SAX distribution is a collection of classes and
interfaces that come with SAX compliant parsers and as part of JAXP. In
the case of the Crimson implementation, the SAX can be found in the package
org.xml.sax and its subpackages. From a Java point of view, essentially
SAX provides a set of event interfaces that must be implemented by one
or more event handler that will deal with document events such as "here
is a new element".
The majority
of the interfaces in the SAX API are used to represent the concepts that
may be found in an XML document. These interfaces are then implemented
as appropriate by different implementations of the SAX specification (such
as Xerces and Crimson). The key interfaces are presented below.

Figure: The SAX API interfaces
The key interfaces
in the SAX API for processing an XML document are:
Content Handler
- This is the main interface that most SAX applications must implement
whenever we want the application to be notified of the elements in the
XML document. The application class must implement this interface and
then call the setContentHandler
method on the actual SAX parser used. The parser uses the instance to
report basic document-related events like the start and end of elements
and character data.
EntityResolver
- If a SAX application needs to implement customized handling for external
entities, it must implement this interface and register an instance with
the SAX driver using the setEntityResolver
method.
ErrorHandler
- If a SAX application needs to implement customized error handling, it
must implement this interface and then register an instance with the XML
reader using the setErrorHandler
method. The parser will then report all errors and warnings through this
interface. Note that if no error handler is provided then no errors will
be reported.
DTDHandler
- If a SAX application needs information about notations and unparsed
entities, then the application implements this interface and registers
an instance with the SAX parser using the parser's setDTDHandler
method. The parser uses the instance to report notation and unparsed entity
declarations to the application. Note that this interface includes only
those DTD events that the XML recommendation requires processors
to report: notation and unparsed entity declarations. This means that
if what you want to do is to process the DTD used by an XML document then
you need to look at the DeclHandler
interface in the org.xml.sax.ext
package.
SAX2 represents
an extension of the original SAX specification and provides two interfaces
in the org.xml.sax.ext
package which can be very useful:
DeclHandler
- This is an optional extension handler for SAX2 to provide information
about DTD declarations in an XML document. XML readers are not required
to support this handler, and this handler is not included in the core
SAX2 distribution.
LexicalHandler
- This is an optional extension handler for SAX2 to provide lexical information
about an XML document, such as comments and CDATA section boundaries;
XML readers are not required to support this handler, and it is not part
of the core SAX2 distribution.
There are
other interfaces, but the most important of these is the org.xml.sax.XMLReader
interface. An XmlReader
is an object that loads an XML document into memory. In essence, the SAXParser
class provided by the JAXP API wraps a class implementing the XmlReader
interface inside it and uses that class to actually load the XML document.
When you
are actually creating a class or classes, that will implement one or more
of the interfaces in the SAX API for event monitoring, it is common to
start with the DefaultHandler
class from the org.xml.sax.helpers
package. DefaultHandler
is a convenience base class for SAX2 applications: it provides default
implementations for all of the callbacks in the four core SAX2 handler
interfaces (that is for EntityResolver,
DTDHandler, ContentHandler
and ErrorHandler).
A developer can extend the DefaultHandler
class when they need to implement only part of an interface.
To actually
use the SAX to process the contents of an XML document you must therefore
implement one or more interfaces (for example the ContentHandler),
which can be done by extending the DefaultHandler
convenience class. Create an instance of this class and register it with
a SAXParser obtained
form the SAXParserFactory
provided as part of the JAXP API. Then when this SAXParser
is used to load an XML document each element found will be reported to
this instance.
The methods
in the ContentHandler
interface that must be implemented include:
startDocument()
- Notifies beginning of a document
endDocument() –
Notifies end of a document
startElement(String
namespaceUri,
String localName,
String qName,
Attributes atts) - Start of an element
endElement(…)
- Notifies end of element
characters(char
ch[], int start, int length) - Notifies character data
The
Verifier application
To illustrate the use of the SAX API we will look at a simple program
that uses the SAX to verify whether an XML document in a file is both
well formed and valid. This program is called Verifier. The Verifier class
itself extends DefaultHandler
and therefore implements the four core interfaces of the SAX API including
ContentHandler
and ErrorHandler.
We can therefore register the Verifier instance with the parser object
we obtain from the SAXParser.
To obtain
the SAXParser we
first obtain a SAXParserFactory
class using the newInstance()
method and then configure the SAXParserFactory
object we obtain. This configuration allows the SAXParserFactory
to determine which SAXParser
to provide. You can then either work with the SAXParser
object obtained, or as we do in the Verifier
class, obtain the implementation of the XMLReader
that it wraps. This allows us to work directly with the XMLReader
instance and to request that the XML document is parsed by the XMLReader.
Once the
application has set up the parser it then calls the parse method on the
XMLReader within
the Verifier.verify()
method. This forces the XMLReader
object to notify the Verifier class of any errors and XML elements. This
is done by calling the startDocument,
endDocument, startElement,
endElement etc.
methods in the order that matches the contents of the XML file.
The Verifier
application is presented below:
import java.io.File;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.XMLReader;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import org.xml.sax.helpers.DefaultHandler;
public class Verifier
extends DefaultHandler {
private String uri;
private XMLReader parser;
/**
* Runs the application
**/
public static void main(String args[]) {
if (args.length != 1)
{
System.err.println(
"Usage:
java jdt.Verifier xml-file");
System.exit(1);
}
Verifier v = new Verifier(args[0]);
v.verify();
}
/**
* Constructor for the class
* @param file the name of the XML file
to load
**/
public Verifier(String file) {
try {
//
Set the URI for the file to laod
uri
= "file:" + (new File(file)).getAbsolutePath();
//
Obtain the parser provided by the SAXParser factory
SAXParserFactory
spf = SAXParserFactory.newInstance();
spf.setValidating(true);
spf.setNamespaceAware(true);
parser
= spf.newSAXParser().getXMLReader();
//
Set this object to receiv notification of both
//
XML elements and of any errors
parser.setContentHandler(this);
parser.setErrorHandler(this);
} catch(Exception e)
{e.printStackTrace();}
}
/**
* Initiates the processing of the XML file
**/
public void verify() {
try {
parser.parse(uri);
}catch(Exception e) {e.printStackTrace();}
}
//**********************************************************
// DefaultHandler methods
//**********************************************************
public
void setDocumentLocator(Locator l) {
// Useful if need to
resolve relative URIs
}
public void startDocument() throws SAXException {
System.out.println("<?xml
version='1.0'?>");
}
public void endDocument() throws SAXException {
System.out.println(
"XML
verification of " + uri + " complete");
}
public void startElement(String namespaceUri,
String
localName,
String
qName, Attributes atts)
throws
SAXException {
System.out.print("<"
+ qName);
for (int i=0; i <
atts.getLength(); i++)
System.out.print("
" + atts.getLocalName(i) + " = \"" +
atts.getValue(i)
+ "\"");
System.out.println(">");
}
public void endElement(String namespaceUri,
String
localName,
String
qName) throws SAXException {
System.out.println("</"
+ qName + ">");
}
public void characters(char buf[],
int
start,
int
length) throws SAXException {
String output = new String(buf,
start, length);
if (output.length() !=
0)
System.out.print(output);
}
public void ignorableWhitespace(char buf[],
int
start,
int
length)
throws
SAXException {
//
Ignorable - so we do
}
public void processingInstruction(String target, String data)
throws
SAXException {
System.out.println("<?"
+ target + " " + data + "?>");
}
public void startPrefixMapping(String prefix, String uri)
throws
SAXException {
// We ignore the prefix
and uri as namespaceAware
// property is true
}
public void endPrefixMapping(String prefix)
throws
SAXException {}
public void skippedEntity(String name) throws SAXException
{
// Most parsers will
not skip entities so we can ignore
}
//*********************************************************
// Error handling methods
//*********************************************************
public
void error(SAXParseException exp)
throws
SAXException {
System.out.println("**
Error" +
".
line " + exp.getLineNumber() +
",
uri " + exp.getSystemId());
System.out.println("
" + exp.getMessage());
}
public void fatalError(SAXParseException
exp)
throws
SAXException {
System.out.println("***
Fatal error" +
".
line " +
exp.getLineNumber()
+
",
uri " + exp.getSystemId());
System.out.println("
" + exp.getMessage());
throw exp;
}
public void warning(SAXParseException
exp)
throws
SAXException {
System.out.println("*
Warning" +
".
line " + exp.getLineNumber() +
",
uri " + exp.getSystemId());
System.out.println("
" + exp.getMessage());
}
}
The text
XML file we will use is contacts.xml presented in the figure below.

Figure: contacts.xml
The result
of running the Verifier
application on the contacts.xml file is presented below:
C:\
>java Verifier contacts.xml
<?xml version='1.0'?>
* Warning. line 3, uri file:C:\contacts.xml
Valid
documents must have a <!DOCTYPE declaration.
** Error. line 3, uri file:C:\contacts.xml
Element
type "CONTACTS" is not declared.
<CONTACTS>
**
Error. line 4, uri file:C:\contacts.xml
Element
type "CONTACT" is not declared.
<CONTACT>
**
Error. line 5, uri file:C:\contacts.xml
Element
type "NAME" is not declared.
<NAME>
Denise Cooke</NAME>
**
Error. line 6, uri file:C:\contacts.xml
Element
type "ADDRESS" is not declared.
<ADDRESS>
10 High St</ADDRESS>
</CONTACT>
<CONTACT>
<NAME>
John Hunt</NAME>
<ADDRESS>
24 Grange Close</ADDRESS>
</CONTACT>
</CONTACTS>
XML verification of file:C:\contacts.xml complete
|