[go: up one dir, main page]

0% found this document useful (0 votes)
136 views44 pages

Parsing XML Into Programming Languages: Jaxp, Dom, Sax, Jdom/Dom4J, Xerces, Xalan, JAXB

This document discusses various strategies for parsing XML into programming languages, including parsing by hand, parsing into a generic tree structure using DOM, and parsing as a sequence of events using SAX. It also covers JAXP, which defines standard Java APIs for XML processing, and JAXB, which allows XML schemas to be automatically mapped to Java classes.

Uploaded by

Jenny Liza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views44 pages

Parsing XML Into Programming Languages: Jaxp, Dom, Sax, Jdom/Dom4J, Xerces, Xalan, JAXB

This document discusses various strategies for parsing XML into programming languages, including parsing by hand, parsing into a generic tree structure using DOM, and parsing as a sequence of events using SAX. It also covers JAXP, which defines standard Java APIs for XML processing, and JAXB, which allows XML schemas to be automatically mapped to Java classes.

Uploaded by

Jenny Liza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Parsing XML into programming

languages
JAXP, DOM, SAX, JDOM/DOM4J,
Xerces, Xalan, JAXB

Prepared By: Mr. Mark Umadhay


Parsing XML
• Goal: read XML files into data structures in
programming languages

• Possible strategies
– Parse by hand with some reusable libraries
– Parse into generic tree structure
– Parse as sequence of events
– Automagically parse to language-specific objects
Parsing by-hand
• Advantages
– Complete control
– Good if simple needs – build off of regex package

• Disadvantages
– Must write the initial code yourself, even if it becomes
generalized
– Pretty tedious and error prone.
– Gets very hard when using schema or DTD to validate
Parsing into generic tree structure
• Advantages
– Industry-wide, language neutral standard exists called DOM
(Document Object Model)
– Learning DOM for one language makes it easy to learn for any
other
– As of JAXP 1.2, support for Schema
– Have to write much less code to get XML to something you want
to manipulate in your program

• Disadvantages
– Non-intuitive API, doesn‟t take full advantage of Java
– Still quite a bit of work
What is JAXP?
• JAXP: Java API for XML Processing
– In the Java language, the definition of these standard
API‟s (together with XSLT API) comprise a set of
interfaces known as JAXP
– Java also provides standard implementations together
with vendor pluggability layer
– Some of these come standard with J2SDK, others are
only availdable with Web Services Developers Pack
– We will study these shortly
Another alternative
• JDOM: Native Java published API for
representing XML as tree
• Like DOM but much more Java-specific,
object oriented
• However, not supported by other languages
• Also, no support for schema
• Dom4j another alternative
JAXB
• JAXB: Java API for XML Bindings

• Defines an API for automagically representing


XML schema as collections of Java classes.

• Most convenient for application programming

• Will cover next class


DOM
About DOM
• Stands for Document Object Model

• A World Wide Web Consortium (w3c) standard

• Standard constantly adding new features – Level 3


Core just released this month

• Well cover most of the basics. There‟s always


more, and it‟s always changing.
DOM abstraction layer in Java --
architecture
Emphasis is on allowing vendors to supply their own DOM
Implementation without requiring change to source code
Returns specific parser
implementation

org.w3d.dom.Document
Sample Code
A factory instance
DocumentBuilderFactor factory = is the parser implementation.
Can be changed with runtime
DocumentBuilderFactory.newInstance(); System property. Jdk has default.
Xerces much better.

/* set some factory options here */


From the factory one obtains
DocumentBuilder builder = an instance of the parser
factory.newDocumentBuilder();
xmlFile can be an java.io.File,
Document doc = builder.parse(xmlFile); an inputstream, etc.

javax.xml.parsers.DocumentBuilderFactory
For reference. Notice that the
javax.xml.parsers.DocumentBuilder
Document class comes from the
org.w3c.dom.Document w3c-specified bindings.
Validation
• Note that by default the parser will not
validate against a schema or DTD

• As of JAXP1.2, java provides a default


parse than can handle most schema features

• See next slide for details on how to setup


Important: Schema validation
String JAXP_SCHEMA_LANGUAGE =
"http://java.sun.com/xml/jaxp/properties/schemaLanguage";
String W3C_XML_SCHEMA =
"http://www.w3.org/2001/XMLSchema";

Next, you need to configure DocumentBuilderFactory to generate a


namespace-aware, validating parser that uses XML Schema:

… DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance()


factory.setNamespaceAware(true);
factory.setValidating(true);
try {
factory.setAttribute(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA); }
catch (IllegalArgumentException x) {
// Happens if the parser does not support JAXP 1.2 ...
}
Associating document with schema

• An xml file can be associated with a


schema in two ways
1. Directly in xml file in regular way
2. Programmatically from java

• Latter is done as:


– factory.setAttribute(JAXP_SCHEMA_SOURCE,
new File(schemaSource));
A few notes
• Factory allows ease of switching parser
implementations
– Java provides simple DOM implementation, but
much better to use vendor-supplied when doing
serious work
– Xerces, part of apache project, is installed on
cluster as Eclipse plugin. We‟ll use next week.
– Note that some properties are not supported by
all parser implementations.
Document object
• Once a Document object is obtained, rich API to
manipulate.

• First call is usually


Element root = doc.getDocumentElement();
This gets the root element of the Document as an
instance of the Element class

• Note that Element subclasses Node and has methods


getType(), getName(), and getValue(), and
getChildNodes()
Types of Nodes
• Note that there are many types of Nodes (ie
subclasses of Node:
Attr, CDATASection, Comment, Document, DocumentFragment,
DocumentType, Element, Entity, EntityReference, Notation,
ProcessingInstruction, Text

Each of these has a special and non-obvious associated type, value, and name.

Standards are language-neutral and are specified on chart on following slide

Important: keep this chart nearby when using DOM


Node nodeName() nodeValue() Attributes nodeType()
Attr Attr name Value of attribute null 2
CDATASection #cdata-section CDATA cotnent null 4
Comment #comment Comment content null 8
Document #document Null null 9
DocumentFragment #document- null null 11
fragment
DocumentType Doc type name null null 10
Element Tag name null NamedNodeMap 1
Entity Entity name null null 6
EntityReference Name entitry null null 5
referenced
Notation Notation name null null 1
ProcessingInstruction target Entire string null 7
Text #text Actual text null 3
Transforming XML
The JAXP Transformation Packages

• JAXP Transformation APIs:


– javax.xml.transform
• This package defines the factory class you use to get a Transformer object. You then
configure the transformer with input (Source) and output (Result) objects, and invoke its
transform() method to make the transformation happen. The source and result objects are
created using classes from one of the other three packages.
– javax.xml.transform.dom
• Defines the DOMSource and DOMResult classes that let you use a DOM as an input to or
output from a transformation.
– javax.xml.transform.sax
• Defines the SAXSource and SAXResult classes that let you use a SAX event generator as
input to a transformation, or deliver SAX events as output to a SAX event processor.
– javax.xml.transform.stream
• Defines the StreamSource and StreamResult classes that let you use an I/O stream as an
input to or output from a transformation.
Transformer Architecture
Writing DOM to XML
public class WriteDOM{
public static void main(String[] argv) throws Exception{
File f = new File(argv[0]);
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(f);

TransformerFactory tFactory =
TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer();
DOMSource source = new DOMSource(document);
StreamResult result = new StreamResult(System.out);
transformer.transform(source, result);
}
}
Creating a DOM from scratch
• Sometimes you may want to create a DOM
tree directly in memory. This is done with:

DocumentBuilderFactory factory
= DocumentBuilderFactory.newInstance();
DocumentBuilder builder
= factory.newDocumentBuilder();
document = builder.newDocument();
Manipulating Nodes
• Once the root node is obtained, typical tree
methods exist to manipulate other elements:
boolean node.hasChildNodes()
NodeList node.getChildNodes()
Node node.getNextSibling()
Node node.getParentNode()
String node.getValue();
String node.getName();
String node.getText();
void setNodeValue(String nodeValue);
Node insertBefore(Node new, Node ref);
SAX

Simple API for XML Processing


About SAX
• SAX in Java is hosted on source forge

• SAX is not a w3c standard

• Originated purely in Java

• Other languages have chosen to implement in their


own ways based on this prototype
SAX vs. …
• Please don‟t compared unrelated things:
– SAX is an alternative to DOM, but realize that
DOM is often built on top of SAX

– SAX and DOM do not compete with JAXP

– They do both compete with JAXB


implementations
How a SAX parser works
• SAX parser scans an xml stream on the fly and responds to
certain parsing events as it encounters them.

• This is very different than digesting an entire XML


document into memory.

• Much faster, requires less memory.

• However, need to reparse if you need to revisit data.


Obtaining a SAX parser
• Important classes
javax.xml.parsers.SAXParserFactory;
javax.xml.parsers.SAXParser;
javax.xml.parsers.ParserConfigurationException;

//get the parser


SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();

//parse the document


saxParser.parse( new File(argv[0]), handler);
DefaultHandler
• Note that an event handler has to be passed to the
SAX parser.

• This must implement the interface


org.xml.sax.ContentHandler;

• Easier to extend the adapter


org.xml.sax.helpers.DefaultHandler
Overriding Handler methods
• Most important methods to override
– void startDocument()
• Called once when document parsing begins
– void endDocument()
• Called once when parsing ends
– void startElement(...)
• Called each time an element begin tag is encountered
– void endElement(...)
• Called each time an element end tag is encountered
– void characters(...)
• Called randomly between startElement and endElement calls
to accumulated character data
startElement
• public void startElement(
String namespaceURI, //if namespace assoc
String sName, //nonqualified name
String qName, //qualified name
Attributes attrs) //list of attributes

• Attribute info is obtained by querying Attributes


objects.
Characters
• public void characters(
char buf[], //buffer of chars accumulated
int offset, //begin element of chars
int len) //number of chars

• Note, characters may be called more than once between


begin tag / end tag

• Also, mixed-content elements require careful handling


Entity references
• Recall that entity references are special character
sequences for referring to characters that have
special meaning in XML syntax
– „<„ is &lt
– „>‟ is &gt
• In SAX these are automatically converted and
passed to the characters stream unless they are part
of a CDATA section
Choosing a Parser
• Choosing your Parser Implementation
– If no other factory class is specified, the default SAXParserFactory
class is used. To use a different manufacturer's parser, you can
change the value of the environment variable that points to it. You
can do that from the command line, like this:
• java -Djavax.xml.parsers.SAXParserFactory=yourFactoryHere ...

• The factory name you specify must be a fully qualified


class name (all package prefixes included). For more
information, see the documentation in the newInstance()
method of the SAXParserFactory class.
Validating SAX Parsers
String JAXP_SCHEMA_LANGUAGE =
"http://java.sun.com/xml/jaxp/properties/schemaLanguage";
String W3C_XML_SCHEMA =
"http://www.w3.org/2001/XMLSchema";

Next, you need to configure DocumentBuilderFactory to generate a


namespace-aware, validating parser that uses XML Schema:

… SaxParserFactory factory = SaxParserFactory.newInstance()


factory.setNamespaceAware(true);
factory.setValidating(true);
try {
factory.setAttribute(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA); }
catch (IllegalArgumentException x) {
// Happens if the parser does not support JAXP 1.2 ...
}
Transforming arbitrary data
structures using SAX and
Transformer
Goal
• Now that we know SAX and a little about
Transformations, there are some cool things we
can do.

• One immediate thing is to create xml files from


plain text files using the help of a faux SAX parser

• Turns out to be more robust than doing by hand


Transformers
• Recall that transformers easily let us go between
any source and result by arbitrary wirings of
– StreamSource / StreamResult
– SAXSource / SAXResult
– DOMSource / DOMResult

• We used this to write a DOM tree to an XML file

• Now we will use a SAXSource together with a


StreamResult to convert our text file
Strategy
• We construct our own SAXParser – ie a class that
implements the XMLReader interface

• This class must have a parse method (among


others)

• We use parse to read our input file and fire the


appropriate SAX events.
What?
• What are we really doing here?

• We‟re having the SAXParser pretend as though it


has encountered certain SAX XML events when it
reads the text file.

• Exactly where we pretend these things occur is


where the appropriate XML will get written by the
transformer
Main snippet
public static void main (String argv []){
StudentReader parser = new StudentReader(); Create SAX “parser”
TransformerFactory tFactory =
TransformerFactory.newInstance();
create transformer
Transformer transformer = tFactory.newTransformer();
FileReader fr = new FileReader(“students.txt”);
BufferedReader br = new BufferedReader(fr);
Use text File as
InputSource inputSource = new InputSource(fr);
Transformer source
SAXSource source = new SAXSource(parser, inputSource);
StreamResult result = new StreamResult(System.out); Use text as result
transformer.transform(source, result);
}
XMLReader implementation

• To have a valid SAXSource we need a class that implements


XMLReader interface

public void parse(InputSource input)


public void setContentHandler(ContentHandler handler)
public ContentHandler getContentHandler()
.
.
.

•Shown are the important methods for a simple app


End

You might also like