Parsing large XML files with Woodstox library

On Sep 16, 2012Posted In Code Snippets,General Programming,NoSql,Tools By admin5 Comments

Parsing large XML files with Woodstox library

Today I wanted to play a bit with the library providing support for StAX (The Streaming API for XML) – JSR-173. Recently I have worked a bit with XML parsers in Ruby (like LibXml) while I was reading Seven Databases in Seven Weeks by E.Redmond and J.R.Wilson – which is by the way a really great book IMHO, and for the last couple of weeks the topic of streaming large xml files appears quite often even when I don’t actually use large XML files at work.
Anyway, playing around CouchDB recently I thought it would be nice to create a POC for some Java StAX library which will help me import large datasets into the database for further needs when I learn some NoSql bits.
Similarly to the ‘Seven Databases..’ book I have decided to use XML file available online from Jamendo service which is a list of Artists/Albums/Tracks and Tags as a database dump in the XML form. The file I have managed to download is about 150MB in size, which maybe its not huge but for the testing needs its large enough (it contains information about over 27 thousands of artists.

The file is available here

I believe the most popular StAX implementation is Woodstox library and because of some nice and simple configuration possibilities I have decided to use it and play around with.

First of all, I have created some simple POJO classes which will hold the information retrieved from the Jamendo XML file, this file contains a list of Artists, each Artist having a number of Albums, each Album consisting of one or more Tracks where each Track can be tagged with 0 or more Tags.
The pojos are dead simple, I will include the source of just the Artist, full sources are available online at: https://github.com/softberries/woodstoxex

public class Artist implements Serializable {

private String id;
private String name;
private String url;
private String mbgid;
private String image;
private String country;
private String city;
private String latitude;
private String longitude;
private String state;
private List albums = new ArrayList();

public String getState() {
return state;
}

public void setState(String state) {
this.state = state;
}

public String getId() {
return id;
}
.......
...... the rest of getters and setters....
.......

Once we have a model we can fill up with data ready we can start building the streaming parser mechanism to do that. XML Streaming reader usage is dead simple and consist of the following steps:

get XmlInputFactory (actually XMLInputFactory2 object)
create XMLStreamReader instance
call hasNext() and next() like you would iterate any Iterable
on each iteration check the element type and read its contents when applicable

First our XmlInputFactory2 object

XMLInputFactory2 xmlif2 = (XMLInputFactory2)XMLInputFactory2.newInstance();

I have actually moved creation of this object to a separate class with some static factory methods where I initialize it using different configuration settings but more about that later.

Getting the XMLStreamReader is no more difficult than the previous step:

XMLStreamReader2 reader = (XMLStreamReader2)xmlif2.createXMLStreamReader(fileName, new FileInputStream(fileName));

where fileName is a name (together with path) to our XML data source.

Once we have reader object ready and initialised with the file handle we can start processing, this is done with iteration over the xml elements and finding out if the current element is a start element, end element or just a content between xml tags (there are more options then just start/end/content but I’ll not cover them here).

logger.info("Starting to parse " + fileName);
try{
XMLStreamReader2 reader = (XMLStreamReader2)xmlif2.createXMLStreamReader(fileName, new FileInputStream(fileName));
int eventType = 0;
String curElement = "";

Artist artist = null;
Album album = null;
Track track = null;
Tag tag = null;

while(reader.hasNext()){
eventType = reader.next();
switch (eventType) {
case XMLEvent.START_ELEMENT:
curElement = reader.getName().toString();
if(ARTIST.equals(curElement)){
artist = new Artist();
}else if(ALBUM.equals(curElement)){
album = new Album();
}else if(TRACK.equals(curElement)){
track = new Track();
}else if(TAG.equals(curElement)){
tag = new Tag();
}
break;
case XMLEvent.CHARACTERS:
String content = reader.getText();
Object obj = null;
if((obj = getCurrentlyActiveObject(tag,track,album,artist)) != null)
PropertyUtils.setProperty(obj, curElement, content);
break;
case XMLEvent.END_ELEMENT:
curElement = reader.getName().toString();
if(ARTIST.equals(curElement)){
processArtist(artist);
artist = null;
}else if(ALBUM.equals(curElement)){
artist.addAlbum(album);
album = null;
}else if(TRACK.equals(curElement)){
album.addTrack(track);
track = null;
}else if(TAG.equals(curElement)){
track.addTag(tag);
tag = null;
}
break;
case XMLEvent.END_DOCUMENT:
logger.info("document parsing finishing..");
}
}
}catch(Exception ex){ex.printStackTrace();}

Thats maybe not the most beautiful code you can find online but for the purpose of this tutorial its more than enough. All it does is to check if the current element is one of the four mentioned (Artist, Album, Track or Tag) and if so, it initialises it when START_ELEMENT is encountered and process it when it END_ELEMENT for this tag is found.
We are interested in creating an Artist object with all the fields filled up, so once an Artis tag is found we create an Artist object and then process its children like Albums, Tracks and Tags, when we encounter END_ELEMENT for Artist we know that we have come back in the XML tree structure back to the closing Artist node and we can finish processing that Artist and start from scratch.
The filled up Artist object can be then processed further inside ‘processArtist(artist);’, for the purpose of this tutorial we just count the Artist occurrences for each fully created Artist object.

The interesting bit is probably the way we handle the rest of the tags being the properties of our main objects. This is done using simple reflection mechanism and PropertyUtils library from Apache Commons:

if((obj = getCurrentlyActiveObject(tag,track,album,artist)) != null)
PropertyUtils.setProperty(obj, curElement, content);

where ‘getCurrentlyActiveObject’ points to the currently processed object (which is not null at this moment). The nullability check goes from the opposite direction:

private Object getCurrentlyActiveObject(Tag tag, Track track, Album album, Artist artist) {
if(tag != null){
return tag;
}else if(track != null){
return track;
}else if(album != null){
return album;
}else if(artist != null){
return artist;
}
return null;
}

where ‘tag’ is a child of ‘track’, the ‘track’ is child of ‘album’ etc.

Very nice feature about Woodstox I have discovered is the ability to tweak the processing using a single method call while creating ‘XMLInputFactory2’ object.
This includes the following:

configureForXmlConformance()

Method to call to make Reader created conform as closely to XML standard as possible, doing all checks and transformations mandated by the XML specification (linefeed conversions, attr value normalizations).

configureForConvenience()

Method to call to make Reader created be as “convenient” to use as possible; ie try to avoid having to deal with some of things like segmented text chunks. This may incur some slight performance penalties, but should not affect XML conformance.

configureForSpeed()

Method to call to make the Reader created be as fast as possible reading documents, especially for long-running processes where caching is likely to help. This means reducing amount of information collected (ignorable white space in prolog/epilog, accurate Location information for Event API), and possibly even including simplifying handling of XML-specified transformations (skip attribute value and text linefeed normalization). Potential downsides are somewhat increased memory usage (for full-sized input buffers), and reduced XML conformance (will not do some of transformations).

configureForLowMemUsage()

Method to call to minimize the memory usage of the stream/event reader; both regarding Objects created, and the temporary memory usage during parsing. This generally incurs some performance penalties, due to using smaller input buffers.

configureForRoundTripping()

Method to call to make Reader try to preserve as much of input formatting as possible, so that round-tripping would be as lossless as possible. This means that the matching writer should be able to reproduce output as closely matching input format as possible (most implementations won’t be able to provide 100% vis-a-vis; white space between attributes is generally lost, as well as use of character entities).

I was expecting a bit more processing to be done while parsing 160MB XML file but just simply counting fully created Artists object takes about 5-8 seconds on my machine for the whole set.

Anyway, I wanted to compare some of those options available in Woodstox and I have created 3 simple benchmarks using Google Caliper microbenchmarking framework.

I have tested the following options: ForLowMemory, ForSpeed and ForXMLConformance
With 5-8 seconds executing there are no big differences but still they are visible:

Benchmarks done using standard Google Calliper settings and executed with ‘–printScore’ flag. 10 runs for each benchmark case.

Execution times:

using standard currentTimeMillis over 5 runs for each case.

All tests were executed on my Mac Book Pro 2.2GHz Intel Core i7, 8GB 1333MHz RAM.

Actual Heap allocations as well as CPU usage in all three cases looked similar:

The full source code (Git+Maven project is available from Github at: https://github.com/softberries/woodstoxex

5 responses on “Parsing large XML files with Woodstox library”

Tomasz Dziurko September 16, 2012 at 11:26 am

Thanks for sharing. I am not a big fan of parsing XMLs but Google Caliper for benchmarking looks interesting, didn’t know about it before.
szimano September 17, 2012 at 9:12 am

Hey good stuff. Would be nice to see a comparison with the default SAX java implementation 😉
Maciej Biłas September 18, 2012 at 6:36 pm

Krzysiek, cool stuff.

What does the Caliper score represent? It looks like a inverse of the execution time, but how it’s measured and what is it useful for? You know, if it’s only dependent on the execution time, why bother computing some derivative values?
admin September 18, 2012 at 7:10 pm

Caliper score is meaningless if you do smth other than compare results with each other, its a different unit than a simple time (in miliseconds) as its being calculated using (mainly) time and used memory factors, in other words its just a quick and convenient way of saying which algorithm has better performance roughly.

Software Passion

by Krzysztof Grajek

Blog

Parsing large XML files with Woodstox library

5 responses on “Parsing large XML files with Woodstox library”

Leave a Reply