In a recent project I was asked to parse massive XML document from a third party provider. We had two problems: the xml document definition was very massive, with about 120 groups of different elements, leading to a total of around 1500 different elements to parse into java objects. The other problem was that our only documentation was a massive Word document that described the Xml format using plain English and some rather consistent tables.
Fortunatly the Word document was rather consistent. Each table had a title, and a list of attributes, with their name, type and a comment. Names were ugly UPPERCASENAMEALLTOGETHER. Titles contained a name, with the XML element name within parenthesises: OVERALCREDITSCORE (XYZ78).
We had chosen using Jibx tool to parse Xml (also to generate Xml in other parts of the application) for efficiency reasons, since it is really lightning fast, and very flexible to map any XML structure into Java structures. Jibx requires to write a custom XML mapping file, and obviously we also had to create the Java classes to map to.
The total amount of work was huge: with 120 groups of 12 XML elements to map into 120 Java classes of 12 fields each, wa had to create the 120 corresponding Java classes, and the mapping file with 1500 mapping instruction (from XML element to Java class attribute). I could not imagine doing this by hand, this is horrible work even for a trainee 
So I copy-pasted the Word document into Excel to convert it into a coma-separated file, fixed the little inconsistencies, then wrote a rustic parser to extract the XML meta model, using some dumb rules such as: “if the cells 3 and 4 are empty then it is a value from the enumeration of possible values”, “if every cell is set then it is an element”… Even the parenthesises were of interest to tell the enclosing element name ! Relying on such stupid criteria seemed frightening, but it worked, and worked quite well !
Then using a simple Java metamodel (class, field, constants as enumerations) and some dedicated Visitors I generated the binding file, the DtD, the java model, a sample XML document to use for testing… in 2 seconds. The complete work including testing took 2 days; I was happy the project manager trusted me on this, he could have decided it would not work and refused me to do it…
I havent told you how I converted the ugly UPPERCASENAMEALLTOGETHER into correct Java naming conventions; using some short business-dedicated dictionary of words, say {UPPER, CASE, NAME, TOGETHER ,ALL…}, a function tries to recognise the words from the dictionary inside the ugly names in order to split them into relevant parts for further camel case naming conversion. It did not work ok 100% of the time, but after a few tries, just add or remove words to solve conflicts, it did the whole job automatically !
Such a disposable, dedicated code generator works great, I dont care I wont reuse it, as long as I get the work done quickly…
Note: the design is actually very common and reusable: an object model (Composite pattern), a parser (Builder pattern) and some code generators (Visitor pattern).
Initially published on Jroller on Tuesday March 29, 2005
Read More