Why use diffxml?

I’m the author of the diffxml tool for comparing XML documents. In this post I’d like to explain why you might want to use diffxml to compare XML documents rather than traditional text tools such as the UNIX diff command.

There are two things that diffxml understands that diff doesn’t; the syntax of XML documents (e.g. <br/> is equivalent to <br></br>) and the hierarchical structure they represent.

The advantages of understanding XML syntax are pretty easy to explain. Consider these two XML documents:

<a
>text<b/>
<c></c>
</a>

and

<a>text<b></b>
<d/>
</a>

If we compare these using diff, we get the following output:

1,3c1,2
< <a
<     >text<b/>
< <c></c>

Which is telling us that every line in the document has changed. However, if we use diffxml to difference the documents, we get:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<delta>
<insert charpos="2" childno="4" name="d" nodetype="1" parent="/node()[1]"/>
<delete node="/node()[1]/node()[5]"/>
</delta>

Which is telling us that the difference between the documents is the insertion of an element “d” and the removal of another element1.

The other major advantage of diffxml is that it understands the hierarchical, or “tree” structure of XML documents. It’s a little harder to explain what this means, but consider the following. The XML document:

<a><b><d/></b><c><e/></c></a>

Can be represented as:

Tree representation of XML

And the XML document:

<a><b/><c><d/><e/></c></a>

Can be represented as:

Tree representation of XML document

It’s clear from the diagram that the only change is that the element “d” has moved from element “b” to element “c”. There is no way that a line-based differencing utility could tell us this, but diffxml gives us:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<delta>
<move childno="1" new_charpos="1" node="/node()[1]/node()[1]/node()[1]" old_charpos="1" parent="/node()[1]/node()[2]"/>
</delta>

Which correctly identifies that the only difference is the move of a single element to a new parent.

I hope this makes it clear why tools such as diffxml which understand the heirarchical narture of XML documents are often a better choice than line-based equivalents for comparing XML documents.

  1. Admittedly the output is a little hard for humans to read currently. There are a couple of things that can be done to improve this (use proper node names instead of using the node() axis and put a string in the nodetype attribute), but in the future I hope to provide some sort of graphical interface. []

12 Responses to “Why use diffxml?”

  1. John Knottenbelt Says:

    Excellent stuff! Visual Studio project (.vcproj) and solution (.sln) files are XML and I frequently get problems with merging in changes using the TortoiseMerge (which is very good, but line-based).

    Is it possible to use diffxml as a custom diff / merge tool with TortoiseSVN?

  2. Adrian Mouat Says:

    Not at the minute, but it’s sounds like a useful idea – I’ll add it to the list of wanted features.

    The main focus at the moment is getting the quality right; in the current version you can still expect to run into the odd bug.

  3. Brian spencer Says:

    I have just seen this utility and following a company project based on this. Its undoubtly an easy resolution but can we automate the process or comparing two xml files thru this tool? I have been asked this question and looking for an answer. So that we can include diffxml in our projects.

    Thanks

  4. Adrian Mouat Says:

    I’m not 100% sure what you mean, but I think the answer is yes.

    They are command line utilities, so it’s dead simple to create a wrapper script or something. You could also directly axis the Java classes, but that’s a little more work (and remember that they are GPL licensed).

  5. Senthil Says:

    Please, we absolutely need a GUI interface for this. Awesome tool!

  6. Mark Says:

    Interesting, but one big problem, no context checking. A standard diff tool checks to make sure the change is really what the patch is, it doesn’t just say “change line three” it says “change the line that comes after these three lines, and before these three lines”. That way if the original file has changed in some non-relevant way ( a few lines added or removed else where), the patch should still work, and if the patch is on top of something else that’s also changed, you can detect the conflict.

    For instance, I’ve got a project which has xml docs that we then need to update in “customize” releases. However, as the base project moves along, the original xml doc changes. I could generate a diff (between the orginal and customized version of the xml doc) with your tool, but the instant I added/removed elements from the root doc, the patch would start modify/deleting the wrong elements.

  7. Adrian Mouat Says:

    Hi Mark,

    Sorry for the slow reply, for some reason WP marked you as spam.

    You are right about context matching. I want to get the basics working properly first though!

  8. Allan Clark Says:

    @Mark: you might get better results using XSLT to transform your existing doc to a “custom” version.

  9. Allan Clark Says:

    Can diffxml favour id=”” attributes in the source? If it generated xpath involving unique IDs recognized in the initial XML, and patchxml used that, you’d be able to use IDs in your source to improve the accuracy of patching rather than (@Mark) rewriting the diff/patch as an XSLT

  10. Adrian Mouat Says:

    Hi Allan,

    No, there is no favouring of ID attributes, but it is a good suggestion.

  11. James Robertson Says:

    Hi,

    if anyone is interested in a Windows application which performs two and three way comparison and merging of XML files, Project: Merge is such a tool I recently released. I originally wrote it to specifically solve the problem of resolving conflicts in Visual Studio project files.

    More information and a trial version can be found at http://www.projectmerge.com

    Cheers,
    James

  12. mkf Says:

    sadsadsa

Leave a Reply