Why use diffxml?
I’m the author of the diffxml tool for comparing XML documents. In this post I’d like to explain why you might want to use diffxml to compare XML documents rather than traditional text tools such as the UNIX diff command.
There are two things that diffxml understands that diff doesn’t; the syntax of XML documents (e.g. <br/> is equivalent to <br></br>) and the hierarchical structure they represent.
The advantages of understanding XML syntax are pretty easy to explain. Consider these two XML documents:
<a
>text<b/>
<c></c>
</a>
and
<a>text<b></b>
<d/>
</a>
If we compare these using diff, we get the following output:
1,3c1,2
< <a
< >text<b/>
< <c></c>
Which is telling us that every line in the document has changed. However, if we use diffxml to difference the documents, we get:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<delta>
<insert charpos="2" childno="4" name="d" nodetype="1" parent="/node()[1]"/>
<delete node="/node()[1]/node()[5]"/>
</delta>
Which is telling us that the difference between the documents is the insertion of an element “d” and the removal of another element1.
The other major advantage of diffxml is that it understands the hierarchical, or “tree” structure of XML documents. It’s a little harder to explain what this means, but consider the following. The XML document:
<a><b><d/></b><c><e/></c></a>
Can be represented as:
And the XML document:
<a><b/><c><d/><e/></c></a>
Can be represented as:
It’s clear from the diagram that the only change is that the element “d” has moved from element “b” to element “c”. There is no way that a line-based differencing utility could tell us this, but diffxml gives us:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<delta>
<move childno="1" new_charpos="1" node="/node()[1]/node()[1]/node()[1]" old_charpos="1" parent="/node()[1]/node()[2]"/>
</delta>
Which correctly identifies that the only difference is the move of a single element to a new parent.
I hope this makes it clear why tools such as diffxml which understand the heirarchical narture of XML documents are often a better choice than line-based equivalents for comparing XML documents.
- Admittedly the output is a little hard for humans to read currently. There are a couple of things that can be done to improve this (use proper node names instead of using the node() axis and put a string in the nodetype attribute), but in the future I hope to provide some sort of graphical interface. [↩]
May 20th, 2009 at 10:07 pm
Excellent stuff! Visual Studio project (.vcproj) and solution (.sln) files are XML and I frequently get problems with merging in changes using the TortoiseMerge (which is very good, but line-based).
Is it possible to use diffxml as a custom diff / merge tool with TortoiseSVN?
May 21st, 2009 at 6:59 pm
Not at the minute, but it’s sounds like a useful idea – I’ll add it to the list of wanted features.
The main focus at the moment is getting the quality right; in the current version you can still expect to run into the odd bug.
June 8th, 2009 at 10:27 am
I have just seen this utility and following a company project based on this. Its undoubtly an easy resolution but can we automate the process or comparing two xml files thru this tool? I have been asked this question and looking for an answer. So that we can include diffxml in our projects.
Thanks
June 8th, 2009 at 5:48 pm
I’m not 100% sure what you mean, but I think the answer is yes.
They are command line utilities, so it’s dead simple to create a wrapper script or something. You could also directly axis the Java classes, but that’s a little more work (and remember that they are GPL licensed).
August 27th, 2009 at 8:04 pm
Please, we absolutely need a GUI interface for this. Awesome tool!
April 9th, 2010 at 1:45 am
Interesting, but one big problem, no context checking. A standard diff tool checks to make sure the change is really what the patch is, it doesn’t just say “change line three” it says “change the line that comes after these three lines, and before these three lines”. That way if the original file has changed in some non-relevant way ( a few lines added or removed else where), the patch should still work, and if the patch is on top of something else that’s also changed, you can detect the conflict.
For instance, I’ve got a project which has xml docs that we then need to update in “customize” releases. However, as the base project moves along, the original xml doc changes. I could generate a diff (between the orginal and customized version of the xml doc) with your tool, but the instant I added/removed elements from the root doc, the patch would start modify/deleting the wrong elements.
May 8th, 2010 at 9:55 pm
Hi Mark,
Sorry for the slow reply, for some reason WP marked you as spam.
You are right about context matching. I want to get the basics working properly first though!
October 12th, 2010 at 5:00 pm
@Mark: you might get better results using XSLT to transform your existing doc to a “custom” version.
October 12th, 2010 at 6:46 pm
Can diffxml favour id=”” attributes in the source? If it generated xpath involving unique IDs recognized in the initial XML, and patchxml used that, you’d be able to use IDs in your source to improve the accuracy of patching rather than (@Mark) rewriting the diff/patch as an XSLT
October 25th, 2010 at 9:53 pm
Hi Allan,
No, there is no favouring of ID attributes, but it is a good suggestion.
February 27th, 2011 at 9:38 pm
Hi,
if anyone is interested in a Windows application which performs two and three way comparison and merging of XML files, Project: Merge is such a tool I recently released. I originally wrote it to specifically solve the problem of resolving conflicts in Visual Studio project files.
More information and a trial version can be found at http://www.projectmerge.com
Cheers,
James
December 20th, 2011 at 1:16 pm
sadsadsa