Why Python is useless for serious XML processing

I have a Python application in which, for my sins, I decided to use XML as an on-disk storage format. Unfortunately, when I made this decision, I neglected to measure the performance of the available Python XML processing implementations.

Bad, bad, bad mistake. I expected that I was going to trade a little saved work for some performance, but when I finally got around to profiling my app today, to see why it was so slow, I was shocked.

Using the xml.sax module, I am able to process a 2.5MB document in 2.5 seconds on a reasonably fast Pentium 4 system. That gives me one megabyte per second of emphysema-wheezing parsing power. This number is so spectacularly, laughably bad that I actually spent several hours rechecking my measurements to see if I was doing something heinously stupid. I wasn’t–that is, beyond naïvely hoping for decent performance in the first place.

Now, I could use PyRXP, and I have before, but it’s only about three times faster than xml.sax. I can chew through vastly more data using fp.write(repr(obj));eval(fp.read())!

I really need something that can parse tens of megabytes of data per second, so as far as I can tell, I simply can’t mix XML and Python at all. Sigh.

Posted in python, software
14 comments on “Why Python is useless for serious XML processing
  1. John Kimball says:

    Bryan;

    Have you checked out lxml? I’m not an xml user but follow Python news and this seems to have the most buzz.

    From their website: “lxml is a Pythonic binding for the libxml2 and libxslt libraries. It is unique in that it combines the speed and feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API.”

    JK

  2. Wheelwright says:

    Personally, I don’t use Python but if possible and not too much trouble could you post XML and the output you want after parsing ? (I am curious how long XLINQ on VB.NET would take and whether it is a viable technology performance-wise).

    BTW: this information might be useful for other python-users as well who just might be able to point-out ways to speed up the process.

  3. Tempura says:

    So, why not use cElementTree, which is since Python 2.5 in the standard-lib?

  4. Guy Murphy says:

    Hi, it’s been an awful lot of years since I used Python for XML parsing, but I remember that you used to have to be careful to ensure that the SAX parser was using the C module/driver (expat) for it’s underlying parse rather than the python one.

    Python used to have a reputation for being really rather nippy with XML, and I know back then it was very popular with people in and around the W3C for prototyping stuff XML related so I’m rather suprised to here that it has fallen from grace in this regard.

    As a side note, XLINQ really is not a viable option for fast XML parsing or large amounts. On the .NET platform you might want to consider using the XML pull parser directly which really is rather fast. Also it’s closer to your experience with SAX except it is a pull parser rather than a push parser (although it’s trivial to layer a push layer on top of it)… There’s always IronPython on the .NET platform if you want to remain in Python. I’ve used IronPython for quick scripts around the edges of an inhouse .NET framework, and it’s quite cool for small tasks.

  5. Suvash says:

    you should check lxml, if you are looking for performance.

  6. qebab says:

    The cElementTree module, as mentioned already is supposedly the best alternative for this.

  7. John says:

    I do quite a bit of heavy XML and HTML processing with lxml and it’s blazingly fast. Of course the reason it’s so fast is that it’s using libxml (which is written in C) behind the scenes… but it provides very nice pythonic bindings. Definitely check it out.

  8. manatlan says:

    Use LXML, it’s, by far, the speedest xml processing available for python. It use libxml2, which is really speed (in the past, it was a lot speeder than msxml !)

  9. Alec says:

    To add on the the ElementTree, I’m pretty sure that it is part of the core as of 2.5 as xml.etree. And yes, doing anything manually with sax is painful, no matter what language you’re using.

  10. Kai says:

    I used the libxml2 XmlTextReader interface (http://xmlsoft.org/xmlreader.html) for some textmining research on the 5GB wikipedia xml file. That was really fine and fast.

    Another idea for fast xml processing is oracle berkeley xml db. It has a python interface and is free for non commercial use.

  11. hoanghung says:

    Use PHP 5’s XMLReader. It is much faster than Python in XML processing

  12. Eric Larson says:

    Another option to consider is Amara. The second version will be released soon and it uses expat internally, which makes it blazingly fast. Also, it doesn’t have the libxml/libxslt dependency, which can be a hassle on some occasions.

    http://xml3k.org/Amara

  13. William says:

    In casing you’re wondering why there is so much activity on this blog posting, someone recently posted this article to dzone.com. So you have a lot of people seeing your post from 4 years ago and assuming that you’re still having this issue now (or perhaps they don’t notice the byline or the URL indicating that this blog post was made back in 2004.

  14. Derek says:

    This is something I have just started looking at: I’m curious if you ever came back to try again or have gone (permanently) down another road?

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>