Subscribe to
Posts
Comments

I have a Python application in which, for my sins, I decided to use XML as an on-disk storage format. Unfortunately, when I made this decision, I neglected to measure the performance of the available Python XML processing implementations.

Bad, bad, bad mistake. I expected that I was going to trade a little saved work for some performance, but when I finally got around to profiling my app today, to see why it was so slow, I was shocked.

Using the xml.sax module, I am able to process a 2.5MB document in 2.5 seconds on a reasonably fast Pentium 4 system. That gives me one megabyte per second of emphysema-wheezing parsing power. This number is so spectacularly, laughably bad that I actually spent several hours rechecking my measurements to see if I was doing something heinously stupid. I wasn’t–that is, beyond naïvely hoping for decent performance in the first place.

Now, I could use PyRXP, and I have before, but it’s only about three times faster than xml.sax. I can chew through vastly more data using fp.write(repr(obj));eval(fp.read())!

I really need something that can parse tens of megabytes of data per second, so as far as I can tell, I simply can’t mix XML and Python at all. Sigh.

Leave a Reply