lxml makes XML (almost) fun

Working with XML isn’t my favorite task. Sure, I get why it’s useful for data transfer, but I’d rather deal with an interface that hides the gory details … meaning the raw XML.  Of course, sometimes that isn’t possible.  The Python standard library modules as of version 2.4 only offers the DOM and SAX APIs, which, while useful, just don’t offer enough power and flexibility to do ad hoc XML processing without significant pain, at least for the casual user. Python 2.5 adds the ElementTree API (xml.etree.ElementTree), but we’re still missing full XPath and XSLT support.

In steps lxml, a library that really makes everything else obsolete. Prior to lxml there was 4Suite, but its latest release was posted in December 2006 and the project appears to be dead. Fortunately, lxml runs on Python 2.3 or later (I’m stuck on 2.4 for the time being).  The only hiccup, depending on your environment, can be getting the C dependencies installed – libxml and libxslt. But if you’re going to do some serious work with XML, it’s worth the effort to get lxml installed.  You won’t want to go back.

DOM

>>> xmlsrc = '<root><element>text</element></root>'
>>> from xml.dom.minidom import parseString
>>> parseString(xmlsrc).getElementsByTagName('element')[0].firstChild.nodeValue
u'text'

Yikes!

lxml

>>> xml = '<root><element>text</element></root>'
>>> from lxml import etree
>>> etree.fromstring(xml).findtext('//element')
'text'

Ahh … better.

Advertisements

,