Event vs. DOM Driven Parsing of XML

Published on Tuesday, April 29, 2008

I recently have been playing with parsing GPX files and spitting out the results into a special KML file. I initially wrote a parser using minidom, yet after running this the first time -- and my Core2Duo laptop reaching 100% utilization for 10 seconds -- I realized I needed to re-write it using something else.

I spent a little time reading the different parsers for XML and eventually read more about cElementTree. And it is included with Python2.5, sweet.

I quickly rewrote the code and did some tests. First, the two bits of code for parsing my GPX file:

minidom-speed.py

#!/usr/bin/python

from xml.dom import minidom
from genshi.template import TemplateLoader

def collect_info():
dom = minidom.parse('airport.gpx')
for node in dom.getElementsByTagName('trkpt'):
lat = node.getAttribute('lat')
lon = node.getAttribute('lon')
speed = node.getElementsByTagName('speed')[0].firstChild.data
speed = float(speed) * 10
coords = '%s,%s' % (lon, lat)
coords_speed = '%s,%s' % (coords, speed)
yield {
'coordinates': coords_speed
}

loader = TemplateLoader(['.'])
template = loader.load('template-speed.kml')
stream = template.generate(collection=collect_info())

f = open('minidom.kml', 'w')
f.write(stream.render())


cet-speed.py

#!/usr/bin/python

import sys,os
import xml.etree.cElementTree as ET
import string
from genshi.template import TemplateLoader

def collect_info():
mainNS=string.Template("{http://www.topografix.com/GPX/1/0}$tag")

wptTag=mainNS.substitute(tag="trkpt")
nameTag=mainNS.substitute(tag="speed")

et=ET.parse(open("airport.gpx"))
for wpt in et.findall("//"+wptTag):
wptinfo=[]
wptinfo.append(wpt.get("lon"))
wptinfo.append(wpt.get("lat"))
wptinfo.append(str(float(wpt.findtext(nameTag)) * 10))
coords_speed = ",".join(wptinfo)
yield {
'coordinates': coords_speed,
}

loader = TemplateLoader(['.'])
template = loader.load('template-speed.kml')
stream = template.generate(collection=collect_info())

f = open('cet.kml', 'w')
f.write(stream.render())


The speed difference is not just noticeable, but very noticeable.

minidom-speed.py

$ python -m cProfile minidom-speed.py
4405376 function calls (3787047 primitive calls) in 32.142 CPU seconds


cet-speed.py

$ python -m cProfile cet-speed.py
1082061 function calls (904167 primitive calls) in 6.736 CPU seconds


A quarter as many calls and almost 5x faster -- at least that's how I interpret the results. Much better!