python - Prune some elements from large xml file -
i have xml file of more 1gb , want reduce size of file removing unwanted children of parent tag creating new xml file or rewriting existing one. how can done through python file large,simple parse tree = elementtree.parse(xmlfile) won't work.
in file every parent tag tasksreportnode want have child tablerow rowcount attribute value 0 , reject other children(table row) of parent.
sample xml code:
<tasksreportnode name="task15"> <tabledata numrows="97" numcolumns="15"> <tablerow rowcount="0"> <tablecolumn name="task"><![cdata[ task15 [get - /pulsev31/appview/projectfeedhidden.jsp - 200]]]></tablecolumn> <tablecolumn name="status"><![cdata[success]]></tablecolumn> <tablecolumn name="successful"><![cdata[96]]></tablecolumn> <tablecolumn name="failed"><![cdata[0]]></tablecolumn> <tablecolumn name="timedout"><![cdata[0]]></tablecolumn> <tablecolumn name="total"><![cdata[96]]></tablecolumn> <tablecolumn name="min(ms)"><![cdata[15]]></tablecolumn> <tablecolumn name="avg(ms)"><![cdata[24.20]]></tablecolumn> <tablecolumn name="avg-90%(ms)"><![cdata[54.55]]></tablecolumn> <tablecolumn name="90%ile(ms)"><![cdata[89.98]]></tablecolumn> <tablecolumn name="95%ile(ms)"><![cdata[95.24]]></tablecolumn> <tablecolumn name="99%ile(ms)"><![cdata[99.45]]></tablecolumn> <tablecolumn name="max(ms)"><![cdata[94]]></tablecolumn> <tablecolumn name="std. dev."><![cdata[15.74]]></tablecolumn> <tablecolumn name="bytes recd(kb)"><![cdata[192]]></tablecolumn> </tablerow> <tablerow rowcount="1"> <tablecolumn name="task"><![cdata[ virtualuser1]]></tablecolumn> <tablecolumn name="status"><![cdata[success]]></tablecolumn> <tablecolumn name="successful"><![cdata[1]]></tablecolumn> <tablecolumn name="failed"><![cdata[0]]></tablecolumn> <tablecolumn name="timedout"><![cdata[0]]></tablecolumn> <tablecolumn name="total"><![cdata[1]]></tablecolumn> <tablecolumn name="min(ms)"><![cdata[934]]></tablecolumn> <tablecolumn name="avg(ms)"><![cdata[934.00]]></tablecolumn> <tablecolumn name="avg-90%(ms)"><![cdata[950.00]]></tablecolumn> <tablecolumn name="90%ile(ms)"><![cdata[1,000.50]]></tablecolumn> <tablecolumn name="95%ile(ms)"><![cdata[1,000.50]]></tablecolumn> <tablecolumn name="99%ile(ms)"><![cdata[1,000.50]]></tablecolumn> <tablecolumn name="max(ms)"><![cdata[934]]></tablecolumn> <tablecolumn name="std. dev."><![cdata[0.00]]></tablecolumn> <tablecolumn name="bytes recd(kb)"><![cdata[0]]></tablecolumn> </tabledata> <tabledata numrows="1" numcolumns="2"> <tablerow rowcount="0"> <tablecolumn name="response time interval (ms)"><![cdata[0 - 99]]></tablecolumn> <tablecolumn name="frequency"><![cdata[96]]></tablecolumn> </tablerow> </tabledata> </tasksreportnode> <tasksreportnode name="task16"> <tabledata numrows="97" numcolumns="15"> <tablerow rowcount="0"> <tablecolumn name="task"><![cdata[ task16 [get - /pulsev31/appview/projectcommenthidden.jsp - 200]]]></tablecolumn> <tablecolumn name="status"><![cdata[success]]></tablecolumn> <tablecolumn name="successful"><![cdata[96]]></tablecolumn> <tablecolumn name="failed"><![cdata[0]]></tablecolumn> <tablecolumn name="timedout"><![cdata[0]]></tablecolumn> <tablecolumn name="total"><![cdata[96]]></tablecolumn> <tablecolumn name="min(ms)"><![cdata[15]]></tablecolumn> <tablecolumn name="avg(ms)"><![cdata[22.73]]></tablecolumn> <tablecolumn name="avg-90%(ms)"><![cdata[54.55]]></tablecolumn> <tablecolumn name="90%ile(ms)"><![cdata[90.93]]></tablecolumn> <tablecolumn name="95%ile(ms)"><![cdata[96.25]]></tablecolumn> <tablecolumn name="99%ile(ms)"><![cdata[100.50]]></tablecolumn> <tablecolumn name="max(ms)"><![cdata[109]]></tablecolumn> <tablecolumn name="std. dev."><![cdata[14.76]]></tablecolumn> <tablecolumn name="bytes recd(kb)"><![cdata[192]]></tablecolumn> </tablerow> </tabledata> </tasksreportnode> here have tried:
xml = 'f:\\reports\\logs\\result_tg1_v16.xml' context = etree.iterparse(xml, events=("start", "end"),) event, element in context: if element.tag == 'tasksreportnode': child1 in element: child2 in child1: if child2.get("rowcount") == "0": child3 in child2: print(child3.tag, child3.text) element.clear() # discard element del context now have rowcount value '0' , can added parent, leaving other siblings.
i recommend using lxml in regards more efficient stdlib xml.elementtree.
you shall not attempt parse whole document whole large, should approach source document iteratively.
at lxml pages event driven parsing
there 2 options:
etree.iterparse- using custom parser, firing sax-like events
i prefer etree.iterparse gives parsed elements in more convenient way. must not forget clean-up work on processed parts, otherwise not save memory comparing parsing whole document @ once.
edit: added real example
example talks better tons of theories. here attempt:
from lxml import etree # fname = "large.xml" # 78 mb fname = "verylarge.xml" # 773 mb toremove = [] event, element in etree.iterparse(fname): if element.tag == "tablerow": if element.attrib["rowcount"] != "0": element.clear() # removing current element causes segmentation fault # element.getparent().remove(element) toremove.append(element) if element.tag == "tabledata": rowelm in toremove: element.remove(rowelm) toremove = [] # last processed element root 1 open("out.xml", "w") f: f.write(etree.tostring(element)) to test performance, took large sample file (73 mb), repeated inner part 10 times, got 773 mb large xml file , processed that.
the processing took 24 seconds (zenbook core i7 4 gb ram) , resulting file 4.7 mb large.
example explained
iterparse default providing "end" events, fired when element parsed.
this solution uses fact, iterparse, elements kept in memory. used in following places:
- during iterparse, not needed elements cleared (
element.clear()) , removed (element.remove(rowelm)).clear()clears inner content of element, element still exists.remove()works on parent element , removes inner part it. - elements used not removed , cleared, find them @ end present in root element.
- finally, when processed, last processed
elementroot one. still in memory, can write string file.
one has careful when remove() element. trying remove element parent @ moment iterated element caused segmentation fault. reason code waits "tablerow" element remove() until complete parsing of parent tabledata element.
variable toremove used collect "tablerow" elements , used parent "tabledata" element parsed. note, remove() works on real element parents, shall sure in proper time.
ideas larger files
for larger files, solution limited size of resulting xml document kept in memory till pruning of source xml completed.
for such scenarios, have use writing out output during parsing , getting rid of elements in memory, processed. in practice, have write out "opening xml element" part (like "<taskreportsummary att="a" otheratt="bb") when "start" event appear, , write clossing xml element part "/>" @ "end" event.
Comments
Post a Comment