Python 3: Split Concatenated Xml Files
Solution 1:
Assuming the input file is rather large and you maybe don't want to load it into memory in full, it would make sense to stream it.
Optimal would be generator that breaks the stream of incoming lines from the file into chunks at certain points, i.e. when a line is equal to your "splitting" line.
This can be generalized as a function that can split any iterable into groups. itertools.groupby lends itself to the task, all we need to do is increment an index when we hit the "split here" value, and use that index as the group key:
from itertools import groupby
defsplit_chunks(values, split_val):
'''splits a list of values into chunks at a certain value'''
index = 0defchunk_index(val):
nonlocal index
if val == split_val:
index += 1return index
return groupby(values, chunk_index)
Test - let's split a list of numbers into chunks at 0
:
for i, numbers in split_chunks([0,1,2,3,0,4,5,6,0,7,8,9], 0):
print(list(numbers))
prints
[0, 1, 2, 3] [0, 4, 5, 6] [0, 7, 8 ,9]
The empty line appears because there is nothing before the first 0
in the input. Exactly the same thing happens when you split a string 'abcabc'.split('a')
.
So this works, usage with "lines in a large text file" instead of "numbers" is simple:
import xml.etree.ElementTree as ET
withopen('large_container_file', 'r', encoding='utf8') as container_file:
for doc_num, doc in split_chunks(container_file, '<?xml version="1.0"?>'):
print(f'processing sub-document #{doc_num}')
tree = ET.fromstringlist(doc)
Make sure you open the container file with the correct encoding.
Since generators only do work when you advance the iteration, reading of the large_container_file
stops while you process the current tree
, so memory usage should be fairly low independently of the input file size.
doc
is a generator in this scenario, which is good, because it is very memory-efficient. But in contrast to a list, you can't easily find out if it is going to be empty, which will happen in your case if '<?xml version="1.0"?>'
is the very first line in the document.
ET.fromstringlist()
is happy with generators, but it will throw when it finds that the generator is empty. However, it will also throw when there is an error in the XML, so what I would do is add a try
:
try:
tree = ET.fromstringlist(doc)
except:
pass
Alternatively you can call list()
up-front and then check if there are any lines:
lines = list(doc)
iflines:
tree = ET.fromstringlist(lines)
Post a Comment for "Python 3: Split Concatenated Xml Files"