Skip over element in large XML file (python 3) - python-3.x

I'm new to xml parsing, and I've been trying to figure out a way to skip over a parent element's contents because there is a nested element that contains a large amount of data in its text attribute (I cannot change how this file is generated). Here's an example of what the xml looks like:
<root>
<Parent>
<thing_1>
<a>I need this</a>
</thing_1>
<thing_2>
<a>I need this</a>
</thing_2>
<thing_3>
<subgroup>
<huge_thing>enormous string here</huge_thing>
</subgroup>
</thing_3>
</Parent>
<Parent>
<thing_1>
<a>I need this</a>
</thing_1>
<thing_2>
<a>I need this</a>
</thing_2>
<thing_3>
<subgroup>
<huge_thing>enormous string here</huge_thing>
</subgroup>
</thing_3>
</Parent>
</root>
I've tried lxml.iterparse and xml.sax implementations to try and work this out, but no dice. These are the majority of the answers I've found in my searches:
Use the tag keyword in iterparse.
This does not work, because, although lxml cleans up the elements in the background, the large text in the element is still parsed into memory, so I'm getting large memory spikes.
Create a flag where you set it to True if the start event for that element is found and then ignore the element in parsing.
This does not work, as the element is still parsed into memory at the end event.
Break before you reach the end event of the specific element.
I cannot just break when I reach the element, because there are multiples of these elements that I need specific children data from.
This is not possible as stream parsers still have an end event and generate the full element.
... ok.
I'm currently trying to directly edit the stream data that the GzipFile sends to iterparse in hopes that it would be able to not even know that the element exists, but I'm running into issues with that. Any direction would be greatly appreciated.

I don't think you can get a parser to selectively ignore some part of the XML it's parsing. Here are my findings using the SAX parser...
I took your sample XML, blew it up to just under 400MB, created a SAX parser, and ran it against my big.xml file two different ways.
For the straightforward approach, sax.parse('big.xml', MyHandler()), memory peaked at 12M.
For a buffered file reader approach, using 4K chunks, parser.feed(chunk), memory peaked at 10M.
I then doubled the size, for an 800M file, re-ran both ways and the peak memory usage didn't change, ~10M. The SAX parser seems very effecient.
I ran this script against your sample XML to create some really big text nodes, 400M each.
with open('input.xml') as f:
data = f.read()
with open('big.xml', 'w') as f:
f.write(data.replace('enormous string here', 'a'*400_000_000))
Here's big.xml's size in MB:
du -ms big.xml
763 big.xml
Heres's my SAX ContentHandler which only handles the character data if the path to the data's parent ends in thing_*/a (which according to your sample disqualifies huge_thing)...
BTW, much appreciation to l4mpi for this answer, showing how to buffer the character data you do want:
from xml import sax
class MyHandler(sax.handler.ContentHandler):
def __init__(self):
self._charBuffer = []
self._path = []
def _getCharacterData(self):
data = ''.join(self._charBuffer).strip()
self._charBuffer = []
return data.strip() # remove strip() if whitespace is important
def characters(self, data):
if len(self._path) < 2:
return
if self._path[-1] == 'a' and self._path[-2].startswith('thing_'):
self._charBuffer.append(data)
def startElement(self, name, attrs):
self._path.append(name)
def endElement(self, name):
self._path.pop()
if len(self._path) == 0:
return
if self._path[-1].startswith('thing_'):
print(self._path[-1])
print(self._getCharacterData())
For both the whole-file parse method, and the chunked reader, I get:
thing_1
I need this
thing_2
I need this
thing_3
thing_1
I need this
thing_2
I need this
thing_3
It's printing thing_3 because of my simple logic, but the data in subgroup/huge_thing is ignored.
Here's how I call the handler with the straight-forward parse() method:
handler = MyHandler()
sax.parse('big.xml', handler)
When I run that with Unix/BSD time, I get:
/usr/bin/time -l ./main.py
...
1.45 real 0.64 user 0.11 sys
...
11027456 peak memory footprint
Here's how I call the handler with the more complex chunked reader, using a 4K chunk size:
handler = MyHandler()
parser = sax.make_parser()
parser.setContentHandler(handler)
Chunk_Sz = 4096
with open('big.xml') as f:
chunk = f.read(Chunk_Sz)
while chunk != '':
parser.feed(chunk)
chunk = f.read(Chunk_Sz)
/usr/bin/time -l ./main.py
...
1.85 real 1.65 user 0.19 sys
...
10453952 peak memory footprint
Even with a 512B chunk size, it doesn't get below 10M, but the runtime doubled.
I'm curious to see what kind of performance you're getting.

You cannot use a DOM parser as that would per definition stuff the whole document into RAM. But basically a DOM parser is just a SAX parser that creates a DOM as it goes through the SAX events.
When creating your custom SAX parser you can actually not just create the DOM (or whichever other memory represenation you prefer) but start ignoring events should they relate to some specific location in the document.
Be aware the parsing needs to continue so you know when to stop ignoring the events. But the output of the parser would not contain this unneeded large chunk of data.

Related

Why does page 323 from Automate The boring Stuff generate an int of 21?

I'm going back through the book "Automate the boring stuff" (which has been a great book btw)as I need to brush up on CSV parsing for a project and I'm trying to understand why each output is generated. Why does this code from page 323 create an output of '21', when it's four words, 16 characters, and three commas. Not to mention that I'm entering strings and it outputs numbers.
#%%
import csv
outputFile = open('output.csv', 'w', newline='')
outputWriter = csv.writer(outputFile)
outputWriter.writerow(['spam', 'eggs', 'bacon', 'ham'])
First I thought ok it's the number of characters, but that adds up to 16. Then I thought ok each word maybe has a space plus one at the beginning and end of the CSV file? Which does technically maybe explain but nothing explicit, it's more "oh it's obvious because " but it's not explicitly stated. I'm not seeing a reference to the addition of or how that number is created.
There seems like a plausible explanation but I don't understand why it's 21.
I've tried breakpoint or pdb but I'm still learning how to use those, to get the following breakdown which I don't see containing anything that answers it. No counting or summation that I can see.
The docs state that csv.csvwriter.csvwriterow returns "the return value of the call to the write method of the underlying file object.".
In your example
outputFile = open('output.csv', 'w', newline='')
is your underlying file object which you then hand to csv.writer().
If we look a bit deeper we can find the type of outputFile with print(type(outputFile)).
<class '_io.TextIOWrapper'>
While the docs don't explicitly define the write method for TextIOWrapper, it does state that it inherits from TextIOBase, which defines it's write() method as "Write the string s to the stream and return the number of characters written.".
If we look at the text file written:
spam,eggs,bacon,ham
We see that it indeed has 21 characters.

Is it possible to remove characters from a compressed file without extracting it?

I have a compressed file that's about 200 MB, in the form of a tar.gz file. I understand that I can extract the xml files in it. It contains several small and one 5 GB xml file. I'm trying to remove certain characters from the xml files.
So my very basic question is: is it even possible to accomplish this without ever extracting the content of the compressed file?
I'm trying to speed up the process of reading through xml files looking for characters to remove.
You will have to decompress, change, and then recompress the files. There's no way around that.
However, this does not necessarily include writing the file to a storage. You might be able to do the changes you like in a streaming fashion, i.e. that everything is only done in memory without ever having the complete decompressed file somewhere. Unix uses pipes for such tasks.
Here is an example on how to do it:
Create two random files:
echo "hello world" > a
echo "hello world" > b
Create a compressed archive containing both:
tar -c -z -f x.tgz a b
Pipe the contents of the uncompressed archive through a changer. Unfortunately I haven't found any shell-based way to do this but you also specified Python in the tags, and with the tarfile module you can achieve this:
Here is the file tar.py:
#!/usr/bin/env python3
import sys
import tarfile
tar_in = tarfile.open(fileobj=sys.stdin.buffer, mode='r:gz')
tar_out = tarfile.open(fileobj=sys.stdout.buffer, mode='w:gz')
for tar_info in tar_in:
reader = tar_in.extractfile(tar_info)
if tar_info.path == 'a': # my example file names are "a" and "b"
# now comes the code which makes our change:
# we just skip the first two bytes in each file:
reader.read(2) # skip two bytes
tar_info.size -= 2 # reduce size in info object as well
# add the (maybe changed) file to the output:
tar_out.addfile(tar_info, reader)
tar_out.close()
tar_in.close()
This can be called like this:
./tar.py < x.tgz > y.tgz
y.tgz will contain both files again, but in a the first two bytes will be skipped (so its contents will be llo world).
You will have noticed that you need to know the resulting size of your change beforehand. tar is designed to handle files, and so it needs to write the size of the entry files into the tar info datagram which precedes every entry file in the resulting file, so I see no way around this. With a compressed output it also isn't possible to skip back after writing all output and adjust the file size.
But as you phrased your question, this might be possible in your case.
All you will have to do is provide a file-like object (could be a Popen object's output stream) like reader in my simple example case.

shutil.copyfileobj but without headers or skip first line [duplicate]

I am currently using the below code to import 6,000 csv files (with headers) and export them into a single csv file (with a single header row).
#import csv files from folder
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None,)
list_.append(df)
stockstats_data = pd.concat(list_)
print(file_ + " has been imported.")
This code works fine, but it is slow. It can take up to 2 days to process.
I was given a single line script for Terminal command line that does the same (but with no headers). This script takes 20 seconds.
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done
Does anyone know how I can speed up the first Python script? To cut the time down, I have thought about not importing it into a DataFrame and just concatenating the CSVs, but I cannot figure it out.
Thanks.
If you don't need the CSV in memory, just copying from input to output, it'll be a lot cheaper to avoid parsing at all, and copy without building up in memory:
import shutil
import glob
#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort() # glob lacks reliable ordering, so impose your own if output order matters
with open('someoutputfile.csv', 'wb') as outfile:
for i, fname in enumerate(allFiles):
with open(fname, 'rb') as infile:
if i != 0:
infile.readline() # Throw away header on all but first file
# Block copy rest of file from input to output without parsing
shutil.copyfileobj(infile, outfile)
print(fname + " has been imported.")
That's it; shutil.copyfileobj handles efficiently copying the data, dramatically reducing the Python level work to parse and reserialize. Don't omit the `allFiles.sort()!†
This assumes all the CSV files have the same format, encoding, line endings, etc., the encoding encodes such that newlines appear as a single byte equivalent to ASCII \n and it's the last byte in the character (so ASCII and all ASCII superset encodings work, as does UTF-16-BE and UTF-32-BE, but not UTF-16-LE and UTF-32-LE) and the header doesn't contain embedded newlines, but if that's the case, it's a lot faster than the alternatives.
For the cases where the encoding's version of a newline doesn't look enough like an ASCII newline, or where the input files are in one encoding, and the output file should be in a different encoding, you can add the work of encoding and decoding without adding CSV parsing/serializing work, with (adding a from io import open if on Python 2, to get Python 3-like efficient encoding-aware file objects, and defining known_input_encoding to some string representing the known encoding for input files, e.g. known_input_encoding = 'utf-16-le', and optionally a different encoding for output files):
# Other imports and setup code prior to first with unchanged from before
# Perform encoding to chosen output encoding, disabling line-ending
# translation to avoid conflicting with CSV dialect, matching raw binary behavior
with open('someoutputfile.csv', 'w', encoding=output_encoding, newline='') as outfile:
for i, fname in enumerate(allFiles):
# Decode with known encoding, disabling line-ending translation
# for same reasons as above
with open(fname, encoding=known_input_encoding, newline='') as infile:
if i != 0:
infile.readline() # Throw away header on all but first file
# Block copy rest of file from input to output without parsing
# just letting the file object decode from input and encode to output
shutil.copyfileobj(infile, outfile)
print(fname + " has been imported.")
This is still much faster than involving the csv module, especially in modern Python (where the io module has undergone greater and greater optimization, to the point where the cost of decoding and reencoding is pretty minor, especially next to the cost of performing I/O in the first place). It's also a good validity check for self-checking encodings (e.g. the UTF family) even if the encoding is not supposed to change; if the data doesn't match the assumed self-checking encoding, it's highly unlikely to decode validly, so you'll get an exception rather than silent misbehavior.
Because some of the duplicates linked here are looking for an even faster solution than copyfileobj, some options:
The only succinct, reasonably portable option is to continue using copyfileobj and explicitly pass a non-default length parameter, e.g. shutil.copyfileobj(infile, outfile, 1 << 20) (1 << 20 is 1 MiB, a number which shutil has switched to for plain shutil.copyfile calls on Windows due to superior performance).
Still portable, but only works for binary files and not succinct, would be to copy the underlying code copyfile uses on Windows, which uses a reusable bytearray buffer with a larger size than copyfileobj's default (1 MiB, rather than 64 KiB), removing some allocation overhead that copyfileobj can't fully avoid for large buffers. You'd replace shutil.copyfileobj(infile, outfile) with (3.8+'s walrus operator, :=, used for brevity) the following code adapted from CPython 3.10's implementation of shutil._copyfileobj_readinto (which you could always use directly if you don't mind using non-public APIs):
buf_length = 1 << 20 # 1 MiB buffer; tweak to preference
# Using a memoryview gets zero copy performance when short reads occur
with memoryview(bytearray(buf_length)) as mv:
while n := infile.readinto(mv):
if n < buf_length:
with mv[:n] as smv:
outfile.write(smv)
else:
outfile.write(mv)
Non-portably, if you can (in any way you feel like) determine the precise length of the header, and you know it will not change by even a byte in any other file, you can write the header directly, then use OS-specific calls similar to what shutil.copyfile uses under the hood to copy the non-header portion of each file, using OS-specific APIs that can do the work with a single system call (regardless of file size) and avoid extra data copies (by pushing all the work to in-kernel or even within file-system operations, removing copies to and from user space) e.g.:
a. On Linux kernel 2.6.33 and higher (and any other OS that allows the sendfile(2) system call to work between open files), you can replace the .readline() and copyfileobj calls with:
filesize = os.fstat(infile.fileno()).st_size # Get underlying file's size
os.sendfile(outfile.fileno(), infile.fileno(), header_len_bytes, filesize - header_len_bytes)
To make it signal resilient, it may be necessary to check the return value from sendfile, and track the number of bytes sent + skipped and the number remaining, looping until you've copied them all (these are low level system calls, they can be interrupted).
b. On any system Python 3.8+ built with glibc >= 2.27 (or on Linux kernel 4.5+), where the files are all on the same filesystem, you can replace sendfile with copy_file_range:
filesize = os.fstat(infile.fileno()).st_size # Get underlying file's size
os.copy_file_range(infile.fileno(), outfile.fileno(), filesize - header_len_bytes, header_len_bytes)
With similar caveats about checking for copying fewer bytes than expected and retrying.
c. On OSX/macOS, you can use the completely undocumented, and therefore even less portable/stable API shutil.copyfile uses, posix._fcopyfile for a similar purpose, with something like (completely untested, and really, don't do this; it's likely to break across even minor Python releases):
infile.seek(header_len_bytes) # Skip past header
posix._fcopyfile(infile.fileno(), outfile.fileno(), posix._COPYFILE_DATA)
which assumes fcopyfile pays attention to the seek position (docs aren't 100% on this) and, as noted, is not only macOS-specific, but uses undocumented CPython internals that could change in any release.
† An aside on sorting the results of glob: That allFiles.sort() call should not be omitted; glob imposes no ordering on the results, and for reproducible results, you'll want to impose some ordering (it wouldn't be great if the same files, with the same names and data, produced an output file in a different order simply because in-between runs, a file got moved out of the directory, then back in, and changed the native iteration order). Without the sort call, this code (and all other Python+glob module answers) will not reliably read from a directory containing a.csv and b.csv in alphabetical (or any other useful) order; it'll vary by OS, file system, and often the entire history of file creation/deletion in the directory in question. This has broken stuff before in the real world, see details at A Code Glitch May Have Caused Errors In More Than 100 Published Studies.
Are you required to do this in Python? If you are open to doing this entirely in shell, all you'd need to do is first cat the header row from a randomly selected input .csv file into merged.csv before running your one-liner:
cat a-randomly-selected-csv-file.csv | head -n1 > merged.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done
You don't need pandas for this, just the simple csv module would work fine.
import csv
df_out_filename = 'df_out.csv'
write_headers = True
with open(df_out_filename, 'wb') as fout:
writer = csv.writer(fout)
for filename in allFiles:
with open(filename) as fin:
reader = csv.reader(fin)
headers = reader.next()
if write_headers:
write_headers = False # Only write headers once.
writer.writerow(headers)
writer.writerows(reader) # Write all remaining rows.
Here's a simpler approach - you can use pandas (though I am not sure how it will help with RAM usage)-
import pandas as pd
import glob
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_)
stockstats_data = pd.concat((df, stockstats_data), axis=0)

Output data from subprocess command line by line

I am trying to read a large data file (= millions of rows, in a very specific format) using a pre-built (in C) routine. I want to then yeild the results of this, line by line, via a generator function.
I can read the file OK, but where as just running:
<command> <filename>
directly in linux will print the results line by line as it finds them, I've had no luck trying to replicate this within my generator function. It seems to output the entire lot as a single string that I need to split on newline, and of course then everything needs reading before I can yield line 1.
This code will read the file, no problem:
import subprocess
import config
file_cmd = '<command> <filename>'
for rec in (subprocess.check_output([file_cmd], shell=True).decode(config.ENCODING).split('\n')):
yield rec
(ENCODING is set in config.py to iso-8859-1 - it's a Swedish site)
The code I have works, in that it gives me the data, but in doing so, it tries to hold the whole lot in memory. I have larger files than this to process which are likely to blow the available memory, so this isn't an option.
I've played around with bufsize on Popen, but not had any success (and also, I can't decode or split after the Popen, though I guess the fact I need to split right now is actually my problem!).
I think I have this working now, so will answer my own question in the event somebody else is looking for this later ...
proc = subprocess.Popen(shlex.split(file_cmd), stdout=subprocess.PIPE)
while True:
output = proc.stdout.readline()
if output == b'' and proc.poll() is not None:
break
if output:
yield output.decode(config.ENCODING).strip()

Parse a huge JSON file

I have a very large JSON file (about a gigabyte) which I want to parse.
I tried the JsonSlurper, but it looks like it tries to load the whole file into memory which causes out of memory exception.
Here is a piece of code I have:
def parser = new JsonSlurper().setType(JsonParserType.CHARACTER_SOURCE);
def result = parser.parse(new File("equity_listing_full_201604160411.json"))
result.each{
println it.Listing.ID
}
And Json is something like this but much longer with more columns and rows
[
{"Listing": {"ID":"2013056","VERSION":"20160229:053120:000","NAME":"XXXXXX","C_ID":["1927445"],}},
{"Listing": {"ID":"2013057","VERSION":"20160229:053120:000","NAME":"XXXXXX","C_ID":["1927446"],}},
{"Listing": {"ID":"2013058","VERSION":"20160229:053120:000","NAME":"XXXXXX","C_ID":["1927447"],}}
]
I want to be able to read it row by row. I can probably just parse each row separately, but was thinking that there might be something for parsing as you read.
Suggest using GSON by Google.
There is a streaming Parsing Option here: https://sites.google.com/site/gson/streaming

Resources