I'm going back through the book "Automate the boring stuff" (which has been a great book btw)as I need to brush up on CSV parsing for a project and I'm trying to understand why each output is generated. Why does this code from page 323 create an output of '21', when it's four words, 16 characters, and three commas. Not to mention that I'm entering strings and it outputs numbers.
#%%
import csv
outputFile = open('output.csv', 'w', newline='')
outputWriter = csv.writer(outputFile)
outputWriter.writerow(['spam', 'eggs', 'bacon', 'ham'])
First I thought ok it's the number of characters, but that adds up to 16. Then I thought ok each word maybe has a space plus one at the beginning and end of the CSV file? Which does technically maybe explain but nothing explicit, it's more "oh it's obvious because " but it's not explicitly stated. I'm not seeing a reference to the addition of or how that number is created.
There seems like a plausible explanation but I don't understand why it's 21.
I've tried breakpoint or pdb but I'm still learning how to use those, to get the following breakdown which I don't see containing anything that answers it. No counting or summation that I can see.
The docs state that csv.csvwriter.csvwriterow returns "the return value of the call to the write method of the underlying file object.".
In your example
outputFile = open('output.csv', 'w', newline='')
is your underlying file object which you then hand to csv.writer().
If we look a bit deeper we can find the type of outputFile with print(type(outputFile)).
<class '_io.TextIOWrapper'>
While the docs don't explicitly define the write method for TextIOWrapper, it does state that it inherits from TextIOBase, which defines it's write() method as "Write the string s to the stream and return the number of characters written.".
If we look at the text file written:
spam,eggs,bacon,ham
We see that it indeed has 21 characters.
I have a compressed file that's about 200 MB, in the form of a tar.gz file. I understand that I can extract the xml files in it. It contains several small and one 5 GB xml file. I'm trying to remove certain characters from the xml files.
So my very basic question is: is it even possible to accomplish this without ever extracting the content of the compressed file?
I'm trying to speed up the process of reading through xml files looking for characters to remove.
You will have to decompress, change, and then recompress the files. There's no way around that.
However, this does not necessarily include writing the file to a storage. You might be able to do the changes you like in a streaming fashion, i.e. that everything is only done in memory without ever having the complete decompressed file somewhere. Unix uses pipes for such tasks.
Here is an example on how to do it:
Create two random files:
echo "hello world" > a
echo "hello world" > b
Create a compressed archive containing both:
tar -c -z -f x.tgz a b
Pipe the contents of the uncompressed archive through a changer. Unfortunately I haven't found any shell-based way to do this but you also specified Python in the tags, and with the tarfile module you can achieve this:
Here is the file tar.py:
#!/usr/bin/env python3
import sys
import tarfile
tar_in = tarfile.open(fileobj=sys.stdin.buffer, mode='r:gz')
tar_out = tarfile.open(fileobj=sys.stdout.buffer, mode='w:gz')
for tar_info in tar_in:
reader = tar_in.extractfile(tar_info)
if tar_info.path == 'a': # my example file names are "a" and "b"
# now comes the code which makes our change:
# we just skip the first two bytes in each file:
reader.read(2) # skip two bytes
tar_info.size -= 2 # reduce size in info object as well
# add the (maybe changed) file to the output:
tar_out.addfile(tar_info, reader)
tar_out.close()
tar_in.close()
This can be called like this:
./tar.py < x.tgz > y.tgz
y.tgz will contain both files again, but in a the first two bytes will be skipped (so its contents will be llo world).
You will have noticed that you need to know the resulting size of your change beforehand. tar is designed to handle files, and so it needs to write the size of the entry files into the tar info datagram which precedes every entry file in the resulting file, so I see no way around this. With a compressed output it also isn't possible to skip back after writing all output and adjust the file size.
But as you phrased your question, this might be possible in your case.
All you will have to do is provide a file-like object (could be a Popen object's output stream) like reader in my simple example case.
I am currently using the below code to import 6,000 csv files (with headers) and export them into a single csv file (with a single header row).
#import csv files from folder
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None,)
list_.append(df)
stockstats_data = pd.concat(list_)
print(file_ + " has been imported.")
This code works fine, but it is slow. It can take up to 2 days to process.
I was given a single line script for Terminal command line that does the same (but with no headers). This script takes 20 seconds.
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done
Does anyone know how I can speed up the first Python script? To cut the time down, I have thought about not importing it into a DataFrame and just concatenating the CSVs, but I cannot figure it out.
Thanks.
If you don't need the CSV in memory, just copying from input to output, it'll be a lot cheaper to avoid parsing at all, and copy without building up in memory:
import shutil
import glob
#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort() # glob lacks reliable ordering, so impose your own if output order matters
with open('someoutputfile.csv', 'wb') as outfile:
for i, fname in enumerate(allFiles):
with open(fname, 'rb') as infile:
if i != 0:
infile.readline() # Throw away header on all but first file
# Block copy rest of file from input to output without parsing
shutil.copyfileobj(infile, outfile)
print(fname + " has been imported.")
That's it; shutil.copyfileobj handles efficiently copying the data, dramatically reducing the Python level work to parse and reserialize. Don't omit the `allFiles.sort()!†
This assumes all the CSV files have the same format, encoding, line endings, etc., the encoding encodes such that newlines appear as a single byte equivalent to ASCII \n and it's the last byte in the character (so ASCII and all ASCII superset encodings work, as does UTF-16-BE and UTF-32-BE, but not UTF-16-LE and UTF-32-LE) and the header doesn't contain embedded newlines, but if that's the case, it's a lot faster than the alternatives.
For the cases where the encoding's version of a newline doesn't look enough like an ASCII newline, or where the input files are in one encoding, and the output file should be in a different encoding, you can add the work of encoding and decoding without adding CSV parsing/serializing work, with (adding a from io import open if on Python 2, to get Python 3-like efficient encoding-aware file objects, and defining known_input_encoding to some string representing the known encoding for input files, e.g. known_input_encoding = 'utf-16-le', and optionally a different encoding for output files):
# Other imports and setup code prior to first with unchanged from before
# Perform encoding to chosen output encoding, disabling line-ending
# translation to avoid conflicting with CSV dialect, matching raw binary behavior
with open('someoutputfile.csv', 'w', encoding=output_encoding, newline='') as outfile:
for i, fname in enumerate(allFiles):
# Decode with known encoding, disabling line-ending translation
# for same reasons as above
with open(fname, encoding=known_input_encoding, newline='') as infile:
if i != 0:
infile.readline() # Throw away header on all but first file
# Block copy rest of file from input to output without parsing
# just letting the file object decode from input and encode to output
shutil.copyfileobj(infile, outfile)
print(fname + " has been imported.")
This is still much faster than involving the csv module, especially in modern Python (where the io module has undergone greater and greater optimization, to the point where the cost of decoding and reencoding is pretty minor, especially next to the cost of performing I/O in the first place). It's also a good validity check for self-checking encodings (e.g. the UTF family) even if the encoding is not supposed to change; if the data doesn't match the assumed self-checking encoding, it's highly unlikely to decode validly, so you'll get an exception rather than silent misbehavior.
Because some of the duplicates linked here are looking for an even faster solution than copyfileobj, some options:
The only succinct, reasonably portable option is to continue using copyfileobj and explicitly pass a non-default length parameter, e.g. shutil.copyfileobj(infile, outfile, 1 << 20) (1 << 20 is 1 MiB, a number which shutil has switched to for plain shutil.copyfile calls on Windows due to superior performance).
Still portable, but only works for binary files and not succinct, would be to copy the underlying code copyfile uses on Windows, which uses a reusable bytearray buffer with a larger size than copyfileobj's default (1 MiB, rather than 64 KiB), removing some allocation overhead that copyfileobj can't fully avoid for large buffers. You'd replace shutil.copyfileobj(infile, outfile) with (3.8+'s walrus operator, :=, used for brevity) the following code adapted from CPython 3.10's implementation of shutil._copyfileobj_readinto (which you could always use directly if you don't mind using non-public APIs):
buf_length = 1 << 20 # 1 MiB buffer; tweak to preference
# Using a memoryview gets zero copy performance when short reads occur
with memoryview(bytearray(buf_length)) as mv:
while n := infile.readinto(mv):
if n < buf_length:
with mv[:n] as smv:
outfile.write(smv)
else:
outfile.write(mv)
Non-portably, if you can (in any way you feel like) determine the precise length of the header, and you know it will not change by even a byte in any other file, you can write the header directly, then use OS-specific calls similar to what shutil.copyfile uses under the hood to copy the non-header portion of each file, using OS-specific APIs that can do the work with a single system call (regardless of file size) and avoid extra data copies (by pushing all the work to in-kernel or even within file-system operations, removing copies to and from user space) e.g.:
a. On Linux kernel 2.6.33 and higher (and any other OS that allows the sendfile(2) system call to work between open files), you can replace the .readline() and copyfileobj calls with:
filesize = os.fstat(infile.fileno()).st_size # Get underlying file's size
os.sendfile(outfile.fileno(), infile.fileno(), header_len_bytes, filesize - header_len_bytes)
To make it signal resilient, it may be necessary to check the return value from sendfile, and track the number of bytes sent + skipped and the number remaining, looping until you've copied them all (these are low level system calls, they can be interrupted).
b. On any system Python 3.8+ built with glibc >= 2.27 (or on Linux kernel 4.5+), where the files are all on the same filesystem, you can replace sendfile with copy_file_range:
filesize = os.fstat(infile.fileno()).st_size # Get underlying file's size
os.copy_file_range(infile.fileno(), outfile.fileno(), filesize - header_len_bytes, header_len_bytes)
With similar caveats about checking for copying fewer bytes than expected and retrying.
c. On OSX/macOS, you can use the completely undocumented, and therefore even less portable/stable API shutil.copyfile uses, posix._fcopyfile for a similar purpose, with something like (completely untested, and really, don't do this; it's likely to break across even minor Python releases):
infile.seek(header_len_bytes) # Skip past header
posix._fcopyfile(infile.fileno(), outfile.fileno(), posix._COPYFILE_DATA)
which assumes fcopyfile pays attention to the seek position (docs aren't 100% on this) and, as noted, is not only macOS-specific, but uses undocumented CPython internals that could change in any release.
† An aside on sorting the results of glob: That allFiles.sort() call should not be omitted; glob imposes no ordering on the results, and for reproducible results, you'll want to impose some ordering (it wouldn't be great if the same files, with the same names and data, produced an output file in a different order simply because in-between runs, a file got moved out of the directory, then back in, and changed the native iteration order). Without the sort call, this code (and all other Python+glob module answers) will not reliably read from a directory containing a.csv and b.csv in alphabetical (or any other useful) order; it'll vary by OS, file system, and often the entire history of file creation/deletion in the directory in question. This has broken stuff before in the real world, see details at A Code Glitch May Have Caused Errors In More Than 100 Published Studies.
Are you required to do this in Python? If you are open to doing this entirely in shell, all you'd need to do is first cat the header row from a randomly selected input .csv file into merged.csv before running your one-liner:
cat a-randomly-selected-csv-file.csv | head -n1 > merged.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done
You don't need pandas for this, just the simple csv module would work fine.
import csv
df_out_filename = 'df_out.csv'
write_headers = True
with open(df_out_filename, 'wb') as fout:
writer = csv.writer(fout)
for filename in allFiles:
with open(filename) as fin:
reader = csv.reader(fin)
headers = reader.next()
if write_headers:
write_headers = False # Only write headers once.
writer.writerow(headers)
writer.writerows(reader) # Write all remaining rows.
Here's a simpler approach - you can use pandas (though I am not sure how it will help with RAM usage)-
import pandas as pd
import glob
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_)
stockstats_data = pd.concat((df, stockstats_data), axis=0)
I am trying to read a large data file (= millions of rows, in a very specific format) using a pre-built (in C) routine. I want to then yeild the results of this, line by line, via a generator function.
I can read the file OK, but where as just running:
<command> <filename>
directly in linux will print the results line by line as it finds them, I've had no luck trying to replicate this within my generator function. It seems to output the entire lot as a single string that I need to split on newline, and of course then everything needs reading before I can yield line 1.
This code will read the file, no problem:
import subprocess
import config
file_cmd = '<command> <filename>'
for rec in (subprocess.check_output([file_cmd], shell=True).decode(config.ENCODING).split('\n')):
yield rec
(ENCODING is set in config.py to iso-8859-1 - it's a Swedish site)
The code I have works, in that it gives me the data, but in doing so, it tries to hold the whole lot in memory. I have larger files than this to process which are likely to blow the available memory, so this isn't an option.
I've played around with bufsize on Popen, but not had any success (and also, I can't decode or split after the Popen, though I guess the fact I need to split right now is actually my problem!).
I think I have this working now, so will answer my own question in the event somebody else is looking for this later ...
proc = subprocess.Popen(shlex.split(file_cmd), stdout=subprocess.PIPE)
while True:
output = proc.stdout.readline()
if output == b'' and proc.poll() is not None:
break
if output:
yield output.decode(config.ENCODING).strip()
I have a very large JSON file (about a gigabyte) which I want to parse.
I tried the JsonSlurper, but it looks like it tries to load the whole file into memory which causes out of memory exception.
Here is a piece of code I have:
def parser = new JsonSlurper().setType(JsonParserType.CHARACTER_SOURCE);
def result = parser.parse(new File("equity_listing_full_201604160411.json"))
result.each{
println it.Listing.ID
}
And Json is something like this but much longer with more columns and rows
[
{"Listing": {"ID":"2013056","VERSION":"20160229:053120:000","NAME":"XXXXXX","C_ID":["1927445"],}},
{"Listing": {"ID":"2013057","VERSION":"20160229:053120:000","NAME":"XXXXXX","C_ID":["1927446"],}},
{"Listing": {"ID":"2013058","VERSION":"20160229:053120:000","NAME":"XXXXXX","C_ID":["1927447"],}}
]
I want to be able to read it row by row. I can probably just parse each row separately, but was thinking that there might be something for parsing as you read.
Suggest using GSON by Google.
There is a streaming Parsing Option here: https://sites.google.com/site/gson/streaming