I wanted to compare the content of two files compressed from the same file with the gzip.open() method from the python3 standard library. Above the code code creating two compressed files:
import shutil
import gzip
with open('file1.txt', 'rb') as f_in:
with gzip.open('output1.gz',"wb") as f_out:
shutil.copyfileobj(f_in, f_out)
with open('file1.txt', 'rb') as f_in:
with gzip.open('output2.gz',"wb") as f_out:
shutil.copyfileobj(f_in, f_out)
Then I compared the gzipped-files with the filecmp module from the standard library:
import filecmp
print(filecmp.cmp("output1.gz", "output2.gz", shallow=False))
That return False due to different timestamps and filename in the headers of the files as pointed by Kris in this thread.
So according to the RFC 1952 the filename section of the headers that comes after the timestamp section is terminated with a null byte. So I created a function to look for this byte and start hashing the rest of the bytes with the md5() method from the hashlib module from the standard library. And then compare the hashes of the files.
import hashlib
def compareContent(filename):
md5 = hashlib.md5()
with open(filename, 'rb') as f_in:
while True:
text = f_in.read(1)
if text == b'\x00':
break
while True:
data = f_in.read(8192)
if not data:
break
md5.update(data)
return md5.hexdigest()`
hash_one = compareContent("output1.gz")
hash_two = compareContent("output2.gz")
print(hash_one == hash_two)
This finally returns True and seems to work well.
I'm just a little bit surprised that I could not find an already existing function to do this work.
As a beginner, my questions are:
Am I missing something? Does such a function already exist?
If not, Is there any better way to achieve this or to improve my code?
Here is the code available in a Colab notebook: compare gzipped-files
Related
Hi I wonder if there is a better way in terms of code readability and repetition.
I have a large file that do not fit in memory. The file is either compressed .gz or not.
If it is compressed I need to open it using gzip from standard lib.
I am not sure the code I ended up is the best way to deal with that situation.
import gzip
from Path import pathlib
def parse_open_file(openfile):
"""parse the content of the file"""
return
def parse_file(file_: Path):
if file.suffix == ".gz":
with gzip.open(file_, 'rb') as f:
parse_open_file(f)
else:
with open(file_, 'rb') as f:
parse_open_file(f)
One way to handle this is to assign either open or gzip.open to a variable, depending on file type, then use that as an 'alias' in the with statement. For example:
if file.suffix == ".gz":
myOpen = gzip.open
else:
myOpen = open
with myOpen(file_, 'rb') as f:
parse_open_file(f)
This question already has an answer here:
Remove file after Flask serves it
(1 answer)
Closed 1 year ago.
I want my server to send a file to the user, and then delete the file.
The problem is that in order to return the file to the user, i am using this:
return send_file(pathAndFilename, as_attachment=True, attachment_filename = requestedFile)
Since this returns, how can i delete the file from the os with os.remove(pathAndFilename)?
I also tried this:
send_file(pathAndFilename, as_attachment=True, attachment_filename = requestedFile)
os.remove(pathAndFilename)
return 0
But i got this error:
TypeError: The view function did not return a valid response. The return type must be a string, dict, tuple, Response instance, or WSGI callable, but it was a int.
Since send_file already returns the response from the endpoint, it is no longer possible to execute code afterwards.
However, it is possible to write the file to a stream before the file is deleted and then to send the stream in response.
from flask import send_file
import io, os, shutil
#app.route('/download/<path:filename>')
def download(filename):
path = os.path.join(
app.static_folder,
filename
)
cache = io.BytesIO()
with open(path, 'rb') as fp:
shutil.copyfileobj(fp, cache)
cache.flush()
cache.seek(0)
os.remove(path)
return send_file(cache, as_attachment=True, attachment_filename=filename)
In order to achieve better use of the memory for larger files, I think a temporary file is more suitable as a buffer.
from flask import send_file
import os, shutil, tempfile
#app.route('/download/<path:filename>')
def download(filename):
path = os.path.join(
app.static_folder,
filename
)
cache = tempfile.NamedTemporaryFile()
with open(path, 'rb') as fp:
shutil.copyfileobj(fp, cache)
cache.flush()
cache.seek(0)
os.remove(path)
return send_file(cache, as_attachment=True, attachment_filename=filename)
I hope your conditions are met.
Have fun implementing your project.
How to make this code works?
There is a zip file with folders and .png files in it. Folder ".\icons_by_year" is empty. I need to get every file one by one without unzipping it and copy to the root of the selected folder (so no extra folders made).
class ArrangerOutZip(Arranger):
def __init__(self):
self.base_source_folder = '\\icons.zip'
self.base_output_folder = ".\\icons_by_year"
def proceed(self):
self.create_and_copy()
def create_and_copy(self):
reg_pattern = re.compile('.+\.\w{1,4}$')
f = open(self.base_source_folder, 'rb')
zfile = zipfile.ZipFile(f)
for cont in zfile.namelist():
if reg_pattern.match(cont):
with zfile.open(cont) as file:
shutil.copyfileobj(file, self.base_output_folder)
zfile.close()
f.close()
arranger = ArrangerOutZip()
arranger.proceed()
shutil.copyfileobj uses file objects for source and destination files. To open the destination you need to construct a file path for it. pathlib is a part of the standard python library and is a nice way to handle file paths. And ZipFile.extract does some of the work of creating intermediate output directories for you (plus sets file metadata) and can be used instead of copyfileobj.
One risk of unzipping files is that they can contain absolute or relative paths outside of the target directory you intend (e.g., "../../badvirus.exe"). extract is a bit too lax about that - putting those files in the root of the target directory - so I wrote a little something to reject the whole zip if you are being messed with.
With a few tweeks to make this a testable program,
from pathlib import Path
import re
import zipfile
#import shutil
#class ArrangerOutZip(Arranger):
class ArrangerOutZip:
def __init__(self, base_source_folder, base_output_folder):
self.base_source_folder = Path(base_source_folder).resolve(strict=True)
self.base_output_folder = Path(base_output_folder).resolve()
def proceed(self):
self.create_and_copy()
def create_and_copy(self):
"""Unzip files matching pattern to base_output_folder, raising
ValueError if any resulting paths are outside of that folder.
Output folder created if it does not exist."""
reg_pattern = re.compile('.+\.\w{1,4}$')
with open(self.base_source_folder, 'rb') as f:
with zipfile.ZipFile(f) as zfile:
wanted_files = [cont for cont in zfile.namelist()
if reg_pattern.match(cont)]
rebased_files = self._rebase_paths(wanted_files,
self.base_output_folder)
for cont, rebased in zip(wanted_files, rebased_files):
print(cont, rebased, rebased.parent)
# option 1: use shutil
#rebased.parent.mkdir(parents=True, exist_ok=True)
#with zfile.open(cont) as file, open(rebased, 'wb') as outfile:
# shutil.copyfileobj(file, outfile)
# option 2: zipfile does the work for you
zfile.extract(cont, self.base_output_folder)
#staticmethod
def _rebase_paths(pathlist, target_dir):
"""Rebase relative file paths to target directory, raising
ValueError if any resulting paths are not within target_dir"""
target = Path(target_dir).resolve()
newpaths = []
for path in pathlist:
newpath = target.joinpath(path).resolve()
newpath.relative_to(target) # raises ValueError if not subpath
newpaths.append(newpath)
return newpaths
#arranger = ArrangerOutZip('\\icons.zip', '.\\icons_by_year')
import sys
try:
arranger = ArrangerOutZip(sys.argv[1], sys.argv[2])
arranger.proceed()
except IndexError:
print("usage: test.py zipfile targetdir")
I'd take a look at the zipfile libraries' getinfo() and also ZipFile.Path() for construction since the constructor class can also use paths that way if you intend to do any creation.
Specifically PathObjects. This is able to do is to construct an object with a path in it, and it appears to be based on pathlib. Assuming you don't need to create zipfiles, you can ignore this ZipFile.Path()
However, that's not exactly what I wanted to point out. Rather consider the following:
zipfile.getinfo()
There is a person who I think is getting at this exact situation here:
https://www.programcreek.com/python/example/104991/zipfile.getinfo
This person seems to be getting a path using getinfo(). It's also clear that NOT every zipfile has the info.
Computing md5 needs a stream of bytes to pass through. I'm assuming it is possible to intercept csv.writer as a stream of bytes while a million rows are written. In below py code, a million rows are written, how do I compute md5 without reading the file into memory just for md5?
def query2csv(connection, fileUri, sqlQuery, args):
import csv
tocsvfile = open(fileUri, 'w+')
writer = csv.writer(tocsvfile, delimiter=',', quotechar='"') # , quoting=csv.QUOTE_MINIMAL
#As a huge blob goes into writer, pass through, md5 how?
# I do not want to read the huge file through memory just to compute md5
with connection.cursor() as cur:
cur.execute(sqlQuery, args)
column_names = list(map(lambda x: x[0], cur.description))
writer.writerow(column_names)
writer.writerows(__batch_rows(cur))
From the docs for csv.writer (emphasis mine):
csv.writer(csvfile, dialect='excel', **fmtparams)
Return a writer object responsible for converting the user’s data into delimited strings on the given file-like object. csvfile can be any object with a write() method. If csvfile is a file object, it should be opened with newline=''.
So we can intercept calls to .write(), and feed the data into the MD5 stream, while also passing it on to the real file. The cleanest way to do this is to define a class with a write method which just calls some functions (i.e. one for the MD5 stream, one for the file object):
import csv
import hashlib
class WriterTee:
def __init__(self, *outs):
self.outs = outs
def write(self, s):
for f in self.outs:
f(s)
def query2csv(connection, fileUri, sqlQuery, args):
md5 = hashlib.md5()
with open(fileUri, 'w+', newline='') as tocsvfile, connection.cursor() as cur:
tee = WriterTee(
tocsvfile.write,
lambda s: md5.update(s.encode())
)
writer = csv.writer(tee, delimiter=',', quotechar='"')
cur.execute(sqlQuery, args)
column_names = list(map(lambda x: x[0], cur.description))
writer.writerow(column_names)
writer.writerows(__batch_rows(cur))
return md5.hexdigest()
I've made a couple of other changes, to manage both resources in the with block, and to use newline='' as the docs say one should.
By the way, I would recommend against using MD5 for any purpose, if you have a choice. MD5 is not secure, and cryptographers have been recommending against it since 1996. Even if you don't consider the security properties to be relevant to your application, there is no downside to using a secure hash algorithm, and the hashlib APIs are the same whichever algorithm you choose.
I need to create a CSV and upload it to an S3 bucket. Since I'm creating the file on the fly, it would be better if I could write it directly to S3 bucket as it is being created rather than writing the whole file locally, and then uploading the file at the end.
Is there a way to do this? My project is in Python and I'm fairly new to the language. Here is what I tried so far:
import csv
import csv
import io
import boto
from boto.s3.key import Key
conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'
fieldnames = ['first_name', 'last_name']
writer = csv.DictWriter(io.StringIO(), fieldnames=fieldnames)
k.set_contents_from_stream(writer.writeheader())
I received this error: BotoClientError: s3 does not support chunked transfer
UPDATE: I found a way to write directly to S3, but I can't find a way to clear the buffer without actually deleting the lines I already wrote. So, for example:
conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'
testDict = [{
"fieldA": "8",
"fieldB": None,
"fieldC": "888888888888"},
{
"fieldA": "9",
"fieldB": None,
"fieldC": "99999999999"}]
f = io.StringIO()
fieldnames = ['fieldA', 'fieldB', 'fieldC']
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
k.set_contents_from_string(f.getvalue())
for row in testDict:
writer.writerow(row)
k.set_contents_from_string(f.getvalue())
f.close()
Writes 3 lines to the file, however I'm unable to release memory to write a big file. If I add:
f.seek(0)
f.truncate(0)
to the loop, then only the last line of the file is written. Is there any way to release resources without deleting lines from the file?
I did find a solution to my question, which I will post here in case anyone else is interested. I decided to do this as parts in a multipart upload. You can't stream to S3. There is also a package available that changes your streaming file over to a multipart upload which I used: Smart Open.
import smart_open
import io
import csv
testDict = [{
"fieldA": "8",
"fieldB": None,
"fieldC": "888888888888"},
{
"fieldA": "9",
"fieldB": None,
"fieldC": "99999999999"}]
fieldnames = ['fieldA', 'fieldB', 'fieldC']
f = io.StringIO()
with smart_open.smart_open('s3://dev-test/bar/foo.csv', 'wb') as fout:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
fout.write(f.getvalue())
for row in testDict:
f.seek(0)
f.truncate(0)
writer.writerow(row)
fout.write(f.getvalue())
f.close()
Here is a complete example using boto3
import boto3
import io
session = boto3.Session(
aws_access_key_id="...",
aws_secret_access_key="..."
)
s3 = session.resource("s3")
buff = io.BytesIO()
buff.write("test1\n".encode())
buff.write("test2\n".encode())
s3.Object(bucket, keypath).put(Body=buff.getvalue())
We were trying to upload file contents to s3 when it came through as an InMemoryUploadedFile object in a Django request. We ended up doing the following because we didn't want to save the file locally. Hope it helps:
#action(detail=False, methods=['post'])
def upload_document(self, request):
document = request.data.get('image').file
s3.upload_fileobj(document, BUCKET_NAME,
DESIRED_NAME_OF_FILE_IN_S3,
ExtraArgs={"ServerSideEncryption": "aws:kms"})
According to docs it's possible
s3.Object('mybucket', 'hello.txt').put(Body=open('/tmp/hello.txt', 'rb'))
so we can use StringIO in ordinary way
Update: smart_open lib from #inquiring minds answer is better solution
There's an interesting code solution mentioned in a GitHub smart_open issue (#82) that I've been meaning to try out. Copy-pasting here for posterity... looks like boto3 is required:
csv_data = io.BytesIO()
writer = csv.writer(csv_data)
writer.writerows(my_data)
gz_stream = io.BytesIO()
with gzip.GzipFile(fileobj=gz_stream, mode="w") as gz:
gz.write(csv_data.getvalue())
gz_stream.seek(0)
s3 = boto3.client('s3')
s3.upload_fileobj(gz_stream, bucket_name, key)
This specific example is streaming to a compressed S3 key/file, but it seems like the general approach -- using the boto3 S3 client's upload_fileobj() method in conjunction with a target stream, not a file -- should work.
There's a well supported library for doing just this:
pip install s3fs
s3fs is really trivial to use:
import s3fs
s3fs.S3FileSystem(anon=False)
with s3.open('mybucket/new-file', 'wb') as f:
f.write(2*2**20 * b'a')
f.write(2*2**20 * b'a')
Incidentally there's also something built into boto3 (backed by the AWS API) called MultiPartUpload.
This isn't factored as a python stream which might be an advantage for some people. Instead you can start an upload and send parts one at a time.
To write a string to an S3 object, use:
s3.Object('my_bucket', 'my_file.txt').put('Hello there')
So convert the stream to string and you're there.