How to best manage file either gz or not? - python-3.x

Hi I wonder if there is a better way in terms of code readability and repetition.
I have a large file that do not fit in memory. The file is either compressed .gz or not.
If it is compressed I need to open it using gzip from standard lib.
I am not sure the code I ended up is the best way to deal with that situation.
import gzip
from Path import pathlib
def parse_open_file(openfile):
"""parse the content of the file"""
return
def parse_file(file_: Path):
if file.suffix == ".gz":
with gzip.open(file_, 'rb') as f:
parse_open_file(f)
else:
with open(file_, 'rb') as f:
parse_open_file(f)

One way to handle this is to assign either open or gzip.open to a variable, depending on file type, then use that as an 'alias' in the with statement. For example:
if file.suffix == ".gz":
myOpen = gzip.open
else:
myOpen = open
with myOpen(file_, 'rb') as f:
parse_open_file(f)

Related

Python script output need to save as a text file

import os ,fnmatch
import os.path
import os
file_dir= '/home/deeghayu/Desktop/mec_sim/up_folder'
file_type = ['*.py']
for root, dirs,files in os.walk( file_dir ):
for extension in ( tuple(file_type) ):
for filename in fnmatch.filter(files, extension):
filepath = os.path.join(root, filename)
if os.path.isfile( filepath ):
print(filename , 'executing...');
exec(open('/home/deeghayu/Desktop/mec_sim/up_folder/{}'.format(filename)).read())
else:
print('Execution failure!!!')
Hello everyone I am working on this code which execute a python file using a python code. I need to save my output of the code as a text file. Here I have shown my code. Can any one give me a solution how do I save my output into a text file?
Piggybacking off of the original answer since they are close but it isn't a best practice to open and close files that way.
It's better to use a context manager instead of saying f = open() since the context manager will handle closing the resource for you regardless of whether your code succeeds or not.
You use it like,
with open("file.txt","w+") as f:
for i in range(10):
f.write("This is line %d\r\n" % (i+1))
try
Open file
f= open("file.txt","w+")
Insert data into file
for i in range(10):
f.write("This is line %d\r\n" % (i+1))
Close the file
f.close()

Compare the content of gzipped files from gzip.open() in python3

I wanted to compare the content of two files compressed from the same file with the gzip.open() method from the python3 standard library. Above the code code creating two compressed files:
import shutil
import gzip
with open('file1.txt', 'rb') as f_in:
with gzip.open('output1.gz',"wb") as f_out:
shutil.copyfileobj(f_in, f_out)
with open('file1.txt', 'rb') as f_in:
with gzip.open('output2.gz',"wb") as f_out:
shutil.copyfileobj(f_in, f_out)
Then I compared the gzipped-files with the filecmp module from the standard library:
import filecmp
print(filecmp.cmp("output1.gz", "output2.gz", shallow=False))
That return False due to different timestamps and filename in the headers of the files as pointed by Kris in this thread.
So according to the RFC 1952 the filename section of the headers that comes after the timestamp section is terminated with a null byte. So I created a function to look for this byte and start hashing the rest of the bytes with the md5() method from the hashlib module from the standard library. And then compare the hashes of the files.
import hashlib
def compareContent(filename):
md5 = hashlib.md5()
with open(filename, 'rb') as f_in:
while True:
text = f_in.read(1)
if text == b'\x00':
break
while True:
data = f_in.read(8192)
if not data:
break
md5.update(data)
return md5.hexdigest()`
hash_one = compareContent("output1.gz")
hash_two = compareContent("output2.gz")
print(hash_one == hash_two)
This finally returns True and seems to work well.
I'm just a little bit surprised that I could not find an already existing function to do this work.
As a beginner, my questions are:
Am I missing something? Does such a function already exist?
If not, Is there any better way to achieve this or to improve my code?
Here is the code available in a Colab notebook: compare gzipped-files

Opening two different format files with a single statement

I am Reading 2 files .txt and .tsv, i had 2 different methods to read these type of files but i want to do it in a single way. for .tsv file we have to pass an argument delimiter and for .txt files it does not need any argument. How can i pass delimiter argu? so that i would not need separate functions.
class DataReader:
def __init__(self, **type="tsv"**):
self.weather_records = []
def read(self, path: str, file: str):
with open(path + file) as file:
for line in csv.DictReader(file, delimiter="\t"):
self.__load(line)
You can try this. I have added comments in the code below.
class DataReader:
def __init__(self, type="tsv"): #type doesn't matter for file reading anymore
self.weather_records = []
def read(self, path, file):
with open(path + file) as file:
if file.endswith('.tsv'):
#do something with delimiter
#You could also set the type here, if you want to use it later
else:
#do without delimiter.

FileNotFoundError But The File Is There: Cryptography Edition

I'm working on a script that takes a checksum and directory as inputs.
Without too much background, I'm looking for 'malware' (ie. a flag) in a directory of executables. I'm given the SHA512 sum of the 'malware'. I've gotten it to work (I found the flag), but I ran into an issue with the output after generalizing the function for different cryptographic protocols, encodings, and individual files instead of directories:
FileNotFoundError: [Errno 2] No such file or directory : 'lessecho'
There is indeed a file lessecho in the directory, and as it happens, is close to the file that returns the actual flag. Probably a coincidence. Probably.
Below is my Python script:
#!/usr/bin/python3
import hashlib, sys, os
"""
### TO DO ###
Add other encryption techniques
Include file read functionality
"""
def main(to_check = sys.argv[1:]):
dir_to_check = to_check[0]
hash_to_check = to_check[1]
BUF_SIZE = 65536
for f in os.listdir(dir_to_check):
sha256 = hashlib.sha256()
with open(f, 'br') as f: <--- line where the issue occurs
while True:
data = f.read(BUF_SIZE)
if not data:
break
sha256.update(data)
f.close()
if sha256.hexdigest() == hash_to_check:
return f
if __name__ == '__main__':
k = main()
print(k)
Credit to Randall for his answer here
Here are some humble trinkets from my native land in exchange for your wisdom.
Your listdir call is giving you bare filenames (e.g. lessecho), but that is within the dir_to_check directory (which I'll call foo for convenience). To open the file, you need to join those two parts of the path back together, to get a proper path (e.g. foo/lessecho). The os.path.join function does exactly that:
for f in os.listdir(dir_to_check):
sha256 = hashlib.sha256()
with open(os.path.join(dir_to_check, f), 'br') as f: # add os.path.join call here!
...
There are a few other issues in the code, unrelated to your current error. One is that you're using the same variable name f for both the file name (from the loop) and file object (in the with statement). Pick a different name for one of them, since you need both available (because I assume you intend return f to return the filename, not the recently closed file object).
And speaking of the closed file, you're actually closing the file object twice. The first one happens at the end of the with statement (that's why you use with). The second is your manual call to f.close(). You don't need the manual call at all.

How to copy from zip file to a folder without unzipping it?

How to make this code works?
There is a zip file with folders and .png files in it. Folder ".\icons_by_year" is empty. I need to get every file one by one without unzipping it and copy to the root of the selected folder (so no extra folders made).
class ArrangerOutZip(Arranger):
def __init__(self):
self.base_source_folder = '\\icons.zip'
self.base_output_folder = ".\\icons_by_year"
def proceed(self):
self.create_and_copy()
def create_and_copy(self):
reg_pattern = re.compile('.+\.\w{1,4}$')
f = open(self.base_source_folder, 'rb')
zfile = zipfile.ZipFile(f)
for cont in zfile.namelist():
if reg_pattern.match(cont):
with zfile.open(cont) as file:
shutil.copyfileobj(file, self.base_output_folder)
zfile.close()
f.close()
arranger = ArrangerOutZip()
arranger.proceed()
shutil.copyfileobj uses file objects for source and destination files. To open the destination you need to construct a file path for it. pathlib is a part of the standard python library and is a nice way to handle file paths. And ZipFile.extract does some of the work of creating intermediate output directories for you (plus sets file metadata) and can be used instead of copyfileobj.
One risk of unzipping files is that they can contain absolute or relative paths outside of the target directory you intend (e.g., "../../badvirus.exe"). extract is a bit too lax about that - putting those files in the root of the target directory - so I wrote a little something to reject the whole zip if you are being messed with.
With a few tweeks to make this a testable program,
from pathlib import Path
import re
import zipfile
#import shutil
#class ArrangerOutZip(Arranger):
class ArrangerOutZip:
def __init__(self, base_source_folder, base_output_folder):
self.base_source_folder = Path(base_source_folder).resolve(strict=True)
self.base_output_folder = Path(base_output_folder).resolve()
def proceed(self):
self.create_and_copy()
def create_and_copy(self):
"""Unzip files matching pattern to base_output_folder, raising
ValueError if any resulting paths are outside of that folder.
Output folder created if it does not exist."""
reg_pattern = re.compile('.+\.\w{1,4}$')
with open(self.base_source_folder, 'rb') as f:
with zipfile.ZipFile(f) as zfile:
wanted_files = [cont for cont in zfile.namelist()
if reg_pattern.match(cont)]
rebased_files = self._rebase_paths(wanted_files,
self.base_output_folder)
for cont, rebased in zip(wanted_files, rebased_files):
print(cont, rebased, rebased.parent)
# option 1: use shutil
#rebased.parent.mkdir(parents=True, exist_ok=True)
#with zfile.open(cont) as file, open(rebased, 'wb') as outfile:
# shutil.copyfileobj(file, outfile)
# option 2: zipfile does the work for you
zfile.extract(cont, self.base_output_folder)
#staticmethod
def _rebase_paths(pathlist, target_dir):
"""Rebase relative file paths to target directory, raising
ValueError if any resulting paths are not within target_dir"""
target = Path(target_dir).resolve()
newpaths = []
for path in pathlist:
newpath = target.joinpath(path).resolve()
newpath.relative_to(target) # raises ValueError if not subpath
newpaths.append(newpath)
return newpaths
#arranger = ArrangerOutZip('\\icons.zip', '.\\icons_by_year')
import sys
try:
arranger = ArrangerOutZip(sys.argv[1], sys.argv[2])
arranger.proceed()
except IndexError:
print("usage: test.py zipfile targetdir")
I'd take a look at the zipfile libraries' getinfo() and also ZipFile.Path() for construction since the constructor class can also use paths that way if you intend to do any creation.
Specifically PathObjects. This is able to do is to construct an object with a path in it, and it appears to be based on pathlib. Assuming you don't need to create zipfiles, you can ignore this ZipFile.Path()
However, that's not exactly what I wanted to point out. Rather consider the following:
zipfile.getinfo()
There is a person who I think is getting at this exact situation here:
https://www.programcreek.com/python/example/104991/zipfile.getinfo
This person seems to be getting a path using getinfo(). It's also clear that NOT every zipfile has the info.

Resources