How to register .gz format in shutil.register_archive_format to use same format in shutil.unpack_archive - python-3.x

I have Example.json.gz and I want to unpack it or extract it in python using shutil.unpack_archive()
However it gives error shutil.ReadError: Unknown archive format as '.gz' format is not in the list of default format.
So it has to be register first using shutil.register_archive_format. Can somebody please help me register and unpack (extract it)

You should define a function that knows how to extract a gz file and then register this function. You could use the gzip library, for instance:
import os
import re
import gzip
import shutil
def gunzip_something(gzipped_file_name, work_dir):
"""gunzip the given gzipped file"""
# see warning about filename
filename = os.path.split(gzipped_file_name)[-1]
filename = re.sub(r"\.gz$", "", filename, flags=re.IGNORECASE)
with gzip.open(gzipped_file_name, 'rb') as f_in: # <<========== extraction happens here
with open(os.path.join(work_dir, filename), 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
try:
shutil.register_unpack_format('gz', ['.gz', ], gunzip_something)
except:
pass
shutil.unpack_archive("Example.json.gz", os.curdir, 'gz')
WARNING: if you extract on the same dir where your gzipped file resides and your file does not have a .gz extension I'm not sure what happens (overwrite?).

Related

Converting multiple files in a directory into .txt format. But file names become Binary

So I am creating plagiarism software, for that, I need to convert .pdf, .docx,[enter image description here][1] etc files into a .txt format. I successfully found a way to convert all the files in one directory to another. BUT the problem is, this method is changing the file names
into binary values. I need to get the original file name which I am gonna need in the next phase.
**Code:**
import os
import uuid
import textract
source_directory = os.path.join(os.getcwd(), "C:/Users/syedm/Desktop/Study/FOUNDplag/Plagiarism-checker-Python/mainfolder")
for filename in os.listdir(source_directory):
file, extension = os.path.splitext(filename)
unique_filename = str(uuid.uuid4()) + extension
os.rename(os.path.join(source_directory, filename), os.path.join(source_directory, unique_filename))
training_directory = os.path.join(os.getcwd(), "C:/Users/syedm/Desktop/Study/FOUNDplag/Plagiarism-checker-Python/trainingdata")
for process_file in os.listdir(source_directory):
file, extension = os.path.splitext(process_file)
# We create a new text file name by concatenating the .txt extension to file UUID
dest_file_path = file + '.txt'
# extract text from the file
content = textract.process(os.path.join(source_directory, process_file))
# We create and open the new and we prepare to write the Binary Data which is represented by the wb - Write Binary
write_text_file = open(os.path.join(training_directory, dest_file_path), "wb")
# write the content and close the newly created file
write_text_file.write(content)
write_text_file.close()
remove this line where you rename the files:
os.rename(os.path.join(source_directory, filename), os.path.join(source_directory, unique_filename))
that's also not binary, but a uuid instead.
Cheers

Colab OSError: [Errno 36] File name too long when reading a docx2text file

I am studying NLP techniques and while I have some experience with .txt files, using .docx has been troublesome. I am trying to use regex on strings, and since I am using a word document, this is my approach:
I will use textract to get a docx to txt and get the bytes to strings:
import textract
my_text = textract.process("1337.docx")
my_text = text.decode("utf-8")
I read the file:
def load_doc(filename):
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
# close the file
file.close()
return text
I then try and do some regexs such as remove all numbers and etc, and when executing it in the main:
def regextest(doc):
...
...
text = load_doc(my_text)
tokens = regextest(text)
print(tokens)
I get the exception:
OSError: [Errno 36] File name too long: Are you buying a Tesla?\n\n\n\n - I believe the pricing is...(and more text from te file)
I know I am transforming my docx file to a text file and then, when I read the "filename", it is actually the whole text. How can I preserve the file and make it work? How would you guys approach this?
It seems that you are using the contents of the file - my_text as the filename parameter to load_doc and hence the error.
I would think that you rather want to use one of the actual file names as a parameter, possibly '1337.docx' and not the contents of this file.

How to read data from home directory in Python

I am trying to read/get data from a json file. This json file is stored in the project > Requests > request1.json. In a script i am trying to read data from the json file and failing badly. This is the code i'm trying to use to open file in read mode.
Trying to replace(in windows)
f = open('D:\\Test\\projectname\\RequestJson\\request1.json', 'r') with
f = open(os.path.expanduser('~user') + "Requests/request1.json", 'r')
Any help would be greatly appreciated.
Using current directory path (assuming that is in the project) and appending the remaining static file path:
import os
current_dir = os.path.abspath(os.getcwd())
path = current_dir + "/RequestJson/request1.json"
with open(path, 'r') as f:
f.write(data)

How to compress csv encoded file into zip archive directly?

I want to write data to cp1250 encoded file and zip it without temporary storing it on filesystem.
I figured out that I need someting like this
f = io.TextIOBase(newline='', encoding='cp1250')
writer = csv.writer(f, delimiter=';', dialect='excel', quoting=csv.QUOTE_ALL)
writer.writerow([3,3,3,4])
with ZipFile('cvs.zip', 'w') as zip_file:
zip_file.writestr('test.cvs', f.getvalue())
But now on third line I got:
io.UnsupportedOperation: write
This is probably because of use io.TextIOBase, but with any stringIO i can't set encoding

Custom filetype in Python 3

How to start creating my own filetype in Python ? I have a design in mind but how to pack my data into a file with a specific format ?
For example I would like my fileformat to be a mix of an archive ( like other format such as zip, apk, jar, etc etc, they are basically all archives ) with some room for packed files, plus a section of the file containing settings and serialized data that will not be accessed by an archive-manager application.
My requirement for this is about doing all this with the default modules for Cpython, without external modules.
I know that this can be long to explain and do, but I can't see how to start this in Python 3.x with Cpython.
Try this:
from zipfile import ZipFile
import json
data = json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}])
with ZipFile('foo.filetype', 'w') as myzip:
myzip.writestr('digest.json', data)
The file is now a zip archive with a json file (thats easy to read in again in many lannguages) for data you can add files to the archive with myzip write or writestr. You can read data back with:
with ZipFile('foo.filetype', 'r') as myzip:
json_data_read = myzip.read('digest.json')
newdata = json.loads(json_data_read)
Edit: you can append arbitrary data to the file with:
f = open('foo.filetype', 'a')
f.write(data)
f.close()
this works for winrar but python can no longer process the zipfile.
Use this:
import base64
import gzip
import ast
def save(data):
data = "[{}]".format(data).encode()
data = base64.b64encode(data)
return gzip.compress(data)
def load(data):
data = gzip.decompress(data)
data = base64.b64decode(data)
return ast.literal_eval(data.decode())[0]
How to use this with file:
open(filename, "wb").write(save(data)) # save data
data = load(open(filename, "rb").read()) # load data
This might look like this is able to be open with archive program
but it cannot because it is base64 encoded and they have to decode it to access it.
Also you can store any type of variable in it!
example:
open(filename, "wb").write(save({"foo": "bar"})) # dict
open(filename, "wb").write(save("foo bar")) # string
open(filename, "wb").write(save(b"foo bar")) # bytes
# there's more you can store!
This may not be appropriate for your question but I think this may help you.
I have a similar problem faced... but end up with some thing like creating a zip file and then renamed the zip file format to my custom file format... But it can be opened with the winRar.

Resources