Converting multiple files in a directory into .txt format. But file names become Binary - python-3.x

So I am creating plagiarism software, for that, I need to convert .pdf, .docx,[enter image description here][1] etc files into a .txt format. I successfully found a way to convert all the files in one directory to another. BUT the problem is, this method is changing the file names
into binary values. I need to get the original file name which I am gonna need in the next phase.
**Code:**
import os
import uuid
import textract
source_directory = os.path.join(os.getcwd(), "C:/Users/syedm/Desktop/Study/FOUNDplag/Plagiarism-checker-Python/mainfolder")
for filename in os.listdir(source_directory):
file, extension = os.path.splitext(filename)
unique_filename = str(uuid.uuid4()) + extension
os.rename(os.path.join(source_directory, filename), os.path.join(source_directory, unique_filename))
training_directory = os.path.join(os.getcwd(), "C:/Users/syedm/Desktop/Study/FOUNDplag/Plagiarism-checker-Python/trainingdata")
for process_file in os.listdir(source_directory):
file, extension = os.path.splitext(process_file)
# We create a new text file name by concatenating the .txt extension to file UUID
dest_file_path = file + '.txt'
# extract text from the file
content = textract.process(os.path.join(source_directory, process_file))
# We create and open the new and we prepare to write the Binary Data which is represented by the wb - Write Binary
write_text_file = open(os.path.join(training_directory, dest_file_path), "wb")
# write the content and close the newly created file
write_text_file.write(content)
write_text_file.close()

remove this line where you rename the files:
os.rename(os.path.join(source_directory, filename), os.path.join(source_directory, unique_filename))
that's also not binary, but a uuid instead.
Cheers

Related

Python Pillow library not saving files with same name in same location

Below is the code I am using to convert a binary data into image and then saving it
img = base64.b64decode(rec.image3)
img_conv = Image.open(io.BytesIO(img))
img_format = img_conv.format
img_conv.save('{}/{}'.format(path, rec.image_name), format(img_format))
There are 4 images with same code and I want to handle the scenario where if all the file names are same in the same location, it should force to save the 4 images even though it has duplicate name.
Any suggestion would be appreciated. Thanks.
Supposing that you want to keep each file under a different name: Append '_' to the original filename as long as a file with such a name exists in your directory.
from pathlib import Path
path_to_save = Path(path, rec.image_name)
while path_to_save.exists():
path_to_save = Path(str(path_to_save) + '_')
img_conv.save(path_to_save, format(img_format))

How to read file as .dat and write it as a .txt

So I'm making a thing where it reads data from a .dat file and saves it as a list, then it takes that list and writes it to a .txt file (basically a .dat to .txt converter). However, whenever I run it and it makes the file, it is a .txt file but it contains the .dat data. After troubleshooting the variable that is written to the .dat file is normal legible .txt not weird .dat data...
Here is my code (pls don't roast I'm very new I know it sucks and has lots of mistakes just leave me be xD):
#import dependencies
import sys
import pickle
import time
#define constants and get file path
data = []
index = 0
path = input("Absolute file path:\n")
#checks if last character is a space (common in copy+pasting) and removes it if there is a space
if path.endswith(' '):
path = path[:-1]
#load the .dat file into a list names bits
bits = pickle.load(open(path, "rb"))
with open(path, 'rb') as fp:
bits = pickle.load(fp)
#convert the data from bits into a new list called data
while index < len(bits):
print("Decoding....\n")
storage = bits[index]
print("Decoding....\n")
str(storage)
print("Decoding....\n")
data.append(storage)
print("Decoding....\n")
index += 1
print("Decoding....\n")
time.sleep(0.1)
#removes the .dat of the file
split = path[:-4]
#creates the new txt file with _converted.txt added to the end
with open(f"{split}_convert.txt", "wb") as fp:
pickle.dump(data, fp)
#tells the user where the file has been created
close_file = str(split)+"_convert.txt"
print(f"\nA decoded txt file has been created. Run this command to open it: cd {close_file}\n\n")
Quick review; I'm setting a variable named data which contains all of the data from the .dat file, then I want to the save the variable to a .txt file, but whenever I save it to a .txt file it has the contents of the .dat file, even though when I call print(data) it tells me the data in normal, legible text. Thanks for any help.
with open(f"{split}_convert.txt", "wb") as fp:
pickle.dump(data, fp)
When you're opening the file in wb mode, it will automatically write binary data to it. To write plain text to .txt file, use
with open(f"{split}_convert.txt", "w") as fp:
fp.write(data)
Since data is a list, you can't write it straight away as well. You'll need to write each item, using a loop.
with open(f"{split}_convert.txt", "w") as fp:
for line in data:
fp.write(line)
For more details on file writing, check this article as well: https://www.tutorialspoint.com/python3/python_files_io.htm

Read multiple text files, search few strings , replace and write in python

I have 10s of text files in my local directory named something like test1, test2, test3, and so on. I would like to read all these files, search few strings in the files, replace them by other strings and finally save back into my directory in such a way that something like newtest1, newtest2, newtest3, and so on.
For instance, if there was a single file, I would have done following:
#Read the file
with open('H:\\Yugeen\\TestFiles\\test1.txt', 'r') as file :
filedata = file.read()
#Replace the target string
filedata = filedata.replace('32-83 Days', '32-60 Days')
#write the file out again
with open('H:\\Yugeen\\TestFiles\\newtest1.txt', 'w') as file:
file.write(filedata)
Is there any way that I can achieve this in python?
If you use Pyhton 3 you can use the scandir in os library.
Python 3 docs: os.scandir
With that you can get the directory entries.
with os.scandir('H:\\Yugeen\\TestFiles') as it:
Then loop over these entries and your code could look something like this.
Notice I changed the path in your code to the entry object path.
import os
# Get the directory entries
with os.scandir('H:\\Yugeen\\TestFiles') as it:
# Iterate over directory entries
for entry in it:
# If not file continue to next iteration
# This is no need if you are 100% sure there is only files in the directory
if not entry.is_file():
continue
# Read the file
with open(entry.path, 'r') as file:
filedata = file.read()
# Replace the target string
filedata = filedata.replace('32-83 Days', '32-60 Days')
# write the file out again
with open(entry.path, 'w') as file:
file.write(filedata)
If you use Pyhton 2 you can use listdir. (also applicable for python 3)
Python 2 docs: os.listdir
In this case same code structure. But you also need to handle the full path to file since listdir will only return the filename.

Create folders dynamically and write csv files to that folders

I would like to read several input files from a folder, perform some transformations,create folders on the fly and write the csv to corresponding folders. The point here is I have the input path which is like
"Input files\P1_set1\Set1_Folder_1_File_1_Hour09.csv" - for a single patient (This file contains readings of patient (P1) at 9th hour)
Similarly, there are multiple files for each patient and each patient files are grouped under each folder as shown below
So, to read each file, I am using wildcard regex as shown below in code
I have already tried using the glob package and am able to read it successfully but am facing issue while creating the output folders and saving the files. I am parsing the file string as shown below
f = "Input files\P1_set1\Set1_Folder_1_File_1_Hour09.csv"
f[12:] = "P1_set1\Set1_Folder_1_File_1_Hour09.csv"
filenames = sorted(glob.glob('Input files\P*_set1\*.csv'))
for f in filenames:
print(f) #This will print the full path
print(f[12:]) # This print the folder structure along with filename
df_transform = pd.read_csv(f)
df_transform = df_transform.drop(['Format 10','Time','Hour'],axis=1)
df_transform.to_csv("Output\" + str(f[12:]),index=False)
I expect the output folder to have the csv files which are grouped by each patient under their respective folders. The screenshot below shows how the transformed files should be arranged in output folder (same structure as input folder). Please note that "Output" folder is already existing (it's easy to create one folder you know)
So to read files in a folder use os library then you can do
import os
folder_path = "path_to_your_folder"
dir = os.listdir(folder_path)
for x in dir:
df_transform = pd.read_csv(f)
df_transform = df_transform.drop(['Format 10','Time','Hour'],axis=1)
if os.path.isdir("/home/el"):
df_transform.to_csv("Output/" + str(f[12:]),index=False)
else:
os.makedirs(folder_path+"/")
df_transform.to_csv("Output/" + str(f[12:]),index=False)
Now instead of user f[12:] split the x in for loop like
file_name = x.split('/')[-1] #if you want filename.csv
Let me know if this is what you wanted

Custom filetype in Python 3

How to start creating my own filetype in Python ? I have a design in mind but how to pack my data into a file with a specific format ?
For example I would like my fileformat to be a mix of an archive ( like other format such as zip, apk, jar, etc etc, they are basically all archives ) with some room for packed files, plus a section of the file containing settings and serialized data that will not be accessed by an archive-manager application.
My requirement for this is about doing all this with the default modules for Cpython, without external modules.
I know that this can be long to explain and do, but I can't see how to start this in Python 3.x with Cpython.
Try this:
from zipfile import ZipFile
import json
data = json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}])
with ZipFile('foo.filetype', 'w') as myzip:
myzip.writestr('digest.json', data)
The file is now a zip archive with a json file (thats easy to read in again in many lannguages) for data you can add files to the archive with myzip write or writestr. You can read data back with:
with ZipFile('foo.filetype', 'r') as myzip:
json_data_read = myzip.read('digest.json')
newdata = json.loads(json_data_read)
Edit: you can append arbitrary data to the file with:
f = open('foo.filetype', 'a')
f.write(data)
f.close()
this works for winrar but python can no longer process the zipfile.
Use this:
import base64
import gzip
import ast
def save(data):
data = "[{}]".format(data).encode()
data = base64.b64encode(data)
return gzip.compress(data)
def load(data):
data = gzip.decompress(data)
data = base64.b64decode(data)
return ast.literal_eval(data.decode())[0]
How to use this with file:
open(filename, "wb").write(save(data)) # save data
data = load(open(filename, "rb").read()) # load data
This might look like this is able to be open with archive program
but it cannot because it is base64 encoded and they have to decode it to access it.
Also you can store any type of variable in it!
example:
open(filename, "wb").write(save({"foo": "bar"})) # dict
open(filename, "wb").write(save("foo bar")) # string
open(filename, "wb").write(save(b"foo bar")) # bytes
# there's more you can store!
This may not be appropriate for your question but I think this may help you.
I have a similar problem faced... but end up with some thing like creating a zip file and then renamed the zip file format to my custom file format... But it can be opened with the winRar.

Resources