Python3 multiprocessing - python-3.x

I am an absolute beginner. I fumble my way through code by analogy to examples so apologies for any misuse of terminology.
I have written a small piece of code in python 3 which:
takes a user input (a folder on their computer)
searches the folder for pdf files
turns each page of the PDF to an image with sequential numbering. Iterates through the jpgs in order of numbering, turning them black and white. OCR scans the files and outputs the text into an object, saves the text contents to a .txt file (via pytesseract). Deletes jpgs, leaving .txt file. Most time is taken in converting to jpgs and possibly making them black and white.
The code works, though I am sure it could be improved. It takes a while so I thought I'd try multiprocessing using Pools.
My code appears to create pools. I can also get the function to print a list of files in the folder, so it appears to have the list passed to it in one form or another.
I cannot get it to work and have now hacked the code about repeatedly with various errors. I think the main problem is, I am clueless.
My code begins:
User input block (asks for a folder in the user's directory, checks it is a valid folder etc).
OCR block as a function (parses PDF then outputs contents into single .txt file)
For loop block as a function (is supposed to loop over each PDF in folder and execute OCR block on it.
Multiprocessing block (is supposed to feed the list of files in the directory to the loop block.
To avoid writing War and Peace, I set out last version of the loop block and multiprocessing blocks below:
#import necessary modules
home_path = os.path.expanduser('~')
#ask for input with various checking mechanisms to make sure a useful pdfDir is obtained
pdfDir = home_path + '/Documents/' + input('Please input the folder name where the PDFs are stored. The folder must be directly under the Documents folder. It cannot have a space in it. \n \n Name of folder:')
def textExtractor():
#convert pdf to jpeg with a tesseract friendly resolution
with Img(filename=pdf_filename, resolution=300) as img: #some can be encrypted so use OCR instead of other libraries
#various lines of code here
compilation_temp.close()
def per_file_process (subject_files):
for pdf in subject_files:
#decode the whole file name as a string
pdf_filename = os.fsdecode(pdf)
#check whether the string ends in .pdf
if pdf_filename.endswith(".pdf"):
#call the OCR function on it
textExtractor()
else:
print ('nonsense')
if __name__ == '__main__':
pool = Pool(2)
pool.map(per_file_process, os.listdir(pdfDir))
Is anyone willing/able to point out my errors, please?
The relevant bits of the code whilst working:
#import necessary
home_path = os.path.expanduser('~')
#block accepting input
pdfDir = home_path + '/Documents/' + input('Please input the folder name where the PDFs are stored. The folder must be directly under the Documents folder. It cannot have a space in it. \n \n Name of folder:')
def textExtractor():
#convert pdf to jpeg with a tesseract friendly resolution
with Img(filename=pdf_filename, resolution=300) as img: #need to think about using generic expanduser or other libraries to allow portability
#various lines of code to OCR and output .txt file
compilation_temp.close()
subject_files = os.listdir(pdfDir)
for pdf in subject_files:
#decode the whole file name as a string you can see
pdf_filename = os.fsdecode(pdf)
#check whether the string ends in /pdf
if pdf_filename.endswith(".pdf"):
textExtractor()
else:
#print for debugging

Pool.map calls the worker function repeatedly with each name returned by os.listdir. In per_file_process, subject_files is a single filename and for pdf in subject_files: is enumerating the individual characters in the name. Further, listdir only shows the base name, without subdirectories, so you aren't looking in the right place for the pdf. You can use glob to filter by extension name and return a working path to the file.
Your example is confusing... textExtractor() takes no parameters so how is it to know which file it is processing? I'm going out on a limb and assuming that it really does take the path to the file processing. If so, you can parallelize rather easily just by feeding pdf's directory it via map. Assuming processing time will vary by pdf, I am setting chunksize to 1 so that an early finishing worker can grap extra files to process.
from glob import glob
import os
from multiprocessing import Pool
def textExtractor(pdf_filename):
#convert pdf to jpeg with a tesseract friendly resolution
with Img(filename=pdf_filename, resolution=300) as img: #some can be encrypted so use OCR instead of other libraries
#...various lines of code here
compilation_temp.close()
if __name__ == '__main__':
#pdfDir is the folder inputted by user
with Pool(2) as pool:
# assuming call signature: textExtractor(path_to_file)
pool.map(textExtractor,
(filename for filename in glob(os.path.join(pdfDir, '*.pdf'))
if os.path.isfile(filename))
chunksize=1)

Related

How to paste image from clipboard into PyQT5 Document?

I'm a noob PyQt5 user following a tutorial and I'm confused how I might extend the sample code below.
The two handlers canInsertFromMimeData and insertFromMimeData Qt5 methods accept an image mime datatype dragged and dropped onto document (that works great). They both receive a signal parameter source which receives a QMimeData object.
However, If I try to paste an image copied from the Windows clipboard into the document it just crashes as there is no handler for this.
Searching the Qt5 documentation at https://doc.qt.io/qt-5/qmimedata.html just leads me to further confusion as I'm not a C++ programmer and I'm using Python 3.x and PyQt5 to do this.
How would I write a handler to allow an image copied to the clipboard to be pasted into the document directly?
class TextEdit(QTextEdit):
def canInsertFromMimeData(self, source):
if source.hasImage():
return True
else:
return super(TextEdit, self).canInsertFromMimeData(source)
def insertFromMimeData(self, source):
cursor = self.textCursor()
document = self.document()
if source.hasUrls():
for u in source.urls():
file_ext = splitext(str(u.toLocalFile()))
if u.isLocalFile() and file_ext in IMAGE_EXTENSIONS:
image = QImage(u.toLocalFile())
document.addResource(QTextDocument.ImageResource, u, image)
cursor.insertImage(u.toLocalFile())
else:
# If we hit a non-image or non-local URL break the loop and fall out
# to the super call & let Qt handle it
break
else:
# If all were valid images, finish here.
return
elif source.hasImage():
image = source.imageData()
uuid = hexuuid()
document.addResource(QTextDocument.ImageResource, uuid, image)
cursor.insertImage(uuid)
return
super(TextEdit, self).insertFromMimeData(source)
code source: https://www.learnpyqt.com/examples/megasolid-idiom-rich-text-editor/
I was exactly in the same position as you. I am also new to Python, so there might be mistakes.
The variable uuid in document.addResource(QTextDocument.ImageResource, uuid, image) is not working. It should be a path -> QUrl(uuid).
Now you can insert the image. However, because the path to an image from the clipboard is changing, it would be better to use a different path, for example to the directory where you are also saving the files.
Also be aware that the user has to select the file type when saving (.html)
For my own project I am going to print the file as pdf. That way you dont have to worry about paths to images ^-^
I got around this by converting to base64 inline embedding of the images, then no resource files as it is all in one file.

Formation of folder redundantly

I have the following structure. I want to iterate through sub folders (machine, gunshot) and process .wav files and build mfccresult folder in each category and the .csv file in it. I have the following code and the MFCC folder is keep forming in already formed MFCC folder.
parent_dir = 'sound'
for subdirs, dirs, files in os.walk(parent_dir):
resultsDirectory = subdirs + '/MFCC/'
if not os.path.isdir("resultsDirectory"):
os.makedirs(resultsDirectory)
for filename in os.listdir(subdirs):
if filename.endswith('.wav'):
(rate,sig) = wav.read(subdirs + "/" +filename)
mfcc_feat = mfcc(sig,rate)
fbank_feat = logfbank(sig,rate)
outputFile = resultsDirectory + "/" + os.path.splitext(filename)[0] + ".csv"
file = open(outputFile, 'w+')
numpy.savetxt(file, fbank_feat, delimiter=",")
file.close()
What version of python are you using? Not sure if this has changed in the past, but os.walk does not return "subdirs" as the first of the tuple, but the dirpath. See here for python 3.6.
I don't know your absolute path, but seeing as you are passing in the path sound as a relative reference, I assume it is a folder inside the directory where you run your python code. So for example, lets say you are running this file (lets call it mycode.py) from
/home/username/myproject/mycode.py
and you have some subdirectory:
/home/username/myproject/sound/
So:
resultsDirectory = subdirs + '/MFCC/'
as written in your code above would resolve to:
/home/username/myproject/sound/MFCC/
So your first if statement will be entered since this is not an existing directory. Thereby you create a new directory:
/home/username/myproject/sound/MFCC/
From there, you take
filename in os.listdir(subdirs)
This is also appears to be a misunderstanding of the output of this function. os.listdir() will return directories not files. See here for the man on that.
So now you are looping through the directories in:
/home/username/myproject/sound/
Here, I assume you have some of the directories from your diagram already made. So I assume you have:
/home/username/myproject/sound/machine_sound
/home/username/myproject/sound/gun_shot_sound
or something along those lines.
So the if statement will never be entered, since your directory names to not end with '.wav'.
Even if it did enter, you'd still have issues asfilename will actually be equal to machine_sound on the first loop, and gun_shot_sound in the second time through.
Maybe you are using some other wav library, but the python built-in is called wave and you need to call the wave.open() on the file not wav.read(). See here for the docs.
I'm not sure what you were trying to achieve with the call to os.path.splitext(filename)[0], but you can read about it here You will end up with the same thing that went in in this case though, so machine_sound and gun_shot_sound.
Your output file will thus result in:
/home/username/myproject/sound/MFCC/machine_sound.csv
on the first loop, and
/home/username/myproject/sound/MFCC/gun_shot_sound.csv
the second time through.
So in conclusion, I'm not sure what is happening when you say "MFCC folder is keep forming in already formed MFCC folder" but you definitely have a lot of reading ahead of you before you can understand your own code, and have any hope of fixing it to do what you want. Assuming you read through the links I provided, you should be able to do that though. Good luck!
Additionally, you had quite few typos in your code that I edited, include the immensely important whitespace characters. You should clean that up and ensure your code runs before posting it here, then double check that your copy/paste action did not result in any errors. People will be much more willing to help if you clean up your presentation a bit.
for subdir,dirs,files in os.walk(parent_dir):
for folder in next(os.walk(parent_dir))[1]:
resultsDirectory= folder + '/MFCC'
absPath = os.path.join(parent_dir, resultsDirectory)
if not os.path.isdir(absPath):
os.makedirs(absPath)
for filename in os.listdir(subdir):
print('listdir')
if filename.endswith('.wav'):
print("csv file writing")
(rate,sig) = wav.read(subdir + "/" +filename)
mfcc_feat = mfcc(sig,rate)
fbank_feat = logfbank(sig,rate)
print("fbank_feat")
outputFile =subdir + "/MFCC"+"/" + os.path.splitext(filename)[0] + ".csv"
file = open(outputFile, "w+")
numpy.savetxt(file, fbank_feat, delimiter=",")
file.close()
Here the csv file is stored in the subdirectory not in mfcc folder for each category.
I have issue with output path file.

How to scrape a pdf file to make a DF

I need to build a database with several data. Most of those data is contained in PDF Files. Those PDF files are all the same, but change only on the data. (for example, one of the files i have to work in: https://documentos.serviciocivil.cl/actas/dnsc/documentService/downloadWs?uuid=aecfeb7c-d494-4631-ade4-584d67ea120e)
I have been trying to extract the data with PyPDF, tabula, pdfminer (even tried with textract but it didn't work through Anaconda) and other stuff, but i didn't get what i want.
Then i tried to transform those pdf files in txt files and then mining it, but didn't get anything. Also tried with regex but didn't understand how to use it, although the code doesn't show errors when running:
import re
import sys
recording = False
your_file = "D:\Magister\Tercer semestre\Tesis I\Txt\ResultadoConcurso1.txt"
start_pattern = 'apellidos:'
stop_pattern = '1.2'
output_section = []
for line in open(your_file).readlines():
if recording is False:
if re.search(start_pattern, line) is not None:
recording = True
output_section.append(line.strip())
elif recording is True:
if re.search(stop_pattern, line) is not None:
recording = False
sys.exit()
output_section.append(line.strip())
print("".join(output_section))
As you can see in the upper link i left, pdf files have different sections. I need to get the info that's inside those sections. For example, one of the fields in my database it's going to be "Nombre y apellido" (name and lastname). It's contained between "apellidos:" and "1.2".
what should i do? Can i work directly from PDF format? Or should i work in txt files? And then, what should i use to get the info? (Python 3.XX; Anaconda)
Thanks

Copying files across a network with Python

I am trying to loop through a list of filepaths for files I have throughout the entire network at my company. The filepaths have locations of various drives throughout the network.
The user submitted the file once upon a time and the filepath was passed through at the point of submission. However, the file drive is not the same for every user and is not the same for what that drive is named on my machine.
For example: a path like X:\Users\Submissions\Bob's File.xlsx may coincide with the same drive and file but named differently on my machine:
K:\Users\Submissions\Bob's File.xlsx
Each user has the potential of using a different letter for that particular drive for a various number of reasons.
Is there a way I can make my pattern string that I pass in smart enough to be able to find the proper directory and locate that file? Any ideas would be great.
Thank you
import pandas as pd
import shutil as sh
copydir = r"C:\Users\me\Desktop\PythonSpyderDesktop\Extractor\Models"
file_path_list = r"C:\Users\me\Desktop\PythonSpyderDesktop\Extractor\FilePathList.csv"
data = pd.read_csv(file_path_list)
i = 1 #Start at 2nd row
for i in range(1, len(data)):
try:
sh.copyfile(data.FilePath[i], copydir)
print("Copied over file: " + data.FilePath[i])
except:
print ("File not found.")
Your question is unclear. It revolves around the source & dest arguments being passed to copyfile:
sh.copyfile(data.FilePath[i], copydir)
It's hard to tell what pathnames you're extracting from the .CSV, but apparently source files may have the "wrong" drive letter, and/or the destination directory copydir may have the "wrong" drive letter. The script apparently runs on multiple machines, and those machines have diverse drive letters mounted.
Write a helper function that finds the "right" drive letter. It should accept a pathname like copydir, then probe a search list, then return a corrected pathname.
Given a list of drive letters, you can iterate through them and test whether a pathname exists using os.path.exists(). Return the first one found.
Use splitdrive() to parse out components of the input pathname.
Suppose that both source and dest may need their drive letters fixed up. Then the call might look like this:
sh.copyfile(fix_path(data.FilePath[i]), fix_path(copydir))

PyPDF2 returns only empty lines for some files

I am working on a script that "reads" PDF files and and then automatically renames the files it recognizes from a dictionary. PyPDF2 however only returns empty lines for some PDFs, while working fine for others. The code for reading the files:
import PyPDF2
# File name
file = 'sample.pdf'
# Open File
with open(file, "rb") as f:
# Read in file
pdfReader = PyPDF2.PdfFileReader(f)
# Check number of pages
number_of_pages = pdfReader.numPages
print(number_of_pages)
# Get first page
pageObj = pdfReader.getPage(0)
# Extract text from page 1
text = pageObj.extractText()
print(text)
It does get the number of pages correctly, so it is able to open the PDF.
If I replace the print(text) by repr(text) for files it doesn't read, I get something like:
"'\\n\\n\\n\\n\\n\\n\\n\\nn\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n'"
Weirdly enough, when I enhance (OCR) the files with Adobe, the script performs slightly worse. It Recognized 140 out of 800 files and after enhancing just 110.
The PDFs are machine readable/searchable, because I am able to copy/paste text to Notepad. I tested some files with "pdfminer" and it does show some text, but also throws in a lot of errors. If possible I like to keep working with PyPDF2.
Specifications of the software I am using:
Windows: 10.0.15063
Python: 3.6.1
PyPDF: 1.26.0
Adobe version: 17.009.20058
Anyone any suggestions? Your help is very much appreciated!
I had the same issue, i fixed it using another python library called slate
Fortunately, i found a fork that works in Python 3.6.5
import slate3k as slate
with open(file.pdf,'rb') as f:
extracted_text = slate.PDF(f)
print(extracted_text)

Resources