How to scrape a pdf file to make a DF - python-3.x

I need to build a database with several data. Most of those data is contained in PDF Files. Those PDF files are all the same, but change only on the data. (for example, one of the files i have to work in: https://documentos.serviciocivil.cl/actas/dnsc/documentService/downloadWs?uuid=aecfeb7c-d494-4631-ade4-584d67ea120e)
I have been trying to extract the data with PyPDF, tabula, pdfminer (even tried with textract but it didn't work through Anaconda) and other stuff, but i didn't get what i want.
Then i tried to transform those pdf files in txt files and then mining it, but didn't get anything. Also tried with regex but didn't understand how to use it, although the code doesn't show errors when running:
import re
import sys
recording = False
your_file = "D:\Magister\Tercer semestre\Tesis I\Txt\ResultadoConcurso1.txt"
start_pattern = 'apellidos:'
stop_pattern = '1.2'
output_section = []
for line in open(your_file).readlines():
if recording is False:
if re.search(start_pattern, line) is not None:
recording = True
output_section.append(line.strip())
elif recording is True:
if re.search(stop_pattern, line) is not None:
recording = False
sys.exit()
output_section.append(line.strip())
print("".join(output_section))
As you can see in the upper link i left, pdf files have different sections. I need to get the info that's inside those sections. For example, one of the fields in my database it's going to be "Nombre y apellido" (name and lastname). It's contained between "apellidos:" and "1.2".
what should i do? Can i work directly from PDF format? Or should i work in txt files? And then, what should i use to get the info? (Python 3.XX; Anaconda)
Thanks

Related

How to save the output of text from selenium chrome (Python)

I'm using Selenium for extracting comments of Youtube.
Everything went well. But when I print comment.text, the output is the last sentence.
I don't know who to save it for further analyze (cleaning and tokenization)
path = "/mnt/c/Users/xxx/chromedriver.exe"
This is the path that I saved and downloaded my chrome
chrome = webdriver.Chrome(path)
url = "https://www.youtube.com/watch?v=WPni755-Krg"
chrome.get(url)
chrome.maximize_window()
scrolldown
sleep = 5
chrome.execute_script('window.scrollTo(0, 500);'
time.sleep(sleep)
chrome.execute_script('window.scrollTo(0, 1080);')
time.sleep(sleep)
text_comment = chrome.find_element_by_xpath('//*[#id="contents"]')
comments = text_comment.find_elements_by_xpath('//*[#id="content-text"]')
comment_ids = []
Try this approach for getting the text of all comments. (the forloop part edited- there was no indention in the previous code.)
for comment in comments:
comment_ids.append(comment.get_attribute('id'))
print(comment.text)
when I print, i can see all the texts here. but how can i open it for further study. Should i always use for loop? I want to tokenize the texts but the output is only last sentence. Is there a way to save this .text file with the whole texts inside it and open it again? I googled it a lot but it wasn't successful.
So it sounds like you're just trying to store these comments to reference later. Your current solution is to append them to a string and use a token to create substrings? I'm not familiar with pythons data structures, but this sounds like a great job for an array or a list depending on how you plan to reference this data.

Extracting title from pdf using pypdf2 not working

I'm trying to extract the title of PDF files using pyPDF2. The output is either none or a wrong title. I tried using PDFminer as well, still the same result. I tried using 3 different pdf files. Is there a better way to extract the title with better accuracy?
This is the code I used:
from PyPDF2 import PdfFileReader
def get_pdf_title(pdf_file_path):
pdf_reader = PdfFileReader(open(pdf_file_path, "rb"))
return pdf_reader.getDocumentInfo().title
title = get_pdf_title('C:/PythonPrograms/Test.pdf')
print(title)
Your code is working, at least for me on python 3.5.2. Check in the PDF properties that he indeed has a title.
PDF's title is part of its metadata, that needs to be set. It is not mandatory, not related to its content (other than by the will of the person writing it), nor with its filename.
If you use your snippet on a file with no title, it's output will be an empty string.

PyPDF2 returns only empty lines for some files

I am working on a script that "reads" PDF files and and then automatically renames the files it recognizes from a dictionary. PyPDF2 however only returns empty lines for some PDFs, while working fine for others. The code for reading the files:
import PyPDF2
# File name
file = 'sample.pdf'
# Open File
with open(file, "rb") as f:
# Read in file
pdfReader = PyPDF2.PdfFileReader(f)
# Check number of pages
number_of_pages = pdfReader.numPages
print(number_of_pages)
# Get first page
pageObj = pdfReader.getPage(0)
# Extract text from page 1
text = pageObj.extractText()
print(text)
It does get the number of pages correctly, so it is able to open the PDF.
If I replace the print(text) by repr(text) for files it doesn't read, I get something like:
"'\\n\\n\\n\\n\\n\\n\\n\\nn\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n'"
Weirdly enough, when I enhance (OCR) the files with Adobe, the script performs slightly worse. It Recognized 140 out of 800 files and after enhancing just 110.
The PDFs are machine readable/searchable, because I am able to copy/paste text to Notepad. I tested some files with "pdfminer" and it does show some text, but also throws in a lot of errors. If possible I like to keep working with PyPDF2.
Specifications of the software I am using:
Windows: 10.0.15063
Python: 3.6.1
PyPDF: 1.26.0
Adobe version: 17.009.20058
Anyone any suggestions? Your help is very much appreciated!
I had the same issue, i fixed it using another python library called slate
Fortunately, i found a fork that works in Python 3.6.5
import slate3k as slate
with open(file.pdf,'rb') as f:
extracted_text = slate.PDF(f)
print(extracted_text)

Opening another .py file in a function to pass agruments in Python3.5

I'm pretty new to Python and the overall goal of the project I am working on is to setup a SQLite DB that will allow easy entries in the future for non-programmers (this is for a small group of people who are all technically competent). The way I am trying to accomplish this right now is to have people save their new data entry as a .py file through a simple text editor and then open that .py file within the function that enters the values into the DB. So far I have:
def newEntry(material=None, param=None, value=None):
if param == 'density':
print('The density of %s is %s' % (material, value))
import fileinput
for line in fileinput.input(files=('testEntry.py'))
process(line)
Then I have created with a simple text editor a file called testEntry.py that will hopefully be called by newEntry.py when newEntry is executed in the terminal. The idea here is that some user would just have to put in the function name with the arguments they are inputing within the parentheses. testEntry.py is simply:
# Some description for future users
newEntry(material='water', param='density', value='1')
When I run newEntry.py in my terminal nothing happens. Is there some other way to open and execute a .py file within another that I do not know of? Thank you very much for any help.
Your solution works, but as a commenter said, it is very insecure and there are better ways. Presuming your process(...) method is just executing some arbitrary Python code, this could be abused to execute system commands, such as deleting files (very bad).
Instead of using a .py file consisting of a series of newEntry(...) on each line, have your users produce a CSV file with the appropriate column headers. I.e.
material,param,value
water,density,1
Then parse this csv file to add new entries:
with open('entries.csv') as entries:
csv_reader = csv.reader(entries)
header = True
for row in csv_reader:
if header: # Skip header
header = False
continue
material = row[0]
param = row[1]
value = row[2]
if param == 'density':
print('The density of %s is %s' % (material, value))
Your users could use Microsoft Excel, Google Sheets, or any other spreadsheet software that can export .csv files to create/edit these files, and you could provide a template to the users with predefined headers.

Python3 multiprocessing

I am an absolute beginner. I fumble my way through code by analogy to examples so apologies for any misuse of terminology.
I have written a small piece of code in python 3 which:
takes a user input (a folder on their computer)
searches the folder for pdf files
turns each page of the PDF to an image with sequential numbering. Iterates through the jpgs in order of numbering, turning them black and white. OCR scans the files and outputs the text into an object, saves the text contents to a .txt file (via pytesseract). Deletes jpgs, leaving .txt file. Most time is taken in converting to jpgs and possibly making them black and white.
The code works, though I am sure it could be improved. It takes a while so I thought I'd try multiprocessing using Pools.
My code appears to create pools. I can also get the function to print a list of files in the folder, so it appears to have the list passed to it in one form or another.
I cannot get it to work and have now hacked the code about repeatedly with various errors. I think the main problem is, I am clueless.
My code begins:
User input block (asks for a folder in the user's directory, checks it is a valid folder etc).
OCR block as a function (parses PDF then outputs contents into single .txt file)
For loop block as a function (is supposed to loop over each PDF in folder and execute OCR block on it.
Multiprocessing block (is supposed to feed the list of files in the directory to the loop block.
To avoid writing War and Peace, I set out last version of the loop block and multiprocessing blocks below:
#import necessary modules
home_path = os.path.expanduser('~')
#ask for input with various checking mechanisms to make sure a useful pdfDir is obtained
pdfDir = home_path + '/Documents/' + input('Please input the folder name where the PDFs are stored. The folder must be directly under the Documents folder. It cannot have a space in it. \n \n Name of folder:')
def textExtractor():
#convert pdf to jpeg with a tesseract friendly resolution
with Img(filename=pdf_filename, resolution=300) as img: #some can be encrypted so use OCR instead of other libraries
#various lines of code here
compilation_temp.close()
def per_file_process (subject_files):
for pdf in subject_files:
#decode the whole file name as a string
pdf_filename = os.fsdecode(pdf)
#check whether the string ends in .pdf
if pdf_filename.endswith(".pdf"):
#call the OCR function on it
textExtractor()
else:
print ('nonsense')
if __name__ == '__main__':
pool = Pool(2)
pool.map(per_file_process, os.listdir(pdfDir))
Is anyone willing/able to point out my errors, please?
The relevant bits of the code whilst working:
#import necessary
home_path = os.path.expanduser('~')
#block accepting input
pdfDir = home_path + '/Documents/' + input('Please input the folder name where the PDFs are stored. The folder must be directly under the Documents folder. It cannot have a space in it. \n \n Name of folder:')
def textExtractor():
#convert pdf to jpeg with a tesseract friendly resolution
with Img(filename=pdf_filename, resolution=300) as img: #need to think about using generic expanduser or other libraries to allow portability
#various lines of code to OCR and output .txt file
compilation_temp.close()
subject_files = os.listdir(pdfDir)
for pdf in subject_files:
#decode the whole file name as a string you can see
pdf_filename = os.fsdecode(pdf)
#check whether the string ends in /pdf
if pdf_filename.endswith(".pdf"):
textExtractor()
else:
#print for debugging
Pool.map calls the worker function repeatedly with each name returned by os.listdir. In per_file_process, subject_files is a single filename and for pdf in subject_files: is enumerating the individual characters in the name. Further, listdir only shows the base name, without subdirectories, so you aren't looking in the right place for the pdf. You can use glob to filter by extension name and return a working path to the file.
Your example is confusing... textExtractor() takes no parameters so how is it to know which file it is processing? I'm going out on a limb and assuming that it really does take the path to the file processing. If so, you can parallelize rather easily just by feeding pdf's directory it via map. Assuming processing time will vary by pdf, I am setting chunksize to 1 so that an early finishing worker can grap extra files to process.
from glob import glob
import os
from multiprocessing import Pool
def textExtractor(pdf_filename):
#convert pdf to jpeg with a tesseract friendly resolution
with Img(filename=pdf_filename, resolution=300) as img: #some can be encrypted so use OCR instead of other libraries
#...various lines of code here
compilation_temp.close()
if __name__ == '__main__':
#pdfDir is the folder inputted by user
with Pool(2) as pool:
# assuming call signature: textExtractor(path_to_file)
pool.map(textExtractor,
(filename for filename in glob(os.path.join(pdfDir, '*.pdf'))
if os.path.isfile(filename))
chunksize=1)

Resources