PyPDF2/PDFminner text extracting issue when PDF is created using InDesign - python-3.x

We have a large number of PDFs which have been created with InDesign and not all the text was being extracted by PyPDF2. Here is the code:-
for pageNum in range(0, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
# note text is a bytes object not string.
text = pageObj.extractText().encode('utf-8')
search_text = text.lower()
if search_word in search_text.decode("utf-8"):
search_word = search_word.strip()
search_word_count += 1
print("Pattern Found on Page: " + str(pageNum+1))
search_word_count_list.append(search_word_count)
print("The word:- '{}' was found:- {} times\n".format(search_word, search_word_count))
I did some testing with PDFminner and found I had the same results i.e. the same bits of text were extracted/not extracted. So I figured there must be something going on with the PDF.
Off the back of this, I worked with a Typesetter doing some testing and discovered when text boxes are locked in InDesign (Crtl+L) the PDF exported has its text locked and is not extractable, I mean the bits that are locked are not extractable via PyPDF2 or PDFminner.
While going forward I can ask the typesetters to unlock text before exporting PDFs. BUT with the thousands to existing PDFs, I want to be able to extract the locked text, asking the typesetters to unlock thousands of files is not an option. Does anyone have experience of this? Any ideas on how to access the locked text?
Edit 1
So doing some testing with the Adobe Acrobat pro 11. When Saving-As Plain Text the locked text does not save to the text file. But the unlocked text does save to the .txt file.
Checking the Security tab in Acrobat:-
With all tested documents, open in Acrobat I pick File -> Properties, switch to the Security tab of the Document Properties dialog, and there I read "Security Method: No Security", and under the restrictions everything is 'Allowed' (Printing, Changing, Copying ...). So I think these are all valid PDFs, which are unprotected.
Edit 2
I have tried to install pdf2txt but my machine does not meet requirements as I am missing "Microsoft Visual C++ 14.0" and as it's a work machine it is locked down.
Edit 3
Acrobat says the PDF Version 1.7. PDF Producer Adobe PDF Library 15.0
I can copy-paste the locked text so I do not think rasterisation is the problem.
Edit 4 - possible solution
So I have tested using the https://pdftotext.com/ and it was able to access the locked text. So I will talk to the IT department to get "Microsoft Visual C++ 14.0" installed so I can use the pdf2txt library.
Edit 5
Have not had much luck with installing PDFtotext due to problems installing poppler which is nothing short of a nightmare to install.
Off the back of usr2564301 input, I have done some more testing with PDFminner. Here is the code I am using to test with:-
from pdfminer.pdfinterp import PDFResourceManager,PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import BytesIO
def pdf_to_text(path):
manager = PDFResourceManager()
retstr = BytesIO()
layout = LAParams(all_texts=True)
device = TextConverter(manager, retstr, laparams=layout)
filepath = open(path, 'rb')
interpreter = PDFPageInterpreter(manager, device)
for page in PDFPage.get_pages(filepath, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
filepath.close()
device.close()
retstr.close()
return text
As for versions I have pdfminer==20191125 installed. On Github it says "Supports PDF-1.7. (well, almost)" so maybe the problem is PDF 1.7?
Edit 6
Just tried to use PDFminer (PDF2txt.py) via Command Prompt. Using this code python C:\Users\my_name\AppData\Local\Programs\Python\Python37-32\Scripts\pdf2txt.py -o output.txt file_name.pdf but I get the same result i.e. locked text does not come through.
Edit 7
So did some testing with a designer and we proved it is when the text is on the master page that the text is not accessible to PDFminer. The designers have multiple preset master pages so that they can drag the required one onto the frontcover. If a designer wrongly works directly on the master page the content is lock and can not be accessed via PDFminer.
Note there is not a problem when text is on the frontcover and locked. Only when on the master page.

Related

How to use system default icons for a file type in a QTreeView

for reference this is all using Pyqt5 and Python 3.6:
I've got a QStandardItemModel that is built from QStandardItems that are strings of the items in a zip (the model displays all the contents of a zipfile). I went with this choice as I can not cache the files locally, and my research shows that QFileSystemModel can not work on archives unless I unpack at least temporarily.
All items in the QStandardItemModel end in the correct extension for the file (.csv,.txt,ect), and I need to display the icon a user would see if they were looking at the file in windows explorer, however show it in the qtreeview (a user seeing content.csv should also see the icon for excel). On that note, this application is only running on windows.
How can I pull the extensions default system file icon, and set it during my setting of these items? Would I have to manually download the icons for my known file types and do this, or does the system store it somewhere I can access?
Here's some basic code of how I build and display the model and treeview:
self.zip_model = QtGui.QStandardItemModel()
# My Computer directory explorer
self.tree_zip = QTreeView()
self.tree_zip.setModel(self.zip_model)
def build_zip_model(self,current_directory):
self.zip_model.clear()
with zipfile.ZipFile(current_directory) as zip_file:
for item in zip_file.namelist():
model_item = QtGui.QStandardItem(item)
self.zip_model.appendRow(model_item)
You can use QFileIconProvider:
def build_zip_model(self, current_directory):
iconProvider = QtWidgets.QFileIconProvider()
self.zip_model.clear()
with zipfile.ZipFile(current_directory) as zip_file:
for item in zip_file.namelist():
icon = iconProvider.icon(QtCore.QFileInfo(item))
model_item = QtGui.QStandardItem(icon, item)
self.zip_model.appendRow(model_item)

setting default printer custom page size with python

At the organization I work for, different printers are set up at various locations. All are mainly used to print A4-sized documents, so the defaults are set up accordingly.
We are also using a bunch of custom-sized forms which people have up to now been filling in by hand.
Recently, I was tasked with setting up print-automation onto the said forms from our central database.
I'm using reportlab to create temporary pdf files which I am then trying to send to the default printer. All is relatively simple, save for getting the printers to register a custom paper size.
I got as far as the following code snippet, but I'm really stuck.
import tempfile
import win32api
import win32print
pdf_file = tempfile.mktemp(".pdf")
#CREATION OF PDF FILE WITH REPORTLAB
printer = win32print.GetDefaultPrinter()
PRINTER_DEFAULTS = {"DesiredAccess":win32print.PRINTER_ALL_ACCESS}
pHandle = win32print.OpenPrinter(printer, PRINTER_DEFAULTS)
level = 2
properties = win32print.GetPrinter(pHandle, level)
pDevModeObj = properties["pDevMode"]
pDevModeObj.PaperSize = 0
pDevModeObj.PaperLength = 2200 #SIZE IN 1/10 mm
pDevModeObj.PaperWidth = 1000 #SIZE IN 1/10 mm
properties["pDevMode"]=pDevModeObj
win32print.SetPrinter(pHandle,level,properties,0)
#OPTION ONE
#win32api.ShellExecute(0, "print", pdf_file, None, ".", 0)
#OPTION TWO
win32api.ShellExecute (0,"printto",pdf_file,'"%s"' % printer,".",0)
win32print.ClosePrinter(pHandle)
It just does not work. Printers do not report a "paper size mismatch", like they should when a non-A4 document is being sent to them. And when I try printing to a PDF printer, it also defaults to A4.
When calling
print(pDevModeObj.PaperSize)
print(pDevModeObj.PaperLength)
print(pDevModeObj.PaperWidth)
everything seems to be in order, so I'm guessing I don't know how to send those paper size values back to the printer settings.
Here is a list of all the resources I checked out (examples not all in python, and a few are not using the win32api), and couldn't get the thing to work properly:
Programmatically Print a PDF File - Specifying Printer
Python's win32api only printing to default printer
https://mail.python.org/pipermail/python-win32/2005-August/003683.html
https://learn.microsoft.com/en-us/troubleshoot/windows/win32/modify-printer-settings-setprinter-api
Print PDF file in duplex mode via Python
https://www.thinbug.com/q/39249360
Saving / Restoring Printer DevModes - wxPython / win32print
pywin32: how do I get a pyDEVMODE object?
https://learn.microsoft.com/en-us/troubleshoot/windows/win32/modify-printer-settings-documentproperties
How to change printer preference settings using python
Print file to continuous paper using win32print Python
python win32print can't set custom page size
http://timgolden.me.uk/pywin32-docs/PyDEVMODE.html
https://newcenturycomputers.net/projects/pythonicwindowsprinting.html
Printing a file and configure printer settings
Change printer default paper size
https://grokbase.com/t/python/python-win32/085x5hdbtd/how-to-change-paper-size-while-printing
openpyxl - set custom paper size for printing
Python win32print changing advanced printer options
Printing PDF files with Python
Python silent print PDF to specific printer
https://learn.microsoft.com/en-us/windows/win32/cimwin32prov/win32-printerconfiguration
Printing PDF's using Python,win32api, and Acrobat Reader 9
Python print pdf file with win32print
How to chose Paper Format when printing a PDF File with Python?
Access denied when attempting to remove printer
https://www.programcreek.com/python/example/24860/win32api.ShellExecute
https://opensource.gonnerman.org/?p=192
Python27 - on windows 10 how can i tell printing paper size is 50.8mm x 25.4mm?
https://mail.python.org/pipermail/python-win32/2008-May/007640.html
http://timgolden.me.uk/python/win32_how_do_i/print.html
ShellExecute is using the default printing parameters. If you need to use the reset DevMode for printing, you can use CreateDC.
Refer: GDI Print API
If you use SetPrinter to modify the default DEVMODE structure for a
printer (globally setting the printer defaults), you must first call
the DocumentProperties function to validate the DEVMODE structure.
Refer:
SetPrinter Remarks
Modify printer settings by using the SetPrinter function
You can also directly use DocumentProperties to modify printer initialization information.
Then pass pDevModeObj to CreateDC, and use StartDoc and StartPage to print.
Similar case: Change printer tray with pywin32

How to scrape a pdf file to make a DF

I need to build a database with several data. Most of those data is contained in PDF Files. Those PDF files are all the same, but change only on the data. (for example, one of the files i have to work in: https://documentos.serviciocivil.cl/actas/dnsc/documentService/downloadWs?uuid=aecfeb7c-d494-4631-ade4-584d67ea120e)
I have been trying to extract the data with PyPDF, tabula, pdfminer (even tried with textract but it didn't work through Anaconda) and other stuff, but i didn't get what i want.
Then i tried to transform those pdf files in txt files and then mining it, but didn't get anything. Also tried with regex but didn't understand how to use it, although the code doesn't show errors when running:
import re
import sys
recording = False
your_file = "D:\Magister\Tercer semestre\Tesis I\Txt\ResultadoConcurso1.txt"
start_pattern = 'apellidos:'
stop_pattern = '1.2'
output_section = []
for line in open(your_file).readlines():
if recording is False:
if re.search(start_pattern, line) is not None:
recording = True
output_section.append(line.strip())
elif recording is True:
if re.search(stop_pattern, line) is not None:
recording = False
sys.exit()
output_section.append(line.strip())
print("".join(output_section))
As you can see in the upper link i left, pdf files have different sections. I need to get the info that's inside those sections. For example, one of the fields in my database it's going to be "Nombre y apellido" (name and lastname). It's contained between "apellidos:" and "1.2".
what should i do? Can i work directly from PDF format? Or should i work in txt files? And then, what should i use to get the info? (Python 3.XX; Anaconda)
Thanks

Xref table not zero-indexed. ID numbers for objects will be corrected. won't continue

I am trying to open a pdf to get the number of pages. I am using PyPDF2.
Here is my code:
def pdfPageReader(file_name):
try:
reader = PyPDF2.PdfReader(file_name, strict=True)
number_of_pages = len(reader.pages)
print(f"{file_name} = {number_of_pages}")
return number_of_pages
except:
return "1"
But then i run into this error:
PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
I tried to use strict=True and strict=False, When it is True, it displays this message, and nothing, I waited for 30minutes, but nothing happened. When it is False, it just display nothing, and that's it, just do nothing, if I press ctrl+c on the terminal (cmd, windows 10) then it cancel that open and continues (I run this in a batch of pdf files). Only 1 in the batch got this problem.
My questions are, how do I fix this, or how do I skip this, or how can I cancel this and move on with the other pdf files?
If somebody had a similar problem and it even crashed the program with this error message
File "C:\Programy\Anaconda3\lib\site-packages\PyPDF2\pdf.py", line 1604, in getObject
% (indirectReference.idnum, indirectReference.generation, idnum, generation))
PyPDF2.utils.PdfReadError: Expected object ID (14 0) does not match actual (13 0); xref table not zero-indexed.
It helped me to add the strict argument equal to False for my pdf reader
pdf_reader = PdfReader(input_file, strict=False)
For anybody else who may be running into this problem, and found that strict=False didn't help, I was able to solve the problem by just re-saving a new copy of the file in Adobe Acrobat Reader. I just opened the PDF file inside an actual copy of Adobe Acrobat Reader (the plain ol' free version on Windows), did a "Save as...", and gave the file a new name. Then I ran my script again using the newly saved copy of my PDF file.
Apparently, the PDF file I was using, which were generated directly from my scanner, were somehow corrupt, even though I could open and view it just fine in Reader. Making a duplicate copy of the file via re-saving in Acrobat Reader somehow seemed to correct whatever was missing.
I had the same problem and looked for a way to skip it. I am not a programmer but looking at the documentation about warnings there is a piece of code that helps you avoid such hindrance.
Although I wouldn't recomend this as a solution, the piece of code that I used for my purpose is (just copied and pasted it from doc on link)
import sys
if not sys.warnoptions:
import warnings
warnings.simplefilter("ignore")
This happens to me when the file was created in a printer / scanner combo that generates PDFs. I could read in the PDF with only a warning though so I read it in, and then rewrote it as a new file. I could append that new one.
from PyPDF2 import PdfMerger, PdfReader, PdfWriter
reader = PdfReader("scanner_generated.pdf", strict=False)
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
with open("fixedPDF.pdf", "wb") as fp:
writer.write(fp)
merger = PdfMerger()
merger.append("fixedPDF.pdf")
I had the exact same problem, and the solutions did help but didn't solve the problem completely, at least the one setting strict=False & resaving the document using Acrobat reader.
Anyway, I still got a stream error, but I was able to fix it after using an PDF online repair. I used sejda.com but please be aware that you are uploading your PDF on some website, so make sure there is nothing sensible in there.

PyPDF2 returns only empty lines for some files

I am working on a script that "reads" PDF files and and then automatically renames the files it recognizes from a dictionary. PyPDF2 however only returns empty lines for some PDFs, while working fine for others. The code for reading the files:
import PyPDF2
# File name
file = 'sample.pdf'
# Open File
with open(file, "rb") as f:
# Read in file
pdfReader = PyPDF2.PdfFileReader(f)
# Check number of pages
number_of_pages = pdfReader.numPages
print(number_of_pages)
# Get first page
pageObj = pdfReader.getPage(0)
# Extract text from page 1
text = pageObj.extractText()
print(text)
It does get the number of pages correctly, so it is able to open the PDF.
If I replace the print(text) by repr(text) for files it doesn't read, I get something like:
"'\\n\\n\\n\\n\\n\\n\\n\\nn\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n'"
Weirdly enough, when I enhance (OCR) the files with Adobe, the script performs slightly worse. It Recognized 140 out of 800 files and after enhancing just 110.
The PDFs are machine readable/searchable, because I am able to copy/paste text to Notepad. I tested some files with "pdfminer" and it does show some text, but also throws in a lot of errors. If possible I like to keep working with PyPDF2.
Specifications of the software I am using:
Windows: 10.0.15063
Python: 3.6.1
PyPDF: 1.26.0
Adobe version: 17.009.20058
Anyone any suggestions? Your help is very much appreciated!
I had the same issue, i fixed it using another python library called slate
Fortunately, i found a fork that works in Python 3.6.5
import slate3k as slate
with open(file.pdf,'rb') as f:
extracted_text = slate.PDF(f)
print(extracted_text)

Resources