PyPDF2 encoding issues - python-3.x

I'm having some trouble identifying why the output doesn't match the input of the PDF when pulling the text. And if there are any tricks I could do to fix this as it's not an isolated issue.
with open(file, 'rb') as f:
binary = PyPDF2.pdf.PdfFileReader(f)
text = binary.getPage(x).extractText()
print(text)
file: "I/O filters, 292–293"
output: "I/O Þlters, 292Ð293"
The Ð seems to represent all instances of '-' and Þ seems to be used for all instances of "fi".
I am using Windows CMD as my output for testing and I do know some characters don't show up right, but that leaves me baffled for something like the 'fi'

The text extraction of PyPDF2 was massively improved in versions 2.x. The whole project moved to pypdf.
I recommend you give it another try: https://pypdf.readthedocs.io/en/latest/user/extract-text.html
from pypdf import PdfReader
reader = PdfReader("example.pdf")
page = reader.pages[0]
print(page.extract_text())

Related

Unknown encoding of files in a resulting Beautiful Soup txt file

I downloaded 13 000 files (10-K reports from different companies) and I need to extract a specific part of these files (section 1A- Risk factors). The problem is that I can open these files in Word easily and they are perfect, while as I open them in a normal txt editor, the document appear to be an HTML with tons of encrypted string in the end (EDIT: I suspect this is due to XBRL format of these files). Same happens as a result of using BeautifulSoup.
I've tried using online decoder, because I thought that maybe this is connected to Base64 encoding, but it seems that none of the known encoding could help me. I saw that at the beginning of some files, there is something like: "created with Certent Disclosure Management 6.31.0.1" and other programs, I thought maybe this causes the encoding. Nevertheless Word is able to open these files, so I guess there must be a known key to it. This is a sample encoded data:
M1G2RBE#MN)T='1,SC4,]%$$Q71T3<XU#[AHMB9#*E1=E_U5CKG&(77/*(LY9
ME$N9MY/U9DC,- ZY:4Z0EWF95RMQY#J!ZIB8:9RWF;\"S+1%Z*;VZPV#(MO
MUCHFYAJ'V#6O8*[R9L<VI8[I8KYQB7WSC#DMFGR[E6+;7=2R)N)1Q\24XQ(K
MYQDS$>UJ65%MV4+(KBRHJ3HFIAR76#G/F$%=*9FOU*DM-6TSTC$Q\[C$YC$/
And a sample file from the 13 000 that I downloaded.
Below I insert the BeautifulSoup that I use to extract text. It does its' job, but I need to find a clue to this encoded string and somehow decode it in the Python code below.
from bs4 import BeautifulSoup
with open("98752-TOROTEL INC-10-K-2019-07-23", "r") as f:
contents = f.read()
soup = BeautifulSoup(contents, 'html.parser')
print(soup.getText())
with open("extracted_test.txt", "w", encoding="utf-8") as f:
f.write(soup.getText())
f.close()
What I want to achieve is decoding of this dummy string in the end of the file.
Ok, this is going to be somewhat messy, but will get you close enough to what you are looking for, without using regex (which is notoriously problematic with html). The fundamental problem you'll be facing is that EDGAR filings are VERY inconsistent in their formatting, so what may work for one 10Q (or 10K or 8K) filing may not work with a similar filing (even from the same filer...) For example, the word 'item' may appear in either lower or uppercase (or mixed), hence the use of the string.lower() method, etc. So there's going to be some cleanup, under all circumstances.
Having said that, the code below should get you the RISK FACTORS sections from both filings (including the one which has none):
url = [one of these two]
from bs4 import BeautifulSoup as bs
response = requests.get(url)
soup = bs(response.content, 'html.parser')
risks = soup.find_all('a')
for risk in risks:
if 'item' in str(risk.attrs).lower() and '1a' in str(risk.attrs).lower():
for i in risk.findAllNext():
if 'item' in str(i.attrs).lower():
break
else:
print(i.text.strip())
Good luck with your project!

Search ’, Â, � etc... How to fix strange encoding characters in python

I tried to retrieve data from Google+ using API. When I wrote data into csv file, I observed weird and strange characters like 😀😄😚😉😠’
After googling, I concluded this is an encoding issue.
To write retrieved data in a file, I used the following code:
file = open('filename, 'a', encoding='utf-8')
writer = csv.writer(file)
writer.writerow(values)
To check my terminal encoding, I used
import sys
sys.getdefaultencoding()
Output is: utf-8
Don't know where is the problem?
Your minimal, reproducible example appears overmuch minimal to be complete and verifiable. In any case, it looks like double mojibake:
value = "‘😀😄😚😉😠’" ### gathered from the question
print(value.encode('cp1252','backslashreplace').decode('utf-8','backslashreplace'))
‘😀😄😚😉😠’

Vader Sentiment with multiple PDF

I have recently merged 20 pdf in 1 pdf via adobe. I have import the pdf in python with this code.
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_file = open ('/Users/cj/Desktop/PEI.pdf','rb')
newfile=open('rjtjj.txt','w')
pdf_reader= PdfFileReader (pdf_file)
pdf_writer= PdfFileWriter()
print(pdf_reader.numPages)
n=pdf_reader.getNumPages()
for i in range(0, n-1):
# pdf_writer.addPage(pdf_reader.getPage(i))
gft=pdf_reader.getPage(i)
newfile.write(gft.extractText())
pdf_file.close()
newfile.close()
I'm trying to use Vadersentiment to analyse the pdf. What i want to do is analyse individually the 20 pdf that are merged into 1.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
with open('rjtjj.txt', 'r') as f:
for line in f.read().split("\n"):
vs=analyzer.polarity_scores(line)
I know my code is wrong, because it only gives me the first line of the entire pdf. I am new to this, i would really appreciate your help.
Thank you
Your problem really isn't about Vader sentiment analysis -- it is about correct extraction of text from a PDF.
Postscript's forth interpreter is Turing-complete, so some PDF documents are "hard" to parse. You didn't post your PDF so we can only guess at the issue. You might try using poppler's pdftotext command line utility instead. Ubuntu calls the package "poppler-utils"; on mac you would use brew install poppler. Running through pdf2ps & ps2ascii will sometimes offer different, and helpful, results.
If you continue to find it difficult to retrieve proper text from the PDF, you may want to contact whoever produced the PDF and settle on supplying the same information in a revised format.

How To Compare two .ico (icon) files in python?

Although i am new to python i just want to compare two .ico extension files.
Anyone with the expertise can tell how can i do that?
Is there any package or library readily available in python to do so ?
Thanks for reading the question. Your suggestions will be appreciated.
What I am currently doing is as follows but It is not giving me what i expect :
import cv2
import numpy as np
Original = cv2.imread("1.ico")
Edited = cv2.imread("chrome.ico")
diff = cv2.subtract(Original, Edited)
cv2.imwrite("diff.jpg", diff)
If you just want to check if files have changes, you can use hashlib of python to get it. The below code finds hash:
import hashlib
h = hashlib.md5()
with open('ico_file.ico', 'rb') as f:
buffer = f.read()
h.update(buffer)
print(buffer) # May not be needed
print(h.hexdigest())
Use the above code for the two files you want to compare and then match their output hash. If it's the same, then files are very likely to be same. If different, then they are definitely different.

PyPDF2 returns only empty lines for some files

I am working on a script that "reads" PDF files and and then automatically renames the files it recognizes from a dictionary. PyPDF2 however only returns empty lines for some PDFs, while working fine for others. The code for reading the files:
import PyPDF2
# File name
file = 'sample.pdf'
# Open File
with open(file, "rb") as f:
# Read in file
pdfReader = PyPDF2.PdfFileReader(f)
# Check number of pages
number_of_pages = pdfReader.numPages
print(number_of_pages)
# Get first page
pageObj = pdfReader.getPage(0)
# Extract text from page 1
text = pageObj.extractText()
print(text)
It does get the number of pages correctly, so it is able to open the PDF.
If I replace the print(text) by repr(text) for files it doesn't read, I get something like:
"'\\n\\n\\n\\n\\n\\n\\n\\nn\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n'"
Weirdly enough, when I enhance (OCR) the files with Adobe, the script performs slightly worse. It Recognized 140 out of 800 files and after enhancing just 110.
The PDFs are machine readable/searchable, because I am able to copy/paste text to Notepad. I tested some files with "pdfminer" and it does show some text, but also throws in a lot of errors. If possible I like to keep working with PyPDF2.
Specifications of the software I am using:
Windows: 10.0.15063
Python: 3.6.1
PyPDF: 1.26.0
Adobe version: 17.009.20058
Anyone any suggestions? Your help is very much appreciated!
I had the same issue, i fixed it using another python library called slate
Fortunately, i found a fork that works in Python 3.6.5
import slate3k as slate
with open(file.pdf,'rb') as f:
extracted_text = slate.PDF(f)
print(extracted_text)

Resources