pdfparser from pdfminer: PDFException: PDFDocument is not initialized - python-3.x

I'm not understanding this error. I want to open a pdf and loop over the pages but I'm getting this exception and I couldn't find much by googling it.
Here is the example that fails
from pdfminer.pdfparser import PDFParser, PDFDocument
from os.path import basename, splitext
file = 'tmpfiles/tmpfile.pdf'
filename = splitext(basename(file))[0]
fp = open(file, 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
num_page = 0
text = ""
pages = doc.get_pages()
for p in pages:
print("do whatever")
Here is the traceback
Traceback (most recent call last):
File "test.py", line 20, in <module>
for p in pages:
File "/home/.../anaconda3/lib/python3.6/site-packages/pdfminer/pdfparser.py", line 544, in get_pages
raise PDFException('PDFDocument is not initialized')
pdfminer.pdftypes.PDFException: PDFDocument is not initialized
I have python 3.6
Before doing this I'm saving the pdf file like this because I have the contents in a base64 encoded string
decoded = base64.b64decode(content_string)
with open(tmpfiles_path+'tmpfile.pdf', 'wb') as fout:
fout.write(decoded)
Could it be that the file is being saved with some protection?

The problem was the version of pdfminer I was using. By installing pdfminer.six and changing the code in this way
from pdfminer.pdfpage import PDFPage
file = 'tmpfiles/tmpfile.pdf'
fp = open(file, 'rb')
pages = PDFPage.get_pages(fp)
for p in pages:
print("do whatever")
Now it works.

Related

unable take input from a text file in python crawler

I have created a basic crawler in python, I want to take input from a text file.
I used open/raw_input but there was an error.
When I used input("") function it is prompting for input and was working fine.
The problem only with reading a file
import re
import urllib.request
url = open('input.txt', 'r')
data = urllib.request.urlopen(url).read()
data1 = data.decode("utf8")
print(data1)
file =open('output.txt' , 'w')
file.write(data1)
file.close()
error output below.
Traceback (most recent call last):
File "scrape.py", line 8, in <module>
data = urllib.request.urlopen(url).read()
File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.6/urllib/request.py", line 518, in open
protocol = req.type
AttributeError: '_io.TextIOWrapper' object has no attribute 'type'
the method open returns a file object, and not the content of the file as a string. if you want url to contain the content as a string, change the line to:
url = open('input.txt', 'r').read()

Editing PDF metadata fields with Python3 and pdfrw

I'm trying to edit the metadata Title field of PDFs, to include the ASCII equivalents when possible. I'm using Python3 and the module pdfrw.
How can I do string operations that replace the metadata fields?
My test code is here:
from pdfrw import PdfReader, PdfWriter, PdfString
import unicodedata
def edit_title_metadata(inpdf):
trailer = PdfReader(inpdf)
# this statement is breaking pdfrw
trailer.Info.Title = unicode_normalize(trailer.Info.Title)
# also have tried:
#trailer.Info.Title = PdfString(unicode_normalize(trailer.Info.Title))
PdfWriter("test.pdf", trailer=trailer).write()
return
def unicode_normalize(s):
return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
if __name__ == "__main__":
edit_title_metadata('Anadon-2011-Scientific Opinion on the safety e.pdf')
And the traceback is:
Traceback (most recent call last):
File "get_metadata.py", line 68, in <module>
main()
File "get_metadata.py", line 54, in main
edit_title_metadata(pdf)
File "get_metadata.py", line 11, in edit_title_metadata
trailer.Info.Title = PdfString(unicode_normalize(trailer.Info.Title))
File "get_metadata.py", line 18, in unicode_normalize
return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
File "/path_to_python/python3.7/site-packages/pdfrw/objects/pdfstring.py", line 550, in encode
if isinstance(source, uni_type):
TypeError: isinstance() arg 2 must be a type or tuple of types
Notes:
This issue at GitHub may be related.
FWIW, Also getting same error with Python3.6
I've shared the pdf (which has non-ascii hyphens, unicode char \u2010)
.
wget https://gist.github.com/philshem/71507d4e8ecfabad252fbdf4d9f8bdd2/raw/cce346ab39dd6ecb3a718ad3f92c9f546761e87b/Anadon-2011-Scientific%2520Opinion%2520on%2520the%2520safety%2520e.pdf
You have to use the .decode() method on the metadata fields:
trailer.Info.Title = unicode_normalize(trailer.Info.Title.decode())
And full working code:
from pdfrw import PdfReader, PdfWriter, PdfReader
import unicodedata
def edit_title_metadata(inpdf):
trailer = PdfReader(inpdf)
trailer.Info.Title = unicode_normalize(trailer.Info.Title.decode())
PdfWriter("test.pdf", trailer=trailer).write()
return
def unicode_normalize(s):
return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
if __name__ == "__main__":
edit_title_metadata('Anadon-2011-Scientific Opinion on the safety e.pdf')

How to find the exact line of error from python 3.6.5?

from lxml import etree
import os
import copy
import xml.etree.ElementTree as ET
XMLDoc = etree.parse(open('aa.xml'))
XSLDoc = etree.parse(open('aa.xsl'))
try:
transform = etree.XSLT(XSLDoc)
except:
for error in etree.XSLT.error_log:
print(error.message, error.line)
v = '/person/name'
for Node in XMLDoc.xpath(v):
m2 = copy.deepcopy(Node)
m3 = etree.tostring(m2, method="xml", xml_declaration=True, encoding="utf-8", with_tail=False)
m3 = m3.decode("utf-8")
dc = open('pq.xml', 'w')
dc.write(str(m3))
dc.close()
xm = etree.parse(open('pq.xml'))
q = transform(xm)
print(q)
I have use lxml for transform our xml to another xml through xslt but i have one parsing error in our xslt.
Traceback (most recent call last):
File "C:\Anil\PTest\08\qqq.py", line 13, in <module>
for error in etree.XSLT(XSLDoc).error_log:
File "src\lxml\xslt.pxi", line 410, in lxml.etree.XSLT.__init__
lxml.etree.XSLTParseError: Failed to compile predicate
Please suggest how to find exact problem in our xslt.
XSLT should have a log attached to it, so you can do something like the following:
try:
# Your Code
except:
for error in YourXSLTObject.error_log:
print(error.message, error.line)

Os walk directory return string literal for unzipping

The problem occurs where the string variable tries to get read by zipfile.Zipfile(variable). The callback is file does not exist. This is because it is adding an extra \ to each folder. I tried several things to try and make the string into a literal but when I do this it prints an extra \\\. I tried even making that a literal but as I would expect it repeated the same thing returning \\\\\.Any insight would be greatly appreciated!!!
The code structure is below and here are the callback functions
Traceback (most recent call last):
File "C:\Users\Yume\Desktop\Walk a Dir Unzip and copy to.py", line 17, in <module>
z = zipfile.ZipFile(zf, 'r')
File "C:\Users\Yume\AppData\Local\Programs\Python\Python36-32\lib\zipfile.py", line 1090, in __init__
self.fp = io.open(file, filemode)
FileNotFoundError: [Errno 2] No such file or directory: "C:\\Users\\Yume\\AppData\\Roaming\\
Traceback (most recent call last):
File "C:\Users\Yume\Desktop\Walk a Dir Unzip and copy to.py", line 17, in <module> z = zipfile.ZipFile(rzf, 'r')
File "C:\Users\Yume\AppData\Local\Programs\Python\Python36-32\lib\zipfile.py", line 1090, in __init__
self.fp = io.open(file, filemode)
OSError: [Errno 22] Invalid argument: '"C:\\\\Users\\\\Yume\\\\AppData\\\\Roaming\\\\
z = zipfile.ZipFile(rrzf, 'r')
File "C:\Users\Yume\AppData\Local\Programs\Python\Python36-32\lib\zipfile.py", line 1090, in __init__
self.fp = io.open(file, filemode)
OSError: [Errno 22] Invalid argument: '\'"C:\\\\\\\\Users\\\\\\\\Yume\\\\\\\\AppData\\\\\\\\Roaming\\\\\\\
CODE:
import os
import re
from shutil import copyfile
import zipfile
import rarfile
isRar = re.compile('.+\\.rar')
isZip = re.compile('.+\\.zip')
for folderName, subfolders, filenames in os.walk(r'C:\Users\Yume\AppData'):
if isZip.match(str(filenames)):
zf = folderName+ str(subfolders)+str(filenames)
rzf = '%r'%zf
z = zipfile.ZipFile(zf)
z.extractall(r'C:\Users\Yume')
z.close()
if isRar.match(str(filenames)):
rf = folderName+ str(subfolders)+str(filenames)
rrf = "%r"%rf
r = rarfile.RarFile(rf)
r.extractall(r'C:\Users\Yume')
r.close()
My problem was corrected by running a for loop through the file list created by os walk. I cleaned up my code and here is a working program that walks a directory find a compressed .zip or .rar and extracts it to the specified destination.
import os
import zipfile
import rarfile
import unrar
isRar = re.compile('.*\\.rar')
isZip = re.compile('.*\\.zip')
for dirpath, dirnames, filenames in os.walk(r'C:\Users\'):
for file in filenames:
if isZip.match(file):
zf = os.path.join(dirpath, file)
z= zipfile.ZipFile(zf)
z.extractall(r'C:\Users\Unzipped')
z.close()
if isRar.match(file):
rf = os.path.join(dirpath, file)
r = rarfile.RarFile(rf)
r.extractall(r'C:\Users\Unzipped')
r.close()

UnicodeDecodeError Python 3.5.1 Email Script

I am attempting to send an email + attachment to an SMS gateway email. However I currently am getting a Unicode Decode: Error'Charmap' codec can't Decode Byte 0x8d in position 60
I'm not sure how to go about fixing this and would be interested in your advice. Bellow is my code and the Full Error.
import smtplib, os
from email.mime.image import MIMEImage
from email.mime.multipart import MIMEMultipart
msg = MIMEMultipart()
msg['Subject'] = 'Cuteness'
msg['From'] = 'sample#outlook.com'
msg['To'] = '111111111#messaging.sprintpcs.com'
msg.preamble = "Would you pet me? I'd Pet me so hard..."
here = os.getcwd()
file = open('cutecat.png')#the png shares directory with actual script
for here in file: #Error appears to be in this section
with open(file, 'rb') as fp:
img = MIMImage(fp.read())
msg.attach(img)
s = smtplib.SMTP('Localhost')
s.send_message(msg)
s.quit()
""" Traceback (most recent call last):
File "C:\Users\Thomas\Desktop\Test_Box\msgr.py", line 16, in <module>
for here in file:
File "C:\Users\Thomas\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 60: character maps to <undefined>"""
You're trying to open the file twice. First you have:
file = open('cutecat.png')
The default mode to open files is to read them in text mode. That is generally not what you want to do with a binary file like a PNG file.
And then you do:
for here in file:
with open(file, 'rb') as fp:
img = MIMImage(fp.read())
msg.attach(img)
You get an exception in the first line because Python is trying to decode the contents of a binary file as text and fails. The chances of this happening are quite high. It is unlikely that a binary file is also a valid text file in your standard encoding.
But even if that would have worked, for every line in the file you try to open the file again? This makes no sense!
Were you just copy/pasting from the examples, especially the third one? You should note that this example is incomplete. The variable pngfiles used in that example (and which should be a sequence of file names) is not defined.
Try this instead:
with open('cutecat.png', 'rb') as fp:
img = MIMImage(fp.read())
msg.attach(img)
Or if you want to include multiple files:
pngfiles = ('cutecat.png', 'kitten.png')
for filename in pngfiles:
with open(filename, 'rb') as fp:
img = MIMImage(fp.read())
msg.attach(img)

Resources