I'm trying to edit the metadata Title field of PDFs, to include the ASCII equivalents when possible. I'm using Python3 and the module pdfrw.
How can I do string operations that replace the metadata fields?
My test code is here:
from pdfrw import PdfReader, PdfWriter, PdfString
import unicodedata
def edit_title_metadata(inpdf):
trailer = PdfReader(inpdf)
# this statement is breaking pdfrw
trailer.Info.Title = unicode_normalize(trailer.Info.Title)
# also have tried:
#trailer.Info.Title = PdfString(unicode_normalize(trailer.Info.Title))
PdfWriter("test.pdf", trailer=trailer).write()
return
def unicode_normalize(s):
return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
if __name__ == "__main__":
edit_title_metadata('Anadon-2011-Scientific Opinion on the safety e.pdf')
And the traceback is:
Traceback (most recent call last):
File "get_metadata.py", line 68, in <module>
main()
File "get_metadata.py", line 54, in main
edit_title_metadata(pdf)
File "get_metadata.py", line 11, in edit_title_metadata
trailer.Info.Title = PdfString(unicode_normalize(trailer.Info.Title))
File "get_metadata.py", line 18, in unicode_normalize
return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
File "/path_to_python/python3.7/site-packages/pdfrw/objects/pdfstring.py", line 550, in encode
if isinstance(source, uni_type):
TypeError: isinstance() arg 2 must be a type or tuple of types
Notes:
This issue at GitHub may be related.
FWIW, Also getting same error with Python3.6
I've shared the pdf (which has non-ascii hyphens, unicode char \u2010)
.
wget https://gist.github.com/philshem/71507d4e8ecfabad252fbdf4d9f8bdd2/raw/cce346ab39dd6ecb3a718ad3f92c9f546761e87b/Anadon-2011-Scientific%2520Opinion%2520on%2520the%2520safety%2520e.pdf
You have to use the .decode() method on the metadata fields:
trailer.Info.Title = unicode_normalize(trailer.Info.Title.decode())
And full working code:
from pdfrw import PdfReader, PdfWriter, PdfReader
import unicodedata
def edit_title_metadata(inpdf):
trailer = PdfReader(inpdf)
trailer.Info.Title = unicode_normalize(trailer.Info.Title.decode())
PdfWriter("test.pdf", trailer=trailer).write()
return
def unicode_normalize(s):
return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
if __name__ == "__main__":
edit_title_metadata('Anadon-2011-Scientific Opinion on the safety e.pdf')
Related
I'm trying to create a text file with a tree of all files / dirs from a place that I choose using os.chdir(). My approach is to print the tree and to save all prints to the text file. The problem is that it doesn't copy the printed tree and the file is blank.
What am I doing wrong?
And is there a way to write this kind of data to the file without to actually print it?
My code:
import os
import sys
f = open("tree.txt", "w")
os.chdir("c:\\Users\Daniel\Desktop")
sys.stdout = f
os.system("tree /f")
f.close()
Edit
I was able to get the file tree from the clipboard after executing the command, however it gives me and eror when it tried to write to the txt file.
code:
import os
import tkinter
with open("tree.txt", "w") as f:
os.system("tree /f |clip")
root = tkinter.Tk()
tree = root.clipboard_get()
print(tree)
f.write(tree)
eror:
Traceback (most recent call last):
File "c:\Users\Daniel\Desktop\Tick\code_test\files.py", line 9, in <module>
f.write(tree)
File "C:\Users\Daniel\AppData\Local\Programs\Python\Python38-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2502' in position 80: character maps to <undefined>
solution
So I found the problem, I needed to use codec to be able write unicode to the text file. Now it works very well
code:
import os
import tkinter
import codecs
with codecs.open("tree.txt", "w", "utf8") as f:
os.chdir("c:\\Users")
os.system("tree /f |clip")
root = tkinter.Tk()
tree = root.clipboard_get()
f.write(tree)
Method check_output from subprocess module can help you to catch program output:
import subprocess
f = open("tree.txt", "wb")
tree_output = subprocess.check_output('tree /f', shell=True, cwd=r'c:\Users\Daniel\Desktop')
f.write(tree_output)
f.close()
Or with context manager:
import subprocess
with open("tree.txt", "wb") as f:
f.write(subprocess.check_output('tree /f', shell=True, cwd=r'c:\Users\Daniel\Desktop'))
Option wb is required because check_output returns bytes not a str. If you want to process output like a string - call tree_output.decode() first.
I've been trying to translate python 2.7 code to python 3. I believe everything above checkpoint 1 should be correct. But I'm getting an error I associated with the second half. I can always download the file I need straight from the link, but I'd like to know what's breaking here.
import urllib
from urllib.request import urlopen
import tarfile
import os
path = 'https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tar.gz'
url = urlopen(path)
#checkpoint 1
os.chdir('..')
tfile = tarfile.open(url, "r:gz")
tfile.extractall(".")
Error:
Traceback (most recent call last):
File "startup.py", line 43, in <module>
tfile = tarfile.open(url, "r:gz")
File "/anaconda3/lib/python3.6/tarfile.py", line 1589, in open
return func(name, filemode, fileobj, **kwargs)
File "/anaconda3/lib/python3.6/tarfile.py", line 1636, in gzopen
fileobj = gzip.GzipFile(name, mode + "b", compresslevel, fileobj)
File "/anaconda3/lib/python3.6/gzip.py", line 163, in __init__
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
TypeError: expected str, bytes or os.PathLike object, not HTTPResponse
When confronted with an error like this, closely look at the traceback, and read the documentation for the functions and objects involved.
urllib.request.urlopen returns a HTTPResponse object.
If you look at the error message, you see that tarfile.open expects a str, bytes or os.PathLike object for the parameter name.
However, tarfile.open supports using a file object as a third argument fileobj, and HTTPResponse implements the io.BufferedIOBase interface. The classes in io are basically the file objects that the open function returns.
So you should be able to do this:
tfile = tarfile.open(None, "r:gz", files)
or
tarfile.open(fileobj=url, mode="r:gz")
The latter could be considered more Pythonic ("explicit is better than implicit").
os.chdir('..')
tfile = tarfile.open("enron_mail_20150507.tar.gz", "r:gz")
Instead of doing the above two steps, can you just mention the fully qualified file name as parameter to tarfile.open? just to rule out the possibility that the path is incorrect
While parsing XML using lxml, I get an error "reading file objects must return bytes objects". Here's the code
from lxml import etree
from io import StringIO
def parseXML(xmlFile):
"""
parse the xml
"""
data=open(xmlFile)
xml=data.read()
data.close()
tree=etree.parse(StringIO(xml))
context=etree.iterparse(StringIO(xml))
for action, elem in context:
if not elem.text:
if not elem.text:
text="None"
else:
text=elem.text
print(elem.tag + "=>" + text)
if __name__ == "__main__":
parseXML("C:\\Users\\karthik\Desktop\\xml_path\\bgm.xml")
BGM xml
<?xml version="1.0" ?>
<zAppointments reminder="15">
<appointment>
<begin>1181251680</begin>
<uid>040000008200E000</uid>
<alarmTime>1181572063</alarmTime>
<state></state>
<location></location>
<duration>1800</duration>
<subject>Bring pizza home</subject>
</appointment>
<appointment>
<begin>1234360800</begin>
<duration>1800</duration>
<subject>Check MS Office website for updates</subject>
<location></location>
<uid>604f4792-eb89-478b-a14f-dd34d3cc6c21-1234360800</uid>
<state>dismissed</state>
</appointment>
</zAppointments>
Error:
Traceback (most recent call last):
File "C:/Users/karthik/source/ChartAttributes/crecords", line 34, in <module>
parseXML("C:\\Users\\karthik\\Desktop\\xml_path\\bgm.xml")
File "C:/Users/karthik/source/ChartAttributes/crecords", line 26, in parseXML
for action, elem in context:
File "src\lxml\iterparse.pxi", line 208, in lxml.etree.iterparse.__next__ (src\lxml\lxml.etree.c:150010)
File "src\lxml\iterparse.pxi", line 193, in lxml.etree.iterparse.__next__ (src\lxml\lxml.etree.c:149708)
File "src\lxml\iterparse.pxi", line 221, in lxml.etree.iterparse._read_more_events (src\lxml\lxml.etree.c:150208)
TypeError: reading file objects must return bytes objects
Process finished with exit code 1
I think you need the XML as a byte array rather than a character string.
Open the file in binary mode to get a bytes object:
data=open(xmlFile, 'rb')
But it's probably just easier to pass the filename to LXML and let it take care of opening and reading the file:
from lxml import etree
def parseXML(xmlFile):
for action, elem in etree.iterparse(xmlFile):
text = elem.text or "None"
print(elem.tag + "=>" + text)
I have a set of pickled text documents which I would like to stem using nltk's PorterStemmer. For reasons specific to my project, I would like to do the stemming inside of a django app view.
However, when stemming the documents inside the django view, I receive an IndexError: string index out of range exception from PorterStemmer().stem() for the string 'oed'. As a result, running the following:
# xkcd_project/search/views.py
from nltk.stem.porter import PorterStemmer
def get_results(request):
s = PorterStemmer()
s.stem('oed')
return render(request, 'list.html')
raises the mentioned error:
Traceback (most recent call last):
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/exception.py", line 39, in inner
response = get_response(request)
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 187, in _get_response
response = self.process_exception_by_middleware(e, request)
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 185, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/Users/jkarimi91/Projects/xkcd_search/xkcd_project/search/views.py", line 15, in get_results
s.stem('oed')
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 665, in stem
stem = self._step1b(stem)
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 376, in _step1b
lambda stem: (self._measure(stem) == 1 and
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 258, in _apply_rule_list
if suffix == '*d' and self._ends_double_consonant(word):
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 214, in _ends_double_consonant
word[-1] == word[-2] and
IndexError: string index out of range
Now what is really odd is running the same stemmer on the same string outside django (be it a seperate python file or an interactive python console) produces no error. In other words:
# test.py
from nltk.stem.porter import PorterStemmer
s = PorterStemmer()
print s.stem('oed')
followed by:
python test.py
# successfully prints 'o'
what is causing this issue?
This is an NLTK bug specific to NLTK version 3.2.2, for which I am to blame. It was introduced by PR https://github.com/nltk/nltk/pull/1261 which rewrote the Porter stemmer.
I wrote a fix which went out in NLTK 3.2.3. If you're on version 3.2.2 and want the fix, just upgrade - e.g. by running
pip install -U nltk
I debugged nltk.stem.porter module using pdb. After a few iterations, in _apply_rule_list() you get:
>>> rule
(u'at', u'ate', None)
>>> word
u'o'
At this point the _ends_double_consonant() method tries to do word[-1] == word[-2] and it fails.
If I'm not mistaken, in NLTK 3.2 the relative method was the following:
def _doublec(self, word):
"""doublec(word) is TRUE <=> word ends with a double consonant"""
if len(word) < 2:
return False
if (word[-1] != word[-2]):
return False
return self._cons(word, len(word)-1)
As far as I can see, the len(word) < 2 check is missing in the new version.
Changing _ends_double_consonant() to something like this should work:
def _ends_double_consonant(self, word):
"""Implements condition *d from the paper
Returns True if word ends with a double consonant
"""
if len(word) < 2:
return False
return (
word[-1] == word[-2] and
self._is_consonant(word, len(word)-1)
)
I just proposed this change in the related NLTK issue.
How am I supposed to use the Amfy module? I try to use it like the JSON module (amfy.loads or amfy.load), but it just gives me errors:
C:\Users\Other>"C:\Users\Other\Desktop\Python3.5.2\test amf.py"
Traceback (most recent call last):
File "C:\Users\Other\Desktop\Python3.5.2\test amf.py", line 4, in <module>
print(amfy.load(cn_rsp.text))
File "C:\Users\Other\Desktop\Python3.5.2\lib\site-packages\amfy\__init__.py", line 9, in load
return Loader().load(input, proto=proto)
File "C:\Users\Other\Desktop\Python3.5.2\lib\site-packages\amfy\core.py", line 33, in load
return self._read_item3(stream, context)
File "C:\Users\Other\Desktop\Python3.5.2\lib\site-packages\amfy\core.py", line 52, in _read_item3
marker = stream.read(1)[0]
AttributeError: 'str' object has no attribute 'read'
this is what I wrote:
import requests
import amfy
cn_rsp = requests.get("http://realm498.c10.castle.rykaiju.com/api/locales/en/get_serialized_new")
print(amfy.load(cn_rsp.text))
After tinkering around and googling some stuff, I found a fix:
New code:
import amfy, requests, json
url = "http://realm416.c9.castle.rykaiju.com/api/locales/en/get_serialized_static"
req = requests.get(url)
if req.status_code == 200:
ret = req.json() if "json" in req.headers["content-type"] else amfy.loads(req.content)
else:
ret = {"failed": req.reason}
with open ("doa manifest.txt", 'w', encoding = 'utf-8') as dump:
json.dumps(ret, dump)
The Terminal throws a UnicodeEncodeError, but I was able to fix that by entering chcp 65001 and then set PYTHONIOENCODING=utf-8
The load method expects an input stream, you provide it a string. Just convert your string into a memory buffer which supports read method like this:
import io
print(amfy.load(io.BytesIO(cn_rsp.text.encode())))
unfortunately serialization fails when using this. Is there another url where it would work, a test URL maybe?
File "C:\Python34\lib\site-packages\amfy\core.py", line 146, in _read_vli
byte = stream.read(1)[0]
IndexError: index out of range