How to extract a PDF's text using pdfrw - python-3.5

Can pdfrw extract the text out of a document?
I was thinking something along the lines of
from pdfrw import PdfReader
doc = PdfReader(pdf_path)
page_texts = []
for page_nr in doc.numPages:
page_texts.append(doc.getPage(page_nr).parse_page()) # ..or something

In the docs the explain how to extract the text. However, it's just a bytestream. You could iterate over the pages and decode them individually.
from pdfrw import PdfReader
doc = PdfReader(pdf_path)
for page in doc.pages:
bytestream = page.Contents.stream # This is a string with bytes, Not a bytestring
string = #somehow decode bytestream. Maybe using zlib.decompress
# do something with that text
Edit:
May be worth nothing that pdfrw does not yet support text decompression due to its complexity according to the author.

Depends on which filters are applied to the page.Contents.stream. If it is only FlateDecode you can use pdfrw.uncompress.uncompress([page.Contents]) to decode it.
Note: Give the whole Contents object in a list to the function
Note: This is not the same as pdfrw.PdfReader.uncompress()
And then you have to parse the string to find your text. It will be be in blocks of lines between BT (begin text) and ET (end text) markers on lines ending in either 'TJ' or 'Tj' inside round brackets.

Here's an example that may be useful:
for pg_num in range(number_of_pages):
pg_obj = pdfreader.getPage(pg_num)
print(pg_num)
if re.search(r'CSE', pg_obj.extractText()):
cse_count+= 1
pdfwriter.addPage(pg_obj)
Here extractText() would extract the text of the page containing the keyword CSE

Related

Parsing a non-Unicode string with Flask-RESTful

I have a webhook developed with Flask-RESTful which gets several parameters with POST.
One of the parameters is a non-Unicode string, encoded in cp1251.
Can't find a way to correctly parse this argument using reqparse.
Here is the fragment of my code:
parser = reqparse.RequestParser()
parser.add_argument('text')
msg = parser.parse_args()
Then, I write msg to a text file, and it looks like this:
{"text": "\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd !\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\n\n-- \n\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd."}
As you can see, Flask somehow replaces all Cyrillic characters with \ufffd. At the same time, non-Cyrillic characters, like ! or \n are processed correctly.
Anything I can do to advise RequestParser with the string encoding?
Here is my code for writing the text to disk:
f = open('log_msg.txt', 'w+')
f.write(json.dumps(msg))
f.close()
I tried f = open('log_msg.txt', 'w+', encoding='cp1251') with the same result.
Then, I tried
f = open('log_msg_ascii.txt', 'w+')
f.write(ascii(json.dumps(msg)))
Also, no difference.
So, I'm pretty sure it's RequestParser() tries to be too smart and can't understand the non-Unicode input.
Thanks!
Okay, I finally found a workaround. Thanks to #lenz for helping me with this issue. It seems that reqparse wrongly assumes that every string parameter comes as UTF-8. So when it sees a non-Unicode input field (among other Unicode fields!), it tries to load it as Unicode and fails. As a result, all characters are U+FFFD (replacement character).
So, to access that non-Unicode field, I did the following trick.
First, I load raw data using get_data(), decode it using cp1251 and parse with a simple regexp.
raw_data = request.get_data()
contents = raw_data.decode('windows-1251')
match = re.search(r'(?P<delim>--\w+\r?\n)Content-Disposition: form-data; name=\"text\"\r?\n(.*?)(?P=delim)', contents, re.MULTILINE | re.DOTALL)
text = match.group(2)
Not the most beautiful solution, but it works.

How to decode a text file by extracting alphabet characters and listing them into a message?

So we were given an assignment to create a code that would sort through a long message filled with special characters (ie. [,{,%,$,*) with only a few alphabet characters throughout the entire thing to make a special message.
I've been searching on this site for a while and haven't found anything specific enough that would work.
I put the text file into a pastebin if you want to see it
https://pastebin.com/48BTWB3B
Anywho, this is what I've come up with for code so far
code = open('code.txt', 'r')
lettersList = code.readlines()
lettersList.sort()
for letters in lettersList:
print(letters)
It prints the code.txt out but into short lists, essentially cutting it into smaller pieces. I want it to find and sort out the alphabet characters into a list and print the decoded message.
This is something you can do pretty easily with regex.
import re
with open('code.txt', 'r') as filehandle:
contents = filehandle.read()
letters = re.findall("[a-zA-Z]+", contents)
if you want to condense the list into a single string, you can use a join:
single_str = ''.join(letters)

How to extract text inserted with track-changes in python-docx

I want to extract text from word documents that were edited in "Track Changes" mode. I want to extract the inserted text and ignore the deleted text.
Running the below code I saw that paragraphs inserted in "track changes" mode return an empty Paragraph.text
import docx
doc = docx.Document('C:\\test track changes.docx')
for para in doc.paragraphs:
print(para)
print(para.text)
Is there a way to retrieve the text in revisioned inserts (w:ins elements) ?
I'm using python-docx 0.8.6, lxml 3.4.0, python 3.4, Win7
Thanks
I was having the same problem for years (maybe as long as this question existed).
By looking at the code of "etienned" posted by #yiftah and the attributes of Paragraph, I have found a solution to retrieve the text after accepting the changes.
The trick was to get p._p.xml to get the XML of the paragraph and then using "etienned" code on that (i.e retrieving all the <w:t> elements from the XML code, which contains both regular runs and <w:ins> blocks).
Hope it can help the souls lost like I was:
from docx import Document
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
WORD_NAMESPACE = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
TEXT = WORD_NAMESPACE + "t"
def get_accepted_text(p):
"""Return text of a paragraph after accepting all changes"""
xml = p._p.xml
if "w:del" in xml or "w:ins" in xml:
tree = XML(xml)
runs = (node.text for node in tree.getiterator(TEXT) if node.text)
return "".join(runs)
else:
return p.text
doc = Document("Hello.docx")
for p in doc.paragraphs:
print(p.text)
print("---")
print(get_accepted_text(p))
print("=========")
Not directly using python-docx; there's no API support yet for tracked changes/revisions.
It's a pretty tricky job, which you'll discover if you search on the element names, perhaps 'open xml w:ins' for a start, that brings up this document as the first result:
https://msdn.microsoft.com/en-us/library/ee836138(v=office.12).aspx
If I needed to do something like that in a pinch I'd get the body element using:
body = document._body._body
and then use XPath on that to return the elements I wanted, something vaguely like this aircode:
from docx.text.paragraph import Paragraph
inserted_ps = body.xpath('./w:ins//w:p')
for p in inserted_ps:
paragraph = Paragraph(p, None)
print(paragraph.text)
You'll be on your own for figuring out what XPath expression will get you the paragraphs you want.
opc-diag may be a friend in this, allowing you to quickly scan the XML of the .docx package. http://opc-diag.readthedocs.io/en/latest/index.html
the below code from Etienne worked for me, it's working directly with the document's xml (and not using python-docx)
http://etienned.github.io/posts/extract-text-from-word-docx-simply/
I needed a quick solution to make text surrounded by "smart tags" visible to docx's text property, and found that the solution could also be adapted to make some tracked changes visible.
It uses lxml.etree.strip_tags to remove surrounding "smartTag" and "ins" tags, and promote the contents; and lxml.etree.strip_elements to remove the whole "del" elements.
def para2text(p, quiet=False):
if not quiet:
unsafeText = p.text
lxml.etree.strip_tags(p._p, "{*}smartTag")
lxml.etree.strip_elements(p._p, "{*}del")
lxml.etree.strip_tags(p._p, "{*}ins")
safeText = p.text
if not quiet:
if safeText != unsafeText:
print()
print('para2text: unsafe:')
print(unsafeText)
print('para2text: safe:')
print(safeText)
print()
return safeText
docin = docx.Document(filePath)
for para in docin.paragraphs:
text = para2text(para)
Beware that this only works for a subset of "tracked changes", but it might be the basis of a more general solution.
If you want to see the xml for a docx file directly: rename it as .zip, extract the "document.xml", and view it by dropping into chrome or your favourite viewer.

Unable to remove string from text I am extracting from html

I am trying to extract the main article from a web page. I can accomplish the main text extraction using Python's readability module. However the text I get back often contains several &#13 strings (there is a ; at the end of this string but this editor won't allow the full string to be entered (strange!)). I have tried using the python replace function, I have also tried using regular expression's replace function, I have also tried using the unicode encode and decode functions. None of these approaches have worked. For the replace and Regular Expression approaches I just get back my original text with the &#13 strings still present and with the unicode encode decode approach I get back the error message:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 2099: ordinal not in range(128)
Here is the code I am using that takes the initial URL and using readability extracts the main article. I have left in all my commented out code that corresponds to the different approaches I have tried to remove the 
 string. It appears as though &#13 is interpreted to be u'\xa9'.
from readability.readability import Document
def find_main_article_text_2():
#url = 'http://finance.yahoo.com/news/questcor-pharmaceuticals-closes-transaction-acquire-130000695.html'
url = "http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://us.rd.yahoo.com/finance/external/cbsm/SIG=11iiumket/*http://www.marketwatch.com/News/Story/Story.aspx?guid=4D9D3170-CE63-4570-B95B-9B16ABD0391C&siteid=yhoof2"
html = urllib.urlopen(url).read()
readable_article = Document(html).summary()
readable_title = Document(html).short_title()
#readable_article.replace("u'\xa9'"," ")
#print re.sub("
",'',readable_article)
#unicodedata.normalize('NFKD', readable_article).encode('ascii','ignore')
print readable_article
#print readable_article.decode('latin9').encode('utf8'),
print "There are " ,readable_article.count("
"),"
's"
#print readable_article.encode( sys.stdout.encoding , '' )
#sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
#sents = sent_tokenizer.tokenize(readable_article)
#new_sents = []
#for sent in sents:
# unicode_sent = sent.decode('utf-8')
# s1 = unicode_sent.encode('ascii', 'ignore')
#s2 = s1.replace("\n","")
# new_sents.append(s1)
#print new_sents
# u'\xa9'
I have a URL that I have been testing the code with included inside the def. If anybody has any ideas on how to remove this &#13 I would appreciate the help. Thanks, George

Extracting source code from html file using python3.1 urllib.request

I'm trying to obtain data using regular expressions from a html file, by implementing the following code:
import urllib.request
def extract_words(wdict, urlname):
uf = urllib.request.urlopen(urlname)
text = uf.read()
print (text)
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
which returns an error:
File "extract.py", line 33, in extract_words
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
File "/usr/lib/python3.1/re.py", line 192, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
Upon experimenting further in the IDLE, I noticed that the uf.read() indeed returns the html source code the first time I invoke it. But then onwards, it returns a - b''. Is there any way to get around this?
uf.read() will only read the contents once. Then you have to close it and reopen it to read it again. This is true for any kind of stream. This is however not the problem.
The problem is that reading from any kind of binary source, such as a file or a webpage, will return the data as a bytes type, unless you specify an encoding. But your regexp is not specified as a bytes type, it's specified as a unicode str.
The re module will quite reasonably refuse to use unicode patterns on byte data, and the other way around.
The solution is to make the regexp pattern a bytes string, which you do by putting a b in front of it. Hence:
match = re.findall(b"<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
Should work. Another option is to decode the text so it also is a unicode str:
encoding = uf.headers.getparam('charset')
text = text.decode(encoding)
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
(Also, to extract data from HTML, I would say that lxml is a better option).

Resources