How to extract text inserted with track-changes in python-docx - python-3.x

I want to extract text from word documents that were edited in "Track Changes" mode. I want to extract the inserted text and ignore the deleted text.
Running the below code I saw that paragraphs inserted in "track changes" mode return an empty Paragraph.text
import docx
doc = docx.Document('C:\\test track changes.docx')
for para in doc.paragraphs:
print(para)
print(para.text)
Is there a way to retrieve the text in revisioned inserts (w:ins elements) ?
I'm using python-docx 0.8.6, lxml 3.4.0, python 3.4, Win7
Thanks

I was having the same problem for years (maybe as long as this question existed).
By looking at the code of "etienned" posted by #yiftah and the attributes of Paragraph, I have found a solution to retrieve the text after accepting the changes.
The trick was to get p._p.xml to get the XML of the paragraph and then using "etienned" code on that (i.e retrieving all the <w:t> elements from the XML code, which contains both regular runs and <w:ins> blocks).
Hope it can help the souls lost like I was:
from docx import Document
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
WORD_NAMESPACE = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
TEXT = WORD_NAMESPACE + "t"
def get_accepted_text(p):
"""Return text of a paragraph after accepting all changes"""
xml = p._p.xml
if "w:del" in xml or "w:ins" in xml:
tree = XML(xml)
runs = (node.text for node in tree.getiterator(TEXT) if node.text)
return "".join(runs)
else:
return p.text
doc = Document("Hello.docx")
for p in doc.paragraphs:
print(p.text)
print("---")
print(get_accepted_text(p))
print("=========")

Not directly using python-docx; there's no API support yet for tracked changes/revisions.
It's a pretty tricky job, which you'll discover if you search on the element names, perhaps 'open xml w:ins' for a start, that brings up this document as the first result:
https://msdn.microsoft.com/en-us/library/ee836138(v=office.12).aspx
If I needed to do something like that in a pinch I'd get the body element using:
body = document._body._body
and then use XPath on that to return the elements I wanted, something vaguely like this aircode:
from docx.text.paragraph import Paragraph
inserted_ps = body.xpath('./w:ins//w:p')
for p in inserted_ps:
paragraph = Paragraph(p, None)
print(paragraph.text)
You'll be on your own for figuring out what XPath expression will get you the paragraphs you want.
opc-diag may be a friend in this, allowing you to quickly scan the XML of the .docx package. http://opc-diag.readthedocs.io/en/latest/index.html

the below code from Etienne worked for me, it's working directly with the document's xml (and not using python-docx)
http://etienned.github.io/posts/extract-text-from-word-docx-simply/

I needed a quick solution to make text surrounded by "smart tags" visible to docx's text property, and found that the solution could also be adapted to make some tracked changes visible.
It uses lxml.etree.strip_tags to remove surrounding "smartTag" and "ins" tags, and promote the contents; and lxml.etree.strip_elements to remove the whole "del" elements.
def para2text(p, quiet=False):
if not quiet:
unsafeText = p.text
lxml.etree.strip_tags(p._p, "{*}smartTag")
lxml.etree.strip_elements(p._p, "{*}del")
lxml.etree.strip_tags(p._p, "{*}ins")
safeText = p.text
if not quiet:
if safeText != unsafeText:
print()
print('para2text: unsafe:')
print(unsafeText)
print('para2text: safe:')
print(safeText)
print()
return safeText
docin = docx.Document(filePath)
for para in docin.paragraphs:
text = para2text(para)
Beware that this only works for a subset of "tracked changes", but it might be the basis of a more general solution.
If you want to see the xml for a docx file directly: rename it as .zip, extract the "document.xml", and view it by dropping into chrome or your favourite viewer.

Related

Automating The Boring Stuff With Python - Chapter 8 - Exercise - Regex Search

I'm trying to complete the exercise for Chapter 8 using which takes a user supplied regular expression and uses it to search each string in each text file in a folder.
I keep getting the error:
AttributeError: 'NoneType' object has no attribute 'group'
The code is here:
import os, glob, re
os.chdir("C:\Automating The Boring Stuff With Python\Chapter 8 - \
Reading and Writing Files\Practice Projects\RegexSearchTextFiles")
userRegex = re.compile(input('Enter your Regex expression :'))
for textFile in glob.glob("*.txt"):
currentFile = open(textFile) #open the text file and assign it to a file object
textCurrentFile = currentFile.read() #read the contents of the text file and assign to a variable
print(textCurrentFile)
#print(type(textCurrentFile))
searchedText = userRegex.search(textCurrentFile)
searchedText.group()
When I try this individually in the IDLE shell it works:
textCurrentFile = "What is life like for those left behind when the last foreign troops flew out of Afghanistan? Four people from cities and provinces around the country told the BBC they had lost basic freedoms and were struggling to survive."
>>> userRegex = re.compile(input('Enter the your Regex expression :'))
Enter the your Regex expression :troops
>>> searchedText = userRegex.search(textCurrentFile)
>>> searchedText.group()
'troops'
But I can't seem to make it work in the code when I run it. I'm really confused.
Thanks
Since you are just looping across all .txt files, there could be files that doesn't have the word "troops" in it. To prove this, don't call the .group(), just perform:
print(textFile, textCurrentFile, searchedText)
If you see that searchedText is None, then that means the contents of textFile (which is textCurrentFile) doesn't have the word "troops".
You could either:
Add the word troops in all .txt files.
Only select the target .txt files, not all.
Check first if if the match is found before accessing .group()
print(searchedText.group() if searchedText else None)

Excluding the Header and Footer Contents of a page of a PDF file while extracting text?

Is it possible to exclude the contents of footers and headers of a page from a pdf file during extracting the text from it. As these contents are least important and almost redundant.
Note: For extracting the text from the .pdf file, I am using the PyPDF2 package on python version = 3.7.
How to exclude the contents of the footers and headers in PyPDF2. Any help is appreciated.
The code snippet is as follows:
import PyPDF2
def Read(startPage, endPage):
global text
text = []
cleanText = " "
pdfFileObj = open('C:\\Users\\Rocky\\Desktop\\req\\req\\0000 - gamma j.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
print(num_pages)
while (startPage <= endPage):
pageObj = pdfReader.getPage(startPage)
text += pageObj.extractText()
startPage += 1
pdfFileObj.close()
for myWord in text:
if myWord != '\n':
cleanText += myWord
text = cleanText.strip().split()
print(text)
Read(1, 1)
As there are no features provided by PyPDF2 officially, I've written a function of my own to exclude the headers and footers in a pdf page which is working fine for my use case. You can add your own Regex patterns in page_format_pattern variable. Here I'm checking only in the first and last elements of my text list.
You can run this function for each page.
def remove_header_footer(self,pdf_extracted_text):
page_format_pattern = r'([page]+[\d]+)'
pdf_extracted_text = pdf_extracted_text.lower().split("\n")
header = pdf_extracted_text[0].strip()
footer = pdf_extracted_text[-1].strip()
if re.search(page_format_pattern, header) or header.isnumeric():
pdf_extracted_text = pdf_extracted_text[1:]
if re.search(page_format_pattern, footer) or footer.isnumeric():
pdf_extracted_text = pdf_extracted_text[:-1]
pdf_extracted_text = "\n".join(pdf_extracted_text)
return pdf_extracted_text
Hope you find this helpful.
At the moment, PyPDF2 does not offer this. It's also unclear how to do it well as those are not semantically represented within the pdf
As a heuristic, you could search for duplicates in the top / bottom of the extracted text of pages. That would likely work well for long documents and not work at all for 1-page documents
You need to consider that the first few pages might have no header or a different header than the rest. Also, there can be differences between chapters and even / odd pages
(side note: I'm the maintainer of PyPDF2 and I think this would be awesome to have)

How to extract a PDF's text using pdfrw

Can pdfrw extract the text out of a document?
I was thinking something along the lines of
from pdfrw import PdfReader
doc = PdfReader(pdf_path)
page_texts = []
for page_nr in doc.numPages:
page_texts.append(doc.getPage(page_nr).parse_page()) # ..or something
In the docs the explain how to extract the text. However, it's just a bytestream. You could iterate over the pages and decode them individually.
from pdfrw import PdfReader
doc = PdfReader(pdf_path)
for page in doc.pages:
bytestream = page.Contents.stream # This is a string with bytes, Not a bytestring
string = #somehow decode bytestream. Maybe using zlib.decompress
# do something with that text
Edit:
May be worth nothing that pdfrw does not yet support text decompression due to its complexity according to the author.
Depends on which filters are applied to the page.Contents.stream. If it is only FlateDecode you can use pdfrw.uncompress.uncompress([page.Contents]) to decode it.
Note: Give the whole Contents object in a list to the function
Note: This is not the same as pdfrw.PdfReader.uncompress()
And then you have to parse the string to find your text. It will be be in blocks of lines between BT (begin text) and ET (end text) markers on lines ending in either 'TJ' or 'Tj' inside round brackets.
Here's an example that may be useful:
for pg_num in range(number_of_pages):
pg_obj = pdfreader.getPage(pg_num)
print(pg_num)
if re.search(r'CSE', pg_obj.extractText()):
cse_count+= 1
pdfwriter.addPage(pg_obj)
Here extractText() would extract the text of the page containing the keyword CSE

combining links into one one output in beautifulsoup

I am trying to grab all of the links with a certain div tag which I can accomplish. The problem is every link is displayed in a new line. For example:
Home
Wire Wheels
Crimped
I would like it to show Home,Wire Wheels,Crimped
Is this possible?
Here is the python code I am using to grab the data:
for crumbs in soup.find('div',{"id":"breadcrumbs"}).find_all('a'):
crumbs2 = crumbs.text
print(crumbs2)
You can specify a different line-ending to print. The default is os.linesep:
crumbs = list(soup.find('div',{"id":"breadcrumbs"}).find_all('a'))
for ind, crumb in enumerate(crumbs):
if ind < len(crumbs) - 1:
ending = {'end': ', '}
else:
ending = {}
print(crumb.text, **ending)
That being said, you should definitely go with #alecxe's answer.
Use .get_text() to get the stripped text directly and str.join() to join the strings:
",".join([crumbs.get_text(strip=True)
for crumbs in soup.find('div',{"id":"breadcrumbs"}).find_all('a')])
Also note that soup.find('div',{"id":"breadcrumbs"}).find_all('a') can be simplified to soup.select("#breadcrumbs a").

python-docx insertion point

I am not sure if I've been missing anything obvious, but I have not found anything documented about how one would go to insert Word elements (tables, for example) at some specific place in a document?
I am loading an existing MS Word .docx document by using:
my_document = Document('some/path/to/my/document.docx')
My use case would be to get the 'position' of a bookmark or section in the document and then proceed to insert tables below that point.
I'm thinking about an API that would allow me to do something along those lines:
insertion_point = my_document.bookmarks['bookmark_name'].position
my_document.add_table(rows=10, cols=3, position=insertion_point+1)
I saw that there are plans to implement something akin to the 'range' object of the MS Word API, this would effectively solve that problem. In the meantime, is there a way to instruct the document object methods where to insert the new elements?
Maybe I can glue some lxml code to find a node and pass that to these python-docx methods? Any help on this subject would be much appreciated! Thanks.
I remembered an old adage, "use the source, Luke!", and could figure it out. A post from python-docx owner on its git project page also gave me a hint: https://github.com/python-openxml/python-docx/issues/7.
The full XML document model can be accessed by using the its _document_part._element property. It behaves exactly like an lxml etree element. From there, everything is possible.
To solve my specific insertion point problem, I created a temp docx.Document object which I used to store my generated content.
import docx
from docx.oxml.shared import qn
tmp_doc = docx.Document()
# Generate content in tmp_doc document
tmp_doc.add_heading('New heading', 1)
# more content generation using docx API.
# ...
# Reference the tmp_doc XML content
tmp_doc_body = tmp_doc._document_part._element.body
# You could pretty print it by using:
#print(docx.oxml.xmlchemy.serialize_for_reading(tmp_doc_body))
I then loaded my docx template (containing a bookmark named 'insertion_point') into a second docx.Document object.
doc = docx.Document('/some/path/example.docx')
doc_body = doc._document_part._element.body
#print(docx.oxml.xmlchemy.serialize_for_reading(doc_body))
The next step is parsing the doc XML to find the index of the insertion point. I defined a small function for the task at hand, which returns a named bookmark parent paragraph element:
def get_bookmark_par_element(document, bookmark_name):
"""
Return the named bookmark parent paragraph element. If no matching
bookmark is found, the result is '1'. If an error is encountered, '2'
is returned.
"""
doc_element = document._document_part._element
bookmarks_list = doc_element.findall('.//' + qn('w:bookmarkStart'))
for bookmark in bookmarks_list:
name = bookmark.get(qn('w:name'))
if name == bookmark_name:
par = bookmark.getparent()
if not isinstance(par, docx.oxml.CT_P):
return 2
else:
return par
return 1
The newly defined function was used toget the bookmark 'insertion_point' parent paragraph. Error control is left to the reader.
bookmark_par = get_bookmark_par_element(doc, 'insertion_point')
We can now use bookmark_par's etree index to insert our tmp_doc generated content at the right place:
bookmark_par_parent = bookmark_par.getparent()
index = bookmark_par_parent.index(bookmark_par) + 1
for child in tmp_doc_body:
bookmark_par_parent.insert(index, child)
index = index + 1
bookmark_par_parent.remove(bookmark_par)
The document is now finalized, the generated content having been inserted at the bookmark location of an existing Word document.
# Save result
# print(docx.oxml.xmlchemy.serialize_for_reading(doc_body))
doc.save('/some/path/generated_doc.docx')
I hope this can help someone, as the documentation regarding this is still yet to be written.
You put [image] as a token in your template document:
for paragraph in document.paragraphs:
if "[image]" in paragraph.text:
paragraph.text = paragraph.text.strip().replace("[image]", "")
run = paragraph.add_run()
run.add_picture(image_path, width=Inches(3))
you have have a paragraph in a table cell as well. just find the cell and do as above.
Python-docx owner suggests how to insert a table into the middle of an existing document:
https://github.com/python-openxml/python-docx/issues/156
Here it is with some improvements:
import re
from docx import Document
def move_table_after(document, table, search_phrase):
regexp = re.compile(search_phrase)
for paragraph in document.paragraphs:
if paragraph.text and regexp.search(paragraph.text):
tbl, p = table._tbl, paragraph._p
p.addnext(tbl)
return paragraph
if __name__ == '__main__':
document = Document('Existing_Document.docx')
table = document.add_table(rows=..., cols=...)
...
move_table_after(document, table, "your search phrase")
document.save('Modified_Document.docx')
Have a look at python-docx-template which allows jinja2 style templates insertion points in a docx file rather than Word bookmarks:
https://pypi.org/project/docxtpl/
https://docxtpl.readthedocs.io/en/latest/
Thanks a lot for taking time to explain all of this.
I was going through more or less the same issue. My specific point was how to merge two or more docx documents, at the end.
It's not exactly a solution to your problem, but here is the function I came with:
def combinate_word(main_file, files, output):
main_doc = Document(main_file)
for file in files:
sub_doc = Document(file)
for element in sub_doc._document_part.body._element:
main_doc._document_part.body._element.append(element)
main_doc.save(output)
Unfortunately, it's not yet possible nor easy to copy images with python-docx. I fall back to win32com ...

Resources