python-docx insertion point - python-3.x

I am not sure if I've been missing anything obvious, but I have not found anything documented about how one would go to insert Word elements (tables, for example) at some specific place in a document?
I am loading an existing MS Word .docx document by using:
my_document = Document('some/path/to/my/document.docx')
My use case would be to get the 'position' of a bookmark or section in the document and then proceed to insert tables below that point.
I'm thinking about an API that would allow me to do something along those lines:
insertion_point = my_document.bookmarks['bookmark_name'].position
my_document.add_table(rows=10, cols=3, position=insertion_point+1)
I saw that there are plans to implement something akin to the 'range' object of the MS Word API, this would effectively solve that problem. In the meantime, is there a way to instruct the document object methods where to insert the new elements?
Maybe I can glue some lxml code to find a node and pass that to these python-docx methods? Any help on this subject would be much appreciated! Thanks.

I remembered an old adage, "use the source, Luke!", and could figure it out. A post from python-docx owner on its git project page also gave me a hint: https://github.com/python-openxml/python-docx/issues/7.
The full XML document model can be accessed by using the its _document_part._element property. It behaves exactly like an lxml etree element. From there, everything is possible.
To solve my specific insertion point problem, I created a temp docx.Document object which I used to store my generated content.
import docx
from docx.oxml.shared import qn
tmp_doc = docx.Document()
# Generate content in tmp_doc document
tmp_doc.add_heading('New heading', 1)
# more content generation using docx API.
# ...
# Reference the tmp_doc XML content
tmp_doc_body = tmp_doc._document_part._element.body
# You could pretty print it by using:
#print(docx.oxml.xmlchemy.serialize_for_reading(tmp_doc_body))
I then loaded my docx template (containing a bookmark named 'insertion_point') into a second docx.Document object.
doc = docx.Document('/some/path/example.docx')
doc_body = doc._document_part._element.body
#print(docx.oxml.xmlchemy.serialize_for_reading(doc_body))
The next step is parsing the doc XML to find the index of the insertion point. I defined a small function for the task at hand, which returns a named bookmark parent paragraph element:
def get_bookmark_par_element(document, bookmark_name):
"""
Return the named bookmark parent paragraph element. If no matching
bookmark is found, the result is '1'. If an error is encountered, '2'
is returned.
"""
doc_element = document._document_part._element
bookmarks_list = doc_element.findall('.//' + qn('w:bookmarkStart'))
for bookmark in bookmarks_list:
name = bookmark.get(qn('w:name'))
if name == bookmark_name:
par = bookmark.getparent()
if not isinstance(par, docx.oxml.CT_P):
return 2
else:
return par
return 1
The newly defined function was used toget the bookmark 'insertion_point' parent paragraph. Error control is left to the reader.
bookmark_par = get_bookmark_par_element(doc, 'insertion_point')
We can now use bookmark_par's etree index to insert our tmp_doc generated content at the right place:
bookmark_par_parent = bookmark_par.getparent()
index = bookmark_par_parent.index(bookmark_par) + 1
for child in tmp_doc_body:
bookmark_par_parent.insert(index, child)
index = index + 1
bookmark_par_parent.remove(bookmark_par)
The document is now finalized, the generated content having been inserted at the bookmark location of an existing Word document.
# Save result
# print(docx.oxml.xmlchemy.serialize_for_reading(doc_body))
doc.save('/some/path/generated_doc.docx')
I hope this can help someone, as the documentation regarding this is still yet to be written.

You put [image] as a token in your template document:
for paragraph in document.paragraphs:
if "[image]" in paragraph.text:
paragraph.text = paragraph.text.strip().replace("[image]", "")
run = paragraph.add_run()
run.add_picture(image_path, width=Inches(3))
you have have a paragraph in a table cell as well. just find the cell and do as above.

Python-docx owner suggests how to insert a table into the middle of an existing document:
https://github.com/python-openxml/python-docx/issues/156
Here it is with some improvements:
import re
from docx import Document
def move_table_after(document, table, search_phrase):
regexp = re.compile(search_phrase)
for paragraph in document.paragraphs:
if paragraph.text and regexp.search(paragraph.text):
tbl, p = table._tbl, paragraph._p
p.addnext(tbl)
return paragraph
if __name__ == '__main__':
document = Document('Existing_Document.docx')
table = document.add_table(rows=..., cols=...)
...
move_table_after(document, table, "your search phrase")
document.save('Modified_Document.docx')

Have a look at python-docx-template which allows jinja2 style templates insertion points in a docx file rather than Word bookmarks:
https://pypi.org/project/docxtpl/
https://docxtpl.readthedocs.io/en/latest/

Thanks a lot for taking time to explain all of this.
I was going through more or less the same issue. My specific point was how to merge two or more docx documents, at the end.
It's not exactly a solution to your problem, but here is the function I came with:
def combinate_word(main_file, files, output):
main_doc = Document(main_file)
for file in files:
sub_doc = Document(file)
for element in sub_doc._document_part.body._element:
main_doc._document_part.body._element.append(element)
main_doc.save(output)
Unfortunately, it's not yet possible nor easy to copy images with python-docx. I fall back to win32com ...

Related

Extract data from embedded script tag in html

I'm trying to fetch data inside a (big) script tag within HTML. By using Beautifulsoup I can approach the necessary script, yet I cannot get the data I want.
What I'm looking for inside this tag resides within a list called "Beleidsdekkingsgraad" more specifically
["Beleidsdekkingsgraad","107,6","107,6","109,1","109,8","110,1","111,5","112,5","113,3","113,3","114,3","115,7","116,3","116,9","117,5","117,8","118,1","118,3","118,4","118,6","118,8","118,9","118,9","118,9","118,5","118,1","117,8","117,6","117,5","117,1","116,7","116,2"] even more specific; the last entry in the list (116,2)
Following 1 or 2 cannot get the case done.
What I've done so far
base='https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed'
url=requests.get(base)
soup=BeautifulSoup(url.text, 'html.parser')
all_scripts = soup.find_all('script')
all_scripts[3].get_text()[1907:2179]
This, however, is not satisfying since each time the indexing has to be changed if new numbers are added.
What I'm looking for an easy way to extract the list from the script tag, second to catch the last number of the extracted list (i.e. 116,2)
You could regex out javascript object holding that item then parse with json library
import requests,re,json
r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'window\.infographicData=(.*);')
data = json.loads(p.findall(r.text)[0])
result = [i for i in data['elements'][1]['data'][0] if 'Beleidsdekkingsgraad' in i][0][-1]
print(result)
Or do whole thing with regex:
import requests,re
r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'\["Beleidsdekkingsgraad".+?,"([0-9,]+)"\]')
print(p.findall(r.text)[0])
Second regex:
Another option:
import requests,re, json
r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'(\["Beleidsdekkingsgraad".+?"\])')
print(json.loads(p.findall(r.text)[0])[-1])

Caching parsed document

I have a set of YAML files. I would like to cache these files so that as much work as possible is re-used.
Each of these files contains two documents. The first document contains “static” information that will always be interpreted in the same way. The second document contains “dynamic” information that must be reinterpreted every time the file is used. Specifically, it uses a tag-based macro system, and the document must be constructed anew each time the file is used. However, the file itself will not change, so the results of parsing the entire file could be cached (at a considerable resource savings).
In ruamel.yaml, is there a simple way to parse an entire file into multiple parsed documents, then run construction on each document individually? This would allow me to cache the result of constructing the first “static” document and cache the parse of the second “dynamic” document for later construction.
Example file:
---
default_argument: name
...
%YAML 1.2
%TAG ! tag:yaml-macros:yamlmacros.lib.extend,yamlmacros.lib.arguments:
---
!merge
name: !argument name
The first document contains metadata that is used (along with other data from elsewhere) in the construction of the second document.
If you don't want to process all YAML documents in a stream completely, you'll have to split up the stream by hand, which is not entirely easy to do in a generic way.
What you need to know is what a YAML stream can consist of:
zero or more documents. Subsequent documents require some sort of separation marker line. If a document is not terminated by a document end marker line, then the following document must begin with a directives end marker line.
A document end marker line is a line that starts with ... followed by space/newline and a directives end marker line is --- followed by space/newline.
The actual production rules are slightly more complicated and "starts with" should ignore the fact that you need to skip any mid-stream byte-order marks.
If you don't have any directives, byte-order-marks and no document-end-markers (and most multi-document YAML streams that I have seen, do not have those), then you can just data = Path().read() the multi-document YAML as a string, split using l = data.split('\n---') and process only the appropriate element of the resulting list with YAML().load(l[N]).
I am not sure the following properly handles all cases, but it does handle your multi-doc stream:
import sys
from pathlib import Path
import ruamel.yaml
docs = []
current = ""
state = "EOD"
for line in Path("example.yaml").open():
if state in ["EOD", "DIR"]:
if line.startswith("%"):
state = "DIR"
else:
state = "BODY"
current += line
continue
if line.startswith('...') and line[3].isspace():
state = "EOD"
docs.append(current)
current = ""
continue
if state == "BODY" and current and line.startswith('---') and line[3].isspace():
docs.append(current)
current = ""
continue
current += line
if current:
docs.append(current)
yaml = ruamel.yaml.YAML()
data = yaml.load(docs[1])
print(data['name'])
which gives:
name
Looks like you can indeed directly operate the parser internals of ruamel.yaml, it just isn't documented. The following function will parse a YAML string into document nodes:
from ruamel.yaml import SafeLoader
def parse_documents(text):
loader = SafeLoader(text)
composer = loader.composer
while composer.check_node():
yield composer.get_node()
From there, the documents can be individually constructed. In order to solve my problem, something like the following should work:
def process_yaml(text):
my_constructor = get_my_custom_constructor()
parsed_documents = list(parse_documents(path.read_text()))
metadata = my_constructor.construct_document(parsed_documents[0])
return (metadata, document[1])
cache = {}
def do_the_thing(file_path):
if file_path not in cache:
cache[file_path] = process_yaml(Path(file_path).read_text())
metadata, document = cache[file_path]
my_constructor = get_my_custom_constructor(metadata)
return my_constructor.construct_document(document)
This way, all of the file IO and parsing is cached, and only the last construction step need be performed each time.

Reportlab PDF creating with python duplicating text

I am trying to automate the production of pdfs by reading data from a pandas data frame and writing it a page on an existing pdf form using pyPDF2 and reportlab. The main meat of the program is here:
def pdfOperations(row, bp):
packet = io.BytesIO()
can = canvas.Canvas(packet, pagesize=letter)
createText(row, can)
packet.seek(0)
new_pdf = PdfFileReader(packet)
textPage = new_pdf.getPage(0)
secondPage = bp.getPage(1)
secondPage.mergePage(textPage)
assemblePDF(frontPage, secondPage, row)
del packet, can, new_pdf, textPage, secondPage
def main():
df = openData()
bp = readPDF()
frontPage = bp.getPage(0)
for ind in df.index:
row = df.loc[ind]
pdfOperations(row, bp)
This works fine for the first row of data and the first pdf generated, but for the subsequent ones all the text is overwritten. I.e. the second pdf contains text from the first iteration and the second. I thought the garbage collection would take care of all the in memory changes, but that does not seem to be happening. Anyone know why?
I even tries forcing the objects to be deleted after the function has run its course, but no luck...
You read bp only once before the loop. Then in the loop, you obtain its second page via getPage(1) and merge stuff to it. But since its always from the same object (bp), each iteration will merge to the same page, therefore all the merges done before add up.
While I don't find any way to create a "deepcopy" of a page in PyPDF2's docs, it should work to just create a new bp object for each iteration.
Somewhere in readPDF you must have done something where you open your template PDF into a binary stream and then pass that to PdfFileReader. Instead, you could read the data into a variable:
with open(filename, "rb") as f:
bp_bin = f.read()
And from that, create a new PdfFileReader instance for each loop iteration:
for ind in df.index:
row = df.loc[ind]
bp = PdfFileReader(bp_bin)
pdfOperations(row, bp)
This should "reset" the secondPage everytime without any additional file I/O overhead. Only the parsing is done again each time, but depending on the file size and contents, maybe the time that takes is low and you can live with that.

How to perform a check with a csv file

I want to know if there is a better way that iterating through a csv when performing a check. Virtually I am using SOAP UI (free version) to test a web service based on a search.
What I want to do is look at a response from a particular search request (the step name of the SOAP Request is 'Search Request') and look for all instances of test found in between xml tags <TestID> for both within <IFInformation> and <OFInformation> (this will be in a groovy script step).
def groovyUtils = new com.eviware.soapui.support.GroovyUtils(context)
import groovy.xml.XmlUtil
def response = messageExchange.response.responseContent
def xml = new XmlParser().parseText( response )
def IF = xml.'soap:Body'
.IF*
.TestId.text()
def OF = xml.'soap:Body'
.OF*
.TestId.text()
Now what I want to do is for each instance of the 'DepartureAirportId', I want to check that the ID is within a CSV file. There are two columns within the csv file (let's call it Search.csv) and both columns contain many rows. If the flight is found within any row within the first column, add a count +1 for the variable 'Test1', else if found in second column in csv, add count +1 for variable 'Test2'. If not found within any, add count +1 for variable 'NotFound'
I don't know if iterating through a csv is the best outcome or output all the data from the csv into an array list and iterate it through there but I want to know how this can be done and the best way for my own learning experience?
don't know about your algorithm, but the easiest way to iterate through simple csv file in groovy by line and splitting each line with separator:
new File("/1.csv").splitEachLine(","){line->
println " ${ line[0] } ${ line[1] } "
}
http://docs.groovy-lang.org/latest/html/groovy-jdk/java/io/File.html#splitEachLine(java.lang.String,%20groovy.lang.Closure)
You might want to use CSV Validator.
Format.of(String regex)
It should do the trick - just provide the literal you're looking for as a rule for first column and check if it throws an exception or not.

How to extract text inserted with track-changes in python-docx

I want to extract text from word documents that were edited in "Track Changes" mode. I want to extract the inserted text and ignore the deleted text.
Running the below code I saw that paragraphs inserted in "track changes" mode return an empty Paragraph.text
import docx
doc = docx.Document('C:\\test track changes.docx')
for para in doc.paragraphs:
print(para)
print(para.text)
Is there a way to retrieve the text in revisioned inserts (w:ins elements) ?
I'm using python-docx 0.8.6, lxml 3.4.0, python 3.4, Win7
Thanks
I was having the same problem for years (maybe as long as this question existed).
By looking at the code of "etienned" posted by #yiftah and the attributes of Paragraph, I have found a solution to retrieve the text after accepting the changes.
The trick was to get p._p.xml to get the XML of the paragraph and then using "etienned" code on that (i.e retrieving all the <w:t> elements from the XML code, which contains both regular runs and <w:ins> blocks).
Hope it can help the souls lost like I was:
from docx import Document
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
WORD_NAMESPACE = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
TEXT = WORD_NAMESPACE + "t"
def get_accepted_text(p):
"""Return text of a paragraph after accepting all changes"""
xml = p._p.xml
if "w:del" in xml or "w:ins" in xml:
tree = XML(xml)
runs = (node.text for node in tree.getiterator(TEXT) if node.text)
return "".join(runs)
else:
return p.text
doc = Document("Hello.docx")
for p in doc.paragraphs:
print(p.text)
print("---")
print(get_accepted_text(p))
print("=========")
Not directly using python-docx; there's no API support yet for tracked changes/revisions.
It's a pretty tricky job, which you'll discover if you search on the element names, perhaps 'open xml w:ins' for a start, that brings up this document as the first result:
https://msdn.microsoft.com/en-us/library/ee836138(v=office.12).aspx
If I needed to do something like that in a pinch I'd get the body element using:
body = document._body._body
and then use XPath on that to return the elements I wanted, something vaguely like this aircode:
from docx.text.paragraph import Paragraph
inserted_ps = body.xpath('./w:ins//w:p')
for p in inserted_ps:
paragraph = Paragraph(p, None)
print(paragraph.text)
You'll be on your own for figuring out what XPath expression will get you the paragraphs you want.
opc-diag may be a friend in this, allowing you to quickly scan the XML of the .docx package. http://opc-diag.readthedocs.io/en/latest/index.html
the below code from Etienne worked for me, it's working directly with the document's xml (and not using python-docx)
http://etienned.github.io/posts/extract-text-from-word-docx-simply/
I needed a quick solution to make text surrounded by "smart tags" visible to docx's text property, and found that the solution could also be adapted to make some tracked changes visible.
It uses lxml.etree.strip_tags to remove surrounding "smartTag" and "ins" tags, and promote the contents; and lxml.etree.strip_elements to remove the whole "del" elements.
def para2text(p, quiet=False):
if not quiet:
unsafeText = p.text
lxml.etree.strip_tags(p._p, "{*}smartTag")
lxml.etree.strip_elements(p._p, "{*}del")
lxml.etree.strip_tags(p._p, "{*}ins")
safeText = p.text
if not quiet:
if safeText != unsafeText:
print()
print('para2text: unsafe:')
print(unsafeText)
print('para2text: safe:')
print(safeText)
print()
return safeText
docin = docx.Document(filePath)
for para in docin.paragraphs:
text = para2text(para)
Beware that this only works for a subset of "tracked changes", but it might be the basis of a more general solution.
If you want to see the xml for a docx file directly: rename it as .zip, extract the "document.xml", and view it by dropping into chrome or your favourite viewer.

Resources