I am using python-docx-template and python-docx to create DOCX file with one page. I need to duplicate the page in the document nth times. How can I do this with python?
python-docx doesn't have pages. However, it recognizes sections, so before you load the document with python-docx, make sure you insert section breaks before and after your target page.
However, currently, python-docx doesn't have APIs for grabbing the content of a section. If you really want it, you will have to walk through its underlying XML. You may start looking at it from document.__body, by print(document.__body).
You are basically looking for the contents between w:sectPr. See its documentation here:
https://python-docx.readthedocs.io/en/latest/dev/analysis/features/sections.html
Related
I am very new to all of this so please bare with me.
I have a OneNote Notebook with several Sections each containing hundreds of pages.
I need to retrieve all the content of the pages (while keeping the structure of the page titles / section titles) and end up with that content in a somewhat usable state in Excel. Of course, doing so manually will take me weeks. That's why I'm looking for an automated/semi-automated approach.
Is there a way to do that? What do I need to look into? I haven't found an answer on the interned but I guess I may need to use the OneNote API? Maybe find a way to export the OneNote content into .csv to then process it in Excel? Maybe the OneNote content can be retrieved directly by a macro in Excel?
What would you look into to achieve my goal?
Thank you for reading.
The OneNote API/Graph API can get sections and pages
GET https://graph.microsoft.com/v1.0/me/onenote/notebooks/{id}/sections
GET https://graph.microsoft.com/v1.0/me/onenote/sections/{id}/pages$count=true&$top=100
There may be a limitation to the number of pages that can be accessed by the API (recently introduced?). Then a further call to get individual page content:
GET https://graph.microsoft.com/v1.0/me/onenote/pages/{id}/content[?includeIDs=true]
An alternative may be to use this rust OneNote notebook parser - creates html files.
How to add Bookmarks to pdf using Pymupdf. I have seen many ways using PyPDF2 but since I'm already using pymupdf for other annotations I would prefer pymupdf for adding bookmarks. Also would like to highlight the text and add bookmarks to it.
You cannot add single bookmarks like you can in other packages.
If you have looked at the details there - or rather in the respective PDF specification, this is an overly / unnecessarily complex task.
PyMuPDF in contrast has this simple approach to offer:
Prepare a Python list that looks like a traditional table of contents (TOC):
Every line in the list contains the hierarchy level, the text to display and the page number. Optionally also some information where on the target page the pointer goes to.
Then use doc.set_toc(toc_list). All pesky detail is taken care of for you.
If the PDF already has a TOC, extract it to a list of that same structure via toc_list = doc.get_toc().
Then modify as required.
Before you dismiss this post as using LibreOffice documents THE WRONG WAY, let me explain what I'm trying to achieve. I am generating programatically ODT documents, which is mostly no big deal. I have hit the wall, however, trying to insert internal references into the documnt. It's quite simple to include an anchor in the content.xml with:
<text:reference-mark text:name="anchor"/>
inside <text:p> element. But when you want to reference it later LibreOffice inserts a reference with the page number. Obviously I don't know the page number where the anchor is, but I can easily include a reference to the anchor with
<text:reference-ref text:reference-format="page" text:ref-name="anchor"/>
The question is how to make LibreOffice recreate and insert page number on reading the document?
It turns out that LibreOffice does recreate page numbers provided there is actually any number included as contents of text:reference-ref
<text:reference-ref text:reference-format="page" text:ref-name="anchor">1</text:reference-ref>
When opened, upon a change of the file the page number is updated by LibreOffice.
I managed to Search the contents of text files using custom search as described in the link below: https://docs.kentico.com/k8/custom-development/miscellaneous-custom-development-tasks/smart-search-api/creating-custom-smart-search-indexes
But it is not able to search in the filename. For example, if my search text is "Roman", the file "RomanRaj.txt" should show up in the results. Please help.
Try to add file name to your search index by index content customization. See the documentation on this topic.
I'd suggest NOT creating a custom smart search index but look at using attachments and searching those. Out of the box, Kentico will allow you to search attachments and their contents without writing any code.
I am using example source code from the Lucene 4.2.0 demo API:
http://lucene.apache.org/core/4_2_0/demo/overview-summary.html
I run IndexFiles.java to create an index from a directory of rtf, pdf, doc, and docx files. I then run SearcFiles.java and notice that I encounter several instances where my searches fail i.e. it does not return a document that contains the word I searched for.
I suspect it has to do with Lucene 4.2.0 not being able to correctly index non .txt files without additional customization.
Question: Can the IndexFiles.java source code (Lucene 4.2.0) correctly index pdf, doc, docx files as it is written in the provided link? Does anyone have examples or references on how to code that functionality?
Thank You
No, it can't. IndexFiles is a demo, an example for you to learn from, but not really designed for production use. If you take a look at the code, you'll see it just uses a FileInputStream (wrapped with an InputStreamReader, wrapped with a BufferedReader). Generally, Lucene won't handle how to parse different file formats (except it's own index files, of course). How to parse a file to provide meaningful content to Lucene is up to you to define.
Apache Tika might be a good place to look for this functionality. Here is a simple example using Tika with Lucene.
You might also consider using Solr.