How to add Bookmarks to pdf using Pymupdf. I have seen many ways using PyPDF2 but since I'm already using pymupdf for other annotations I would prefer pymupdf for adding bookmarks. Also would like to highlight the text and add bookmarks to it.
You cannot add single bookmarks like you can in other packages.
If you have looked at the details there - or rather in the respective PDF specification, this is an overly / unnecessarily complex task.
PyMuPDF in contrast has this simple approach to offer:
Prepare a Python list that looks like a traditional table of contents (TOC):
Every line in the list contains the hierarchy level, the text to display and the page number. Optionally also some information where on the target page the pointer goes to.
Then use doc.set_toc(toc_list). All pesky detail is taken care of for you.
If the PDF already has a TOC, extract it to a list of that same structure via toc_list = doc.get_toc().
Then modify as required.
Related
I have a form field fillable PDF that I'd like to fill using a python script, I can't use JavaScript which seems to be the preferred solution. Is this something that can be done through a python script? The catch is that I can't install any libraries like PyPDF2, fillpdf, or fdfgen like in this answer. Is this still possible? Can I homebrew a similar solution?
The fields can be tabbed through and assuming the alt text of a field is its name, I have those as well, though at least 1 field appears to be unnamed.
Ideally I'd like to fill multiple specific fields with variables that I can programmatically generate and then save the PDF as a new file.
I am using python-docx-template and python-docx to create DOCX file with one page. I need to duplicate the page in the document nth times. How can I do this with python?
python-docx doesn't have pages. However, it recognizes sections, so before you load the document with python-docx, make sure you insert section breaks before and after your target page.
However, currently, python-docx doesn't have APIs for grabbing the content of a section. If you really want it, you will have to walk through its underlying XML. You may start looking at it from document.__body, by print(document.__body).
You are basically looking for the contents between w:sectPr. See its documentation here:
https://python-docx.readthedocs.io/en/latest/dev/analysis/features/sections.html
Before you dismiss this post as using LibreOffice documents THE WRONG WAY, let me explain what I'm trying to achieve. I am generating programatically ODT documents, which is mostly no big deal. I have hit the wall, however, trying to insert internal references into the documnt. It's quite simple to include an anchor in the content.xml with:
<text:reference-mark text:name="anchor"/>
inside <text:p> element. But when you want to reference it later LibreOffice inserts a reference with the page number. Obviously I don't know the page number where the anchor is, but I can easily include a reference to the anchor with
<text:reference-ref text:reference-format="page" text:ref-name="anchor"/>
The question is how to make LibreOffice recreate and insert page number on reading the document?
It turns out that LibreOffice does recreate page numbers provided there is actually any number included as contents of text:reference-ref
<text:reference-ref text:reference-format="page" text:ref-name="anchor">1</text:reference-ref>
When opened, upon a change of the file the page number is updated by LibreOffice.
I'm making a personal website with Hakyll, and I'd like to list my publications.
I've found this module and this guide for how to print the references from a markdown document at the bottom.
The problem with this is, it assumes you've got some document, where you cite all the things you want printed.
What I want is to generate a document that lists every document my .bib file. In particular:
I don't want to have to manually write the bibtex name of each publication I want listed
I just want the "references" section printed, i.e. there's no place in the document where the publication is referenced, they're just listed at the end.
Is it possible to get this information from the Hakyll.Web.Pandoc.Biblio module? Or do I need to separately parse the .bib file to get this? And once I do, how would I make go about generating this page with Hakyll?
You could use this trick from the pandoc's manual, the equivalent of biblatex's \nocite{*}:
It is possible to create a bibliography with all the citations,
whether or not they appear in the document, by using a wildcard:
---
nocite: |
#*
---
I am new to this topic, but my requirement is to parse documents of different types(Html, pdf,txt) using a crawlers. please suggest me what crawler to use for my requirement and provide me some tutorial s or some example how to parse the document using crawlers.
Thankyou.
This is a very broad question, so my answer is also very broad and only touches the surface.
It all comes down to two steps, (1) extracting the data from its source, and (2) matching and parsing the relevant data.
1a. Extracting data from the web
There are many ways to scrape data from the web. Different strategies can be used depending if the source is static or dynamic.
If the data is on static pages, you can download the HTML source for all the pages (automated, not manually) and then extract the data out of the HTML source. Downloading the HTML source can be done with many different tools (in many different languages), even a simple wget or curl will do.
If the data is on a dynamic page (for example, if the data is behind some forms that you need to do a database query to view it) then a good strategy is to use an automated web scraping or testing tool. There are many of these.
See this list of Automated Data Collection resources [1]. If you use such a tool, you can extract the data right away, you usually don't have the intermediate step of explicitly saving the HTML source to disk and then parsing it afterwards.
1b. Extracting data from PDF
Try Tabula first. It's an open source web application that lets you visually extract tabular data from PDFs.
If your PDF doesn't have its data neatly structured in simple tables or you have way too much data for Tabula to be feasible, then I recommend using the *NIX command-line tool pdftotext for converting Portable Document Format (PDF) files to plain text.
Use the command man pdftotext to see the manual page for the tool. One useful option is the -layout option which tries to preserve the original layout in the text output. The default option is to "undo" the physical layout of the document, and instead output the text in reading order.
1c. Extracting data from spreadsheet
Try xls2text for converting to text.
2. Parsing the (HTML/text) data
For parsing the data, there are also many options. For example, you can use a combination of grep and sed, or the BeautifulSoup Python library` if you're dealing with HTML source, but don't limit yourself to these options, you can use a language or tool that you're familiar with.
When you're parsing and extracting the data, you're essentially doing pattern matching.
Look for unique patterns that make it easy to isolate the data you're after.
One method of course is regular expressions. Say I want to extract email addresses from a text file named file.
egrep -io "\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b" file
The above command will print the email addresses [2]. If you instead want to save them to a file, append > filename to the end of the command.
[1] Note that this list is not an exhaustive list. It's missing many options.
[2] This regular expression isn't bulletproof, there are some extreme cases it will not cover.
Alternatively, you can use a script that I've created which is much better for extracting email addresses from text files. It's more accurate at finding email addresses, easier to use, and you can pass it multiple files at once. You can access it here: https://gist.github.com/dideler/5219706