PolyAnalyst: Is there a way to read in the bookmarks of a PDF file? - nlp

I have a PDF with bookmarks. I tried to read it into PolyAnalyst but looks like it completely ignores the bookmarks. Is there a way to make PA keep the bookmarks of the file?

There currently is no method of extracting pdf bookmark metadata. You can, however, extract bookmark metadata from a Microsoft Word file.

Related

Edit contents of RTF file with Powershell - Hyperlink/Mailto

Trying to update a mailto value of an RTF document with a powershell script, but it seems that the issue is RTF file related rather than Powershell related because can't get it to work when doing it by hand either:
I have previously made small find and replace changes to RTF files by finding that bit of text and changing it within the file, not using any kind of RTF library or cmdlet but just using plain string processing. With more recent RTF files updating the mailto: value in the raw file text does not seem to update the address a new message is created addressed to, and the previous value is used for the new message's TO value. The previous value not appearing in the file in plain text at that point and yet being known once the mailto link is clicked.
I am aware that RTF files have changed over time and not all of the content is purely ASCII and formatting control markup and I presume that the mailto: target is held somewhere that's not plaintext. I need to know where this other instance of the data is held in the file and a way to edit it.
mailto still showing the old email that no longer appears in plaintext in file
mailto value updated in file
Thanks for any thoughts or suggestions for things to try next!
Kind regards,
Oscar
I am able to update html and txt files just fine, but more recently RTF files are seemingly not showing the updated values because somehow they are storing the hyperlink target in a second place in the file that is not human readable. Updating other elements just by changing the human-readable instances of them in the file seems to work fine, just not the hyperlink 'mailto:' section now. Updating the link in word processor causes the human-readable 'mailto:' section to be updated when viewed in a text editor, but updating the value in a text editor and then opening it in the word processor does not show the update. So it seems to be storing the value in multiple places and the plain text version is not used in the event they're different, in so far as I've established.
Perhaps there is an RTF cmdlet or library that lets you edit the binary portion (or whatever alternate location it's held in) of the RTF file, or it would be easier to create the RTF version of the file from the properly updated HTML version. Open to ideas!

Can we print docx file without open with any office dll?

Can I print and PDF docx file without open into winword by VSTO or OOXML from MS-office installed desktop/server? Is any DLL is there to do?
I need to make print and PDF bunch of files.
Yes you can:
there are lot's of libraries. Like:
Free Word API to Operate DocX documents - Convert your docx to pdf and then print it.
Apsose Word: It's some kind of the standard library to work with docx in .Net, but not really cheap.
In addition you can find some similar questions here and here.

How to preview word files through documents portlet like pdf files?

I have added the document portlet and to that a few pdf's and word files.
But when I click on pdf files they generate a preview, and the docx files don't. Why?
I want to preview the docx files also. Is there a method to achieve it?
Liferay makes use of OpenOffice/LibreOffice for deciphering MS-Office documents. You'll have to set up a connection to OpenOffice or LibreOffice (which has to run in server mode) in order to use it the way you want.

How to programmatically create MS Office .doc or .docx files on a linux server

In the past I've used catdoc for reading .doc files, but now I need to write them.
What is the best way to go about this? I don't need it to be perfect or fully featured.
a quick and dirty way would be, to write your file in HTML and save the file as .doc
Because word can open HTML you would have a Word File^^
Beware that if you open the file with word sometimes the "web-view-mode" is selected

Converting MS documents to csv files

I have a bulk of MS documents and I'm using ubuntu os. I need to convert all of these documents to CSV format.
Is there any way to do it?
You could try to open them with openoffice.org and then save as cvs files.
Look for the 'xlhtml' package.
It's far from perfect but if your excel documents are simple it'll probably work.
It'll only convert Excel files. If you want to convert Word to plain text look at 'antiword'.

Resources