Convert Microsoft Office documents to Text [closed]

Convert Microsoft Office documents to Text [closed] - linux

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I'm looking for a library (or command line tool) to turn MS Office documents into either plaintext or HTML (for conversion to text).
It must run on Linux (not via Wine!).
I found antiword, but the last release was 2005, so it won't read the new Office 2007 formats.
I need it to read Word, Excel and Powerpoint documents

The new office 2007 format is just (ZIP) compressed XML.
All the text (in at least the .docx format) is located (once you decompress the file) in the word folder, document.xml file. Strip it from all the XML tags and you'll get the text. You'll lose the formatting no doubt, but if you want to do text indexing or something like it format isn't relevant anyway. The order is preserved.
I haven't analyzed Excel and Powerpoint but the approach should be similar. Excel might be trickier, depending on how are the cells stored in the XML file.

The Apache POI library can extract text from office formats. This is used by Tika in Lucene. Tika can be executed as a command line tool:
curl http://.../document.doc \
| java -jar tika-app-x.y.jar --text \
| grep -q keyword

PyODConverter for automating OpenOffice. Use it to do the conversions.
OONinja example converting Doc to PDF but any OpenOffice supported imports or exports should work. Also has the advantage of working Headless if required.
other options include,
Abiword
or you really just want to deal with command line WvWare but I don't think it supports Docx,

You can use Autonomy Keyview with the appropriate licence to use in your application. It seems to be extremely powerful and can extract text from almost everything; we use it to identify text within arbitrary format files.
I've no idea what the licensing terms are, but they're available from your account manager :)

Related

How to remove PDF/A markers without using Adobe Acrobat in GNU/Linux

I am trying to remove the PDF/A markers in a file — I have no access to Adobe Acrobat — as some tools balk at PDF/As. Is there a way to revert a PDF/A to a normal PDF with free software tools? I am running Debian testing.

The indicators for PDF/A are in the Metadata entry, but you do not want to erase that entire entry. Instead you would want to modify it.
To modify, you can extract the XML string, modify using whatever XML tool is handy for you, and then "update".
These three entries are the ones you want to erase.
pdfaid:part
pdfaid:amd
pdfaid:conformance
Of course this still leaves you with the following tasks, with 1 and 3 normally done using a PDF SDK library.
Find and extract the Metadata entry (it could be compressed in the PDF)
Reading and editing the XML (should be trivial)
Updating the Metadata entry with your modified XML
Since you gave no indications of platform+OS I can't advise any further.

XLSX corrupt after accidental open & save in Notepad [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 5 years ago.
Improve this question
Is there a programmatic, built-in, or external method to recover a corrupt Excel file?
I accidentally opened my .XLSX file (Excel 2010) in Windows Notepad, added a line of recognizable text, and saved it. Now the file cannot be opened by Excel, as the internally compressed .XLSX file cannot be uncompressed.
My research showed me that:
The .XLSX file is a compressed archive, starting with PK….
By saving the file in Notepad, all Null/\0 characters were replaced with spaces (0x20).
Before the mistake happened, the file contained already several hundreds of 0x20 characters, so replacing all 0x20 with 0x00 won't help.
Any ideas are appreciated! Thanks in advance!

Wow, that sucks, bigtime.
I've seen lots of corrupt files caused lots of different ways but that's a new one.
That series of events if probably near the top of the list of "Things not to do to Excel files." We're sorry, we'll have to suspend your "Excel Operator's License".
I guess there's a couple things you could try...,
Plan A
Is there any chance you have Autorecover turned on? If not, you should probably turn them on now, for "next time".
If it is on, then hopefully the Document Recovery task pane appears when you try to open the file (in Excel). If so, see:
Office.com : Recover your Office Files
Even if your file doesn't show up in that dialog, double-check all of the folders/files located in the default AutoRecover save location to see if a recent version was saved, "just in case":
%AppData%\Microsoft\Excel
Plan B:
Since an XLSX file is actually just a .ZIP file, there's a chance you may be able to use WinZip or a ZIP Repair Utility to recover your data (depending, of course, on how badly you messed it up.)
Change the file's extension to .ZIP and open it with WinZip.
It may try to repair the file, and if it does it might even succeed (on some or all "parts").
Put all the files back into a new .ZIP.
Change the extension back to .XLS.
Cross you fingers and try opening it with Excel.
There are also lots of standalone ZIP Repair Utilities out there, so you could try a few others with the same process.
I have no idea if any will actually work in this case, but please report back if you do end up trying any of them (whether or not they fix it), so we all know.
More about Excel File Structure:
XML & ZIP: Explore Your Excel Workbooks File Structure
Plan C:
Short of that working, you could try opening it back up in Notepad and see if there's any legible data you can copy & paste out manually... might be there a while, if there's anything at all...
Plan D:
There is no Plan C. Sorry, you're SOL.
How to turn on AutoRecover (for next time):
Click File → Options → Save
Make sure the Save AutoRecover information every x minutes box is selected
Important: Even after turning AutoRecover on, the Save button is still your best friend.
To be sure you don’t lose your latest work, click Save Button
(or press Ctrl+S) often.
Oh, and in the future:
Don't open Excel files in anything but Excel.
Source & More Info:
Office.com: Use AutoSave and AutoRecover to help protect your files in case of a crash

Edit an applescript file from a linux computer [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Though applescript appears to be a scripting language like any other (wikipedia/applescript), for reasons I don't understand it seems these scripts are often saved as binaries. It seems like this isn't an issue for someone working on a Mac with a mac-based text editor that can open these scripts into a plain-text format where they can be edited and read, but for the rest of us, we just see gibberish. For instance, Github has many examples of .scpt files committed to repositories instead of/without the plain-text equivalent (a bit of Googling suggests this would be a .applescript file instead)
Question: Is there an open-source tool that can parse and serialize these binaries so that they can be viewed/edited in a standard plain text editor and saved back as .scpt?
(My context: I'd like to provide a user-friendly, os native button-click way to launch my application on a mac, rather than tell users to open a bash terminal and type stuff.)
Edit I only have access to a linux machine, I don't own a mac.

Instead of trying to create an AppleScript on a non-Mac, what you can do is simply name your shell script file with a .command suffix and make sure that it has execute POSIX permissions for the user. The user can then double-click the file in the Finder to execute your script instead of having to enter Terminal commands.
If you would like to take advantage of AppleScript commands within your shell script file to add some simple GUI functionality, you can use the osascript command.
BTW, for reference: on a Mac the application "Script Editor" (or "AppleScript Editor" on older systems) is generally used to create AppleScripts. It provides several save options - the .scpt binary and .applescript plain text files you noted as well as .scptd script bundles and .app standard, double-clickable applications.

CLI pdf viewer for linux [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Hey, for quite a while now, I am looking for a pdf viewer for the command line.
As I like to work without X on Linux, and often work on a remote machine, I would like to have a tool to read pdfs. There are quite a lot of really good graphical programs (evince, okular, acroread, ...) to do the job, so I figured there should be at least one decent text-mode tool. But I don't even know of a crappy one!
Currently, I either start X only to read pdfs, or use pdftohtml+lynx.
However, the latter does not produce a very good output, and most documents are just unreadable, especially if they contain mathematical formula.
Google is full of people saying either it's not possible or suggesting the pdftohtml version.
I realise, this is not exactly a programming question, but I am currently considering starting a project to implement such a program, unless there already is a good one out there.
Thanks for any suggestions.

Hi I think that you don't need to write a program for your purpose I mean reading pdf file in console mode because less command already do it for you. So use it and just enjoy it.
less "the name of pdf file"

Ok, you asked to know even "crappy" ones. Here are two (decide yourself about their respective crappiness):
First: Ghostscript's txtwrite output device
gs \
-dBATCH \
-dNOPAUSE \
-sDEVICE=txtwrite \
-sOutputFile=- \
/path/to/your/pdf
Second: XPDF's pdftotext CLI utility (better than Ghostscript):
pdftotext \
-f 13 \
-l 17 \
-layout \
-opw supersecret \
-upw secret \
-eol unix \
-nopgbrk \
/path/to/your/pdf
- |less
This will display the page range 13 (first page) to 17 (last page), preserve the layout of a double-password protected named PDF file (using user and owner passwords secret and supersecret), with Unix EOL convention, but without inserting pagebreaks between PDF pages, piped through less...
pdftotext -h displays all available commandline options.
Of course, both tools only work for the text parts of PDFs (if they have any). Oh, and mathematical formula also won't work too well... ;-)
Edit: I had mis-typed the command above (originally using pdftops instead of pdftotext).

There is also the green PDF viewer. There is a demo on YouTube.

By the way, i m always in the same situation, and I use mc (midnight commander) which handles text pdf's very well...
Just view the file (F3) in mc

Try fbgs, which should be provided by the fbi or fbida package depending on your distribution. Note that it only works in real terminals (ttys).
http://web.archive.org/web/20150316143120/http://linuxers.org/howto/how-open-pdf-files-linux-console-using-fbgs-framebuffer-pdf-viewer

fbpdf is a framebuffer pdf viewer.
There is also a fork, jfbpdf, but at the moment I am not able to get it working.

This would only work if your PDF document is structured, i.e. it is a tagged PDF document.
This is required to get the correct reading-order of the text objects in the document.
Tagged PDF documents also allow your to re-flow the document though I am not aware of any tool doing that with command line output.

Can I embed an exe payload in a pdf, doc, ppt or any other file format? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Is there any way that I can embed a .exe file in a .pdf, .doc, .xls, or .ppt file in such a way that upon opening the containing file, the document processor will run the .exe automatically without the user intentionally executing it?

Yes, this is totally possible and pretty easy to accomplish - so long as you have an active exploit in the PDF viewer. Check out one of the many Adobe Acrobat Exploits in the Metasploit framework. Next you can use a download+exec shellcode to download and execute your payload, err I mean ".exe".

You can embed files with EXE or any other format. However, the ability to have the EXE run automatically depends on the viewer application and its security settings. This PDF feature has been exploited by many malware. So, there is no guarantee that it will work on all end-user systems. Be warned that if you make this feature a part of some commercial application, then security software will soon flag it as a malware, which can adversly affect your company's reputation.

Yes. Besides using an exploit, you can just paste the file in using Acrobat Professional. Acrobat allows you to add arbitrary attachments these days.
If you make your PDF files with pdflatex, you can embed any file using the embedfile package. I use this frequently to add all kinds of files to PDF files. They show up as attachments.
\usepackage{embedfile}
\embedfile{my-wonderful-file.exe}
You can also use the Acrobat GUI to do it.

In short, no. These file formats have no provision for embedding a Win32 PE executable inside of them.
For the Office files, you could use VBA to write a script that runs when the document is opened.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string