Convert doc to txt via commandline - linux

We're searching a programm that allows us to convert a doc or docx document to a txt file. We're working with linux and we want to start a website that converts user uploaded doc files. We don't wanna use open office/libre office cause we have bad experience with that. Pandoc can't handle doc files :/
Anyone have a idea?

You will have to use two different command-line tools, depending if you are working with .doc or .docx format.
For .doc use catdoc:
catdoc foo.doc > foo.txt
For .docx use docx2txt:
docx2txt foo.docx
The latter will produce a file called foo.txt in the same directory as the original.
I'm not sure which Linux distribution you are using, but both catdoc and docx2txt are available from the Ubuntu repositories, for example:
apt-get install docx2txt
Or with Homebrew on Mac:
brew install docx2txt

here is a perl project which claims to do it. I have done a lot of this by hand also, using XSLT on the document.xml. the Docx file itself is just a zip file, you can unzip it and inspect the elements. I will say that this is not hard to do for specific files, but is very hard to do in the general case, because of the lack of documentation for how Word internally stores things, and the variance of internal representation.

For doc files you may use antiword, it's available on Homebrew and Ubuntu.

You can also use pandoc:
Keep the layout (newline as in the visualization of the document):
pandoc -s mydocument.docx -o ouput.txt
Newline only when the original text has a newline command:
pandoc --wrap=none -s mydocument.docx -o ouput.txt

Related

Convert old .doc format to the new .docx format on Linux?

Is there a command on Linux to convert an old MS Word document (.doc) to the new .docx format?
Something like this question is just asked here, you can use following commands from this git repository:
First install unoconv.
If you are using Ubuntu use this command:
sudo apt-get install unoconv
and go to your file location and run this command:
unoconv -d document --format=docx {your_file_name}.doc
or even if you want to format all .doc files there use this:
unoconv -d document --format=docx *.doc
hope this works for you.

How can I convert rtf to txt files on Kali Linux by command-line?

I have 30 rtf files that I need to convert all of them to txt. Is there any script can I run it on command line and do Multiple-Files convert
I’m using Kali Linux 2020.2 ..
Thanks
I've just had the same problem and tested the recommended programs here.
Of those, I found unoconv to work the best. A one-liner to convert everything in one directory is:
for rtf in rtf/*.rtf; do unoconv -f txt $rtf;done
Please note that unoconv adds a BOM mark at the beginning of each .txt file (at least for my rtf files), so if your future processing can't handle that you should strip it.

How to convert .xls file which has html tags in it to delimited .csv from command line

We have a requirement to download the export from jira portal and insert this information to a table for some reporting. now the challenge is , the file downloaded from jira is in .xlx extension and has all the html links in it. when i use the xls2csv (catdoc software )and other converter tools over the command line they cant recognize the file format and not able to convert. I need this file to be converted as .csv file with some delimiter that way i can use sql loader to load it to a table.
If you have libreoffice installed on your system, running the following code on a bash terminal (as a normal user, not root - see the reason here) might help you:
libreoffice --invisible --convert-to csv my_file.xls
If you need this code to be run on a script that runs as root, it would still be possible to (safely) run this command if you run it as a "normal" user instead of root, such as:
su - myuser -c 'libreoffice --invisible --convert-to csv my_file.xls'
To find out who is the user who should be used to run the command above, one of the best options would be to use the logname command, such as:
myuser="$(logname 2>/dev/null)"

Gitbash version does not allow grep -o, is it possible to install new grep package?

I am trying to do a directory-wide search for specific strings in JSON files. The only problem is that these JSON files are only one line, so when I cat all of them, all strings occur a magical "1" time...since there's only one line even when I string them all together.
An easy solution, which I see a lot (here and here), is grep -o. Only problem is it doesn't come standard on my Gitbash. I solved my immediate problem by just installing the latest Cygwin. However, I'm wondering if there was an easier/more granular solution. Is it possible to do the equivalent of "apt-get install" or similar on Gitbash? Or can someone explain to me a quick-and-dirty way to extract and install the tar file in Gitbash?
The other approach is to:
use a cmd session (using the git-cmd.bat which packaged with Git for Windows)
use the grep included Gnu for Windows, which supports the -o option (and actually allow you to use most of the other Unix commands that your script might be currently using)

Convert xlsx to csv in Linux with command line

I'm looking for a way to convert xlsx files to csv files on Linux.
I do not want to use PHP/Perl or anything like that since I'm looking at processing several millions of lines, so I need something quick. I found a program on the Ubuntu repos called xls2csv but it will only convert xls (Office 2003) files (which I'm currently using) but I need support for the newer Excel files.
Any ideas?
The Gnumeric spreadsheet application comes with a command line utility called ssconvert that can convert between a variety of spreadsheet formats:
$ ssconvert Book1.xlsx newfile.csv
Using exporter Gnumeric_stf:stf_csv
$ cat newfile.csv
Foo,Bar,Baz
1,2,3
123.6,7.89,
2012/05/14,,
The,last,Line
To install on Ubuntu:
apt-get install gnumeric
To install on Mac:
brew install gnumeric
You can do this with LibreOffice:
libreoffice --headless --convert-to csv $filename --outdir $outdir
For reasons not clear to me, you might need to run this with sudo. You can make LibreOffice work with sudo without requiring a password by adding this line to you sudoers file:
users ALL=(ALL) NOPASSWD: libreoffice
If you already have a desktop environment_ then I'm sure Gnumeric or LibreOffice would work well, but on a headless server (such as Amazon Web Services), they require dozens of dependencies that you also need to install.
I found this Python alternative: xlsx2csv
easy_install xlsx2csv
xlsx2csv file.xlsx > newfile.csv
It took two seconds to install and works like a charm.
If you have multiple sheets, you can export all at once, or one at a time:
xlsx2csv file.xlsx --all > all.csv
xlsx2csv file.xlsx --all -p '' > all-no-delimiter.csv
xlsx2csv file.xlsx -s 1 > sheet1.csv
He also links to several alternatives built in Bash, Python, Ruby, and Java.
Use csvkit:
in2csv data.xlsx > data.csv
For details, check their excellent documentation.
In Bash, I used this LibreOffice command (executable libreoffice) to convert all my .xlsx files in the current directory:
for i in *.xlsx; do libreoffice --headless --convert-to csv "$i" ; done
Close all your LibreOffice open instances before executing, or it will fail silently.
The command takes care of spaces in the filename.
I tried it again some years later, and it didn't work. This question gives some tips, but the quickest solution was to run as root (or running a sudo libreoffice). It is not elegant, but quick.
Use the command scalc.exe in Windows.
Another option would be to use R via a small Bash wrapper for convenience:
xlsx2txt(){
echo '
require(xlsx)
write.table(read.xlsx2(commandArgs(TRUE)[1], 1), stdout(), quote=F, row.names=FALSE, col.names=T, sep="\t")
' | Rscript --vanilla - $1 2>/dev/null
}
xlsx2txt file.xlsx > file.txt
If the .xlsx file has many sheets, the -s flag can be used to get the sheet you want. For example:
xlsx2csv "my_file.xlsx" -s 2 second_sheet.csv
second_sheet.csv would contain the data of the second sheet in my_file.xlsx.
Using the Gnumeric spreadsheet application which comes which a commandline utility called ssconvert is indeed super simple:
find . -name '*.xlsx' -exec ssconvert -T Gnumeric_stf:stf_csv {} \;
and you're done!
If you are OK to run Java command line then you can do it with Apache POI HSSF's Excel Extractor. It has a main method that says to be the command line extractor. This one seems to just dump everything out. They point out to this example that converts to CSV. You would have to compile it before you can run it but it too has a main method so you should not have to do much coding per se to make it work.
Another option that might fly but will require some work on the other end is to make your Excel files come to you as Excel XML Data or XML Spreadsheet of whatever MS calls that format these days. It will open a whole new world of opportunities for you to slice and dice it the way you want.
You can use executable libreoffice to convert your .xlsx files to csv:
libreoffice --headless --convert-to csv ABC.xlsx
Argument --headless indicates that we don't need GUI.
As others said, executable libreoffice can convert Excel files (.xls) files to CSV. The problem for me was the sheet selection.
This LibreOffice Python script does a fine job at converting a single sheet to CSV.
Usage is:
./libreconverter.py File.xls:"Sheet Name" output.csv
The only downside (on my end) is that --headless doesn't seem to work. I have a LibreOffice window that shows up for a second and then quits.
That's OK with me; it's the only tool that does the job rapidly.
You can use script getsheets.py. Add dependencies first:
pip3 install pandas xlrd openpyxl
Then call the script: python3 getsheets.py <file.xlsx>

Resources