Command line tool to search docx file under ms dos or cygwin - linux

Is there a command line tool that is able to search docx file under ms dos or cygwin ?
I have tried grep, it's not working with docx while working fine with txt file.
I know I could always convert the docx to txt 1st then search using grep, but I am wondering
is there a command tool that I can search directly under command line?
Thanks

i wrote a small bash script, which would help you:
#!/bin/bash
export DOCKEY="$#"
function searchdoc(){
VK1=$(cat "$#" | grep -i "$DOCKEY" | wc -c)
VK2=$(unzip -c "$#" | grep -i "$DOCKEY" | wc -c)
let NUM=$VK1+$VK2
if [ "$NUM" -gt 0 ]; then
echo $NUM occurences in $#
echo opening file.
gnome-open "$#"
fi
}
export -f searchdoc
echo searching for $DOCKEY ...
find . -exec bash -c 'searchdoc "{}" 2>/dev/null' \;
save it as docfind.sh and you can invoke
$#> docfind.sh searchterm
from any folder you want to scan.

After a trying out the stuff , I found the easiest way to do this is to use a linux utility to batch convert all docx files into txt files, then do grep with those txt files easily.

zgrep might work for you? It usually works in OpenOffice documents, and both are compressed archives containing XML:
zgrep "some string" *.xdoc
I have no .xdoc files to test this with, but in theory it should work...

You can use zipgrep, which calls grep on all files of a zip archive (which a docx file is).
You might be disappointed with the result, though, as it returns raw content of XML files containing both the text and XML tags.

save it as docfind.sh and you can invoke
Newbies like me might need to be told that for the .sh script to be executable from any directory, it needs to have the executable property set and be located in /usr/bin or elsewhere in your Path.
I was able to set up the nemo file manager in Linux Mint to open a terminal from any folder's context menu (information here).

Related

How to search multiple DOCX files for a string within a Word field?

Is there any Windows app that will search for a string of text within fields in a Word (DOCX) document? Apps like Agent Ransack and its big brother FileLocator Pro can find strings in the Word docs but seem incapable of searching within fields.
For example, I would like to be able to find all occurrences of the string "getProposalTranslations" within a collection of Word documents that have fields with syntax like this:
{ AUTOTEXTLIST \t "<wr:out select='$.shared_quote_info' datasource='getProposalTranslations'/>" }
Note that string doesn't appear within the text of the document itself but rather only within a field. Essentially the DOCX file is just a zip file, I believe, so if there's a tool that can grep within archives, that might work. Note also that I need to be able to search across hundreds or perhaps thousands of files in many directories, so unzipping the files one by one isn't feasible. I haven't found anything on my own and thought I'd ask here. Thanks in advance.
This script should accomplish what you are trying to do. Let me know if that isn't the case. I don't usually write entire scripts because it can hurt the learning process, so I have commented each command so that you might learn from it.
#!/bin/sh
# Create ~/tmp/WORDXML folder if it doesn't exist already
mkdir -p ~/tmp/WORDXML
# Change directory to ~/tmp/WORDXML
cd ~/tmp/WORDXML
# Iterate through each file passed to this script
for FILE in $#; do
{
# unzip it into ~/tmp/WORDXML
# 2>&1 > /dev/null discards all output to the terminal
unzip $FILE 2>&1 > /dev/null
# find all of the xml files
find -type f -name '*.xml' | \
# open them in xmllint to make them pretty. Discard errors.
xargs xmllint --recover --format 2> /dev/null | \
# search for and report if found
grep 'getProposalTranslations' && echo " [^ found in file '$FILE']"
# remove the temporary contents
rm -rf ~/tmp/WORDXML/*
}; done
# remove the temporary folder
rm -rf ~/tmp/WORDXML
Save the script wherever you like. Name it whatever you like. I'll name it docxfind. Make it executable by running chmod +x docxfind. Then you can run the script like this (assuming your terminal is running in the same directory): ./docxfind filenames...

Sort files according to their filetype

After an HD problem and some work, I have a bunch of files with names like "f1234", "f1235", etc.
My goal is to sort this files according to their filetype. For example, I want to move all the PDF files in the "pdfs" directory.
For one file, I can do : "file f1234", and if it's a PDF, I can "mv f1234 pdfs/". But I have thousands of file... Can you help me with a bash or zsh command for sort all the PDF in one pass ? Thanks
The hard part here is reliably turning the output of file into a directory name. I think probably the best candidate for that is the mime-type of the file rather than the human readable output of file. I'd use something like:
mkdir sorted
for f in f*
do
d=$(file -b --mime-type "$f" | tr / -)
mkdir -p "sorted/$d"
mv "$f" "sorted/$d/"
done
Obviously I'd test that out a bit before running it on your files, but something pretty close to that should work.

Script to open latest text file from a directory

I need a shell script to open latest text file from a given directory. it will be then copied to another directory. How can i achieve it?
I need a logic which will search and give the latest file from a directory (name of the text file can be anything (not fixed), so i need to find out latest text file)
Here you can do something like this
#!/bin/sh
SOURCE_DIR=/home/juned/Downloads
DEST_DIR=/tmp/
LAST_MODIFIED_FILE=`ls -t ${SOURCE_DIR}| head -1`
echo $LAST_MODIFIED_FILE
#Open file
vim $SOURCE_DIR/$LAST_MODIFIED_FILE
#Copy file
cp $SOURCE_DIR/$LAST_MODIFIED_FILE $DEST_DIR
echo "File copied successfully"
You can specify any application name in which you want to open that file like gedit, kate etc. Here I've used vim.
xdg-open - opens a file or URL in the user's preferred application
Not an expert in bash but you can try this logic:
First, grab the latest file using ls -t -t sorts by time head -1 gets the first file
F=`ls -t * | head -1`
Then open the file using and editor:
xdg-open $F
gedit $F
...
As suggested by # AJefferiss you can directly do :
xdg-open $(ls -t * | head -1)
gedit $(ls -t * | head -1)
For editing the latest modified / created,
vim $(ls -t | head -1)
For editing the latest in alphanumerical order,
vim $(ls -1 | tail -1)
In one line (if are you sure that there are only files):
vim `ls -t .|head -1`
it will be opened in vim (or use other txt editor)
if there are directories you should write script with loop and test every file (if it's not a dir):
if [ -f $FILE ];
or you can also use find, or use pipe for get latest file:
ls -lt .|sed -n 2p|grep -v '^d'
The existing answers are helpful, but fall short when it comes to dealing with filenames with embedded spaces or other shell metacharacters.[1]
# Get the most recently modified *.txt file.
# (On *assignment*, names with spaces, ... are not a concern.)
f=$(ls -t *.txt | head -n 1)
# *Use* the variable enclosed in *double-quotes* to ensure that it is passed
# to the target command unmodified.
xdg-open "$f" # could also use "$(ls -t *.txt | head -n 1)" directly
Additionally, some answer user all-uppercase shell variable names, which should be avoided so as to avoid conflicts with environment variables.
[1] Due to use of ls, filenames with embedded newlines won't be handled correctly, but that's rarely a real-world concern.

Embedded Linux shell (BusyBox) check over list of files if they exist and run commands

I'm trying to download a list of files (text file, one filename per line, no spaces or newlines in the filenames), and then check for each file if it exists and run commands accordingly. the first part seems to work just fine, the file from the web server is downloaded, and the first echo outputs the filenames. but the file existence check does not work.
#!/bin/sh
wget -qO- http://web.server/x/files.txt | while read file
do
echo $file
if [ -f $file ]; then
echo $file exists
else
echo $file does not exist
fi
done
output when executed in a directory where the second file (temp.txt) does exist:
file1.tmp
does not exist
temp.txt
does not exist
file3.tmp
does not exist
file4.tmp
does not exist
The second file does exist, and the echo commands in the if statement apparently doesn't recognize the $file variable either.
Any help is appreciated, I tried cobbling this together with info found here. A problem might be that this is not a full linux system, but embedded Linux (OpenELEC) with BusyBox v1.22.1.
UPDATE: thanks to the commenters we figured out that the code as is basically works fine AS LONG as the files.txt from the web server only contains unix EOL -- it doesn't work with windows CRLF line endings.
Now how could the script be made to work regardless of the line endings in the file from the web server?
dos2unix is a utility which converts Windows line endings to Unix line endings. You can use it in your script like this:
wget -qO- http://web.server/x/files.txt | dos2unix | while read file
Or:
while read line; do
...
done < <(wget -qO- http://web.server/x/files.txt | dos2unix)

Open the lastest downloaded file with bash script

Below is my attempt at this problem. It's a functional script, but I have to specify the application to be used for each file type. Since this information regarding default application must be stored somewhere on Linux / Ubuntu already, how may I access them and incorporate into my script?
Also, can my script be more "elegant" in any way?
Thank you for helping a Bash script beginner! I appreciate any comment.
#!/bin/bash
# Open the latest file in ~/Downloads
filename=$(ls -t ~/Downloads | head -1)
filetype=$(echo -n $filename | tail -c -3)
if [ $filetype == "txt" ]; then
leafpad ~/Downloads/$filename
elif [ $filetype == "pdf" ]; then
evince ~/Downloads/$filename
fi
How do I open a file in its default program - Linux should help you with the first part of your question:
xdg-open ~/Downloads/$filename
As mentioned in other answers, it's best not to trust the output of ls in scripts, especially if you have unusual characters like newlines in your filenames. One way to robustly get a list of filenames in a script is with the find command, and null-delimiting them into a pipe.
So to answer your question with a one-liner:
find ~/Downloads -maxdepth 1 -type f -printf "%C# %p\0" | sort -zrn | { read -d '' ts file; xdg-open "$file"; }
Breaking it down:
The find command lists files in the ~/Download directory, but doesn't descend any deeper into subdirectories. The filenames are printed with the given printf format, which lists a numerical timestamp, followed by a space, followed by a null delimiter. Note the printf format specifiers for find are different to those for regular printf
The sort command numerically sorts (-n) the resulting null-delimited list (-z) by the first field (numerical timestamp). Sort order is reversed (-r) so that the latest entry is displayed first
The read command reads the timestamp and filename of the first file in the list into the ts and file variables. -d '' tells read to use null delimiters.
The file is opened using xdg-open.
Note the read and xdg-open commands are in a curly bracket inline group, so the file variable is in scope for both.
Welcome to bash programming. :-)
First off, I'll refer you to the Bash FAQ. Great resource, lots of tips, perspectives and warnings.
One of them is the classic Parsing LS problem that your script suffers from. The basic idea is that you don't want to trust the output of the ls command, because special characters like spaces and control characters may be represented in a way that doesn't allow you to refer to the file.
You're opening the "last" file, as determined by a sort that the ls command is doing. In order to detect the most recent file without ls, we'll need some extra code. For example:
#!/bin/sh
last=0
for filename in ~/Downloads/*; do
when=$(stat -c '%Y' "$filename")
if [ $when -gt $last ]; then
last=$when
to_open="$filename"
fi
done
xdg-open "$to_open"
The idea is that we'll walk through each file in your Downloads directory and fine the one with the largest timestamp using the stat command. Then open that file using xdg-open, which may already be installed on your system because it's part of a tool set that's a dependency for a number of other applications.
If you don't have xdg-open, you can probably install it from the xdg-utils package which using whatever package management system is around for your Linux distro.
Another possibility is gnome-open, which is part of the Gnome desktop (the libgnome package, to be precise). YMMV. We'd need to know more about your distro and your desktop environment to come up with better advice.
Note that if you do want to continue selecting your application by extension, you might want to consider using a switch instead of a series of ifs:
...
case "${filename##*.}" in
txt)
leafpad "$filename"
;;
pdf)
xdg-open "$filename"
;;
*)
echo "ERROR: can't open '$filename'" >&2
;;
esac
mimeopen might be useful? There's an explanation of Mime types here.
Also - are your filetype extensions always exactly two letters, as the tail -c -3 implies? If they're of variable length, you may want a regular expression instead.
As previously mentioned, xdg-open and mimeopen may be useful and more elegant; from their manpages:
xdg-open opens a file or URL in the user's preferred application. If a URL is provided the URL will be opened in the user's preferred web browser. If a file is provided the file will be opened in the preferred application for files of that type.
[mimeopen] tries to determine the mimetype of a file and open it with the default desktop application. If no default application is configured the user is prompted with an "open with" menu in the terminal.
For more elegance in the original script, replace
filetype=$(echo -n $filename | tail -c -3)
with
filetype=${filename: -3}
and instead of the five-lines if/elif/fi structure, consider using two lines as follows.
[ $filetype == "txt" ] && leafpad ~/Downloads/$filename
[ $filetype == "pdf" ] && evince ~/Downloads/$filename

Resources