Search MS word files in a directory for specific content in Linux

Search MS word files in a directory for specific content in Linux - linux

I have a directory structure full of MS word files and I have to search the directory for particular string. Until now I was using the following command to search files for in a directory
find . -exec grep -li 'search_string' {} \;
find . -name '*' -print | xargs grep 'search_string'
But, this search doesn't work for MS word files.
Is it possible to do string search in MS word files in Linux?

I'm a translator and know next to nothing about scripting but I was so pissed off about grep not being able to scan inside Word .doc files that I worked out how to make this little shell script to use catdoc and grep to search a directory of .doc files for a given input string.
You need to install catdocand docx2txt packages
#!/bin/bash
echo -e "\n
Welcome to scandocs. This will search .doc AND .docx files in this directory for a given string. \n
Type in the text string you want to find... \n"
read response
find . -name "*.doc" |
while read i; do catdoc "$i" |
grep --color=auto -iH --label="$i" "$response"; done
find . -name "*.docx" |
while read i; do docx2txt < "$i" |
grep --color=auto -iH --label="$i" "$response"; done
All improvements and suggestions welcome!

Here's a way to use "unzip" to print the entire contents to standard output, then pipe to "grep -q" to detect whether the desired string is present in the output. It works for docx format files.
#!/bin/bash
PROG=`basename $0`
if [ $# -eq 0 ]
then
echo "Usage: $PROG string file.docx [file.docx...]"
exit 1
fi
findme="$1"
shift
for file in $#
do
unzip -p "$file" | grep -q "$findme"
[ $? -eq 0 ] && echo "$file"
done
Save the script as "inword" and search for "wombat" in three files with:
$ ./inword wombat file1.docx file2.docx file3.docx
file2.docx
Now you know file2.docx contains "wombat". You can get fancier by adding support for other grep options. Have fun.

The more recent versions of MS Word intersperse ascii[0] in between each of the letters of the text for purposes I cannot yet understand. I have written my own MS Word search utilities that insert ascii[0] in between each of the characters in the search field and it just works fine. Clumsy but OK. A lot of questions remain. Perhaps the junk characters are not always the same. More tests need to be done. It would be nice if someone could write a utility that would take all this into account. On my windows machine the same files respond well to searches.
We can do it!

In a .doc file the text is generally present and can be found by grep, but that text is broken up and interspersed with field codes and formatting information so searching for a phrase you know is there may not match. A search for something very short has a better chance of matching.
A .docx file is actually a zip archive collecting several files together in a directory structure (try renaming a .docx to .zip then unzipping it!) -- with zip compression it's unlikely that grep will find anything at all.

The opensource command line utility crgrep will search most MS document formats (I'm the author).

Have you tried with awk ‘/Some|Word|In|Word/’ document.docx ?

If it's not too many files you can write a script that incorporates something like catdoc: http://manpages.ubuntu.com/manpages/gutsy/man1/catdoc.1.html , by looping over each file, perfoming a catdoc and grep, storing that in a bash variable, and outputting it if it's satisfactory.

If you have installed program called antiword you can use this command:
find -iname "*.doc" |xargs -I {} bash -c 'if (antiword {}|grep "string_to_search") > /dev/null 2>&1; then echo {} ; fi'
replace "string_to_search" in above command with your text. This command spits file name(s) of files containing "string_to_search"
The command is not perfect because works weird on small files (the result can be untrustful), becasue for some reseaon antiword spits this text:
"I'm afraid the text stream of this file is too small to handle."
if file is small (whatever it means .o.)

The best solution I came upon was to use unoconv to convert the word documents to html. It also has a .txt output, but that dropped content in my case.
http://linux.die.net/man/1/unoconv

I've found a way of searching Word files (doc and docx) that uses the preprocessor functionality of ripgrep.
This depends on the following being installed:
ripgrep (more information about the preprocessor here)
LibreOffice
docx2txt
this catdoc2 script, which I've added to my $PATH:
#!/bin/bash
temp_dir=$(mktemp -d)
trap "rm $temp_dir/* && rmdir $temp_dir" 0 2 3 15
libreoffice --headless --convert-to "txt:Text (encoded):UTF8" --outdir ${temp_dir} $1 1>/dev/null
cat ${temp_dir}/$(basename -s .doc $1).txt
The command pattern tor a one-level recursive search is:
$ rg --pre <preprocessor> --glob <glob with filetype> <search string>
Example:
$ ls *
one:
a.docx
two:
b.docx c.doc
$ rg --pre docx2txt --glob *.docx This
two/b.docx
1:This is file b.
one/a.docx
1:This is file a.
$ rg --pre catdoc2 --glob *.doc This
two/c.doc
1:This is file c.

Here's the full script I use on macOS (Catalina, Big Sur, Monterey).
It's based on Ralph's suggestion, but using built-in textutil for .doc
#!/bin/bash
searchInDoc() {
# in .doc
find "$DIR" -name "*.doc" |
while read -r i; do
textutil -stdout -cat txt "$i" | grep --color=auto -iH --label="$i" "$PATTERN"
done
}
searchInDocx() {
for i in "$DIR"/*.docx; do
#extract
docx2txt.sh "$i" 1> /dev/null
#point, grep, remove
txtExtracted="$i"
txtExtracted="${txtExtracted//.docx/.txt}"
grep -iHn "$PATTERN" "$txtExtracted"
rm "$txtExtracted"
done
}
askPrompts() {
local i
for i in DIR PATTERN; do
#prompt
printf "\n%s to search: \n" "$i"
#read & assign
read -e REPLY
eval "$i=$REPLY"
done
}
makeLogs() {
local i
for i in results errors; do
# extract dir for log name
dirNAME="${DIR##*/}"
# set var
eval "${i}LOG=$HOME/$i-$PATTERN-$dirNAME.log"
local VAR="${i}LOG"
# remove if existant
if [ -f "${!VAR}" ]; then
printf "WARNING: %s will be overwriten.\n" "${!VAR}"
fi
# touch file
touch "${!VAR}"
done
}
checkDocx2txt() {
#see if soft exists
if ! command -v docx2txt.sh 1>/dev/null; then
printf "\nWARNING: docx2txt is required.\n"
printf "Use \e[3mbrew install docx2txt\e[0m.\n\n"
exit
else
printf "\n~~~~~~~~~~~~~~~~~~~~~~~~\n"
printf "Welcome to scandocs macOS.\n"
printf "~~~~~~~~~~~~~~~~~~~~~~~~\n"
fi
}
parseLogs() {
# header
printf "\n------\n"
printf "Scandocs finished.\n"
# results
if [ ! -s "$resultsLOG" ]; then
printf "But no results were found."
printf "\"%s\" did not match in \"%s\"" "$PATTERN" "$DIR" > "$resultsLOG"
else
printf "See match results in %s" "$resultsLOG"
fi
# errors
if [ ! -s "$errorsLOG" ]; then
rm -f "$errorsLOG"
else
printf "\nWARNING: there were some errors. See %s" "$errorsLOG"
fi
# footer
printf "\n------\n\n"
}
#the soft
checkDocx2txt
askPrompts
makeLogs
{
searchInDoc
searchInDocx
} 1>"$resultsLOG" 2>"$errorsLOG"
parseLogs

Related

Echo to all files found from GREP

I'm having a trouble with my code.
grep looks for files that doesn't have a word 'code'
and I need to add 'doesn't have' as a last line in those files
By logic
echo 'doesnt have' >> grep -ril 'code' file/file
I'm using -ril to ignore the cases and get file names
Does anyone know how to append a text to each .txt files found from grep searches?

How's this for a novel alternative?
echo "doesn't have" |
tee -a $(grep -riL 'code' file/file)
I switched to the -L option to list the files which do not contain the search string.
This is unfortunately rather brittle in that it assumes your file names do not contain whitespace or other shell metacharacters. This can be fixed at the expense of some complexity (briefly, have grep output zero-terminated matches, and read them into an array with readarray -d ''. This requires a reasonably recent Bash, and probably GNU grep.)

The 'echo' command can append output to a single file, which must be specified by redirecting the standard output. To update multiple files, a loop is needed. The loop iterated over all the files found with 'grep'
for file in $(grep -ril 'code' file/file) ; do
echo 'doesnt have' >> $file
done

Using a while read loop and Process Substitution.
#!/usr/bin/env bash
while IFS= read -r files; do
echo "doesn't have" >> "$files"
done < <(grep -ril 'code' file/file)
As mentioned by #dibery
#!/bin/sh
grep -ril 'code' file/file | {
while IFS= read -r files; do
echo "doesn't have" >> "$files"
done
}

for loop in Linux treats pattern as filename when no files exist

I ran the following in a directory with no files:
for file in *.20191017.*;do echo ${file}; done
what it returned was this:
*.20191017.*
which is little awkward since this was just a pattern and not the filename itself.
Can anyone please help on this?

Found the reason for this anomaly (source: https://www.cyberciti.biz/faq/bash-loop-over-file/)
You can do filename expansion in loop such as work on all pdf files in current directory:
for f in *.pdf; do
echo "Removing password for pdf file - $f"
done
However, there is one problem with the above syntax. If there are no pdf files in current directory it will expand to *.pdf (i.e. f will be set to *.pdf”). To avoid this problem add the following statement before the for loop:
#!/bin/bash
# Usage: remove all utility bills pdf file password
shopt -s nullglob # expands the glob to empty string when there are no matching files in the directory.
for f in *.pdf; do
echo "Removing password for pdf file - $f"
pdftk "$f" output "output.$f" user_pw "YOURPASSWORD-HERE"
done

The for loop simply iterates over the words between in and ; (possibly expanded by bash). Here, file is just the variable name. If you want to iterate between all files that are actually present, you can, for example, add a if to check if the ${file} really exists:
for file in *.20191017.*
do
if [ -e "${file}" ]
then
echo ${file}
fi
done
Or you can use, e.g., find
find . -name '*.20191017.*' -maxdepth 1
-maxdepth 1 is to avoid recursion.

Create file with egrep matches and file names

I need some help...
I'm creating a script of unit tests using shellscripts. That script, stores all the beeline calls from all files inside a directory.
The script is doing it's purpose, but I don't wanna append the file name if grep does not return results.
That's my code:
for file in $(ls)
do
cat $file | egrep -on '^( +)?\bbeeline.*password=;"?' >> testa_scripts.sh
echo $file >> testa_scripts.sh
done
How can I do that?
Thanks

grep returns a falsy exit status (1) if it doesn't find any matching lines, so you can put in an if statement to test if it matched anything. Inverted with ! here:
for file in ./*; do
if ! egrep -on '...' "$file" >> somefile; then
echo 'grep did not match anything'
fi
done
(I don't think there's any need for the ls instead of just a shell glob here.)

Keeping *nix Format

My code works, but not in the way I want to exactly. Basically, what my code does is that it looks through the current directory, searches for folders, and within those folders it looks at the files. If one of the file extensions is the value of the $Example variable, then it should delete all other files with the same beginning file name, regardless of extension and rename the one with the $Example extension to the same name, just without the $Example extension. Here is the code:
#!/bin/sh
set -o errexit
Example=dummy
for d in *; do
if test "$(ls -A "$d" 2>/dev/null)"; then
if [ $(ls -1 ${d}/*.$Example 2>/dev/null | wc -l) -ge 1 ]; then
cd $(pwd)/$d;
for f in *.$Example; do
fileName="${f%.$Example}";
mv "$f" "${f%.$Example}";
#tr "\r" "\n" < "${f%.$Example}" > "${f%.$Example}"
listToDelete=$(find -name "$fileName.*");
for g in $listToDelete; do
rm $g;
done;
done;
cd ..;
fi;
fi;
done
The files being used have been created in VIM, so are supposed to have Linux formatting, rather than Windows formatting. For some reason or other, once the extension has been stripped, using this code, it gets formatted with \r, and the file fails to run. I added the comment where my temporary solution is located, but I was wondering if either there is some way to alter the mv function to keep the Linux formatting, or maybe there is another way to achieve what I want. Thanks

The files being used have been created in VIM, so are supposed to have Linux formatting, rather than Windows formatting.
That has no impact on the line separator being used.
But I can think of 3 different possible causes.
ViM substitution
The \r and \n may have been mixed up or the Windows line separator (\r\n) may have been striped out incorrectly. If you're just trying to convert the Windows line separator at some point, convert them to to Unix line separated files with dos2unix instead of sed or vim magic if possible and then edit them. If the files are adding or replacing line separators using ViM, remember that in ViM line separators are searched for using \n and replaced with \r since ViM just loads file data into a buffer. e.g. %s/\n/somestring/ and %s/somestring/\r/g.
Cygwin
If you're using Cygwin, I'm pretty sure it defaults to ViM using Windows line separators, but you can change it to Unix line separators. Don't remember how to off the top of my head though.
ViM default line separators
Not sure how this would have happened if you're on a GNU/Linux system, but the default line separator to be used in ViM can be specified to Unix using :e ++ff=unix

This answer does not address the issue with a possible stray \r in the script source code, but rather gives an alternative way of doing what I believe is what the user wants to achieve:
#!/bin/bash
ext='dummy'
tmpfile=$(mktemp)
find . -type f -name "*.$ext" \
-execdir bash -x -c \
'cp "$3" "$2" &&
tee "${3%.$1}"* <"$2" >/dev/null' bash "$ext" "$tmpfile" {} \;
rm -f "$tmpfile"
This finds all files with the given extension and for each of them executes
cp "$3" "$2" && tee "${3%.$1}"* <"$2" >/dev/null
$1 will be the extension without the dot.
$2 will be the name of a temporary file.
$3 will be the found file.
The command will first copy the found file to a temporary file, then it will feed this temporary file through tee which will duplicate its contents to all files with the same prefix (${3%.$1} will strip the extension from the found filename and ${3%.$1}* will expand to all files in the same directory that has the same prefix).
Most modern implementations of find supports -execdir which works like -exec with the difference that the given utility will be executed in the directory of the found name. Also, {} will be the pathless basename of the found thing.
Given the following files in a directory:
$ ls test/
f1.bar f1.foo f2.bar f2.foo f3.bar f3.foo
f1.dummy f1.txt f2.dummy f2.txt f3.dummy f3.txt
This script does the following:
$ bash -x script.sh
+ ext=dummy
++ mktemp
+ tmpfile=/tmp/tmp.9v5JMAcA12
+ find . -type f -name '*.dummy' -execdir bash -x -c 'cp "$3" "$2" &&
tee "${3%.$1}"* <"$2" >/dev/null' sh dummy /tmp/tmp.9v5JMAcA12 '{}' ';'
+ cp f1.dummy /tmp/tmp.9v5JMAcA12
+ tee f1.bar f1.dummy f1.foo f1.txt
+ cp f2.dummy /tmp/tmp.9v5JMAcA12
+ tee f2.bar f2.dummy f2.foo f2.txt
+ cp f3.dummy /tmp/tmp.9v5JMAcA12
+ tee f3.bar f3.dummy f3.foo f3.txt
+ rm -f /tmp/tmp.9v5JMAcA12
Remove -x from the invocation of bash (and from bash on the command line) to disable tracing.
I used bash rather than sh because my sh can't grok ${3%.$1} properly.

Find string inside pdf with shell

I'd like to know if there is any way to check if there is a string inside a pdf file using a shell script? I was looking for something like:
if [search(string,pdf_file)] > 0 then
echo "exist"
fi

This approach converts the .pdf files page-wise, so the occurences of the search string $query can be located more specifically.
# search for query string in available pdf files pagewise
for i in *.pdf; do
pagenr=$(pdfinfo "$i" | grep "Pages" | grep -o "[0-9][0-9]*")
fileid="\n$i\n"
for (( p=1; p<=pagenr; p++ )); do
matches=$(pdftotext -q -f $p -l $p "$i" - | grep --color=always -in "$query")
if [ -n "$matches" ]; then
echo -e "${fileid}PAGE: $p"
echo "$matches"
fileid=""
fi
done
done
pdftotext -f $p -l $p limits the range to be converted to only one page identified by the number $p. grep --color=always allows for protecting match highlights in the subsequent echo. fileid="" just makes sure the file name of the .pdf document is only printed once for multiple matches.

As nicely pointed by Simon, you can simply convert the pdf to plain text using pdftotext, and then, just search for what you're looking for.
After conversion, you may use grep, bash regex, or any variation you want:
while read line; do
if [[ ${line} =~ [0-9]{4}(-[0-9]{2}){2} ]]; then
echo ">>> Found date;";
fi
done < <(pdftotext infile.pdf -)

Each letter within a PDF doc is typically set individually. Therefore, you have to convert the .pdf to text, which will reduce the text to a simple stream.
I would try this:
grep -q 'a \+string' <(pdf2text some.pdf - | tr '\n' ' ') && echo exists
The tr joins line breaks. The \+ allows for 1 or more space chars between words. Finally, grep -q only returns exit status 0/1 based on a match. It does not print matching lines.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Search MS word files in a directory for specific content in Linux - linux

The opensource command line utility crgrep will search most MS document formats (I'm the author).

Have you tried with awk ‘/Some|Word|In|Word/’ document.docx ?

If it's not too many files you can write a script that incorporates something like catdoc: http://manpages.ubuntu.com/manpages/gutsy/man1/catdoc.1.html , by looping over each file, perfoming a catdoc and grep, storing that in a bash variable, and outputting it if it's satisfactory.

The best solution I came upon was to use unoconv to convert the word documents to html. It also has a .txt output, but that dropped content in my case. http://linux.die.net/man/1/unoconv

Related

Echo to all files found from GREP

for loop in Linux treats pattern as filename when no files exist

Create file with egrep matches and file names

Keeping *nix Format

Find string inside pdf with shell

Categories

Resources