Find string inside pdf with shell

Find string inside pdf with shell - linux

I'd like to know if there is any way to check if there is a string inside a pdf file using a shell script? I was looking for something like:
if [search(string,pdf_file)] > 0 then
echo "exist"
fi

This approach converts the .pdf files page-wise, so the occurences of the search string $query can be located more specifically.
# search for query string in available pdf files pagewise
for i in *.pdf; do
pagenr=$(pdfinfo "$i" | grep "Pages" | grep -o "[0-9][0-9]*")
fileid="\n$i\n"
for (( p=1; p<=pagenr; p++ )); do
matches=$(pdftotext -q -f $p -l $p "$i" - | grep --color=always -in "$query")
if [ -n "$matches" ]; then
echo -e "${fileid}PAGE: $p"
echo "$matches"
fileid=""
fi
done
done
pdftotext -f $p -l $p limits the range to be converted to only one page identified by the number $p. grep --color=always allows for protecting match highlights in the subsequent echo. fileid="" just makes sure the file name of the .pdf document is only printed once for multiple matches.

As nicely pointed by Simon, you can simply convert the pdf to plain text using pdftotext, and then, just search for what you're looking for.
After conversion, you may use grep, bash regex, or any variation you want:
while read line; do
if [[ ${line} =~ [0-9]{4}(-[0-9]{2}){2} ]]; then
echo ">>> Found date;";
fi
done < <(pdftotext infile.pdf -)

Each letter within a PDF doc is typically set individually. Therefore, you have to convert the .pdf to text, which will reduce the text to a simple stream.
I would try this:
grep -q 'a \+string' <(pdf2text some.pdf - | tr '\n' ' ') && echo exists
The tr joins line breaks. The \+ allows for 1 or more space chars between words. Finally, grep -q only returns exit status 0/1 based on a match. It does not print matching lines.

Related

bash/awk/unix detect changes in lines of csv files

I have a timestamp in this format:
(normal_file.csv)
timestamp
19/02/2002
19/02/2002
19/02/2002
19/02/2002
19/02/2002
19/02/2002
The dates are usually uniform, however, there are files with irregular dates pattern such as this example:
(abnormal_file.csv)
timestamp
19/02/2002
19/02/2003
19/02/2005
19/02/2006
In my directory, there are hundreds of files that consist of normal.csv and abnormal.csv.
I want to write a bash or awk script that detect the dates pattern in all files of a directory. Files with abnormal.csv should be moved automatically to a new, separate directory (let's say dir_different/).
Currently, I have tried the following:
#!/bin/bash
mkdir dir_different
for FILE in *.csv;
do
# pipe 1: detect the changes in the line
# pipe 2: print the timestamp column (first column, columns are comma-separated)
awk '$1 != prev {print ; prev = $1}' < $FILE | awk -F , '{print $1}'
done
If the timestamp in a given file is normal, then only one single timestamp will be printed; but for abnormal files, multiple dates will be printed.
I am not sure how to separate the abnormal files from the normal files, and I have tried the following:
do
output=$(awk 'FNR==3{print $0}' $FILE)
echo ${output}
if [[ ${output} =~ ([[:space:]]) ]]
then
mv $FILE dir_different/
fi
done
Or is there an easier method to detect changes in lines and separate files that have different lines? Thank you for any suggestions :)

Assuming that none of your "normal" CSV files have trailing newlines this should do the separation just fine:
#!/bin/bash
mkdir -p dir_different
for FILE in *.csv;
do
if awk '{a[$1]++}END{if(length(a)<=2){exit 1}}' "$FILE" ; then
echo mv "$FILE" dir_different
fi
done
After a dry-run just get rid of the echo :)
Edit:
{a[$1]++} This bit creates an array a that gets the first field of each line as an index, and that gets incremented every time the same value is seen.
END{if(length(a)<=2){exit 1}} This checks how many elements are in the array. If there there are less than 3 (which should be the case if there's always the same date and we only get 1 header, 1 date) exit the processing with 1.
"$FILE" is part of the bash script, not awk, and I quoted your variable out of habit, should you ever have files w/ spaces in their names you'll see why :)

So, a "normal" file contains only two different lines:
timestamp
dd/mm/yyyy
Testing if a file is normal is thus as simple as:
[ $(sort -u file.csv | wc -l) -eq 2 ]
This leads to the following possible solution:
#!/usr/bin/env bash
mkdir -p dir_different
for FILE in *.csv;
do
if [ $(sort -u "$FILE" | wc -l) -ne 2 ] ; then
echo mv "$FILE" dir_different
fi
done

Echo to all files found from GREP

I'm having a trouble with my code.
grep looks for files that doesn't have a word 'code'
and I need to add 'doesn't have' as a last line in those files
By logic
echo 'doesnt have' >> grep -ril 'code' file/file
I'm using -ril to ignore the cases and get file names
Does anyone know how to append a text to each .txt files found from grep searches?

How's this for a novel alternative?
echo "doesn't have" |
tee -a $(grep -riL 'code' file/file)
I switched to the -L option to list the files which do not contain the search string.
This is unfortunately rather brittle in that it assumes your file names do not contain whitespace or other shell metacharacters. This can be fixed at the expense of some complexity (briefly, have grep output zero-terminated matches, and read them into an array with readarray -d ''. This requires a reasonably recent Bash, and probably GNU grep.)

The 'echo' command can append output to a single file, which must be specified by redirecting the standard output. To update multiple files, a loop is needed. The loop iterated over all the files found with 'grep'
for file in $(grep -ril 'code' file/file) ; do
echo 'doesnt have' >> $file
done

Using a while read loop and Process Substitution.
#!/usr/bin/env bash
while IFS= read -r files; do
echo "doesn't have" >> "$files"
done < <(grep -ril 'code' file/file)
As mentioned by #dibery
#!/bin/sh
grep -ril 'code' file/file | {
while IFS= read -r files; do
echo "doesn't have" >> "$files"
done
}

Create file with egrep matches and file names

I need some help...
I'm creating a script of unit tests using shellscripts. That script, stores all the beeline calls from all files inside a directory.
The script is doing it's purpose, but I don't wanna append the file name if grep does not return results.
That's my code:
for file in $(ls)
do
cat $file | egrep -on '^( +)?\bbeeline.*password=;"?' >> testa_scripts.sh
echo $file >> testa_scripts.sh
done
How can I do that?
Thanks

grep returns a falsy exit status (1) if it doesn't find any matching lines, so you can put in an if statement to test if it matched anything. Inverted with ! here:
for file in ./*; do
if ! egrep -on '...' "$file" >> somefile; then
echo 'grep did not match anything'
fi
done
(I don't think there's any need for the ls instead of just a shell glob here.)

How to check if sed has changed a file

I am trying to find a clever way to figure out if the file passed to sed has been altered successfully or not.
Basically, I want to know if the file has been changed or not without having to look at the file modification date.
The reason why I need this is because I need to do some extra stuff if sed has successfully replaced a pattern.
I currently have:
grep -q $pattern $filename
if [ $? -eq 0 ]
then
sed -i s:$pattern:$new_pattern: $filename
# DO SOME OTHER STUFF HERE
else
# DO SOME OTHER STUFF HERE
fi
The above code is a bit expensive and I would love to be able to use some hacks here.

A bit late to the party but for the benefit of others, I found the 'w' flag to be exactly what I was looking for.
sed -i "s/$pattern/$new_pattern/w changelog.txt" "$filename"
if [ -s changelog.txt ]; then
# CHANGES MADE, DO SOME STUFF HERE
else
# NO CHANGES MADE, DO SOME OTHER STUFF HERE
fi
changelog.txt will contain each change (ie the changed text) on it's own line. If there were no changes, changelog.txt will be zero bytes.
A really helpful sed resource (and where I found this info) is http://www.grymoire.com/Unix/Sed.html.

I believe you may find these GNU sed extensions useful
t label
If a s/// has done a successful substitution since the last input line
was read and since the last t or T command, then branch to label; if
label is omitted, branch to end of script.
and
q [exit-code]
Immediately quit the sed script without processing any more input, except
that if auto-print is not disabled the current pattern space will be printed.
The exit code argument is a GNU extension.
It seems like exactly what are you looking for.

This might work for you (GNU sed):
sed -i.bak '/'"$old_pattern"'/{s//'"$new_pattern"'/;h};${x;/./{x;q1};x}' file || echo changed
Explanation:
/'"$old_pattern"'/{s//'"$new_pattern"'/;h} if the pattern space (PS) contains the old pattern, replace it by the new pattern and copy the PS to the hold space (HS).
${x;/./{x;q1};x} on encountering the last line, swap to the HS and test it for the presence of any string. If a string is found in the HS (i.e. a substitution has taken place) swap back to the original PS and exit using the exit code of 1, otherwise swap back to the original PS and exit with the exit code of 0 (the default).

You can diff the original file with the sed output to see if it changed:
sed -i.bak s:$pattern:$new_pattern: "$filename"
if ! diff "$filename" "$filename.bak" &> /dev/null; then
echo "changed"
else
echo "not changed"
fi
rm "$filename.bak"

You could use awk instead:
awk '$0 ~ p { gsub(p, r); t=1} 1 END{ exit (!t) }' p="$pattern" r="$repl"
I'm ignoring the -i feature: you can use the shell do do redirections as necessary.
Sigh. Many comments below asking for basic tutorial on the shell. You can use the above command as follows:
if awk '$0 ~ p { gsub(p, r); t=1} 1 END{ exit (!t) }' \
p="$pattern" r="$repl" "$filename" > "${filename}.new"; then
cat "${filename}.new" > "${filename}"
# DO SOME OTHER STUFF HERE
else
# DO SOME OTHER STUFF HERE
fi
It is not clear to me if "DO SOME OTHER STUFF HERE" is the same in each case. Any similar code in the two blocks should be refactored accordingly.

In macos I just do it as follows:
changes=""
changes+=$(sed -i '' "s/$to_replace/$replacement/g w /dev/stdout" "$f")
if [ "$changes" != "" ]; then
echo "CHANGED!"
fi
I checked, and this is faster than md5, cksum and sha comparisons

I know it is a old question and using awk instead of sed is perhaps the best idea, but if one wants to stick with sed, an idea is to use the -w flag. The file argument to the w flag only contains the lines with a match. So, we only need to check that it is not empty.

perl -sple '$replaced++ if s/$from/$to/g;
END{if($replaced != 0){ print "[Info]: $replaced replacement done in $ARGV(from/to)($from/$to)"}
else {print "[Warning]: 0 replacement done in $ARGV(from/to)($from/$to)"}}' -- -from="FROM_STRING" -to="$DESIRED_STRING" </file/name>
Example:
The command will produce the following output, stating the number of changes made/file.
perl -sple '$replaced++ if s/$from/$to/g;
END{if($replaced != 0){ print "[Info]: $replaced replacement done in $ARGV(from/to)($from/$to)"}
else {print "[Warning]: 0 replacement done in $ARGV(from/to)($from/$to)"}}' -- -from="timeout" -to="TIMEOUT" *
[Info]: 5 replacement done in main.yml(from/to)(timeout/TIMEOUT)
[Info]: 1 replacement done in task/main.yml(from/to)(timeout/TIMEOUT)
[Info]: 4 replacement done in defaults/main.yml(from/to)(timeout/TIMEOUT)
[Warning]: 0 replacement done in vars/main.yml(from/to)(timeout/TIMEOUT)
Note: I have removed -i from the above command , so it will not update the files for the people who are just trying out the command. If you want to enable in-place replacements in the file add -i after perl in above command.

check if sed has changed MANY files
recursive replace of all files in one directory
produce a list of all modified files
workaround with two stages: match + replace
g='hello.*world'
s='s/hello.*world/bye world/g;'
d='./' # directory of input files
o='modified-files.txt'
grep -r -l -Z -E "$g" "$d" | tee "$o" | xargs -0 sed -i "$s"
the file paths in $o are zero-delimited

$ echo hi > abc.txt
$ sed "s/hi/bye/g; t; q1;" -i abc.txt && (echo "Changed") || (echo "Failed")
Changed
$ sed "s/hi/bye/g; t; q1;" -i abc.txt && (echo "Changed") || (echo "Failed")
Failed
https://askubuntu.com/questions/1036912/how-do-i-get-the-exit-status-when-using-the-sed-command/1036918#1036918

Don't use sed to tell if it has changed a file; instead, use grep to tell if it is going to change a file, then use sed to actually change the file. Notice the single line of sed usage at the very end of the Bash function below:
# Usage: `gs_replace_str "regex_search_pattern" "replacement_string" "file_path"`
gs_replace_str() {
REGEX_SEARCH="$1"
REPLACEMENT_STR="$2"
FILENAME="$3"
num_lines_matched=$(grep -c -E "$REGEX_SEARCH" "$FILENAME")
# Count number of matches, NOT lines (`grep -c` counts lines),
# in case there are multiple matches per line; see:
# https://superuser.com/questions/339522/counting-total-number-of-matches-with-grep-instead-of-just-how-many-lines-match/339523#339523
num_matches=$(grep -o -E "$REGEX_SEARCH" "$FILENAME" | wc -l)
# If num_matches > 0
if [ "$num_matches" -gt 0 ]; then
echo -e "\n${num_matches} matches found on ${num_lines_matched} lines in file"\
"\"${FILENAME}\":"
# Now show these exact matches with their corresponding line 'n'umbers in the file
grep -n --color=always -E "$REGEX_SEARCH" "$FILENAME"
# Now actually DO the string replacing on the files 'i'n place using the `sed`
# 's'tream 'ed'itor!
sed -i "s|${REGEX_SEARCH}|${REPLACEMENT_STR}|g" "$FILENAME"
fi
}
Place that in your ~/.bashrc file, for instance. Close and reopen your terminal and then use it.
Usage:
gs_replace_str "regex_search_pattern" "replacement_string" "file_path"
Example: replace do with bo so that "doing" becomes "boing" (I know, we should be fixing spelling errors not creating them :) ):
$ gs_replace_str "do" "bo" test_folder/test2.txt
9 matches found on 6 lines in file "test_folder/test2.txt":
1:hey how are you doing today
2:hey how are you doing today
3:hey how are you doing today
4:hey how are you doing today hey how are you doing today hey how are you doing today hey how are you doing today
5:hey how are you doing today
6:hey how are you doing today?
$SHLVL:3
Screenshot of the output:
References:
https://superuser.com/questions/339522/counting-total-number-of-matches-with-grep-instead-of-just-how-many-lines-match/339523#339523
https://unix.stackexchange.com/questions/112023/how-can-i-replace-a-string-in-a-files/580328#580328

Search MS word files in a directory for specific content in Linux

I have a directory structure full of MS word files and I have to search the directory for particular string. Until now I was using the following command to search files for in a directory
find . -exec grep -li 'search_string' {} \;
find . -name '*' -print | xargs grep 'search_string'
But, this search doesn't work for MS word files.
Is it possible to do string search in MS word files in Linux?

I'm a translator and know next to nothing about scripting but I was so pissed off about grep not being able to scan inside Word .doc files that I worked out how to make this little shell script to use catdoc and grep to search a directory of .doc files for a given input string.
You need to install catdocand docx2txt packages
#!/bin/bash
echo -e "\n
Welcome to scandocs. This will search .doc AND .docx files in this directory for a given string. \n
Type in the text string you want to find... \n"
read response
find . -name "*.doc" |
while read i; do catdoc "$i" |
grep --color=auto -iH --label="$i" "$response"; done
find . -name "*.docx" |
while read i; do docx2txt < "$i" |
grep --color=auto -iH --label="$i" "$response"; done
All improvements and suggestions welcome!

Here's a way to use "unzip" to print the entire contents to standard output, then pipe to "grep -q" to detect whether the desired string is present in the output. It works for docx format files.
#!/bin/bash
PROG=`basename $0`
if [ $# -eq 0 ]
then
echo "Usage: $PROG string file.docx [file.docx...]"
exit 1
fi
findme="$1"
shift
for file in $#
do
unzip -p "$file" | grep -q "$findme"
[ $? -eq 0 ] && echo "$file"
done
Save the script as "inword" and search for "wombat" in three files with:
$ ./inword wombat file1.docx file2.docx file3.docx
file2.docx
Now you know file2.docx contains "wombat". You can get fancier by adding support for other grep options. Have fun.

The more recent versions of MS Word intersperse ascii[0] in between each of the letters of the text for purposes I cannot yet understand. I have written my own MS Word search utilities that insert ascii[0] in between each of the characters in the search field and it just works fine. Clumsy but OK. A lot of questions remain. Perhaps the junk characters are not always the same. More tests need to be done. It would be nice if someone could write a utility that would take all this into account. On my windows machine the same files respond well to searches.
We can do it!

In a .doc file the text is generally present and can be found by grep, but that text is broken up and interspersed with field codes and formatting information so searching for a phrase you know is there may not match. A search for something very short has a better chance of matching.
A .docx file is actually a zip archive collecting several files together in a directory structure (try renaming a .docx to .zip then unzipping it!) -- with zip compression it's unlikely that grep will find anything at all.

The opensource command line utility crgrep will search most MS document formats (I'm the author).

Have you tried with awk ‘/Some|Word|In|Word/’ document.docx ?

If it's not too many files you can write a script that incorporates something like catdoc: http://manpages.ubuntu.com/manpages/gutsy/man1/catdoc.1.html , by looping over each file, perfoming a catdoc and grep, storing that in a bash variable, and outputting it if it's satisfactory.

If you have installed program called antiword you can use this command:
find -iname "*.doc" |xargs -I {} bash -c 'if (antiword {}|grep "string_to_search") > /dev/null 2>&1; then echo {} ; fi'
replace "string_to_search" in above command with your text. This command spits file name(s) of files containing "string_to_search"
The command is not perfect because works weird on small files (the result can be untrustful), becasue for some reseaon antiword spits this text:
"I'm afraid the text stream of this file is too small to handle."
if file is small (whatever it means .o.)

The best solution I came upon was to use unoconv to convert the word documents to html. It also has a .txt output, but that dropped content in my case.
http://linux.die.net/man/1/unoconv

I've found a way of searching Word files (doc and docx) that uses the preprocessor functionality of ripgrep.
This depends on the following being installed:
ripgrep (more information about the preprocessor here)
LibreOffice
docx2txt
this catdoc2 script, which I've added to my $PATH:
#!/bin/bash
temp_dir=$(mktemp -d)
trap "rm $temp_dir/* && rmdir $temp_dir" 0 2 3 15
libreoffice --headless --convert-to "txt:Text (encoded):UTF8" --outdir ${temp_dir} $1 1>/dev/null
cat ${temp_dir}/$(basename -s .doc $1).txt
The command pattern tor a one-level recursive search is:
$ rg --pre <preprocessor> --glob <glob with filetype> <search string>
Example:
$ ls *
one:
a.docx
two:
b.docx c.doc
$ rg --pre docx2txt --glob *.docx This
two/b.docx
1:This is file b.
one/a.docx
1:This is file a.
$ rg --pre catdoc2 --glob *.doc This
two/c.doc
1:This is file c.

Here's the full script I use on macOS (Catalina, Big Sur, Monterey).
It's based on Ralph's suggestion, but using built-in textutil for .doc
#!/bin/bash
searchInDoc() {
# in .doc
find "$DIR" -name "*.doc" |
while read -r i; do
textutil -stdout -cat txt "$i" | grep --color=auto -iH --label="$i" "$PATTERN"
done
}
searchInDocx() {
for i in "$DIR"/*.docx; do
#extract
docx2txt.sh "$i" 1> /dev/null
#point, grep, remove
txtExtracted="$i"
txtExtracted="${txtExtracted//.docx/.txt}"
grep -iHn "$PATTERN" "$txtExtracted"
rm "$txtExtracted"
done
}
askPrompts() {
local i
for i in DIR PATTERN; do
#prompt
printf "\n%s to search: \n" "$i"
#read & assign
read -e REPLY
eval "$i=$REPLY"
done
}
makeLogs() {
local i
for i in results errors; do
# extract dir for log name
dirNAME="${DIR##*/}"
# set var
eval "${i}LOG=$HOME/$i-$PATTERN-$dirNAME.log"
local VAR="${i}LOG"
# remove if existant
if [ -f "${!VAR}" ]; then
printf "WARNING: %s will be overwriten.\n" "${!VAR}"
fi
# touch file
touch "${!VAR}"
done
}
checkDocx2txt() {
#see if soft exists
if ! command -v docx2txt.sh 1>/dev/null; then
printf "\nWARNING: docx2txt is required.\n"
printf "Use \e[3mbrew install docx2txt\e[0m.\n\n"
exit
else
printf "\n~~~~~~~~~~~~~~~~~~~~~~~~\n"
printf "Welcome to scandocs macOS.\n"
printf "~~~~~~~~~~~~~~~~~~~~~~~~\n"
fi
}
parseLogs() {
# header
printf "\n------\n"
printf "Scandocs finished.\n"
# results
if [ ! -s "$resultsLOG" ]; then
printf "But no results were found."
printf "\"%s\" did not match in \"%s\"" "$PATTERN" "$DIR" > "$resultsLOG"
else
printf "See match results in %s" "$resultsLOG"
fi
# errors
if [ ! -s "$errorsLOG" ]; then
rm -f "$errorsLOG"
else
printf "\nWARNING: there were some errors. See %s" "$errorsLOG"
fi
# footer
printf "\n------\n\n"
}
#the soft
checkDocx2txt
askPrompts
makeLogs
{
searchInDoc
searchInDocx
} 1>"$resultsLOG" 2>"$errorsLOG"
parseLogs

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Find string inside pdf with shell - linux

I'd like to know if there is any way to check if there is a string inside a pdf file using a shell script? I was looking for something like: if [search(string,pdf_file)] > 0 then echo "exist" fi

Related

bash/awk/unix detect changes in lines of csv files

Echo to all files found from GREP

Create file with egrep matches and file names

How to check if sed has changed a file

Search MS word files in a directory for specific content in Linux

Categories

Resources