find and grep: get filenames - linux

I need to find the reports (.docx files), read them with docx2txt, find the second match of "passed" (excluding "not passed") and save these filenames to text file. Here is what I tried:
OIFS="$IFS"
IFS=$'\n'
for f in $(find . -wholename '*_done/(*Report*.docx' |grep -v appendix)
do
docx2txt "$f" - | (grep -q -m2 passed || grep -q -v "not passed") || echo $f >> failed
done
IFS="$OIFS"
But this script gives me an empty file. If I replace || to && before echo, all filenames are stored into the file. grep works fine if it is not in the script, as well as docx2txt. What am I doing wrong here?

There are quite a lot problems with the grep commands.
grep -q always exits successfully on the first match.
With -q the -m2 has no effect. If there is one match grep exits successfully. It does not check if there is a second match.
To check that there are (at least) two matches, count the matches and then use test/[ ] to check the number of found matches. If there is at most one passed per line, grep -c is sufficient. If there can be multiple matches per line, you need grep -o ... | wc -l.
-q and -v together means: Is there at least one line that does not contain the pattern? When grep finds such a line it exits successfully. The only way for this command to fail is an input in which every line contains not passed (this includes the empty file).
Matching passed but not not passed is trickier than one might suspect. If there can be at most one passed/not passed per line, you can use grep -v 'not passed' | grep passed. Otherwise you need a need negative lookbehind, which is only available in perl compatible regular expressions (PCRE).
In addition to that command | (grep ... || grep ...) might not do what you expect. command produces output only once. After the first grep read some of this output, that read part is gone. The second grep will then continue reading where the first grep stopped.
BTW: for … in $(find … | grep -v …) can be turned into a single, safe find command using -not and -exec.
Solution
If each line contains at most one passed/not passed, use
find . -wholename '*_done/(*Report*.docx' -not -wholename '*appendix*' \
-exec sh -c '[ $(docx2txt "$0" - | grep -v "not passed" | grep -cm2 passed) = 2 ]' {} \; -print
If there can be multiple passed/not passed per line, you need GNU grep or pcregrep:
find . -wholename '*_done/(*Report*.docx' -not -wholename '*appendix*' \
-exec sh -c '[ $(docx2txt "$0" - | grep -Pom2 "(?<!not )passed" | wc -l) = 2 ]' {} \; -print

When you run into a problem like this, it's a good idea to remove as much code as possible. If we just take that one line with the multiple grep statements, we can first verify that the current expression doesn't work:
$ echo passed | ((grep -q -m2 passed || grep -q -v "not passed") || echo failed
$ echo not passed | ((grep -q -m2 passed || grep -q -v "not passed") || echo failed
We can see that neither of these commands produces at any output.
Let's think carefully about the logic:
The || operator means "if the first command doesn't succeed, run the second command". So in both cases, the first grep succeeds (because both passed and not passed contain the phrase passed). This means the second grep will never run, and it means that since the first command was successful, the entire grep ... || grep ... command will be successful, and that means the final echo $f will never run.
I was trying to think of a clever way to solve this, but it seems simplest if we make use of a temporary file:
OIFS="$IFS"
IFS=$'\n'
tmpfile=$(mktemp docXXXXXX)
trap "rm -f $tmpfile" EXIT
for f in $(find . -wholename '*_done/(*Report*.docx' |grep -v appendix)
do
docx2txt "$f" - | head -2 > $tmpfile
if grep -q passed $tmpfile && ! grep -q 'not passed' $tmpfile; then
echo $f >> failed
fi
done
IFS="$OIFS"

Related

Using multiple arguments to find and grep

I am trying to make a script so that I can give command with variable number of arguments myfind one two three and it finds all files in the folder, then applies grep -i one then grep -i two, and grep -i three and so on.
I tried following code:
#! /bin/bash
FULLARG="find . | "
for arg in "$#"
do
FULLARG=$FULLARG" grep -i "$arg" | "
done
echo $FULLARG
$FULLARG
However, though the command is created but it is not working and giving following error:
$ ./myfind one two three
find . | grep -i one | grep -i two | grep -i three |
find: unknown predicate `-i'
Where is the problem and how can it be solved?
You could store the result of find . and keep filtering that out till you have command line arguments:
#!/bin/bash
result="$(find .)"
for arg in "$#"
do
result=$(echo "$result" | grep -i "${arg}")
done
echo "$result"

Set file modification time from the date string present in the filename

I'm restoring a number of archives with dates within their names, something along the lines of:
user-2018.12.20.tar.xz
user-2019.01.10.tar.xz
user-2019.02.25.tar.xz
user-2019.04.19.tar.xz
...
I want to set each file's modification date to match the date in their filename by piping the filenames to touch via xargs and using replace-str to set the dates.
touch -m -t will take a datetime in the format [CCYYMMDDhhmm], but I'm having trouble substituting inline:
find . -name "*.xz" | xargs -I {} touch -m -t $(sed -e 's/\.tar\.xz//g; s/user-//g; s/\.//g; s/\///g; s/$/0000/g' {}) {}
Returns touch: invalid date format ‘./user-2018.03.22.tar.xz’, even though this:
find . -name "*.xz" | sed -e 's/\.tar\.xz//g; s/user-//g; s/\.//g; s/\///g; s/$/0000/g'
Returns properly-formatted dates, for example 201812200000. Am I misusing command substitution in my replace string somehow?
EDIT : Yes, a simple script could do this no problem. But the question remains...
You don't need find, sed, xargs or any third party tools, but just use the shell built-in regex capabilities to get the timestamp from the file
for file in *.tar.xz; do
[ -f "$file" ] || continue
if [[ $file =~ ^user-([[:digit:]]+).([[:digit:]]+).([[:digit:]]+).tar.xz$ ]]; then
dateStr="${BASH_REMATCH[1]}${BASH_REMATCH[2]}${BASH_REMATCH[3]}0000"
touch -m -t "$dateStr"
fi
done
The problem is that the command substitution will be evaluated once when you call xargs, not for each argument. You would need to spawn a shell for that:
find . -name "*.xz" \
| xargs -I {} bash -c 'touch -m --date "$(sed -e "s/\.tar\.xz//;s/user-//g; s/\.//g; s/\///g;" <<< "$1")" "$1"' -- {}
Note: xargs is not needed because you can use the -exec option of find:
find . -name "*.xz" -exec bash -c 'touch -m --date "$(sed -e "s/\.tar\.xz//;s/user-//g; s/\.//g; s/\///g;" <<< "$1")" "$1"' -- {} \;
PS: A small for loop would be more readable:
for file in user-*.tar.xz ; do
# remove prefix and suffix
date=${file#user-}
date=${date%.tar.xz}
# replace dots by /
date=${date//./\/}
touch -m --date "${date}" "${file}"
done
This might work for you (GNU parallel):
parallel --dryrun touch -m --date '{= s/[^0-9]//g =}' {} ::: *.xz
When happy that the commands are correct, then remove the --dryrun option.
Alternative:
parallel touch -m --date `{= s/user-//;s/\.tar\.xz//;s/\.//g =}' {} ::: *.xz

Perl Script to Grep Directory For String and Print

I would like to create a perl or bash script that will read keyboard input and assign a variable, perform a fixed string grep recursively within the current directory filled with Snort logs, and then automatically tcpdump the matched files, grep its output, and print the specified lines to the terminal. Does anyone have a good idea of how this should work?
Here is an example of the methodology I want from the script:
step 1: Read keyboard input and assign it to variable named string.
step 2 command: grep -Fr "$string"
step 2 output: snort.log.1470609906 matches
step 3 command: tcpdump -r snort.log.1470609906 | grep -F "$string" C-10
step 3 output:
Snort log
Here's some bash code that does that:
s="google.com"
grep -Frl "$s" | \
while IFS= read -r x; do
tcpdump -r "$x" | grep -F "$s" -C10
done
idk about perl but you can do it easily enough just in shell:
str="google.com"
find . -type f -name 'snort.log.*' -exec grep -FlZ "$str" {} + |
xargs -0 -I {} sh -c 'tcpdump -r "{}" | grep -F '"$str"' -C10'

how to use && with grep in bash

I want to check if multiple lines in a file exist in bash.
so for that I use grep -q which works with only one line:
if grep -q string1 "/path/to/file";then
echo 'exists'
else
echo 'does not exist'
fi
I tried many things in various combinations, for example:
if grep -q [ string1 ] && grep -q [ string2 ] "path/to/file";then
I also tried it with -E:
grep -E 'pattern1' filename | grep -E 'pattern2'
but nothing seems to work. Any ideas?
Rather than running multiple grep commands you can use this gnu-awk command to assert presence of multiple strings in a file:
awk -v RS='\\Z' '/string1/ && /string2/ && /string3/{e=1} END{exit !e}' file &&
echo 'exists' || echo 'does not exist'
RS=\Z will make awk read all the input in a single record separator
Using && between multiple search terms will make sure all the search words exist in input file
This will print exists only if all 3 search terms exists in the input file.
since #iruvar hasn't posted his comment as answer, i'll put it here:
grep -q string_1 file && grep -q string_2 file
now, here is my contribution. is #anubhava's more computationally complex awk answer, which reads the file only once, any faster than #iruvar's simpler answer, which reads the file three times?
awk 11.730 s
grep && grep 0.258 s
no.
this surely will depend on the speed of the filesystem vs the cpu, and on how much caching goes on, but on my system, which is probably a typical B+/A- workstation, grep kw1 file && grep kw2 file && grep kw3 file is ~50x as fast as #anubhava's awk solution. this held true both on ssd and spindle raid. (details: test file was 5,000,000 lines, 160M, and had kw1 on the first line, kw2 on the 2.5 millionth, and kw3 on the 5 millionth.)
some easy optimization is possible, for example, if you can solve your problem by matching whole lines, do so (with grep -x); it's twice as fast in this case.
for many (e.g., >1,000) files, it is faster to use grep -l and xargs:
grep -l kw1 *.txt | xargs grep -l kw2 | xargs grep -q kw3
as opposed to a loop:
for f in *.txt; do
grep -q kw1 $f && grep -q kw2 $f && grep -q kw3 $f
done
with the same test file, grep -l | xargs grep took 0.258 s, just like grep && grep. with two test files, it was still no faster than grep && grep. with 2000 test files of 5,000 lines each, none of which contained any matches, grep -l | xargs grep was ~10x as faster as grep && grep.
There are a couple ambiguities in your question, but assuming you want pattern_1 and pattern_2 to exist in a file (not on the same line) then you can do this.
for file in *; do
egrep -q pattern_1 $file && egrep -q pattern_2 $file && echo $file
done
With grep -p you can match multiply patterns in the same line:
grep -P '(?=.*string1)(?=.*string2)' file
The above will print lines that matches string1 and string2.
(?=...) is a positive lookaheads which matches a pattern without making it a part of the match.
And -z will slurp the whole file:
% seq 1 100 | grep -qzP '(?=.*1)(?=.*5)'; echo $?
0
% seq 1 100 | grep -qzP '(?=.*1)(?=.*a)'; echo $?
1
You can do it like this:
if grep -q 'string1' /path/to/file; then
if grep -q 'string2' /path/to/file; then
echo exists
else
echo 'does not exist'
else
echo 'does not exist'
fi
Or:
grep -q 'string1' /path/to/file &&
grep -q 'string2' /path/to/file &&
echo exists ||
echo 'does not exist'
you can use "-q" to search using grep
if grep -q string1 "/path/to/file" && grep -q string2 "/path/to/file";then
echo 'exists'
else
echo 'does not exist'
fi

How do I search for a file based on what is output by a command running on that file

I am working on a project for one of my professors and he asked me to sort a couple hundred .fits images based on their header files (specifically what star they are images of) I think that grep would be the best way to do this however I can't seam to figure out how to use grep based on the header.
I am entering:
ls | imhead *.fits | grep -E -r "PG\ 1104+243" *
to just list them out for now, once they are listed I know how to copy them into a directory.
I am new to using grep so I am unsure as to where my error lies? any help would be greatly appreciated! Thanks!
Assuming that imghead will extract the headers of the .fits as txt, you can use a simple shell script to do it:
script.sh
#!/bin/bash
grep "$1" "$2" > /dev/null 2>&1 && echo "$2"
Note that the + is a special character if you use extended regular expression, meaning if you pass the -E as in the question. A simple grep without any options should do the trick here.
Use find to exec the script on every *.fits file in the current folder:
find -maxdepth 1 -name '*.fits' -exec ./script.sh 'PG 1104+243' {} \;
If you are going to copy/move/alter or do something with the files you find, you might be better off, in terms of complexity and ease of quoting, using a loop like this:
#!/bin/bash
find . -name \*.fits -print0 | while read -d '' -r file; do
echo Checking file: $file
imhead "$file" | grep -q 'PG 1104+243'
if [ $? -eq 0 ]; then
echo Object matches: $file
fi
done

Resources