I'm writing a Bash script where I need to look through the output of a command and do certain actions based on that output. For clarity, this command will output a few million lines of text and it may take roughly an hour or so to do so.
Currently, I'm executing the command and piping it into a while loop that reads a line at a time then looks for certain criteria. If that criterion exists, then update a .dat file and reprint the screen. Below is a snippet of the script.
eval "$command"| while read line ; do
if grep -Fq "Specific :: Criterion"; then
#pull the sixth word from the line which will have the data I need
temp=$(echo "$line" | awk '{ printf $6 }')
#sanity check the data
echo "\$line = $line"
echo "\$temp = $temp"
#then push $temp through a case statement that does what I need it to do.
fi
done
So here's the problem, the sanity check on the data is showing weird results. It is printing lines that don't contain the grep criteria.
To make sure that my grep statement is working properly, I grep the log file that contains a record of the text that is output by the command and it outputs only the lines that contain the specified criteria.
I'm still fairly new to Bash so I'm not sure what's going on. Could it be that the command is force feeding the while loop a new $line before it can process the $line that met the grep criteria?
Any ideas would be much appreciated!
How does grep know what line looks like?
if ( printf '%s\n' "$line" | grep -Fq "Specific :: Criterion"); then
But I cant help feel like you are overcomplicating a lot.
function process() {
echo "I can do anything I want"
echo " per element $1"
echo " that I want here"
}
export -f process
$command | grep -F "Specific :: Criterion" | awk '{print $6}' | xargs -I % -n 1 bash -c "process %";
Run the command, filter only matching lines, and pull the sixth element. Then if you need to run an arbitrary code on it, send it to a function (you export to make it visible in subprocesses) via xargs.
What are you applying the grep on ?
Modify
if grep -Fq "Specific :: Criterion"; then
as below
if ( echo $line | grep -Fq "Specific :: Criterion" ); then
Related
I am trying to count the number of line matches in a very LARGE file and store them in variables using only the BASH shell commands.
Currently, i am scanning the results of a very large file twice and using a separate grep statement each time, like so:
$ cat test.txt
first example line one
first example line two
first example line three
second example line one
second example line two
$ FIRST=$( cat test.txt | grep 'first example' | wc --lines ; ) ; ## first run
$ SECOND=$(cat test.txt | grep 'second example' | wc --lines ; ) ; ## second run
and I end up with this:
$ echo $FIRST
3
$ echo $SECOND
2
Hopefully, I want to only scan the large file just once. And I have never used Awk and would rather not use that!
The |tee option is new to me. It seems that passing the results into two separate grep statements may mean that we only have to scan the large file once.
Ideally, I would also like to be able to do this without having to create any temporary files & subsequently having to remember to delete them.
I have tried multiple ways using something like these below:
FIRST=''; SECOND='';
cat test.txt \
|tee >(FIRST=$( grep 'first example' | wc --lines ;);) \
>(SECOND=$(grep 'second example' | wc --lines ;);) \
>/dev/null ;
and using read:
FIRST=''; SECOND='';
cat test.txt \
|tee >(grep 'first example' | wc --lines | (read FIRST); ); \
>(grep 'second example' | wc --lines | (read SECOND); ); \
> /dev/null ;
cat test.txt \
| tee <( read FIRST < <(grep 'first example' | wc --lines )) \
<( read SECOND < <(grep 'sedond example' | wc --lines )) \
> /dev/null ;
and with curly brackets:
FIRST=''; SECOND='';
cat test.txt \
|tee >(FIRST={$( grep 'first example' | wc --lines ;)} ) \
>(SECOND={$(grep 'second example' | wc --lines ;)} ) \
>/dev/null ;
but none of these allow me to save the line count into variables FIRST and SECOND.
Is this even possible to do?
tee isn't saving any work. Each grep is still going to do a full scan of the file. Either way you've got three passes through the file: two greps and one Useless Use of Cat. In fact tee actually just adds a fourth program that loops over the whole file.
The various | tee invocations you tried don't work because of one fatal flaw: variable assignments don't work in pipelines. That is to say, they "work" insofar as a variable is assigned a value, it's just the value is almost immediately lost. Why? Because the variable is in a subshell, not the parent shell.
Every command in a | pipeline executes in a different process and it's a fundamental fact of Linux systems that processes are isolated from each other and don't share variable assignments.
As a rule of thumb, you can write variable=$(foo | bar | baz) where the variable is on the outside. No problem. But don't try foo | variable=$(bar) | baz where it's on the inside. It won't work and you'll be sad.
But don't lose hope! There are plenty of ways to skin this cat. Let's go through a few of them.
Two greps
Getting rid of cat yields:
first=$(grep 'first example' test.txt | wc -l)
second=$(grep 'second example' test.txt | wc -l)
This is actually pretty good and will usually be fast enough. Linux maintains a large page cache in RAM. Any time you read a file Linux stores the contents in memory. Reading a file multiple times will usually hit the cache and not the disk, which is super fast. Even multi-GB files will comfortably fit into modern computers' RAM, particularly if you're doing the reads back-to-back while the cached pages are still fresh.
One grep
You could improve this by using a single grep call that searches for both strings. It could work if you don't actually need the individual counts but just want the total:
total=$(grep -e 'first example' -e 'second example' test.txt | wc -l)
Or if there are very few lines that match, you could use it to filter down the large file into a small set of matching lines, and then use the original greps to pull out the separate counts:
matches=$(grep -e 'first example' -e 'second example' test.txt)
first=$(grep 'first example' <<< "$matches" | wc -l)
second=$(grep 'second example' <<< "$matches" | wc -l)
Pure bash
You could also build a Bash-only solution that does a single pass and invokes no external programs. Forking processes is slow, so using only built-in commands like read and [[ can offer a nice speedup.
First, let's start with a while read loop to process the file line by line:
while IFS= read -r line; do
...
done < test.txt
You can count matches by using double square brackets [[ and string equality ==, which accepts * wildcards:
first=0
second=0
while IFS= read -r line; do
[[ $line == *'first example'* ]] && ((++first))
[[ $line == *'second example'* ]] && ((++second))
done < test.txt
echo "$first" ## should display 3
echo "$second" ## should display 2
Another language
If none of these are fast enough then you should consider using a "real" programming language like Python, Perl, or, really, whatever you are comfortable with. Bash is not a speed demon. I love it, and it's really underappreciated, but even I'll admit that high-performance data munging is not its wheelhouse.
If you're going to be doing things like this, I'd really recommend getting familiar with awk; it's not scary, and IMO it's much easier to do complex things like this with it vs. the weird pipefitting you're looking at. Here's a simple awk program that'll count occurrences of both patterns at once:
awk '/first example/ {first++}; /second example/ {second++}; END {print first, second}' test.txt
Explanation: /first example/ {first++} means for each line that matches the regex pattern "first example", increment the first variable. /second example/ {second++} does the same for the second pattern. Then END {print first second} means at the end, it should print the two variables. Simple.
But there is one tricky thing: splitting the two numbers it prints into two different variables. You could do this with read:
bothcounts=$(awk '/first example/ {first++}; /second example/ {second++}; END {print first, second}' test.txt)
read first second <<<"$bothcounts"
(Note: I recommend using lower- or mixed-case variable names, to avoid conflicts with the many all-caps names that have special functions.)
Another option is to skip the bothcounts variable by using process substitution to feed the output from awk directly into read:
read first second < <(awk '/first example/ {first++}; /second example/ {second++}; END {print first, second}' test.txt)
">" is about redirect to file/device, not to the next command in pipe. So tee will just allow you to redirect pipe to multiple files, not to multiple commands.
So just try this:
FIRST=$(grep 'first example' test.txt| wc --lines)
SECOND=$(grep 'second example' test.txt| wc --lines)
It's possible to get matches an count them in a single pass, then get the count of each from the result.
matches="$(grep -e 'first example' -e 'second example' --only-matching test.txt | sort | uniq -c | tr -s ' ')"
FIRST=$(grep -e 'first example' <<<"$matches" | cut -d ' ' -f 2)
echo $FIRST
Result:
3
Using awk is the best option I think.
So i'm currently trying to grep a single result from a random file in a specific directory. The grepping works just fine and the expected output file is populated as expected, but for some reason, even after the output file has already been filled, the process won't stop. This is the grep command where the program seems to be getting stuck.
searchFILE(){
case $2 in
pref)
echo "Populating output file: $3-$1.data.out"
dataOutputFile="$3-$1.data.out"
zgrep -a "\"someParameter\"\:\"$1\"" /folder/anotherFolder/filetemplate.log.* | zgrep -a "\"parameter2\"\:\"$3\"" | head -1 > $dataOutputFile
;;
*)
echo "Unrecognized command"
;;
esac
echo "Query finished"
}
What is currently happening is that the output file is being populated as expected with the head pipe, but for some reason I'm not getting the "Query finished" message, and the process seems not to stop at all.
grep does not know that head -n1 is no longer reading from the pipe until it attempts to write to the pipe, which it will only do if another match is found. There is no direct communication between the processes. It will eventually stop, but only once all the data is read, a second match is found and write fails with EPIPE, or some other error occurs.
You can watch this happen in a simple pipeline like this:
cat /dev/urandom | grep -ao "12[0-9]" | head -n1
With a sufficiently rare pattern, you will observe a delay between output and exit.
One solution is to change your stop condition. Instead of waiting for SIGPIPE as your pipeline does, wait for grep to match once using the -m1 option:
cat /dev/urandom | grep -ao -m1 "12[0-9]"
I saw better performance results with zcat myZippedFile | grep whatever paradigm...
The first difference you need to try is pipe with | head -z --lines=1
The reason is null terminated lines instead of newlines (just in case).
My example script below worked (drop the case statement to make it more simple). If I hold onto $1 $2 inside functions things go wrong. I use parameter $names and only use the $1 $2 $# once, because it also goes wrong for me if I don't and in any case you can then shift over $# and catch arguments. The $# in the script itself are not the same as arguments in bash functions.
grep searching for 2 or multiple parameters in any order means using grep twice; in your case zgrep | grep. The second grep is a normal grep! You only need the first grep to be zgrep to do the unzip. Your question is simpler if you drop the case statement as bash case scares off people: bash was always an ugly lady that works good for short scripts.
zgrep searches text or compressed text, but newlines in LINUX style vs WINDOWS are not the same. So use dos2unix to convert files so that newlines work. I use compressed file simply because it is strange and rare to see zgrep, so it is demonstrated in a shell script with a compressed file! It works for me. I changed a few things, like >> and "sort -u" but you can obviously change them back.
#!/usr/bin/env bash
# Search for egA AND egB using option go
# COMMAND LINE: ./zgrp egA go egB
A="$1"
cOPT="$2" # expecting case go
B="$3"
LOG="./filetemplate.log" # use parameters for long names.
# Generate some data with gzip and delete the temporary file.
echo "\"pramA\":\"$A\" \"pramB\":\"$B\"" >> $B$A.tmp
rm -f ${LOG}.A; tar czf ${LOG}.A $B$A.tmp
rm -f $B$A.tmp
# Use paramaterise $names not $1 etc because you may want to do shift etc
searchFILE()
{
outFile="$B-$A.data.out"
case $cOPT in
go) # This is zgrep | grep NOT zgrep | zgrep
zgrep -a "\"pramA\":\"$A\"" ${LOG}.* | grep -a "\"pramB\":\"$B\"" | head -z --lines=1 >> $outFile
sort -u $outFile > ${outFile}.sorted # sort unique on your output.
;;
*) echo -e "ERROR second argument must be go.\n Usage: ./zgrp egA go egB"
exit 9
;;
esac
echo -e "\n ============ Done: $0 $# Fin. ============="
}
searchFILE "$#"
cat ${outFile}.sorted
I am trying to implement a bash script that will take the piped input and cut the first column and return a total. The difference is that the cut is not from a file, but from the piped input. For example:
#!/bin/bash
total=$(cut -f1 $1 | numsum)
echo $total
Basically, in my scenario the first column will always be from the passed input.
Example usuage is:
./script.sh "cat data.txt | grep -i status | grep something"
data.txt contains:
1 status
2 something
This will produce something like:
2
How can this be achieved? I have noticed the cut only works for files only. I cannot see any examples on Google.
I have managed to fix the issue myself. The code is:
#!/bin/bash
total=$(eval $1 | awk '{sum+=$1} END {print sum}')
echo $total
I have a file containing list of 4000 words (A.txt). Now I want to grep lines from another file (sentence_per_line.txt) containing those 4000 words mentioned in the file A.txt.
The shell script I wrote for the above problem is
#!/bin/bash
file="A.txt"
while IFS= read -r line
do
# display $line or do somthing with $line
printf '%s\n' "$line"
grep $line sentence_per_line.txt >> output.txt
# tried printing the grep command to check its working or not
result=$(grep "$line" sentence_per_line.txt >> output.txt)
echo "$result"
done <"$file"
And A.txt looks like this
applicable
available
White
Black
..
The code is neither working nor does it shows any error.
Grep has this built in:
grep -f A.txt sentence_per_line.txt > output.txt
Remarks to your code:
Looping over a file to execute grep/sed/awk on each line is typically an antipattern, see this Q&A.
If your $line parameter contains more than one word, you have to quote it (doesn't hurt anyway), or grep tries to look for the first word in a file named after the second word:
grep "$line" sentence_per_line.txt >> output.txt
If you write output in a loop, don't redirect within the loop, do it outside:
while read -r line; do
grep "$line" sentence_per_line.txt
done < "$file" > output.txt
but remember, it's usually not a good idea in the first place.
If you'd like to write to a file and at the same time see what you're writing, you can use tee:
grep "$line" sentence_per_line.txt | tee output.txt
writes to output.txt and stdout.
If A.txt contains words which you want to match only if the complete word matches, i.e., pattern should not match longerpattern, you can use grep -wf – the -w matches only complete words.
If the words in A.txt aren't regular expressions, but fixed strings, you can use grep -fF – the -F option looks for fixed strings and is faster. These two can be combined: grep -WfF
I'm currently figuring my way around a bash script (sorry, can't use other languages like perl) to keep track of a running log during a server startup. Basically, I have to trigger certain events depending on whether or not i run into certain strings or patterns while the log is being written. Currently, i have this code:
LOG=path_to_logfile
LINE1="[1-9][0-9]* some string"
LINE2="another string"
LINE3="third string"
tail -fn0 $LOG | \
while read line
do
echo $line | grep "$LINE1" || echo $line | grep "$LINE2" || echo $line | grep "$LINE3"
if [ $? = 0 ]
then
TMP=<echo line above>
... bunch of conditional statements...
fi
done
However, this is kinda slow; by the time the line i need to track is detected by the echo/grep combinations using or, it's waaaay after the server already started up. What's a good alternative to the above? I've read awk should be used but when i tried writing it in awk, either i wrote it wrong or the processing was also taking too much time to finish.
Any help will be appreciated. Thanks!
Rather than calling grep (potentially several times) on each line, let bash do the regular expression matching.
LOG=path_to_logfile
LINE1="[1-9][0-9]* some string"
LINE2="another string"
LINE3="third string"
tail -fn0 $LOG | while read line
do
if [[ $line =~ $LINE1|$LINE2|$LINE3 ]]; then
TMP=<echo line above>
... bunch of conditional statements...
fi
done
I'd try something like this instead:
tail -fn0 $LOG | egrep "$LINE1|$LINE2|$LINE3" | \
while read TMP
do
...
done
That way, the while read loop, which at a guess is going to be the slowest part of this whole operation, is only invoked when egrep actually finds a matching line in the input log.
You can have multiple match statements, which are ORed together to see if the line matches:
tail -f -n0 "$LOG" | grep -e "$LINE1" -e "$LINE2" -e "$LINE3" | while IFS= read -r line
do
# Do something with each matching $line
done