Grep versus Awk: How do the search mechanisms differ - search

I am writing a script that must loop, each loop different scripts pull variables from external files and the last step compiles them. I am trying to maximize the speed at which this loop can run, and thus trying to find the best programs for the job.
The rate limiting step right now is searching through a file which has 2 columns and 4.5 million lines. column one is a key and column 2 is the value I am extracting.
The two programs I am evaluating are awk and grep. I have put the two scripts and their run times to find the last value below.
time awk -v a=15 'BEGIN{B=10000000}$1==a{print $2;B=NR}NR>B{exit}' infile
T
real 0m2.255s
user 0m2.237s
sys 0m0.018s
time grep "^15 " infile |cut -d " " -f 2
T
real 0m0.164s
user 0m0.127s
sys 0m0.037s
This brings me to my question... how does grep search. I understand awk runs line by line and field by field, which is why it takes longer as the file gets longer and i have to search further into it.
how does grep search? Clearly not line by line, or if it is it's clearly in a much different manner than awk considering the almost 20x time difference.
(I have noticed awk runs faster than grep for short files and I've yet to try and find where they diverge, but for those sizes it really doesn't matter nearly as much!).
I'd like to understand this so I can make good decisions for future program usage.

The awk command you posted does far more than the grep+cut:
awk -v a=15 'BEGIN{B=10000000}$1==a{print $2;B=NR}NR>B{exit}' infile
grep "^15 " infile |cut -d " " -f 2
so a time difference is very understandable. Try this awk command, which IS equivalent to the grep+cut, and see what results you get so we can compare apples to apples:
awk '/^15 /{print $2}' infile
or even:
awk '$1==15{print $2}' infile

Related

Linux Shell Bash Spilt Text file by columes

I have a massive list of english words that looks something like this.
1. do accomplish, prepare, resolve, work out
2. say suggest, disclose, answer
3. go continue, move, lead
4. get bring, attain, catch, become
5. make create, cause, prepare, invest
6. know understand, appreciate, experience, identify
7. think contemplate, remember, judge, consider
8. take accept, steal, buy, endure
9. see detect, comprehend, scan
10. come happen, appear, extend, occur
11. want choose, prefer, require, wish
12. look glance, notice, peer, read
13. use accept, apply, handle, work
14. find detect, discover, notice, uncover
15. give grant, award, issue
16. tell confess, explain, inform, reveal
And I would like to be able to extract the second colum,
do
say
go
get
make
know
think
take
see
comer
want
look
use
find
give
tell
anybody know how to do this in shell bash.
Thanks.
Using bash
$ cat tst.sh
#!/usr/bin/env bash
while read line; do
line=${line//[0-9.]}
line=${line/,*}
echo ${line% *}
done < /path/to/input_file
$ ./tst.sh
do
say
go
get
make
know
think
take
see
come
want
look
use
find
give
tell
Using sed
$ sed 's/[^a-z]*\([^ ]*\).*/\1/' input_file
do
say
go
get
make
know
think
take
see
come
want
look
use
find
give
tell
There are a lot of ways to do that:
awk '{print $2}' input-file
cut -d ' ' -f 5 input-file # assuming 4 spaces between the first columns
< input-file tr -s ' ' | cut -d ' ' -f 2
< input-file tr -s ' ' \\t | cut -f 2
perl -lane 'print $F[1]' input-file
sed 's/[^ ]* *\([^ ]*\).*/\1/' input-file
while read a b c; do printf '%s\n' "$b"; done < input-file
What about awk:
cat file.txt | awk '{print $2}'
Does this work?

Grep a "text" and print all the line before and after the text. each log session is separated by 2 blank lines

1
1
1
1
Text
11
1
1
1
The text above can be anywhere therefore grep -C wont help.
I have tried that using AWK but i want to do it using grep
zgrep -C25 "Text" engine.*.log.gz
doesn't work as text may appear anywhere
I know the awk option but it doesn't work on .gz file i have to convert it its a lengthy process
awk -v RS= '/Text/' engine.*.log.gz
Just do it the trivial, obvious, robust, portable way:
zcat engine.*.log.gz | awk -v RS= '/Text/{print; exit}'
That should also be pretty efficient since the SIGPIPE that zcat gets when awk exits on finding the first Text should terminate the zcat.
Or if Text can appear multiple times in the input and you want all associated records output:
zcat engine.*.log.gz | awk -v RS= -v ORS='\n\n' '/Text/'
Thank you so much for the help, but I figured out the solution
zgrep -C34 "text" engine.*.log.gz |awk '/20yy-mm-dd hh:mm:ss/,/SENT MESSAGES (asynchronous)/'
Search the "text" along side 34 lines before and after it to get average out all the lines required
All of my logs session always begins with date and time so is used AWK for that
All of my logs session always ends with "SENT MESSAGES (asynchronous):" So AWK did that trick for me.

With Bash, how to print the whole lines of a .txt file which contain, in their first section, a specific string?

So far, I used grep for these kind of questions, but here, I think it can't be use. Indeed, I want a match in the first section of each line, and then print the whole line.
I wrote something like that:
cat file.txt | cut -d " " -f1 | grep root
But of course, this command does not print the whole line, but only the first section of each line that contains "root". I've heard of awk command, but even with the manual, I do not understand how to reach my goal.
Thank you in advance for your answer
Simple awk should do it:
awk '$1 ~ /root/' file.txt

Optimizing search in linux

I have a huge log file close to 3GB in size.
My task is to generate some reporting based on # of times something is being logged.
I need to find the number of time StringA , StringB , StringC is being called separately.
What I am doing right now is:
grep "StringA" server.log | wc -l
grep "StringB" server.log | wc -l
grep "StringC" server.log | wc -l
This is a long process and my script takes close to 10 minutes to complete. What I want to know is that whether this can be optimized or not ? Is is possible to run one grep command and find out the number of time StringA, StringB and StringC has been called individually ?
You can use grep -c instead of wc -l:
grep -c "StringA" server.log
grep can't report count of individual strings. You can use awk:
out=$(awk '/StringA/{a++;} /StringB/{b++;} /StringC/{c++;} END{print a, b, c}' server.log)
Then you can extract each count with a simple bash array:
arr=($out)
echo "StringA="${arr[0]}
echo "StringA="${arr[1]}
echo "StringA="${arr[2]}
This (grep without wc) is certainly going to be faster and possibly awk solution is also faster. But I haven't measured any.
Certainly this approach could be optimized since grep doesn't perform any text indexing. I would use a text indexing engine like one of those from this review or this stackexchange QA . Also you may consider using journald from systemd which stores logs in a structured and indexed format so lookups are more effective.
So many greps so little time... :-)
According to David Lyness, a straight grep search is about 7 times as fast as an awk in large file searches.
If that is the case, the current approach could be optimized by changing grep to fgrep, but only if the patterns being searched for are not regular expressions. fgrep is optimized for fixed patterns.
If the number of instances is relatively small compared to the original log file entries, it may be an improvement to use the egrep version of grep to create a temporary file filled with all three instances:
egrep "StringA|StringB|StringC" server.log > tmp.log
grep "StringA" tmp.log | wc -c
grep "StringB" tmp.log | wc -c
grep "StringC" tmp.log | wc -c
The egrep variant of grep allows for a | (vertical bar/pipe) character to be used between two or more separate search strings so that you can find multiple strings in statement. You can use grep -E to do the same thing.
Full documentation is in the man grep page and information about the Extended Regular Expressions that egrep uses from the man 7 re_format command.

Awk, tail, sed or others - which one faster for big files?

I have scripts for big log files. I can check all line and do something with tail and awk.
Tail:
tail -n +$startline $LOG
Awk:
awk 'NR>='"$startline"' {print}' $LOG
And checking time, tail working 6 mins 39 seconds, awk working 6 mins 42 seconds. So two commands do same thing / same time.
I don't know how to do with sed. Sed can be faster than tail and awk? Or maybe other commands.
Second question, I use $startline and every time continue remains from the last line. For example:
I use script line this:
10:00AM -> ./script -> $startline=1 and do something -> write line number to save file(for ex. 25),
10:05AM -> ./script -> $startline=26(read save file +1) and do something -> write line number save file(55),
10:10AM -> ./script -> $startline=56(read save file +1) and do something ....
But when script is running, checking all lines and when see $startline, doing something. And it's a little slow because of huge files.
Any suggestions for it be faster?
Script example:
lastline=$(tail -1 "line.save")
startline=$(($lastline + 1))
tail -n +$startline $LOG | while read -r
do
....
done
linecount=$(wc -l "$LOG" | awk '{print $1}')
echo $linecount >> line.save
tail and head are tools especially created for this purposes, so the intuitive idea is that their are quite optimized for it. On the other hand, awk and sed can perfectly do it because they are like a Swiss Army knife, but this is not supposed to be its best "skill" over the multiple others that they have.
In Efficient way to print lines from a massive file using awk, sed, or something else? there is a nice comparison on methods and head / tail is seen as the best approach.
Hence, I would go for tail + head.
Note also that if it is not only the last lines, but a set of them within the text, in awk (or in sed) you have the option to exit after the last line you wanted. This way, you avoid the script to run the file until the last line.
So this:
awk '{if (NR>=10 && NR<20) print} NR==20 {print; exit}'
is faster than
awk 'NR>=10 && NR<=20'
If your input happens to contain more than 20 lines.
Regarding your expression:
awk 'NR>='"$startline"' {print}' $LOG
note that it is more straight forward to write:
awk -v start="$startline" 'NR>=start' $LOG
there is no need to say print because it is implicit.

Resources