Comparing two files with simple answears, summing them up with no output to console - linux

I have two files as atributes in my script with ten lines in each. They contain simple answers: YES or NO. I need to compare those two files and get number of good answers and bad answers. I need to have them in separate variables.
I have tried :
diff <(nl $1) <(nl $2) | grep -E '<' | wc -l
and i get number of bad aswers, but i want it to be in some variable.

Either you're no good at describing what you want to achieve or the command doesn't do what you think it does.
Anyway - to store the output in a variable just do:
variable=$(diff <(nl $1) <(nl $2) | grep -E '<' | wc -l)

Related

Saving values in BASH shell variables while using |tee

I am trying to count the number of line matches in a very LARGE file and store them in variables using only the BASH shell commands.
Currently, i am scanning the results of a very large file twice and using a separate grep statement each time, like so:
$ cat test.txt
first example line one
first example line two
first example line three
second example line one
second example line two
$ FIRST=$( cat test.txt | grep 'first example' | wc --lines ; ) ; ## first run
$ SECOND=$(cat test.txt | grep 'second example' | wc --lines ; ) ; ## second run
and I end up with this:
$ echo $FIRST
3
$ echo $SECOND
2
Hopefully, I want to only scan the large file just once. And I have never used Awk and would rather not use that!
The |tee option is new to me. It seems that passing the results into two separate grep statements may mean that we only have to scan the large file once.
Ideally, I would also like to be able to do this without having to create any temporary files & subsequently having to remember to delete them.
I have tried multiple ways using something like these below:
FIRST=''; SECOND='';
cat test.txt \
|tee >(FIRST=$( grep 'first example' | wc --lines ;);) \
>(SECOND=$(grep 'second example' | wc --lines ;);) \
>/dev/null ;
and using read:
FIRST=''; SECOND='';
cat test.txt \
|tee >(grep 'first example' | wc --lines | (read FIRST); ); \
>(grep 'second example' | wc --lines | (read SECOND); ); \
> /dev/null ;
cat test.txt \
| tee <( read FIRST < <(grep 'first example' | wc --lines )) \
<( read SECOND < <(grep 'sedond example' | wc --lines )) \
> /dev/null ;
and with curly brackets:
FIRST=''; SECOND='';
cat test.txt \
|tee >(FIRST={$( grep 'first example' | wc --lines ;)} ) \
>(SECOND={$(grep 'second example' | wc --lines ;)} ) \
>/dev/null ;
but none of these allow me to save the line count into variables FIRST and SECOND.
Is this even possible to do?
tee isn't saving any work. Each grep is still going to do a full scan of the file. Either way you've got three passes through the file: two greps and one Useless Use of Cat. In fact tee actually just adds a fourth program that loops over the whole file.
The various | tee invocations you tried don't work because of one fatal flaw: variable assignments don't work in pipelines. That is to say, they "work" insofar as a variable is assigned a value, it's just the value is almost immediately lost. Why? Because the variable is in a subshell, not the parent shell.
Every command in a | pipeline executes in a different process and it's a fundamental fact of Linux systems that processes are isolated from each other and don't share variable assignments.
As a rule of thumb, you can write variable=$(foo | bar | baz) where the variable is on the outside. No problem. But don't try foo | variable=$(bar) | baz where it's on the inside. It won't work and you'll be sad.
But don't lose hope! There are plenty of ways to skin this cat. Let's go through a few of them.
Two greps
Getting rid of cat yields:
first=$(grep 'first example' test.txt | wc -l)
second=$(grep 'second example' test.txt | wc -l)
This is actually pretty good and will usually be fast enough. Linux maintains a large page cache in RAM. Any time you read a file Linux stores the contents in memory. Reading a file multiple times will usually hit the cache and not the disk, which is super fast. Even multi-GB files will comfortably fit into modern computers' RAM, particularly if you're doing the reads back-to-back while the cached pages are still fresh.
One grep
You could improve this by using a single grep call that searches for both strings. It could work if you don't actually need the individual counts but just want the total:
total=$(grep -e 'first example' -e 'second example' test.txt | wc -l)
Or if there are very few lines that match, you could use it to filter down the large file into a small set of matching lines, and then use the original greps to pull out the separate counts:
matches=$(grep -e 'first example' -e 'second example' test.txt)
first=$(grep 'first example' <<< "$matches" | wc -l)
second=$(grep 'second example' <<< "$matches" | wc -l)
Pure bash
You could also build a Bash-only solution that does a single pass and invokes no external programs. Forking processes is slow, so using only built-in commands like read and [[ can offer a nice speedup.
First, let's start with a while read loop to process the file line by line:
while IFS= read -r line; do
...
done < test.txt
You can count matches by using double square brackets [[ and string equality ==, which accepts * wildcards:
first=0
second=0
while IFS= read -r line; do
[[ $line == *'first example'* ]] && ((++first))
[[ $line == *'second example'* ]] && ((++second))
done < test.txt
echo "$first" ## should display 3
echo "$second" ## should display 2
Another language
If none of these are fast enough then you should consider using a "real" programming language like Python, Perl, or, really, whatever you are comfortable with. Bash is not a speed demon. I love it, and it's really underappreciated, but even I'll admit that high-performance data munging is not its wheelhouse.
If you're going to be doing things like this, I'd really recommend getting familiar with awk; it's not scary, and IMO it's much easier to do complex things like this with it vs. the weird pipefitting you're looking at. Here's a simple awk program that'll count occurrences of both patterns at once:
awk '/first example/ {first++}; /second example/ {second++}; END {print first, second}' test.txt
Explanation: /first example/ {first++} means for each line that matches the regex pattern "first example", increment the first variable. /second example/ {second++} does the same for the second pattern. Then END {print first second} means at the end, it should print the two variables. Simple.
But there is one tricky thing: splitting the two numbers it prints into two different variables. You could do this with read:
bothcounts=$(awk '/first example/ {first++}; /second example/ {second++}; END {print first, second}' test.txt)
read first second <<<"$bothcounts"
(Note: I recommend using lower- or mixed-case variable names, to avoid conflicts with the many all-caps names that have special functions.)
Another option is to skip the bothcounts variable by using process substitution to feed the output from awk directly into read:
read first second < <(awk '/first example/ {first++}; /second example/ {second++}; END {print first, second}' test.txt)
">" is about redirect to file/device, not to the next command in pipe. So tee will just allow you to redirect pipe to multiple files, not to multiple commands.
So just try this:
FIRST=$(grep 'first example' test.txt| wc --lines)
SECOND=$(grep 'second example' test.txt| wc --lines)
It's possible to get matches an count them in a single pass, then get the count of each from the result.
matches="$(grep -e 'first example' -e 'second example' --only-matching test.txt | sort | uniq -c | tr -s ' ')"
FIRST=$(grep -e 'first example' <<<"$matches" | cut -d ' ' -f 2)
echo $FIRST
Result:
3
Using awk is the best option I think.

Optimizing search in linux

I have a huge log file close to 3GB in size.
My task is to generate some reporting based on # of times something is being logged.
I need to find the number of time StringA , StringB , StringC is being called separately.
What I am doing right now is:
grep "StringA" server.log | wc -l
grep "StringB" server.log | wc -l
grep "StringC" server.log | wc -l
This is a long process and my script takes close to 10 minutes to complete. What I want to know is that whether this can be optimized or not ? Is is possible to run one grep command and find out the number of time StringA, StringB and StringC has been called individually ?
You can use grep -c instead of wc -l:
grep -c "StringA" server.log
grep can't report count of individual strings. You can use awk:
out=$(awk '/StringA/{a++;} /StringB/{b++;} /StringC/{c++;} END{print a, b, c}' server.log)
Then you can extract each count with a simple bash array:
arr=($out)
echo "StringA="${arr[0]}
echo "StringA="${arr[1]}
echo "StringA="${arr[2]}
This (grep without wc) is certainly going to be faster and possibly awk solution is also faster. But I haven't measured any.
Certainly this approach could be optimized since grep doesn't perform any text indexing. I would use a text indexing engine like one of those from this review or this stackexchange QA . Also you may consider using journald from systemd which stores logs in a structured and indexed format so lookups are more effective.
So many greps so little time... :-)
According to David Lyness, a straight grep search is about 7 times as fast as an awk in large file searches.
If that is the case, the current approach could be optimized by changing grep to fgrep, but only if the patterns being searched for are not regular expressions. fgrep is optimized for fixed patterns.
If the number of instances is relatively small compared to the original log file entries, it may be an improvement to use the egrep version of grep to create a temporary file filled with all three instances:
egrep "StringA|StringB|StringC" server.log > tmp.log
grep "StringA" tmp.log | wc -c
grep "StringB" tmp.log | wc -c
grep "StringC" tmp.log | wc -c
The egrep variant of grep allows for a | (vertical bar/pipe) character to be used between two or more separate search strings so that you can find multiple strings in statement. You can use grep -E to do the same thing.
Full documentation is in the man grep page and information about the Extended Regular Expressions that egrep uses from the man 7 re_format command.

How can I assign output to two bash arrays variables

There are vaguely similar answers here but nothing that could really answer my question. I am at a point in my bash script where I have to fill two arrays from an output that looks like the following:
part-of-the-file1:line_32
part-of-the-file1:line_97
part-of-the-file2:line_88
part-of-the-file2:line_93
What I need to do is pull out the files and line numbers in there own separate arrays. So far I have:
read FILES LINES <<<($(echo $INPUTFILES | xargs grep -n $id | cut -f1,2 -d':' | awk '{print $1 " " $2}'
with a modified IFS=':' but this doesn't work. I'm sure there are better solutions since I am NOT a scripting or tools wizard.
read cannot populate two arrays at a time like it can populate two regular parameters using word-splitting. It's best to just iterate over the output of your pipeline to build the arrays.
while IFS=: read fname lineno therest; do
files+=("$fname")
lines+=("$lineno")
done < <(echo $INPUTFILES | xargs grep -n "$id")
Here, I'm letting read do the splitting and discarding of the the tail end of grep's output in place of using cut and awk.

How to capture the output of a bash command into a variable when using pipes and apostrophe? [duplicate]

This question already has answers here:
How do I set a variable to the output of a command in Bash?
(15 answers)
Closed 6 years ago.
I am not sure how to save the output of a command via bash into a variable:
PID = 'ps -ef | grep -v color=auto | grep raspivid | awk '{print $2}''
Do I have to use a special character for the apostrophe or for the pipes?
Thanks!
To capture the output of a command in shell, use command substitution: $(...). Thus:
pid=$(ps -ef | grep -v color=auto | grep raspivid | awk '{print $2}')
Notes
When making an assignment in shell, there must be no spaces around the equal sign.
When defining shell variables for local use, it is best practice to use lower case or mixed case. Variables that are important to the system are defined in upper case and you don't want to accidentally overwrite one of them.
Simplification
If the goal is to get the PID of the raspivid process, then the grep and awk can be combined into a single process:
pid=$(ps -ef | awk '/[r]aspivid/{print $2}')
Note the simple trick that excludes the current process from the output: instead of searching for raspivid we search for [r]aspivid. The string [r]aspivid does not match the regular expression [r]aspivid. Hence the current process is removed from the output.
The Flexibility of awk
For the purpose of showing how awk can replace multiple calls to grep, consider this scenario: suppose that we want to find lines that contain raspivid but that do not contain color=auto. With awk, both conditions can be combined logically:
pid=$(ps -ef | awk '/raspivid/ && !/color=auto/{print $2}')
Here, /raspivid/ requires a match with raspivid. The && symbol means logical "and". The ! before the regex /color=auto/ means logical "not". Thus, /raspivid/ && !/color=auto/ matches only on lines that contain raspivid but not color=auto.
A more straightforward approach:
pid=$(pgrep raspivid)
... or a little different
echo pgrep [t]eleport

Shell script to get count of a variable from a single line output

How can I get the count of the # character from the following output. I had used tr command and extracted? I am curious to know what is the best way to do it? I mean other ways of doing the same thing.
{running_device,[test#01,test#02]},
My solution was:
echo '{running_device,[test#01,test#02]},' | tr ',' '\n' | grep '#' | wc -l
I think it is simpler to use:
echo '{running_device,[test#01,test#02]},' | tr -cd # | wc -c
This yields 2 for me (tested on Mac OS X 10.7.5). The -c option to tr means 'complement' (of the set of specified characters) and -d means 'delete', so that deletes every non-# character, and wc counts what's provided (no newline, so the line count is 0, but the character count is 2).
Nothing wrong with your approach. Here are a couple of other approaches:
echo $(echo {running_device,[test#01,test#02]}, |awk -F"#" '{print NF - 1}')
or
echo $((`echo {running_device,[test#01,test#02]} | sed 's+[^#]++g' | wc -c` - 1 ))
The only concern I would have is if you are running this command in a loop (e.g. once for every line in a large file). If that is the case, then execution time could be an issue as stringing together shell utilities incurs the overhead of launching processes which can be sloooow. If this is the case, then I would suggest writing a pure awk version to process the entire file.
Use GNU Grep to Avoid Character Translation
Here's another way to do this that I personally find more intuitive: extract just the matching characters with grep, then count grep's output lines. For example:
echo '{running_device,[test#01,test#02]},' |
fgrep --fixed-strings --only-matching # |
wc -l
yields 2 as the result.

Resources