Linux commandline find and add up numbers - linux

In the log file which may contain information of different form, I need to grep only those lines that contain substring "ABC", then among chosen lines extract (it always exists) the number of Kb at the end (pattern is ": %n Kb", where %n is the number from 0 and above). Finally I need to add up all the values to get the amount of memory used by an app.
2016-01-14T16:15:01.695Z [INFO] application - ABC 5f18dda7-a30a-44f5-82dd-69d4b5469245: 118 Kb
2016-01-14T16:15:04.535Z [INFO] application - 5f18dda7-a30a-44f5-82dd-69d4b5469245

grep isn't a verb, but awk is!
awk '/ABC/ {s+= $(NF-1)} END {print s "Kb"}'
should work (untested)

You can use following chain:
grep ABC logfile.txt | egrep -o "[0-9]+ Kb" | cut -f1 -d" "| paste -s -d+ | bc

I need to grep only those lines that contain substring "ABC", then among chosen lines extract (it always exists) the number of Kb at the end
This looks like a job for awk. The number is always the second last column, which awk can extract easily:
awk '/ABC/ { print $(NF-1) }' filename_here
Here NF-1 is the index of the second-last column, and $ gets the value in that column.
But you want to sum it up, rather than just extract. That's a simple task, and it shows off a slightly more advanced usage of awk:
awk '
BEGIN { sum = 0; }
/ABC/ { sum += $(NF-1); }
END { print sum; }
' filename_here
Technically speaking you can omit the entire BEGIN line, but I consider it good style to be up-front about the variables you expect to use in the program.

Related

awk Print Line Issue

I'm experiencing some issues with a awk command right now. The original script was developed using awk on MacOS and was then ported to Linux. There awk shows a different behavior.
What I want to do is to count the occurrences of single strings provided via /tmp/test.uniq.txt in the file /tmp/test.txt.
awk '{print $1, system("cat /tmp/test.txt | grep -o -c " $1)}' /tmp/test.uniq.txt
Mac delivers an expected output like:
test1 2
test2 1
The output is in one line, the sting and the number of occurrences, separated by a whitespace.
Linux delivers an output like:
2
test1 1
test2
The output is not in one line an the output of the system command is printed first.
Sample input:
test.txt looks like:
test1 test test
test1 test test
test2 test test
test.uniq.txt looks like:
test1
test2
As comments suggested that using grep and cat etc using system function is not recommended as awk is complete language that can perform most of these tasks.
You can use following awk command to replace your cat | grep functionality:
awk 'FNR == NR {a[$1]=0; next} {for (i=1; i<=NF; i++) if ($i in a) a[$i]++}
END { for (i in a) print i, a[i] }' uniq.txt test.txt
test1 2
test2 1
Note that this output doesn't match with the count 5 as your question states as your sample data is probably different.
References:
Effective AWK Programming
Awk Tutorial
It looks to me as if you're trying to count the number of line containing each unique string in the uniq file. But the way you're doing it is .. awkward, and as you've demonstrated, inconsistent between versions of awk.
The following might work a little better:
$ awk '
NR==FNR {
a[$1]
next
}
{
for (i in a) {
if ($1~i) {
a[i]++
}
}
}
END {
for (i in a)
printf "%6d\t%s\n",a[i],i
}
' test.uniq.txt test.txt
2 test1
1 test2
This loads your uniq file into an array, then for every line in your text file, steps through the array to count the matches.
Note that these are being compared as regular expressions, without word boundaries, so test1 will also be counted as part of test12.
Another way might be to use grep+sort+uniq:
grep -o -w -F -f uniq.txt test.txt | sort | uniq -c
It's a pipeline but a short one
From man grep:
-F, --fixed-strings, --fixed-regexp Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched.
(-F is specified by POSIX, --fixed-regexp is an obsoleted alias, please do not use it in new scripts.)
-f FILE, --file=FILE Obtain patterns from FILE, one per line. The empty file contains zero patterns and therefore matches nothing. (-f is specified by POSIX.)
-o, --only-matching Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
-w, --word-regexp
Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.

How to extract specific value using grep and awk?

I am facing a problem to extract a specific value in a .txt file using grep and awk.
I show below an excerpt from the .txt file:
"-
bravais-lattice index = 2
lattice parameter (alat) = 10.0000 a.u.
unit-cell volume = 250.0000 (a.u.)^3
number of atoms/cell = 2
number of atomic types = 1
number of electrons = 28.00
number of Kohn-Sham states= 18
kinetic-energy cutoff = 60.0000 Ry
charge density cutoff = 300.0000 Ry
convergence threshold = 1.0E-09
mixing beta = 0.7000"
I also defined some variable: ELEMENT and lat.
I want to extract the "unit-cell volume" value which is equal to 250.00.
I tried the following to extract the value using grep and awk:
volume=`grep "unit-cell volume" ./latt.10/$ELEMENT.scf.latt_$lat.out | awk '{printf "%15.12f\n",$5}'`
However, when i run the bash file I always get 00.000000 as a result instead of the correct value of 250.00.
Can anyone help, please?
Thanks in advance.
awk '{printf "%15.12f\n",$5}'
You're asking awk to print out the fifth field of the line ($5).
unit-cell volume = 250.0000 (a.u.)^3
1 2 3 4 5
The fifth field is (a.u.)^3, which you are then asking awk to interpret as a number via the %f format code. It's not a number, though (or actually, doesn't start with a number), and when awk is asked to treat a non-numeric string as a number, it uses 0 instead. Thus it prints 0.
Solution: use $4 instead.
By the way, you can skip invoking grep by using awk itself to select the line, e.g.
awk /^ unit-cell/ {...}
The /^ unit-cell/ is a regular expression that matches "unit-cell" (with a leading space) at the beginning of the line. Adjust as necessary if you have other lines that start with unit-cell which you don't want to select.
You never need grep when you're using awk since awk can do anything useful that grep can do. It sounds like this is all you need:
$ awk -F'=' '/unit-cell volume/{printf "%.2f\n",$2}' file
250.00
The above works because when FS is = that means $2 is <spaces>250.000 (a.u.)^3 and when awk is asked to convert a string to a number it strips off leading spaces and anything after the numeric part so that leaves 250.000 to be converted to a number by %.2f.
In the script you posted $5 was failing because the 5th space-separated field in:
$1 $2 $3 $4 $5
<unit-cell> <volume> <=> <250.0000> <(a.u.)^3>
is (a.u.)^3 - you could have just added print $5 to see that.
Since you are processing key-value pairs where the key can have variable amount on space in it, you need to tune that field number ($4, $5 etc.) separately for each record you want to process unless you set the field separator (FS) appropriately to FS=" *= *". Then the key will always be in $1 and value in $2.
Then use split to split the value and unit parts from each other.
Also, you can loose that grep by defining in awk a pattern (or condition, /unit-cell volume/) for that printaction:
$ awk 'BEGIN{FS=" *= *"} /unit-cell volume/{split($2,a," +");print a[1]}' file
250.0000
Explained:
$ awk '
BEGIN { FS=" *= *" } # set appropriate field separator
/unit-cell volume/ { # pattern or condition
split($2,a," +") # split value part to value and possible unit parts
print a[1] # output value part
}' file

Retaining one member of a pair

Good afternoon to all,
I have a file containing two fields, each representing a member of a pair.
I want to retain one member of each pair and it does not matter which member as these are codes for duplicate samples in a study.
Each pair appears twice in my file, with each member of the pair appearing once in either column.
An example of an input file is:
XXX1 XXX7
XXX2 XXX4
abc2 dcb3
XXX7 XXX1
dcb3 abc2
XXX4 XXX2
And an example of the desired output would be
XXX1
XXX2
abc2
How might this be accomplished in bash? Thank you.
Here is a combination of GNU awk, cut and sort, store the scipt as duplicatePairs.awk:
{ if ( $1 < $2) print $1, $2
else print $2, $1
}
and run it like this: awk -f duplicatePairs.awk your_file | sort -u | cut -d" " -f1
The if sorts the pairs such that a line with x,y and a line with y,x will be printed the same. Then sort -u can remove the duplicate lines. And the cut selects the first column.
With a slightly larger awk script, we can solve the requirements "awk-only":
{
smallest = $1;
if ( $1 > $2) {
smallest = $2
}
if( !(smallest in seen) ) {
seen [ smallest ] = 1
print smallest
}
}
Run it like this: awk -f duplicatePairs.awk your_file
While the answer posted by Lars above works very well I would like to suggest an alternative, just in case someone stumbles upon this problem.
I had previously used awk '!seen[$2,$1]++ {print $1}' to the same result. I didn't realize it had worked since the number of lines in my file wasn't halved. This turned out to be because of some wrong assumptions I made about my data.

Comparing two files using awk and printing contains which are matching from other files

I have two files:
file1.txt
919167,hutch,mumbai
919594,idea,mumbai
file2.txt
919167000000
919594000000
Output
919167000000,hutch,mumbai
919594000000,idea,mumbai
How can I achieve this using AWK? I've got a huge file of phone numbers which needs to be compared like this. I believe Awk can handle it; if not please let me know how can I do this.
Extra definitions
Is the common part always a 6-digit number? Yes always 6.
Are the two files already sorted? file1 is not sorted. file2 can be sorted.
Are the trailing digits in file 2 always zeros? No, these are phone numbers this can vary, purpose of this is to get series information of the phone number.
Is there any danger of file 1 containing three records for a given number while file 2 contains 2 records, or is it one-to-one? It's one-to-one.
Can there be records in file 1 with no match in file 2, or vice versa?_ Yes.
If so, do you want to see the unmatched records? Yes I want both records.
Extended data
file1.txt
919167,hutch,mumbai
919594,idea,mumbai
918888,airtel,karnataka
file2.txt
919167838888
919594998484
919212334323
Output Expected:
919167838888,hutch,mumbai
919594998484,idea,mumbai
919212334323,nomatch,nomatch
As I noted in a comment, there's a lot of unstated information needed to give a definitive answer. However, we can make some plausible guesses:
The common number is the first 6 digits of file 2 (we don't care about the trailing digits, but will simply copy them to the output).
The files are sorted in order.
If there are unmatched records in either file, those records will be ignored.
The tools of choice are probably sed and join:
sed 's/^\([0-9]\{6\}\)/\1,\1/' file2.txt |
join -t, -o 1.2,2.2,2.3 - file1.txt
This edits file2.txt to create a comma-separated first field with the 6-digit phone number followed by all the rest of the line. The input is fed to the join command, which joins on the first column, and outputs the 'rest of the line' (column 2) from file2.txt and columns 2 and 3 from file1.txt.
If the phone numbers are variable length, then the matching operation is horribly complex. For that, I'd drop into Perl (or Python) to do the work. If the data is unsorted, it can be sorted before being fed into the commands. If you want unmatched records, you can specify how to handle those in the options to join.
The extra information needed is now available. The key information is the 6-digits is fixed — phew! Since you're on Linux, I'm assuming bash is available with 'process substitution':
sort file2.txt |
sed 's/^\([0-9]\{6\}\)/\1,\1/' |
join -t, -o 1.2,2.2,2.3 -a 1 -a 2 -e 'no-match' - <(sort file1.txt)
If process substitution is not available, simply sort file1.txt in situ:
sort -o file1.txt file1.txt
Then use file1.txt in place of <(sort file1.txt).
I think the comment might be asking for inputs such as:
file1.txt
919167,hutch,mumbai
919594,idea,mumbai
902130,airtel,karnataka
file2.txt
919167000000
919594000000
919342313242
Output
no-match,airtel,karnataka
919167000000,hutch,mumbai
919342313242,no-match,no-match
919594000000,idea,mumbai
If that's not what the comment is about, please clarify by editing the question to add the extra data and output in a more readable format than comments allow.
Working with the extended data, this mildly modified command:
sort file2.txt |
sed 's/^\([0-9]\{6\}\)/\1,\1/' |
join -t, -o 1.2,2.2,2.3 -a 1 -e 'no-match' - <(sort file1.txt)
produces the output:
919167838888,hutch,mumbai
919212334323,no-match,no-match
919594998484,idea,mumbai
which looks rather like a sorted version of the desired output. The -a n options control whether the unmatched records from file 1 or file 2 (or both) are printed; the -e option controls the value printed for the unmatched fields. All of this is readily available from the man pages for join, of course.
Here's one way using GNU awk. Run like:
awk -f script.awk file2.txt file1.txt
Contents of script.awk:
BEGIN {
FS=OFS=","
}
FNR==NR {
sub(/[ \t]+$/, "")
line = substr($0, 0, 6)
array[line]=$0
next
}
{
printf ($1 in array) ? $0"\n" : "FILE1 no match --> "$0"\n"
dup[$1]++
}
END {
for (i in array) {
if (!(i in dup)) {
printf "FILE2 no match --> %s\n", array[i]
}
}
}
Alternatively, here's the one-liner:
awk 'BEGIN { FS=OFS="," } FNR==NR { sub(/[ \t]+$/, ""); line = substr($0, 0, 6); array[line]=$0; next } { printf ($1 in array) ? $0"\n" : "FILE1 no match --> "$0"\n"; dup[$1]++} END { for (i in array) if (!(i in dup)) printf "FILE2 no match --> %s\n", array[i] }' file2.txt file1.txt
awk -F, 'FNR==NR{a[$1]=$2","$3;next}{for(i in a){if($1~/i/) print $1","a[i]}}' your_file

Getting n-th line of text output

I have a script that generates two lines as output each time. I'm really just interested in the second line. Moreover I'm only interested in the text that appears between a pair of #'s on the second line. Additionally, between the hashes, another delimiter is used: ^A. It would be great if I can also break apart each part of text that is ^A-delimited (Note that ^A is SOH special character and can be typed by using Ctrl-A)
output | sed -n '1p' #prints the 1st line of output
output | sed -n '1,3p' #prints the 1st, 2nd and 3rd line of output
your.program | tail +2 | cut -d# -f2
should get you 2/3 of the way.
Improving Grumdrig's answer:
your.program | head -n 2| tail -1 | cut -d# -f2
I'd probably use awk for that.
your_script | awk -F# 'NR == 2 && NF == 3 {
num_tokens=split($2, tokens, "^A")
for (i = 1; i <= num_tokens; ++i) {
print tokens[i]
}
}'
This says
1. Set the field separator to #
2. On lines that are the 2nd line, and also have 3 fields (text#text#text)
3. Split the middle (2nd) field using "^A" as the delimiter into the array named tokens
4. Print each token
Obviously this makes a lot of assumptions. You might need to tweak it if, for example, # or ^A can appear legitimately in the data, without being separators. But something like that should get you started. You might need to use nawk or gawk or something, I'm not entirely sure if plain awk can handle splitting on a control character.
bash:
read
read line
result="${line#*#}"
result="${result%#*}"
IFS=$'\001' read result -a <<< "$result"
$result is now an array that contains the elements you're interested in. Just pipe the output of the script to this one.
here's a possible awk solution
awk -F"#" 'NR==2{
for(i=2;i<=NF;i+=2){
split($i,a,"\001") # split on SOH
for(o in a ) print o # print the splitted hash
}
}' file

Resources