Grep input from file and sum columns based on search - linux

I am using grep to input a search string from a file and awk to print out the sum of the columns based on the search result using
grep -f input data.txt |awk '{ sum+=$2} END {print sum}'
This gives me the sum with all the input strings. How do I get the sum for each input string separately?
Sample input
a
b
c
Sample data.txt
a/cell1 5
b/cell1 5
a/cell2 8
c/cell1 10
no of lines in input ~32
size of data.txt - 5GB
Expected results:
a 13
b 5
c 5

$ awk 'NR==FNR{sum[$0]=0;next} $1 in sum{sum[$1]+=$2} END{for (key in sum) print key, sum[key]}' input data.txt
a 2
b 1
c 1

Hard to tell without seeing your files, but maybe:
grep -f input data.txt | \
awk '{sum[$1] += $2} END { for (key in sum) { print key, sum[key] } }'

The following avoids accumulating unnecessary details, and therefore may circumvent the memory allocation error. It assumes the list of the strings of interest is in a file named input:
awk -v dict=input '
BEGIN {while((getline<dict) > 0) {a[$1]=1}}
a[$1] {sum[$1] += $2}
END { for (key in sum) { print key, sum[key] } }'
If this does not resolve the memory issue, then please give some details about your awk, OS, and anything else that may be relevant.

Is this fast enough running againt your 5GB file?
awk 'NR == FNR {sum[$1]+=$2} NR != FNR {printf "%s %s\n", $1, sum[$1] }' file1 file2
Where file1 is the 5GB file and file2 is the file containing the strings you want to find in file1.
EDIT
As #EdMorton commented earlier, my solution will print blank for sum[$1] when $1 is not found.
In addition, #EdMorton provided an answer which will print 0 instead.
I suggest to check out his answer first, as it is assumed to meet your needs better.

Related

Count number of ';' in column

I use the following command to count number of ; in a first line in a file:
awk -F';' '(NR==1){print NF;}' $filename
I would like to do same with all lines in the same file. That is to say, count number of ; on all line in file.
What I have :
$ awk -F';' '(NR==1){print NF;}' $filename
11
What I would like to have :
11
11
11
11
11
11
Straight forward method to count ; per line should be:
awk '{print gsub(/;/,"&")}' Input_file
To remove empty lines try:
awk 'NF{print gsub(/;/,"&")}' Input_file
To do this in OP's way reduce 1 from value of NF:
awk -F';' '{print (NF-1)}' Input_file
OR
awk -F';' 'NF{print (NF-1)}' Input_file
I'd say you can solve your problem with the following:
awk -F';' '{if (NF) {a += NF-1;}} END {print a}' test.txt
You want to keep a running count of all the occurrences made (variable a).
As NF will return the number of fields, which is one more than the number of separators, you'll need to subtract 1 for each line. This is the NF-1 part.
However, you don't want to count "-1" for the lines in which there is no separator at all. To skip those you need the if (NF) part.
Here's a (perhaps contrived) example:
$ cat test.txt
;;
; ; ; ;;
; asd ;;a
a ; ;
$ awk -F';' '{if (NF) {a += NF-1;}} END {print a}' test.txt
12
Notice the empty line at the end (to test against the "no separator" case).
A different approach using tr and wc:
$ tr -cd ';' < file | wc -c
42
Your code returns a number one more than the number of semicolons; NF is the number of fields you get from splitting on a semicolon (so for example, if there is one semicolon, the line is split in two).
If you want to add this number from each line, that's easy;
awk -F ';' '{ sum += NF-1 } END { print sum }' "$filename"
If the number of fields is consistent, you could also just count the number of lines and multiply;
awk -F ':' 'END { print NR * (NF-1) }' "$filename"
But that's obviously wrong if you can't guarantee that all lines contain exactly the same number of fields.

using variable defined outside Awk

I have codded the following lines :
ARRAY=($(awk 'FS = ";" {print $3}' file.txt))
LINE_CREATOR=`echo "aaaa;bbbb;cccccccc" |
'{awk -F";"};
END
for (i in ARRAY)
{
print $'${ARRAY['i']}'
}
}'`
the File.txt looks like
1;8;3
4;6;1
7;9;2
Explanation :
the array contains the value : 3 1 2
so the loop will loop on the array , and extract fields $3 $1 $2 from the "aaaa;bbbb;cccccccc" using awk
and the final output should be this
ccccccccaaaabbbb
I still have some errors while launching my script.
I'm making a few guesses here but I think that this does what you want:
$ echo "aaaa;bbbb;cccccccc" | awk -F\; 'NR == FNR { n = split($0, a); next }
{ printf "%s", a[$3] } END { print "" }' - file
ccccccccaaaabbbb
NR == FNR means that the block is only run for the first input. - as an argument tells awk to read first from standard input. The string is split on FS (;) into the array a. next skips the rest of the script.
The second block is only run for the second input (the text file). The values in the third field are used to print the elements in the array a.
if you want to pass the index as an awk variable, here is another way
$ awk -F';' -v ix="$(cut -d\; -f3 file | paste -sd\;)" '
BEGIN{n=split(ix,a)}
{for(i=1;i<n;i++) printf "%s",$a[i];
printf "%s\n",$a[n]}' <<< "aaaa;bbbb;cccccccc"
ccccccccaaaabbbb

How to replace fields using substr comparison

I have two files where I need to fetch the last 6 char of Field-11 from F1 and lookup on F2, if it match I need to replace Field-9 of F1 with Field-1 and Filed-2 of F2.
file1:
12345||||||756432101000||756432||756432101000||
aaaaa||||||986754812345||986754||986754812345||
ccccc||||||134567222222||134567||134567222222||
file2:
101000|AAAA
812345|20030
The expected output is:
12345||||||756432101000||101000AAAA ||756432101000||
aaaaa||||||986754812345||81234520030||986754812345||
ccccc||||||134567222222||134567||134567222222||
I have tried:
awk -F '|' -v OFS='|' 'NR==FNR{a[$1,$2];next} {b=substr($11,length($11)-7)} b in a {$9=a[$1,$2]}1'
I'd write it this way as a full script in a file, rather than a one-liner:
#!/usr/bin/awk -f
BEGIN {
FS = "|";
OFS = FS;
}
NR == FNR { # second file: the replacements to use
map[$1] = $2
next;
}
{ # first file specified: the main file to manipulate
b = substr($11,length($11)-5);
if (map[b]) {
$9 = b map[b]
}
print
}
$ awk -F '|' -v OFS='|' 'NR==FNR{a[$1]=$2;next} {b=substr($11,length($11)-5)} b in a {$9=b a[b]}1' file2 file1
12345||||||756432101000||101000AAAA||756432101000||
aaaaa||||||986754812345||81234520030||986754812345||
ccccc||||||134567222222||134567||134567222222||
How it works
awk implicitly loops through every line in both files, starting with file2 because it is specified first on the command line.
-F '|'
This tells awk to use | as the field separator on input
-v OFS='|'
This tells awk to use | as the field separator on output
NR==FNR{a[$1]=$2;next}
While reading the first file, file2, this saves the second field, $2, as the value of associative array a with the first field, $1, as the key.
next tells awk to skip the rest of the commands and start over on the next line.
b=substr($11,length($11)-5)
This extracts the last six characters of field 11 and saves them in variable b.
b in a {$9=b a[b]}
This tests to see if b is one of the keys of associative array a. If it is, this assigns the ninth field, $9, to the combination of b and a[b].
1
This is awk's cryptic shorthand for print-the-line.
You are almost there:
$ awk -F '|' -v OFS='|' 'NR==FNR{a[$1]=$2;next} {b=substr($11,length($11)-5)} b in a {$9=b a[b]}1' file2 file1
12345||||||756432101000||101000AAAA ||756432101000||
aaaaa||||||986754812345||81234520030||986754812345||
ccccc||||||134567222222||134567||134567222222||
$

using sed or similar to modify lines in a file

I have a huge file composed of the following:
this is text
1234.1234567
this is another text
1234.1234567
and so on
I would like to transfer it to:
this is text:1234.1234567
this is another text:1234.1234567
is this possible using sed? or any other similar command?
Thanks
If you just want to join lines using : as separator, you could use paste:
paste -d : - - < file.txt
Or using awk:
awk -v sep=: '{ if (NR % 2 == 0) { print prev sep $0 } else prev = $0 }' file.txt
If you have lines containing just alphabets and other containing floating point numbers, you can do the following:
awk '/[a-zA-Z]+/ {printf "%s:", $0}
/[0-9.]+/ {print $0}' data
data is the filename. You can redirect the output to another file.

How do you implement an array into an AWK expression?

I am writing a script, and I have delimited file that looks like this.
1|Anderson|399.00|123
2|Smith|29.99|234
3|Smith|98.00|345
4|Smith|29.98|456
5|Edwards|399.00|567
6|Kreitzer|234.56|456
Here's an awk statement that will grab all the values in column one of a row that contain "Smith".
echo $(awk -F '|' 'BEGIN {count=0;} $2=="Smith" {count++; print $1}' customer)
The output would be:
2 3 4
How could I make it so I am also inputting the values into an array as awk increments. I tried this:
echo $(awk -F '|' 'BEGIN {count=0;} $2=="Smith" {count++; arr[count]=$1; print $1}' customer)
Edit: Later into the script, when I type
echo ${array[1]}
nothing outputs.
Your code seems to be right! Perhaps, I might haven't got your question correctly?
I slightly enhanced your code to print the values stored in the array at the end of execution. Also, there is a print statement just before the values are printed.
echo $(awk -F '|' 'BEGIN {count=0;} $2=="Smith" {count++; arr[count]=$1; print $1} END { print "Printing the inputs"; for (i in arr) print arr[i] }' customer)
2 3 4 Printing the inputs 2 3 4
Further, look at this site for more examples.
Your question is not very clear. Looking for something like this?
awk -F "|" '$2=="Smith" {arr[count++]=$1}
END {n=length(arr); for (i=0;i<n;i++) print (i+1), arr[i]}' in.file
OUTPUT
1 2
2 3
3 4
Found an easy solution. Set the output of awk into a variable. Then turn the variable into an array.
list=$(awk -F '|' 'BEGIN {count=0;} $2=="Smith" {count++; print $1}' customer)
array=($list)
Typing:
echo ${array[1]}
Will give you the second entry in the array

Resources