How do you implement an array into an AWK expression? - linux

I am writing a script, and I have delimited file that looks like this.
1|Anderson|399.00|123
2|Smith|29.99|234
3|Smith|98.00|345
4|Smith|29.98|456
5|Edwards|399.00|567
6|Kreitzer|234.56|456
Here's an awk statement that will grab all the values in column one of a row that contain "Smith".
echo $(awk -F '|' 'BEGIN {count=0;} $2=="Smith" {count++; print $1}' customer)
The output would be:
2 3 4
How could I make it so I am also inputting the values into an array as awk increments. I tried this:
echo $(awk -F '|' 'BEGIN {count=0;} $2=="Smith" {count++; arr[count]=$1; print $1}' customer)
Edit: Later into the script, when I type
echo ${array[1]}
nothing outputs.

Your code seems to be right! Perhaps, I might haven't got your question correctly?
I slightly enhanced your code to print the values stored in the array at the end of execution. Also, there is a print statement just before the values are printed.
echo $(awk -F '|' 'BEGIN {count=0;} $2=="Smith" {count++; arr[count]=$1; print $1} END { print "Printing the inputs"; for (i in arr) print arr[i] }' customer)
2 3 4 Printing the inputs 2 3 4
Further, look at this site for more examples.

Your question is not very clear. Looking for something like this?
awk -F "|" '$2=="Smith" {arr[count++]=$1}
END {n=length(arr); for (i=0;i<n;i++) print (i+1), arr[i]}' in.file
OUTPUT
1 2
2 3
3 4

Found an easy solution. Set the output of awk into a variable. Then turn the variable into an array.
list=$(awk -F '|' 'BEGIN {count=0;} $2=="Smith" {count++; print $1}' customer)
array=($list)
Typing:
echo ${array[1]}
Will give you the second entry in the array

Related

Count number of ';' in column

I use the following command to count number of ; in a first line in a file:
awk -F';' '(NR==1){print NF;}' $filename
I would like to do same with all lines in the same file. That is to say, count number of ; on all line in file.
What I have :
$ awk -F';' '(NR==1){print NF;}' $filename
11
What I would like to have :
11
11
11
11
11
11
Straight forward method to count ; per line should be:
awk '{print gsub(/;/,"&")}' Input_file
To remove empty lines try:
awk 'NF{print gsub(/;/,"&")}' Input_file
To do this in OP's way reduce 1 from value of NF:
awk -F';' '{print (NF-1)}' Input_file
OR
awk -F';' 'NF{print (NF-1)}' Input_file
I'd say you can solve your problem with the following:
awk -F';' '{if (NF) {a += NF-1;}} END {print a}' test.txt
You want to keep a running count of all the occurrences made (variable a).
As NF will return the number of fields, which is one more than the number of separators, you'll need to subtract 1 for each line. This is the NF-1 part.
However, you don't want to count "-1" for the lines in which there is no separator at all. To skip those you need the if (NF) part.
Here's a (perhaps contrived) example:
$ cat test.txt
;;
; ; ; ;;
; asd ;;a
a ; ;
$ awk -F';' '{if (NF) {a += NF-1;}} END {print a}' test.txt
12
Notice the empty line at the end (to test against the "no separator" case).
A different approach using tr and wc:
$ tr -cd ';' < file | wc -c
42
Your code returns a number one more than the number of semicolons; NF is the number of fields you get from splitting on a semicolon (so for example, if there is one semicolon, the line is split in two).
If you want to add this number from each line, that's easy;
awk -F ';' '{ sum += NF-1 } END { print sum }' "$filename"
If the number of fields is consistent, you could also just count the number of lines and multiply;
awk -F ':' 'END { print NR * (NF-1) }' "$filename"
But that's obviously wrong if you can't guarantee that all lines contain exactly the same number of fields.

grep reverse with exact matching

I have a list file, which has id and number and am trying to get those lines from a master file which do not have those ids.
List file
nw_66 17296
nw_67 21414
nw_68 21372
nw_69 27387
nw_70 15830
nw_71 32348
nw_72 21925
nw_73 20363
master file
nw_1 5896
nw_2 52814
nw_3 14537
nw_4 87323
nw_5 56466
......
......
nw_n xxxxx
so far am trying this but not working as expected.
for i in $(awk '{print $1}' list.txt); do grep -v -w $i master.txt; done;
Kindly help
Give this awk one-liner a try:
awk 'NR==FNR{a[$1]=1;next}!a[$1]' list master
Maybe this helps:
awk 'NR == FNR {id[$1]=1;next}
{
if (id[$1] == "") {
print $0
}
}' listfile masterfile
We accept 2 files as input above, first one is listfile, second is masterfile.
NR == FNR would be true while awk is going through listfile. In the associative array id[], all ids in listfile are made a key with value as 1.
When awk goes through masterfile, it only prints a line if $1 i.e. the id is not a key in array ids.
The OP attempted the following line:
for i in $(awk '{print $1}' list.txt); do grep -v -w $i master.txt; done;
This line will not work as for every entry $i, you print all entries in master.txt tat are not equivalent to "$i". As a consequence, you will end up with multiple copies of master.txt, each missing a single line.
Example:
$ for i in 1 2; do grep -v -w "$i" <(seq 1 3); done
2 \ copy of seq 1 3 without entry 1
3 /
1 \ copy of seq 1 3 without entry 2
3 /
Furthermore, the attempt reads the file master.txt multiple times. This is very inefficient.
The unix tool grep allows one the check multiple expressions stored in a file in a single go. This is done using the -f flag. Normally this looks like:
$ grep -f list.txt master.txt
The OP can use this now in the following way:
$ grep -vwf <(awk '{print $1}' list.txt) master.txt
But this would do matches over the full line.
The awk solution presented by Kent is more flexible and allows the OP to define a more tuned match:
awk 'NR==FNR{a[$1]=1;next}!a[$1]' list master
Here the OP clearly states, I want to match column 1 of list with column 1 of master and I don't care about spaces or whatever is in column 2. The grep solution could still match entries in column 2.

using variable defined outside Awk

I have codded the following lines :
ARRAY=($(awk 'FS = ";" {print $3}' file.txt))
LINE_CREATOR=`echo "aaaa;bbbb;cccccccc" |
'{awk -F";"};
END
for (i in ARRAY)
{
print $'${ARRAY['i']}'
}
}'`
the File.txt looks like
1;8;3
4;6;1
7;9;2
Explanation :
the array contains the value : 3 1 2
so the loop will loop on the array , and extract fields $3 $1 $2 from the "aaaa;bbbb;cccccccc" using awk
and the final output should be this
ccccccccaaaabbbb
I still have some errors while launching my script.
I'm making a few guesses here but I think that this does what you want:
$ echo "aaaa;bbbb;cccccccc" | awk -F\; 'NR == FNR { n = split($0, a); next }
{ printf "%s", a[$3] } END { print "" }' - file
ccccccccaaaabbbb
NR == FNR means that the block is only run for the first input. - as an argument tells awk to read first from standard input. The string is split on FS (;) into the array a. next skips the rest of the script.
The second block is only run for the second input (the text file). The values in the third field are used to print the elements in the array a.
if you want to pass the index as an awk variable, here is another way
$ awk -F';' -v ix="$(cut -d\; -f3 file | paste -sd\;)" '
BEGIN{n=split(ix,a)}
{for(i=1;i<n;i++) printf "%s",$a[i];
printf "%s\n",$a[n]}' <<< "aaaa;bbbb;cccccccc"
ccccccccaaaabbbb

Grep input from file and sum columns based on search

I am using grep to input a search string from a file and awk to print out the sum of the columns based on the search result using
grep -f input data.txt |awk '{ sum+=$2} END {print sum}'
This gives me the sum with all the input strings. How do I get the sum for each input string separately?
Sample input
a
b
c
Sample data.txt
a/cell1 5
b/cell1 5
a/cell2 8
c/cell1 10
no of lines in input ~32
size of data.txt - 5GB
Expected results:
a 13
b 5
c 5
$ awk 'NR==FNR{sum[$0]=0;next} $1 in sum{sum[$1]+=$2} END{for (key in sum) print key, sum[key]}' input data.txt
a 2
b 1
c 1
Hard to tell without seeing your files, but maybe:
grep -f input data.txt | \
awk '{sum[$1] += $2} END { for (key in sum) { print key, sum[key] } }'
The following avoids accumulating unnecessary details, and therefore may circumvent the memory allocation error. It assumes the list of the strings of interest is in a file named input:
awk -v dict=input '
BEGIN {while((getline<dict) > 0) {a[$1]=1}}
a[$1] {sum[$1] += $2}
END { for (key in sum) { print key, sum[key] } }'
If this does not resolve the memory issue, then please give some details about your awk, OS, and anything else that may be relevant.
Is this fast enough running againt your 5GB file?
awk 'NR == FNR {sum[$1]+=$2} NR != FNR {printf "%s %s\n", $1, sum[$1] }' file1 file2
Where file1 is the 5GB file and file2 is the file containing the strings you want to find in file1.
EDIT
As #EdMorton commented earlier, my solution will print blank for sum[$1] when $1 is not found.
In addition, #EdMorton provided an answer which will print 0 instead.
I suggest to check out his answer first, as it is assumed to meet your needs better.

bash print first to nth column in a line iteratively

I am trying to get the column names of a file and print them iteratively. I guess the problem is with the print $i but I don't know how to correct it. The code I tried is:
#! /bin/bash
for i in {2..5}
do
set snp = head -n 1 smaller.txt | awk '{print $i}'
echo $snp
done
Example input file:
ID Name Age Sex State Ext
1 A 12 M UT 811
2 B 12 F UT 818
Desired output:
Name
Age
Sex
State
Ext
But the output I get is blank screen.
You'd better just read the first line of your file and store the result as an array:
read -a header < smaller.txt
and then printf the relevant fields:
printf "%s\n" "${header[#]:1}"
Moreover, this uses bash only, and involves no unnecessary loops.
Edit. To also answer your comment, you'll be able to loop through the header fields thus:
read -a header < smaller.txt
for snp in "${header[#]:1}"; do
echo "$snp"
done
Edit 2. Your original method had many many mistakes. Here's a corrected version of it (although what I wrote before is a much preferable way of solving your problem):
for i in {2..5}; do
snp=$(head -n 1 smaller.txt | awk "{print \$$i}")
echo "$snp"
done
set probably doesn't do what you think it does.
Because of the single quotes in awk '{print $i}', the $i never gets expanded by bash.
This algorithm is not good since you're calling head and awk 4 times, whereas you don't need a single external process.
Hope this helps!
You can print it using awk itself:
awk 'NR==1{for (i=2; i<=5; i++) print $i}' smaller.txt
The main problem with your code is that your assignment syntax is wrong. Change this:
set snp = head -n 1 smaller.txt | awk '{print $i}'
to this:
snp=$(head -n 1 smaller.txt | awk '{print $i}')
That is:
Do not use set. set is for setting shell options, numbered parameters, and so on, not for assigning arbitrary variables.
Remove the spaces around =.
To run a command and capture its output as a string, use $(...) (or `...`, but $(...) is less error-prone).
That said, I agree with gniourf_gniourf's approach.
Here's another alternative; not necessarily better or worse than any of the others:
for n in $(head smaller.txt)
do
echo ${n}
done
somthin like
for x1 in $(head -n1 smaller.txt );do
echo $x1
done

Resources