awk sum up multiple files show lines which does not appear on both sets of files - linux

I have been using awk to sum up multiple files, this is used to sum up the summary of server log parsing values, it really does speed up the final overall count but I have hit a minor problem and the typical examples I have hit on the web have not helped.
Here is the example:
cat file1
aa 1
bb 2
cc 3
ee 4
cat file2
aa 1
bb 2
cc 3
dd 4
cat file3
aa 1
bb 2
cc 3
ff 4
And the script:
cat test.sh
#!/bin/bash
files="file1 file2 file3"
i=0;
oldname="";
for names in $(echo $files); do
((i++));
if [ $i == 1 ]; then
oldname=$names
#echo "-- $i $names"
shift;
else
oldname1=$names.$$
awk 'NR==FNR { _[$1]=$2 } NR!=FNR { if(_[$1] != "") nn=0; nn=($2+_[$1]); print $1" "nn }' $names $oldname> $oldname1
if [ $i -gt 2 ]; then
rm $oldname;
fi
oldname=$oldname1
fi
done
echo "------------------------------ $i"
cat $oldname
When I run this, the identical columns are added up but those that appear only in one of the files does not
./test.sh
------------------------------ 3
aa 3
bb 6
cc 9
ee 4
ff dd does not appear in the list, from what I have seen its within the NR==FR
I have come across this:
http://dbaspot.com/shell/246751-awk-comparing-two-files-problem.html
you want all the lines in file1 that are not in file2,
awk 'NR == FNR { a[$0]; next } !($0 in a)' file2 file1
If you want only uniq lines in file1 that are not in file2,
awk 'NR == FNR { a[$0]; next } !($0 in a) { print; a[$0] }'
file2
file1
but this only complicates current issue further when attempted since lots of other fields get duplicated
After posting question - updates to the content ... and tests....
I wanted to stick with awk since it does appear to be a much shorter way of achieving result there is a problem still..
awk '{a[$1]+=$2}END{for (k in a) print k,a[k]}' file1 file2 file3
aa 3
bb 6
cc 9
ee 4
ff 4
gg 4
RESULT_SET_4 0
RESULT_SET_3 0
RESULT_SET_2 0
RESULT_SET_1 0
$ cat file1
RESULT_SET_1
aa 1
RESULT_SET_2
bb 2
RESULT_SET_3
cc 3
RESULT_SET_4
ff 4
$ cat file2
RESULT_SET_1
aa 1
RESULT_SET_2
bb 2
RESULT_SET_3
cc 3
RESULT_SET_4
ee 4
The file content is not left as it was originally i.e. the results are not under the headings, my original method did keep it all intact
Updated expected output - headings in correct context
cat file1
RESULT_SET_1
aa 1
RESULT_SET_2
bb 2
RESULT_SET_3
cc 3
RESULT_SET_4
ff 4
cat file2
RESULT_SET_1
aa 1
RESULT_SET_2
bb 2
RESULT_SET_3
cc 3
RESULT_SET_4
ee 4
cat file3
RESULT_SET_1
aa 1
RESULT_SET_2
bb 2
RESULT_SET_3
cc 3
RESULT_SET_4
gg 4
test.sh awk line to produce above is :
awk -v i=$i 'NR==FNR { _[$1]=$2 } NR!=FNR { if (_[$1] != "") { if ($2 ~ /[0-9]/) { nn=($2+_[$1]); print $1" "nn; } else { print;} }else { print; } }' $names $oldname> $oldname1
./test.sh
------------------------------ 3
RESULT_SET_1
aa 3
RESULT_SET_2
bb 6
RESULT_SET_3
cc 9
RESULT_SET_4
ff 4
works but destroys required formatting
awk '($2 != "") {a[$1]+=$2}; ($2 == "") { a[$1]=$2 } END {for (k in a) print k,a[k]} ' file1 file2 file3
aa 3
bb 6
cc 9
ee 4
ff 4
gg 4
RESULT_SET_4
RESULT_SET_3
RESULT_SET_2
RESULT_SET_1

$ awk '{a[$1]+=$2}END{for (k in a) print k,a[k]}' file1 file2 file3 | sort
aa 3
bb 6
cc 9
dd 4
ee 4
ff 4
Edit:
It's a bit of a hack but it does the job:
$ awk 'FNR==NR&&!/RESULT/{a[$1]=$2;next}($1 in a){a[$1]+=$2}END{for (k in a) print k,a[k]}' file1 file2 file3 | sort | awk '$1="RESULTS_SET_"NR"\n"$1'
RESULTS_SET_1
aa 3
RESULTS_SET_2
bb 6
RESULTS_SET_3
cc 9
RESULTS_SET_4
ff 4

You can do this in awk, as sudo_O suggested, but you can also do it in pure bash.
#!/bin/bash
# We'll use an associative array, where the indexes are strings.
declare -A a
# Our list of files, in an array (not associative)
files=(file1 file2 file3)
# Walk through array of files...
for file in "${files[#]}"; do
# And for each file, increment the array index with the value.
while read index value; do
((a[$index]+=$value))
done < "$file"
done
# Walk through array. ${!...} returns a list of indexes.
for i in ${!a[#]}; do
echo "$i ${a[$i]}"
done
And the result:
$ ./doit
dd 4
aa 3
ee 4
bb 6
ff 4
cc 9
And if you want the output sorted ... you can pipe it through sort. :)

Here's one way using GNU awk. Run like:
awk -f script.awk File1 File2 File3
Contents of script.awk:
sub(/RESULT_SET_/,"") {
i = $1
next
}
{
a[i][$1]+=$2
}
END {
for (j=1;j<=length(a);j++) {
print "RESULT_SET_" j
for (k in a[j]) {
print k, a[j][k]
}
}
}
Results:
RESULT_SET_1
aa 3
RESULT_SET_2
bb 6
RESULT_SET_3
cc 9
RESULT_SET_4
ee 4
ff 4
gg 4
Alternatively, here's the one-liner:
awk 'sub(/RESULT_SET_/,"") { i = $1; next } { a[i][$1]+=$2 } END { for (j=1;j<=length(a);j++) { print "RESULT_SET_" j; for (k in a[j]) print k, a[j][k] } }' File1 File2 File3

fixed using this
Basically it goes through each file, if the entry exists on the other side, it will add the entry to approximate line number with a 0 value so that it can sum up the content - been testing this on my current output and seems to be working real well
#!/bin/bash
files="file1 file2 file3 file4 file5 file6 file7 file8"
RAND="$$"
i=0;
oldname="";
for names in $(echo $files); do
((i++));
if [ $i == 1 ]; then
oldname=$names
shift;
else
oldname1=$names.$RAND
for entries in $(awk -v i=$i 'NR==FNR { _[$1]=$2 } NR!=FNR { if (_[$1] == "") { if ($2 ~ /[0-9]/) { nn=0; nn=(_[$1]+=$2); print FNR"-"$1"%0"} else { } } else { } }' $oldname $names); do
line=$(echo ${entries%%-*})
content=$(echo ${entries#*-})
content=$(echo $content|tr "%" " ")
edit=$(ed -s $oldname << EOF
$line
a
$content
.
w
q
EOF
)
$edit >/dev/null 2>&1
done
awk -v i=$i 'NR==FNR { _[$1]=$2 } NR!=FNR { if (_[$1] != "") { if ($2 ~ /[0-9]/) { nn=0; nn=($2+_[$1]); print $1" "nn; } else { print $1;} }else { print; } }' $names $oldname> $oldname1
oldname=$oldname1
fi
done
cat $oldname
#rm file?.*

Related

Executing Concatenation for all rows

I'm working with GWAS data.
Using p-link command I was able to get SNPslist, SNPs.map, SNPs.ped.
Here are the data files and commands I have for 2 SNPs (rs6923761, rs7903146):
$ cat SNPs.map
0 rs6923761 0 0
0 rs7903146 0 0
$ cat SNPs.ped
6 6 0 0 2 2 G G C C
74 74 0 0 2 2 A G T C
421 421 0 0 2 2 A G T C
350 350 0 0 2 2 G G T T
302 302 0 0 2 2 G G C C
bash commands I used:
echo -n IID > SNPs.csv
cat SNPs.map | awk '{printf ",%s", $2}' >> SNPs.csv
echo >> SNPs.csv
cat SNPs.ped | awk '{printf "%s,%s%s,%s%s\n", $1, $7, $8, $9, $10}' >> SNPs.csv
cat SNPs.csv
Output:
IID,rs6923761,rs7903146
6,GG,CC
74,AG,TC
421,AG,TC
350,GG,TT
302,GG,CC
This is about 2 SNPs, so I can see manually their position so I added and called using the above command. But now I have 2000 SNPs IDs and their values. Need help with bash command which can parse over 2000 SNPs in the same way.
One awk idea that replaces all of the current code:
awk '
BEGIN { printf "IID" }
# process 1st file:
FNR==NR { printf ",%s", $2; next }
# process 2nd file:
FNR==1 { print "" } # terminate 1st line of output
{ printf $1 # print 1st column
for (i=7;i<=NF;i=i+2) # loop through columns 7-NF, incrementing index +2 on each pass
printf ",%s%s", $i, $(i+1) # print (i)th and (i+1)th columns
print "" # terminate line
}
' SNPs.map SNPs.ped
NOTE: remove comments to declutter code
This generates:
IID,rs6923761,rs7903146
6,GG,CC
74,AG,TC
421,AG,TC
350,GG,TT
302,GG,CC
You can use --recodeA flag in plink to have your IID as rows and SNPs as columns.

I have two huge sequencefiles where i want to extract the same linenumbers from file1 in file2

I have my two sequencefiles and I have a list of rows/lines of interest from file1. I want to extract the lines with the same linenumber as in file1. The list is just 1 column of numbers.
I tried using awk in a loop, but all I get is an empty file as output file.
My code looks like this:
for i in <listfile>;
do awk -F lnr="$i" 'NR==lnr' <file2> > outputfile
The output file is created but is just empty.
I could not find this question being asked before, but if so sorry for wasting your time
If I understand the question - file 1 has a list of "line numbers" and you desire to print those lines in file2:
awk 'FNR==NR{line[$1]=1;next}{if(line[FNR]==1)print FNR, $0}' file1 file2
Given the input...
for i in {a..z}; do echo $i; done > /tmp/list-1
for i in {z..a}; do echo $i; done > /tmp/list-2
The current line of each file will be stored in FNR, so you can use that.
$ awk -v a=4 -v b=9 'FNR >= a && FNR <= b { print FILENAME, NR, FNR, $0 }' /tmp/list-*
Sample output:
/tmp/list-1 4 4 d
/tmp/list-1 5 5 e
/tmp/list-1 6 6 f
/tmp/list-1 7 7 g
/tmp/list-1 8 8 h
/tmp/list-1 9 9 i
/tmp/list-2 30 4 w
/tmp/list-2 31 5 v
/tmp/list-2 32 6 u
/tmp/list-2 33 7 t
/tmp/list-2 34 8 s
/tmp/list-2 35 9 r

Insert a row and a column in a matrix using awk

I have a gridded dataset with 250 rows x 300 columns in matrix form:
ifile.txt
2 3 4 1 2 3
3 4 5 2 4 6
2 4 0 5 0 7
0 0 5 6 3 8
I would like to insert the latitude values at the first column and longitude values at the top. Which looks like:
ofile.txt
20.00 20.33 20.66 20.99 21.32 21.65
100.00 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
The increment is 0.33
I can do it for a small size matrix in manually, but I can't able to get any idea how to get my output in my desired format. I was writing a script in the following way, but completely useless.
echo 20 > latitude.txt
for i in `seq 1 250`;do
i1=$(( i + 0.33 )) #bash can't recognize fractions
echo $i1 >> latitude.txt
done
echo 100 > longitude.txt
for j in `seq 1 300`;do
j1=$(( j + 0.33 ))
echo $j1 >> longitude.txt
done
paste longitude.txt ifile.txt > dummy_file.txt
cat latitude.txt dummy_file.txt > ofile.txt
$ cat tst.awk
BEGIN {
lat = 100
lon = 20
latWid = lonWid = 6
latDel = lonDel = 0.33
latFmt = lonFmt = "%*.2f"
}
NR==1 {
printf "%*s", latWid, ""
for (i=1; i<=NF; i++) {
printf lonFmt, lonWid, lon
lon += lonDel
}
print ""
}
{
printf latFmt, latWid, lat
lat += latDel
for (i=1; i<=NF; i++) {
printf "%*s", lonWid, $i
}
print ""
}
$ awk -f tst.awk file
20.00 20.33 20.66 20.99 21.32 21.65
100.00 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
Following awk may also help you on same.
awk -v col=100 -v row=20 'FNR==1{printf OFS;for(i=1;i<=NF;i++){printf row OFS;row=row+.33;};print ""} {col+=.33;$1=$1;print col OFS $0}' OFS="\t" Input_file
Adding non one liner form of above solution too now:
awk -v col=100 -v row=20 '
FNR==1{
printf OFS;
for(i=1;i<=NF;i++){
printf row OFS;
row=row+.33;
};
print ""
}
{
col+=.33;
$1=$1;
print col OFS $0
}
' OFS="\t" Input_file
Awk solution:
awk 'NR == 1{
long = 20.00; lat = 100.00; printf "%12s%.2f", "", long;
for (i=1; i<NF; i++) { long += 0.33; printf "\t%.2f", long } print "" }
NR > 1{ lat += 0.33 }
{
printf "%.2f%6s", lat, "";
for (i=1; i<=NF; i++) printf "\t%d", $i; print ""
}' file
With perl
$ perl -lane 'print join "\t", "", map {20.00+$_*0.33} 0..$#F if $.==1;
print join "\t", 100+(0.33*$i++), #F' ip.txt
20 20.33 20.66 20.99 21.32 21.65
100 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
-a to auto-split input on whitespaces, result saved in #F array
See https://perldoc.perl.org/perlrun.html#Command-Switches for details on command line options
if $.==1 for the first line of input
map {20.00+$_*0.33} 0..$#F iterate based on size of #F array, and for each iteration, we get a value based on equation inside {} where $_ will be 0, 1, etc upto last index of #F array
print join "\t", "", map... use tab separator to print empty element and results of map
For all the lines, print contents of #F array pre-fixed with results of 100+(0.33*$i++) where $i will be initially 0 in numeric context. Again, tab is used as separator while joining these values
Use sprintf if needed for formatting, also $, can be initialized instead of using join
perl -lane 'BEGIN{$,="\t"; $st=0.33}
print "", map { sprintf "%.2f", 20+$_*$st} 0..$#F if $.==1;
print sprintf("%.2f", 100+($st*$i++)), #F' ip.txt

count using awk commands

I have fileA.txt and a few lines of it are shown below:
AA
BB
CC
DD
EE
And i have fileB.txt, and it has text like shown below:
Group col2 col3 col4
1 pp 4567 AA,BC,AB
1 qp 3428 AA
2 pp 3892 AA
3 ee 28399 AA
4 dd 3829 BB,CC
1 dd 27819 BB
5 ak 29938 CC
For every line in fileA.txt, it should count the number of times it is present in fileB.txt based on column1 in fileB.txt.
Sample output should look like:
AA 3
BB 2
CC 2
AA is present 4 times but it is present in the group "1" twice. If it is present more than once in the same group in column1,it should be counted only once and therefore in the above output AA count is 3.
Any help using awk or any other oneliners?
Here is an awk one-liner that should work:
awk '
NR==FNR && !seen[$4,$1]++{count[$4]++;next}
($1 in count){print $1,count[$1]}' fileB.txt fileA.txt
Explaination:
NR==FNR&&!seen[$4,$1]++ pattern is only true when Column 1 has not been captured at all. For all duplicate captures we dont increment the counter.
$1 in count looks for first file column 1 presence in array. If it is present, we print along with counts.
Output:
$ awk 'NR==FNR && !seen[$4,$1]++{count[$4]++;next}($1 in count){print $1,count[$1]}' fileB.txt fileA.txt
AA 3
BB 2
CC 1
Update based on the modified question:
awk '
NR==FNR {
n = split($4,tmp,/,/);
for(x = 1; x <= n; x++) {
if(!seen[$1,tmp[x]]++) {
count[tmp[x]]++
}
}
next
}
($1 in count) {
print $1, count[$1]
}' fileB.txt fileA.txt
Outputs:
AA 3
BB 2
CC 2
Pure bash (4.0 or newer):
#!/bin/bash
declare -A items=()
# read in the list of items to track
while read -r; do items[$REPLY]=0; done <fileA.txt
# read fourth column from fileB and increment for each match
while read -r _ _ _ item _; do
[[ ${items[$item]} ]] || continue # skip unrecognized values
items[$item]=$(( items[$item] + 1 )) # otherwise, increment
done <fileB.txt
# print output
for key in "${!items[#]}"; do # iterate over keys
value="${items[$key]}" # look up values
printf '%s\t%s\n' "$key" "$value" # print them together
done
A simple awk one-liner.
awk 'NR>FNR{if($0 in a)print$0,a[$0];next}!a[$4,$1]++{a[$4]++}' fileB.txt fileA.txt
Note the order of files.

how to multiply two tables in BASH

I have two data files like this:
file1:
a1 a2 a3 ... aN
b1 b2 b3 ... bN
.
.
.
file1:
A1 A2 A3 ... AN
B1 B2 B3 ... BN
.
.
.
I want to multiply the two tables, i.e.,
a1*A1 a2*A2 a3*A3 ... aN*AN
b1*B1 b2*B2 b3*B3 ... bN*BN
.
.
.
Can I do it with AWK or something else in BASH? Thanks a lot!
Here's one way using GNU awk, assuming you have the same number of fields and rows in each file. Run like:
awk -f script.awk file1 file2
Contents of script.awk:
FNR==NR {
for (i=1;i<=NF;i++) {
a[NR][i]=$i
}
next
}
{
for (j=1;j<=NF;j++) {
$j = $j * a[FNR][j]
}
}1
Alternatively, here's the one liner:
awk 'FNR==NR { for(i=1;i<=NF;i++) a[NR][i]=$i; next } { for(j=1;j<=NF;j++) $j = $j * a[FNR][j] }1' file1 file2
Testing:
Contents of file1:
1 2 3
2 4 6
Contents of file2:
3 4 5
6 7 8
Results:
3 8 15
12 28 48
EDIT:
If, and I mean if, there could be extra fields that one file has that the other doesn't, change:
$j = $j * a[FNR][j]
to:
$j = (a[FNR][j] ? $j * a[FNR][j] : $j)
This will print the existing value and not zero. HTH.

Resources