count using awk commands - linux

I have fileA.txt and a few lines of it are shown below:
AA
BB
CC
DD
EE
And i have fileB.txt, and it has text like shown below:
Group col2 col3 col4
1 pp 4567 AA,BC,AB
1 qp 3428 AA
2 pp 3892 AA
3 ee 28399 AA
4 dd 3829 BB,CC
1 dd 27819 BB
5 ak 29938 CC
For every line in fileA.txt, it should count the number of times it is present in fileB.txt based on column1 in fileB.txt.
Sample output should look like:
AA 3
BB 2
CC 2
AA is present 4 times but it is present in the group "1" twice. If it is present more than once in the same group in column1,it should be counted only once and therefore in the above output AA count is 3.
Any help using awk or any other oneliners?

Here is an awk one-liner that should work:
awk '
NR==FNR && !seen[$4,$1]++{count[$4]++;next}
($1 in count){print $1,count[$1]}' fileB.txt fileA.txt
Explaination:
NR==FNR&&!seen[$4,$1]++ pattern is only true when Column 1 has not been captured at all. For all duplicate captures we dont increment the counter.
$1 in count looks for first file column 1 presence in array. If it is present, we print along with counts.
Output:
$ awk 'NR==FNR && !seen[$4,$1]++{count[$4]++;next}($1 in count){print $1,count[$1]}' fileB.txt fileA.txt
AA 3
BB 2
CC 1
Update based on the modified question:
awk '
NR==FNR {
n = split($4,tmp,/,/);
for(x = 1; x <= n; x++) {
if(!seen[$1,tmp[x]]++) {
count[tmp[x]]++
}
}
next
}
($1 in count) {
print $1, count[$1]
}' fileB.txt fileA.txt
Outputs:
AA 3
BB 2
CC 2

Pure bash (4.0 or newer):
#!/bin/bash
declare -A items=()
# read in the list of items to track
while read -r; do items[$REPLY]=0; done <fileA.txt
# read fourth column from fileB and increment for each match
while read -r _ _ _ item _; do
[[ ${items[$item]} ]] || continue # skip unrecognized values
items[$item]=$(( items[$item] + 1 )) # otherwise, increment
done <fileB.txt
# print output
for key in "${!items[#]}"; do # iterate over keys
value="${items[$key]}" # look up values
printf '%s\t%s\n' "$key" "$value" # print them together
done

A simple awk one-liner.
awk 'NR>FNR{if($0 in a)print$0,a[$0];next}!a[$4,$1]++{a[$4]++}' fileB.txt fileA.txt
Note the order of files.

Related

Executing Concatenation for all rows

I'm working with GWAS data.
Using p-link command I was able to get SNPslist, SNPs.map, SNPs.ped.
Here are the data files and commands I have for 2 SNPs (rs6923761, rs7903146):
$ cat SNPs.map
0 rs6923761 0 0
0 rs7903146 0 0
$ cat SNPs.ped
6 6 0 0 2 2 G G C C
74 74 0 0 2 2 A G T C
421 421 0 0 2 2 A G T C
350 350 0 0 2 2 G G T T
302 302 0 0 2 2 G G C C
bash commands I used:
echo -n IID > SNPs.csv
cat SNPs.map | awk '{printf ",%s", $2}' >> SNPs.csv
echo >> SNPs.csv
cat SNPs.ped | awk '{printf "%s,%s%s,%s%s\n", $1, $7, $8, $9, $10}' >> SNPs.csv
cat SNPs.csv
Output:
IID,rs6923761,rs7903146
6,GG,CC
74,AG,TC
421,AG,TC
350,GG,TT
302,GG,CC
This is about 2 SNPs, so I can see manually their position so I added and called using the above command. But now I have 2000 SNPs IDs and their values. Need help with bash command which can parse over 2000 SNPs in the same way.
One awk idea that replaces all of the current code:
awk '
BEGIN { printf "IID" }
# process 1st file:
FNR==NR { printf ",%s", $2; next }
# process 2nd file:
FNR==1 { print "" } # terminate 1st line of output
{ printf $1 # print 1st column
for (i=7;i<=NF;i=i+2) # loop through columns 7-NF, incrementing index +2 on each pass
printf ",%s%s", $i, $(i+1) # print (i)th and (i+1)th columns
print "" # terminate line
}
' SNPs.map SNPs.ped
NOTE: remove comments to declutter code
This generates:
IID,rs6923761,rs7903146
6,GG,CC
74,AG,TC
421,AG,TC
350,GG,TT
302,GG,CC
You can use --recodeA flag in plink to have your IID as rows and SNPs as columns.

Bash Colum sum over a table of variable length

Im trying to get the columsums (exept for the first one) of a tab delimited containing numbers.
To find out the number of columns an store it in a variable I use:
cols=$(awk '{print NF}' file.txt | sort -nu | tail -n 1
next I want to calculate the sum of all numbers in that column and store this in a variable again in a for loop:
for c in 2:$col
do
num=$(cat file.txt | awk '{sum+$2 ; print $0} END{print sum}'| tail -n 1
done
this
num=$(cat file.txt | awk '{sum+$($c) ; print $0} END{print sum}'| tail -n 1
on itself with a fixed numer and without variable input works find but i cannot get it to accept the for-loop variable.
Thanks for the support
p.s. It would also be fine if i could sum all columns (expept the first one) at once without the loop-trouble.
Assuming you want the sums of the individual columns,
$ cat file
1 2 3 4
5 6 7 8
9 10 11 12
$ awk '
{for (i=2; i<=NF; i++) sum[i] += $i}
END {for (i=2; i<=NF; i++) printf "%d%s", sum[i], OFS; print ""}
' file
18 21 24
In case you're not bound to awk, there's a nice tool for "command-line statistical operations" on textual files called GNU datamash.
With datamash, summing (probably the simplest operation of all) a 2nd column is as easy as:
$ datamash sum 2 < table
9
Assuming the table file holds tab-separated data like:
$ cat table
1 2 3 4
2 3 4 5
3 4 5 6
To sum all columns from 2 to n use column ranges (available in datamash 1.2):
$ n=4
$ datamash sum 2-$n < table
9 12 15
To include headers, see the --headers-out option

add a column with different label

I want to add a column that contains two different labels. Let's say I have this text
aa bb cc
dd ee ff
gg hh ii
ll mm nn
oo pp qq
and I want to add 1 at hte first column of the first two lines and 2 at the first column of the remaining lines, so that eventually I will get that text:
1 aa bb cc
1 dd ee ff
2 gg hh ii
3 ll mm nn
4 oo pp qq
Do you know how to do it?
thanks
Assuming you are processing a text file in Linux shell, you could use awk for this. Your problem description says you want two labels 1 and 2, this would be
cat input.txt | awk '{print (NR<=2 ? "1 ":"2 ") $0}'
Your expected output says you want label 1 for the first two lines, and start counting from 2 beginning with the third line, this would be
cat input.txt | awk '{print (NR<=2 ? "1 ":NR-1" ") $0}'
I'm assuming that you want to do this using the shell, if your data is in a file called input.txt, you can either use cat -n or nl.
% tail -n+2 input.txt | cat -n
1 dd ee ff
2 gg hh ii
3 ll mm nn
4 oo pp qq
% tail -n+2 input.txt | nl
1 dd ee ff
2 gg hh ii
3 ll mm nn
4 oo pp qq
The first line can be added back manually.
The two commands will behave differently if you have empty lines in your input file.
Could you please try following and let me know if this helps you.
Solution 1st: By using a variable named count whose initial value is 1 and then checking if line number is either 1 or 2 then simply append 1 in $1 else increase variable count's value by 1 and append it in $1's value.
awk -v count=1 '{$1=NR==1||NR==2?1 FS $1:++count FS $1} 1' Input_file
Solution 2nd: Checking if line number is 1 or 2 then simply adding 1 to $1 else checking if a line is NOT NULL then add NR-1(which means subtract 1 from it's line number) and add to $1's value.
awk '{$1=NR==1||NR==2?1 FS $1:(NF?FNR-1 FS $1:"")} 1' Input_file

If first two columns are equal, select top 3 based on descending order of 3rd column

I want to select top 3 results for every line that has the same first two column.
For example the data will look like,
cat data.txt
A A 10
A A 1
A A 2
A A 5
A A 8
A B 1
A B 2
A C 6
A C 5
A C 10
A C 1
B A 1
B A 1
B A 2
B A 8
And for the result I want
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 1
B A 1
B A 2
Note that some of the "groups" do not contain 3 rows.
I have tried
sort -k1,1 -k2,2 -k3,3nr data.txt | sort -u -k1,1 -k2,2 > 1.txt
comm -23 <(sort data.txt) <(sort 1.txt)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 2.txt
comm -23 <(sort data.txt) <(cat 1.txt 2.txt | sort)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 3.txt
It seems like it's working but since I am learning to code better was wondering if there was a better way to go about this. Plus, my code will generate many files that I will have to delete.
You can do:
$ sort -k1,1 -k2,2 -k3,3nr file | awk 'a[$1,$2]++<3'
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 8
B A 2
B A 1
Explanation:
There are two key items to understand the awk program; associative arrays and fields.
If you reference an empty awk array element, it is an empty container -- ready for anything you put into it. You can use that as a counter.
You state If first two columns are equal...
The sort puts the file in order desired. The statement a[$1,$2] uses the values of the first two fields as a unique entry into an associative array.
You then state ...select top 3 based on descending order of 3rd column...
Once again, the sort put the file into the desired order, and the statement a[$1,$2]++ counts them. Now just count up to three.
awk is organized into blocks of condition {action} The statement a[$1,$2]++<3 is true until there are more than 3 of the same pattern seen.
A wordier version of the program would be:
awk 'a[$1,$2]++<3 {print $0}'
But the default action if the condition is true is to print $0 so it is not needed.
If you are processing text in Unix, you should get to know awk. It is the most powerful tool that POSIX guarantees you will have, and is commonly used for these tasks.
Great place to start is the online book Effective AWK Programming by Arnold D. Robbins
#Dawg has the best answer. This one will be a little lighter on memory, which probably won't be a concern for your data:
sort -k1,2 -k3,3nr file |
awk '
{key = $1 FS $2}
prev != key {prev = key; count = 1}
count <= 3 {print; count++}
'
You can sort the file by first two columns primarily and by the 3rd one numerically secondarily, then read the output and only print the first three lines for each combination of the first two columns.
sort -k1,2 -k3,3rn data.txt \
| while read c1 c2 n ; do
if [[ $c1 == $l1 && $c2 == $l2 ]] ; then
((c++))
else
c=0
fi
if (( c < 3 )) ; then
echo $c1 $c2 $n
l1=$c1
l2=$c2
fi
done

awk sum up multiple files show lines which does not appear on both sets of files

I have been using awk to sum up multiple files, this is used to sum up the summary of server log parsing values, it really does speed up the final overall count but I have hit a minor problem and the typical examples I have hit on the web have not helped.
Here is the example:
cat file1
aa 1
bb 2
cc 3
ee 4
cat file2
aa 1
bb 2
cc 3
dd 4
cat file3
aa 1
bb 2
cc 3
ff 4
And the script:
cat test.sh
#!/bin/bash
files="file1 file2 file3"
i=0;
oldname="";
for names in $(echo $files); do
((i++));
if [ $i == 1 ]; then
oldname=$names
#echo "-- $i $names"
shift;
else
oldname1=$names.$$
awk 'NR==FNR { _[$1]=$2 } NR!=FNR { if(_[$1] != "") nn=0; nn=($2+_[$1]); print $1" "nn }' $names $oldname> $oldname1
if [ $i -gt 2 ]; then
rm $oldname;
fi
oldname=$oldname1
fi
done
echo "------------------------------ $i"
cat $oldname
When I run this, the identical columns are added up but those that appear only in one of the files does not
./test.sh
------------------------------ 3
aa 3
bb 6
cc 9
ee 4
ff dd does not appear in the list, from what I have seen its within the NR==FR
I have come across this:
http://dbaspot.com/shell/246751-awk-comparing-two-files-problem.html
you want all the lines in file1 that are not in file2,
awk 'NR == FNR { a[$0]; next } !($0 in a)' file2 file1
If you want only uniq lines in file1 that are not in file2,
awk 'NR == FNR { a[$0]; next } !($0 in a) { print; a[$0] }'
file2
file1
but this only complicates current issue further when attempted since lots of other fields get duplicated
After posting question - updates to the content ... and tests....
I wanted to stick with awk since it does appear to be a much shorter way of achieving result there is a problem still..
awk '{a[$1]+=$2}END{for (k in a) print k,a[k]}' file1 file2 file3
aa 3
bb 6
cc 9
ee 4
ff 4
gg 4
RESULT_SET_4 0
RESULT_SET_3 0
RESULT_SET_2 0
RESULT_SET_1 0
$ cat file1
RESULT_SET_1
aa 1
RESULT_SET_2
bb 2
RESULT_SET_3
cc 3
RESULT_SET_4
ff 4
$ cat file2
RESULT_SET_1
aa 1
RESULT_SET_2
bb 2
RESULT_SET_3
cc 3
RESULT_SET_4
ee 4
The file content is not left as it was originally i.e. the results are not under the headings, my original method did keep it all intact
Updated expected output - headings in correct context
cat file1
RESULT_SET_1
aa 1
RESULT_SET_2
bb 2
RESULT_SET_3
cc 3
RESULT_SET_4
ff 4
cat file2
RESULT_SET_1
aa 1
RESULT_SET_2
bb 2
RESULT_SET_3
cc 3
RESULT_SET_4
ee 4
cat file3
RESULT_SET_1
aa 1
RESULT_SET_2
bb 2
RESULT_SET_3
cc 3
RESULT_SET_4
gg 4
test.sh awk line to produce above is :
awk -v i=$i 'NR==FNR { _[$1]=$2 } NR!=FNR { if (_[$1] != "") { if ($2 ~ /[0-9]/) { nn=($2+_[$1]); print $1" "nn; } else { print;} }else { print; } }' $names $oldname> $oldname1
./test.sh
------------------------------ 3
RESULT_SET_1
aa 3
RESULT_SET_2
bb 6
RESULT_SET_3
cc 9
RESULT_SET_4
ff 4
works but destroys required formatting
awk '($2 != "") {a[$1]+=$2}; ($2 == "") { a[$1]=$2 } END {for (k in a) print k,a[k]} ' file1 file2 file3
aa 3
bb 6
cc 9
ee 4
ff 4
gg 4
RESULT_SET_4
RESULT_SET_3
RESULT_SET_2
RESULT_SET_1
$ awk '{a[$1]+=$2}END{for (k in a) print k,a[k]}' file1 file2 file3 | sort
aa 3
bb 6
cc 9
dd 4
ee 4
ff 4
Edit:
It's a bit of a hack but it does the job:
$ awk 'FNR==NR&&!/RESULT/{a[$1]=$2;next}($1 in a){a[$1]+=$2}END{for (k in a) print k,a[k]}' file1 file2 file3 | sort | awk '$1="RESULTS_SET_"NR"\n"$1'
RESULTS_SET_1
aa 3
RESULTS_SET_2
bb 6
RESULTS_SET_3
cc 9
RESULTS_SET_4
ff 4
You can do this in awk, as sudo_O suggested, but you can also do it in pure bash.
#!/bin/bash
# We'll use an associative array, where the indexes are strings.
declare -A a
# Our list of files, in an array (not associative)
files=(file1 file2 file3)
# Walk through array of files...
for file in "${files[#]}"; do
# And for each file, increment the array index with the value.
while read index value; do
((a[$index]+=$value))
done < "$file"
done
# Walk through array. ${!...} returns a list of indexes.
for i in ${!a[#]}; do
echo "$i ${a[$i]}"
done
And the result:
$ ./doit
dd 4
aa 3
ee 4
bb 6
ff 4
cc 9
And if you want the output sorted ... you can pipe it through sort. :)
Here's one way using GNU awk. Run like:
awk -f script.awk File1 File2 File3
Contents of script.awk:
sub(/RESULT_SET_/,"") {
i = $1
next
}
{
a[i][$1]+=$2
}
END {
for (j=1;j<=length(a);j++) {
print "RESULT_SET_" j
for (k in a[j]) {
print k, a[j][k]
}
}
}
Results:
RESULT_SET_1
aa 3
RESULT_SET_2
bb 6
RESULT_SET_3
cc 9
RESULT_SET_4
ee 4
ff 4
gg 4
Alternatively, here's the one-liner:
awk 'sub(/RESULT_SET_/,"") { i = $1; next } { a[i][$1]+=$2 } END { for (j=1;j<=length(a);j++) { print "RESULT_SET_" j; for (k in a[j]) print k, a[j][k] } }' File1 File2 File3
fixed using this
Basically it goes through each file, if the entry exists on the other side, it will add the entry to approximate line number with a 0 value so that it can sum up the content - been testing this on my current output and seems to be working real well
#!/bin/bash
files="file1 file2 file3 file4 file5 file6 file7 file8"
RAND="$$"
i=0;
oldname="";
for names in $(echo $files); do
((i++));
if [ $i == 1 ]; then
oldname=$names
shift;
else
oldname1=$names.$RAND
for entries in $(awk -v i=$i 'NR==FNR { _[$1]=$2 } NR!=FNR { if (_[$1] == "") { if ($2 ~ /[0-9]/) { nn=0; nn=(_[$1]+=$2); print FNR"-"$1"%0"} else { } } else { } }' $oldname $names); do
line=$(echo ${entries%%-*})
content=$(echo ${entries#*-})
content=$(echo $content|tr "%" " ")
edit=$(ed -s $oldname << EOF
$line
a
$content
.
w
q
EOF
)
$edit >/dev/null 2>&1
done
awk -v i=$i 'NR==FNR { _[$1]=$2 } NR!=FNR { if (_[$1] != "") { if ($2 ~ /[0-9]/) { nn=0; nn=($2+_[$1]); print $1" "nn; } else { print $1;} }else { print; } }' $names $oldname> $oldname1
oldname=$oldname1
fi
done
cat $oldname
#rm file?.*

Resources