Bash Colum sum over a table of variable length - linux

Im trying to get the columsums (exept for the first one) of a tab delimited containing numbers.
To find out the number of columns an store it in a variable I use:
cols=$(awk '{print NF}' file.txt | sort -nu | tail -n 1
next I want to calculate the sum of all numbers in that column and store this in a variable again in a for loop:
for c in 2:$col
do
num=$(cat file.txt | awk '{sum+$2 ; print $0} END{print sum}'| tail -n 1
done
this
num=$(cat file.txt | awk '{sum+$($c) ; print $0} END{print sum}'| tail -n 1
on itself with a fixed numer and without variable input works find but i cannot get it to accept the for-loop variable.
Thanks for the support
p.s. It would also be fine if i could sum all columns (expept the first one) at once without the loop-trouble.

Assuming you want the sums of the individual columns,
$ cat file
1 2 3 4
5 6 7 8
9 10 11 12
$ awk '
{for (i=2; i<=NF; i++) sum[i] += $i}
END {for (i=2; i<=NF; i++) printf "%d%s", sum[i], OFS; print ""}
' file
18 21 24

In case you're not bound to awk, there's a nice tool for "command-line statistical operations" on textual files called GNU datamash.
With datamash, summing (probably the simplest operation of all) a 2nd column is as easy as:
$ datamash sum 2 < table
9
Assuming the table file holds tab-separated data like:
$ cat table
1 2 3 4
2 3 4 5
3 4 5 6
To sum all columns from 2 to n use column ranges (available in datamash 1.2):
$ n=4
$ datamash sum 2-$n < table
9 12 15
To include headers, see the --headers-out option

Related

If first two columns are equal, select top 3 based on descending order of 3rd column

I want to select top 3 results for every line that has the same first two column.
For example the data will look like,
cat data.txt
A A 10
A A 1
A A 2
A A 5
A A 8
A B 1
A B 2
A C 6
A C 5
A C 10
A C 1
B A 1
B A 1
B A 2
B A 8
And for the result I want
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 1
B A 1
B A 2
Note that some of the "groups" do not contain 3 rows.
I have tried
sort -k1,1 -k2,2 -k3,3nr data.txt | sort -u -k1,1 -k2,2 > 1.txt
comm -23 <(sort data.txt) <(sort 1.txt)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 2.txt
comm -23 <(sort data.txt) <(cat 1.txt 2.txt | sort)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 3.txt
It seems like it's working but since I am learning to code better was wondering if there was a better way to go about this. Plus, my code will generate many files that I will have to delete.
You can do:
$ sort -k1,1 -k2,2 -k3,3nr file | awk 'a[$1,$2]++<3'
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 8
B A 2
B A 1
Explanation:
There are two key items to understand the awk program; associative arrays and fields.
If you reference an empty awk array element, it is an empty container -- ready for anything you put into it. You can use that as a counter.
You state If first two columns are equal...
The sort puts the file in order desired. The statement a[$1,$2] uses the values of the first two fields as a unique entry into an associative array.
You then state ...select top 3 based on descending order of 3rd column...
Once again, the sort put the file into the desired order, and the statement a[$1,$2]++ counts them. Now just count up to three.
awk is organized into blocks of condition {action} The statement a[$1,$2]++<3 is true until there are more than 3 of the same pattern seen.
A wordier version of the program would be:
awk 'a[$1,$2]++<3 {print $0}'
But the default action if the condition is true is to print $0 so it is not needed.
If you are processing text in Unix, you should get to know awk. It is the most powerful tool that POSIX guarantees you will have, and is commonly used for these tasks.
Great place to start is the online book Effective AWK Programming by Arnold D. Robbins
#Dawg has the best answer. This one will be a little lighter on memory, which probably won't be a concern for your data:
sort -k1,2 -k3,3nr file |
awk '
{key = $1 FS $2}
prev != key {prev = key; count = 1}
count <= 3 {print; count++}
'
You can sort the file by first two columns primarily and by the 3rd one numerically secondarily, then read the output and only print the first three lines for each combination of the first two columns.
sort -k1,2 -k3,3rn data.txt \
| while read c1 c2 n ; do
if [[ $c1 == $l1 && $c2 == $l2 ]] ; then
((c++))
else
c=0
fi
if (( c < 3 )) ; then
echo $c1 $c2 $n
l1=$c1
l2=$c2
fi
done

Linux command (Calculating the sum)

I have a .txt file with the following content:
a 3
a 4
a 5
a 6
b 1
b 3
b 5
c 9
c 10
I am wondering if there is any command (no awk if possible) that can read the .txt file and give the following output (Sorted by the second column):
c 19
a 18
b 9
You can use awk piped to sort:
awk '{sums[$1] += $2} END {for (i in sums) print i, sums[i]}' file | sort -rnk2
c 19
a 18
b 9
sums[$1] += $2 is adding value of $2 in an array sums that is indexed by field #1 ($1).
sort -rnk2 is reverse sorting numerically output of awk on field 2
Use can use this code:
cat 1.txt | awk '{arr[$1]+=$2}END{for (var in arr) print var," ",arr[var]}' | sort -rnk 2
Explanation:
cat 1.txt - read 1.txt file with content
awk - is a language very useful for data manipulation
{arr[$1]+=$2} for each line in content file increase array item with key first field with value of second field. Field separator by default is space.
END{for (var in arr) print var," ",arr[var]}' - after all line is proceeded, print array content
sort -rnk 2 - reverse numeric sort on field 2
Non-awk solutions.
perl
perl -lane '
$sum{$F[0]} += $F[1]
} END {
$, = " ";
print $_, $sum{$_} for reverse sort {$sum{$a} <=> $sum{$b}} keys %sum
' file.txt
bash version 4
declare -A sum
while read key val; do (( sum[$key] += $val )); done < file.txt
for key in "${!sum[#]}"; do echo "$key ${sum[$key]}"; done | sort -rn -k2
non-awk challenge accepted
vars=$(cut -d" " -f1 nums | uniq); paste <(echo "$vars") <(cat <(sed -e 's/ /+=/' nums) <(echo "$vars" | sed 's/$/;/') | bc) | sort -k2,2nr
c 19
a 18
b 9

How to add number of identical line next to the line itself? [duplicate]

This question already has answers here:
Find duplicate lines in a file and count how many time each line was duplicated?
(7 answers)
Closed 7 years ago.
I have file file.txt which look like this
a
b
b
c
c
c
I want to know the command to which get file.txt as input and produces the output
a 1
b 2
c 3
I think uniq is the command you are looking for. The output of uniq -c is a little different from your format, but this can be fixed easily.
$ uniq -c file.txt
1 a
2 b
3 c
If you want to count the occurrences you can use uniq with -c.
If the file is not sorted you have to use sort first
$ sort file.txt | uniq -c
1 a
2 b
3 c
If you really need the line first followed by the count, swap the columns with awk
$ sort file.txt | uniq -c | awk '{ print $2 " " $1}'
a 1
b 2
c 3
You can use this awk:
awk '!seen[$0]++{ print $0, (++c) }' file
a 1
b 2
c 3
seen is an array that holds only uniq items by incrementing to 1 first time an index is populated. In the action we are printing the record and an incrementing counter.
Update: Based on comment below if intent is to get a repeat count in 2nd column then use this awk command:
awk 'seen[$0]++{} END{ for (i in seen) print i, seen[i] }' file
a 1
b 2
c 3

Find the maximum values in 2nd column for each distinct values in 1st column using Linux

I have two columns as follows
ifile.dat
1 10
3 34
1 4
3 32
5 3
2 2
4 20
3 13
4 50
1 40
2 20
What I look for is to find the maximum values in 2nd column for each 1,2,3,4,5 in 1st column.
ofile.dat
1 40
2 20
3 34
4 50
5 3
I found someone has done this using other program e.g. Get the maximum values of column B per each distinct value of column A
awk seems a prime candidate for this task. Simply traverse your input file and keep an array indexed by the first column values and storing a value of column 2 if it is larger than the currently stored value. At the end of the traversal iterate over the array to print indices and corresponding values
awk '{
if (a[$1] < $2) {
a[$1]=$2
}
} END {
for (i in a) {
print i, a[i]
}
}' ifile.dat
Now the result will not be sorted numerically on the first column but that should be easily fixable if that is required
Another way is using sort.
First numeric sort on column 2 decreasing and then remove non unique values of column 1, a one-liner:
sort -n -r -k 2 ifile.dat| sort -u -n -k 1
The easiest command to find the maximum value in the second column is something like this
sort -nrk2 data.txt | awk 'NR==1{print $2}'
When doing min/max calculations, always seed the min/max variable using the first value read:
$ cat tst.awk
!($1 in max) || $2>max[$1] { max[$1] = $2 }
END {
PROCINFO["sorted_in"] = "#ind_num_asc"
for (key in max) {
print key, max[key]
}
}
$ awk -f tst.awk file
1 40
2 20
3 34
4 50
5 3
The above uses GNU awk 4.* for PROCINFO["sorted_in"] to control output order, see http://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Array-Traversal.
Considering that your 1st field will be starting from 1 if yes then try one more solution in awk also.
awk '{a[$1]=$2>a[$1]?$2:(a[$2]?a[$2]:$2);} END{for(j=1;j<=length(a);j++){if(a[j]){print j,a[j]}}}' Input_file
Adding one more way for same too here.
sort -k1 Input_file | awk 'prev != $1 && prev{print prev, val;val=prev=""} {val=val>$2?val:$2;prev=$1} END{print prev,val}'

how can i print the upper triangle of a matrix

using awk command I tried to print the upper triangle of a matrix
awk '{for (i=1;i<=NF;i++) if (i>=NR) printf $i FS "\n"}' matrix
but the output is shown as a single row
Consider this sample matrix:
$ cat matrix
1 2 3
4 5 6
7 8 9
To print the upper-right triangle:
$ awk '{for (i=1;i<=NF;i++) printf "%s%s",(i>=NR)?$i:" ",FS; print""}' matrix
1 2 3
5 6
9
Or:
$ awk '{for (i=1;i<=NF;i++) printf "%2s",(i>=NR)?$i:" "; print""}' matrix
1 2 3
5 6
9
To print the upper-left triangle:
$ awk '{for (i=1;i<=NF+1-NR;i++) printf "%s%s",$i,FS; print""}' matrix
1 2 3
4 5
7
Or:
$ awk '{for (i=1;i<=NF+1-NR;i++) printf "%2s",$i; print""}' matrix
1 2 3
4 5
7
This might work for you (GNU sed):
sed -r ':a;n;H;G;s/\n//;:b;s/^\S+\s*(.*)\n.*/\1/;tb;$!ba' file
Use the hold space as a counter for those lines that have been processed and for each current line remove those many fields from the front of the current line.
N.B. The counter is set following the printing of the current line otherwise the first line would be minus the first field.
On reflection an alternative/more elegant solution is:
sed -r '1!G;h;:a;s/^\S+\s*(.*)\n.*/\1/;ta' file
And to print the upper-left triangle:
sed -r '1!G;h;:a;s/^([^\n]*)\S+[^\n]*(.*)\n.*/\1\2/;ta' file
$ awk '{for (i=NR;i<=NF;i++) printf "%s%s",$i,(i<NF?FS:RS)}' file
1 2 3
5 6
9

Resources