Find the maximum values in 2nd column for each distinct values in 1st column using Linux - linux

I have two columns as follows
ifile.dat
1 10
3 34
1 4
3 32
5 3
2 2
4 20
3 13
4 50
1 40
2 20
What I look for is to find the maximum values in 2nd column for each 1,2,3,4,5 in 1st column.
ofile.dat
1 40
2 20
3 34
4 50
5 3
I found someone has done this using other program e.g. Get the maximum values of column B per each distinct value of column A

awk seems a prime candidate for this task. Simply traverse your input file and keep an array indexed by the first column values and storing a value of column 2 if it is larger than the currently stored value. At the end of the traversal iterate over the array to print indices and corresponding values
awk '{
if (a[$1] < $2) {
a[$1]=$2
}
} END {
for (i in a) {
print i, a[i]
}
}' ifile.dat
Now the result will not be sorted numerically on the first column but that should be easily fixable if that is required

Another way is using sort.
First numeric sort on column 2 decreasing and then remove non unique values of column 1, a one-liner:
sort -n -r -k 2 ifile.dat| sort -u -n -k 1

The easiest command to find the maximum value in the second column is something like this
sort -nrk2 data.txt | awk 'NR==1{print $2}'

When doing min/max calculations, always seed the min/max variable using the first value read:
$ cat tst.awk
!($1 in max) || $2>max[$1] { max[$1] = $2 }
END {
PROCINFO["sorted_in"] = "#ind_num_asc"
for (key in max) {
print key, max[key]
}
}
$ awk -f tst.awk file
1 40
2 20
3 34
4 50
5 3
The above uses GNU awk 4.* for PROCINFO["sorted_in"] to control output order, see http://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Array-Traversal.

Considering that your 1st field will be starting from 1 if yes then try one more solution in awk also.
awk '{a[$1]=$2>a[$1]?$2:(a[$2]?a[$2]:$2);} END{for(j=1;j<=length(a);j++){if(a[j]){print j,a[j]}}}' Input_file
Adding one more way for same too here.
sort -k1 Input_file | awk 'prev != $1 && prev{print prev, val;val=prev=""} {val=val>$2?val:$2;prev=$1} END{print prev,val}'

Related

Translate specific columns to rows

I have this file (space delimited) :
bc1 no 12
bc1 no 15
bc1 yes 4
bc2 no 8
bc3 yes 14
bc3 yes 12
bc4 no 2
I would like to get this output :
bc1 3 no;no;yes 31
bc2 1 no 8
bc3 2 yes;yes 26
bc4 1 no 2
1st column : one occurence of the first column in the input file
2nd : number of this occurence in the input file
3rd : 3rd column translated in row with ";" delimiter
4th : sum of the last column
I can do what I want with the "no/yes" column :
awk -F' ' 'NF>2{a[$1] = a[$1]";"$2}END{for(i in a){print i" "a[i]}}' test.txt | sort -k1,1n
With your shown samples, please try following awk code. Since $1(first column) is always sorted we need not to sort it so coming up with this solution here.
awk '
prev!=$1 && prev{
print prev OFS count,value,sum
count=sum=0
prev=value=""
}
{
prev=$1
value=(value?value ";":"") $2
count++
sum+=$NF
}
END{
if(prev){
print prev OFS count,value,sum
}
}
' Input_file
Here's a solution with newer versions of datamash (thanks glenn jackman for the tips about count and --collapse-delimiter):
datamash -t' ' -g1 --collapse-delimiter=';' count 2 collapse 2 sum 3 <ip.txt
With older versions (mine is 1.4) and awk:
$ datamash -t' ' -g1 count 2 collapse 2 sum 3 <ip.txt
bc1 3 no,no,yes 31
bc2 1 no 8
bc3 2 yes,yes 26
bc4 1 no 2
$ <ip.txt datamash -t' ' -g1 count 2 collapse 2 sum 3 | awk '{gsub(/,/, ";", $3)} 1'
bc1 3 no;no;yes 31
bc2 1 no 8
bc3 2 yes;yes 26
bc4 1 no 2
-t helps to set space as field separator. Column 2 is collapsed by using column 1 as the key. sum 3 helps to find the total of the numbers. count 2 helps to count the number of collapsed rows.
Then, awk is used to change the , in the third column to ;
One alternative awk idea:
awk '
function print_row() {
if (series)
print key,c,series,sum
c=sum=0
series=sep=""
}
{ if ($1 != key) # if 1st column has changed then print previous data set
print_row()
key=$1
c++
series=series sep $2
sep=";"
sum+=$3
}
END { print_row() } # flush last data set to stdout
' input
This generates:
bc1 3 no;no;yes 31
bc2 1 no 8
bc3 2 yes;yes 26
bc4 1 no 2

Bash Colum sum over a table of variable length

Im trying to get the columsums (exept for the first one) of a tab delimited containing numbers.
To find out the number of columns an store it in a variable I use:
cols=$(awk '{print NF}' file.txt | sort -nu | tail -n 1
next I want to calculate the sum of all numbers in that column and store this in a variable again in a for loop:
for c in 2:$col
do
num=$(cat file.txt | awk '{sum+$2 ; print $0} END{print sum}'| tail -n 1
done
this
num=$(cat file.txt | awk '{sum+$($c) ; print $0} END{print sum}'| tail -n 1
on itself with a fixed numer and without variable input works find but i cannot get it to accept the for-loop variable.
Thanks for the support
p.s. It would also be fine if i could sum all columns (expept the first one) at once without the loop-trouble.
Assuming you want the sums of the individual columns,
$ cat file
1 2 3 4
5 6 7 8
9 10 11 12
$ awk '
{for (i=2; i<=NF; i++) sum[i] += $i}
END {for (i=2; i<=NF; i++) printf "%d%s", sum[i], OFS; print ""}
' file
18 21 24
In case you're not bound to awk, there's a nice tool for "command-line statistical operations" on textual files called GNU datamash.
With datamash, summing (probably the simplest operation of all) a 2nd column is as easy as:
$ datamash sum 2 < table
9
Assuming the table file holds tab-separated data like:
$ cat table
1 2 3 4
2 3 4 5
3 4 5 6
To sum all columns from 2 to n use column ranges (available in datamash 1.2):
$ n=4
$ datamash sum 2-$n < table
9 12 15
To include headers, see the --headers-out option

If first two columns are equal, select top 3 based on descending order of 3rd column

I want to select top 3 results for every line that has the same first two column.
For example the data will look like,
cat data.txt
A A 10
A A 1
A A 2
A A 5
A A 8
A B 1
A B 2
A C 6
A C 5
A C 10
A C 1
B A 1
B A 1
B A 2
B A 8
And for the result I want
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 1
B A 1
B A 2
Note that some of the "groups" do not contain 3 rows.
I have tried
sort -k1,1 -k2,2 -k3,3nr data.txt | sort -u -k1,1 -k2,2 > 1.txt
comm -23 <(sort data.txt) <(sort 1.txt)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 2.txt
comm -23 <(sort data.txt) <(cat 1.txt 2.txt | sort)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 3.txt
It seems like it's working but since I am learning to code better was wondering if there was a better way to go about this. Plus, my code will generate many files that I will have to delete.
You can do:
$ sort -k1,1 -k2,2 -k3,3nr file | awk 'a[$1,$2]++<3'
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 8
B A 2
B A 1
Explanation:
There are two key items to understand the awk program; associative arrays and fields.
If you reference an empty awk array element, it is an empty container -- ready for anything you put into it. You can use that as a counter.
You state If first two columns are equal...
The sort puts the file in order desired. The statement a[$1,$2] uses the values of the first two fields as a unique entry into an associative array.
You then state ...select top 3 based on descending order of 3rd column...
Once again, the sort put the file into the desired order, and the statement a[$1,$2]++ counts them. Now just count up to three.
awk is organized into blocks of condition {action} The statement a[$1,$2]++<3 is true until there are more than 3 of the same pattern seen.
A wordier version of the program would be:
awk 'a[$1,$2]++<3 {print $0}'
But the default action if the condition is true is to print $0 so it is not needed.
If you are processing text in Unix, you should get to know awk. It is the most powerful tool that POSIX guarantees you will have, and is commonly used for these tasks.
Great place to start is the online book Effective AWK Programming by Arnold D. Robbins
#Dawg has the best answer. This one will be a little lighter on memory, which probably won't be a concern for your data:
sort -k1,2 -k3,3nr file |
awk '
{key = $1 FS $2}
prev != key {prev = key; count = 1}
count <= 3 {print; count++}
'
You can sort the file by first two columns primarily and by the 3rd one numerically secondarily, then read the output and only print the first three lines for each combination of the first two columns.
sort -k1,2 -k3,3rn data.txt \
| while read c1 c2 n ; do
if [[ $c1 == $l1 && $c2 == $l2 ]] ; then
((c++))
else
c=0
fi
if (( c < 3 )) ; then
echo $c1 $c2 $n
l1=$c1
l2=$c2
fi
done

how can I make awk match up lines in file 1 with the lines in file 2 based on some number ranges in file 2

I have the following two files:
file 1:
22
2
42
32
file 2:
1 10 valuea
11 20 valueb
21 30 valuec
31 40 valued
41 50 valuee
51 60 valuef
How can I make awk grab each value from file 1, match it up with file 2 based on whether it falls between the number range in columns 1 and 2 of file 2, and then print out column 3 from the matched column in file 2? The output would resemble the following:
valuec
valuea
valuee
valued
I tried using the following AWK command (based on what I found in this post: How to check value of a column lies between values of two columns in other file and print corresponding value from column in Unix?), but it does not seem to be working correctly.
#!/bin/bash
awk 'FNR == NR { val[$1] = $1 }
FNR != NR { if (val[$1] >= $1 && val[$1] <= $2)
print $3
}' file1 file2
Also I did not include it in here for obvious reasons, but for the actual application of this script, file 1 would include around 7,000 entries while file 2 would include 68,000 entries
alternative awk script
$ awk 'FNR == NR {a[$1]=$2; v[$1]=$3; next}
{for(k in a)
if(k+0<=$1 && $1+0<=a[k]) print v[k]}' file2 file1
valuec
valuea
valuee
valued
note that file2 is the first file. This will cover multiple range matches as well. +0 is to force for numerical comparison.

Select rows in one file based on specific values in the second file (Linux)

I have two files:
One is "total.txt". It has two columns: the first column is natural numbers (indicator) ranging from 1 to 20, the second column contains random numbers.
1 321
1 423
1 2342
1 7542
2 789
2 809
2 5332
2 6762
2 8976
3 42
3 545
... ...
20 432
20 758
The other one is "index.txt". It has three columns:(1.indicator, 2:low value, 3: high value)
1 400 5000
2 600 800
11 300 4000
I want to output the rows of "total.txt" file with first column matches with the first column of "index.txt" file. And at the same time, the second column of output results must be larger than (>) the second column of the "index.txt" and smaller than (<) the third column of the "index.txt".
The expected result is as follows:
1 423
1 2342
2 809
2 5332
2 6762
11 ...
11 ...
I have tried this:
awk '$1==(awk 'print($1)' index.txt) && $2 > (awk 'print($2)' index.txt) && $1 < (awk 'print($2)' index.txt)' total.txt > result.txt
But it failed!
Can you help me with this? Thank you!
You need to read both files in the same awk script. When you read index.txt, store the other columns in an array.
awk 'FNR == NR { low[$1] = $2; high[$1] = $3; next }
$2 > low[$1] && $2 < high[$1] { print }' index.txt total.txt
FNR == NR is the common awk idiom to detect when you're processing the first file.
Use join like Barmar said:
# To join on the first columns
join -11 -21 total.txt index.txt
And if the files aren't sorted in lexical order by the first column then:
join -11 -21 <(sort -k1,1 total.txt) <(sort -k1,1 index.txt)

Resources