I would like to compare the rows in the second column, and get the row with the highest value in the consecutive columns, with priority of column 3> 4 > 5. I sorted my dataset for the second column so the same values will be together.
My dataset looks like this:
X1 A 0.38 24.68 2.93
X2 A 0.38 20.22 14.54
X3 A 0.38 20.08 00.48
X3.3 A 0.22 11.55 10.68
C43 B 0.22 11.55 20.08
C4.2 C 0.22 11.55 3.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
C44 D 0.22 1.10 1.24
P1 E 0.42 0.42 0.42
P2 E 0.42 0.42 0.42
P3 E 0.42 0.42 0.42
In here, I would like to say, if second column is the same value with another row, then I compare their values in the third column and pick the row with the highest value in the third column.
If the rows have the same second and third columns, then I go to forth column and compare their values in this column, and then get row with the highest value.
If the rows sharing second column still share the values in third and forth columns, then I pick the row with highest value in the fifth column.
If, second-third-forth-fifth columns are the same (complete duplicates), then I print them all, but add 'duplicate' next to their fifth column.
If a row does not share its value for the second column for any other rows, then there is no comparison and I keep this column.
Therefore, my expected output will be:
X1 A 0.38 24.68 2.93
C43 B 0.22 11.55 20.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
P1 E 0.42 0.42 0.42duplicate
P2 E 0.42 0.42 0.42duplicate
P3 E 0.42 0.42 0.42duplicate
What I tried at the moment fails, because I can only compare based on second column and not with multiple columns conditioning and I cannot keep complete duplicates.
cat data.txt | awk -v OFS="\t" '$1=$1' | sort -k2,2 -k3nr -k4nr -k5nr | awk '!a[$2]++'
X1 A 0.38 24.68 2.93
C43 B 0.22 11.55 20.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
P1 E 0.42 0.42 0.42
I appreciate to learn how to fix it.
I'm afraid the code below is not sophisticated, how about:
awk -v OFS="\t" '$1=$1' "data.txt" | sort -k2,2 -k3nr -k4nr -k5nr > "tmp.txt"
awk -v OFS="\t" '
NR==FNR {
vals = $3","$4","$5
if (max[$2] == "") max[$2] = vals
else if (max[$2] == vals) dupe[$2] = 1
next
} {
vals = $3","$4","$5
if (dupe[$2]) $6 = "duplicate"
if (max[$2] == vals) print
}' "tmp.txt" "tmp.txt"
rm -f "tmp.txt"
It saves the sorted result in a temporary file "tmp.txt".
The 2nd awk script processes the temporary file with two passes.
In the 1st pass, it extracts the "max value" for each 2nd column.
It also detects the duplications and set the variable dupe if found.
In the 2nd pass, it assigns the variable $6 to a string duplicate
if the line has the dupe flag.
Then it prints only the line(s) which have the max value for each 2nd column.
This may not be the most elegant solution but it works
cat data.txt | awk -v OFS="\t" '$1=$1' | sort -k2,2 -k3nr -k4nr -k5nr | awk '!a[$2]++' | cut -f2- > /tmp/fgrep.$$
cat data.txt | fgrep -f /tmp/fgrep.$$ | awk '{
rec[NR] = $0
idx = sprintf("%s %s %s %s",$2,$3,$4,$5)
irec[NR] = idx
dup[idx]++
}
END{
for(i in rec){
if(dup[irec[i]]> 1){
print rec[i] "duplicate"
}else{
print rec[i]
}
}
}'
rm /tmp/fgrep.$$
I have a text file:
ifile.txt
x y z t value
1 1 5 01hr01Jan2018 3
1 1 5 02hr01Jan2018 3.1
1 1 5 03hr01Jan2018 3.2
1 3.4 3 01hr01Jan2018 4.1
1 3.4 3 02hr01Jan2018 6.1
1 3.4 3 03hr01Jan2018 1.1
1 4.2 6 01hr01Jan2018 6.33
1 4.2 6 02hr01Jan2018 8.33
1 4.2 6 03hr01Jan2018 5.33
3.4 1 2 01hr01Jan2018 3.5
3.4 1 2 02hr01Jan2018 5.65
3.4 1 2 03hr01Jan2018 3.66
3.4 3.4 4 01hr01Jan2018 6.32
3.4 3.4 4 02hr01Jan2018 9.32
3.4 3.4 4 03hr01Jan2018 12.32
3.4 4.2 8.1 01hr01Jan2018 7.43
3.4 4.2 8.1 02hr01Jan2018 7.93
3.4 4.2 8.1 03hr01Jan2018 5.43
4.2 1 3.4 01hr01Jan2018 6.12
4.2 1 3.4 02hr01Jan2018 7.15
4.2 1 3.4 03hr01Jan2018 9.12
4.2 3.4 5.5 01hr01Jan2018 2.2
4.2 3.4 5.5 02hr01Jan2018 3.42
4.2 3.4 5.5 03hr01Jan2018 3.21
4.2 4.2 6.2 01hr01Jan2018 1.3
4.2 4.2 6.2 02hr01Jan2018 3.4
4.2 4.2 6.2 03hr01Jan2018 1
Explanation: Each coordinate (x,y) has a z-value and three time values. The spaces are not tabs. They are sequence of spaces.
I would like to format the t-column as row and then convert to a csv file. My expected output is as:
ofile.txt
x,y,z,01hr01Jan2018,02hr01Jan2018,03hr01Jan2018
1,1,5,3,3.1,3.2
1,3.4,3,4.1,6.1,1.1
1,4.2,6,6.33,8.33,5.33
3.4,1,2,3.5,5.65,3.66
3.4,3.4,4,6.32,9.32,12.32
3.4,4.2,8.1,7.43,7.93,5.43
4.2,1,3.4,6.12,7.15,9.12
4.2,3.4,5.5,2.2,3.42,3.21
4.2,4.2,6.2,1.3,3.4,1
I am trying it in following way, but still not getting the desire output. My script prints some extra commas (,) at the end.
My algorithm and script is:
#Step1:- Split into two files: one with x,y,z (0001.txt) and
# another with t,value (0002.txt).
awk '{n=3; for (i=1;i<=n;i++) printf "%s ", $i; print "";}' ifile.txt > 0001.txt
awk '{n=5; for (i=4;i<=n;i++) printf "%s ", $i; print "";}' ifile.txt > 0002.txt
#Setp2:- In 0001.txt: Delete the repetition rows.
awk '!seen[$1,$2,$3]++' 0001.txt > 00011.txt
#Step3:- In 0002.txt: Delete the first row. For each 3 rows in t-column,
# write the value-column as rows. Add the t-row at top
# this is very manual. I am wondering for some command
grep -E "^[0-9].*" 0002.txt > 0003.txt
awk -v n=3 '{ row = row $2 " "; if (NR % n == 0) { print row; row = "" } }' 0003.txt > 0004.txt
(echo "01hr01Jan2018,02hr01Jan2018,03hr01Jan2018";cat 0004.txt) > 00022.txt
#Step4:- Paste output of two and convert to csv.
paste 00011.txt 00022.txt > 0005.txt
cat 0005.txt | tr -s '[:blank:]' ',' > ofile.txt
You may use this awk:
awk -v OFS=, '{k=$1 OFS $2 OFS $3}
!($4 in hdr){hn[++h]=$4; hdr[$4]}
k in row{row[k]=row[k] OFS $5; next}
{rn[++n]=k; row[k]=$5}
END {
printf "%s", rn[1]
for(i=1; i<=h; i++)
printf "%s", OFS hn[i]
print ""
for (i=2; i<=n; i++)
print rn[i], row[rn[i]]
}' file
x,y,z,t,01hr01Jan2018,02hr01Jan2018,03hr01Jan2018
1,1,5,3,3.1,3.2
1,3.4,3,4.1,6.1,1.1
1,4.2,6,6.33,8.33,5.33
3.4,1,2,3.5,5.65,3.66
3.4,3.4,4,6.32,9.32,12.32
3.4,4.2,8.1,7.43,7.93,5.43
4.2,1,3.4,6.12,7.15,9.12
4.2,3.4,5.5,2.2,3.42,3.21
4.2,4.2,6.2,1.3,3.4,1
A single awk program can generate your desired output: using GNU awk
gawk '
BEGIN {SUBSEP = OFS = ","}
NR==1 {next}
{ groups[$4]; value[$1,$2,$3][$4] = $5 }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
printf "x,y,z"
for (g in groups) printf ",%s", g
printf "\n"
for (a in value) {
printf "%s", a
for (g in groups) printf "%s%s", OFS, 0+value[a][g]
printf "\n"
}
}
' ifile.txt
another similar awk, without the right header
$ awk -v OFS=, '{k=$1 OFS $2 OFS $3}
p!=k {if(p) print line; p=k; line=k}
{line=line OFS $NF}
END {print line}' file
x,y,z,value
1,1,5,3,3.1,3.2
1,3.4,3,4.1,6.1,1.1
1,4.2,6,6.33,8.33,5.33
3.4,1,2,3.5,5.65,3.66
3.4,3.4,4,6.32,9.32,12.32
3.4,4.2,8.1,7.43,7.93,5.43
4.2,1,3.4,6.12,7.15,9.12
4.2,3.4,5.5,2.2,3.42,3.21
4.2,4.2,6.2,1.3,3.4,1
I have a gridded dataset with 250 rows x 300 columns in matrix form:
ifile.txt
2 3 4 1 2 3
3 4 5 2 4 6
2 4 0 5 0 7
0 0 5 6 3 8
I would like to insert the latitude values at the first column and longitude values at the top. Which looks like:
ofile.txt
20.00 20.33 20.66 20.99 21.32 21.65
100.00 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
The increment is 0.33
I can do it for a small size matrix in manually, but I can't able to get any idea how to get my output in my desired format. I was writing a script in the following way, but completely useless.
echo 20 > latitude.txt
for i in `seq 1 250`;do
i1=$(( i + 0.33 )) #bash can't recognize fractions
echo $i1 >> latitude.txt
done
echo 100 > longitude.txt
for j in `seq 1 300`;do
j1=$(( j + 0.33 ))
echo $j1 >> longitude.txt
done
paste longitude.txt ifile.txt > dummy_file.txt
cat latitude.txt dummy_file.txt > ofile.txt
$ cat tst.awk
BEGIN {
lat = 100
lon = 20
latWid = lonWid = 6
latDel = lonDel = 0.33
latFmt = lonFmt = "%*.2f"
}
NR==1 {
printf "%*s", latWid, ""
for (i=1; i<=NF; i++) {
printf lonFmt, lonWid, lon
lon += lonDel
}
print ""
}
{
printf latFmt, latWid, lat
lat += latDel
for (i=1; i<=NF; i++) {
printf "%*s", lonWid, $i
}
print ""
}
$ awk -f tst.awk file
20.00 20.33 20.66 20.99 21.32 21.65
100.00 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
Following awk may also help you on same.
awk -v col=100 -v row=20 'FNR==1{printf OFS;for(i=1;i<=NF;i++){printf row OFS;row=row+.33;};print ""} {col+=.33;$1=$1;print col OFS $0}' OFS="\t" Input_file
Adding non one liner form of above solution too now:
awk -v col=100 -v row=20 '
FNR==1{
printf OFS;
for(i=1;i<=NF;i++){
printf row OFS;
row=row+.33;
};
print ""
}
{
col+=.33;
$1=$1;
print col OFS $0
}
' OFS="\t" Input_file
Awk solution:
awk 'NR == 1{
long = 20.00; lat = 100.00; printf "%12s%.2f", "", long;
for (i=1; i<NF; i++) { long += 0.33; printf "\t%.2f", long } print "" }
NR > 1{ lat += 0.33 }
{
printf "%.2f%6s", lat, "";
for (i=1; i<=NF; i++) printf "\t%d", $i; print ""
}' file
With perl
$ perl -lane 'print join "\t", "", map {20.00+$_*0.33} 0..$#F if $.==1;
print join "\t", 100+(0.33*$i++), #F' ip.txt
20 20.33 20.66 20.99 21.32 21.65
100 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
-a to auto-split input on whitespaces, result saved in #F array
See https://perldoc.perl.org/perlrun.html#Command-Switches for details on command line options
if $.==1 for the first line of input
map {20.00+$_*0.33} 0..$#F iterate based on size of #F array, and for each iteration, we get a value based on equation inside {} where $_ will be 0, 1, etc upto last index of #F array
print join "\t", "", map... use tab separator to print empty element and results of map
For all the lines, print contents of #F array pre-fixed with results of 100+(0.33*$i++) where $i will be initially 0 in numeric context. Again, tab is used as separator while joining these values
Use sprintf if needed for formatting, also $, can be initialized instead of using join
perl -lane 'BEGIN{$,="\t"; $st=0.33}
print "", map { sprintf "%.2f", 20+$_*$st} 0..$#F if $.==1;
print sprintf("%.2f", 100+($st*$i++)), #F' ip.txt
Hi I have a file with 6 columns and I wish to know the average of three of these (columns 2,3,4) and the sum of the last two (columns 5 and 6) for each unique variable in column one.
A1234 0.526 0.123 0.456 0.986 1.123
A1234 0.423 0.256 0.397 0.876 0.999
A1234 0.645 0.321 0.402 0.903 1.101
A1234 0.555 0.155 0.406 0.888 1.009
B5678 0.111 0.345 0.285 0.888 0.789
B5678 0.221 0.215 0.305 0.768 0.987
B5678 0.336 0.289 0.320 0.789 0.921
I have come across code that will get the average for column 2 based on column one but is there anyway I can expand this across columns? Thanks
awk '{a[$1]+=$2; c[$1]++} END{for (i in a) printf "%d%s%.2f\n", i, OFS, a[i]/c[i]}'
I would like the output to be in the following format ;each variable in column one will also have a different number of rows
A1234 0.53725 0.21375 0.41525 3.653 4.232
B5678 0.22233 0.283 0.30333 2.445 2.697
awk '{a[$1]+=$2;b[$1]+=$3;c[$1]+=$4;d[$1]+=$5;e[$1]+=$6;f[$1]++} END{for (i in a) print i,a[i]/f[i],b[i]/f[i],c[i]/f[i],d[i],e[i]}' file
O/p:
B5678 0.222667 0.283 0.303333 2.445 2.697
A1234 0.53725 0.21375 0.41525 3.653 4.232
try following once and let me know if this helps you.
awk '{A[$1]=A[$1]?A[$1]+$5+$6:$5+$6;C[$1]=C[$1]?C[$1]+$2+$3+$4:$2+$3+$4;B[$1]++} END{for(i in A){print "Avg. for " i" =\t",C[i]/(B[i]*3) RS "Count for " i" =\t",A[i]}}' Input_file
EDIT: Adding a non-one liner form of solution too now.
awk '{
A[$1]=A[$1]?A[$1]+$5+$6:$5+$6;
C[$1]=C[$1]?C[$1]+$2+$3+$4:$2+$3+$4;
B[$1]++
}
END{
for(i in A){
print "Avg. for " i" =\t",C[i]/(B[i]*3) RS "Count for " i" =\t",A[i]
}
}
' Input_file
awk solution:
awk '{ a[$1]++; avg[$1]+=$2+$3+$4; sum[$1]+=$5+$6 }
END{ for(i in a) printf "%s%.2f%s%.2f\n",i OFS,avg[i]/(a[i]*3),OFS,sum[i] }' file
The output (the 2nd column - average value, the 3rd column - sum value):
B5678 0.27 5.14
A1234 0.39 7.88
To calculate average of column 2, 3, 4:
awk '{ sum += $2 + $3 + $4 } END { print sum / (NR * 3) }'
To calculate the sum of column 5 and 6 group by column 1:
awk '{ arr[$1] += $5 + $6 } END { for (a in arr) if (a) print a, arr[a] }'
To calculate column 5 and 6 of the last row:
tail file -1 | awk '{sum += $5 + $6} END {print sum}'
I have a output like this
3.69
0.25
0.80
1.78
3.04
1.99
0.71
0.50
0.94
I want to find the biggest number and the smallest number in the above output
I need output like
smallest is 0.25 and biggest as 3.69
Just sort your input first and print the first and last value. One method:
$ sort file | awk 'NR==1{min=$1}END{print "Smallest",min,"Biggest",$0}'
Smallest 0.25 Biggest 3.69
Hope this help.
OUTPUT="3.69 0.25 0.80 1.78 3.04 1.99 0.71 0.50 0.94"
SORTED=`echo $OUTPUT | tr ' ' '\n' | sort -n`
SMALLEST=`echo "$SORTED" | head -n 1`
BIGGEST=`echo "$SORTED" | tail -n 1`
echo "Smallest is $SMALLEST"
echo "Biggest is $BIGGEST"
Added op's awk oneliner request.
I'm not good at awk, but this works anyway. :)
echo "3.69 0.25 0.80 1.78 3.04 1.99 0.71 0.50 0.94" | awk '{
for (i=1; i<=NF; i++) {
if (length(s) == 0) s = $i;
if (length(b) == 0) b = $i;
if ($i < s) s = $i;
if (b < $i) b = $i;
}
print "Smallest is", s;
print "Biggest is", b;
}'
You want an awk solution?
echo "3.69 0.25 0.80 1.78 3.04 1.99 0.71 0.50 0.94" | \
awk -v RS=' ' '/.+/ { biggest = ((biggest == "") || ($1 > biggest)) ? $1 : biggest;
smallest = ((smallest == "") || ($1 < smallest)) ? $1:smallest}
END { print biggest, smallest}'
Produce the following output:
3.69 0.25
You can use this method also
sort file | echo -e `sed -nr '1{s/(.*)/smallest is :\1/gp};${s/(.*)/biggest no is :\1/gp'}`
TXR solution:
$ txr -e '(let ((nums [mapcar tofloat (gun (get-line))]))
(if nums
(pprinl `smallest is #(find-min nums) and biggest is #(find-max nums)`)
(pprinl "empty input")))'
0.1
-1.0
3.5
2.4
smallest is -1.0 and biggest is 3.5