Average column if value in other column matches and print as additional column - linux

I have a file like this:
Score 1 24 HG 1
Score 2 26 HG 2
Score 5 56 RP 0.5
Score 7 82 RP 1
Score 12 97 GM 5
Score 32 104 LS 3
I would like to average column 5 if column 4 are identical and print the average as column 6 so that it looks like this:
Score 1 24 HG 1 1.5
Score 2 26 HG 2 1.5
Score 5 56 RP 0.5 0.75
Score 7 82 RP 1 0.75
Score 12 97 GM 5 5
Score 32 104 LS 3 3
I have tried a couple of solutions I found on here.
e.g.
awk '{ total[$4] += $5; ++n[$4] } END { for(i in total) print i, total[i] / n[i] }'
but they all end up with this:
HG 1.5
RP 0.75
GM 5
LS 3
Which is undesirable as I lose a lot of information.

You can iterate through your table twice: calculate the averages (as you already) do on the first iteration, and then print them out on the second iteration:
awk 'NR==FNR { total[$4] += $5; ++n[$4] } NR>FNR { print $0, total[$4] / n[$4] }' file file
Notice the file twice at the end. While going through the "first" file, NR==FNR, and we sum the appropriate values, keeping them in memory (variables total and n). During "second" file traversal, NR>FNR, and we print out all the original data + averages:
Score 1 24 HG 1 1.5
Score 2 26 HG 2 1.5
Score 5 56 RP 0.5 0.75
Score 7 82 RP 1 0.75
Score 12 97 GM 5 5
Score 32 104 LS 3 3

You can use 1 pass through the file, but you have to store in memory the entire file, so disk i/o vs memory tradeoff:
awk '
BEGIN {FS = OFS = "\t"}
{total[$4] += $5; n[$4]++; line[NR] = $0; key[NR] = $4}
END {for (i=1; i<=NR; i++) print line[i], total[key[i]] / n[key[i]]}
' file

Related

Sum each row in a CSV file and sort it by specific value bash

i have a question taking the below set Coma separated CSV i want to run a script in bash that sums all values from colums 7,8,9 from an especific city and show the row with the max value
so Original dataset:
Row,name,city,age,height,weight,good rates,bad rates,medium rates
1,john,New York,25,186,98,10,5,11
2,mike,New York,21,175,87,19,6,21
3,Sandy,Boston,38,185,88,0,5,6
4,Sam,Chicago,34,167,76,7,0,2
5,Andy,Boston,31,177,85,19,0,1
6,Karl,New York,33,189,98,9,2,1
7,Steve,Chicago,45,176,88,10,3,0
the desire output will be
Row,name,city,age,height,weight,good rates,bad rates,medium rates,max rates by city
2,mike,New York,21,175,87,19,6,21,46
5,Andy,Boston,31,177,85,19,0,1,20
7,Steve,Chicago,45,176,88,10,3,0,13
im trying with this; but it gives me only the highest rate number so 46 but i need it by city and that shows all the row, any ideas how to continue?
awk 'BEGIN {FS=OFS=","}{sum = 0; for (i=7; i<=9;i++) sum += $i} NR ==1 || sum >max {max = sum}
You may use this awk:
awk '
BEGIN {FS=OFS=","}
NR==1 {
print $0, "max rates by city"
next
}
{
s = $7+$8+$9
if (s > max[$3]) {
max[$3] = s
rec[$3] = $0
}
}
END {
for (i in max)
print rec[i], max[i]
}' file
Row,name,city,age,height,weight,good rates,bad rates,medium rates,max rates by city
7,Steve,Chicago,45,176,88,10,3,0,13
2,mike,New York,21,175,87,19,6,21,46
5,Andy,Boston,31,177,85,19,0,1,20
or to get tabular output:
awk 'BEGIN {FS=OFS=","} NR==1{print $0, "max rates by city"; next} {s=$7+$8+$9; if (s > max[$3]) {max[$3] = s; rec[$3] = $0}} END {for (i in max) print rec[i], max[i]}' file | column -s, -t
Row name city age height weight good rates bad rates medium rates max rates by city
7 Steve Chicago 45 176 88 10 3 0 13
2 mike New York 21 175 87 19 6 21 46
5 Andy Boston 31 177 85 19 0 1 20

Linux filter text rows by sum specific colums

From raw sequencing data I created a count file (.txt) with the counts of unique sequences per sample.
The data looks like this:
sequence seqLength S1 S2 S3 S4 S5 S6 S7 S8
AAAAA... 46 0 1 1 8 1 0 1 5
AAAAA... 46 50 1 5 0 2 0 4 0
...
TTTTT... 71 0 0 5 7 5 47 2 2
TTTTT... 81 5 4 1 0 7 0 1 1
I would like to filter the sequences per row sum, so only rows with a total sum of all samples (sum of S1 to S8) lower than for example 100 are removed.
This can probably be done with awk, but I have no experience with this text-processing utility.
Can anyone help?
Give a try to this:
awk 'NR>1 {sum=0; for (i=3; i<=NF; i++) { sum+= $i } if (sum > 100) print}' file.txt
It will skip line 1 NR>1
Then will sum items per row starting from item 3 (S1 to S8) in your example:
{sum=0; for (i=3; i<=NF; i++) { sum+= $i }
Then will only print rows with sum is > than 100: if (sum > 100) print}'
You could modify/test with the condition based on the sum, but hope this can give you an idea about how to do it with awk
Following awk may help you on same.
awk 'FNR>1{for(i=3;i<=NF;i++){sum+=$i};if(sum>100){print sum > "out_file"};sum=""}' Input_file
In case you need different different out files then following may help.
awk 'FNR>1{for(i=3;i<=NF;i++){sum+=$i};if(sum>100){print sum > "out_file"++i};sum=""}' Input_file

Use part of a column in one file as search term in other file

I have two files. The output file I am searching has earthquake locations and has the following format:
19090212 1323 30.12 36 19.41 103 28.24 7.29 0.00 4 149 25.8 0.02 5.7 9.8 D - 0
19090216 1828 49.61 36 13.27 101 35.38 10.94 0.00 13 54 38.5 0.07 0.3 0.7 B 0
19090711 2114 54.11 35 1.07 99 56.42 7.00 0.00 7 177 18.7 4.00 63.3 53.2 D # 0
I want to use the last 6 digits of the first column (i.e. '090418' out of '19090418') with the first 3 digits of the second column (i.e. '072' out of '0728') as my search term. The file I am searching has the following format:
SC17 P 090212132329.89
X25A P 090212132330.50
AMTX P 090216182814.12
X29A P 090216182813.70
Y28A P 090216182822.36
MSTX P 090216182826.80
Y27A P 090216182831.43
After I search the second file for the term, I need to figure out how many lines are in that section. So for this example, if I were searching the terms shown for the second file above, I want to know there are 2 lines for 090212132 and 5 lines for 090216182.
This is my first post, so please let me know how I can improve clarity or conciseness in my posts. Thanks for the help!
awk to the rescue!
$ awk 'NR==FNR{a[substr($1,3) substr($2,1,3)]; next}
{k=substr($3,1,9)}
k in a{a[k]++}
END{for(k in a) if(a[k]>0) print k,a[k]}' file1 file2
with your input files, there is no output as expected.
The answer karakfa suggested worked! My output looks like this:
100224194 7
100117172 18
091004005 11
090520220 10
090526143 21
090122033 20
Thanks for the help!
Karafka answer with explanation
awk 'NR==FNR { # For first file
$1 = substr($1, 3); # Get last 6 characters from first col
$2 = substr($2, 1, 3); # Get first 3 characters from second col
a[$1 $2]; # Add to an array
next } # Move to next record in first file
# Start processing second file
{k = substr($3, 1, 9)} # Get first 9 character for third col
k in a {a[k]++} # If key in a, then increment the key
END {
for (k in a) # Iterate array
if (a[k] > 0) # If pattern was matched
print k, a[k] # print the pattern and num occurrence
}'

Calculating average in irregular intervals without considering missing values in shell script?

I have a dataset with many missing values as -999. Part of the data is
input.txt
30
-999
10
40
23
44
-999
-999
31
-999
54
-999
-999
-999
-999
-999
-999
10
23
2
5
3
8
8
7
9
6
10
and so on
I would like calculate the average in each 5,6,6 rows interval without considering the missing values.
Desire output is
ofile.txt
25.75 (i.e. consider first 5 rows and take average without considering missing values, so (30+10+40+23)/4)
43 (i.e. consider next 6 rows and take average without considering missing values, so (44+31+54)/3)
-999 (i.e. consider next 6 and take average without considering missing values. Since all are missing, so write as a missing value -999)
8.6 (i.e. consider next 5 rows and take average (10+23+2+5+3)/5)
8 (i.e. consider next 6 rows and take average)
I can do if it is regular interval (lets say 5) with this
awk '!/\-999/{sum += $1; count++} NR%5==0{print count ? (sum/count) :-999;sum=count=0}' input.txt
I asked a similar question with regular interval here Calculating average without considering missing values in shell script? But here I am asking the solution for irregular intervals.
With AWK
awk -v f="5" 'f&&f--&&$0!=-999{c++;v+=$0} NR%17==0{f=5;r++}
!f&&NR%17!=0{f=6;r++} r&&!c{print -999;r=0} r&&c{print v/c;r=v=c=0}
END{if(c!=0)print v/c}' input.txt
Output
25.75
43
-999
8.6
8
Breakdown
f&&f--&&$0!=-999{c++;v+=$0} #add valid values and increment count
NR%17==0{f=5;r++} #reset to 5,6,6 pattern
!f&&NR%17!=0{f=6;r++} #set 6 if pattern doesnt match
r&&!c{print -999;r=0} #print -999 if no valid values
r&&c{print v/c;r=v=c=0} #print avg
END{
if(c!=0) #print remaining values avg
print v/c
}
$ cat tst.awk
function nextInterval( intervals) {
numIntervals = split("5 6 6",intervals)
intervalsIdx = (intervalsIdx % numIntervals) + 1
return intervals[intervalsIdx]
}
BEGIN {
interval = nextInterval()
noVal = -999
}
$0 != noVal {
sum += $0
cnt++
}
++numRows == interval {
print (cnt ? sum / cnt : noVal)
interval = nextInterval()
numRows = sum = cnt = 0
}
$ awk -f tst.awk file
25.75
43
-999
8.6
8

How do i parse the large file in linux

I am beginner for Linux. I have the following flat file test.txt
Iteration 1
Telephony
Pass/Fail
5.1.1.1 voiceCallPhoneBook 50 45
5.1.1.4 voiceCallPhoneHistory 50 49
5.1.1.7 receiveCall 100 100
5.1.1.8 deleteContacts 20 19
5.1.1.9 addContacts 20 20
Telephony 16:47:42
Messaging
Pass/Fail
5.1.2.3 openSMS 50 49
5.1.2.1 smsManuallyEntryOption 50 50
5.1.2.2 smsSelectContactsOption 50 50
Messaging 03:26:31
Email
Pass/Fail
Email 00:00:48
Email
Pass/Fail
Email 00:00:40
PIM
Pass/Fail
5.1.6.1 addAppointment 5 0
5.1.6.2 setAlarm 1 0
5.1.6.3 deleteAppointment 5 0
5.1.6.4 deleteAlarm 1 0
5.1.6.5 addTask 1 0
5.1.6.6 openTask 1 0
5.1.6.7 deleteTask 1 0
PIM 00:03:06
Multi-Media
teration 2
Telephony
Pass/Fail
5.1.1.1 voiceCallPhoneBook 50 47
5.1.1.4 voiceCallPhoneHistory 50 50
5.1.1.7 receiveCall 100 100
5.1.1.8 deleteContacts 20 20
5.1.1.9 addContacts 20 20
Telephony 04:02:05
Messaging
Pass/Fail
5.1.2.3 openSMS 50 50
5.1.2.1 smsManuallyEntryOption 50 50
5.1.2.2 smsSelectContactsOption 50 50
Messaging 03:20:01
Email
Pass/Fail
Email 00:00:47
Email
Pass/Fail
Email 00:00:40
PIM
Pass/Fail
5.1.6.1 addAppointment 5 5
5.1.6.2 setAlarm 1 1
5.1.6.3 deleteAppointment 5 5
5.1.6.4 deleteAlarm 1 1
5.1.6.5 addTask 1 1
5.1.6.6 openTask 1 1
5.1.6.7 deleteTask 1 1
PIM 00:09:20
Multi-Media
I want to count the number of occurrences for specific word in the file Eg: if i search with "voiceCallPhoneBook" it's display as 2 times.
i can use
cat reports.txt | grep "5.1.1.4" | cut -d' ' -f1,4,7,10 |
after running this script i got output like below
5.1.1.4 voiceCallPhoneBook 50 45
5.1.1.4 voiceCallPhoneBook 50 47
It is very large file and i want to make use of loops with bash/awk scripts and also find the average of SUM of 3rd and 4th column value. i am struggling to write in bash scripts. It would be appreciated someone can give the solution for it.
Thanks
#!/usr/bin/awk -f
BEGIN{
c3 = 0
c4 = 0
count = 0
}
/voiceCallPhoneBook/{
c3 = c3 + $3;
c4 = c4 + $4;
count++;
}
END{
print "column 3 avg: " c3/count
print "column 4 avg: " c4/count
}
1) save it in a file for example countVoiceCall.awk
2) awk -f countVoiceCall.awk sample.txt
output:
column 3 avg: 50
column 4 avg: 46
Briefly explain:
a. BEGIN{...} block uses for variables initialization
b. /PATTERN/{...} blocks uses to search your keyword, for example "voiceCallPhoneBook"
c. END{...} block uses for print the results
This will search for lines containing 5.1.1.4
Make a tally of the 3rd and 4th columns
Then print them all out
awk '/^5\.?\.?\.?/ {a[$1" " $2] +=$3 ; b[$1" " $2] +=$4 }
END{ for (k in a){
printf("%-50s%-10i%-10i\n",k,a[k],b[k])}
}' $1
Duplicate from earlier today is here Parse the large test files using awk
With headers avg and Occurence count and formatted a bit neater for easier reading :)
awk 'BEGIN{
printf("%-50s%-10s%-10s%-10s\n","Name","Col3 Tot","Col4 Tot","Ocurr")
}
/^5\.?\.?\.?/ {
count++
c3 = c3 + $3
c4 = c4 + $4
a[$1" " $2] +=$3
b[$1" " $2] +=$4
c[$1" " $2]++
}
END{
for (k in a)
{printf("%-50s%-10i%-10i%-10i\n",k,a[k],b[k],c[k])}
print "col3 avg: " c3/count "\ncol4 avg: " c4/count
}' $1

Resources