How do i parse the large file in linux - linux

I am beginner for Linux. I have the following flat file test.txt
Iteration 1
Telephony
Pass/Fail
5.1.1.1 voiceCallPhoneBook 50 45
5.1.1.4 voiceCallPhoneHistory 50 49
5.1.1.7 receiveCall 100 100
5.1.1.8 deleteContacts 20 19
5.1.1.9 addContacts 20 20
Telephony 16:47:42
Messaging
Pass/Fail
5.1.2.3 openSMS 50 49
5.1.2.1 smsManuallyEntryOption 50 50
5.1.2.2 smsSelectContactsOption 50 50
Messaging 03:26:31
Email
Pass/Fail
Email 00:00:48
Email
Pass/Fail
Email 00:00:40
PIM
Pass/Fail
5.1.6.1 addAppointment 5 0
5.1.6.2 setAlarm 1 0
5.1.6.3 deleteAppointment 5 0
5.1.6.4 deleteAlarm 1 0
5.1.6.5 addTask 1 0
5.1.6.6 openTask 1 0
5.1.6.7 deleteTask 1 0
PIM 00:03:06
Multi-Media
teration 2
Telephony
Pass/Fail
5.1.1.1 voiceCallPhoneBook 50 47
5.1.1.4 voiceCallPhoneHistory 50 50
5.1.1.7 receiveCall 100 100
5.1.1.8 deleteContacts 20 20
5.1.1.9 addContacts 20 20
Telephony 04:02:05
Messaging
Pass/Fail
5.1.2.3 openSMS 50 50
5.1.2.1 smsManuallyEntryOption 50 50
5.1.2.2 smsSelectContactsOption 50 50
Messaging 03:20:01
Email
Pass/Fail
Email 00:00:47
Email
Pass/Fail
Email 00:00:40
PIM
Pass/Fail
5.1.6.1 addAppointment 5 5
5.1.6.2 setAlarm 1 1
5.1.6.3 deleteAppointment 5 5
5.1.6.4 deleteAlarm 1 1
5.1.6.5 addTask 1 1
5.1.6.6 openTask 1 1
5.1.6.7 deleteTask 1 1
PIM 00:09:20
Multi-Media
I want to count the number of occurrences for specific word in the file Eg: if i search with "voiceCallPhoneBook" it's display as 2 times.
i can use
cat reports.txt | grep "5.1.1.4" | cut -d' ' -f1,4,7,10 |
after running this script i got output like below
5.1.1.4 voiceCallPhoneBook 50 45
5.1.1.4 voiceCallPhoneBook 50 47
It is very large file and i want to make use of loops with bash/awk scripts and also find the average of SUM of 3rd and 4th column value. i am struggling to write in bash scripts. It would be appreciated someone can give the solution for it.
Thanks

#!/usr/bin/awk -f
BEGIN{
c3 = 0
c4 = 0
count = 0
}
/voiceCallPhoneBook/{
c3 = c3 + $3;
c4 = c4 + $4;
count++;
}
END{
print "column 3 avg: " c3/count
print "column 4 avg: " c4/count
}
1) save it in a file for example countVoiceCall.awk
2) awk -f countVoiceCall.awk sample.txt
output:
column 3 avg: 50
column 4 avg: 46
Briefly explain:
a. BEGIN{...} block uses for variables initialization
b. /PATTERN/{...} blocks uses to search your keyword, for example "voiceCallPhoneBook"
c. END{...} block uses for print the results

This will search for lines containing 5.1.1.4
Make a tally of the 3rd and 4th columns
Then print them all out
awk '/^5\.?\.?\.?/ {a[$1" " $2] +=$3 ; b[$1" " $2] +=$4 }
END{ for (k in a){
printf("%-50s%-10i%-10i\n",k,a[k],b[k])}
}' $1
Duplicate from earlier today is here Parse the large test files using awk
With headers avg and Occurence count and formatted a bit neater for easier reading :)
awk 'BEGIN{
printf("%-50s%-10s%-10s%-10s\n","Name","Col3 Tot","Col4 Tot","Ocurr")
}
/^5\.?\.?\.?/ {
count++
c3 = c3 + $3
c4 = c4 + $4
a[$1" " $2] +=$3
b[$1" " $2] +=$4
c[$1" " $2]++
}
END{
for (k in a)
{printf("%-50s%-10i%-10i%-10i\n",k,a[k],b[k],c[k])}
print "col3 avg: " c3/count "\ncol4 avg: " c4/count
}' $1

Related

Sum each row in a CSV file and sort it by specific value bash

i have a question taking the below set Coma separated CSV i want to run a script in bash that sums all values from colums 7,8,9 from an especific city and show the row with the max value
so Original dataset:
Row,name,city,age,height,weight,good rates,bad rates,medium rates
1,john,New York,25,186,98,10,5,11
2,mike,New York,21,175,87,19,6,21
3,Sandy,Boston,38,185,88,0,5,6
4,Sam,Chicago,34,167,76,7,0,2
5,Andy,Boston,31,177,85,19,0,1
6,Karl,New York,33,189,98,9,2,1
7,Steve,Chicago,45,176,88,10,3,0
the desire output will be
Row,name,city,age,height,weight,good rates,bad rates,medium rates,max rates by city
2,mike,New York,21,175,87,19,6,21,46
5,Andy,Boston,31,177,85,19,0,1,20
7,Steve,Chicago,45,176,88,10,3,0,13
im trying with this; but it gives me only the highest rate number so 46 but i need it by city and that shows all the row, any ideas how to continue?
awk 'BEGIN {FS=OFS=","}{sum = 0; for (i=7; i<=9;i++) sum += $i} NR ==1 || sum >max {max = sum}
You may use this awk:
awk '
BEGIN {FS=OFS=","}
NR==1 {
print $0, "max rates by city"
next
}
{
s = $7+$8+$9
if (s > max[$3]) {
max[$3] = s
rec[$3] = $0
}
}
END {
for (i in max)
print rec[i], max[i]
}' file
Row,name,city,age,height,weight,good rates,bad rates,medium rates,max rates by city
7,Steve,Chicago,45,176,88,10,3,0,13
2,mike,New York,21,175,87,19,6,21,46
5,Andy,Boston,31,177,85,19,0,1,20
or to get tabular output:
awk 'BEGIN {FS=OFS=","} NR==1{print $0, "max rates by city"; next} {s=$7+$8+$9; if (s > max[$3]) {max[$3] = s; rec[$3] = $0}} END {for (i in max) print rec[i], max[i]}' file | column -s, -t
Row name city age height weight good rates bad rates medium rates max rates by city
7 Steve Chicago 45 176 88 10 3 0 13
2 mike New York 21 175 87 19 6 21 46
5 Andy Boston 31 177 85 19 0 1 20

Using AWK to check conditions in multiple columns to output average, min, max, and total occurrences from a dataset containing age, race, and sex

I am using PuTTy for school to learn UNIX/Linux and have a file 2.asr which is a large data set containing the age, sex, and race of multiple individuals in their own columns, for example:
19 Male White
23 Female White
23 Male White
45 Female Other
54 Male Asian
24 Male Other
34 Female Asian
23 Male Hispanic
45 Female Hispanic
38 Female White
I would like to find the average age, max age, min age, and total occurrences of unique demographics such as Male White or Female Hispanic.
I've tried using awk code as follows:
$ awk '$2 == "Male" && $3 == "Hispanic" {sum+=$1; n++}
(NR==1) {min=$1;max=$1+0};
(NR>=2) {if(min>$1) min=$1; if(max<$1) max=$1}
END {if (n>0)
print $2 " " $3 " Average Age: " sum/n ", Max: " max ", Min: " min ", Total: " n
}' 2.asr
However, regardless of what sex and race I input, the output is always "Male White" and the max and min values are those of the entire dataset rather than the unique demographic conditions I've set. It does seem however that the average age and total occurrences of each demographic are outputted properly and change accordingly. I've tried using $2 and $3 at the start of the command in an if statement and utilizing BEGIN at the start also but I keep getting syntax errors at the end where I have my print function. Is there a better way to approach this with if statements ate the start of the command or is my syntax off somewhere? Thanks to whoever wishes to assist!
do it wholesale
$ awk '{k=$2 FS $3}
!(k in c) {max[k]=min[k]=$1}
{sum[k]+=$1; c[k]++}
max[k]<$1 {max[k]=$1}
min[k]>$1 {min[k]=$1}
END {for(k in c) print k,max[k],min[k],sum[k]/c[k]}' file | sort | column -t
Female Asian 34 34 34
Female Hispanic 45 45 45
Female Other 45 45 45
Female White 38 23 30.5
Male Asian 54 54 54
Male Hispanic 23 23 23
Male Other 24 24 24
Male White 23 19 21
add the header
If this is for a class, it might not be an option, but GNU datamash is a useful tool intended just for this sort of statistics:
$ datamash -Ws -g2,3 mean 1 min 1 max 1 count 1 < input.txt
GroupBy(field-2) GroupBy(field-3) mean(field-1) min(field-1) max(field-1) count(field-1)
Female Asian 34 34 34 1
Female Hispanic 45 45 45 1
Female Other 45 45 45 1
Female White 30.5 23 38 2
Male Asian 54 54 54 1
Male Hispanic 23 23 23 1
Male Other 24 24 24 1
Male White 21 19 23 2
This will let you process all of your demographics at once while avoiding the need to store all of your input in memory at once (sort uses demand paging to handle that if necessary) which may matter since you said your input is a large data set :
$ cat tst.sh
#!/usr/bin/env bash
sort -k2 -k1,1n file |
awk '
BEGIN { OFS="\t" }
{ curr = $2 FS $3 }
curr != prev {
prt()
min = $1
sum = cnt = 0
prev = curr
}
{
max = $1
sum += $1
cnt++
}
END { prt() }
function prt() {
if (cnt) {
print prev, sum/cnt, max, min, cnt
}
}
'
.
$ ./tst.sh
Female Asian 34 34 34 1
Female Hispanic 45 45 45 1
Female Other 45 45 45 1
Female White 30.5 38 23 2
Male Asian 54 54 54 1
Male Hispanic 23 23 23 1
Male Other 24 24 24 1
Male White 21 19 23 2
To only find one group, say Female Asian, just change sort -k2 -k1,1n file | to grep 'Female Asian' file |sort -k2 -k1,1n | or tweak the awk script to test for those values or even just pipe the output to grep if you don't care much about efficiency:
$ ./tst.sh | grep 'Female Asian'
Female Asian 34 34 34 1
#rockytimmy, your code contained a few logical bugs.
Here is a minimal rewrite and yet keeping to your "original requirements":
awk -v Sex="Female" -v Race="White" '
BEGIN {max=0; min=999; n=0; sum=0 }
$2 == Sex && $3 == Race {
print;
sum+=$1;
n++;
if ($1 < min) {min = $1};
if ($1 > max) {max = $1}
}
END { print Sex " " Race " Average Age: " sum/n ", Max: " max ", Min: " min ", Total: " n
}' 2.asr
NOTE: All matching entries are also printed out for verification.
Running the above awk script using the sample data you provided prints:
23 Female White
38 Female White
Female White Average Age: 30.5, Max: 38, Min: 23, Total: 2

How to sort or rearrange numbers from multiple column into multiple row [fixed into 4 columns]?

I have 1 text file, which is test1.txt.
text1.txt contain as following:
Input:
##[A1] [B1] [T1] [V1] [T2] [V2] [T3] [V3] [T4] [V4]## --> headers
1 1000 0 100 10 200 20 300 30 400
40 500 50 600 60 700 70 800
1010 0 101 10 201 20 301 30 401
40 501 50 601
2 1000 0 110 15 210 25 310 35 410
45 510 55 610 65 710
1010 0 150 10 250 20 350 30 450
40 550
Condition:
A1 and B1 -> for each A1 + (B1 + [Tn + Vn])
A1 should be in 1 column.
B1 should be in 1 column.
T1,T2,T3 and T4 should be in 1 column.
V1,V2,V3 and V4 should be in 1 column.
How do I sort it become like below?
Desire Output:
## A1 B1 Tn Vn ## --> headers
1 1000 0 100
10 200
20 300
30 400
40 500
50 600
60 700
70 800
1010 0 101
10 201
20 301
30 401
40 501
50 601
2 1000 0 110
15 210
25 310
35 410
45 510
55 610
65 710
1010 0 150
10 250
20 350
30 450
40 550
Here is my current code:
First Attempt:
Input
cat test1.txt | awk ' { a=$1 b=$2 } { for(i=1; i<=5; i=i+1) { t=substr($0,11+i*10,5) v=substr($0,16+i*10,5) if( t ~ /^\ +[0-9]+$/ || t ~ /^[0-9]+$/ || t ~ /^\ +[0-9]+\ +$/ ){ printf "%7s %7d %8d %8d \n",a,b,t,v } }}' | less
Output:
1 1000 400 0
40 500 800 0
1010 0 401 0
2 1000 410 0
1010 0 450 0
I'm trying using simple awk command, but still can't get the result.
Can anyone help me on this?
Thanks,
Am
Unlike what is stated elsewhere, there's nothing tricky about this at all, you're just using fixed width fields in your input instead of char/string separated fields.
With GNU awk for FIELDWIDTHS to handle fixed width fields it really couldn't be much simpler:
$ cat tst.awk
BEGIN {
# define the width of the input and output fields
FIELDWIDTHS = "2 4 5 5 6 5 6 5 6 5 6 99"
ofmt = "%2s%5s%6s%5s%6s%s\n"
}
{
# strip leading/trailing blanks and square brackets from every field
for (i=1; i<=NF; i++) {
gsub(/^[[\s]+|[]\s]+$/,"",$i)
}
}
NR==1 {
# print the header line
printf ofmt, $1, $2, $3, "Tn", "Vn", " "$NF
next
}
{
# print every other line
for (i=4; i<NF; i+=2) {
printf ofmt, $1, $2, $3, $i, $(i+1), ""
$1 = $2 = $3 = ""
}
}
.
$ awk -f tst.awk file
## A1 B1 Tn Vn ## --> headers
1 1000 0 100
10 200
20 300
30 400
40 500
50 600
60 700
70 800
1010 0 101
10 201
20 301
30 401
40 501
50 601
2 1000 0 110
15 210
25 310
35 410
45 510
55 610
65 710
1010 0 150
10 250
20 350
30 450
40 550
With other awks you'd use a while() { substr() } loop instead of FIELDWIDTHS so it'd be a couple more lines of code but still trivial.
The above will be orders of magnitude faster than an equivalent shell script. See https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice.
This isn't easy because it is hard to identify when you have the different styles of row — those with values in both column 1 and column 2, those with no value in column 1 and a value in column 2, and those no value in column 1 or 2. A first step is to make this easier — sed to the rescue:
$ sed 's/[[:space:]]\{1,\}$//
s/^....../&|/
s/|....../&|/
:a
s/|\( *[0-9][0-9]* \)\( *[^|]\)/|\1|\2/
t a' data
1 | 1000 | 0 | 100 | 10 | 200 | 20 | 300 | 30 | 400
| | 40 | 500 | 50 | 600 | 60 | 700 | 70 | 800
| 1010 | 0 | 101 | 10 | 201 | 20 | 301 | 30 | 401
| | 40 | 501 | 50 | 601
2 | 1000 | 0 | 110 | 15 | 210 | 25 | 310 | 35 | 410
| | 45 | 510 | 55 | 610 | 65 | 710
| 1010 | 0 | 150 | 10 | 250 | 20 | 350 | 30 | 450
| | 40 | 550
$
The first line removes any trailing white space, to avoid confusion. The next two expressions handle the fixed-width columns 1 and 2 (6 characters each). The next line creates a label a; the substitute finds a pipe |, some spaces, some digits, a space, and some trailing material which doesn't include a pipe; and inserts a pipe in the middle. The t a jumps back to the label if a substitution was done.
With that in place, it becomes easy to manage awk with a field separator of |.
This is verbose, but seems to do the trick:
awk -F '|' '
$1 > 0 { printf "%5d %4d %3d %3d\n", $1, $2, $3, $4
for (i = 5; i <= NF; i += 2) { printf "%5s %4s %3d %3d\n", "", "", $i, $(i+1) }
next
}
$2 > 0 { printf "%5s %4d %3d %3d\n", "", $2, $3, $4
for (i = 5; i <= NF; i += 2) { printf "%5s %4s %3d %3d\n", "", "", $i, $(i+1) }
next
}
{ for (i = 3; i <= NF; i += 2) { printf "%5s %4s %3d %3d\n", "", "", $i, $(i+1) }
next
}'
Output:
1 1000 0 100
10 200
20 300
30 400
40 500
50 600
60 700
70 800
1010 0 101
10 201
20 301
30 401
40 501
50 601
2 1000 0 110
15 210
25 310
35 410
45 510
55 610
65 710
1010 0 150
10 250
20 350
30 450
40 550
If you need to remove the headings, add 1d; to the start of the sed script.
This might work for you (GNU sed):
sed -r '1d;s/^(.{11}).{11}/&\n\1/;s/^((.{5}).*\n)\2/\1 /;s/^(.{5}(.{6}).*\n.{5})\2/\1 /;/\S/P;D' file
Delete the first line (if the header is needed see below). The key fields occupy the first 11 (the first key is 5 characters and the second 6) characters and the data fields occupy the next 11. Insert a newline and the key fields before each pair of data fields. Compare the keys on adjacent lines and replace by spaces if they are duplicated. Do not print any blank lines.
If the header is needed, use the following:
sed -r '1{s/\[[^]]+\]\s*//5g;y/[]/ /;s/1/n/3g;s/B/ B/;G;b};s/^(.{11}).{11}/&\n\1/;s/^((.{5}).*\n)\2/\1 /;s/^(.{5}(.{6}).*\n.{5})\2/\1 /;/\S/P;D' file
This does additional formatting on the first line to remove superfluous headings, []'s, replace 1's by n, add an additional space for alignment and a following empty line.
Further more. By utilising the second line of the input file as a template for the data, a sed script can be created that does not have any hard coded values:
sed -r '2!d;s/\s*\S*//3g;s/.\>/&\n/;h;s/[^\n]/./g;G;s/[^\n.]/ /g;s#(.*)\n(.*)\n(.*)\n(.*)#1d;s/^(\1\2)\1\2/\&\\n\\1/;s/^((\1).*\\n)\\2/\\1\3/;s/^(\1(\2).*\\n\1)\\2/\\1\4/;/\\S/P;D#' file |
sed -r -f - file
The script created from the template is piped into a second invocation of the sed as a file and run against the original file to produce the required output.
Likewise the headers may be formatted if need be as so:
sed -r '2!d;s/\s*\S*//3g;s/.\>/&\n/;h;s/[^\n]/./g;G;s/[^\n.]/ /g;s#(.*)\n(.*)\n(.*)\n(.*)#s/^(\1\2)\1\2/\&\\n\\1/;s/^((\1).*\\n)\\2/\\1\3/;s/^(\1(\2).*\\n\1)\\2/\\1\4/;/\\S/P;D#' file |
sed -r -e '1{s/\[[^]]+\]\s*//5g;y/[]/ /;s/1/n/3g;s/B/ B/;G;b}' -f - file
By extracting the first four fields from the second line of the input file, Four variables can be made. Two regexp and two values. These variables can be used to build the sed script.
N.B. The sed script is created from strings extracted from the template and the variables produced are also strings so they can be concatenated to produce further new regexp's and new values etc etc
This is a rather tricky problem that can be handled a number of ways. Whether bash, perl or awk, you will need to handle to number of fields in some semi-generic way instead of just hardcoding values for your example.
Using bash, so long as you can rely on an even-number of fields in all lines (except for the lines with the sole initial value (e.g. 1010), you can accommodate the number of fields is a reasonably generic way. For the lines with 1, 2, etc.. you know your initial output will contain 4-fields. For lines with 1010, etc.. you know the output will contain an initial 3-fields. For the remaining values you are simply outputting pairs.
The tricky part is handling the alignment. Here is where printf which allows you to set the field-width with a parameter using the form "%*s" where the conversion specifier expects the next parameter to be an integer value specifying the field-width followed by a parameter for the string conversion itself. It takes a little gymnastics, but you could do something like the following in bash itself:
(note: edit to match your output header format)
#!/bin/bash
declare -i nfields wd=6 ## total no. fields, printf field-width modifier
while read -r line; do ## read each line (preserve for header line)
arr=($line) ## separate into array
first=${arr[0]} ## check for '#' in first line for header
if [ "${first:0:1}" = '#' ]; then
nfields=$((${#arr[#]} - 2)) ## no. fields in header
printf "## A1 B1 Tn Vn ## --> headers\n" ## new header
continue
fi
fields=${#arr[#]} ## fields in line
case "$fields" in
$nfields ) ## fields -eq nfiles?
cnt=4 ## handle 1st 4 values in line
printf " "
for ((i=0; i < cnt; i++)); do
if [ "$i" -eq '2' ]; then
printf "%*s" "5" "${arr[i]}"
else
printf "%*s" "$wd" "${arr[i]}"
fi
done
echo
for ((i = cnt; i < $fields; i += 2)); do ## handle rest
printf "%*s%*s%*s\n" "$((2*wd))" " " "$wd" "${arr[i]}" "$wd" "${arr[$((i+1))]}"
done
;;
$((nfields - 1)) ) ## one less than nfields
cnt=3 ## handle 1st 3 values
printf " %*s%*s" "$wd" " "
for ((i=0; i < cnt; i++)); do
if [ "$i" -eq '1' ]; then
printf "%*s" "5" "${arr[i]}"
else
printf "%*s" "$wd" "${arr[i]}"
fi
done
echo
for ((i = cnt; i < $fields; i += 2)); do ## handle rest
if [ "$i" -eq '0' ]; then
printf "%*s%*s%*s\n" "$((wd+1))" " " "$wd" "${arr[i]}" "$wd" "${arr[$((i+1))]}"
else
printf "%*s%*s%*s\n" "$((2*wd))" " " "$wd" "${arr[i]}" "$wd" "${arr[$((i+1))]}"
fi
done
;;
* ) ## all other lines format as pairs
for ((i = 0; i < $fields; i += 2)); do
printf "%*s%*s%*s\n" "$((2*wd))" " " "$wd" "${arr[i]}" "$wd" "${arr[$((i+1))]}"
done
;;
esac
done
Rather than reading from a file, just use redirection to redirect the input file to your script (if you want to just provide a filename, then redirect the file to feed the output while read... loop)
Example Use/Output
$ bash text1format.sh <dat/text1.txt
## A1 B1 Tn Vn ## --> headers
1 1000 0 100
10 200
20 300
30 400
40 500
50 600
60 700
70 800
1010 0 101
10 201
20 301
30 401
40 501
50 601
2 1000 0 110
15 210
25 310
35 410
45 510
55 610
65 710
1010 0 150
10 250
20 350
30 450
40 550
As between awk and bash, awk will generally be faster, but here with formatted output, it may be closer than usual. Look things over and let me know if you have questions.

Linux filter text rows by sum specific colums

From raw sequencing data I created a count file (.txt) with the counts of unique sequences per sample.
The data looks like this:
sequence seqLength S1 S2 S3 S4 S5 S6 S7 S8
AAAAA... 46 0 1 1 8 1 0 1 5
AAAAA... 46 50 1 5 0 2 0 4 0
...
TTTTT... 71 0 0 5 7 5 47 2 2
TTTTT... 81 5 4 1 0 7 0 1 1
I would like to filter the sequences per row sum, so only rows with a total sum of all samples (sum of S1 to S8) lower than for example 100 are removed.
This can probably be done with awk, but I have no experience with this text-processing utility.
Can anyone help?
Give a try to this:
awk 'NR>1 {sum=0; for (i=3; i<=NF; i++) { sum+= $i } if (sum > 100) print}' file.txt
It will skip line 1 NR>1
Then will sum items per row starting from item 3 (S1 to S8) in your example:
{sum=0; for (i=3; i<=NF; i++) { sum+= $i }
Then will only print rows with sum is > than 100: if (sum > 100) print}'
You could modify/test with the condition based on the sum, but hope this can give you an idea about how to do it with awk
Following awk may help you on same.
awk 'FNR>1{for(i=3;i<=NF;i++){sum+=$i};if(sum>100){print sum > "out_file"};sum=""}' Input_file
In case you need different different out files then following may help.
awk 'FNR>1{for(i=3;i<=NF;i++){sum+=$i};if(sum>100){print sum > "out_file"++i};sum=""}' Input_file

Average column if value in other column matches and print as additional column

I have a file like this:
Score 1 24 HG 1
Score 2 26 HG 2
Score 5 56 RP 0.5
Score 7 82 RP 1
Score 12 97 GM 5
Score 32 104 LS 3
I would like to average column 5 if column 4 are identical and print the average as column 6 so that it looks like this:
Score 1 24 HG 1 1.5
Score 2 26 HG 2 1.5
Score 5 56 RP 0.5 0.75
Score 7 82 RP 1 0.75
Score 12 97 GM 5 5
Score 32 104 LS 3 3
I have tried a couple of solutions I found on here.
e.g.
awk '{ total[$4] += $5; ++n[$4] } END { for(i in total) print i, total[i] / n[i] }'
but they all end up with this:
HG 1.5
RP 0.75
GM 5
LS 3
Which is undesirable as I lose a lot of information.
You can iterate through your table twice: calculate the averages (as you already) do on the first iteration, and then print them out on the second iteration:
awk 'NR==FNR { total[$4] += $5; ++n[$4] } NR>FNR { print $0, total[$4] / n[$4] }' file file
Notice the file twice at the end. While going through the "first" file, NR==FNR, and we sum the appropriate values, keeping them in memory (variables total and n). During "second" file traversal, NR>FNR, and we print out all the original data + averages:
Score 1 24 HG 1 1.5
Score 2 26 HG 2 1.5
Score 5 56 RP 0.5 0.75
Score 7 82 RP 1 0.75
Score 12 97 GM 5 5
Score 32 104 LS 3 3
You can use 1 pass through the file, but you have to store in memory the entire file, so disk i/o vs memory tradeoff:
awk '
BEGIN {FS = OFS = "\t"}
{total[$4] += $5; n[$4]++; line[NR] = $0; key[NR] = $4}
END {for (i=1; i<=NR; i++) print line[i], total[key[i]] / n[key[i]]}
' file

Resources