Using AWK to check conditions in multiple columns to output average, min, max, and total occurrences from a dataset containing age, race, and sex - linux

I am using PuTTy for school to learn UNIX/Linux and have a file 2.asr which is a large data set containing the age, sex, and race of multiple individuals in their own columns, for example:
19 Male White
23 Female White
23 Male White
45 Female Other
54 Male Asian
24 Male Other
34 Female Asian
23 Male Hispanic
45 Female Hispanic
38 Female White
I would like to find the average age, max age, min age, and total occurrences of unique demographics such as Male White or Female Hispanic.
I've tried using awk code as follows:
$ awk '$2 == "Male" && $3 == "Hispanic" {sum+=$1; n++}
(NR==1) {min=$1;max=$1+0};
(NR>=2) {if(min>$1) min=$1; if(max<$1) max=$1}
END {if (n>0)
print $2 " " $3 " Average Age: " sum/n ", Max: " max ", Min: " min ", Total: " n
}' 2.asr
However, regardless of what sex and race I input, the output is always "Male White" and the max and min values are those of the entire dataset rather than the unique demographic conditions I've set. It does seem however that the average age and total occurrences of each demographic are outputted properly and change accordingly. I've tried using $2 and $3 at the start of the command in an if statement and utilizing BEGIN at the start also but I keep getting syntax errors at the end where I have my print function. Is there a better way to approach this with if statements ate the start of the command or is my syntax off somewhere? Thanks to whoever wishes to assist!

do it wholesale
$ awk '{k=$2 FS $3}
!(k in c) {max[k]=min[k]=$1}
{sum[k]+=$1; c[k]++}
max[k]<$1 {max[k]=$1}
min[k]>$1 {min[k]=$1}
END {for(k in c) print k,max[k],min[k],sum[k]/c[k]}' file | sort | column -t
Female Asian 34 34 34
Female Hispanic 45 45 45
Female Other 45 45 45
Female White 38 23 30.5
Male Asian 54 54 54
Male Hispanic 23 23 23
Male Other 24 24 24
Male White 23 19 21
add the header

If this is for a class, it might not be an option, but GNU datamash is a useful tool intended just for this sort of statistics:
$ datamash -Ws -g2,3 mean 1 min 1 max 1 count 1 < input.txt
GroupBy(field-2) GroupBy(field-3) mean(field-1) min(field-1) max(field-1) count(field-1)
Female Asian 34 34 34 1
Female Hispanic 45 45 45 1
Female Other 45 45 45 1
Female White 30.5 23 38 2
Male Asian 54 54 54 1
Male Hispanic 23 23 23 1
Male Other 24 24 24 1
Male White 21 19 23 2

This will let you process all of your demographics at once while avoiding the need to store all of your input in memory at once (sort uses demand paging to handle that if necessary) which may matter since you said your input is a large data set :
$ cat tst.sh
#!/usr/bin/env bash
sort -k2 -k1,1n file |
awk '
BEGIN { OFS="\t" }
{ curr = $2 FS $3 }
curr != prev {
prt()
min = $1
sum = cnt = 0
prev = curr
}
{
max = $1
sum += $1
cnt++
}
END { prt() }
function prt() {
if (cnt) {
print prev, sum/cnt, max, min, cnt
}
}
'
.
$ ./tst.sh
Female Asian 34 34 34 1
Female Hispanic 45 45 45 1
Female Other 45 45 45 1
Female White 30.5 38 23 2
Male Asian 54 54 54 1
Male Hispanic 23 23 23 1
Male Other 24 24 24 1
Male White 21 19 23 2
To only find one group, say Female Asian, just change sort -k2 -k1,1n file | to grep 'Female Asian' file |sort -k2 -k1,1n | or tweak the awk script to test for those values or even just pipe the output to grep if you don't care much about efficiency:
$ ./tst.sh | grep 'Female Asian'
Female Asian 34 34 34 1

#rockytimmy, your code contained a few logical bugs.
Here is a minimal rewrite and yet keeping to your "original requirements":
awk -v Sex="Female" -v Race="White" '
BEGIN {max=0; min=999; n=0; sum=0 }
$2 == Sex && $3 == Race {
print;
sum+=$1;
n++;
if ($1 < min) {min = $1};
if ($1 > max) {max = $1}
}
END { print Sex " " Race " Average Age: " sum/n ", Max: " max ", Min: " min ", Total: " n
}' 2.asr
NOTE: All matching entries are also printed out for verification.
Running the above awk script using the sample data you provided prints:
23 Female White
38 Female White
Female White Average Age: 30.5, Max: 38, Min: 23, Total: 2

Related

Sum each row in a CSV file and sort it by specific value bash

i have a question taking the below set Coma separated CSV i want to run a script in bash that sums all values from colums 7,8,9 from an especific city and show the row with the max value
so Original dataset:
Row,name,city,age,height,weight,good rates,bad rates,medium rates
1,john,New York,25,186,98,10,5,11
2,mike,New York,21,175,87,19,6,21
3,Sandy,Boston,38,185,88,0,5,6
4,Sam,Chicago,34,167,76,7,0,2
5,Andy,Boston,31,177,85,19,0,1
6,Karl,New York,33,189,98,9,2,1
7,Steve,Chicago,45,176,88,10,3,0
the desire output will be
Row,name,city,age,height,weight,good rates,bad rates,medium rates,max rates by city
2,mike,New York,21,175,87,19,6,21,46
5,Andy,Boston,31,177,85,19,0,1,20
7,Steve,Chicago,45,176,88,10,3,0,13
im trying with this; but it gives me only the highest rate number so 46 but i need it by city and that shows all the row, any ideas how to continue?
awk 'BEGIN {FS=OFS=","}{sum = 0; for (i=7; i<=9;i++) sum += $i} NR ==1 || sum >max {max = sum}
You may use this awk:
awk '
BEGIN {FS=OFS=","}
NR==1 {
print $0, "max rates by city"
next
}
{
s = $7+$8+$9
if (s > max[$3]) {
max[$3] = s
rec[$3] = $0
}
}
END {
for (i in max)
print rec[i], max[i]
}' file
Row,name,city,age,height,weight,good rates,bad rates,medium rates,max rates by city
7,Steve,Chicago,45,176,88,10,3,0,13
2,mike,New York,21,175,87,19,6,21,46
5,Andy,Boston,31,177,85,19,0,1,20
or to get tabular output:
awk 'BEGIN {FS=OFS=","} NR==1{print $0, "max rates by city"; next} {s=$7+$8+$9; if (s > max[$3]) {max[$3] = s; rec[$3] = $0}} END {for (i in max) print rec[i], max[i]}' file | column -s, -t
Row name city age height weight good rates bad rates medium rates max rates by city
7 Steve Chicago 45 176 88 10 3 0 13
2 mike New York 21 175 87 19 6 21 46
5 Andy Boston 31 177 85 19 0 1 20

Use printf to format list that is uneven

I have a small list of student grades, I need to format it them side by side depending on the gender of the student. So one column is Male the other Female. The problem is the list doesn't go male female male female, it is uneven.
I've tried using printf to format the output so the 2 columns are side by side, but the format is ruined because of the uneven list.
Name Gender Mark1 Mark2 Mark3
AA M 20 15 35
BB F 22 17 44
CC F 19 14 25
DD M 15 20 42
EE F 18 22 30
FF M 0 20 45
This is the list I am talking about ^^
awk 'BEGIN {print "Male" " Female"} {if (NR!=1) {if ($2 == "M") {printf "%-s %-s %-s", $3, $4, $5} else if ($2 == "F") {printf "%s %s %s\n", $3, $4 ,$5}}}' text.txt
So I'm getting results like
Male Female
20 15 35 22 17 44
19 14 25
15 20 42 18 22 30
0 20 45
But I want it like this:
Male Female
20 15 35 22 17 44
15 20 42 19 14 25
0 20 45 18 22 30
I haven't added separators yet I'm just trying to figure this out, not sure if it would be better to put the marks into 2 arrays depending on gender then printing them out.
another solution tries to address if M/F is not unity
$ awk 'NR==1 {print "Male\tFemale"}
NR>1 {k=$2;$1=$2="";sub(/ +/,"");
if(k=="M") m[++mc]=$0; else f[++fc]=$0}
END {max=mc>fc?mc:fc;
for(i=1;i<=max;i++) print (m[i]?m[i]:"-") "\t" (f[i]?f[i]:"-")}' file |
column -ts$'\t'
Male Female
20 15 35 22 17 44
15 20 42 19 14 25
0 20 45 18 22 30
Something like this?
awk 'BEGIN{format="%2s %2s %2s %2s\n";printf("Male Female\n"); }NR>1{if (s) { if ($2=="F") {printf(format, s, $3, $4, $5);} else {printf(format, $3,$4,$5,s);} s=""} else {s=sprintf("%2s %2s %2s", $3, $4, $5)}}' file
Another approach using awk
awk '
BEGIN {
print "Male\t\tFemale"
}
NR > 1 {
I = ++G[$2]
A[$2 FS I] = sprintf("%2d %2d %2d", $(NF-2), $(NF-1), $NF)
}
END {
M = ( G["M"] > G["F"] ? G["M"] : G["F"] )
for ( i = 1; i <= M; i++ )
print A["M" FS i] ? A["M" FS i] : OFS, A["F" FS i] ? A["F" FS i] : OFS
}
' OFS='\t' file
This might work for you (GNU sed):
sed -Ee '1c\Male Female' -e 'N;s/^.. M (.*)\n.. F(.*)/\1\2/;s/^.. F(.*)\n.. M (.*)/\2\1/' file
Change the header line. Then compare a pair of lines and re-arrange them as appropriate.

Finding if a column is in a range

I have two files that I want to find out if a column of file1 is in a range of columns.
file1.txt
1 19
1 21
1 24
2 22
4 45
file2.txt
1 19 23 A
1 20 28 A
4 42 45 A
I am trying to see if the 1st column of file1.txt is the same with 1st column of file2.txt, whether the second column of file1.txt is in between 2nd and 3rd columns of file2.txt, and append if it is in the range.
So the output should be :
output.txt
1 19 23 A 1 19
1 19 23 A 1 21
1 20 28 A 1 24
4 42 45 A 4 45
What I am trying is to find out if first columns are the same:
awk 'NR==FNR{c[$1]++;next};c[$1] > 0' file1.txt file2.txt
1 19 23 A
1 20 28 A
4 42 45 A
But I am not able to put the larger/ smaller conditions.
How do I add it?
Following may also help you here.
while read first second
do
awk -v fir="$first" -v sec="$second" '$1==fir && ($2<=sec && $3>=sec){print $0,fir,sec}' file2
done < "file1"
Using join + awk:
join file2.txt file1.txt | awk '{if ($2 <= $5 && $5 <= $3) { print $1,$2,$3,$4,$1,$5 } }'
First two files are joined on the first column, then the columns are compared and output printed (with the first column printed twice, as join hides it).
Using awk:
$ awk 'NR==FNR{a[$1]=a[$1]" "$2;next} {split(a[$1],b);for(i in b) if(b[i]>=$2 && b[i]<=$3) print $0,$1,b[i]}' file1 file2
1 19 23 A 1 19
1 19 23 A 1 21
1 20 28 A 1 21
1 20 28 A 1 24
4 42 45 A 4 45
The first block statement stores the elements of file1 into the array a. The array index is the first column of the file and the array element is the concatenation of all numbers of the second column with the same number in the first column.
The second block statement loops over the the array a element with the same index as the first column and checks for the number in the array is in between the range.
Another approach is to use join:
$ join -o 1.1 1.2 1.3 1.4 1.1 2.2 file2 file1 | awk '$6 >= $2 && $6 <= $3'
1 19 23 A 1 19
1 19 23 A 1 21
1 20 28 A 1 21
1 20 28 A 1 24
4 42 45 A 4 45
join -o generated the expected output format. The awk statement is filtering
the lines that are in range.

Transposing data based on unique ID - awk

I really hope you can help. I am completely new to (g)awk and I have been fighting with it for the last two weeks.
My original file is as follows - there is column with a unique Id and another with unique names. Subsequent columns are various courses and each field contains (when not empty) a mark for each course and for each student. So each student has only one mark for each course:
Id Name Course1 Course2 Course3 Course4 Course5
1 John 55
2 George 63
4 Alex 64
1 John 74
3 Emma 63
2 George 64
4 Alex 60
2 George 29
3 Emma 69
1 John 67
3 Emma 80
4 Alex 57
2 George 91
1 John 81
1 John 34
3 Emma 75
2 George 89
4 Alex 49
3 Emma 78
4 Alex 69
5 TERRY 67
6 HELEN 39
This is what I want to achieve - transpose data i.e marks, based on the unique ID and place the marks below each corresponding course like below:
Id Name Course1 Course2 Course3 Course4 Course5
1 John 55 69 64 60 49
2 George 29 64 89 91 63
3 Emma 63 80 75 78 69
4 Alex 57 69 64 60 49
5 TERRY 67
6 HELLEN 39
This is what I managed to get so far:
Id Name Course1 Course2 Course3 Course4 Course5
1 John 55
2 George 29
3 Emma 63
4 Alex 57
5 TERRY
6 HELLEN
1 John 69
2 George 64
3 Emma 80
4 Alex 69
5 TERRY 67
6 HELLEN
1 John 64
2 George 89
3 Emma 75
4 Alex 64
5 TERRY
6 HELLEN 39
...and so on
It is really a bit tricky for me to achieve based to what I already know on awk (please note I am not interested in sed/perl e.t.c. based solutions).
If it is to provide some help (preferably NOT an one liner) may I ask to be a bit descriptive as I am interested in the solution as much as I am in the method itself.
Any help will be very much appreciated.
EDIT
Here is the code I wrote to reach the last stage (and where I got stuck)
#!/bin/bash
files3="*.csv"
for j in $files3
do
#echo "processing $j..."
fi13=$(awk -F" " '(NR==1){field13=$13;}{print field13}' ./work1/test1YA.csv)
fi14=$(awk -F" " '(NR==1){field14=$14;}{print field14}' ./work1/test1YA.csv)
fi15=$(awk -F" " '(NR==1){field15=$15;}{print field15}' ./work1/test1YA.csv)
fi16=$(awk -F" " '(NR==1){field16=$16;}{print field16}' ./work1/test1YA.csv)
# awk -F" " 'BEGIN{OFS=" ";RS="\n"}{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12}' "$j" >> ./work1/test2YA.csv
awk -F" " -v f13="$fi13" -v f14="$fi14" -v f15="$fi15" -v f16="$fi16" '{if($13==f13){$13=$6;$14=$15=$16=""}if($13==f14){$14=$6;$13=$15=$16=""}if($13==f15){$15=$6;$13=$14=$16=""}if($13==f16){$16=$6;$13=$14=$15=""}{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16}}' "$j" >> ./work1/test2YA.csv
done;
awk -F" " 'BEGIN{print "ID","Title","FirstName","MiddleName","LastName","FinalMarks","Status","Username","Campus","Code","Programme","Year","course1","course2","course3","course4"}{print}' ./work1/test2YA.csv >> ./work1/test3YA.csv
Here is a solution for gnu awk:
course.awk
BEGIN { # setup field width for constant field splitting
FIELDWIDTHS = "2 2 12 7 1 7 1 7 1 7 1 7"
# setup sort order (by id)
PROCINFO["sorted_in"] = "#ind_num_asc"
}
NR == 1 { # print header
print
next
}
{
# add ids to names
names[ $1 ] = $3
# store under id and course number the mark if it is present
for( c = 1; c <= 5; c++ ) {
field = 2+ (c*2)
if( $(field) !~ /^ *$/ ) {
marks[ $1, c ] = $(field)
}
}
}
END {
# output
for( id in names ) {
printf("%-4s%-12s%7s %7s %7s %7s %7s\n",id, names[ id ], marks[ id, 1], marks[ id, 2], marks[ id, 3], marks[ id, 4], marks[ id, 5])
}
}
Use it like this: awk -f course.awk your_file.
The fact that the input is not tab delimited, but has fixed column width make is a bit unelegant:
use of FIELDWIDTHS and %Ns where N is derived from the FIELDWIDTHS
FIELDWIDTHS take into account the empty column between ID and Name, Course1 and Course2, ...
the check if a mark is present: if( $(field) !~ /^ *$/ ) checks if field does not consist entirely of spaces.
This could be an approximation in awk:
NR==1{
for(x=1;x<=NF;x++)
{
head=head $x"\t";
}
print head
}
NR>1{
for(i=3;i<=NF;i++)
{
students[$1"\t"$2]=students[$1"\t"$2] "\t"$i;
}
}
END{
for (stu in students)
{
print stu,students[stu];
}
}
Id Name Course1 Course2 Course3 Course4 Course5
5 TERRY 67
4 Alex 64 60 57 49 69
1 John 55 74 67 81 34
6 HELEN 39
3 Emma 63 69 80 75 78
2 George 63 64 29 91 89
same ideas, perhaps simpler
$ awk 'BEGIN{ FIELDWIDTHS="16 8 8 8 8 8"}
NR==1{print;next}
NR>1{keys[$1];
for(i=2;i<=6;i++)
{gsub(" ","",$i);
if($i) a[$1,i]=$i}}
END{for(k in keys)
{printf "%16s",k;
for(i=2;i<=6;i++) printf "%-8s",a[k,i];
print ""}}' file
Id Name Course1 Course2 Course3 Course4 Course5
3 Emma 63 80 75 78 69
4 Alex 57 69 64 60 49
6 HELEN 39
5 TERRY 67
1 John 55 67 81 74 34
2 George 29 64 89 91 63
you can sort the output as well by piping to sort -n
... | sort -n
Id Name Course1 Course2 Course3 Course4 Course5
1 John 55 67 81 74 34
2 George 29 64 89 91 63
3 Emma 63 80 75 78 69
4 Alex 57 69 64 60 49
5 TERRY 67
6 HELEN 39
With GNU awk for FIELDWIDTHS, 2D arrays, and sorted_in:
$ cat tst.awk
NR==1 {
print
split($0,f,/\S+\s*/,s)
for (i=1;i in s;i++) {
w[i] = length(s[i])
FIELDWIDTHS = FIELDWIDTHS (i>1?" ":"") w[i]
}
next
}
{
sub(/\s*$/," ")
for (i=1;i<=NF;i++) {
if ($i ~ /\S/) {
val[$1][i] = $i
}
}
}
END {
PROCINFO["sorted_in"] = "#ind_num_asc"
for (id in val) {
for (i=1;i<=NF;i++) {
printf "%*s", w[i], val[id][i]
}
print ""
}
}
.
$ awk -f tst.awk file
Id Name Course1 Course2 Course3 Course4 Course5
1 John 55 67 81 74 34
2 George 29 64 89 91 63
3 Emma 63 80 75 78 69
4 Alex 57 69 64 60 49
5 TERRY 67
6 HELEN 39
Here's my take on this. This works in plain-old awk (doesn't use FIELDWIDTHS), and it automatically adjusts to different numbers of fields (i.e. add a Course7 column and you should be fine). Also, you can point it at multiple files, and it should process each one separately.
#!/usr/bin/awk -f
# Initialize variables on the first record of each input file
# (and also print the header)
#
FNR <= 1 {
print
delete name
delete score
next
}
# Process each line.
#
{
id = substr($0, 0, 16) #
name[id] # Store the unique identifier in an array
pos = 0 #
# Step through the score fields until we hit the end of the line,
# storing scores in another array.
do {
score[id, pos] += substr($0,17+pos*8,8) +0
printf("id='%s' pos=%s value=%s total=%s\n", id, pos, substr($0,17+pos*8,8)+0, score[id, pos] );
} while (17+(++pos)*8 < length())
}
# Keep track of our maximum number of fields
pos>max { max=pos }
# Finally, generate our (randomly sorted) output.
END {
for (id in name) { # Step through the records...
printf("%-12s", id);
for (i=0; i<max; i++) { # Step through the fields...
if (score[id, i]==0) score[id, i]=""
printf("%-8s", score[id, i]);
}
printf("\n")
}
}
It's a bit long but I think it's easier to understand what it does.

How do i parse the large file in linux

I am beginner for Linux. I have the following flat file test.txt
Iteration 1
Telephony
Pass/Fail
5.1.1.1 voiceCallPhoneBook 50 45
5.1.1.4 voiceCallPhoneHistory 50 49
5.1.1.7 receiveCall 100 100
5.1.1.8 deleteContacts 20 19
5.1.1.9 addContacts 20 20
Telephony 16:47:42
Messaging
Pass/Fail
5.1.2.3 openSMS 50 49
5.1.2.1 smsManuallyEntryOption 50 50
5.1.2.2 smsSelectContactsOption 50 50
Messaging 03:26:31
Email
Pass/Fail
Email 00:00:48
Email
Pass/Fail
Email 00:00:40
PIM
Pass/Fail
5.1.6.1 addAppointment 5 0
5.1.6.2 setAlarm 1 0
5.1.6.3 deleteAppointment 5 0
5.1.6.4 deleteAlarm 1 0
5.1.6.5 addTask 1 0
5.1.6.6 openTask 1 0
5.1.6.7 deleteTask 1 0
PIM 00:03:06
Multi-Media
teration 2
Telephony
Pass/Fail
5.1.1.1 voiceCallPhoneBook 50 47
5.1.1.4 voiceCallPhoneHistory 50 50
5.1.1.7 receiveCall 100 100
5.1.1.8 deleteContacts 20 20
5.1.1.9 addContacts 20 20
Telephony 04:02:05
Messaging
Pass/Fail
5.1.2.3 openSMS 50 50
5.1.2.1 smsManuallyEntryOption 50 50
5.1.2.2 smsSelectContactsOption 50 50
Messaging 03:20:01
Email
Pass/Fail
Email 00:00:47
Email
Pass/Fail
Email 00:00:40
PIM
Pass/Fail
5.1.6.1 addAppointment 5 5
5.1.6.2 setAlarm 1 1
5.1.6.3 deleteAppointment 5 5
5.1.6.4 deleteAlarm 1 1
5.1.6.5 addTask 1 1
5.1.6.6 openTask 1 1
5.1.6.7 deleteTask 1 1
PIM 00:09:20
Multi-Media
I want to count the number of occurrences for specific word in the file Eg: if i search with "voiceCallPhoneBook" it's display as 2 times.
i can use
cat reports.txt | grep "5.1.1.4" | cut -d' ' -f1,4,7,10 |
after running this script i got output like below
5.1.1.4 voiceCallPhoneBook 50 45
5.1.1.4 voiceCallPhoneBook 50 47
It is very large file and i want to make use of loops with bash/awk scripts and also find the average of SUM of 3rd and 4th column value. i am struggling to write in bash scripts. It would be appreciated someone can give the solution for it.
Thanks
#!/usr/bin/awk -f
BEGIN{
c3 = 0
c4 = 0
count = 0
}
/voiceCallPhoneBook/{
c3 = c3 + $3;
c4 = c4 + $4;
count++;
}
END{
print "column 3 avg: " c3/count
print "column 4 avg: " c4/count
}
1) save it in a file for example countVoiceCall.awk
2) awk -f countVoiceCall.awk sample.txt
output:
column 3 avg: 50
column 4 avg: 46
Briefly explain:
a. BEGIN{...} block uses for variables initialization
b. /PATTERN/{...} blocks uses to search your keyword, for example "voiceCallPhoneBook"
c. END{...} block uses for print the results
This will search for lines containing 5.1.1.4
Make a tally of the 3rd and 4th columns
Then print them all out
awk '/^5\.?\.?\.?/ {a[$1" " $2] +=$3 ; b[$1" " $2] +=$4 }
END{ for (k in a){
printf("%-50s%-10i%-10i\n",k,a[k],b[k])}
}' $1
Duplicate from earlier today is here Parse the large test files using awk
With headers avg and Occurence count and formatted a bit neater for easier reading :)
awk 'BEGIN{
printf("%-50s%-10s%-10s%-10s\n","Name","Col3 Tot","Col4 Tot","Ocurr")
}
/^5\.?\.?\.?/ {
count++
c3 = c3 + $3
c4 = c4 + $4
a[$1" " $2] +=$3
b[$1" " $2] +=$4
c[$1" " $2]++
}
END{
for (k in a)
{printf("%-50s%-10i%-10i%-10i\n",k,a[k],b[k],c[k])}
print "col3 avg: " c3/count "\ncol4 avg: " c4/count
}' $1

Resources