Use printf to format list that is uneven - linux

I have a small list of student grades, I need to format it them side by side depending on the gender of the student. So one column is Male the other Female. The problem is the list doesn't go male female male female, it is uneven.
I've tried using printf to format the output so the 2 columns are side by side, but the format is ruined because of the uneven list.
Name Gender Mark1 Mark2 Mark3
AA M 20 15 35
BB F 22 17 44
CC F 19 14 25
DD M 15 20 42
EE F 18 22 30
FF M 0 20 45
This is the list I am talking about ^^
awk 'BEGIN {print "Male" " Female"} {if (NR!=1) {if ($2 == "M") {printf "%-s %-s %-s", $3, $4, $5} else if ($2 == "F") {printf "%s %s %s\n", $3, $4 ,$5}}}' text.txt
So I'm getting results like
Male Female
20 15 35 22 17 44
19 14 25
15 20 42 18 22 30
0 20 45
But I want it like this:
Male Female
20 15 35 22 17 44
15 20 42 19 14 25
0 20 45 18 22 30
I haven't added separators yet I'm just trying to figure this out, not sure if it would be better to put the marks into 2 arrays depending on gender then printing them out.

another solution tries to address if M/F is not unity
$ awk 'NR==1 {print "Male\tFemale"}
NR>1 {k=$2;$1=$2="";sub(/ +/,"");
if(k=="M") m[++mc]=$0; else f[++fc]=$0}
END {max=mc>fc?mc:fc;
for(i=1;i<=max;i++) print (m[i]?m[i]:"-") "\t" (f[i]?f[i]:"-")}' file |
column -ts$'\t'
Male Female
20 15 35 22 17 44
15 20 42 19 14 25
0 20 45 18 22 30

Something like this?
awk 'BEGIN{format="%2s %2s %2s %2s\n";printf("Male Female\n"); }NR>1{if (s) { if ($2=="F") {printf(format, s, $3, $4, $5);} else {printf(format, $3,$4,$5,s);} s=""} else {s=sprintf("%2s %2s %2s", $3, $4, $5)}}' file

Another approach using awk
awk '
BEGIN {
print "Male\t\tFemale"
}
NR > 1 {
I = ++G[$2]
A[$2 FS I] = sprintf("%2d %2d %2d", $(NF-2), $(NF-1), $NF)
}
END {
M = ( G["M"] > G["F"] ? G["M"] : G["F"] )
for ( i = 1; i <= M; i++ )
print A["M" FS i] ? A["M" FS i] : OFS, A["F" FS i] ? A["F" FS i] : OFS
}
' OFS='\t' file

This might work for you (GNU sed):
sed -Ee '1c\Male Female' -e 'N;s/^.. M (.*)\n.. F(.*)/\1\2/;s/^.. F(.*)\n.. M (.*)/\2\1/' file
Change the header line. Then compare a pair of lines and re-arrange them as appropriate.

Related

Using AWK to check conditions in multiple columns to output average, min, max, and total occurrences from a dataset containing age, race, and sex

I am using PuTTy for school to learn UNIX/Linux and have a file 2.asr which is a large data set containing the age, sex, and race of multiple individuals in their own columns, for example:
19 Male White
23 Female White
23 Male White
45 Female Other
54 Male Asian
24 Male Other
34 Female Asian
23 Male Hispanic
45 Female Hispanic
38 Female White
I would like to find the average age, max age, min age, and total occurrences of unique demographics such as Male White or Female Hispanic.
I've tried using awk code as follows:
$ awk '$2 == "Male" && $3 == "Hispanic" {sum+=$1; n++}
(NR==1) {min=$1;max=$1+0};
(NR>=2) {if(min>$1) min=$1; if(max<$1) max=$1}
END {if (n>0)
print $2 " " $3 " Average Age: " sum/n ", Max: " max ", Min: " min ", Total: " n
}' 2.asr
However, regardless of what sex and race I input, the output is always "Male White" and the max and min values are those of the entire dataset rather than the unique demographic conditions I've set. It does seem however that the average age and total occurrences of each demographic are outputted properly and change accordingly. I've tried using $2 and $3 at the start of the command in an if statement and utilizing BEGIN at the start also but I keep getting syntax errors at the end where I have my print function. Is there a better way to approach this with if statements ate the start of the command or is my syntax off somewhere? Thanks to whoever wishes to assist!
do it wholesale
$ awk '{k=$2 FS $3}
!(k in c) {max[k]=min[k]=$1}
{sum[k]+=$1; c[k]++}
max[k]<$1 {max[k]=$1}
min[k]>$1 {min[k]=$1}
END {for(k in c) print k,max[k],min[k],sum[k]/c[k]}' file | sort | column -t
Female Asian 34 34 34
Female Hispanic 45 45 45
Female Other 45 45 45
Female White 38 23 30.5
Male Asian 54 54 54
Male Hispanic 23 23 23
Male Other 24 24 24
Male White 23 19 21
add the header
If this is for a class, it might not be an option, but GNU datamash is a useful tool intended just for this sort of statistics:
$ datamash -Ws -g2,3 mean 1 min 1 max 1 count 1 < input.txt
GroupBy(field-2) GroupBy(field-3) mean(field-1) min(field-1) max(field-1) count(field-1)
Female Asian 34 34 34 1
Female Hispanic 45 45 45 1
Female Other 45 45 45 1
Female White 30.5 23 38 2
Male Asian 54 54 54 1
Male Hispanic 23 23 23 1
Male Other 24 24 24 1
Male White 21 19 23 2
This will let you process all of your demographics at once while avoiding the need to store all of your input in memory at once (sort uses demand paging to handle that if necessary) which may matter since you said your input is a large data set :
$ cat tst.sh
#!/usr/bin/env bash
sort -k2 -k1,1n file |
awk '
BEGIN { OFS="\t" }
{ curr = $2 FS $3 }
curr != prev {
prt()
min = $1
sum = cnt = 0
prev = curr
}
{
max = $1
sum += $1
cnt++
}
END { prt() }
function prt() {
if (cnt) {
print prev, sum/cnt, max, min, cnt
}
}
'
.
$ ./tst.sh
Female Asian 34 34 34 1
Female Hispanic 45 45 45 1
Female Other 45 45 45 1
Female White 30.5 38 23 2
Male Asian 54 54 54 1
Male Hispanic 23 23 23 1
Male Other 24 24 24 1
Male White 21 19 23 2
To only find one group, say Female Asian, just change sort -k2 -k1,1n file | to grep 'Female Asian' file |sort -k2 -k1,1n | or tweak the awk script to test for those values or even just pipe the output to grep if you don't care much about efficiency:
$ ./tst.sh | grep 'Female Asian'
Female Asian 34 34 34 1
#rockytimmy, your code contained a few logical bugs.
Here is a minimal rewrite and yet keeping to your "original requirements":
awk -v Sex="Female" -v Race="White" '
BEGIN {max=0; min=999; n=0; sum=0 }
$2 == Sex && $3 == Race {
print;
sum+=$1;
n++;
if ($1 < min) {min = $1};
if ($1 > max) {max = $1}
}
END { print Sex " " Race " Average Age: " sum/n ", Max: " max ", Min: " min ", Total: " n
}' 2.asr
NOTE: All matching entries are also printed out for verification.
Running the above awk script using the sample data you provided prints:
23 Female White
38 Female White
Female White Average Age: 30.5, Max: 38, Min: 23, Total: 2

How to print contents of column fields that have strings composed of "n" character/s using bash?

Say I have a file which contains:
22 30 31 3a 31 32 3a 32 " 0 9 : 1 2 : 2
30 32 30 20 32 32 3a 31 1 2 7 2 2 : 1
And, I want to print only the column fields that have string composed of 1 character. I want the output to be like this:
" 0 9 : 1 2 : 2
1 2 7 2 2 : 1
Then, I want to print only those strings that are composed of two characters, the output should be:
22 30 31 3a 31 32 3a 32
30 32 30 20 32 32 3a 31
I am a beginner and I really don't know how to do this. Thanks for your help!
Could you please try following, I am trying it in a different way for provided samples. Written and tested with provided samples only.
For getting values before BULK SPACE try:
awk '
{
line=$0
while(match($0,/[[:space:]]+/)){
arr=arr>RLENGTH?arr:RLENGTH
start[arr]+=RSTART+prev_start
prev_start=RSTART
$0=substr($0,RSTART+RLENGTH)
}
var=substr(line,1,start[arr]-1)
sub(/ +$/,"",var)
print var
delete start
var=arr=""
}
' Input_file
Output will be as follows.
22 30 31 3a 31 32 3a 32
30 32 30 20 32 32 3a 31
For getting values after BULK SPACE try:
awk '
{
line=$0
while(match($0,/[[:space:]]+/)){
arr=arr>RLENGTH?arr:RLENGTH
start[arr]+=RSTART+prev_start
prev_start=RSTART
$0=substr($0,RSTART+RLENGTH)
}
var=substr(line,start[arr])
sub(/^ +/,"",var)
print var
delete start
var=arr=""
}
' Input_file
Output will be as follows:
" 0 9 : 1 2 : 2
1 2 7 2 2 : 1
You can try
awk '{for(i=1;i<=NF;++i)if(length($i)==1)printf("%s ", $i);print("")}'
For each field, check the length and print it if it's desired. You may pass the -F option to awk if it's not separated by blanks.
The awk script is expanded as:
for( i = 1; i <= NF; ++i )
if( length( $i ) == 1 )
printf( "%s ", $i );
print( "" );
The print outside loop is to print a newline after each input line.
Assuming all the columns are tab-separated (So you can have a space as a column value like the second line of your sample), easy to do with a perl one-liner:
$ perl -F"\t" -lane 'BEGIN { $, = "\t" } print grep { /^.$/ } #F' foo.txt
" 0 9 : 1 2 : 2
1 2 7 2 2 : 1
$ perl -F"\t" -lane 'BEGIN { $, = "\t" } print grep { /^..$/ } #F' foo.txt
22 30 31 3a 31 32 3a 32
30 32 30 20 32 32 3a 31

Find lines with a common value in a particular column

Suppose I have a file like this
5 kata 45 buu
34 tuy 3 rre
21 ppo 90 ty
21 ret 60 buu
09 ret 89 ty
21 plk 1 uio
23 kata 90 ty
I want to have in output only the lines that contains repetead values on the 4th column. Therefore, my desired output would be this one:
5 kata 45 buu
21 ppo 90 ty
21 ret 60 buu
09 ret 89 ty
23 kata 90 ty
How can I perform this task?
I can identify and isolate the column of my interest with:
awk -F"," '{print $4}' file1 > file1_temp
and then check if there are repeated values and how many with:
awk '{dups[$1]++} END{for (num in dups) {print num,dups[num]}}' file1_temp
but that's not definitely what I would like to do..
A simple way to preserve the ordering would be to run through the file twice. The first time, keep a record of the counts, then print the ones with a count greater than 1 on the second pass:
awk 'NR == FNR { ++count[$4]; next } count[$4] > 1' file file
If you prefer not to loop through the file twice, you can keep track of things in a few arrays and do the printing in the END block:
awk '{ line[NR] = $0; col[NR] = $4; ++count[$4] }
END { for (i = 1; i <= NR; ++i) if (count[col[i]] > 1) print line[i] }' file
Here line stores the contents of the whole line, col stores the fourth column and count does the same as before.

Compare two files having different column numbers and print the requirement to a new file if condition satisfies

I have two files with more than 10000 rows:
File1 has 1 col File2 has 4 col
23 23 88 90 0
34 43 74 58 5
43 54 87 52 3
54 73 52 35 4
. .
. .
I want to compare each value in file-1 with that in file-2. If exists then print the value along with other three values in file-2. In this example output will be:
23 88 90 0
43 74 58 5
54 87 52 3
.
.
I have written following script, but it is taking too much time to execute.
s1=1; s2=$(wc -l < File1.txt)
while [ $s1 -le $s2 ]
do n=$(awk 'NR=="$s1" {print $1}' File1.txt)
p1=1; p2=$(wc -l < File2.txt)
while [ $p1 -le $p2 ]
do awk '{if ($1==$n) printf ("%s %s %s %s\n", $1, $2, $3, $4);}'> ofile.txt
(( p1++ ))
done
(( s1++ ))
done
Is there any short/ easy way to do it?
You can do it very shortly using awk as
awk 'FNR==NR{found[$1]++; next} $1 in found'
Test
>>> cat file1
23
34
43
54
>>> cat file2
23 88 90 0
43 74 58 5
54 87 52 3
73 52 35 4
>>> awk 'FNR==NR{found[$1]++; next} $1 in found' file1 file2
23 88 90 0
43 74 58 5
54 87 52 3
What it does?
FNR==NR Checks if FNR file number of record is equal to NR total number of records. This will be same only for the first file, file1 because FNR is reset to 1 when awk reads a new file.
{found[$1]++; next} If the check is true then creates an associative array indexed by $1, the first column in file1
$1 in found This check is only done for the second file, file2. If column 1 value, $1 is and index in associative array found then it prints the entire line ( which is not written because it is the default action)

Gawk print largest value from each column

I am writing a awk script that takes some columns of input in a text file and print out the largest value in each column
Input:
$cat numbers
10 20 30.3 40.5
20 30 45.7 66.1
40 75 107.2 55.6
50 20 30.3 40.5
60 30 45.O 66.1
70 1134.7 50 70
80 75 107.2 55.6
Output:
80 1134.7 107.2 70
Script:
BEGIN {
val=0;
line=1;
}
{
if( $2 > $3 )
{
if( $2 > val )
{
val=$2;
line=$0;
}
}
else
{
if( $3 > val )
{
val=$3;
line=$0;
}
}
}
END{
print line
}
Current output:
60 30 45.O 66.1
What am I doing wrong first awk script
=======SOLUTION======
END {
for (i = 0; ++i <= NF;)
printf "%s", (m[i] (i < NF ? FS : RS))
}
{
for (i = 0; ++i <= NF;)
$i > m[i] && m[i] = $i
}
Thanks for the help
Since you have four columns, you'll need at least four variables, one for each column (or an array if you prefer). And you won't need to hold any line in its entirety. Treat each column independently.
You need to adapt something like the following for your purposes which will find the maximum in a particular column (the second in this case).
awk 'BEGIN {max = 0} {if ($2>max) max=$2} END {print max}' numbers.dat
The approach you are taking with $2 > $3 seems to be comparing two columns with each other.
You can create one user defined function and then pass individual column arrays to it to retrieve the max value. Something like this -
[jaypal:~/Temp] cat numbers
10 20 30.3 40.5
20 30 45.7 66.1
40 75 107.2 55.6
50 20 30.3 40.5
60 30 45.O 66.1
70 1134.7 50.0 70
80 75 107.2 55.6
[jaypal:~/Temp] awk '
function max(x){i=0;for(val in x){if(i<=x[val]){i=x[val];}}return i;}
{a[$1]=$1;b[$2]=$2;c[$3]=$3;d[$4]=$4;next}
END{col1=max(a);col2=max(b);col3=max(c);col4=max(d);print col1,col2,col3,col4}' numbers
80 1134.7 107.2 70
or
awk 'a<$1{a=$1}b<$2{b=$2}c<$3{c=$3}d<$4{d=$4} END{print a,b,c,d}' numbers

Resources