Transposing data based on unique ID - awk - linux

I really hope you can help. I am completely new to (g)awk and I have been fighting with it for the last two weeks.
My original file is as follows - there is column with a unique Id and another with unique names. Subsequent columns are various courses and each field contains (when not empty) a mark for each course and for each student. So each student has only one mark for each course:
Id Name Course1 Course2 Course3 Course4 Course5
1 John 55
2 George 63
4 Alex 64
1 John 74
3 Emma 63
2 George 64
4 Alex 60
2 George 29
3 Emma 69
1 John 67
3 Emma 80
4 Alex 57
2 George 91
1 John 81
1 John 34
3 Emma 75
2 George 89
4 Alex 49
3 Emma 78
4 Alex 69
5 TERRY 67
6 HELEN 39
This is what I want to achieve - transpose data i.e marks, based on the unique ID and place the marks below each corresponding course like below:
Id Name Course1 Course2 Course3 Course4 Course5
1 John 55 69 64 60 49
2 George 29 64 89 91 63
3 Emma 63 80 75 78 69
4 Alex 57 69 64 60 49
5 TERRY 67
6 HELLEN 39
This is what I managed to get so far:
Id Name Course1 Course2 Course3 Course4 Course5
1 John 55
2 George 29
3 Emma 63
4 Alex 57
5 TERRY
6 HELLEN
1 John 69
2 George 64
3 Emma 80
4 Alex 69
5 TERRY 67
6 HELLEN
1 John 64
2 George 89
3 Emma 75
4 Alex 64
5 TERRY
6 HELLEN 39
...and so on
It is really a bit tricky for me to achieve based to what I already know on awk (please note I am not interested in sed/perl e.t.c. based solutions).
If it is to provide some help (preferably NOT an one liner) may I ask to be a bit descriptive as I am interested in the solution as much as I am in the method itself.
Any help will be very much appreciated.
EDIT
Here is the code I wrote to reach the last stage (and where I got stuck)
#!/bin/bash
files3="*.csv"
for j in $files3
do
#echo "processing $j..."
fi13=$(awk -F" " '(NR==1){field13=$13;}{print field13}' ./work1/test1YA.csv)
fi14=$(awk -F" " '(NR==1){field14=$14;}{print field14}' ./work1/test1YA.csv)
fi15=$(awk -F" " '(NR==1){field15=$15;}{print field15}' ./work1/test1YA.csv)
fi16=$(awk -F" " '(NR==1){field16=$16;}{print field16}' ./work1/test1YA.csv)
# awk -F" " 'BEGIN{OFS=" ";RS="\n"}{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12}' "$j" >> ./work1/test2YA.csv
awk -F" " -v f13="$fi13" -v f14="$fi14" -v f15="$fi15" -v f16="$fi16" '{if($13==f13){$13=$6;$14=$15=$16=""}if($13==f14){$14=$6;$13=$15=$16=""}if($13==f15){$15=$6;$13=$14=$16=""}if($13==f16){$16=$6;$13=$14=$15=""}{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16}}' "$j" >> ./work1/test2YA.csv
done;
awk -F" " 'BEGIN{print "ID","Title","FirstName","MiddleName","LastName","FinalMarks","Status","Username","Campus","Code","Programme","Year","course1","course2","course3","course4"}{print}' ./work1/test2YA.csv >> ./work1/test3YA.csv

Here is a solution for gnu awk:
course.awk
BEGIN { # setup field width for constant field splitting
FIELDWIDTHS = "2 2 12 7 1 7 1 7 1 7 1 7"
# setup sort order (by id)
PROCINFO["sorted_in"] = "#ind_num_asc"
}
NR == 1 { # print header
print
next
}
{
# add ids to names
names[ $1 ] = $3
# store under id and course number the mark if it is present
for( c = 1; c <= 5; c++ ) {
field = 2+ (c*2)
if( $(field) !~ /^ *$/ ) {
marks[ $1, c ] = $(field)
}
}
}
END {
# output
for( id in names ) {
printf("%-4s%-12s%7s %7s %7s %7s %7s\n",id, names[ id ], marks[ id, 1], marks[ id, 2], marks[ id, 3], marks[ id, 4], marks[ id, 5])
}
}
Use it like this: awk -f course.awk your_file.
The fact that the input is not tab delimited, but has fixed column width make is a bit unelegant:
use of FIELDWIDTHS and %Ns where N is derived from the FIELDWIDTHS
FIELDWIDTHS take into account the empty column between ID and Name, Course1 and Course2, ...
the check if a mark is present: if( $(field) !~ /^ *$/ ) checks if field does not consist entirely of spaces.

This could be an approximation in awk:
NR==1{
for(x=1;x<=NF;x++)
{
head=head $x"\t";
}
print head
}
NR>1{
for(i=3;i<=NF;i++)
{
students[$1"\t"$2]=students[$1"\t"$2] "\t"$i;
}
}
END{
for (stu in students)
{
print stu,students[stu];
}
}
Id Name Course1 Course2 Course3 Course4 Course5
5 TERRY 67
4 Alex 64 60 57 49 69
1 John 55 74 67 81 34
6 HELEN 39
3 Emma 63 69 80 75 78
2 George 63 64 29 91 89

same ideas, perhaps simpler
$ awk 'BEGIN{ FIELDWIDTHS="16 8 8 8 8 8"}
NR==1{print;next}
NR>1{keys[$1];
for(i=2;i<=6;i++)
{gsub(" ","",$i);
if($i) a[$1,i]=$i}}
END{for(k in keys)
{printf "%16s",k;
for(i=2;i<=6;i++) printf "%-8s",a[k,i];
print ""}}' file
Id Name Course1 Course2 Course3 Course4 Course5
3 Emma 63 80 75 78 69
4 Alex 57 69 64 60 49
6 HELEN 39
5 TERRY 67
1 John 55 67 81 74 34
2 George 29 64 89 91 63
you can sort the output as well by piping to sort -n
... | sort -n
Id Name Course1 Course2 Course3 Course4 Course5
1 John 55 67 81 74 34
2 George 29 64 89 91 63
3 Emma 63 80 75 78 69
4 Alex 57 69 64 60 49
5 TERRY 67
6 HELEN 39

With GNU awk for FIELDWIDTHS, 2D arrays, and sorted_in:
$ cat tst.awk
NR==1 {
print
split($0,f,/\S+\s*/,s)
for (i=1;i in s;i++) {
w[i] = length(s[i])
FIELDWIDTHS = FIELDWIDTHS (i>1?" ":"") w[i]
}
next
}
{
sub(/\s*$/," ")
for (i=1;i<=NF;i++) {
if ($i ~ /\S/) {
val[$1][i] = $i
}
}
}
END {
PROCINFO["sorted_in"] = "#ind_num_asc"
for (id in val) {
for (i=1;i<=NF;i++) {
printf "%*s", w[i], val[id][i]
}
print ""
}
}
.
$ awk -f tst.awk file
Id Name Course1 Course2 Course3 Course4 Course5
1 John 55 67 81 74 34
2 George 29 64 89 91 63
3 Emma 63 80 75 78 69
4 Alex 57 69 64 60 49
5 TERRY 67
6 HELEN 39

Here's my take on this. This works in plain-old awk (doesn't use FIELDWIDTHS), and it automatically adjusts to different numbers of fields (i.e. add a Course7 column and you should be fine). Also, you can point it at multiple files, and it should process each one separately.
#!/usr/bin/awk -f
# Initialize variables on the first record of each input file
# (and also print the header)
#
FNR <= 1 {
print
delete name
delete score
next
}
# Process each line.
#
{
id = substr($0, 0, 16) #
name[id] # Store the unique identifier in an array
pos = 0 #
# Step through the score fields until we hit the end of the line,
# storing scores in another array.
do {
score[id, pos] += substr($0,17+pos*8,8) +0
printf("id='%s' pos=%s value=%s total=%s\n", id, pos, substr($0,17+pos*8,8)+0, score[id, pos] );
} while (17+(++pos)*8 < length())
}
# Keep track of our maximum number of fields
pos>max { max=pos }
# Finally, generate our (randomly sorted) output.
END {
for (id in name) { # Step through the records...
printf("%-12s", id);
for (i=0; i<max; i++) { # Step through the fields...
if (score[id, i]==0) score[id, i]=""
printf("%-8s", score[id, i]);
}
printf("\n")
}
}
It's a bit long but I think it's easier to understand what it does.

Related

Using AWK to check conditions in multiple columns to output average, min, max, and total occurrences from a dataset containing age, race, and sex

I am using PuTTy for school to learn UNIX/Linux and have a file 2.asr which is a large data set containing the age, sex, and race of multiple individuals in their own columns, for example:
19 Male White
23 Female White
23 Male White
45 Female Other
54 Male Asian
24 Male Other
34 Female Asian
23 Male Hispanic
45 Female Hispanic
38 Female White
I would like to find the average age, max age, min age, and total occurrences of unique demographics such as Male White or Female Hispanic.
I've tried using awk code as follows:
$ awk '$2 == "Male" && $3 == "Hispanic" {sum+=$1; n++}
(NR==1) {min=$1;max=$1+0};
(NR>=2) {if(min>$1) min=$1; if(max<$1) max=$1}
END {if (n>0)
print $2 " " $3 " Average Age: " sum/n ", Max: " max ", Min: " min ", Total: " n
}' 2.asr
However, regardless of what sex and race I input, the output is always "Male White" and the max and min values are those of the entire dataset rather than the unique demographic conditions I've set. It does seem however that the average age and total occurrences of each demographic are outputted properly and change accordingly. I've tried using $2 and $3 at the start of the command in an if statement and utilizing BEGIN at the start also but I keep getting syntax errors at the end where I have my print function. Is there a better way to approach this with if statements ate the start of the command or is my syntax off somewhere? Thanks to whoever wishes to assist!
do it wholesale
$ awk '{k=$2 FS $3}
!(k in c) {max[k]=min[k]=$1}
{sum[k]+=$1; c[k]++}
max[k]<$1 {max[k]=$1}
min[k]>$1 {min[k]=$1}
END {for(k in c) print k,max[k],min[k],sum[k]/c[k]}' file | sort | column -t
Female Asian 34 34 34
Female Hispanic 45 45 45
Female Other 45 45 45
Female White 38 23 30.5
Male Asian 54 54 54
Male Hispanic 23 23 23
Male Other 24 24 24
Male White 23 19 21
add the header
If this is for a class, it might not be an option, but GNU datamash is a useful tool intended just for this sort of statistics:
$ datamash -Ws -g2,3 mean 1 min 1 max 1 count 1 < input.txt
GroupBy(field-2) GroupBy(field-3) mean(field-1) min(field-1) max(field-1) count(field-1)
Female Asian 34 34 34 1
Female Hispanic 45 45 45 1
Female Other 45 45 45 1
Female White 30.5 23 38 2
Male Asian 54 54 54 1
Male Hispanic 23 23 23 1
Male Other 24 24 24 1
Male White 21 19 23 2
This will let you process all of your demographics at once while avoiding the need to store all of your input in memory at once (sort uses demand paging to handle that if necessary) which may matter since you said your input is a large data set :
$ cat tst.sh
#!/usr/bin/env bash
sort -k2 -k1,1n file |
awk '
BEGIN { OFS="\t" }
{ curr = $2 FS $3 }
curr != prev {
prt()
min = $1
sum = cnt = 0
prev = curr
}
{
max = $1
sum += $1
cnt++
}
END { prt() }
function prt() {
if (cnt) {
print prev, sum/cnt, max, min, cnt
}
}
'
.
$ ./tst.sh
Female Asian 34 34 34 1
Female Hispanic 45 45 45 1
Female Other 45 45 45 1
Female White 30.5 38 23 2
Male Asian 54 54 54 1
Male Hispanic 23 23 23 1
Male Other 24 24 24 1
Male White 21 19 23 2
To only find one group, say Female Asian, just change sort -k2 -k1,1n file | to grep 'Female Asian' file |sort -k2 -k1,1n | or tweak the awk script to test for those values or even just pipe the output to grep if you don't care much about efficiency:
$ ./tst.sh | grep 'Female Asian'
Female Asian 34 34 34 1
#rockytimmy, your code contained a few logical bugs.
Here is a minimal rewrite and yet keeping to your "original requirements":
awk -v Sex="Female" -v Race="White" '
BEGIN {max=0; min=999; n=0; sum=0 }
$2 == Sex && $3 == Race {
print;
sum+=$1;
n++;
if ($1 < min) {min = $1};
if ($1 > max) {max = $1}
}
END { print Sex " " Race " Average Age: " sum/n ", Max: " max ", Min: " min ", Total: " n
}' 2.asr
NOTE: All matching entries are also printed out for verification.
Running the above awk script using the sample data you provided prints:
23 Female White
38 Female White
Female White Average Age: 30.5, Max: 38, Min: 23, Total: 2

How to print contents of column fields that have strings composed of "n" character/s using bash?

Say I have a file which contains:
22 30 31 3a 31 32 3a 32 " 0 9 : 1 2 : 2
30 32 30 20 32 32 3a 31 1 2 7 2 2 : 1
And, I want to print only the column fields that have string composed of 1 character. I want the output to be like this:
" 0 9 : 1 2 : 2
1 2 7 2 2 : 1
Then, I want to print only those strings that are composed of two characters, the output should be:
22 30 31 3a 31 32 3a 32
30 32 30 20 32 32 3a 31
I am a beginner and I really don't know how to do this. Thanks for your help!
Could you please try following, I am trying it in a different way for provided samples. Written and tested with provided samples only.
For getting values before BULK SPACE try:
awk '
{
line=$0
while(match($0,/[[:space:]]+/)){
arr=arr>RLENGTH?arr:RLENGTH
start[arr]+=RSTART+prev_start
prev_start=RSTART
$0=substr($0,RSTART+RLENGTH)
}
var=substr(line,1,start[arr]-1)
sub(/ +$/,"",var)
print var
delete start
var=arr=""
}
' Input_file
Output will be as follows.
22 30 31 3a 31 32 3a 32
30 32 30 20 32 32 3a 31
For getting values after BULK SPACE try:
awk '
{
line=$0
while(match($0,/[[:space:]]+/)){
arr=arr>RLENGTH?arr:RLENGTH
start[arr]+=RSTART+prev_start
prev_start=RSTART
$0=substr($0,RSTART+RLENGTH)
}
var=substr(line,start[arr])
sub(/^ +/,"",var)
print var
delete start
var=arr=""
}
' Input_file
Output will be as follows:
" 0 9 : 1 2 : 2
1 2 7 2 2 : 1
You can try
awk '{for(i=1;i<=NF;++i)if(length($i)==1)printf("%s ", $i);print("")}'
For each field, check the length and print it if it's desired. You may pass the -F option to awk if it's not separated by blanks.
The awk script is expanded as:
for( i = 1; i <= NF; ++i )
if( length( $i ) == 1 )
printf( "%s ", $i );
print( "" );
The print outside loop is to print a newline after each input line.
Assuming all the columns are tab-separated (So you can have a space as a column value like the second line of your sample), easy to do with a perl one-liner:
$ perl -F"\t" -lane 'BEGIN { $, = "\t" } print grep { /^.$/ } #F' foo.txt
" 0 9 : 1 2 : 2
1 2 7 2 2 : 1
$ perl -F"\t" -lane 'BEGIN { $, = "\t" } print grep { /^..$/ } #F' foo.txt
22 30 31 3a 31 32 3a 32
30 32 30 20 32 32 3a 31

Problems combining awk scripts

I am trying to use awk to parse a tab delimited table -- there are several duplicate entries in the first column, and I need to remove the duplicate rows that have a smaller total sum of the other 4 columns in the table. I can remove the first or second row easily, and sum the columns, but I'm having trouble combining the two. For my purposes there will never be more than 2 duplicates.
Example file: http://pastebin.com/u2GBnm2D
Desired output in this case would be to remove the rows:
lmo0330 1 1 0 1
lmo0506 7 21 2 10
And keep the other two rows with the same gene id in the column. The final parsed file would look like this: http://pastebin.com/WgDkm5ui
Here's what I have tried (this doesn't do anything. But the first part removes the second duplicate, and the second part sums the counts):
awk 'BEGIN {!a[$1]++} {for(i=1;i<=NF;i++) t+=$i; print t; t=0}'
I tried modifying the 2nd part of the script in the best answer of this question: Removing lines containing a unique first field with awk?
awk 'FNR==NR{a[$1]++;next}(a[$1] > 1)' ./infile ./infile
But unfortunately I don't really understand what's going on well enough to get it working. Can anyone help me out? I think I need to replace the a[$1] > 1 part with [remove (first duplicate count or 2nd duplicate count depending on which is larger].
EDIT: I'm also using GNU Awk 3.1.7 if that matters.
You can use this awk command:
awk 'NR == 1 {
print;
next
} {
s = $2+$3+$4+$5
} s >= sum[$1] {
sum[$1] = s;
if (!($1 in rows))
a[++n] = $1;
rows[$1] = $0
} END {
for(i=1; i<=n; i++)
print rows[a[i]]
}' file | column -t
Output:
gene SRR034450.out.rpkm_0 SRR034451.out.rpkm_0 SRR034452.out.rpkm_0 SRR034453.out.rpkm_0
lmo0001 160 323 533 293
lmo0002 135 317 504 306
lmo0003 1 4 5 3
lmo0004 35 59 58 48
lmo0005 113 218 257 187
lmo0006 279 519 653 539
lmo0007 563 1053 1165 1069
lmo0008 34 84 203 107
lmo0009 13 45 90 49
lmo0010 57 210 237 169
lmo0011 65 224 247 179
lmo0012 65 226 250 215
lmo0013 342 500 738 682
lmo0014 662 1032 1283 1311
lmo0015 321 413 631 637
lmo0016 175 253 273 325
lmo0017 3 6 6 6
lmo0018 33 38 46 45
lmo0019 13 1 39 1
lmo0020 3 12 28 15
lmo0021 3 4 14 12
lmo0022 2 3 5 1
lmo0023 2 0 3 2
lmo0024 1 0 2 6
lmo0330 1 1 1 3
lmo0506 151 232 60 204

Compare two files having different column numbers and print the requirement to a new file if condition satisfies

I have two files with more than 10000 rows:
File1 has 1 col File2 has 4 col
23 23 88 90 0
34 43 74 58 5
43 54 87 52 3
54 73 52 35 4
. .
. .
I want to compare each value in file-1 with that in file-2. If exists then print the value along with other three values in file-2. In this example output will be:
23 88 90 0
43 74 58 5
54 87 52 3
.
.
I have written following script, but it is taking too much time to execute.
s1=1; s2=$(wc -l < File1.txt)
while [ $s1 -le $s2 ]
do n=$(awk 'NR=="$s1" {print $1}' File1.txt)
p1=1; p2=$(wc -l < File2.txt)
while [ $p1 -le $p2 ]
do awk '{if ($1==$n) printf ("%s %s %s %s\n", $1, $2, $3, $4);}'> ofile.txt
(( p1++ ))
done
(( s1++ ))
done
Is there any short/ easy way to do it?
You can do it very shortly using awk as
awk 'FNR==NR{found[$1]++; next} $1 in found'
Test
>>> cat file1
23
34
43
54
>>> cat file2
23 88 90 0
43 74 58 5
54 87 52 3
73 52 35 4
>>> awk 'FNR==NR{found[$1]++; next} $1 in found' file1 file2
23 88 90 0
43 74 58 5
54 87 52 3
What it does?
FNR==NR Checks if FNR file number of record is equal to NR total number of records. This will be same only for the first file, file1 because FNR is reset to 1 when awk reads a new file.
{found[$1]++; next} If the check is true then creates an associative array indexed by $1, the first column in file1
$1 in found This check is only done for the second file, file2. If column 1 value, $1 is and index in associative array found then it prints the entire line ( which is not written because it is the default action)

sort blocks in tabular file

I would like to sort file A based in column 1 with the blank lines preserved and occurrence of value from top to bottom.
I have a tabular file A:
seq1 5 15
seq1 20 34
seq1 50 48
seq1 45 36
seq2 17 20
seq1 55 75
seq1 80 84
seq2 30 48
seq2 55 66
seq3 27 40
I would like to get an output as follows:
seq1 5 15
seq1 20 34
seq1 50 48
seq1 45 36
seq1 55 75
seq1 80 84
seq2 17 20
seq2 30 48
seq2 55 66
seq3 27 40
The Blank lines should be preserved.
I have tried using sort but it removes blank lines and doesn't maintain the order from top to bottom.
sort -k1,1 fileA.txt
Could anyone point what am I missing here?
Many thanks.
With GNU awk for sorting and 2D arrays:
$ cat tst.awk
BEGIN { RS=""; ORS="\n\n" }
{ rec[$1][++cnt[$1]] = $0 }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (key in rec) {
for (nr=1; nr <= cnt[key]; nr++) {
print rec[key][nr]
}
}
}
$
$ gawk -f tst.awk file
seq1 5 15
seq1 20 34
seq1 50 48
seq1 45 36
seq1 55 75
seq1 80 84
seq2 17 20
seq2 30 48
seq2 55 66
seq3 27 40
You'll need gawk version 4.0 at least.
For numeric ordering:
BEGIN { RS=""; ORS="\n\n" }
{ key=gensub(/^[^[:digit:]]+/,"","",$1); rec[key][++cnt[key]] = $0 }
END {
PROCINFO["sorted_in"] = "#ind_num_asc"
for (key in rec) {
for (nr=1; nr <= cnt[key]; nr++) {
print rec[key][nr]
}
}
}

Resources