sort blocks in tabular file - linux

I would like to sort file A based in column 1 with the blank lines preserved and occurrence of value from top to bottom.
I have a tabular file A:
seq1 5 15
seq1 20 34
seq1 50 48
seq1 45 36
seq2 17 20
seq1 55 75
seq1 80 84
seq2 30 48
seq2 55 66
seq3 27 40
I would like to get an output as follows:
seq1 5 15
seq1 20 34
seq1 50 48
seq1 45 36
seq1 55 75
seq1 80 84
seq2 17 20
seq2 30 48
seq2 55 66
seq3 27 40
The Blank lines should be preserved.
I have tried using sort but it removes blank lines and doesn't maintain the order from top to bottom.
sort -k1,1 fileA.txt
Could anyone point what am I missing here?
Many thanks.

With GNU awk for sorting and 2D arrays:
$ cat tst.awk
BEGIN { RS=""; ORS="\n\n" }
{ rec[$1][++cnt[$1]] = $0 }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (key in rec) {
for (nr=1; nr <= cnt[key]; nr++) {
print rec[key][nr]
}
}
}
$
$ gawk -f tst.awk file
seq1 5 15
seq1 20 34
seq1 50 48
seq1 45 36
seq1 55 75
seq1 80 84
seq2 17 20
seq2 30 48
seq2 55 66
seq3 27 40
You'll need gawk version 4.0 at least.
For numeric ordering:
BEGIN { RS=""; ORS="\n\n" }
{ key=gensub(/^[^[:digit:]]+/,"","",$1); rec[key][++cnt[key]] = $0 }
END {
PROCINFO["sorted_in"] = "#ind_num_asc"
for (key in rec) {
for (nr=1; nr <= cnt[key]; nr++) {
print rec[key][nr]
}
}
}

Related

How to print contents of column fields that have strings composed of "n" character/s using bash?

Say I have a file which contains:
22 30 31 3a 31 32 3a 32 " 0 9 : 1 2 : 2
30 32 30 20 32 32 3a 31 1 2 7 2 2 : 1
And, I want to print only the column fields that have string composed of 1 character. I want the output to be like this:
" 0 9 : 1 2 : 2
1 2 7 2 2 : 1
Then, I want to print only those strings that are composed of two characters, the output should be:
22 30 31 3a 31 32 3a 32
30 32 30 20 32 32 3a 31
I am a beginner and I really don't know how to do this. Thanks for your help!
Could you please try following, I am trying it in a different way for provided samples. Written and tested with provided samples only.
For getting values before BULK SPACE try:
awk '
{
line=$0
while(match($0,/[[:space:]]+/)){
arr=arr>RLENGTH?arr:RLENGTH
start[arr]+=RSTART+prev_start
prev_start=RSTART
$0=substr($0,RSTART+RLENGTH)
}
var=substr(line,1,start[arr]-1)
sub(/ +$/,"",var)
print var
delete start
var=arr=""
}
' Input_file
Output will be as follows.
22 30 31 3a 31 32 3a 32
30 32 30 20 32 32 3a 31
For getting values after BULK SPACE try:
awk '
{
line=$0
while(match($0,/[[:space:]]+/)){
arr=arr>RLENGTH?arr:RLENGTH
start[arr]+=RSTART+prev_start
prev_start=RSTART
$0=substr($0,RSTART+RLENGTH)
}
var=substr(line,start[arr])
sub(/^ +/,"",var)
print var
delete start
var=arr=""
}
' Input_file
Output will be as follows:
" 0 9 : 1 2 : 2
1 2 7 2 2 : 1
You can try
awk '{for(i=1;i<=NF;++i)if(length($i)==1)printf("%s ", $i);print("")}'
For each field, check the length and print it if it's desired. You may pass the -F option to awk if it's not separated by blanks.
The awk script is expanded as:
for( i = 1; i <= NF; ++i )
if( length( $i ) == 1 )
printf( "%s ", $i );
print( "" );
The print outside loop is to print a newline after each input line.
Assuming all the columns are tab-separated (So you can have a space as a column value like the second line of your sample), easy to do with a perl one-liner:
$ perl -F"\t" -lane 'BEGIN { $, = "\t" } print grep { /^.$/ } #F' foo.txt
" 0 9 : 1 2 : 2
1 2 7 2 2 : 1
$ perl -F"\t" -lane 'BEGIN { $, = "\t" } print grep { /^..$/ } #F' foo.txt
22 30 31 3a 31 32 3a 32
30 32 30 20 32 32 3a 31

Perl string weirdness : equal strings being not equal?

I am using Perl v5.16.2
I am using the Net::SMPP modules and it returns me some data.
If I show this data, I get this (simplified) :
$VAR1 = bless( {
'receipted_message_id' => '400002F6E09C61701222120140',
'30' => '400002F6E09C61701222120140'
}, 'Net::SMPP::PDU' );
Now, let's assume this data is in $pdu and I do this :
$message_id = $pdu->{30}; # or $pdu->{receipted_message_id}, same result
myfunction($message_id);
Then, I have myfunction defined as :
sub myfunction {
my $message_id = shift;
my $message_id_static = '400002F6E09C61701222120140';
print Dumper($message_id);
print Dumper($message_id_static);
print hexdump($message_id);
print hexdump($message_id_static);
if ($message_id eq $message_id_static)
{
print "match\n";
}
else
{
print "no match\n";
}
}
The output of the program is :
$VAR1 = '400002F6E09C61701222120140';
$VAR1 = '400002F6E09C61701222120140';
Data::Hexdumper: data length isn't an integer multiple of lines
so has been padded with NULLs at the end.
0x0000 : 34 30 30 30 30 32 46 36 45 30 39 43 36 31 37 30 : 400002F6E09C6170
0x0010 : 31 32 32 32 31 32 30 31 34 30 00 00 00 00 00 00 : 1222120140......
Data::Hexdumper: data length isn't an integer multiple of lines
so has been padded with NULLs at the end.
0x0000 : 34 30 30 30 30 32 46 36 45 30 39 43 36 31 37 30 : 400002F6E09C6170
0x0010 : 31 32 32 32 31 32 30 31 34 30 00 00 00 00 00 00 : 1222120140......
no match
Which doesn't make any sense to me... !
If I try to use $message_id to do a SQLite query, it fails miserably. If I use $message_id_static instead, it works perfectly.
So, is this a weird internal Perl bug, or am I missing something ?
This has been driving me nuts for hours...
EDIT :
Using the perl debugger, I get this :
DB<3> x $message_id_static
0 '400002F6E09C61701222120140'
DB<4> x $message_id
0 "400002F6E09C61701222120140\c#"
So at least I see there is a difference in the strings, but why isn't it seen by the hexdump, and what is that \c# ?
Thanks !
The \c# character is Ctrl-#, which is the ASCII NUL character at code point zero
You can't see it in your hexdump output because it is indistinguishable from the 00 padding at the end of the dump
If you set $Data::Dumper::Useqq = 1 then it will be visible in the output from print Dumper $message_id
You can remove it from the variable by using s/\0\z// or tr/\0//d, but you should really investigate why it is there in the first place

Transposing data based on unique ID - awk

I really hope you can help. I am completely new to (g)awk and I have been fighting with it for the last two weeks.
My original file is as follows - there is column with a unique Id and another with unique names. Subsequent columns are various courses and each field contains (when not empty) a mark for each course and for each student. So each student has only one mark for each course:
Id Name Course1 Course2 Course3 Course4 Course5
1 John 55
2 George 63
4 Alex 64
1 John 74
3 Emma 63
2 George 64
4 Alex 60
2 George 29
3 Emma 69
1 John 67
3 Emma 80
4 Alex 57
2 George 91
1 John 81
1 John 34
3 Emma 75
2 George 89
4 Alex 49
3 Emma 78
4 Alex 69
5 TERRY 67
6 HELEN 39
This is what I want to achieve - transpose data i.e marks, based on the unique ID and place the marks below each corresponding course like below:
Id Name Course1 Course2 Course3 Course4 Course5
1 John 55 69 64 60 49
2 George 29 64 89 91 63
3 Emma 63 80 75 78 69
4 Alex 57 69 64 60 49
5 TERRY 67
6 HELLEN 39
This is what I managed to get so far:
Id Name Course1 Course2 Course3 Course4 Course5
1 John 55
2 George 29
3 Emma 63
4 Alex 57
5 TERRY
6 HELLEN
1 John 69
2 George 64
3 Emma 80
4 Alex 69
5 TERRY 67
6 HELLEN
1 John 64
2 George 89
3 Emma 75
4 Alex 64
5 TERRY
6 HELLEN 39
...and so on
It is really a bit tricky for me to achieve based to what I already know on awk (please note I am not interested in sed/perl e.t.c. based solutions).
If it is to provide some help (preferably NOT an one liner) may I ask to be a bit descriptive as I am interested in the solution as much as I am in the method itself.
Any help will be very much appreciated.
EDIT
Here is the code I wrote to reach the last stage (and where I got stuck)
#!/bin/bash
files3="*.csv"
for j in $files3
do
#echo "processing $j..."
fi13=$(awk -F" " '(NR==1){field13=$13;}{print field13}' ./work1/test1YA.csv)
fi14=$(awk -F" " '(NR==1){field14=$14;}{print field14}' ./work1/test1YA.csv)
fi15=$(awk -F" " '(NR==1){field15=$15;}{print field15}' ./work1/test1YA.csv)
fi16=$(awk -F" " '(NR==1){field16=$16;}{print field16}' ./work1/test1YA.csv)
# awk -F" " 'BEGIN{OFS=" ";RS="\n"}{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12}' "$j" >> ./work1/test2YA.csv
awk -F" " -v f13="$fi13" -v f14="$fi14" -v f15="$fi15" -v f16="$fi16" '{if($13==f13){$13=$6;$14=$15=$16=""}if($13==f14){$14=$6;$13=$15=$16=""}if($13==f15){$15=$6;$13=$14=$16=""}if($13==f16){$16=$6;$13=$14=$15=""}{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16}}' "$j" >> ./work1/test2YA.csv
done;
awk -F" " 'BEGIN{print "ID","Title","FirstName","MiddleName","LastName","FinalMarks","Status","Username","Campus","Code","Programme","Year","course1","course2","course3","course4"}{print}' ./work1/test2YA.csv >> ./work1/test3YA.csv
Here is a solution for gnu awk:
course.awk
BEGIN { # setup field width for constant field splitting
FIELDWIDTHS = "2 2 12 7 1 7 1 7 1 7 1 7"
# setup sort order (by id)
PROCINFO["sorted_in"] = "#ind_num_asc"
}
NR == 1 { # print header
print
next
}
{
# add ids to names
names[ $1 ] = $3
# store under id and course number the mark if it is present
for( c = 1; c <= 5; c++ ) {
field = 2+ (c*2)
if( $(field) !~ /^ *$/ ) {
marks[ $1, c ] = $(field)
}
}
}
END {
# output
for( id in names ) {
printf("%-4s%-12s%7s %7s %7s %7s %7s\n",id, names[ id ], marks[ id, 1], marks[ id, 2], marks[ id, 3], marks[ id, 4], marks[ id, 5])
}
}
Use it like this: awk -f course.awk your_file.
The fact that the input is not tab delimited, but has fixed column width make is a bit unelegant:
use of FIELDWIDTHS and %Ns where N is derived from the FIELDWIDTHS
FIELDWIDTHS take into account the empty column between ID and Name, Course1 and Course2, ...
the check if a mark is present: if( $(field) !~ /^ *$/ ) checks if field does not consist entirely of spaces.
This could be an approximation in awk:
NR==1{
for(x=1;x<=NF;x++)
{
head=head $x"\t";
}
print head
}
NR>1{
for(i=3;i<=NF;i++)
{
students[$1"\t"$2]=students[$1"\t"$2] "\t"$i;
}
}
END{
for (stu in students)
{
print stu,students[stu];
}
}
Id Name Course1 Course2 Course3 Course4 Course5
5 TERRY 67
4 Alex 64 60 57 49 69
1 John 55 74 67 81 34
6 HELEN 39
3 Emma 63 69 80 75 78
2 George 63 64 29 91 89
same ideas, perhaps simpler
$ awk 'BEGIN{ FIELDWIDTHS="16 8 8 8 8 8"}
NR==1{print;next}
NR>1{keys[$1];
for(i=2;i<=6;i++)
{gsub(" ","",$i);
if($i) a[$1,i]=$i}}
END{for(k in keys)
{printf "%16s",k;
for(i=2;i<=6;i++) printf "%-8s",a[k,i];
print ""}}' file
Id Name Course1 Course2 Course3 Course4 Course5
3 Emma 63 80 75 78 69
4 Alex 57 69 64 60 49
6 HELEN 39
5 TERRY 67
1 John 55 67 81 74 34
2 George 29 64 89 91 63
you can sort the output as well by piping to sort -n
... | sort -n
Id Name Course1 Course2 Course3 Course4 Course5
1 John 55 67 81 74 34
2 George 29 64 89 91 63
3 Emma 63 80 75 78 69
4 Alex 57 69 64 60 49
5 TERRY 67
6 HELEN 39
With GNU awk for FIELDWIDTHS, 2D arrays, and sorted_in:
$ cat tst.awk
NR==1 {
print
split($0,f,/\S+\s*/,s)
for (i=1;i in s;i++) {
w[i] = length(s[i])
FIELDWIDTHS = FIELDWIDTHS (i>1?" ":"") w[i]
}
next
}
{
sub(/\s*$/," ")
for (i=1;i<=NF;i++) {
if ($i ~ /\S/) {
val[$1][i] = $i
}
}
}
END {
PROCINFO["sorted_in"] = "#ind_num_asc"
for (id in val) {
for (i=1;i<=NF;i++) {
printf "%*s", w[i], val[id][i]
}
print ""
}
}
.
$ awk -f tst.awk file
Id Name Course1 Course2 Course3 Course4 Course5
1 John 55 67 81 74 34
2 George 29 64 89 91 63
3 Emma 63 80 75 78 69
4 Alex 57 69 64 60 49
5 TERRY 67
6 HELEN 39
Here's my take on this. This works in plain-old awk (doesn't use FIELDWIDTHS), and it automatically adjusts to different numbers of fields (i.e. add a Course7 column and you should be fine). Also, you can point it at multiple files, and it should process each one separately.
#!/usr/bin/awk -f
# Initialize variables on the first record of each input file
# (and also print the header)
#
FNR <= 1 {
print
delete name
delete score
next
}
# Process each line.
#
{
id = substr($0, 0, 16) #
name[id] # Store the unique identifier in an array
pos = 0 #
# Step through the score fields until we hit the end of the line,
# storing scores in another array.
do {
score[id, pos] += substr($0,17+pos*8,8) +0
printf("id='%s' pos=%s value=%s total=%s\n", id, pos, substr($0,17+pos*8,8)+0, score[id, pos] );
} while (17+(++pos)*8 < length())
}
# Keep track of our maximum number of fields
pos>max { max=pos }
# Finally, generate our (randomly sorted) output.
END {
for (id in name) { # Step through the records...
printf("%-12s", id);
for (i=0; i<max; i++) { # Step through the fields...
if (score[id, i]==0) score[id, i]=""
printf("%-8s", score[id, i]);
}
printf("\n")
}
}
It's a bit long but I think it's easier to understand what it does.

Compare two files having different column numbers and print the requirement to a new file if condition satisfies

I have two files with more than 10000 rows:
File1 has 1 col File2 has 4 col
23 23 88 90 0
34 43 74 58 5
43 54 87 52 3
54 73 52 35 4
. .
. .
I want to compare each value in file-1 with that in file-2. If exists then print the value along with other three values in file-2. In this example output will be:
23 88 90 0
43 74 58 5
54 87 52 3
.
.
I have written following script, but it is taking too much time to execute.
s1=1; s2=$(wc -l < File1.txt)
while [ $s1 -le $s2 ]
do n=$(awk 'NR=="$s1" {print $1}' File1.txt)
p1=1; p2=$(wc -l < File2.txt)
while [ $p1 -le $p2 ]
do awk '{if ($1==$n) printf ("%s %s %s %s\n", $1, $2, $3, $4);}'> ofile.txt
(( p1++ ))
done
(( s1++ ))
done
Is there any short/ easy way to do it?
You can do it very shortly using awk as
awk 'FNR==NR{found[$1]++; next} $1 in found'
Test
>>> cat file1
23
34
43
54
>>> cat file2
23 88 90 0
43 74 58 5
54 87 52 3
73 52 35 4
>>> awk 'FNR==NR{found[$1]++; next} $1 in found' file1 file2
23 88 90 0
43 74 58 5
54 87 52 3
What it does?
FNR==NR Checks if FNR file number of record is equal to NR total number of records. This will be same only for the first file, file1 because FNR is reset to 1 when awk reads a new file.
{found[$1]++; next} If the check is true then creates an associative array indexed by $1, the first column in file1
$1 in found This check is only done for the second file, file2. If column 1 value, $1 is and index in associative array found then it prints the entire line ( which is not written because it is the default action)

Gawk print largest value from each column

I am writing a awk script that takes some columns of input in a text file and print out the largest value in each column
Input:
$cat numbers
10 20 30.3 40.5
20 30 45.7 66.1
40 75 107.2 55.6
50 20 30.3 40.5
60 30 45.O 66.1
70 1134.7 50 70
80 75 107.2 55.6
Output:
80 1134.7 107.2 70
Script:
BEGIN {
val=0;
line=1;
}
{
if( $2 > $3 )
{
if( $2 > val )
{
val=$2;
line=$0;
}
}
else
{
if( $3 > val )
{
val=$3;
line=$0;
}
}
}
END{
print line
}
Current output:
60 30 45.O 66.1
What am I doing wrong first awk script
=======SOLUTION======
END {
for (i = 0; ++i <= NF;)
printf "%s", (m[i] (i < NF ? FS : RS))
}
{
for (i = 0; ++i <= NF;)
$i > m[i] && m[i] = $i
}
Thanks for the help
Since you have four columns, you'll need at least four variables, one for each column (or an array if you prefer). And you won't need to hold any line in its entirety. Treat each column independently.
You need to adapt something like the following for your purposes which will find the maximum in a particular column (the second in this case).
awk 'BEGIN {max = 0} {if ($2>max) max=$2} END {print max}' numbers.dat
The approach you are taking with $2 > $3 seems to be comparing two columns with each other.
You can create one user defined function and then pass individual column arrays to it to retrieve the max value. Something like this -
[jaypal:~/Temp] cat numbers
10 20 30.3 40.5
20 30 45.7 66.1
40 75 107.2 55.6
50 20 30.3 40.5
60 30 45.O 66.1
70 1134.7 50.0 70
80 75 107.2 55.6
[jaypal:~/Temp] awk '
function max(x){i=0;for(val in x){if(i<=x[val]){i=x[val];}}return i;}
{a[$1]=$1;b[$2]=$2;c[$3]=$3;d[$4]=$4;next}
END{col1=max(a);col2=max(b);col3=max(c);col4=max(d);print col1,col2,col3,col4}' numbers
80 1134.7 107.2 70
or
awk 'a<$1{a=$1}b<$2{b=$2}c<$3{c=$3}d<$4{d=$4} END{print a,b,c,d}' numbers

Resources