How to print contents of column fields that have strings composed of "n" character/s using bash? - linux

Say I have a file which contains:
22 30 31 3a 31 32 3a 32 " 0 9 : 1 2 : 2
30 32 30 20 32 32 3a 31 1 2 7 2 2 : 1
And, I want to print only the column fields that have string composed of 1 character. I want the output to be like this:
" 0 9 : 1 2 : 2
1 2 7 2 2 : 1
Then, I want to print only those strings that are composed of two characters, the output should be:
22 30 31 3a 31 32 3a 32
30 32 30 20 32 32 3a 31
I am a beginner and I really don't know how to do this. Thanks for your help!

Could you please try following, I am trying it in a different way for provided samples. Written and tested with provided samples only.
For getting values before BULK SPACE try:
awk '
{
line=$0
while(match($0,/[[:space:]]+/)){
arr=arr>RLENGTH?arr:RLENGTH
start[arr]+=RSTART+prev_start
prev_start=RSTART
$0=substr($0,RSTART+RLENGTH)
}
var=substr(line,1,start[arr]-1)
sub(/ +$/,"",var)
print var
delete start
var=arr=""
}
' Input_file
Output will be as follows.
22 30 31 3a 31 32 3a 32
30 32 30 20 32 32 3a 31
For getting values after BULK SPACE try:
awk '
{
line=$0
while(match($0,/[[:space:]]+/)){
arr=arr>RLENGTH?arr:RLENGTH
start[arr]+=RSTART+prev_start
prev_start=RSTART
$0=substr($0,RSTART+RLENGTH)
}
var=substr(line,start[arr])
sub(/^ +/,"",var)
print var
delete start
var=arr=""
}
' Input_file
Output will be as follows:
" 0 9 : 1 2 : 2
1 2 7 2 2 : 1

You can try
awk '{for(i=1;i<=NF;++i)if(length($i)==1)printf("%s ", $i);print("")}'
For each field, check the length and print it if it's desired. You may pass the -F option to awk if it's not separated by blanks.
The awk script is expanded as:
for( i = 1; i <= NF; ++i )
if( length( $i ) == 1 )
printf( "%s ", $i );
print( "" );
The print outside loop is to print a newline after each input line.

Assuming all the columns are tab-separated (So you can have a space as a column value like the second line of your sample), easy to do with a perl one-liner:
$ perl -F"\t" -lane 'BEGIN { $, = "\t" } print grep { /^.$/ } #F' foo.txt
" 0 9 : 1 2 : 2
1 2 7 2 2 : 1
$ perl -F"\t" -lane 'BEGIN { $, = "\t" } print grep { /^..$/ } #F' foo.txt
22 30 31 3a 31 32 3a 32
30 32 30 20 32 32 3a 31

Related

Use printf to format list that is uneven

I have a small list of student grades, I need to format it them side by side depending on the gender of the student. So one column is Male the other Female. The problem is the list doesn't go male female male female, it is uneven.
I've tried using printf to format the output so the 2 columns are side by side, but the format is ruined because of the uneven list.
Name Gender Mark1 Mark2 Mark3
AA M 20 15 35
BB F 22 17 44
CC F 19 14 25
DD M 15 20 42
EE F 18 22 30
FF M 0 20 45
This is the list I am talking about ^^
awk 'BEGIN {print "Male" " Female"} {if (NR!=1) {if ($2 == "M") {printf "%-s %-s %-s", $3, $4, $5} else if ($2 == "F") {printf "%s %s %s\n", $3, $4 ,$5}}}' text.txt
So I'm getting results like
Male Female
20 15 35 22 17 44
19 14 25
15 20 42 18 22 30
0 20 45
But I want it like this:
Male Female
20 15 35 22 17 44
15 20 42 19 14 25
0 20 45 18 22 30
I haven't added separators yet I'm just trying to figure this out, not sure if it would be better to put the marks into 2 arrays depending on gender then printing them out.
another solution tries to address if M/F is not unity
$ awk 'NR==1 {print "Male\tFemale"}
NR>1 {k=$2;$1=$2="";sub(/ +/,"");
if(k=="M") m[++mc]=$0; else f[++fc]=$0}
END {max=mc>fc?mc:fc;
for(i=1;i<=max;i++) print (m[i]?m[i]:"-") "\t" (f[i]?f[i]:"-")}' file |
column -ts$'\t'
Male Female
20 15 35 22 17 44
15 20 42 19 14 25
0 20 45 18 22 30
Something like this?
awk 'BEGIN{format="%2s %2s %2s %2s\n";printf("Male Female\n"); }NR>1{if (s) { if ($2=="F") {printf(format, s, $3, $4, $5);} else {printf(format, $3,$4,$5,s);} s=""} else {s=sprintf("%2s %2s %2s", $3, $4, $5)}}' file
Another approach using awk
awk '
BEGIN {
print "Male\t\tFemale"
}
NR > 1 {
I = ++G[$2]
A[$2 FS I] = sprintf("%2d %2d %2d", $(NF-2), $(NF-1), $NF)
}
END {
M = ( G["M"] > G["F"] ? G["M"] : G["F"] )
for ( i = 1; i <= M; i++ )
print A["M" FS i] ? A["M" FS i] : OFS, A["F" FS i] ? A["F" FS i] : OFS
}
' OFS='\t' file
This might work for you (GNU sed):
sed -Ee '1c\Male Female' -e 'N;s/^.. M (.*)\n.. F(.*)/\1\2/;s/^.. F(.*)\n.. M (.*)/\2\1/' file
Change the header line. Then compare a pair of lines and re-arrange them as appropriate.

Transposing data based on unique ID - awk

I really hope you can help. I am completely new to (g)awk and I have been fighting with it for the last two weeks.
My original file is as follows - there is column with a unique Id and another with unique names. Subsequent columns are various courses and each field contains (when not empty) a mark for each course and for each student. So each student has only one mark for each course:
Id Name Course1 Course2 Course3 Course4 Course5
1 John 55
2 George 63
4 Alex 64
1 John 74
3 Emma 63
2 George 64
4 Alex 60
2 George 29
3 Emma 69
1 John 67
3 Emma 80
4 Alex 57
2 George 91
1 John 81
1 John 34
3 Emma 75
2 George 89
4 Alex 49
3 Emma 78
4 Alex 69
5 TERRY 67
6 HELEN 39
This is what I want to achieve - transpose data i.e marks, based on the unique ID and place the marks below each corresponding course like below:
Id Name Course1 Course2 Course3 Course4 Course5
1 John 55 69 64 60 49
2 George 29 64 89 91 63
3 Emma 63 80 75 78 69
4 Alex 57 69 64 60 49
5 TERRY 67
6 HELLEN 39
This is what I managed to get so far:
Id Name Course1 Course2 Course3 Course4 Course5
1 John 55
2 George 29
3 Emma 63
4 Alex 57
5 TERRY
6 HELLEN
1 John 69
2 George 64
3 Emma 80
4 Alex 69
5 TERRY 67
6 HELLEN
1 John 64
2 George 89
3 Emma 75
4 Alex 64
5 TERRY
6 HELLEN 39
...and so on
It is really a bit tricky for me to achieve based to what I already know on awk (please note I am not interested in sed/perl e.t.c. based solutions).
If it is to provide some help (preferably NOT an one liner) may I ask to be a bit descriptive as I am interested in the solution as much as I am in the method itself.
Any help will be very much appreciated.
EDIT
Here is the code I wrote to reach the last stage (and where I got stuck)
#!/bin/bash
files3="*.csv"
for j in $files3
do
#echo "processing $j..."
fi13=$(awk -F" " '(NR==1){field13=$13;}{print field13}' ./work1/test1YA.csv)
fi14=$(awk -F" " '(NR==1){field14=$14;}{print field14}' ./work1/test1YA.csv)
fi15=$(awk -F" " '(NR==1){field15=$15;}{print field15}' ./work1/test1YA.csv)
fi16=$(awk -F" " '(NR==1){field16=$16;}{print field16}' ./work1/test1YA.csv)
# awk -F" " 'BEGIN{OFS=" ";RS="\n"}{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12}' "$j" >> ./work1/test2YA.csv
awk -F" " -v f13="$fi13" -v f14="$fi14" -v f15="$fi15" -v f16="$fi16" '{if($13==f13){$13=$6;$14=$15=$16=""}if($13==f14){$14=$6;$13=$15=$16=""}if($13==f15){$15=$6;$13=$14=$16=""}if($13==f16){$16=$6;$13=$14=$15=""}{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16}}' "$j" >> ./work1/test2YA.csv
done;
awk -F" " 'BEGIN{print "ID","Title","FirstName","MiddleName","LastName","FinalMarks","Status","Username","Campus","Code","Programme","Year","course1","course2","course3","course4"}{print}' ./work1/test2YA.csv >> ./work1/test3YA.csv
Here is a solution for gnu awk:
course.awk
BEGIN { # setup field width for constant field splitting
FIELDWIDTHS = "2 2 12 7 1 7 1 7 1 7 1 7"
# setup sort order (by id)
PROCINFO["sorted_in"] = "#ind_num_asc"
}
NR == 1 { # print header
print
next
}
{
# add ids to names
names[ $1 ] = $3
# store under id and course number the mark if it is present
for( c = 1; c <= 5; c++ ) {
field = 2+ (c*2)
if( $(field) !~ /^ *$/ ) {
marks[ $1, c ] = $(field)
}
}
}
END {
# output
for( id in names ) {
printf("%-4s%-12s%7s %7s %7s %7s %7s\n",id, names[ id ], marks[ id, 1], marks[ id, 2], marks[ id, 3], marks[ id, 4], marks[ id, 5])
}
}
Use it like this: awk -f course.awk your_file.
The fact that the input is not tab delimited, but has fixed column width make is a bit unelegant:
use of FIELDWIDTHS and %Ns where N is derived from the FIELDWIDTHS
FIELDWIDTHS take into account the empty column between ID and Name, Course1 and Course2, ...
the check if a mark is present: if( $(field) !~ /^ *$/ ) checks if field does not consist entirely of spaces.
This could be an approximation in awk:
NR==1{
for(x=1;x<=NF;x++)
{
head=head $x"\t";
}
print head
}
NR>1{
for(i=3;i<=NF;i++)
{
students[$1"\t"$2]=students[$1"\t"$2] "\t"$i;
}
}
END{
for (stu in students)
{
print stu,students[stu];
}
}
Id Name Course1 Course2 Course3 Course4 Course5
5 TERRY 67
4 Alex 64 60 57 49 69
1 John 55 74 67 81 34
6 HELEN 39
3 Emma 63 69 80 75 78
2 George 63 64 29 91 89
same ideas, perhaps simpler
$ awk 'BEGIN{ FIELDWIDTHS="16 8 8 8 8 8"}
NR==1{print;next}
NR>1{keys[$1];
for(i=2;i<=6;i++)
{gsub(" ","",$i);
if($i) a[$1,i]=$i}}
END{for(k in keys)
{printf "%16s",k;
for(i=2;i<=6;i++) printf "%-8s",a[k,i];
print ""}}' file
Id Name Course1 Course2 Course3 Course4 Course5
3 Emma 63 80 75 78 69
4 Alex 57 69 64 60 49
6 HELEN 39
5 TERRY 67
1 John 55 67 81 74 34
2 George 29 64 89 91 63
you can sort the output as well by piping to sort -n
... | sort -n
Id Name Course1 Course2 Course3 Course4 Course5
1 John 55 67 81 74 34
2 George 29 64 89 91 63
3 Emma 63 80 75 78 69
4 Alex 57 69 64 60 49
5 TERRY 67
6 HELEN 39
With GNU awk for FIELDWIDTHS, 2D arrays, and sorted_in:
$ cat tst.awk
NR==1 {
print
split($0,f,/\S+\s*/,s)
for (i=1;i in s;i++) {
w[i] = length(s[i])
FIELDWIDTHS = FIELDWIDTHS (i>1?" ":"") w[i]
}
next
}
{
sub(/\s*$/," ")
for (i=1;i<=NF;i++) {
if ($i ~ /\S/) {
val[$1][i] = $i
}
}
}
END {
PROCINFO["sorted_in"] = "#ind_num_asc"
for (id in val) {
for (i=1;i<=NF;i++) {
printf "%*s", w[i], val[id][i]
}
print ""
}
}
.
$ awk -f tst.awk file
Id Name Course1 Course2 Course3 Course4 Course5
1 John 55 67 81 74 34
2 George 29 64 89 91 63
3 Emma 63 80 75 78 69
4 Alex 57 69 64 60 49
5 TERRY 67
6 HELEN 39
Here's my take on this. This works in plain-old awk (doesn't use FIELDWIDTHS), and it automatically adjusts to different numbers of fields (i.e. add a Course7 column and you should be fine). Also, you can point it at multiple files, and it should process each one separately.
#!/usr/bin/awk -f
# Initialize variables on the first record of each input file
# (and also print the header)
#
FNR <= 1 {
print
delete name
delete score
next
}
# Process each line.
#
{
id = substr($0, 0, 16) #
name[id] # Store the unique identifier in an array
pos = 0 #
# Step through the score fields until we hit the end of the line,
# storing scores in another array.
do {
score[id, pos] += substr($0,17+pos*8,8) +0
printf("id='%s' pos=%s value=%s total=%s\n", id, pos, substr($0,17+pos*8,8)+0, score[id, pos] );
} while (17+(++pos)*8 < length())
}
# Keep track of our maximum number of fields
pos>max { max=pos }
# Finally, generate our (randomly sorted) output.
END {
for (id in name) { # Step through the records...
printf("%-12s", id);
for (i=0; i<max; i++) { # Step through the fields...
if (score[id, i]==0) score[id, i]=""
printf("%-8s", score[id, i]);
}
printf("\n")
}
}
It's a bit long but I think it's easier to understand what it does.

Compare two files having different column numbers and print the requirement to a new file if condition satisfies

I have two files with more than 10000 rows:
File1 has 1 col File2 has 4 col
23 23 88 90 0
34 43 74 58 5
43 54 87 52 3
54 73 52 35 4
. .
. .
I want to compare each value in file-1 with that in file-2. If exists then print the value along with other three values in file-2. In this example output will be:
23 88 90 0
43 74 58 5
54 87 52 3
.
.
I have written following script, but it is taking too much time to execute.
s1=1; s2=$(wc -l < File1.txt)
while [ $s1 -le $s2 ]
do n=$(awk 'NR=="$s1" {print $1}' File1.txt)
p1=1; p2=$(wc -l < File2.txt)
while [ $p1 -le $p2 ]
do awk '{if ($1==$n) printf ("%s %s %s %s\n", $1, $2, $3, $4);}'> ofile.txt
(( p1++ ))
done
(( s1++ ))
done
Is there any short/ easy way to do it?
You can do it very shortly using awk as
awk 'FNR==NR{found[$1]++; next} $1 in found'
Test
>>> cat file1
23
34
43
54
>>> cat file2
23 88 90 0
43 74 58 5
54 87 52 3
73 52 35 4
>>> awk 'FNR==NR{found[$1]++; next} $1 in found' file1 file2
23 88 90 0
43 74 58 5
54 87 52 3
What it does?
FNR==NR Checks if FNR file number of record is equal to NR total number of records. This will be same only for the first file, file1 because FNR is reset to 1 when awk reads a new file.
{found[$1]++; next} If the check is true then creates an associative array indexed by $1, the first column in file1
$1 in found This check is only done for the second file, file2. If column 1 value, $1 is and index in associative array found then it prints the entire line ( which is not written because it is the default action)

Gawk print largest value from each column

I am writing a awk script that takes some columns of input in a text file and print out the largest value in each column
Input:
$cat numbers
10 20 30.3 40.5
20 30 45.7 66.1
40 75 107.2 55.6
50 20 30.3 40.5
60 30 45.O 66.1
70 1134.7 50 70
80 75 107.2 55.6
Output:
80 1134.7 107.2 70
Script:
BEGIN {
val=0;
line=1;
}
{
if( $2 > $3 )
{
if( $2 > val )
{
val=$2;
line=$0;
}
}
else
{
if( $3 > val )
{
val=$3;
line=$0;
}
}
}
END{
print line
}
Current output:
60 30 45.O 66.1
What am I doing wrong first awk script
=======SOLUTION======
END {
for (i = 0; ++i <= NF;)
printf "%s", (m[i] (i < NF ? FS : RS))
}
{
for (i = 0; ++i <= NF;)
$i > m[i] && m[i] = $i
}
Thanks for the help
Since you have four columns, you'll need at least four variables, one for each column (or an array if you prefer). And you won't need to hold any line in its entirety. Treat each column independently.
You need to adapt something like the following for your purposes which will find the maximum in a particular column (the second in this case).
awk 'BEGIN {max = 0} {if ($2>max) max=$2} END {print max}' numbers.dat
The approach you are taking with $2 > $3 seems to be comparing two columns with each other.
You can create one user defined function and then pass individual column arrays to it to retrieve the max value. Something like this -
[jaypal:~/Temp] cat numbers
10 20 30.3 40.5
20 30 45.7 66.1
40 75 107.2 55.6
50 20 30.3 40.5
60 30 45.O 66.1
70 1134.7 50.0 70
80 75 107.2 55.6
[jaypal:~/Temp] awk '
function max(x){i=0;for(val in x){if(i<=x[val]){i=x[val];}}return i;}
{a[$1]=$1;b[$2]=$2;c[$3]=$3;d[$4]=$4;next}
END{col1=max(a);col2=max(b);col3=max(c);col4=max(d);print col1,col2,col3,col4}' numbers
80 1134.7 107.2 70
or
awk 'a<$1{a=$1}b<$2{b=$2}c<$3{c=$3}d<$4{d=$4} END{print a,b,c,d}' numbers

how to subset a file - select a numbers of rows or columns

I would like to have your advice/help on how to subset a big file (millions of rows or lines).
For example,
(1)
I have big file (millions of rows, tab-delimited). I want to a subset of this file with only rows from 10000 to 100000.
(2)
I have big file (millions of columns, tab-delimited). I want to a subset of this file with only columns from 10000 to 100000.
I know there are tools like head, tail, cut, split, and awk or sed. I can use them to do simple subsetting. But, I do not know how to do this job.
Could you please give any advice? Thanks in advance.
Filtering rows is easy, for example with AWK:
cat largefile | awk 'NR >= 10000 && NR <= 100000 { print }'
Filtering columns is easier with CUT:
cat largefile | cut -d '\t' -f 10000-100000
As Rahul Dravid mentioned, cat is not a must here, and as Zsolt Botykai added you can improve performance using:
awk 'NR > 100000 { exit } NR >= 10000 && NR <= 100000' largefile
cut -d '\t' -f 10000-100000 largefile
Some different solutions:
For row ranges:
In sed :
sed -n 10000,100000p somefile.txt
For column ranges in awk:
awk -v f=10000 -v t=100000 '{ for (i=f; i<=t;i++) printf("%s%s", $i,(i==t) ? "\n" : OFS) }' details.txt
For the first problem, selecting a set of rows from a large file, piping tail to head is very simple. You want 90000 rows from largefile starting at row 10000. tail grabs the back end of largefile starting at row 10000 and then head chops off all but the first 90000 rows.
tail -n +10000 largefile | head -n 90000 -
Was beaten to it for the sed solution, so I'll post a perl dito instead.
To print selected lines.
$ seq 100 | perl -ne 'print if $. >= 10 && $. <= 20'
10
11
12
13
14
15
16
17
18
19
20
To print selective columns, use
perl -lane 'print $F[1] .. $F[3] '
-F is used in conjunction with -a, to choose the delimiter on which to split lines.
To test, use seq and paste to get generate some columns
$ seq 50 | paste - - - - -
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
26 27 28 29 30
31 32 33 34 35
36 37 38 39 40
41 42 43 44 45
46 47 48 49 50
Lets's print everything except the first and the last column
$ seq 50 | paste - - - - - | perl -lane 'print join " ", $F[1] .. $F[3]'
2 3 4
7 8 9
12 13 14
17 18 19
22 23 24
27 28 29
32 33 34
37 38 39
42 43 44
47 48 49
In the join statement above, there is a tab, you get it by doing a ctrl-v tab.

Resources