Supplement patterns - linux

I have these kind of records in a file:
1867 121 2 56
1868 121 1 6
1868 121 2 65
1868 122 0 53
1869 121 0 41
1869 121 1 41
1871 121 1 13
1871 121 2 194
I would like to get this output:
1867 121 2 56
1868 121 1 6
1868 121 2 65
1868 122 0 53
1869 121 0 41
1869 121 1 41
1870 121 0 0
1871 121 1 13
1871 121 2 194
The difference is the 1870 121 0 0 row.
So, if the difference between the numbers in the first column is greater than 1, then we have to include a line with the missing number (the above case it is 1870) and the other columns. One should get the other columns in a way, that let the second column be the minimum of the possible values of the numbers of the column (in the example these values might be 121 or 122), and for the same as in the third column case. The value of the last column let be always zero.
Can anybody suggest me something? Thanks in advance!
I am trying to solve it with awk, but maybe there is (are) other nicer or more practical solution(s) for this...

Something like this could work -
awk 'BEGIN{getline;a=$1;b=$2;c=$3}
NR==FNR{if (b>$2) b=$2; if (c>$3) c=$3;next}
{if ($1-a>1) {x=($1-a); for (i=1;i<x;i++) {print (a+1)"\t"b,c,"0";a++};a=$1} else a=$1;print}' file file
Explanation:
BEGIN{getline;a=$1;b=$2;c=$3} -
In this BEGIN block we read the first line and assign values in column 1 to variable a, column 2 to variable b and column 3 to variable c.
NR==FNR{if (b>$2) b=$2; if (c>$3) c=$3;next} -
In this we scan through the entire file (NR==FNR) and keep track of the lowest possible values in column 2 and column 3 and store them in variables b and c respectively. We use next to avoid running the second pattern{action} statement.
{if ($1-a>1) {x=($1-a); for (i=1;i<x;i++) {print (a+1)"\t"b,c,"0";a++};a=$1} else a=$1;print} -
This action statement checks the for the value in column 1 and compares it with a. If the the difference is more than 1, we do a for loop to add all the missing lines and set the value of a to $1. If the value in column 1 on successive lines is not greater than 1, we assign the value of column 1 to a and print it.
Test:
[jaypal:~/Temp] cat file
1867 121 2 56
1868 121 1 6
1868 121 2 65
1868 122 0 53
1869 121 0 41
1869 121 1 41
1871 121 1 13 # <--- 1870 skipped
1871 121 2 194
1875 120 1 12 # <--- 1872, 1873, 1874 skipped
[jaypal:~/Temp] awk 'BEGIN{getline;a=$1;b=$2;c=$3}
NR==FNR{if (b>$2) b=$2; if (c>$3) c=$3;next}
{if ($1-a>1) {x=($1-a); for (i=1;i<x;i++) {print (a+1)"\t"b,c,"0";a++};a=$1} else a=$1;print}' file file
1867 121 2 56
1868 121 1 6
1868 121 2 65
1868 122 0 53
1869 121 0 41
1869 121 1 41
1870 120 0 0 # Assigned minimum value in col 2 (120) and col 3 (0).
1871 121 1 13
1871 121 2 194
1872 120 0 0 # Assigned minimum value in col 2 (120) and col 3 (0).
1873 120 0 0 # Assigned minimum value in col 2 (120) and col 3 (0).
1874 120 0 0 # Assigned minimum value in col 2 (120) and col 3 (0).
1875 120 1 12

Perl solution. Should work for large files, too, as it does not load the whole file into memory, but goes over the file two times.
#!/usr/bin/perl
use warnings;
use strict;
my $file = shift;
open my $IN, '<', $file or die $!;
my #mins;
while (<$IN>) {
my #cols = split;
for (0, 1) {
$mins[$_] = $cols[$_ + 1] if $cols[$_ + 1] < $mins[$_ ]
or ! defined $mins[$_];
}
}
seek $IN, 0, 0;
my $last;
while (<$IN>) {
my #cols = split;
$last //= $cols[0];
for my $i ($last .. $cols[0]-2) {
print $i + 1, "\t#mins 0\n";
}
print;
$last = $cols[0];
}

A Bash solution:
# initialize minimum of 2. and 3. column
read no min2 min3 c4 < "$infile"
# get minimum of 2. and 3. column
while read c1 c2 c3 c4 ; do
[ $c2 -lt $min2 ] && min=$c2
[ $c3 -lt $min3 ] && min=$c3
done < "$infile"
while read c1 c2 c3 c4 ; do
# insert missing line(s) ?
while (( c1- no > 1 )) ; do
((no++))
echo -e "$no $min2 $min3 0"
done
# now insert existing line
echo -e "$c1 $c2 $c3 $c4"
no=$c1
done < "$infile"

One way using awk:
BEGIN {
if ( ARGC > 2 ) {
print "Usage: awk -f script.awk <file-name>"
exit 0
}
## Need to process file twice, duplicate the input filename.
ARGV[2] = ARGV[1]
++ARGC
col2 = -1
col3 = -1
}
## First processing of file. Get min values of second and third columns.
FNR == NR {
col2 = col2 < 0 || col2 > $2 ? $2 : col2
col3 = col3 < 0 || col3 > $3 ? $3 : col3
next
}
## Second processing of file.
FNR < NR {
## Get value of column 1 in first row.
if ( FNR == 1 ) {
col1 = $1
print
next
}
## Compare current value of column 1 with value of previous row.
## Add a new row while difference is bigger than '1'.
while ( $1 - col1 > 1 ) {
++col1
printf "%d\t%d %d %d\n", col1, col2, col3, 0
}
## Assing new value of column 1.
col1 = $1
print
}
Running the script:
awk -f script.awk infile
Result:
1867 121 2 56
1868 121 1 6
1868 121 2 65
1868 122 0 53
1869 121 0 41
1869 121 1 41
1870 121 0 0
1871 121 1 13
1871 121 2 194

Related

How to rearrange the columns using awk?

I have a file with 120 columns. A part of it is here with 12 columns.
A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3
4 4 5 2 3 3 2 1 9 17 25 33
5 6 4 6 8 2 3 5 3 1 -1 -3
7 8 3 10 13 1 4 9 -3 -15 -27 -39
9 10 2 14 18 0 5 13 -9 -31 -53 -75
11 12 1 18 23 -1 6 17 -15 -47 -79 -111
13 14 0 22 28 -2 7 21 -21 -63 -105 -147
15 16 -1 26 33 -3 8 25 -27 -79 -131 -183
17 18 -2 30 38 -4 9 29 -33 -95 -157 -219
19 20 -3 34 43 -5 10 33 -39 -111 -183 -255
21 22 -4 38 48 -6 11 37 -45 -127 -209 -291
I would like to rearrange it by bringing all A columns together (A1 A2 A3 A4) and similarly all Bs (B1 B2 B3 B4), Cs (C1 C2 C3 C4), Ds (D1 D2 D3 D4) together.
I am looking to print the columns as
A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 D1 D2 D3 D4
My script is:
#!/bin/sh
sed -i '1d' input.txt
for i in {1..4};do
j=$(( 1 + $(( 3 * $(( i - 1 )) )) ))
awk '{print $'$j'}' input.txt >> output.txt
done
for i in {1..4};do
j=$(( 2 + $(( 3 * $(( i - 1 )) )) ))
awk '{print $'$j'}' input.txt >> output.txt
done
for i in {1..4};do
j=$(( 3 + $(( 3 * $(( i - 1 )) )) ))
awk '{print $'$j'}' input.txt >> output.txt
done
It is printing all in one column.
Here are two Generic approach solutions, without hard-coding the field numbers from Input_file, values can come in any order and it will sort them automatically. Written and tested in GNU awk with shown samples.
1st solution: Traverse through all the lines and their respective fields and then sort by values to perform indexing on headers.
awk '
FNR==1{
for(i=1;i<=NF;i++){
arrInd[i]=$i
}
next
}
{
for(i=1;i<=NF;i++){
value[FNR,arrInd[i]]=$i
}
}
END{
PROCINFO["sorted_in"]="#val_num_asc"
for(i in arrInd){
printf("%s%s",arrInd[i],i==length(arrInd)?ORS:OFS)
}
for(i=2;i<=FNR;i++){
for(k in arrInd){
printf("%s%s",value[i,arrInd[k]],k==length(arrInd)?ORS:OFS)
}
}
}
' Input_file
OR in case you want to get output in tabular format, then small tweak in above solution.
awk '
BEGIN { OFS="\t" }
FNR==1{
for(i=1;i<=NF;i++){
arrInd[i]=$i
}
next
}
{
for(i=1;i<=NF;i++){
value[FNR,arrInd[i]]=$i
}
}
END{
PROCINFO["sorted_in"]="#val_num_asc"
for(i in arrInd){
printf("%s%s",arrInd[i],i==length(arrInd)?ORS:OFS)
}
for(i=2;i<=FNR;i++){
for(k in arrInd){
printf("%s%s",value[i,arrInd[k]],k==length(arrInd)?ORS:OFS)
}
}
}
' Input_file | column -t -s $'\t'
2nd solution: Almost same concept of 1st solution, here traversing through array within conditions rather than explicitly calling it in END block of this program.
awk '
BEGIN { OFS="\t" }
FNR==1{
for(i=1;i<=NF;i++){
arrInd[i]=$i
}
next
}
{
for(i=1;i<=NF;i++){
value[FNR,arrInd[i]]=$i
}
}
END{
PROCINFO["sorted_in"]="#val_num_asc"
for(i=1;i<=FNR;i++){
if(i==1){
for(k in arrInd){
printf("%s%s",arrInd[k],k==length(arrInd)?ORS:OFS)
}
}
else{
for(k in arrInd){
printf("%s%s",value[i,arrInd[k]],k==length(arrInd)?ORS:OFS)
}
}
}
}
' Input_file | column -t -s $'\t'
Is it just A,B,C,D,A,B,C,D all the way across? Something like this should work (quick and dirty and specific though it be):
awk -v OFS='\t' '{
for (i=0; i<4; ++i) { # i=0:A, i=1:B,etc.
for (j=0; 4*j+i<NF; ++j) {
if (i || j) printf "%s", OFS;
printf "%s", $(4*j+i+1);
}
}
printf "%s", ORS;
}'
A similar approach to #MarkReed that manipulates the increment instead of the test condition can be written as:
awk '{
for (n=1; n<=4; n++)
for (c=n; c<=NF; c+=4)
printf "%s%s", ((c>1)?"\t":""), $c
print ""
}
' cols.txt
Example Use/Output
With your sample input in cols.txt you would have:
$ awk '{
> for (n=1; n<=4; n++)
> for (c=n; c<=NF; c+=4)
> printf "%s%s", ((c>1)?"\t":""), $c
> print ""
> }
> ' cols.txt
A1 A2 A3 B1 B2 B3 C1 C2 C3 D1 D2 D3
4 3 9 4 3 17 5 2 25 2 1 33
5 8 3 6 2 1 4 3 -1 6 5 -3
7 13 -3 8 1 -15 3 4 -27 10 9 -39
9 18 -9 10 0 -31 2 5 -53 14 13 -75
11 23 -15 12 -1 -47 1 6 -79 18 17 -111
13 28 -21 14 -2 -63 0 7 -105 22 21 -147
15 33 -27 16 -3 -79 -1 8 -131 26 25 -183
17 38 -33 18 -4 -95 -2 9 -157 30 29 -219
19 43 -39 20 -5 -111 -3 10 -183 34 33 -255
21 48 -45 22 -6 -127 -4 11 -209 38 37 -291
Here's a succinct generic solution that is not memory-bound, as RavinderSing13's solution is. (That is, it does not store the entire input in an array for printing in END.)
BEGIN {
OFS="\t" # output field separator
}
NR==1 {
# Sort column titles
for (i=1;i<=NF;i++) { sorted[i]=$i; position[$i]=i }
asort(sorted)
# And print them
for (i=1;i<=NF;i++) { $i=sorted[i] }
print
next
}
{
# Make an array of our input line...
split($0,line)
for (i=1;i<=NF;i++) { $i=line[position[sorted[i]]] }
print
}
The idea here is that at the first line of input, we record the position of our columns in the input, then sort the list of column names with asort(). It is important here that column names are not duplicated, as they are used as the index of an array.
As we step through the data, each line is reordered by replacing each field with the value from the position as sorted by the first line.
It is important that you set your input field separator correctly (whitespace, tab, comma, whatever), and have the complete set of fields in each line, or output will be garbled.
Also, this doesn't create columns. You mentioned A4 in your question, but there is no A4 in your sample data. We are only sorting what is there.
Lastly, this is a GNU awk program, due to the use of asort().
Using any awk for any number of tags (non-numeric leading strings in the header line) and/or numbers associated with them in the header line, including different counts of each letter so you could have A1 A2 but then B1 B2 B3 B4, reproducing the input order in the output and only storing 1 line at a time in memory:
$ cat tst.awk
BEGIN { OFS="\t" }
NR == 1 {
for ( fldNr=1; fldNr<=NF; fldNr++ ) {
tag = $fldNr
sub(/[0-9]+$/,"",tag)
if ( !seen[tag]++ ) {
tags[++numTags] = tag
}
fldNrs[tag,++numTagCols[tag]] = fldNr
}
}
{
out = ""
for ( tagNr=1; tagNr<=numTags; tagNr++ ) {
tag = tags[tagNr]
for ( tagColNr=1; tagColNr<=numTagCols[tag]; tagColNr++ ) {
fldNr = fldNrs[tag,tagColNr]
out = (out=="" ? "" : out OFS) $fldNr
}
}
print out
}
$ awk -f tst.awk file
A1 A2 A3 B1 B2 B3 C1 C2 C3 D1 D2 D3
4 3 9 4 3 17 5 2 25 2 1 33
5 8 3 6 2 1 4 3 -1 6 5 -3
7 13 -3 8 1 -15 3 4 -27 10 9 -39
9 18 -9 10 0 -31 2 5 -53 14 13 -75
11 23 -15 12 -1 -47 1 6 -79 18 17 -111
13 28 -21 14 -2 -63 0 7 -105 22 21 -147
15 33 -27 16 -3 -79 -1 8 -131 26 25 -183
17 38 -33 18 -4 -95 -2 9 -157 30 29 -219
19 43 -39 20 -5 -111 -3 10 -183 34 33 -255
21 48 -45 22 -6 -127 -4 11 -209 38 37 -291
or with different formats of tags and different numbers of columns per tag:
$ cat file
foo1 bar1 bar2 bar3 foo2 bar4
4 4 5 2 3 3
5 6 4 6 8 2
$ awk -f tst.awk file
foo1 foo2 bar1 bar2 bar3 bar4
4 3 4 5 2 3
5 8 6 4 6 2
The above assumes you want the output order per tag to match the input order, not be based on the numeric values after each tag so if you have input of A2 B1 A1 then the output will be A2 A1 B1, not A1 A2 B1.

How to sort or rearrange numbers from multiple column into multiple row [fixed into 4 columns]?

I have 1 text file, which is test1.txt.
text1.txt contain as following:
Input:
##[A1] [B1] [T1] [V1] [T2] [V2] [T3] [V3] [T4] [V4]## --> headers
1 1000 0 100 10 200 20 300 30 400
40 500 50 600 60 700 70 800
1010 0 101 10 201 20 301 30 401
40 501 50 601
2 1000 0 110 15 210 25 310 35 410
45 510 55 610 65 710
1010 0 150 10 250 20 350 30 450
40 550
Condition:
A1 and B1 -> for each A1 + (B1 + [Tn + Vn])
A1 should be in 1 column.
B1 should be in 1 column.
T1,T2,T3 and T4 should be in 1 column.
V1,V2,V3 and V4 should be in 1 column.
How do I sort it become like below?
Desire Output:
## A1 B1 Tn Vn ## --> headers
1 1000 0 100
10 200
20 300
30 400
40 500
50 600
60 700
70 800
1010 0 101
10 201
20 301
30 401
40 501
50 601
2 1000 0 110
15 210
25 310
35 410
45 510
55 610
65 710
1010 0 150
10 250
20 350
30 450
40 550
Here is my current code:
First Attempt:
Input
cat test1.txt | awk ' { a=$1 b=$2 } { for(i=1; i<=5; i=i+1) { t=substr($0,11+i*10,5) v=substr($0,16+i*10,5) if( t ~ /^\ +[0-9]+$/ || t ~ /^[0-9]+$/ || t ~ /^\ +[0-9]+\ +$/ ){ printf "%7s %7d %8d %8d \n",a,b,t,v } }}' | less
Output:
1 1000 400 0
40 500 800 0
1010 0 401 0
2 1000 410 0
1010 0 450 0
I'm trying using simple awk command, but still can't get the result.
Can anyone help me on this?
Thanks,
Am
Unlike what is stated elsewhere, there's nothing tricky about this at all, you're just using fixed width fields in your input instead of char/string separated fields.
With GNU awk for FIELDWIDTHS to handle fixed width fields it really couldn't be much simpler:
$ cat tst.awk
BEGIN {
# define the width of the input and output fields
FIELDWIDTHS = "2 4 5 5 6 5 6 5 6 5 6 99"
ofmt = "%2s%5s%6s%5s%6s%s\n"
}
{
# strip leading/trailing blanks and square brackets from every field
for (i=1; i<=NF; i++) {
gsub(/^[[\s]+|[]\s]+$/,"",$i)
}
}
NR==1 {
# print the header line
printf ofmt, $1, $2, $3, "Tn", "Vn", " "$NF
next
}
{
# print every other line
for (i=4; i<NF; i+=2) {
printf ofmt, $1, $2, $3, $i, $(i+1), ""
$1 = $2 = $3 = ""
}
}
.
$ awk -f tst.awk file
## A1 B1 Tn Vn ## --> headers
1 1000 0 100
10 200
20 300
30 400
40 500
50 600
60 700
70 800
1010 0 101
10 201
20 301
30 401
40 501
50 601
2 1000 0 110
15 210
25 310
35 410
45 510
55 610
65 710
1010 0 150
10 250
20 350
30 450
40 550
With other awks you'd use a while() { substr() } loop instead of FIELDWIDTHS so it'd be a couple more lines of code but still trivial.
The above will be orders of magnitude faster than an equivalent shell script. See https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice.
This isn't easy because it is hard to identify when you have the different styles of row — those with values in both column 1 and column 2, those with no value in column 1 and a value in column 2, and those no value in column 1 or 2. A first step is to make this easier — sed to the rescue:
$ sed 's/[[:space:]]\{1,\}$//
s/^....../&|/
s/|....../&|/
:a
s/|\( *[0-9][0-9]* \)\( *[^|]\)/|\1|\2/
t a' data
1 | 1000 | 0 | 100 | 10 | 200 | 20 | 300 | 30 | 400
| | 40 | 500 | 50 | 600 | 60 | 700 | 70 | 800
| 1010 | 0 | 101 | 10 | 201 | 20 | 301 | 30 | 401
| | 40 | 501 | 50 | 601
2 | 1000 | 0 | 110 | 15 | 210 | 25 | 310 | 35 | 410
| | 45 | 510 | 55 | 610 | 65 | 710
| 1010 | 0 | 150 | 10 | 250 | 20 | 350 | 30 | 450
| | 40 | 550
$
The first line removes any trailing white space, to avoid confusion. The next two expressions handle the fixed-width columns 1 and 2 (6 characters each). The next line creates a label a; the substitute finds a pipe |, some spaces, some digits, a space, and some trailing material which doesn't include a pipe; and inserts a pipe in the middle. The t a jumps back to the label if a substitution was done.
With that in place, it becomes easy to manage awk with a field separator of |.
This is verbose, but seems to do the trick:
awk -F '|' '
$1 > 0 { printf "%5d %4d %3d %3d\n", $1, $2, $3, $4
for (i = 5; i <= NF; i += 2) { printf "%5s %4s %3d %3d\n", "", "", $i, $(i+1) }
next
}
$2 > 0 { printf "%5s %4d %3d %3d\n", "", $2, $3, $4
for (i = 5; i <= NF; i += 2) { printf "%5s %4s %3d %3d\n", "", "", $i, $(i+1) }
next
}
{ for (i = 3; i <= NF; i += 2) { printf "%5s %4s %3d %3d\n", "", "", $i, $(i+1) }
next
}'
Output:
1 1000 0 100
10 200
20 300
30 400
40 500
50 600
60 700
70 800
1010 0 101
10 201
20 301
30 401
40 501
50 601
2 1000 0 110
15 210
25 310
35 410
45 510
55 610
65 710
1010 0 150
10 250
20 350
30 450
40 550
If you need to remove the headings, add 1d; to the start of the sed script.
This might work for you (GNU sed):
sed -r '1d;s/^(.{11}).{11}/&\n\1/;s/^((.{5}).*\n)\2/\1 /;s/^(.{5}(.{6}).*\n.{5})\2/\1 /;/\S/P;D' file
Delete the first line (if the header is needed see below). The key fields occupy the first 11 (the first key is 5 characters and the second 6) characters and the data fields occupy the next 11. Insert a newline and the key fields before each pair of data fields. Compare the keys on adjacent lines and replace by spaces if they are duplicated. Do not print any blank lines.
If the header is needed, use the following:
sed -r '1{s/\[[^]]+\]\s*//5g;y/[]/ /;s/1/n/3g;s/B/ B/;G;b};s/^(.{11}).{11}/&\n\1/;s/^((.{5}).*\n)\2/\1 /;s/^(.{5}(.{6}).*\n.{5})\2/\1 /;/\S/P;D' file
This does additional formatting on the first line to remove superfluous headings, []'s, replace 1's by n, add an additional space for alignment and a following empty line.
Further more. By utilising the second line of the input file as a template for the data, a sed script can be created that does not have any hard coded values:
sed -r '2!d;s/\s*\S*//3g;s/.\>/&\n/;h;s/[^\n]/./g;G;s/[^\n.]/ /g;s#(.*)\n(.*)\n(.*)\n(.*)#1d;s/^(\1\2)\1\2/\&\\n\\1/;s/^((\1).*\\n)\\2/\\1\3/;s/^(\1(\2).*\\n\1)\\2/\\1\4/;/\\S/P;D#' file |
sed -r -f - file
The script created from the template is piped into a second invocation of the sed as a file and run against the original file to produce the required output.
Likewise the headers may be formatted if need be as so:
sed -r '2!d;s/\s*\S*//3g;s/.\>/&\n/;h;s/[^\n]/./g;G;s/[^\n.]/ /g;s#(.*)\n(.*)\n(.*)\n(.*)#s/^(\1\2)\1\2/\&\\n\\1/;s/^((\1).*\\n)\\2/\\1\3/;s/^(\1(\2).*\\n\1)\\2/\\1\4/;/\\S/P;D#' file |
sed -r -e '1{s/\[[^]]+\]\s*//5g;y/[]/ /;s/1/n/3g;s/B/ B/;G;b}' -f - file
By extracting the first four fields from the second line of the input file, Four variables can be made. Two regexp and two values. These variables can be used to build the sed script.
N.B. The sed script is created from strings extracted from the template and the variables produced are also strings so they can be concatenated to produce further new regexp's and new values etc etc
This is a rather tricky problem that can be handled a number of ways. Whether bash, perl or awk, you will need to handle to number of fields in some semi-generic way instead of just hardcoding values for your example.
Using bash, so long as you can rely on an even-number of fields in all lines (except for the lines with the sole initial value (e.g. 1010), you can accommodate the number of fields is a reasonably generic way. For the lines with 1, 2, etc.. you know your initial output will contain 4-fields. For lines with 1010, etc.. you know the output will contain an initial 3-fields. For the remaining values you are simply outputting pairs.
The tricky part is handling the alignment. Here is where printf which allows you to set the field-width with a parameter using the form "%*s" where the conversion specifier expects the next parameter to be an integer value specifying the field-width followed by a parameter for the string conversion itself. It takes a little gymnastics, but you could do something like the following in bash itself:
(note: edit to match your output header format)
#!/bin/bash
declare -i nfields wd=6 ## total no. fields, printf field-width modifier
while read -r line; do ## read each line (preserve for header line)
arr=($line) ## separate into array
first=${arr[0]} ## check for '#' in first line for header
if [ "${first:0:1}" = '#' ]; then
nfields=$((${#arr[#]} - 2)) ## no. fields in header
printf "## A1 B1 Tn Vn ## --> headers\n" ## new header
continue
fi
fields=${#arr[#]} ## fields in line
case "$fields" in
$nfields ) ## fields -eq nfiles?
cnt=4 ## handle 1st 4 values in line
printf " "
for ((i=0; i < cnt; i++)); do
if [ "$i" -eq '2' ]; then
printf "%*s" "5" "${arr[i]}"
else
printf "%*s" "$wd" "${arr[i]}"
fi
done
echo
for ((i = cnt; i < $fields; i += 2)); do ## handle rest
printf "%*s%*s%*s\n" "$((2*wd))" " " "$wd" "${arr[i]}" "$wd" "${arr[$((i+1))]}"
done
;;
$((nfields - 1)) ) ## one less than nfields
cnt=3 ## handle 1st 3 values
printf " %*s%*s" "$wd" " "
for ((i=0; i < cnt; i++)); do
if [ "$i" -eq '1' ]; then
printf "%*s" "5" "${arr[i]}"
else
printf "%*s" "$wd" "${arr[i]}"
fi
done
echo
for ((i = cnt; i < $fields; i += 2)); do ## handle rest
if [ "$i" -eq '0' ]; then
printf "%*s%*s%*s\n" "$((wd+1))" " " "$wd" "${arr[i]}" "$wd" "${arr[$((i+1))]}"
else
printf "%*s%*s%*s\n" "$((2*wd))" " " "$wd" "${arr[i]}" "$wd" "${arr[$((i+1))]}"
fi
done
;;
* ) ## all other lines format as pairs
for ((i = 0; i < $fields; i += 2)); do
printf "%*s%*s%*s\n" "$((2*wd))" " " "$wd" "${arr[i]}" "$wd" "${arr[$((i+1))]}"
done
;;
esac
done
Rather than reading from a file, just use redirection to redirect the input file to your script (if you want to just provide a filename, then redirect the file to feed the output while read... loop)
Example Use/Output
$ bash text1format.sh <dat/text1.txt
## A1 B1 Tn Vn ## --> headers
1 1000 0 100
10 200
20 300
30 400
40 500
50 600
60 700
70 800
1010 0 101
10 201
20 301
30 401
40 501
50 601
2 1000 0 110
15 210
25 310
35 410
45 510
55 610
65 710
1010 0 150
10 250
20 350
30 450
40 550
As between awk and bash, awk will generally be faster, but here with formatted output, it may be closer than usual. Look things over and let me know if you have questions.

How to append a special character in awk?

I have three files with different column and row size. For example,
ifile1.txt ifile2.txt ifile3.txt
1 2 2 1 6 3 8
2 5 6 3 8 9 0
3 8 7 6 8 23 6
6 7 6 23 6 44 5
9 87 87 44 7 56 7
23 6 6 56 8 78 89
44 5 76 99 0 95 65
56 6 7 99 78
78 7 8 106 0
95 6 7 110 6
99 6 4
106 5 34
110 6 4
Here ifile1.txt has 3 coulmns and 13 rows,
ifile2.txt has 2 columns and 7 rows,
ifile3.txt has 2 columns and 10 rows.
1st column of each ifile is the ID,
This ID is sometimes missing in ifile2.txt and ifile3.txt.
I would like to make an outfile.txt with 4 columns whose 1st column would have all the IDs as in ifile1.txt, while the 2nd coulmn will be $3 from ifile1.txt, 3rd and 4th column will be $2 from ifile2.txt and ifile3.txt and the missing stations in ifile2.txt and ifile3.txt will be assigned as a special charecter '?'.
Desire output:
outfile.txt
1 2 6 ?
2 6 ? ?
3 7 8 8
6 6 8 ?
9 87 ? 0
23 6 6 6
44 76 7 5
56 7 8 7
78 8 ? 89
95 7 ? 65
99 4 0 78
106 34 ? 0
110 4 ? 6
I was trying with the following algorithm, but can't able to write a script.
for each i in $1, awk '{printf "%3s %3s %3s %3s\n", $1, $3 (from ifile1.txt),
check if i is present in $1 (ifile2.txt), then
write corresponding $2 values from ifile2.txt
else write ?
similarly check for ifile3.txt
You can do that with GNU AWK using this script:
script.awk
# read lines from the three files
ARGIND == 1 { file1[ $1 ] = $3
# init the other files with ?
file2[ $1 ] = "?"
file3[ $1 ] = "?"
next;
}
ARGIND == 2 { file2[ $1 ] = $2
next;
}
ARGIND == 3 { file3[ $1 ] = $2
next;
}
# output the collected information
END { for( k in file1) {
printf("%3s%6s%6s%6s\n", k, file1[ k ], file2[ k ], file3[ k ])
}
}
Run the script like this: awk -f script.awk ifile1.txt ifile2.txt ifile3.txt > outfile.txt

Problems combining awk scripts

I am trying to use awk to parse a tab delimited table -- there are several duplicate entries in the first column, and I need to remove the duplicate rows that have a smaller total sum of the other 4 columns in the table. I can remove the first or second row easily, and sum the columns, but I'm having trouble combining the two. For my purposes there will never be more than 2 duplicates.
Example file: http://pastebin.com/u2GBnm2D
Desired output in this case would be to remove the rows:
lmo0330 1 1 0 1
lmo0506 7 21 2 10
And keep the other two rows with the same gene id in the column. The final parsed file would look like this: http://pastebin.com/WgDkm5ui
Here's what I have tried (this doesn't do anything. But the first part removes the second duplicate, and the second part sums the counts):
awk 'BEGIN {!a[$1]++} {for(i=1;i<=NF;i++) t+=$i; print t; t=0}'
I tried modifying the 2nd part of the script in the best answer of this question: Removing lines containing a unique first field with awk?
awk 'FNR==NR{a[$1]++;next}(a[$1] > 1)' ./infile ./infile
But unfortunately I don't really understand what's going on well enough to get it working. Can anyone help me out? I think I need to replace the a[$1] > 1 part with [remove (first duplicate count or 2nd duplicate count depending on which is larger].
EDIT: I'm also using GNU Awk 3.1.7 if that matters.
You can use this awk command:
awk 'NR == 1 {
print;
next
} {
s = $2+$3+$4+$5
} s >= sum[$1] {
sum[$1] = s;
if (!($1 in rows))
a[++n] = $1;
rows[$1] = $0
} END {
for(i=1; i<=n; i++)
print rows[a[i]]
}' file | column -t
Output:
gene SRR034450.out.rpkm_0 SRR034451.out.rpkm_0 SRR034452.out.rpkm_0 SRR034453.out.rpkm_0
lmo0001 160 323 533 293
lmo0002 135 317 504 306
lmo0003 1 4 5 3
lmo0004 35 59 58 48
lmo0005 113 218 257 187
lmo0006 279 519 653 539
lmo0007 563 1053 1165 1069
lmo0008 34 84 203 107
lmo0009 13 45 90 49
lmo0010 57 210 237 169
lmo0011 65 224 247 179
lmo0012 65 226 250 215
lmo0013 342 500 738 682
lmo0014 662 1032 1283 1311
lmo0015 321 413 631 637
lmo0016 175 253 273 325
lmo0017 3 6 6 6
lmo0018 33 38 46 45
lmo0019 13 1 39 1
lmo0020 3 12 28 15
lmo0021 3 4 14 12
lmo0022 2 3 5 1
lmo0023 2 0 3 2
lmo0024 1 0 2 6
lmo0330 1 1 1 3
lmo0506 151 232 60 204

Compare two files having different column numbers and print the requirement to a new file if condition satisfies

I have two files with more than 10000 rows:
File1 has 1 col File2 has 4 col
23 23 88 90 0
34 43 74 58 5
43 54 87 52 3
54 73 52 35 4
. .
. .
I want to compare each value in file-1 with that in file-2. If exists then print the value along with other three values in file-2. In this example output will be:
23 88 90 0
43 74 58 5
54 87 52 3
.
.
I have written following script, but it is taking too much time to execute.
s1=1; s2=$(wc -l < File1.txt)
while [ $s1 -le $s2 ]
do n=$(awk 'NR=="$s1" {print $1}' File1.txt)
p1=1; p2=$(wc -l < File2.txt)
while [ $p1 -le $p2 ]
do awk '{if ($1==$n) printf ("%s %s %s %s\n", $1, $2, $3, $4);}'> ofile.txt
(( p1++ ))
done
(( s1++ ))
done
Is there any short/ easy way to do it?
You can do it very shortly using awk as
awk 'FNR==NR{found[$1]++; next} $1 in found'
Test
>>> cat file1
23
34
43
54
>>> cat file2
23 88 90 0
43 74 58 5
54 87 52 3
73 52 35 4
>>> awk 'FNR==NR{found[$1]++; next} $1 in found' file1 file2
23 88 90 0
43 74 58 5
54 87 52 3
What it does?
FNR==NR Checks if FNR file number of record is equal to NR total number of records. This will be same only for the first file, file1 because FNR is reset to 1 when awk reads a new file.
{found[$1]++; next} If the check is true then creates an associative array indexed by $1, the first column in file1
$1 in found This check is only done for the second file, file2. If column 1 value, $1 is and index in associative array found then it prints the entire line ( which is not written because it is the default action)

Resources