How to append a special character in awk? - linux

I have three files with different column and row size. For example,
ifile1.txt ifile2.txt ifile3.txt
1 2 2 1 6 3 8
2 5 6 3 8 9 0
3 8 7 6 8 23 6
6 7 6 23 6 44 5
9 87 87 44 7 56 7
23 6 6 56 8 78 89
44 5 76 99 0 95 65
56 6 7 99 78
78 7 8 106 0
95 6 7 110 6
99 6 4
106 5 34
110 6 4
Here ifile1.txt has 3 coulmns and 13 rows,
ifile2.txt has 2 columns and 7 rows,
ifile3.txt has 2 columns and 10 rows.
1st column of each ifile is the ID,
This ID is sometimes missing in ifile2.txt and ifile3.txt.
I would like to make an outfile.txt with 4 columns whose 1st column would have all the IDs as in ifile1.txt, while the 2nd coulmn will be $3 from ifile1.txt, 3rd and 4th column will be $2 from ifile2.txt and ifile3.txt and the missing stations in ifile2.txt and ifile3.txt will be assigned as a special charecter '?'.
Desire output:
outfile.txt
1 2 6 ?
2 6 ? ?
3 7 8 8
6 6 8 ?
9 87 ? 0
23 6 6 6
44 76 7 5
56 7 8 7
78 8 ? 89
95 7 ? 65
99 4 0 78
106 34 ? 0
110 4 ? 6
I was trying with the following algorithm, but can't able to write a script.
for each i in $1, awk '{printf "%3s %3s %3s %3s\n", $1, $3 (from ifile1.txt),
check if i is present in $1 (ifile2.txt), then
write corresponding $2 values from ifile2.txt
else write ?
similarly check for ifile3.txt

You can do that with GNU AWK using this script:
script.awk
# read lines from the three files
ARGIND == 1 { file1[ $1 ] = $3
# init the other files with ?
file2[ $1 ] = "?"
file3[ $1 ] = "?"
next;
}
ARGIND == 2 { file2[ $1 ] = $2
next;
}
ARGIND == 3 { file3[ $1 ] = $2
next;
}
# output the collected information
END { for( k in file1) {
printf("%3s%6s%6s%6s\n", k, file1[ k ], file2[ k ], file3[ k ])
}
}
Run the script like this: awk -f script.awk ifile1.txt ifile2.txt ifile3.txt > outfile.txt

Related

How to rearrange the columns using awk?

I have a file with 120 columns. A part of it is here with 12 columns.
A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3
4 4 5 2 3 3 2 1 9 17 25 33
5 6 4 6 8 2 3 5 3 1 -1 -3
7 8 3 10 13 1 4 9 -3 -15 -27 -39
9 10 2 14 18 0 5 13 -9 -31 -53 -75
11 12 1 18 23 -1 6 17 -15 -47 -79 -111
13 14 0 22 28 -2 7 21 -21 -63 -105 -147
15 16 -1 26 33 -3 8 25 -27 -79 -131 -183
17 18 -2 30 38 -4 9 29 -33 -95 -157 -219
19 20 -3 34 43 -5 10 33 -39 -111 -183 -255
21 22 -4 38 48 -6 11 37 -45 -127 -209 -291
I would like to rearrange it by bringing all A columns together (A1 A2 A3 A4) and similarly all Bs (B1 B2 B3 B4), Cs (C1 C2 C3 C4), Ds (D1 D2 D3 D4) together.
I am looking to print the columns as
A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 D1 D2 D3 D4
My script is:
#!/bin/sh
sed -i '1d' input.txt
for i in {1..4};do
j=$(( 1 + $(( 3 * $(( i - 1 )) )) ))
awk '{print $'$j'}' input.txt >> output.txt
done
for i in {1..4};do
j=$(( 2 + $(( 3 * $(( i - 1 )) )) ))
awk '{print $'$j'}' input.txt >> output.txt
done
for i in {1..4};do
j=$(( 3 + $(( 3 * $(( i - 1 )) )) ))
awk '{print $'$j'}' input.txt >> output.txt
done
It is printing all in one column.
Here are two Generic approach solutions, without hard-coding the field numbers from Input_file, values can come in any order and it will sort them automatically. Written and tested in GNU awk with shown samples.
1st solution: Traverse through all the lines and their respective fields and then sort by values to perform indexing on headers.
awk '
FNR==1{
for(i=1;i<=NF;i++){
arrInd[i]=$i
}
next
}
{
for(i=1;i<=NF;i++){
value[FNR,arrInd[i]]=$i
}
}
END{
PROCINFO["sorted_in"]="#val_num_asc"
for(i in arrInd){
printf("%s%s",arrInd[i],i==length(arrInd)?ORS:OFS)
}
for(i=2;i<=FNR;i++){
for(k in arrInd){
printf("%s%s",value[i,arrInd[k]],k==length(arrInd)?ORS:OFS)
}
}
}
' Input_file
OR in case you want to get output in tabular format, then small tweak in above solution.
awk '
BEGIN { OFS="\t" }
FNR==1{
for(i=1;i<=NF;i++){
arrInd[i]=$i
}
next
}
{
for(i=1;i<=NF;i++){
value[FNR,arrInd[i]]=$i
}
}
END{
PROCINFO["sorted_in"]="#val_num_asc"
for(i in arrInd){
printf("%s%s",arrInd[i],i==length(arrInd)?ORS:OFS)
}
for(i=2;i<=FNR;i++){
for(k in arrInd){
printf("%s%s",value[i,arrInd[k]],k==length(arrInd)?ORS:OFS)
}
}
}
' Input_file | column -t -s $'\t'
2nd solution: Almost same concept of 1st solution, here traversing through array within conditions rather than explicitly calling it in END block of this program.
awk '
BEGIN { OFS="\t" }
FNR==1{
for(i=1;i<=NF;i++){
arrInd[i]=$i
}
next
}
{
for(i=1;i<=NF;i++){
value[FNR,arrInd[i]]=$i
}
}
END{
PROCINFO["sorted_in"]="#val_num_asc"
for(i=1;i<=FNR;i++){
if(i==1){
for(k in arrInd){
printf("%s%s",arrInd[k],k==length(arrInd)?ORS:OFS)
}
}
else{
for(k in arrInd){
printf("%s%s",value[i,arrInd[k]],k==length(arrInd)?ORS:OFS)
}
}
}
}
' Input_file | column -t -s $'\t'
Is it just A,B,C,D,A,B,C,D all the way across? Something like this should work (quick and dirty and specific though it be):
awk -v OFS='\t' '{
for (i=0; i<4; ++i) { # i=0:A, i=1:B,etc.
for (j=0; 4*j+i<NF; ++j) {
if (i || j) printf "%s", OFS;
printf "%s", $(4*j+i+1);
}
}
printf "%s", ORS;
}'
A similar approach to #MarkReed that manipulates the increment instead of the test condition can be written as:
awk '{
for (n=1; n<=4; n++)
for (c=n; c<=NF; c+=4)
printf "%s%s", ((c>1)?"\t":""), $c
print ""
}
' cols.txt
Example Use/Output
With your sample input in cols.txt you would have:
$ awk '{
> for (n=1; n<=4; n++)
> for (c=n; c<=NF; c+=4)
> printf "%s%s", ((c>1)?"\t":""), $c
> print ""
> }
> ' cols.txt
A1 A2 A3 B1 B2 B3 C1 C2 C3 D1 D2 D3
4 3 9 4 3 17 5 2 25 2 1 33
5 8 3 6 2 1 4 3 -1 6 5 -3
7 13 -3 8 1 -15 3 4 -27 10 9 -39
9 18 -9 10 0 -31 2 5 -53 14 13 -75
11 23 -15 12 -1 -47 1 6 -79 18 17 -111
13 28 -21 14 -2 -63 0 7 -105 22 21 -147
15 33 -27 16 -3 -79 -1 8 -131 26 25 -183
17 38 -33 18 -4 -95 -2 9 -157 30 29 -219
19 43 -39 20 -5 -111 -3 10 -183 34 33 -255
21 48 -45 22 -6 -127 -4 11 -209 38 37 -291
Here's a succinct generic solution that is not memory-bound, as RavinderSing13's solution is. (That is, it does not store the entire input in an array for printing in END.)
BEGIN {
OFS="\t" # output field separator
}
NR==1 {
# Sort column titles
for (i=1;i<=NF;i++) { sorted[i]=$i; position[$i]=i }
asort(sorted)
# And print them
for (i=1;i<=NF;i++) { $i=sorted[i] }
print
next
}
{
# Make an array of our input line...
split($0,line)
for (i=1;i<=NF;i++) { $i=line[position[sorted[i]]] }
print
}
The idea here is that at the first line of input, we record the position of our columns in the input, then sort the list of column names with asort(). It is important here that column names are not duplicated, as they are used as the index of an array.
As we step through the data, each line is reordered by replacing each field with the value from the position as sorted by the first line.
It is important that you set your input field separator correctly (whitespace, tab, comma, whatever), and have the complete set of fields in each line, or output will be garbled.
Also, this doesn't create columns. You mentioned A4 in your question, but there is no A4 in your sample data. We are only sorting what is there.
Lastly, this is a GNU awk program, due to the use of asort().
Using any awk for any number of tags (non-numeric leading strings in the header line) and/or numbers associated with them in the header line, including different counts of each letter so you could have A1 A2 but then B1 B2 B3 B4, reproducing the input order in the output and only storing 1 line at a time in memory:
$ cat tst.awk
BEGIN { OFS="\t" }
NR == 1 {
for ( fldNr=1; fldNr<=NF; fldNr++ ) {
tag = $fldNr
sub(/[0-9]+$/,"",tag)
if ( !seen[tag]++ ) {
tags[++numTags] = tag
}
fldNrs[tag,++numTagCols[tag]] = fldNr
}
}
{
out = ""
for ( tagNr=1; tagNr<=numTags; tagNr++ ) {
tag = tags[tagNr]
for ( tagColNr=1; tagColNr<=numTagCols[tag]; tagColNr++ ) {
fldNr = fldNrs[tag,tagColNr]
out = (out=="" ? "" : out OFS) $fldNr
}
}
print out
}
$ awk -f tst.awk file
A1 A2 A3 B1 B2 B3 C1 C2 C3 D1 D2 D3
4 3 9 4 3 17 5 2 25 2 1 33
5 8 3 6 2 1 4 3 -1 6 5 -3
7 13 -3 8 1 -15 3 4 -27 10 9 -39
9 18 -9 10 0 -31 2 5 -53 14 13 -75
11 23 -15 12 -1 -47 1 6 -79 18 17 -111
13 28 -21 14 -2 -63 0 7 -105 22 21 -147
15 33 -27 16 -3 -79 -1 8 -131 26 25 -183
17 38 -33 18 -4 -95 -2 9 -157 30 29 -219
19 43 -39 20 -5 -111 -3 10 -183 34 33 -255
21 48 -45 22 -6 -127 -4 11 -209 38 37 -291
or with different formats of tags and different numbers of columns per tag:
$ cat file
foo1 bar1 bar2 bar3 foo2 bar4
4 4 5 2 3 3
5 6 4 6 8 2
$ awk -f tst.awk file
foo1 foo2 bar1 bar2 bar3 bar4
4 3 4 5 2 3
5 8 6 4 6 2
The above assumes you want the output order per tag to match the input order, not be based on the numeric values after each tag so if you have input of A2 B1 A1 then the output will be A2 A1 B1, not A1 A2 B1.

How to print contents of column fields that have strings composed of "n" character/s using bash?

Say I have a file which contains:
22 30 31 3a 31 32 3a 32 " 0 9 : 1 2 : 2
30 32 30 20 32 32 3a 31 1 2 7 2 2 : 1
And, I want to print only the column fields that have string composed of 1 character. I want the output to be like this:
" 0 9 : 1 2 : 2
1 2 7 2 2 : 1
Then, I want to print only those strings that are composed of two characters, the output should be:
22 30 31 3a 31 32 3a 32
30 32 30 20 32 32 3a 31
I am a beginner and I really don't know how to do this. Thanks for your help!
Could you please try following, I am trying it in a different way for provided samples. Written and tested with provided samples only.
For getting values before BULK SPACE try:
awk '
{
line=$0
while(match($0,/[[:space:]]+/)){
arr=arr>RLENGTH?arr:RLENGTH
start[arr]+=RSTART+prev_start
prev_start=RSTART
$0=substr($0,RSTART+RLENGTH)
}
var=substr(line,1,start[arr]-1)
sub(/ +$/,"",var)
print var
delete start
var=arr=""
}
' Input_file
Output will be as follows.
22 30 31 3a 31 32 3a 32
30 32 30 20 32 32 3a 31
For getting values after BULK SPACE try:
awk '
{
line=$0
while(match($0,/[[:space:]]+/)){
arr=arr>RLENGTH?arr:RLENGTH
start[arr]+=RSTART+prev_start
prev_start=RSTART
$0=substr($0,RSTART+RLENGTH)
}
var=substr(line,start[arr])
sub(/^ +/,"",var)
print var
delete start
var=arr=""
}
' Input_file
Output will be as follows:
" 0 9 : 1 2 : 2
1 2 7 2 2 : 1
You can try
awk '{for(i=1;i<=NF;++i)if(length($i)==1)printf("%s ", $i);print("")}'
For each field, check the length and print it if it's desired. You may pass the -F option to awk if it's not separated by blanks.
The awk script is expanded as:
for( i = 1; i <= NF; ++i )
if( length( $i ) == 1 )
printf( "%s ", $i );
print( "" );
The print outside loop is to print a newline after each input line.
Assuming all the columns are tab-separated (So you can have a space as a column value like the second line of your sample), easy to do with a perl one-liner:
$ perl -F"\t" -lane 'BEGIN { $, = "\t" } print grep { /^.$/ } #F' foo.txt
" 0 9 : 1 2 : 2
1 2 7 2 2 : 1
$ perl -F"\t" -lane 'BEGIN { $, = "\t" } print grep { /^..$/ } #F' foo.txt
22 30 31 3a 31 32 3a 32
30 32 30 20 32 32 3a 31

Put every N rows of input into a new column

In bash, given input
1
2
3
4
5
6
7
8
...
And N for example 5, I want the output
1 6 11
2 7 12
3 8 ...
4 9
5 10
How do I do this?
Using a little known gem pr:
$ seq 20 | pr -ts' ' --column 4
1 6 11 16
2 7 12 17
3 8 13 18
4 9 14 19
5 10 15 20
replace 5 in following script with your number.
seq 20|xargs -n5| awk '{for (i=1;i<=NF;i++) a[i,NR]=$i; }END{
for(i=1;i<=NF;i++) {for(j=1;j<=NR;j++)printf a[i,j]" "; print "" }}'
output:
1 6 11 16
2 7 12 17
3 8 13 18
4 9 14 19
5 10 15 20
note seq 20 above there is just for generating the number sequence for testing. You don't need it in your real work.
EDIT
as pointed out by sudo_O, I add an pure awk solution:
awk -vn=5 '{a[NR]=$0}END{ x=1; while (x<=n){ for(i=x;i<=length(a);i+=n) printf a[i]" "; print ""; x++; } }' file
test
kent$ seq 20| awk -vn=5 '{a[NR]=$0}END{ x=1; while (x<=n){ for(i=x;i<=length(a);i+=n) printf a[i]" "; print ""; x++; } }'
1 6 11 16
2 7 12 17
3 8 13 18
4 9 14 19
5 10 15 20
kent$ seq 12| awk -vn=5 '{a[NR]=$0}END{ x=1; while (x<=n){ for(i=x;i<=length(a);i+=n) printf a[i]" "; print ""; x++; } }'
1 6 11
2 7 12
3 8
4 9
5 10
Here's how I'd do it with awk:
awk -v n=5 '{ c++ } c>n { c=1 } { a[c] = (a[c] ? a[c] FS : "") $0 } END { for (i=1;i<=n;i++) print a[i] }'
Some simple testing:
seq 21 | awk -v n=5 '{ c++ } c>n { c=1 } { a[c] = (a[c] ? a[c] FS : "") $0 } END { for (i=1;i<=n;i++) print a[i] | "column -t" }'
Results:
1 6 11 16 21
2 7 12 17
3 8 13 18
4 9 14 19
5 10 15 20
And another:
seq 40 | awk -v n=6 '{ c++ } c>n { c=1 } { a[c] = (a[c] ? a[c] FS : "") $0 } END { for (i=1;i<=n;i++) print a[i] | "column -t" }'
Results:
1 7 13 19 25 31 37
2 8 14 20 26 32 38
3 9 15 21 27 33 39
4 10 16 22 28 34 40
5 11 17 23 29 35
6 12 18 24 30 36

Count Number of occurrence at each line

I have the following file
ENST001 ENST002 4 4 4 88 9 9
ENST004 3 3 3 99 8 8
ENST009 ENST010 ENST006 8 8 8 77 8 8
Basically I want to count how many times ENST* is repeated in each line so the expected results is
2
1
3
Any suggestion please ?
Try this (and see it in action here):
awk '{print gsub("ENST[0-9]+","")}' INPUTFILE

Supplement patterns

I have these kind of records in a file:
1867 121 2 56
1868 121 1 6
1868 121 2 65
1868 122 0 53
1869 121 0 41
1869 121 1 41
1871 121 1 13
1871 121 2 194
I would like to get this output:
1867 121 2 56
1868 121 1 6
1868 121 2 65
1868 122 0 53
1869 121 0 41
1869 121 1 41
1870 121 0 0
1871 121 1 13
1871 121 2 194
The difference is the 1870 121 0 0 row.
So, if the difference between the numbers in the first column is greater than 1, then we have to include a line with the missing number (the above case it is 1870) and the other columns. One should get the other columns in a way, that let the second column be the minimum of the possible values of the numbers of the column (in the example these values might be 121 or 122), and for the same as in the third column case. The value of the last column let be always zero.
Can anybody suggest me something? Thanks in advance!
I am trying to solve it with awk, but maybe there is (are) other nicer or more practical solution(s) for this...
Something like this could work -
awk 'BEGIN{getline;a=$1;b=$2;c=$3}
NR==FNR{if (b>$2) b=$2; if (c>$3) c=$3;next}
{if ($1-a>1) {x=($1-a); for (i=1;i<x;i++) {print (a+1)"\t"b,c,"0";a++};a=$1} else a=$1;print}' file file
Explanation:
BEGIN{getline;a=$1;b=$2;c=$3} -
In this BEGIN block we read the first line and assign values in column 1 to variable a, column 2 to variable b and column 3 to variable c.
NR==FNR{if (b>$2) b=$2; if (c>$3) c=$3;next} -
In this we scan through the entire file (NR==FNR) and keep track of the lowest possible values in column 2 and column 3 and store them in variables b and c respectively. We use next to avoid running the second pattern{action} statement.
{if ($1-a>1) {x=($1-a); for (i=1;i<x;i++) {print (a+1)"\t"b,c,"0";a++};a=$1} else a=$1;print} -
This action statement checks the for the value in column 1 and compares it with a. If the the difference is more than 1, we do a for loop to add all the missing lines and set the value of a to $1. If the value in column 1 on successive lines is not greater than 1, we assign the value of column 1 to a and print it.
Test:
[jaypal:~/Temp] cat file
1867 121 2 56
1868 121 1 6
1868 121 2 65
1868 122 0 53
1869 121 0 41
1869 121 1 41
1871 121 1 13 # <--- 1870 skipped
1871 121 2 194
1875 120 1 12 # <--- 1872, 1873, 1874 skipped
[jaypal:~/Temp] awk 'BEGIN{getline;a=$1;b=$2;c=$3}
NR==FNR{if (b>$2) b=$2; if (c>$3) c=$3;next}
{if ($1-a>1) {x=($1-a); for (i=1;i<x;i++) {print (a+1)"\t"b,c,"0";a++};a=$1} else a=$1;print}' file file
1867 121 2 56
1868 121 1 6
1868 121 2 65
1868 122 0 53
1869 121 0 41
1869 121 1 41
1870 120 0 0 # Assigned minimum value in col 2 (120) and col 3 (0).
1871 121 1 13
1871 121 2 194
1872 120 0 0 # Assigned minimum value in col 2 (120) and col 3 (0).
1873 120 0 0 # Assigned minimum value in col 2 (120) and col 3 (0).
1874 120 0 0 # Assigned minimum value in col 2 (120) and col 3 (0).
1875 120 1 12
Perl solution. Should work for large files, too, as it does not load the whole file into memory, but goes over the file two times.
#!/usr/bin/perl
use warnings;
use strict;
my $file = shift;
open my $IN, '<', $file or die $!;
my #mins;
while (<$IN>) {
my #cols = split;
for (0, 1) {
$mins[$_] = $cols[$_ + 1] if $cols[$_ + 1] < $mins[$_ ]
or ! defined $mins[$_];
}
}
seek $IN, 0, 0;
my $last;
while (<$IN>) {
my #cols = split;
$last //= $cols[0];
for my $i ($last .. $cols[0]-2) {
print $i + 1, "\t#mins 0\n";
}
print;
$last = $cols[0];
}
A Bash solution:
# initialize minimum of 2. and 3. column
read no min2 min3 c4 < "$infile"
# get minimum of 2. and 3. column
while read c1 c2 c3 c4 ; do
[ $c2 -lt $min2 ] && min=$c2
[ $c3 -lt $min3 ] && min=$c3
done < "$infile"
while read c1 c2 c3 c4 ; do
# insert missing line(s) ?
while (( c1- no > 1 )) ; do
((no++))
echo -e "$no $min2 $min3 0"
done
# now insert existing line
echo -e "$c1 $c2 $c3 $c4"
no=$c1
done < "$infile"
One way using awk:
BEGIN {
if ( ARGC > 2 ) {
print "Usage: awk -f script.awk <file-name>"
exit 0
}
## Need to process file twice, duplicate the input filename.
ARGV[2] = ARGV[1]
++ARGC
col2 = -1
col3 = -1
}
## First processing of file. Get min values of second and third columns.
FNR == NR {
col2 = col2 < 0 || col2 > $2 ? $2 : col2
col3 = col3 < 0 || col3 > $3 ? $3 : col3
next
}
## Second processing of file.
FNR < NR {
## Get value of column 1 in first row.
if ( FNR == 1 ) {
col1 = $1
print
next
}
## Compare current value of column 1 with value of previous row.
## Add a new row while difference is bigger than '1'.
while ( $1 - col1 > 1 ) {
++col1
printf "%d\t%d %d %d\n", col1, col2, col3, 0
}
## Assing new value of column 1.
col1 = $1
print
}
Running the script:
awk -f script.awk infile
Result:
1867 121 2 56
1868 121 1 6
1868 121 2 65
1868 122 0 53
1869 121 0 41
1869 121 1 41
1870 121 0 0
1871 121 1 13
1871 121 2 194

Categories

Resources