Extracting information using Awk - string

This post is related to my previous question about string splitting: Awk split string into words and numbers. Let's say we have a following string:
1A5T4
This string encodes the following information:
A at positon 2 (1 item before A)
T at position 8 (7 items before T , i.e. 1 + A + 5)
no more letters past the rightmost one mean no more relevant information to extract.
So the desired output here is A T 2 8
I'd like to write the Awk script to get this information, preferably in two arrays: one containing positions, the other containing letters. I thought this would be a convenient way to store it, as I need to use the values in other parts of the script that I am writing (or rather struggling to write).
I thought the first step would be to delimit the string by splitting it (credits go to helpful commenters Awk split string into words and numbers).
echo 1A5T4 | awk '{gsub(/[^0-9]+/," & ")}1'
1 A 5 T 4
But maybe the delimiter is not necessary. I tried to do the task using a for loop, by iterating through consecutive letter-number pairs, and adding them to the arrays. However, I was not able to make it to work (there is no arrary, as I could not get the loop to work properly):
echo 1A5T4 | awk '{gsub(/[0-9]+$/,"", $0); a = $0}{for (i = 1; i <= length(a); i++2) {b = substr(a, i, 1) + 1 + b; print b}}'
2
3
9
10
*idea here was to get only numbers and then the letters in the separate for loop
I also had the idea of expanding the string like this: .A.....T.... and then getting the positions of the letters by counting string lengths from the beginning until the letter.
The strings that I need to process will contain one more complication - another type of block: caret followed by a set of letters. In this block, the number of letters following a caret will be added to the final indices. Example below:
1A2^CCG3T4
A is 2 (as in the example above)
T is 11 (2 + 2 + 3 (sum of letter in CCG following the caret) + 3, so 10 positons that preceed T)
So the desired output here is A T 2 11
The letters following the caret are not relevant for anything else, except shifting the indices of the letter to the rate of the caret block.
Would be great to get some helpful hints on how to tackle this.
Clarification: the script should output all letters, as long as they are not preceded by caret. The letters after the caret only shift the indices. For example:
27T19T^A16G8G29
should give
T T G G 28 48 66 75
and
27T19T16G8G29
should give
T T G G 28 48 65 74
Update:
Thanks to #vgersh99, I managed to improve the code. It first converts the text blocks that follow each cater to the same format as the other blocks. Then all the blocks are dealt with in the same way (for loop), and in the end, caret values are just not displayed (the if statement). However, there is still the problem, in case there are multiple caret blocks of variable lengths.
1A5T4
1A1^AAAAA2T2
1A2^CCG3T4
27T19T^A16G8G29
27T19T16G8G29
1A^AA5^TT4T4
10A3A1G9A10A25^TT1^G1^G42T12^G1G29
{
match($0, /\^[A-Z]+/);
a = "^"length(substr($0, RSTART, RLENGTH))-2"^";
gsub(/\^[A-Z]+/, a)
}
# if a letter is directly followed by a caret, such carets are removed, as they would have count==0
{
a = match($0, /[A-Z]+\^/);
a = substr($0, RSTART, RLENGTH-1);
gsub(/[A-Z]+\^/, a)
}
# intermediate string with transformed caret blocks is then used further
{
sum=0; delete(out); str=""
n=patsplit($0,b, /[[:alpha:]^]/, seps);
for(i=1; i<=n;i++) {
sum+=seps[i-1]+1
# print b[i], sum
if (b[i]!="^")
{out[sum]=b[i]}
}
PROCINFO["sorted_in"] = "#ind_num_asc"
for(i in out) {
printf("%s ", out[i])
str=(str? str OFS:"") i
}
print str
} tst.txt
A T 2 8
A T 2 12
A T 2 12
T T G G 28 48 66 75
T T G G 28 48 65 74
A T 2 17
A A G A A T G 11 15 17 27 38 117 134
the last two values in the last row are incorrect, it should be 112 and 127.
This is because gsub always uses the first match to get the replacement for the string, and therefore all the replacements are identical in the intermediate string:
10A3A1G9A10A25^1^1^1^1^1^42T12^1^1G29

it's a rough approximation as I'm a bit confused about your explanation...
Will probably need to be tweaked a bit...
implemntation is gawk specific using gawk's support for patsplit and PROCINFO["sorted_in"].
Given myFile.txt:
1A5T4
1A1^AAAAA2T2
1A2^CCG3T4
27T19T^A16G8G29
27T19T16G8G29
1A^AA5^TT4T4
10A3A1G9A10A25^TT1^G1^G42T12^G1G29
$ cat tst.awk
# prep block for the following "core" mod block
{
# if a caret is followed by letters, subsitute it by caret followed by the length of
# the letter string (-1) followed by a caret
# eg: 1A1^AAAAA2T2 -> 1A1^4^2T2
#$0=gensub(/\^([[:alpha:]]+)/,"^" length("\\2")-2 "^","G")
if(match($0,/\^([[:alpha:]]+)/,sub1))
for (i=1;i in sub1;i++)
sub(sub1[i],int(sub1[i,"length"])-1 "^")
# if a letter is directly followed by a caret, such carets are removed, as they would have count==0
$0=gensub(/([[:alpha:]])\^/,"\\1","G")
#print "[" $0 "]"
#next
}
# "core" mod block
# intermediate string with transformed caret blocks is then used further
{
sum=0; delete(out); str=""
n=patsplit($0,b, /[[:alpha:]^]/, seps);
for(i=1; i<=n;i++) {
sum+=seps[i-1]+1
# print b[i], sum
if (b[i]!="^")
{out[sum]=b[i]}
}
PROCINFO["sorted_in"] = "#ind_num_asc"
for(i in out) {
printf("%s ", out[i])
str=(str? str OFS:"") i
}
print str
}
$ gawk -f tst.awk myFile.txt
A T 2 8
A T 2 12
A T 2 12
T T G G 28 48 66 75
T T G G 28 48 65 74
A T T T 2 11 12 17
A A G A A G G T G G 11 15 17 27 38 69 72 115 129 131

% echo 1A5T4 | gawk 'BEGIN{ FS=""; }{ for (i=1;i<=NF;i++) { if($i>="A"){ s=s $i } else { for(j=1;j<=$i;j++)s=s "." }} print s }'
.A.....T....
% echo 1A2^CCG3T4 | gawk 'BEGIN{ FS=""; }{ for (i=1;i<=NF;i++) { if($i>="A"){ s=s $i } else { for(j=1;j<=$i;j++)s=s "." }} print s }'
.A..^CCG...T....
%
maybe the caret handling is wrong, but that should not be too hard to fix...

maybe try this
{mawk/mawk2/gawk} 'BEGIN { FS = "[=]+";
OFS = "=";
} {
outC = outP = pos = "";
gsub(/\^/, "=&" ); # first split carets next to letters
gsub(/[0-9]+/, "=&="); # insert delims around numbers
} { $1 = $1 } {
while (match($0, /[\^][A-Z]+/)) { sub(/[\^][A-Z]+/, RLENGTH -1) }
} {
x = 1; do {
if ($(x) ~ /[0-9]+|^$/) { pos += int($(x)) } else {
outC = outC "" $(x) " ";
outP = outP "" ++pos " ";
} } while (++x <= NF); print outC outP; } '
this version of the solution works in mawk and mawk2 as well. It doesn't require any sort of patsplit / FPAT logic, nor any gawk specific feature. It also doesn't even require a single call to substr( ).
It also avoids any of the hash-index overhead associated with dealing with arrays. Doesn't require any sorting either, since it's read sequentially left-to-right anyway.

Related

How to print output in table format in shell script

I am new to shell scripting.. I want to disribute all the data of a file in a table format and redirect the output into another file.
I have below input file File.txt
Fruit_label:1 Fruit_name:Apple
Color:Red
Type: S
No.of seeds:10
Color of seeds :brown
Fruit_label:2 fruit_name:Banana
Color:Yellow
Type:NS
I want it to looks like this
Fruit_label| Fruit_name |color| Type |no.of seeds |Color of seeds
1 | apple | red | S | 10 | brown
2 | banana| yellow | NS
I want to read all the data line by line from text file and make the headerlike fruit_label,fruit_name,color,type, no.of seeds, color of seeds and then print all the assigned value in rows.All the above data is different for different fruits for ex. banana dont have seeds so want to keep its row value as blank ..
Can anyone help me here.
Another approach, is a "Decorate & Process" approach. What is "Decorate & Process"? To Decorate is to take the text you have and decorate it with another separator to make field-splitting easier -- like in your case your fields can contain included whitespace along with the ':' separator between the field-names and values. With your inconsistent whitespace around ':' -- that makes it a nightmare to process ... simply.
So instead of worrying about what the separator is, think about "What should the fields be?" and then add a new separator (Decorate) between the fields and then Process with awk.
Here sed is used to Decorate your input with '|' as separators (a second call eliminates the '|' after the last field) and then a simpler awk process is used to split() the fields on ':' to obtain the field-name and field-value where the field-value is simply printed and the field-names are stored in an array. When a duplicate field-name is found -- it is uses as seen variable to designate the change between records, e.g.
sed -E 's/([^:]+:[[:blank:]]*[^[:blank:]]+)[[:blank:]]*/\1|/g' file |
sed 's/|$//' |
awk '
BEGIN { FS = "|" }
{
for (i=1; i<=NF; i++) {
if (split ($i, parts, /[[:blank:]]*:[[:blank:]]*/)) {
if (! n || parts[1] in fldnames) {
printf "%s %s", n ? "\n" : "", parts[2]
delete fldnames
n = 1
}
else
printf " | %s", parts[2]
fldnames[parts[1]]++
}
}
}
END { print "" }
'
Example Output
With your input in file you would have:
1 | Apple | Red | S | 10 | brown
2 | Banana | Yellow | NS
You will also see a "Decorate-Sort-Undecorate" used to sort data on a new non-existent columns of values by "Decorating" your data with a new last field, sorting on that field, and then "Undecorating" to remove the additional field when sorting is done. This allow sorting by data that may be the sum (or combination) of any two columns, etc...
Here is my solution. It is a new year gift, usually you have to demonstrate what you have tried so far and we help you, not do it for you.
Disclaimer some guru will probably come up with a simpler awk version, but this works.
File script.awk
# Remove space prefix
function ltrim(s) { sub(/^[ \t\r\n]+/, "", s); return s }
# Remove space suffix
function rtrim(s) { sub(/[ \t\r\n]+$/, "", s); return s }
# Remove both suffix and prefix spaces
function trim(s) { return rtrim(ltrim(s)); }
# Initialise or reset a fruit array
function array_init() {
for (i = 0; i <= 6; ++i) {
fruit[i] = ""
}
}
# Print the content of the fruit
function array_print() {
# To keep track if something was printed. Yes, print a carriage return.
# To avoid printing a carriage return on an empty array.
printedsomething = 0
for (i = 0; i <= 6; =+i) {
# Do no print if the content is empty
if (fruit[i] != "") {
printedsomething = 1
if (i == 1) {
# The first field must be further split, to remove "Fruit_name"
# Split on the space
split(fruit[i], temparr, / /)
printf "%s", trim(temparr[1])
}
else {
printf " | %s", trim(fruit[i])
}
}
}
if ( printedsomething == 1 ) {
print ""
}
}
BEGIN {
FS = ":"
print "Fruit_label| Fruit_name |color| Type |no.of seeds |Color of seeds"
array_init()
}
/Fruit_label/ {
array_print()
array_init()
fruit[1] = $2
fruit[2] = $3
}
/Color:/ {
fruit[3] = $2
}
/Type/ {
fruit[4] = $2
}
/No.of seeds/ {
fruit[5] = $2
}
/Color of seeds/ {
fruit[6] = $2
}
END { array_print() }
To execute, call awk -f script.awk File.txt
awk processes a file line per line. So the idea is to store fruit information into an array.
Every time the line "Fruit_label:....." is found, print the current fruit and start a new one.
Since each line is read in sequence, you tell awk what to do with each line, based on a pattern.
The patterns are what are enclosed between // characters at the beginning of each section of code.
Difficulty: since the first line contains 2 information on every fruit, and I cut the lines on : char, the Fruit_label will include "Fruit_name".
I.e.: the first line is cut like this: $1 = Fruit_label, $2 = 1 Fruit_name, $3 = Apple
This is why the array_print() function is so complicated.
Trim functions are there to remove spaces.
Like for the Apple, Type: S when split on the : will result in S
If it meets your requirements, please see https://stackoverflow.com/help/someone-answers to accept it.

How to keep values from a messy table based on string characteristics

I have a really difficult file.asv (values separated with #) that contains lines with un-matching columns.
Example:
name#age#city#lat#long
eric#paris#4.4283333333333331e+01#-1.0550000000000000e+02
dan#43#berlin#3.1366000000000000e+01#-1.0371500000000000e+02
london##2.5250000000000000e+01#1.0538333000000000e+02
Latitude and Longitude values are pretty consistent. They have 22 or 23 characters (depending on the positive (absent) and negative sign), and always with scientific notation. I would like to keep only latitude and longitude from each line.
Expected output:
lat#long
4.4283333333333331e+01#-1.0550000000000000e+02
3.1366000000000000e+01#-1.0371500000000000e+02
2.5250000000000000e+01#1.0538333000000000e+02
Headers are not totally necessary, I can add them later. I could also work with separated latitude and longitude outputs, and then paste them together. Any sed or awk command I could use?
Use this awk:
awk 'BEGIN{OFS=FS="#"} {print $(NF-1),$NF}' file
Here,
OFS - Output Field Separator
FS - Input Field Separator
NF - Number of Fields
Assuming that latitude and longitude always be a last fields. $NF and $(NF-1) will print last two fields.
Test:
$ awk 'BEGIN{OFS=FS="#"} {print $(NF-1),$NF}' file
lat#long
4.4283333333333331e+01#-1.0550000000000000e+02
3.1366000000000000e+01#-1.0371500000000000e+02
2.5250000000000000e+01#1.0538333000000000e+02
Simple grep would do, assuming -o option is present
$ grep -o '[^#]*#[^#]*$' file.asv
lat#long
4.4283333333333331e+01#-1.0550000000000000e+02
3.1366000000000000e+01#-1.0371500000000000e+02
2.5250000000000000e+01#1.0538333000000000e+02
Trying to select fields using regular expression
$ cat ll.awk
function rep(c, n, ans) { # repeat `c' `n' times
while (n--) ans = ans c
return ans
}
function build_re( d, s, os) { # build a regexp `r'
d = "[0-9]" # a digit
s = "[+-]" # sign
os = s "?" # optional sign
r = os d "[.]" rep(d, 16) "e" s d d # adjust here
r = "^" r "$" # match entire string
}
function process( sep, line, i) {
for (i = 1; i <= NF; i ++ ) {
if (!($i ~ r)) continue # skip fields
line = line sep $i; sep = FS
}
if (length(line)) print line
}
BEGIN {
build_re()
FS = "#"
}
{ # call on every line of input
process()
}
Usage:
$ awk -f ll.awk file.txt
4.4283333333333331e+01#-1.0550000000000000e+02
3.1366000000000000e+01#-1.0371500000000000e+02
2.5250000000000000e+01#1.0538333000000000e+02
One more in Gnu awk using gensub:
$ awk '{print gensub(/(.+)((#[^#]+){2})$/,"\\2","g",$0)}' file
#lat#long
#4.4283333333333331e+01#-1.0550000000000000e+02
#3.1366000000000000e+01#-1.0371500000000000e+02
#2.5250000000000000e+01#1.0538333000000000e+02

how to add space of a number after first four one space then after two one space etc using awk

I want to add multiple spaces of a Number is "20120911162500" add space first four then after every two.
Desired output is
2012 09 11 16 25 00
This is what I tried:
echo "2012 09 11 16 25 00" |sed 's/.\{4\}/& /g'
but the output is 2012 0911 1625 00.
This might work for you (GNU sed):
sed 's/../ &/3g' file
This prepends a space to the third pair of characters and every 2 characters thereafter.
You can do this way:
echo "20120911162500" |sed 's/.\{2\}/& /g;s/ //'
2012 09 11 16 25 00
Add space to every 2 digits s/.\{2\}/& /g then remove the first space to make it 4 digits s/ //' .
Create a file
f.awk
function sep(i, n) { # which separator to use?
if (i==n) return ""
if (i<4) return ""
if (i % 2 == 0) return " "
return ""
}
function format(num, n, i, ans) {
n = split(num, a, "")
for (i=1; i<=n; i++)
ans = ans a[i] sep(i, n)
return ans
}
{
print format($0)
}
Usage:
echo 12345678901234 | awk -f f.awk
Use the Add Button Click Event
It will be give space first after four digit and then second after two digit space and so on
For Example
string data = TextBox1. Text;
string[] no = new string[data.Length];
string number = "";
int k = 4;
for (int i = 0; i < data.Length; i++)
{
if (i == k)
{
number += " ";
k = k + 2;
}
number += data[i] ;
}
Label1.Text = number.ToString();

Awk: How do I count occurrences of a string across columns and find the maximum across rows?

I have a problem with my bash script on Linux.
My input looks like this:
input
Karydhs y n y y y n n y n n n y n y n
Markopoulos y y n n n y n y n y y n n n y
name3 y n y n n n n n y y n y n y n
etc...
which y=yes and n=no and that is the results of voting... and now I want with using awk to display the name and the total yes vote of each person (name) and the person that win (got the most y), any ideas?
I do something like this:
awk '{count=0 for (I=1;i<=15;i++) if (a[I]="y") count++} {print $1,count}' filename
Here is a fast (no sort required, no explicit "for" loop), one-pass solution that takes into account the possibility of ties:
awk 'NF==0{next}
{name=$1; $1=""; gsub(/[^y]/,"",$0); l=length($0);
print name, l;
if (mx=="" || mx < l) { mx=l; tie=""; winner=name; }
else if (mx == l) {
tie = 1; winner = winner", "name;
}
}
END {fmt = tie ? "The winners have won %d votes each:\n" :
"The winner has won %d votes:\n";
printf fmt, mx;
print winner;
}'
Output:
Karydhs 7
Markopoulos 7
name3 6
The winners have won 7 votes each:
Karydhs, Markopoulos
NOTE: The program above is presented for readability, but is accepted with the line breaks shown by GNU awk. Certain awks disallow splitting the ternary conditional.
What about this?
awk '{ for (i=2;i<NF;i++) { if ($i=="y") { a[$1" "$i]++} } } END { print "Yes tally"; l=0; for (i in a) { print i,a[i]; if (l>a[i]) { l=l } else { l=a[i];name=i } } split(name,a," "); print "Winner is ",a[1],"with ",l,"votes" } ' f
Yes tally
name3 y 6
Markopoulos y 6
Karydhs y 7
Winner is Karydhs with 7 votes
Here's yet another approach.
{ name=$1; $1=""; votes[name]=length(gensub("[^y]","","g")); }
END {asorti(votes,rank); for (r in rank) print rank[r], votes[rank[r]]; }
It is similar to the answer from #mklement0, but it uses asorti()¹ to sort inside of awk.
name=$1 saves the name from token 1
$1=""; clears token 1, which has the side effect of removing it from $0
votes[name] is an array indexed by the candidate's name
gensub("[^y]","","g") removes everything but 'y's from what's left of $0
and length() counts them
asorti(votes,rank) sorts votes by index into rank; at this point the arrays look like this:
votes rank
[name3] = 6 [1] = Karydhs
[Markopoulos] = 7 [2] = Markopoulos
[Karydhs] = 7 [3] = name3
for (r in rank) print rank[r], votes[rank[r]]; prints the results:
Karydhs 7
Markopoulos 7
name3 6
¹ the asorti() function may not be available in some versions of awk
Alternative two-pass awk
$ awk '{print $1; $1=""}1' votes |
awk -Fy 'NR%2{printf "%s ",$0; next} {print NF-1}' |
sort -k2nr
Karydhs 7
Markopoulos 7
name3 6
A simpler - and POSIX-compliant - awk solution, assisted by sort; note that no winner information (which may apply to multiple lines) is explicitly printed, but the sorting by votes in descending order should make the winner(s) obvious.
awk '{
printf "%s", $1
$1=""
yesCount=gsub("y", "")
printf " %s\n", yesCount
}' file |
sort -t ' ' -k2,2nr
printf "%s", $1 prints the name field only, without a trailing newline.
$1="" clears the 1st field, causing $0, the input line, to be rebuilt so that it contains the vote columns only.
yesCount=gsub("y", "") performs a dummy substitution that takes advantage of the fact that Awk's gsub() function returns the count of replacements performed; in effect, the return value is the number of y values on the line.
printf " %s\n", yesCount then prints the number of yes votes as the second output field and terminates the line.
sort -t ' ' -k2,2,nr then sorts the resulting lines by the second (-k2,2) space-separated (-t ' ') field, numerically (n), in reverse order (r) so that the highest yes-vote counts appear first.

Convert a text file into columns

Let's assume I have scientific data, all numbers arranged in a single column but representing an intensities matrix of n (width) by m (height). The column of the input file has in total n * m rows. An input example may look like that:
1
2
3
......
30
The new output should be such that I have n new columns with m rows. Sticking to my example with 30 fields input and n = 3, m = 10, I would need an output file like this (separator does not matter much, could be a blank, a tab etc.):
1 11 21
2 12 22
... ... ...
10 20 30
I use gawk under Windows. Please note that there are no special FS, more real-world examples are like 60 * 60 or bigger.
If you are not limited to awk but have GNU core-utils (cygwin, native, ..) then the simplest solution is to use pr:
pr -ts" " --columns 3 file
I believe this will do:
awk '
{ split($0,data); }
END {
m = 10;
n = 3;
for( i = 1; i<=m; i++ ) {
for( j = 0; j<n; j++ ) {
printf "%s ", data[j*m + i] # output data plus space in one line
}
# here you might want to start a new line though you did not ask for it:
printf "\n";
}
}' inputfile
I might have the index counting wrong but I am sure you can figure it out. The trick is the split in the first line. It splits your input on whitespace and creates an array data. The END block runs after processing your file and just accesses data by index. Note array indices count from 0.
Assumption is all data is in a single line. Your question isn't quite clear on this. If it is on several lines you'd have to read it into the array differently.
Hope this gets you started.
EDIT
I notice you changed your question while I was answering it. So change
{ split($0,data); }
to
{ data[++i] = $1; }
to account for the input being on different lines. Actually, this would give you the option to read it into a two dimensional array in the first place.
EDIT 2
Read two dimensional array
To read as a two dimensional array assuming m and n are known beforehand and not encoded in the input somehow:
awk '
BEGIN {
m = 10;
n = 3;
}
{
for( i = 0; i<m; i++ ) {
for( j = 0; j<n; j++ ) {
data[i,j] = $0;
}
}
# do something with data
}' inputfile
However, since you only want to reformat your data you could do it immediately. Combining the two solutions getting rid of data and passing m and n on the command line:
awk -v m=10 -v n=3'
{
for( i = 0; i<m; i++ ) {
for( j = 0; j<n; j++ ) {
printf "%s ", $0 # output data plus space in one line
}
printf "\n";
}
}' inputfile
Here is a fairly simple solution (in the example I've set n equal to 3; plug in the appropriate value for n):
awk -v n=3 '{ row = row $1 " "; if (NR % n == 0) { print row; row = "" } }' FILE
This works by reading in records one line at a time concatenating each line with the preceding lines. When n lines have been concatenated, it prints the concatenated result on a single new line. This repeats until there are no more lines left in the input.
You can use the below command
paste - - - < input.txt
By default, the delimiter is TAB, to change the delimiter, use below command
paste - - - -d' ' < input.txt

Resources