I want to remove the function engine "map" { ... "foobar" ... }.
I tried in so many ways, it's so hard because it has empty lines and '}' at the end, delimiters doesn't work
mainfunc {
var = "baz"
engine "map" {
func {
var0 = "foo"
border = { 1, 1, 1, 1 }
var1 = "bar"
}
}
}
mainfunc {
var = "baz"
engine "map" {
func {
var0 = "foo"
border = { 1, 1, 1, 1 }
var1 = "foobar"
}
}
}
... # more functions like 'mainfunc'
I tried
sed '/engine/,/^\s\s}$/d' file
but removes every engine function, I just need the one containing "foobar", maybe a pattern match everything even newlines until foobar something like this:
sed '/engine(.*)foobar/,/^\s\s}$/d' file
Is it possible?
Try:
sed '/engine/{:a;N;/foobar/{N;N;d};/ }/b;ba}' filename
or:
awk '/engine/{c=1}c{b=b?b"\n"$0:$0;if(/{/)a++;if(/}/)a--;if(!a){if(b!~/foobar/)print b;c=0;b="";next}}!c' filename
I would simple count the numbers of open / close brackets when you match engine "map", cannot say if this only works in gawk
awk '
/^[ \t]*engine "map"/ {
ship=1; # ship is used as a boolean
b=0 # The factor between open / close brackets
}
ship {
b += split($0, tmp, "{"); # Count numbers of { in line
b -= split($0, tmp, "}"); # Count numbers of } in line
# If open / close brackets are equal the function ends
if(b==0) {
ship = 0;
}
# Ship the rest (printing)
next;
}
1 # Print line
' file
Split returns the number of matches: split(string, array [, fieldsep [, seps ] ]):
Divide
string into pieces defined by fieldpat
and store the pieces in array and the separator strings in the
seps array. The first piece is stored in
array[1], the second piece in array[2], and so
forth. The third argument, fieldpat, is
a regexp describing the fields in string (just as FPAT is
a regexp describing the fields in input records).
It may be either a regexp constant or a string.
If fieldpat is omitted, the value of FPAT is used.
patsplit() returns the number of elements created.
Related
I can't seem to find any way of sorting a word based on its characters in awk.
For example if the word is "hello" then its sorted equivalent is "ehllo". how to achieve this in awk ?
With GNU awk for PROCINFO[], "sorted_in" (see https://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Scanning) and splitting with a null separator resulting in an array of chars:
$ echo 'hello' |
awk '
BEGIN { PROCINFO["sorted_in"]="#val_str_asc" }
{
split($1,chars,"")
word = ""
for (i in chars) {
word = word chars[i]
}
print word
}
'
ehllo
$ echo 'hello' | awk -v ordr='#val_str_asc' 'BEGIN{PROCINFO["sorted_in"]=ordr} {split($1,chars,""); word=""; for (i in chars) word=word chars[i]; print word}'
ehllo
$ echo 'hello' | awk -v ordr='#val_str_desc' 'BEGIN{PROCINFO["sorted_in"]=ordr} {split($1,chars,""); word=""; for (i in chars) word=word chars[i]; print word}'
ollhe
Another option is a Decorate-Sort-Undecorate with sed. Essentially, you use sed to break "hello" into one character per-line (decorating each character with a newline '\n') and pipe the result to sort. You then use sed to do the reverse (undecorate each line by removing the '\n') to join the lines back together.
printf "hello" | sed 's/\(.\)/\1\n/g' | sort | sed '{:a N;s/\n//;ta}'
ehllo
There are several approaches you can use, but this one is shell friendly, but the behavior requires GNU sed.
This would be more doable with gawk, which includes the asort function to sort an array:
awk 'BEGIN{FS=OFS=ORS=""}{split($0,a);asort(a);for(i in a)print a[i]}'<<<hello
This outputs:
ehllo
Demo: https://ideone.com/ylWQLJ
You need to write a function to sort letters in a word (see : https://www.gnu.org/software/gawk/manual/html_node/Join-Function.html):
function siw(word, result, arr, arrlen, arridx) {
split(word, arr, "")
arrlen = asort(arr)
for (arridx = 1; arridx <= arrlen; arridx++) {
result = result arr[arridx]
}
return result
}
And define a sort sub-function to compare two words (see : https://www.gnu.org/software/gawk/manual/html_node/Array-Sorting-Functions.html):
function compare_by_letters(i1, v1, i2, v2, left, right) {
left = siw(v1)
right = siw(v2)
if (left < right)
return -1
else if (left == right)
return 0
else
return 1
}
And use this function with awk sort function:
asort(array_test, array_test_result, "compare_by_letters")
Then, the sample program is:
function siw(word, result, arr, arrlen, arridx) {
result = hash_word[word]
if (result != "") {
return result
}
split(word, arr, "")
arrlen = asort(arr)
for (arridx = 1; arridx <= arrlen; arridx++) {
result = result arr[arridx]
}
hash_word[word] = result
return result
}
function compare_by_letters(i1, v1, i2, v2, left, right) {
left = siw(v1)
right = siw(v2)
if (left < right)
return -1
else if (left == right)
return 0
else
return 1
}
{
array_test[i++] = $0
}
END {
alen = asort(array_test, array_test_result, "compare_by_letters")
for (aind = 1; aind <= alen; aind++) {
print array_test_result[aind]
}
}
Executed like this:
echo -e "fail\nhello\nborn" | awk -f sort_letter.awk
Output:
fail
born
hello
Of course, if you have a big input, you could adapt siw function to memorize result for fastest compute:
function siw(word, result, arr, arrlen, arridx) {
result = hash_word[word]
if (result != "") {
return result
}
split(word, arr, "")
arrlen = asort(arr)
for (arridx = 1; arridx <= arrlen; arridx++) {
result = result arr[arridx]
}
hash_word[word] = result
return result
}
here's a very unorthodox method for a quick-n-dirty approach, if you really want to sort "hello" into "ehllo" :
mawk/mawk2/gawk 'BEGIN { FS="^$"
# to make it AaBbCc… etc; chr(65) = ascii "A"
for (x = 65; x < 91; x++) {
ref = sprintf("%s%c%c",ref, x, x+32)
} } /^[[:alpha:]]$/ { print } /[[:alpha:]][[:alpha:]]+/ {
# for gawk/nawk, feel free to change
# that to /[[:alpha:]]{2,}/
# the >= 2+ condition is to prevent wasting time
# sorting single letter words "A" and "I"
s=""; x=1; len=length(inp=$0);
while ( len && (x<53) ) {
if (inp~(ch = substr(ref,x++,1))) {
while ( sub(ch,"",inp) ) {
s = s ch;
len -= 1 ;
} } }
print s }'
I'm aware it's an extremely inefficient way of doing selection sort. The potential time-savings stem from instant loop ending the moment all letters are completed, instead of iterating all 52 letters everytime. The downside is that it doesn't pre-profile the input
(e.g. if u detect that this row is only lower-case, then u can speed it up with a lowercase only loop instead)
The upside is that it eliminates the need for custom-functions, eliminate any gawk dependencies, and also eliminate the need to split every row into an array (or every character into its own field)
i mean yes technically one can set FS to null string thus automatically becomes having NF as the string length. But at times it could be slow if input is a bit large. If you need unicode support, then a match()-based approach is more desirable.
added (x<53) condition to prevent run-away infinite loops in case input isn't pure ASCII letters
I want to write some regex which matches the following: any char regargless how often and then and opening and closing bracket. Also I don't want it to match if there is a # infront of it.
Here is what I tried so far:
\s*^(?!#)*[A-Za-z0-9_]*\(\)
The problem is that this regex seems to match every line in the test file, not just the functions that I want.
Have a great day
EDIT:
function match(array, regex) {//Rematcher
back = [];
for (let i = 0; i < array.length; i++) {
let line = array[i];
if (regex.test(line)) {
back.push(line);
}
}
return back
}
function find(text) {//function finder
let reg = new RegExp("^(?!#)*\w*\(\)");
return reMatcher.match(text, reg);
}
let content = fs.readFileSync(file, "UTF-8");//starting point
let functions = find_functions.find(content);
content = content.split("\n");
//Testfile
meldet()
{
if true
then
if true
then
echo Pseudocode
fi
echo Pseudocode
fi
}
The regex should only match the first line, but instead it matches every line and also it shouldn't match on lines where a # is before the function header
In the pattern \s*^(?!#)*[A-Za-z0-9_]*\(\) you can omit the \s* before the start of the line anchors.
As you don't match a # in the character class, you can also omit the negative lookahead (?!#)
Instead of using reMatcher.match(text, reg);, you could use match on the text variable.
Note to double escape the backslashes in the RegExp constructor.
You could update the function find to:
function find(text) {
let reg = new RegExp("\\s*^(?!#)*[A-Za-z0-9_]*\\(\\)");
return text.match(reg)[0]; // Or first check the value before indexing
}
May I introduce you to the problem that destroyed my weekend. I have biological data in 4 columns
#ID:::12345/1 ACGACTACGA text !"#$%vwxyz
#ID:::12345/2 TATGACGACTA text :;<=>?VWXYZ
I would like to use awk to edit the first column to replace characters : and / with -
I would like to convert the string in the last column with a comma-separated string of decimals that correspond to each individual ASCII character (any character ranging from ASCII 33 - 126).
#ID---12345-1 ACGACTACGA text 33,34,35,36,37,118,119,120,121,122
#ID---12345-2 TATGACGACTA text 58,59,60,61,62,63,86,87,88,89,90
The first part is easy, but i'm stuck with the second. I've tried using awk ordinal functions and sprintf; I can only get the former to work on the first char in the string and I can only get the latter to convert hexidecimal to decimal and not with spaces. Also tried bash function
$ od -t d1 test3 | awk 'BEGIN{OFS=","}{i = $1; $1 = ""; print $0}'
But don't know how to call this function within awk.
I would prefer to use awk as I have some downstream manipulations that can also be done in awk.
Many thanks in advance
Using the ordinal functions from the awk manual, you can do it like this:
awk -f ord.awk --source '{
# replace : with - in the first field
gsub(/:/,"-",$1)
# calculate the ordinal by looping over the characters in the fourth field
res=ord($4)
for(i=2;i<=length($4);i++) {
res=res","ord(substr($4,i))
}
$4=res
}1' file
Output:
#ID---12345/1 ACGACTACGA text 33,34,35,36,37,118,119,120,121,122
#ID---12345/2 TATGACGACTA text 58,59,60,61,62,63,86,87,88,89,90
Here is ord.awk (taken as is from: http://www.gnu.org/software/gawk/manual/html_node/Ordinal-Functions.html)
# ord.awk --- do ord and chr
# Global identifiers:
# _ord_: numerical values indexed by characters
# _ord_init: function to initialize _ord_
BEGIN { _ord_init() }
function _ord_init( low, high, i, t)
{
low = sprintf("%c", 7) # BEL is ascii 7
if (low == "\a") { # regular ascii
low = 0
high = 127
} else if (sprintf("%c", 128 + 7) == "\a") {
# ascii, mark parity
low = 128
high = 255
} else { # ebcdic(!)
low = 0
high = 255
}
for (i = low; i <= high; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
function ord(str, c)
{
# only first character is of interest
c = substr(str, 1, 1)
return _ord_[c]
}
function chr(c)
{
# force c to be numeric by adding 0
return sprintf("%c", c + 0)
}
If you don't want to include the whole of ord.awk, you can do it like this:
awk 'BEGIN{ _ord_init()}
function _ord_init(low, high, i, t)
{
low = sprintf("%c", 7) # BEL is ascii 7
if (low == "\a") { # regular ascii
low = 0
high = 127
} else if (sprintf("%c", 128 + 7) == "\a") {
# ascii, mark parity
low = 128
high = 255
} else { # ebcdic(!)
low = 0
high = 255
}
for (i = low; i <= high; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
{
# replace : with - in the first field
gsub(/:/,"-",$1)
# calculate the ordinal by looping over the characters in the fourth field
res=_ord_[substr($4,1,1)]
for(i=2;i<=length($4);i++) {
res=res","_ord_[substr($4,i,1)]
}
$4=res
}1' file
Perl soltuion:
perl -lnae '$F[0] =~ s%[:/]%-%g; $F[-1] =~ s/(.)/ord($1) . ","/ge; chop $F[-1]; print "#F";' < input
The first substitution replaces : and / in the first field with a dash, the second one replaces each character in the last field with its ord and a comma, chop removes the last comma.
I am trying to write a Perl script where a string get fragmented at every occurrence of 'E' and when user enters positions of 'C' through command line (say 3-8 or 3-8,13-18 or any comma separated such positions of 'C' according to the string in such format if the string is long), the fragments containing 'C' (say at 3 and 8 positions) should be joined and shown in the output. Suppose string is
"ABCDEABCDEABCDEABCDEABCDE" and user enters 3-8 then program oputput should be-
ABCDEABCDE
ABCDE
ABCDE
ABCDE
I wrote a script where user enters 'C' positions through command line and string get cut at every position of 'E' but after that I'm not able to write it properly. Please help!
Code (edited) that I've written so far is:
use strict;
use warnings;
my $s = 'ABCDEABCDEABCDEABCDEABCDE';
my $i=0;
my #where;
my #array;
my #bond;
my #pos;
my #s_array;
my #s_array2;
for ($i=0; $i<=4; $i++) {
$where[$i] = index($s,"C",$where[$i-1])+1;
push(#array,$where[$i]);
}
print "All Positions of C: #array\n\n";
print "Enter C positions:\n";
my #join_C = <STDIN>;
foreach (#join_C) {
#bond = split (',',$_);
}
foreach (#bond) {
#pos = split ('-', $_);
}
print "entered C positions:#pos\n";
print "Resulting disulfide linked peptides\n\n";
my #a = split(/E/, $s);
my $result = join ("E,", #a);
my #final = split(/,/, $result);
foreach my $final (#final) {
foreach my $pos(#pos) {
my #source = split //, $final[#final];
my $length = #source;
for ($i=0; $i<=$length; $i++) {
if (index($final[$i], "C") == $pos[$i]) {
push (#s_array, $final[$i]);
}
else {
push (#s_array2, $final[$i]);
}
}
}
}
my $lastly_joined = join (',', #s_array);
print "Joined Fragments= #s_array\n";
print "Remaining fragments= #s_array2\n";
I will try to understand what you want to do.
I am trying to write a Perl script where a string get fragmented at every
occurrence of 'E'
Okay, first create the input. Lets use an array to make access to the
elements easier.
my #s = split ('', 'ABCDE' x 5);
I'm not sure how that string will look in your case. Can you please provide an
real world example.
and when user enters c_pos of 'C' through command line (say 3-8
or 3-8,13-18 or any comma separated such c_pos of 'C' according to
the string in such format if the string is long)
I would suggest to use commandline arguments. That makes it easier to use the
script later in a chain with other tools. Pass the arguments to the script:
script.pl 3-8,13-18
So we get a list of pairs:
my #pairs = split (',', join('', #ARGV));
Now you should check that the passed values point to 'C's. Valid combinations
are stored in a hash where the key is the start index and the value is the end
index.
my %c_pos;
foreach my $pair (#pairs) {
my ($from, $to) = split('-', $pair);
if (($string[$from-1] eq 'C') && ($string[$to-1] eq 'C')) {
$c_pos{$from-1} = $to-1;
} else {
warn "position ${from}-${to} not valid => ignored!\n";
}
}
the fragments containing 'C' (say at 3 and 8 positions) should be joined
and shown in the output.
Now we can iterate on the elements of #s. When we hit a start index
a 'connecion' starts and this connection is active until the end is
reached.
We store all values to the current entry.
When we hit an 'E' and we are not in a 'connection' the current entry is
pushed to our result and we start with the next empty entry.
for (my $i=0; $i<#string; $i++) {
if ($c_pos{$i}) {
$inside_connection = 1;
$end = $c_pos{$i};
} elsif ($i == $end) {
$inside_connection = 0;
$end = 0;
}
$entry.=$string[$i];
if ($inside_connection) {
# do not split on 'E'
} elsif ($string[$i] eq 'E') {
# split on 'E'
push #result, $entry;
$entry = '';
}
}
Because I do not know better I assumed that a chaind connection like
3-8, 8-13 will cause that it works like you would have said 3-13. Hope
that fits. Here is the complete script:
use strict;
use warnings;
my #string = split ('', 'ABCDE' x 5);
my #pairs = split (',', join('', #ARGV));
my %c_pos;
foreach my $pair (#pairs) {
my ($from, $to) = split('-', $pair);
if (($string[$from-1] eq 'C') && ($string[$to-1] eq 'C')) {
$c_pos{$from-1} = $to-1;
} else {
warn "position ${from}-${to} not valid => ignored!\n";
}
}
my #result;
my $entry = '';
my $inside_connection = 0;
my $end=0;
for (my $i=0; $i<#string; $i++) {
if ($c_pos{$i}) {
$inside_connection = 1;
$end = $c_pos{$i};
} elsif ($i == $end) {
$inside_connection = 0;
$end = 0;
}
$entry.=$string[$i];
if ($inside_connection) {
# do not split on 'E'
} elsif ($string[$i] eq 'E') {
# split on 'E'
push #result, $entry;
$entry = '';
}
}
print join ("\n", #result);
Is there a method in Perl (not BioPerl) to find the number of each two consecutive letters.
I.e., number of AA, AC, AG, AT, CC, CA, ... in a sequence like this:
$sequence = 'AACGTACTGACGTACTGGTTGGTACGA'
PS: We can make it manually by using the regular expression, i.e., $GC=($sequence=~s/GC/GC/g) which return the number of GC in the sequence.
I need an automated and generic way.
You had me confused for a while, but I take it you want to count the dinucleotides in a given string.
Code:
my #dinucs = qw(AA AC AG CC CA CG);
my %count;
my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
for my $dinuc (#dinucs) {
$count{$dinuc} = ($sequence =~ s/\Q$dinuc\E/$dinuc/g);
}
Output from Data::Dumper:
$VAR1 = {
"AC" => 5,
"CC" => "",
"AG" => "",
"AA" => 1,
"CG" => 3,
"CA" => ""
};
Close to TLP's answer, but without substitution:
my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
my #dinucs = qw(AA AC AG AT CC CG);
my %count = map{$_ => 0}#dinucs;
for my $dinuc (#dinucs) {
while($sequence=~/$dinuc/g) {
$count{$dinuc}++;
}
}
Benchmark:
my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
my #dinucs = qw(AA AC AG AT CC CG);
my %count = map{$_ => 0}#dinucs;
my $count = -3;
my $r = cmpthese($count, {
'match' => sub {
for my $dinuc (#dinucs) {
while($sequence=~/$dinuc/g) {
$count{$dinuc}++;
}
}
},
'substitute' => sub {
for my $dinuc (#dinucs) {
$count{$dinuc} = ($sequence =~ s/\Q$dinuc\E/$dinuc/g);
}
}
});
Output:
Rate substitute Match
Substitute 13897/s -- -11%
Match 15622/s 12% --
Regex works if you're careful, but there's a simple solution using substr that will be faster and more flexible.
(As of this posting, the regex solution marked as accepted will fail to correctly count dinucleotides in repeated regions like 'AAAA...', of which there are many in naturally occurring sequences.
Once you match 'AA', the regex search resumes on the third character, skipping the middle 'AA' dinucleotide. This doesn't affect the other dinucleotides since if you have 'AC' at one position, you're guaranteed not to have it in the next base, naturally. The particular sequence given in the question will not suffer from this problem since no base appears three times in a row.)
The method I suggest is more flexible in that it can count words of any length; extending the regex method to longer words is complicated since you have to do even more gymnastics with your regex to get an accurate count.
sub substrWise {
my ($seq, $wordLength) = #_;
my $cnt = {};
my $w;
for my $i (0 .. length($seq) - $wordLength) {
$w = substr($seq, $i, $wordLength);
$cnt->{$w}++;
}
return $cnt;
}
sub regexWise {
my ($seq, $dinucs) = #_;
my $cnt = {};
for my $d (#$dinucs) {
if (substr($d, 0,1) eq substr($d, 1,1) ) {
my $n = substr($d, 0,1);
$cnt->{$d} = ($seq =~ s/$n(?=$n)/$n/g); # use look-ahead
} else {
$cnt->{$d} = ($seq =~ s/$d/$d/g);
}
}
return $cnt;
}
my #dinucs = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
use Test::More tests => 1;
my $rWise = regexWise($sequence, \#dinucs);
my $sWise = substrWise($sequence, 2);
$sWise->{$_} //= '' for #dinucs; # substrWise will not create keys for words not found
# this seems like desirable behavior IMO,
# but i'm adding '' to show that the counts match
is_deeply($rWise, $sWise, 'verify equivalence');
use Benchmark qw(:all);
cmpthese(100000, {
'regex' => sub {
regexWise($sequence, \#dinucs);
},
'substr' => sub {
substrWise($sequence, 2);
}
Output:
1..1
ok 1 - verify equivalence
Rate regex substr
regex 11834/s -- -85%
substr 76923/s 550% --
For longer sequences (10-100 kbase), the advantage is not as pronounced, but it still wins by about 70%.