Taking strnig as input for calculator program in Perl - string

I am new to Perl and I'm trying to create a simple calculator program, but the rules are different from normal maths. All operations have the same power and the math problem must be solved from left to right.
Here is an example:
123 - 10 + 4 * 10 = ((123 - 10) + 4) * 10 = 1170
8 * 7 / 3 + 2 = ((8 * 7) / 3) + 2 = 20.666
So in the first case the user needs to enter one string: 123 - 10 + 4 * 10.
How do i approach this task?
I'm sorry if it's too much of a general question, but i'm not sure how to even begin. Do i need a counter? Like - every second character of the string is an operator, while the two on the sides are digits.

I'm afraid I'm lazy so I'll parse with a regex and process as I parse.
#!/usr/bin/env perl
#use Data::Dumper;
use Params::Validate (':all');
use 5.01800;
use warnings;
my $string=q{123 - 10 + 4 * 10};
my $result;
sub fee {
my ($a)=validate_pos(#_,{ type=>SCALAR });
#warn Data::Dumper->Dump([\$a],[qw(*a)]),' ';
$result=$a;
};
sub fi {
my ($op,$b)=validate_pos(#_,{ type=>SCALAR},{ type=>SCALAR });
#warn Data::Dumper->Dump([\$op,\$b],[qw(*op *b)]),' ';
$result = $op eq '+' ? $result+$b :
$op eq '-' ? $result-$b :
$op eq '*' ? $result*$b :
$op eq '/' ? $result/$b :
undef;
#warn Data::Dumper->Dump([\$result],[qw(*result)]),' ';
};
$string=~ m{^(\d+)(?{ fee($1) })(?:(?: *([-+/*]) *)(\d+)(?{ fi($2,$3) }))*$};
say $result;
Note the use of (?{...}) 1

To be clear, you are not looking for a regular calculator. You are looking for a calculator that bends the rules of math.
What you want is to extract the operands and operators, then handle them 3 at the time, with the first one being the rolling "sum", the second an operator and the third an operand.
A simple way to handle it is to just eval the strings. But since eval is a dangerous operation, we need to de-taint the input. We do this with a regex match: /\d+|[+\-*\/]+/g. This matches either 1 or more + digits \d or |, 1 or more + of either +-*/. And we do this match as many times as we can /g.
use strict;
use warnings;
use feature 'say';
while (<>) { # while we get input
my ($main, #ops) = /\d+|[+\-*\/]+/g; # extract the ops
while (#ops) { # while the list is not empty
$main = calc($main, splice #ops, 0, 2); # take 2 items off the list and process
}
say $main; # print result
}
sub calc {
eval "#_"; # simply eval a string of 3 ops, e.g. eval("1 + 2")
}
You may wish to add some input checking, to count the args and make sure they are the correct number.
A more sensible solution is to use a calling table, using the operator as the key from a hash of subs designed to handle each math operation:
sub calc {
my %proc = (
"+" => sub { $_[0] + $_[1] },
"-" => sub { $_[0] - $_[1] },
"/" => sub { $_[0] / $_[1] },
"*" => sub { $_[0] * $_[1] }
);
return $proc{$_[1]}($_[0], $_[2]);
}
As long as the middle argument is an operator, this will perform the required operation without the need for eval. This will also allow you to add other math operations that you might want for the future.

Just to read raw input from the user you would simply read the STDIN file handle.
$input = <STDIN>;
This will give you a string, say "123 + 234 - 345" which will have a end of line marker. You can remove this safely with the chomp command.
After that you will want to parse your string to get your appropriate variables. You can brute force this with a stream scanner that looks at each character as you read it and processes it accordingly. For example:
#input = split //, $input;
for $ch (#input) {
if ($ch > 0 and $ch <= 9) {
$tVal = ($tVal * 10) + $ch;
} elsif ($ch eq " ") {
$newVal = $oldVal
} elsif ($ch eq "+") {
# Do addition stuff
}...
}
Another approach would be to split it into words so you can just deal with whole terms.
#input = split /\s+/, $input;
Instead of a stream of characters, as you process the array values will be 123, +, 234, -, and 345...
Hope this points you in the right direction...

Related

How to sort the characters of a word using awk?

I can't seem to find any way of sorting a word based on its characters in awk.
For example if the word is "hello" then its sorted equivalent is "ehllo". how to achieve this in awk ?
With GNU awk for PROCINFO[], "sorted_in" (see https://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Scanning) and splitting with a null separator resulting in an array of chars:
$ echo 'hello' |
awk '
BEGIN { PROCINFO["sorted_in"]="#val_str_asc" }
{
split($1,chars,"")
word = ""
for (i in chars) {
word = word chars[i]
}
print word
}
'
ehllo
$ echo 'hello' | awk -v ordr='#val_str_asc' 'BEGIN{PROCINFO["sorted_in"]=ordr} {split($1,chars,""); word=""; for (i in chars) word=word chars[i]; print word}'
ehllo
$ echo 'hello' | awk -v ordr='#val_str_desc' 'BEGIN{PROCINFO["sorted_in"]=ordr} {split($1,chars,""); word=""; for (i in chars) word=word chars[i]; print word}'
ollhe
Another option is a Decorate-Sort-Undecorate with sed. Essentially, you use sed to break "hello" into one character per-line (decorating each character with a newline '\n') and pipe the result to sort. You then use sed to do the reverse (undecorate each line by removing the '\n') to join the lines back together.
printf "hello" | sed 's/\(.\)/\1\n/g' | sort | sed '{:a N;s/\n//;ta}'
ehllo
There are several approaches you can use, but this one is shell friendly, but the behavior requires GNU sed.
This would be more doable with gawk, which includes the asort function to sort an array:
awk 'BEGIN{FS=OFS=ORS=""}{split($0,a);asort(a);for(i in a)print a[i]}'<<<hello
This outputs:
ehllo
Demo: https://ideone.com/ylWQLJ
You need to write a function to sort letters in a word (see : https://www.gnu.org/software/gawk/manual/html_node/Join-Function.html):
function siw(word, result, arr, arrlen, arridx) {
split(word, arr, "")
arrlen = asort(arr)
for (arridx = 1; arridx <= arrlen; arridx++) {
result = result arr[arridx]
}
return result
}
And define a sort sub-function to compare two words (see : https://www.gnu.org/software/gawk/manual/html_node/Array-Sorting-Functions.html):
function compare_by_letters(i1, v1, i2, v2, left, right) {
left = siw(v1)
right = siw(v2)
if (left < right)
return -1
else if (left == right)
return 0
else
return 1
}
And use this function with awk sort function:
asort(array_test, array_test_result, "compare_by_letters")
Then, the sample program is:
function siw(word, result, arr, arrlen, arridx) {
result = hash_word[word]
if (result != "") {
return result
}
split(word, arr, "")
arrlen = asort(arr)
for (arridx = 1; arridx <= arrlen; arridx++) {
result = result arr[arridx]
}
hash_word[word] = result
return result
}
function compare_by_letters(i1, v1, i2, v2, left, right) {
left = siw(v1)
right = siw(v2)
if (left < right)
return -1
else if (left == right)
return 0
else
return 1
}
{
array_test[i++] = $0
}
END {
alen = asort(array_test, array_test_result, "compare_by_letters")
for (aind = 1; aind <= alen; aind++) {
print array_test_result[aind]
}
}
Executed like this:
echo -e "fail\nhello\nborn" | awk -f sort_letter.awk
Output:
fail
born
hello
Of course, if you have a big input, you could adapt siw function to memorize result for fastest compute:
function siw(word, result, arr, arrlen, arridx) {
result = hash_word[word]
if (result != "") {
return result
}
split(word, arr, "")
arrlen = asort(arr)
for (arridx = 1; arridx <= arrlen; arridx++) {
result = result arr[arridx]
}
hash_word[word] = result
return result
}
here's a very unorthodox method for a quick-n-dirty approach, if you really want to sort "hello" into "ehllo" :
mawk/mawk2/gawk 'BEGIN { FS="^$"
# to make it AaBbCc… etc; chr(65) = ascii "A"
for (x = 65; x < 91; x++) {
ref = sprintf("%s%c%c",ref, x, x+32)
} } /^[[:alpha:]]$/ { print } /[[:alpha:]][[:alpha:]]+/ {
# for gawk/nawk, feel free to change
# that to /[[:alpha:]]{2,}/
# the >= 2+ condition is to prevent wasting time
# sorting single letter words "A" and "I"
s=""; x=1; len=length(inp=$0);
while ( len && (x<53) ) {
if (inp~(ch = substr(ref,x++,1))) {
while ( sub(ch,"",inp) ) {
s = s ch;
len -= 1 ;
} } }
print s }'
I'm aware it's an extremely inefficient way of doing selection sort. The potential time-savings stem from instant loop ending the moment all letters are completed, instead of iterating all 52 letters everytime. The downside is that it doesn't pre-profile the input
(e.g. if u detect that this row is only lower-case, then u can speed it up with a lowercase only loop instead)
The upside is that it eliminates the need for custom-functions, eliminate any gawk dependencies, and also eliminate the need to split every row into an array (or every character into its own field)
i mean yes technically one can set FS to null string thus automatically becomes having NF as the string length. But at times it could be slow if input is a bit large. If you need unicode support, then a match()-based approach is more desirable.
added (x<53) condition to prevent run-away infinite loops in case input isn't pure ASCII letters

sed command to remove a string and everything after it in that line

The .cpp file in a directory contains this text:
/**
* Performs the standard binary search using two comparisons per level.
* Returns index where item is found or or the index where it chould
* be inserted if not found
*/
template <typename Comparable>
int binarySearch( const Comparable* a, int size, const Comparable & x )
{
int low = 0, high = size - 1; // Set the bounds for the search
while( low <= high )
{
// Examine the element at the midpoint
int mid = ( low + high ) / 2;
if( a[ mid ] < x )
low = mid + 1; // If x is in the array, it must be in the upper
else if( a[ mid ] > x )
high = mid - 1; // If x is in the array, it must be in the lower
else
return mid; // Found
}
// Return the position where x would be inserted to
// preserve the ordering within the array.
return low;
}
Using the unix sed command, how would I print the contents of the .cpp file above with all the inline comments strings deleted (which look like this: // ) and all the text after it in that row deleted? I put an example below of what I am looking for. All the // marks and everything after them on that row is gone in this desired output.
/**
* Performs the standard binary search using two comparisons per level.
* Returns index where item is found or or the index where it chould
* be inserted if not found
*/
template <typename Comparable>
int binarySearch( const Comparable* a, int size, const Comparable & x )
{
int low = 0, high = size - 1;
while( low <= high )
{
int mid = ( low + high ) / 2;
if( a[ mid ] < x )
low = mid + 1;
else if( a[ mid ] > x )
high = mid - 1;
else
return mid;
}
return low;
}
If you don't need to use sed, this can be done easily with grep:
cat file.cpp | grep -v \/\/
Explanation:
grep -v will print all lines that don't match the pattern, and the pattern \/\/ is just an escaped version of //
If you do need to use sed, this can still be done easily (it's just arguably not the right tool for the job, and quite a bit slower).
cat file.cpp | sed '/\/\//d'
This matches every line that starts with // and deletes it.
To remove every line that contains "//":
sed '/\/\//d' file.cpp
To remove "//" and all that follows it on the line:
sed 's|//.*||' file.cpp
To do both (i.e. remove the "//" and all that follows it on the line, and remove that whole line if nothing but whitespace came before it):
sed '/^ *\/\//d;s|//.*||' file.cpp

Fastest way to search string with two mismatch in big files in perl?

I modified the code to work with two files. to_search.txt has string to be searched. big_file.fastq has lines where to be searched and if string found (2 mismatch allowed with exact length which range from 8-10, no addition and deletion), place in respective name. So each string is searched in all lines (2nd line) in big_file.fastq.
# to_search.txt: (length can be from 8-20 characters)
1 TCCCTTGT
2 ACGAGACT
3 GCTGTACG
4 ATCACCAG
5 TGGTCAAC
6 ATCGCACA
7 GTCGTGTA
8 AGCGGAGG
9 ATCCTTTG
10 TACAGCGC
#2000 search needed
# big_file.fastq: 2 billions lines (each 4 lines are associated: string search is in second line of each 4 lines).
# Second line can have 100-200 characters
#M04398:19:000000000-APDK3:1:1101:21860:1000 1:N:0:1
TCttTTGTGATCGATCGATCGATCGATCGGTCGTGTAGCCTCCAACCAAGCACCCCATCTGTTCCAAATCTTCTCCCACTGCTACTTGAAGACGCTGAAGTTGAAGGGCCACCTTCATCATTCTGG
+
#8ACCDGGGGGGGGGGGGGEGGGGGGGGDFFEGGG#FFGGGGGGGGGGGGGGGGGCF#<FFGGGGGFE9FGGGFEACE8EFFFGGGGGF9F?CECEFEG,CFGF,CF#CCC?BFFG<,9<9AEFG,,
#M04398:19:000000000-APDK3:1:1101:13382:1000 1:N:0:1
NTCGATCGATCGATCGATCGATCGATCGTTCTGAGAGGTACCAACCAAGCACACCACGGGCGACACAGACAGCTCCGTGTTGAACGGGTTGTTCTTCTTCTTGCCTTCATCATCCCCATCCTCAGTGGACGCAGCTTGCTCATCCTTCCTC
+
#8BCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
#M04398:19:000000000-APDK3:1:1101:18888:1000 1:N:0:1
NCAGAATGAGGAAGGATGAGCCCCGTCGTGTCGAAGCTATTGACACAGCGCTATTCCGTCTTTATGTTCACTTTAAGCGGTACAAGGAGCTGCTTGTTCTGATTCAGGAACCGAACCCTGGTGGTGTGCTTGGTTGGCAAGTTTACGGCTC
+
#8BCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGE
Here is the code for two mismatches. I tried with exact match, speed is not bad: takes around a day. I have used Time::Progress module. When I use 2 mismatch: shows 115 days to finish. How the speed can be improved here?
#!/usr/bin/perl
use strict;
use warnings;
$| = 1;
open( IN_P1, "big_file.fastq" ) or die "File not found";
my ( #sample_file_names, #barcode1 );
open( BC_FILE, "to_search.txt" ) or die "No barcode file";
my #barcode_file_content = <BC_FILE>;
foreach (#barcode_file_content) {
chomp $_;
$_ =~ s/\r//;
$_ =~ s/\n//;
#print $_;
my #elements = split( "(\t|,| )", $_ );
push #sample_file_names, $elements[0];
push #barcode1, $elements[2];
}
# open FH
my #fh_array_R1;
foreach (#sample_file_names) {
chomp $_;
local *OUT_R1;
open( OUT_R1, ">", "$_\.fq" ) or die "cannot write file";
push #fh_array_R1, *OUT_R1;
}
# unknown barcode file
open( UNKNOWN_R1, ">unknown-barcode_SE.fq" ) or die "cannot create unknown-r1 file";
while ( defined( my $firstp1 = <IN_P1> ) ) {
my $p1_first_line = $firstp1;
my $p1_second_line = <IN_P1>;
my $p1_third_line = <IN_P1>;
my $p1_fourth_line = <IN_P1>;
chomp( $p1_first_line, $p1_second_line, $p1_third_line, $p1_fourth_line, );
my $matched_R1 = "$p1_first_line\n$p1_second_line\n$p1_third_line\n$p1_fourth_line\n";
for ( my $j = 0 ; $j < scalar #barcode1 ; $j++ ) {
chomp $barcode1[$j];
my $barcode1_regex = make_barcode_fragments( $barcode1[$j] );
if ( $p1_second_line =~ /$barcode1_regex/i ) {
# keep if matched
print { $fh_array_R1[$j] } $matched_R1;
last;
}
else {
#print to unknown;
print UNKNOWN_R1 $matched_R1;
}
}
}
# make two mismatch patterm of barcode
sub make_barcode_fragments {
my ($in1) = #_;
my #subpats;
for my $i ( 0 .. length($in1) - 1 ) {
for my $j ( $i + 1 .. length($in1) - 1 ) {
my $subpat = join( '',
substr( $in1, 0, $i ),
'\\w', substr( $in1, $i + 1, $j - $i - 1 ),
'\\w', substr( $in1, $j + 1 ),
);
push #subpats, $subpat;
}
}
my $pat = join( '|', #subpats );
#print $pat;
return "$pat";
}
exit;
If your algorithm cannot be changed/improved in Perl itself, you can still get speedup by writing the time consuming parts in C. Here is an example using inline C:
use strict;
use warnings;
use Benchmark qw(timethese);
use Inline C => './check_line_c.c';
my $find = "MATCH1";
my $search = "saasdadadadadasd";
my %sub_info = (
c => sub { check_line_c( $find, $search ) },
perl => sub { check_line_perl( $find, $search ) },
);
timethese( 4_000_000, \%sub_info );
sub check_line_perl {
my ($find, $search ) = #_;
my $max_distance = 2;
for my $offset ( 0 .. length($search) - length($find) ) {
my $substr = substr( $search, $offset, length($find) );
my $hd = hd( $find, $substr );
if ( $hd <= $max_distance ) {
return ( $hd, $substr );
}
}
return ( undef, undef );
}
sub hd {
return ( $_[0] ^ $_[1] ) =~ tr/\001-\377//;
}
where check_line_c.c is:
void check_line_c( char* find, char * search ) {
int max_distance = 2;
int flen = strlen(find);
int last_ind = strlen(search) - flen;
SV *dis = &PL_sv_undef;
SV *match = &PL_sv_undef;
for ( int ind = 0; ind <= last_ind; ind++ )
{
int count = 0;
for ( int j = 0; j < flen; j++ )
{
if ( find[j] ^ search[ind+j] ) count++;
}
if ( count < max_distance )
{
match = newSV(flen);
sv_catpvn(match, search+ind, flen );
dis = newSViv(count);
break;
}
}
Inline_Stack_Vars;
Inline_Stack_Reset;
Inline_Stack_Push(sv_2mortal(dis));
Inline_Stack_Push(sv_2mortal(match));
Inline_Stack_Done;
}
The output is (Ubuntu Laptop using Intel Core i7-4702MQ CPU #2.20GHz):
Benchmark: timing 4000000 iterations of c, perl...
c: 2 wallclock secs ( 0.76 usr + 0.00 sys = 0.76 CPU) # 5263157.89/s (n=4000000)
perl: 19 wallclock secs (18.30 usr + 0.00 sys = 18.30 CPU) # 218579.23/s (n=4000000)
So this gives a 24-fold speedup for this case.
I'd suggest creating a really bad hashing algorithm. Something nice and reversible and inefficient, like the sum of the characters. Or maybe the sum of unique values (1-4) represented by the characters.
Compute the target sums, and also compute the maximum allowed variance. That is, if the objective is a match with two substitutions, then what is the maximum possible difference? (4-1 + 4-1 = 6).
Then, for each "window" of text of the appropriate length in the target data file, compute a running score. (Add a character to the end, drop a character from the start, update the hash score.) If the score for a window is within the allowable range, you can do further investigation.
You might want to implement this as different passes. Possibly even as different stages in a shell pipeline or script. The idea being that you might be able to parallelize parts of the search. (For instance, all the match strings with the same length could be searched by one process, since the hash windows are the same.)
Also, of course, it is beneficial that you can keep your early work if your program crashes in the later stages. And you can even have the early parts of the process running while you are still developing the end stages.

Converting string of ASCII characters to string of corresponding decimals

May I introduce you to the problem that destroyed my weekend. I have biological data in 4 columns
#ID:::12345/1 ACGACTACGA text !"#$%vwxyz
#ID:::12345/2 TATGACGACTA text :;<=>?VWXYZ
I would like to use awk to edit the first column to replace characters : and / with -
I would like to convert the string in the last column with a comma-separated string of decimals that correspond to each individual ASCII character (any character ranging from ASCII 33 - 126).
#ID---12345-1 ACGACTACGA text 33,34,35,36,37,118,119,120,121,122
#ID---12345-2 TATGACGACTA text 58,59,60,61,62,63,86,87,88,89,90
The first part is easy, but i'm stuck with the second. I've tried using awk ordinal functions and sprintf; I can only get the former to work on the first char in the string and I can only get the latter to convert hexidecimal to decimal and not with spaces. Also tried bash function
$ od -t d1 test3 | awk 'BEGIN{OFS=","}{i = $1; $1 = ""; print $0}'
But don't know how to call this function within awk.
I would prefer to use awk as I have some downstream manipulations that can also be done in awk.
Many thanks in advance
Using the ordinal functions from the awk manual, you can do it like this:
awk -f ord.awk --source '{
# replace : with - in the first field
gsub(/:/,"-",$1)
# calculate the ordinal by looping over the characters in the fourth field
res=ord($4)
for(i=2;i<=length($4);i++) {
res=res","ord(substr($4,i))
}
$4=res
}1' file
Output:
#ID---12345/1 ACGACTACGA text 33,34,35,36,37,118,119,120,121,122
#ID---12345/2 TATGACGACTA text 58,59,60,61,62,63,86,87,88,89,90
Here is ord.awk (taken as is from: http://www.gnu.org/software/gawk/manual/html_node/Ordinal-Functions.html)
# ord.awk --- do ord and chr
# Global identifiers:
# _ord_: numerical values indexed by characters
# _ord_init: function to initialize _ord_
BEGIN { _ord_init() }
function _ord_init( low, high, i, t)
{
low = sprintf("%c", 7) # BEL is ascii 7
if (low == "\a") { # regular ascii
low = 0
high = 127
} else if (sprintf("%c", 128 + 7) == "\a") {
# ascii, mark parity
low = 128
high = 255
} else { # ebcdic(!)
low = 0
high = 255
}
for (i = low; i <= high; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
function ord(str, c)
{
# only first character is of interest
c = substr(str, 1, 1)
return _ord_[c]
}
function chr(c)
{
# force c to be numeric by adding 0
return sprintf("%c", c + 0)
}
If you don't want to include the whole of ord.awk, you can do it like this:
awk 'BEGIN{ _ord_init()}
function _ord_init(low, high, i, t)
{
low = sprintf("%c", 7) # BEL is ascii 7
if (low == "\a") { # regular ascii
low = 0
high = 127
} else if (sprintf("%c", 128 + 7) == "\a") {
# ascii, mark parity
low = 128
high = 255
} else { # ebcdic(!)
low = 0
high = 255
}
for (i = low; i <= high; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
{
# replace : with - in the first field
gsub(/:/,"-",$1)
# calculate the ordinal by looping over the characters in the fourth field
res=_ord_[substr($4,1,1)]
for(i=2;i<=length($4);i++) {
res=res","_ord_[substr($4,i,1)]
}
$4=res
}1' file
Perl soltuion:
perl -lnae '$F[0] =~ s%[:/]%-%g; $F[-1] =~ s/(.)/ord($1) . ","/ge; chop $F[-1]; print "#F";' < input
The first substitution replaces : and / in the first field with a dash, the second one replaces each character in the last field with its ord and a comma, chop removes the last comma.

Find the number of matching two characters in a string in Perl

Is there a method in Perl (not BioPerl) to find the number of each two consecutive letters.
I.e., number of AA, AC, AG, AT, CC, CA, ... in a sequence like this:
$sequence = 'AACGTACTGACGTACTGGTTGGTACGA'
PS: We can make it manually by using the regular expression, i.e., $GC=($sequence=~s/GC/GC/g) which return the number of GC in the sequence.
I need an automated and generic way.
You had me confused for a while, but I take it you want to count the dinucleotides in a given string.
Code:
my #dinucs = qw(AA AC AG CC CA CG);
my %count;
my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
for my $dinuc (#dinucs) {
$count{$dinuc} = ($sequence =~ s/\Q$dinuc\E/$dinuc/g);
}
Output from Data::Dumper:
$VAR1 = {
"AC" => 5,
"CC" => "",
"AG" => "",
"AA" => 1,
"CG" => 3,
"CA" => ""
};
Close to TLP's answer, but without substitution:
my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
my #dinucs = qw(AA AC AG AT CC CG);
my %count = map{$_ => 0}#dinucs;
for my $dinuc (#dinucs) {
while($sequence=~/$dinuc/g) {
$count{$dinuc}++;
}
}
Benchmark:
my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
my #dinucs = qw(AA AC AG AT CC CG);
my %count = map{$_ => 0}#dinucs;
my $count = -3;
my $r = cmpthese($count, {
'match' => sub {
for my $dinuc (#dinucs) {
while($sequence=~/$dinuc/g) {
$count{$dinuc}++;
}
}
},
'substitute' => sub {
for my $dinuc (#dinucs) {
$count{$dinuc} = ($sequence =~ s/\Q$dinuc\E/$dinuc/g);
}
}
});
Output:
Rate substitute Match
Substitute 13897/s -- -11%
Match 15622/s 12% --
Regex works if you're careful, but there's a simple solution using substr that will be faster and more flexible.
(As of this posting, the regex solution marked as accepted will fail to correctly count dinucleotides in repeated regions like 'AAAA...', of which there are many in naturally occurring sequences.
Once you match 'AA', the regex search resumes on the third character, skipping the middle 'AA' dinucleotide. This doesn't affect the other dinucleotides since if you have 'AC' at one position, you're guaranteed not to have it in the next base, naturally. The particular sequence given in the question will not suffer from this problem since no base appears three times in a row.)
The method I suggest is more flexible in that it can count words of any length; extending the regex method to longer words is complicated since you have to do even more gymnastics with your regex to get an accurate count.
sub substrWise {
my ($seq, $wordLength) = #_;
my $cnt = {};
my $w;
for my $i (0 .. length($seq) - $wordLength) {
$w = substr($seq, $i, $wordLength);
$cnt->{$w}++;
}
return $cnt;
}
sub regexWise {
my ($seq, $dinucs) = #_;
my $cnt = {};
for my $d (#$dinucs) {
if (substr($d, 0,1) eq substr($d, 1,1) ) {
my $n = substr($d, 0,1);
$cnt->{$d} = ($seq =~ s/$n(?=$n)/$n/g); # use look-ahead
} else {
$cnt->{$d} = ($seq =~ s/$d/$d/g);
}
}
return $cnt;
}
my #dinucs = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
use Test::More tests => 1;
my $rWise = regexWise($sequence, \#dinucs);
my $sWise = substrWise($sequence, 2);
$sWise->{$_} //= '' for #dinucs; # substrWise will not create keys for words not found
# this seems like desirable behavior IMO,
# but i'm adding '' to show that the counts match
is_deeply($rWise, $sWise, 'verify equivalence');
use Benchmark qw(:all);
cmpthese(100000, {
'regex' => sub {
regexWise($sequence, \#dinucs);
},
'substr' => sub {
substrWise($sequence, 2);
}
Output:
1..1
ok 1 - verify equivalence
Rate regex substr
regex 11834/s -- -85%
substr 76923/s 550% --
For longer sequences (10-100 kbase), the advantage is not as pronounced, but it still wins by about 70%.

Resources