This question is closely related to another stackoverflow question. Looking for a very efficient solution to the question asked there. Have suffix arrays been implemented in perl ?
Here is my current solution in perl.
chomp(my $ipstr = <>);
my #bigstrchars = split(//, $ipstr);
my $length = (length $ipstr);
my $sum = 0;
my $span = 1;
my $flag = 0;
while ($span < $length) {
for ($j=0; $j+$span<$length; $j++) {
if ($bigstrchars[$j] eq $bigstrchars[$j+$span]) {
$sum++;
}
else {
last;
}
}
if ($span == 1 && $sum == ($length-1)) {
$sum = $length * ($length+1) * 0.5;
$flag = 1;
last;
}
$span++;
}
unless ($flag) {
$sum += $length;
}
How can this be improved ?
EDIT
Stating the problem here :
For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.
The problem is to give an algorithm to calculate the sum of similarities of a string S with each of it's suffixes. For example, let the string be : ababaa. Then, the suffixes of the string are ababaa, babaa, abaa, baa, aa and a. The similarities of each of these strings with the string ababaa are 6,0,3,0,1,1, respectively. Thus the answer is 6 + 0 + 3 + 0 + 1 + 1 = 11
What about Array::Suffix?
Flesk's solution is pretty elegant. You asked for efficiency, then you asked for improvement. When it comes to perl, I find whatever takes the less time to understand when coming back to it after 3 months to be the best improvement. So take into consideration something a little more descriptive:
use Data::Dumper;
use strict;
main();
sub main {
my $string = "ababaa"; # input string
my $parts; # hash ref
my #suffixes = split '',$string; # break input into tokens
my $running_sum = 0;
$"='';
# Build suffix tree
for (0..$#suffixes){
$parts->{"#suffixes"}=0;
shift #suffixes;
}
# Compare suffixes to initial string
for my $suffix (sort keys %$parts){
$parts->{$suffix} = getMatches($suffix,$string);
$running_sum += $parts->{$suffix};
}
# Output
$Data::Dumper::Sortkeys++;
print Dumper($parts), "\nTotal Matches: $running_sum";
}
sub getMatches{
my ($word,$string) = #_;
my $part = '';
my $offset = 0;
my $matches = 0;
for (0..(length($word) - 1)){
$offset++;
$part = substr($word,0,$offset);
if ($string =~ /^$part/){ $matches++; }
}
return $matches;
}
There are obvious inefficiencies that could be improved (loops, regex comparison, subroutine call), but the point of this answer is an alternative to something that I've already identified as better for the only benefit of better future comprehension.
If I understand the algorithm correctly and you want to calculate the sum of longest common prefixes, your implementation is incorrect, since you're lacking the ascending lexicographic sort.
Here's one way to solve your problem:
#!/usr/bin/perl
use strict;
use warnings;
chomp(my $ipstr = <>);
my #subipstrs = map [split//], sort map{substr $ipstr, $_} 0 .. length($ipstr) - 1;
my $sum = 0;
for my $i (1 .. $#subipstrs) {
my #last = #{$subipstrs[$i-1]};
my #this = #{$subipstrs[$i]};
my $j = 0;
$sum++ while $j < #last && $j < #this && $last[$j] eq $this[$j++];
}
For the example string ababaa in the question you refer to this will produce the suffix array
5 | a
4 | aa
2 | abaa
0 | ababaa
3 | baa
1 | babaa
represented by #subipstrs
#subipstrs = (
['a'],
['a', 'a'],
['a', 'b', 'a', 'a'],
['a', 'b', 'a', 'b', 'a', 'a'],
['b', 'a', 'a'],
['b', 'a', 'b', 'a', 'a']
);
This makes calculating lcps a matter of comparing neighboring array refs element by element while pairs match, and adding up the total number of matches. The result is
5 | a | 0
4 | aa | 1
2 | abaa | 1
0 | ababaa | 3
3 | baa | 0
1 | babaa | 2
With a total of 7, not 11.
EDIT: This solves the problem you're interested in:
#!/usr/bin/perl
use strict;
use warnings;
chomp(my $ipstr = <>);
my $len = my $sum = length($ipstr);
for my $i (1 .. $len -1) {
my $substr = substr $ipstr, $i;
chop $substr while $substr ne substr $ipstr, 0, length($substr);
$sum += length($substr);
}
And is a bit faster than your solution with your example string and 1M iterations:
trinity 80906/s -- -32%
flesk 119332/s 47% --
EDIT2: This one's faster because it works from the start of the strings and are able to discard negative matches faster:
#!/usr/bin/perl
use strict;
use warnings;
chomp(my $ipstr = <>);
my $len = my $sum = length($ipstr);
for my $i (1 .. $len - 1) {
my $ipstrcopy = reverse $ipstr;
my $substr = reverse substr $ipstr, $i;
my ($slen, $j) = (length($substr), 0);
$sum++ while $j++ <= $slen && chop $ipstrcopy eq chop $substr;
}
ababaa and 100K iterations:
trinity 81967/s -- -24%
flesk 107527/s 31% --
abcdefghijklmnopqrstuvwxyz and 100K iterations:
trinity 26178/s -- -15%
flesk 30769/s 18% --
aaaaaaaaaaabbbaaaaaaaaaaaaaaaabbbaaaaaaaaa and 100K iterations:
trinity 5435/s -- -30%
flesk 7800/s 44% --
The algorithm can probably be improved further by reversing $ipstr before the loop or just using substrs instead of chop.
Related
I've found index and rindex for finding the first or last occurrence of a character (or substring) in a string. I'm also aware that they have an offset that can be used to start at a certain index.
What I want to know is if there is a simple way to find the index of the Nth occurrence of a character or substring in a string. I'd prefer not to have to do it with a regex and would rather not to have to write a loop that just repeatedly calls index with an offset.
EDIT: I didn't state the restriction well. The reason I said "no loop" is I am looking for a built-in way to do it, which exists in many languages.
One possible implementation:
use strict;
use warnings;
use feature qw(say);
my $string = 'the quick brown fox jumped over the lazy dog';
my $substring = 'o';
my $n = 4;
sub nth_index {
my ($string, $substring, $n) = #_;
my ($times, $index) = (0, 0);
while ( $times < $n && $index != -1 ) {
$index = index(
$string,
$substring,
$times == 0
? 0
: $index + length($substring),
);
$times++;
}
return $index;
}
say nth_index($string, $substring, $n); # 42
Here are two examples of how I would solve the problem
Subroutine nth_index1 uses index, while nth_index2 uses a regex. Both involve loops, as any solution must
I believe the regex solution is much more readable, while the index solution is probably a little faster. But they are both so quick that they are highly unlikely to cause a bottleneck, and readability is always paramount
use strict;
use warnings 'all';
my $s = 'the quick brown fox jumps over the lazy dog';
my $ss = 'o';
for my $n ( 1 .. 4 ) {
printf "%d %d\n",
nth_index1($s, $ss, $n),
nth_index2($s, $ss, $n);
}
sub nth_index1 {
my ($s, $ss, $n) = #_;
my $i;
my $len = length $ss;
while ( $n-- ) {
$i = index($s, $ss, $i ? $i + $len : 0 );
return if $i < 0;
}
$i;
}
sub nth_index2 {
my ($s, $ss, $n) = #_;
while ( $s =~ /$ss/g ) {
return $-[0] unless --$n;
}
return;
}
output
12 12
17 17
26 26
41 41
As stated, there is no built-in for this. Here are a few ways, using split, index, and regex.
use warnings;
use strict;
use feature qw(say);
my $str = "Xab_ab_ab_ab_"; # 'Xab_ab'; # test failed (3) matches
my $N = 3;
foreach my $patt qw(a ab c) {
say "Find index of occurrence $N of |$patt| in: |$str|";
say "index: ", ( ind_Nth_match_1($str, $patt, $N) // "no $N matches" ); #/
say "split: ", ( ind_Nth_match_2($str, $patt, $N) // "no $N matches" ); #/
say "regex: ", ( ind_Nth_match_3($str, $patt, $N) // "no $N matches" ); #/
}
sub ind_Nth_match_1 {
my ($str, $patt, $N) = #_;
my ($pos, $cnt) = (0, 0);
while ($pos = index($str, $patt, $pos) + 1) { # != 0
return $pos-1 if ++$cnt == $N;
}
return;
}
sub ind_Nth_match_2 {
my ($str, $patt, $N) = #_;
my #toks = split /($patt)/, $str;
return if #toks < 2*$N;
return length( join '', #toks[0..2*$N-1] ) - length($patt);
}
sub ind_Nth_match_3 {
my ($str, $patt, $N) = #_;
my $cnt = 0;
while ($str =~ m/$patt/g) {
return $-[0] if ++$cnt == $N;
}
}
This prints
Find index of occurrence 3 of |a| in: |Xab_ab_ab_ab_|
index: 7
split: 7
regex: 7
Find index of occurrence 3 of |ab| in: |Xab_ab_ab_ab_|
index: 7
split: 7
regex: 7
Find index of occurrence 3 of |c| in: |Xab_ab_ab_ab_|
index: no 3 matches
split: no 3 matches
regex: no 3 matches
Notes
In split each delimiter is also returned in the output list, with capturing /($patt)/, for simpler length estimation. Thus we count 2*$N (and then take -1).
In regex the #- array is used, #LAST_MATCH_START, for the position of the last successful match. Here the /g in scalar context in while makes it jump from a match to the next in repeated executions, and $-[0] gives the starting position of the last (previous) such match.
The subs return undef if there are not required $N matches, including no matches at all.
Thanks to Borodin for comments on return from subs and on using #- instead of #+.
(This answer does not answer your question, but is here to help you down the road towards accepting a regex solution.)
The way you're asking for a numeric index of the position, it sounds like you're thinking of extracting data from the string once you have that number, in a way that C programmers might do it.
For example, say you had the string
my $str = "My daddy left home when I was three and he didn't leave much for ma and me";
and you wanted to extract all the data up to the first instance of the word "and". Here's the way you could do it, which is sort of a C way to do it using Perl.
my $pos = find_index_of_first_occurrence( $str, 'and' );
# Note that find_index_of_first_occurrence() is a hypothetical function.
print substr( $str, 0, $pos );
# Prints "My daddy left home when I was three "
The way you'd do that in Perl with regexes is much simpler.
$str =~ /^(.*?)and/;
print $1;
With regexes, you're combining the searching for the string and the extraction of the data in one operation. (Note that both code snippets ignore the case of not finding "and" at all, for the sake of simplicity)
I understand that you don't know regexes very well yet, and that regexes can be daunting at first, but you'll need to understand them as part of learning Perl if you're going to succeed with the language.
I have a bunch of decimal numbers (as strings) which I receive from an API. I need to 'unscale' them, i.e. divide them by some power of 10. This seems a simple task for integers, but I have decimals with no guaranteed range. So, basically I need a function that works like this:
move_point "12.34" 1; # "1.234"
move_point "12.34" 5; # "0.0001234"
I'd rather not use floats to avoid any rounding errors.
This is a bit verbose, but should do the trick:
sub move_point {
my ($n, $places) = #_;
die 'negative number of places' if $places < 0;
return $n if $places == 0;
my ($i, $f) = split /\./, $n; # split to integer/fractional parts
$places += length($f);
$n = sprintf "%0*s", $places+1, $i.$f; # left pad with enough zeroes
substr($n, -$places, 0, '.'); # insert the decimal point
return $n;
}
Demo:
my $n = "12.34";
for my $p (0..5) {
printf "%d %s\n", $p, move_point($n, $p);
}
0 12.34
1 1.234
2 0.1234
3 0.01234
4 0.001234
5 0.0001234
Unless your data has contains values with significantly more digits than you have shown then a floating-point value has more than enough accuracy for your purpose. Perl can reliably reproduce up to 16-digit values
use strict;
use warnings 'all';
use feature 'say';
say move_point("12.34", 1); # "1.234"
say move_point("12.34", 5); # "0.0001234"
say move_point("1234", 12);
say move_point("123400", -9);
sub move_point {
my ($v, $n) = #_;
my $dp = $v =~ /\.([^.]*)\z/ ? length $1 : 0;
$dp += $n;
$v /= 10**$n;
sprintf '%.*f', $dp < 0 ? 0 : $dp, $v;
}
output
1.234
0.0001234
0.000000001234
123400000000000
Update
If the limits of standard floating-point numbers are actually insuffcient for you then the core Math::BigFloat will do what you need
This program shows a number with sixteen digits of accuracy, multiplied by everything from 10E-20 to 10E20
use strict;
use warnings 'all';
use feature 'say';
use Math::BigFloat;
for ( -20 .. 20 ) {
say move_point('1234567890.1234567890', $_);
}
sub move_point {
my ($v, $n) = #_;
$v = Math::BigFloat->new($v);
# Build 10**$n
my $mul = Math::BigFloat->new(10)->bpow($n);
# Count new decimal places
my $dp = $v =~ /\.([^.]*)\z/ ? length $1 : 0;
$dp += $n;
$v->bdiv($mul);
$v->bfround(-$dp) if $dp >= 0;
$v->bstr;
}
output
123456789012345678900000000000
12345678901234567890000000000
1234567890123456789000000000
123456789012345678900000000
12345678901234567890000000
1234567890123456789000000
123456789012345678900000
12345678901234567890000
1234567890123456789000
123456789012345678900
12345678901234567890
1234567890123456789
123456789012345678.9
12345678901234567.89
1234567890123456.789
123456789012345.6789
12345678901234.56789
1234567890123.456789
123456789012.3456789
12345678901.23456789
1234567890.123456789
123456789.0123456789
12345678.90123456789
1234567.890123456789
123456.7890123456789
12345.67890123456789
1234.567890123456789
123.4567890123456789
12.34567890123456789
1.234567890123456789
0.1234567890123456789
0.01234567890123456789
0.001234567890123456789
0.0001234567890123456789
0.00001234567890123456789
0.000001234567890123456789
0.0000001234567890123456789
0.00000001234567890123456789
0.000000001234567890123456789
0.0000000001234567890123456789
0.00000000001234567890123456789
I have file with following entries:
1,2
2,3
4,5
1,3
1,4
5,6
...
This tells the ids: first column matches with second column. Now I want to find all id groups that are having all combinations only. i.e. the following needs to be output:
1,2,3
4,5
1,4
5,6
I tried to write a perl script for the solution:
while(<STDIN>) {
if(m/^(\d+),(\d+)/) {
$dub{$1}{$2} = 1;
$dub{$2}{$1} = 1;
$hs{$1} = 1;
$hs{$2} = 1;
}
}
$i=0;
foreach $a (keys %dub) {
$grp[$i]{$a} = 1;
foreach $b (keys %{$dub{$a}}) {
$grp[$i]{$b} = 1;
foreach $c (keys %hs) {
if($c == $a || $c == $b) { next; }
$flag = 1;
foreach $d (keys %{$grp[$i]}) {
if(!$dub{$d}{$c}) {
$flag = 0;
last;
}
}
$grp[$i]{$c} = 1 if($flag);
}
$i++;
}
}
for($i=0; $i<=$#grp; $i++) {
print join(",", (keys %{$grp[$i]}))."\n";
}
But this takes hell lot of time for execution.
Is there a better solution, algorithm or performance tune for above script?
Any solution in LAMP is appreciated.
Thanks
EDIT:
Think of this way:
(1,2) is defined as "1 and 2 are similar"
(2,3) is defined as "2 and 3 are similar"
(1,4) is defined as "1 and 4 are similar"
(1,3) is defined as "1 and 3 are similar"
From these similarities I conclude that group (1,2,3) are similar to each other but not group (1,2,3,4).
In order to form group (1,2,3,4) there should be other entries in data as (2,4) and (3,4).
Finally I wanted to find all groups in given set of co-ordinates.
From my understanding, {1,2,3} are in the same group because all point to each other ({1,2}, {2,3},{1,3} exist). So, we can reduce this problem to finding cliques in an undirected graph, which is NP-Complete problem. So, every solution will be quite inefficient on big data.
This works for me:
use Data::Dump;
my #results;
my ($last_a, $last_b) = (0,0);
while(<DATA>) {
chomp;
my ($a, $b) = split /,/;
if( $last_b == $a ) {
my $last_item = $results[$#results];
push #$last_item, $b;
}
else {
push #results, [$a, $b];
}
($last_a, $last_b) = ($a, $b);
}
dd #results; # ([1, 2, 3], [4, 5], [1, 3], [1, 4], [5, 6])
__DATA__
1,2
2,3
4,5
1,3
1,4
5,6
You haven't really described the algorithm that we're supposed to used. I can't really understand why your input generates "1,2,3" and "1,4" rather than just "1,2,3,4".
But is this what you want?
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my %data;
while (<DATA>) {
chomp;
my ($k, $v) = split /,/;
push #{ $data{$k} }, $v;
}
foreach (sort keys %data) {
say "$_,", join ',', #{ $data{$_ } };
}
__DATA__
1,2
2,3
4,5
1,3
1,4
5,6
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to plot a gene graph for a DNA sequence say ATGCCGCTGCGC?
Im trying to write a Perl script that compares two DNA sequences (60 characters in length each lets say) in alignment, and then show the ratio of matches to non-matches of the sequences to each other. But i'm not having much luck. if it helps i can upload my code, but its no use. here's an example of what im trying to achieve below.
e.g
A T C G T A C
| | | | | | |
T A C G A A C
So the matches of the above example would be 4. and non-matches are: 3. Giving it a ratio of 4.3.
Any help would be much appreciated. thanks.
in general, please do post your code. It does help. In any case, something like this should do what you are asking:
#!/usr/bin/perl -w
use strict;
my $d1='ATCGTAC';
my $d2='TACGAAC';
my #dna1=split(//,$d1);
my #dna2=split(//,$d2);
my $matches=0;
for (my $i=0; $i<=$#dna1; $i++) {
$matches++ if $dna1[$i] eq $dna2[$i];
}
my $mis=scalar(#dna1)-$matches;
print "Matches/Mismatches: $matches/$mis\n";
Bear in mind though that the ratio of 4 to 3 is most certainly not 4.3 but ~1.3. If you post some information on your input file format I will update my answer to include lines for parsing the sequence from your file.
Normally I'd say "What have you tried" and "upload your code first" because it doesn't seem to be a very difficult problem. But let's give this a shot:
create two arrays, one to hold each sequence:
#sequenceOne = ("A", "T", "C", "G", "T", "A", "C");
#sequenceTwo = ("T", "A", "C", "G", "A", "A", "C");
$myMatch = 0;
$myMissMatch = 0;
for ($i = 0; $i < #sequenceOne; $i++) {
my $output = "Comparing " . $sequenceOne[$i] . " <=> " . $sequenceTwo[$i];
if ($sequenceOne[$i] eq $sequenceTwo[$i]) {
$output .= " MATCH\n";
$myMatch++;
} else {
$myMissMatch++;
$output .= "\n";
}
print $output;
}
print "You have " . $myMatch . " matches.\n";
print "You have " . $myMissMatch . " mismatches\n";
print "The ratio of hits to misses is " . $myMatch . ":" . $myMissMatch . ".\n";
Of course, you'd probably want to read the sequence from something else on the fly instead of hard-coding the array. But you get the idea. With the above code your output will be:
torgis-MacBook-Pro:platform-tools torgis$ ./dna.pl
Comparing A <=> T
Comparing T <=> A
Comparing C <=> C MATCH
Comparing G <=> G MATCH
Comparing T <=> A
Comparing A <=> A MATCH
Comparing C <=> C MATCH
You have 4 matches.
You have 3 mismatches
The ratio of hits to misses is 4:3.
So many ways to do this. Here's one.
use strict;
use warnings;
my $seq1 = "ATCGTAC";
my $seq2 = "TACGAAC";
my $len = length $seq1;
my $matches = 0;
for my $i (0..$len-1) {
$matches++ if substr($seq1, $i, 1) eq substr($seq2, $i, 1);
}
printf "Length: %d Matches: %d Ratio: %5.3f\n", $len, $matches, $matches/$len;
exit 0;
Just grab the length of one of the strings (we're assuming string lengths are equal, right?), and then iterate using substr.
my #strings = ( 'ATCGTAC', 'TACGAAC' );
my $matched;
foreach my $ix ( 0 .. length( $strings[0] ) - 1 ) {
$matched++
if substr( $strings[0], $ix, 1 ) eq substr( $strings[1], $ix, 1 );
}
print "Matches: $matched\n";
print "Mismatches: ", length( $strings[0] ) - $matched, "\n";
I think substr is the way to go, rather than splitting the strings into arrays.
This is probably most convenient if presented as a subroutine:
use strict;
use warnings;
print ratio(qw/ ATCGTAC TACGAAC /);
sub ratio {
my ($aa, $bb) = #_;
my $total = length $aa;
my $matches = 0;
for (0 .. $total-1) {
$matches++ if substr($aa, $_, 1) eq substr($bb, $_, 1);
}
$matches / ($total - $matches);
}
output
1.33333333333333
Bill Ruppert's right that there are many way to do this. Here's another:
use Modern::Perl;
say compDNAseq( 'ATCGTAC', 'TACGAAC' );
sub compDNAseq {
my $total = my $i = 0;
$total += substr( $_[1], $i++, 1 ) eq $1 while $_[0] =~ /(.)/g;
sprintf '%.2f', $total / ( $i - $total );
}
Output:
1.33
Here is an approach which gives a NULL, \0, for each match in an xor comparison.
#!/usr/bin/perl
use strict;
use warnings;
my $d1='ATCGTAC';
my $d2='TACGAAC';
my $len = length $d1; # assumes $d1 and $d2 are the same length
my $matches = () = ($d1 ^ $d2) =~ /\0/g;
printf "ratio of %f", $matches / ($len - $matches);
Output: ratio of 1.333333
I am very new to Perl, and I am trying to write a word frequency counter as a learning exercise.
However, I am not able to figure out the error in my code below, after working on it. This is my code:
$wa = "A word frequency counter.";
#wordArray = split("",$wa);
$num = length($wa);
$word = "";
$flag = 1; # 0 if previous character was an alphabet and 1 if it was a blank.
%wordCount = ("null" => 0);
if ($num == -1) {
print "There are no words.\n";
} else {
print "$length";
for $i (0 .. $num) {
if(($wordArray[$i]!=' ') && ($flag==1)) { # start of a new word.
print "here";
$word = $wordArray[$i];
$flag = 0;
} elsif ($wordArray[$i]!=' ' && $flag==0) { # continuation of a word.
$word = $word . $wordArray[$i];
} elsif ($wordArray[$i]==' '&& $flag==0) { # end of a word.
$word = $word . $wordArray[$i];
$flag = 1;
$wordCount{$word}++;
print "\nword: $word";
} elsif ($wordArray[$i]==" " && $flag==1) { # series of blanks.
# do nothing.
}
}
for $i (keys %wordCount) {
print " \nword: $i - count: $wordCount{$i} ";
}
}
It's neither printing "here", nor the words. I am not worried about optimization at this point, though any input in that direction would also be much appreciated.
This is a good example of a problem where Perl will help you work out what's wrong if you just ask it for help. Get used to always adding the lines:
use strict;
use warnings;
to the top of your Perl programs.
Fist off,
$wordArray[$i]!=' '
should be
$wordArray[$i] ne ' '
according to the Perl documentation for comparing strings and characters. Basically use numeric operators (==, >=, …) for numbers, and string operators for text (eq, ne, lt, …).
Also, you could do
#wordArray = split(" ",$wa);
instead of
#wordArray = split("",$wa);
and then #wordArray wouldn't need to do the wonky character checking and you never would have had the problem. #wordArray will be split into the words already and you'll just have to count the occurrences.
You seem to be writing C in Perl. The difference is not just one of style. By exploding a string into a an array of individual characters, you cause the memory footprint of your script to explode as well.
Also, you need to think about what constitutes a word. Below, I am not suggesting that any \w+ is a word, rather pointing out the difference between \S+ and \w+.
#!/usr/bin/env perl
use strict; use warnings;
use YAML;
my $src = '$wa = "A word frequency counter.";';
print Dump count_words(\$src, 'w');
print Dump count_words(\$src, 'S');
sub count_words {
my $src = shift;
my $class = sprintf '\%s+', shift;
my %counts;
while ($$src =~ /(?<sequence> $class)/gx) {
$counts{ $+{sequence} } += 1;
}
return \%counts;
}
Output:
---
A: 1
counter: 1
frequency: 1
wa: 1
word: 1
---
'"A': 1
$wa: 1
=: 1
counter.";: 1
frequency: 1
word: 1