Related
Compare two strings and find mismatch and mismatch and count them both
string1 = "SEQUENCE"
string2 = "SEKUEAEE"
I want output like. With the mismatch and match count.
'SS' match 1
'EE' match 3
'UU' match 1
'QK' mismatch 1
'NA' mismatch 1
'CE' mismatch 1
Here's a solution in old Perl. Also works with however many strings you want
use warnings;
use strict;
use List::AllUtils qw( mesh part count_by pairs );
my #strings = ("SEQUENCES", "SEKUEAEES", "SEKUEAEES");
my $i = 0;
print join "",
map { $_->[0] . " " . ($_->[1] > 1 ? 'match' : 'mismatch') . " " . $_->[1] ."\n" }
pairs
count_by { $_ }
map { join "", #$_ }
part { int($i++/scalar #strings) }
&mesh( #{[ map { [ split // ] } #strings ]} )
;
And here for comparison, analogous code in Perl 6.
my #strings = "SEQUENCES", "SEKUEAEES", "SEKUEAEES";
([Z] #strings>>.comb)
.map({ .join })
.Bag
.map({ "{.key} { .value > 1 ?? 'match' !! 'mismatch' } {.value}\n" })
.join
.say;
Isn't that just pretty?
Solution that works for any amount of strings.
use List::Util qw(max);
use Perl6::Junction qw(all);
my #strings = qw(SEQUENCE SEKUEAEE);
my (%matches, %mismatches);
for my $i (0 .. -1 + max map { length } #strings) {
my #c = map { substr $_, $i, 1 } #strings;
if ($c[0] eq all #c) {
$matches{join '', #c}++;
} else {
$mismatches{join '', #c}++;
}
}
for my $k (keys %matches) {
printf "'%s' match %d\n", $k, $matches{$k};
}
for my $k (keys %mismatches) {
printf "'%s' mismatch %d\n", $k, $mismatches{$k};
}
__END__
'SS' match 1
'UU' match 1
'EE' match 3
'QK' mismatch 1
'NA' mismatch 1
'CE' mismatch 1
Useing the non-core but very handy List::MoreUtils module.
#!/usr/bin/env perl
use warnings;
use strict;
use feature qw/say/;
use List::MoreUtils qw/each_array/;
sub count_matches {
die "strings must be equal length!" unless length $_[0] == length $_[1];
my #letters1 = split //, $_[0];
my #letters2 = split //, $_[1];
my (%matches, %mismatches);
my $iter = each_array #letters1, #letters2;
while (my ($c1, $c2) = $iter->()) {
if ($c1 eq $c2) {
$matches{"$c1$c2"} += 1;
} else {
$mismatches{"$c1$c2"} += 1;
}
}
say "'$_' match $matches{$_}" for sort keys %matches;
say "'$_' mismatch $mismatches{$_}" for sort keys %mismatches;
}
count_matches qw/SEQUENCE SEKUEAEE/;
I have a bunch of decimal numbers (as strings) which I receive from an API. I need to 'unscale' them, i.e. divide them by some power of 10. This seems a simple task for integers, but I have decimals with no guaranteed range. So, basically I need a function that works like this:
move_point "12.34" 1; # "1.234"
move_point "12.34" 5; # "0.0001234"
I'd rather not use floats to avoid any rounding errors.
This is a bit verbose, but should do the trick:
sub move_point {
my ($n, $places) = #_;
die 'negative number of places' if $places < 0;
return $n if $places == 0;
my ($i, $f) = split /\./, $n; # split to integer/fractional parts
$places += length($f);
$n = sprintf "%0*s", $places+1, $i.$f; # left pad with enough zeroes
substr($n, -$places, 0, '.'); # insert the decimal point
return $n;
}
Demo:
my $n = "12.34";
for my $p (0..5) {
printf "%d %s\n", $p, move_point($n, $p);
}
0 12.34
1 1.234
2 0.1234
3 0.01234
4 0.001234
5 0.0001234
Unless your data has contains values with significantly more digits than you have shown then a floating-point value has more than enough accuracy for your purpose. Perl can reliably reproduce up to 16-digit values
use strict;
use warnings 'all';
use feature 'say';
say move_point("12.34", 1); # "1.234"
say move_point("12.34", 5); # "0.0001234"
say move_point("1234", 12);
say move_point("123400", -9);
sub move_point {
my ($v, $n) = #_;
my $dp = $v =~ /\.([^.]*)\z/ ? length $1 : 0;
$dp += $n;
$v /= 10**$n;
sprintf '%.*f', $dp < 0 ? 0 : $dp, $v;
}
output
1.234
0.0001234
0.000000001234
123400000000000
Update
If the limits of standard floating-point numbers are actually insuffcient for you then the core Math::BigFloat will do what you need
This program shows a number with sixteen digits of accuracy, multiplied by everything from 10E-20 to 10E20
use strict;
use warnings 'all';
use feature 'say';
use Math::BigFloat;
for ( -20 .. 20 ) {
say move_point('1234567890.1234567890', $_);
}
sub move_point {
my ($v, $n) = #_;
$v = Math::BigFloat->new($v);
# Build 10**$n
my $mul = Math::BigFloat->new(10)->bpow($n);
# Count new decimal places
my $dp = $v =~ /\.([^.]*)\z/ ? length $1 : 0;
$dp += $n;
$v->bdiv($mul);
$v->bfround(-$dp) if $dp >= 0;
$v->bstr;
}
output
123456789012345678900000000000
12345678901234567890000000000
1234567890123456789000000000
123456789012345678900000000
12345678901234567890000000
1234567890123456789000000
123456789012345678900000
12345678901234567890000
1234567890123456789000
123456789012345678900
12345678901234567890
1234567890123456789
123456789012345678.9
12345678901234567.89
1234567890123456.789
123456789012345.6789
12345678901234.56789
1234567890123.456789
123456789012.3456789
12345678901.23456789
1234567890.123456789
123456789.0123456789
12345678.90123456789
1234567.890123456789
123456.7890123456789
12345.67890123456789
1234.567890123456789
123.4567890123456789
12.34567890123456789
1.234567890123456789
0.1234567890123456789
0.01234567890123456789
0.001234567890123456789
0.0001234567890123456789
0.00001234567890123456789
0.000001234567890123456789
0.0000001234567890123456789
0.00000001234567890123456789
0.000000001234567890123456789
0.0000000001234567890123456789
0.00000000001234567890123456789
I have file with following entries:
1,2
2,3
4,5
1,3
1,4
5,6
...
This tells the ids: first column matches with second column. Now I want to find all id groups that are having all combinations only. i.e. the following needs to be output:
1,2,3
4,5
1,4
5,6
I tried to write a perl script for the solution:
while(<STDIN>) {
if(m/^(\d+),(\d+)/) {
$dub{$1}{$2} = 1;
$dub{$2}{$1} = 1;
$hs{$1} = 1;
$hs{$2} = 1;
}
}
$i=0;
foreach $a (keys %dub) {
$grp[$i]{$a} = 1;
foreach $b (keys %{$dub{$a}}) {
$grp[$i]{$b} = 1;
foreach $c (keys %hs) {
if($c == $a || $c == $b) { next; }
$flag = 1;
foreach $d (keys %{$grp[$i]}) {
if(!$dub{$d}{$c}) {
$flag = 0;
last;
}
}
$grp[$i]{$c} = 1 if($flag);
}
$i++;
}
}
for($i=0; $i<=$#grp; $i++) {
print join(",", (keys %{$grp[$i]}))."\n";
}
But this takes hell lot of time for execution.
Is there a better solution, algorithm or performance tune for above script?
Any solution in LAMP is appreciated.
Thanks
EDIT:
Think of this way:
(1,2) is defined as "1 and 2 are similar"
(2,3) is defined as "2 and 3 are similar"
(1,4) is defined as "1 and 4 are similar"
(1,3) is defined as "1 and 3 are similar"
From these similarities I conclude that group (1,2,3) are similar to each other but not group (1,2,3,4).
In order to form group (1,2,3,4) there should be other entries in data as (2,4) and (3,4).
Finally I wanted to find all groups in given set of co-ordinates.
From my understanding, {1,2,3} are in the same group because all point to each other ({1,2}, {2,3},{1,3} exist). So, we can reduce this problem to finding cliques in an undirected graph, which is NP-Complete problem. So, every solution will be quite inefficient on big data.
This works for me:
use Data::Dump;
my #results;
my ($last_a, $last_b) = (0,0);
while(<DATA>) {
chomp;
my ($a, $b) = split /,/;
if( $last_b == $a ) {
my $last_item = $results[$#results];
push #$last_item, $b;
}
else {
push #results, [$a, $b];
}
($last_a, $last_b) = ($a, $b);
}
dd #results; # ([1, 2, 3], [4, 5], [1, 3], [1, 4], [5, 6])
__DATA__
1,2
2,3
4,5
1,3
1,4
5,6
You haven't really described the algorithm that we're supposed to used. I can't really understand why your input generates "1,2,3" and "1,4" rather than just "1,2,3,4".
But is this what you want?
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my %data;
while (<DATA>) {
chomp;
my ($k, $v) = split /,/;
push #{ $data{$k} }, $v;
}
foreach (sort keys %data) {
say "$_,", join ',', #{ $data{$_ } };
}
__DATA__
1,2
2,3
4,5
1,3
1,4
5,6
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to plot a gene graph for a DNA sequence say ATGCCGCTGCGC?
Im trying to write a Perl script that compares two DNA sequences (60 characters in length each lets say) in alignment, and then show the ratio of matches to non-matches of the sequences to each other. But i'm not having much luck. if it helps i can upload my code, but its no use. here's an example of what im trying to achieve below.
e.g
A T C G T A C
| | | | | | |
T A C G A A C
So the matches of the above example would be 4. and non-matches are: 3. Giving it a ratio of 4.3.
Any help would be much appreciated. thanks.
in general, please do post your code. It does help. In any case, something like this should do what you are asking:
#!/usr/bin/perl -w
use strict;
my $d1='ATCGTAC';
my $d2='TACGAAC';
my #dna1=split(//,$d1);
my #dna2=split(//,$d2);
my $matches=0;
for (my $i=0; $i<=$#dna1; $i++) {
$matches++ if $dna1[$i] eq $dna2[$i];
}
my $mis=scalar(#dna1)-$matches;
print "Matches/Mismatches: $matches/$mis\n";
Bear in mind though that the ratio of 4 to 3 is most certainly not 4.3 but ~1.3. If you post some information on your input file format I will update my answer to include lines for parsing the sequence from your file.
Normally I'd say "What have you tried" and "upload your code first" because it doesn't seem to be a very difficult problem. But let's give this a shot:
create two arrays, one to hold each sequence:
#sequenceOne = ("A", "T", "C", "G", "T", "A", "C");
#sequenceTwo = ("T", "A", "C", "G", "A", "A", "C");
$myMatch = 0;
$myMissMatch = 0;
for ($i = 0; $i < #sequenceOne; $i++) {
my $output = "Comparing " . $sequenceOne[$i] . " <=> " . $sequenceTwo[$i];
if ($sequenceOne[$i] eq $sequenceTwo[$i]) {
$output .= " MATCH\n";
$myMatch++;
} else {
$myMissMatch++;
$output .= "\n";
}
print $output;
}
print "You have " . $myMatch . " matches.\n";
print "You have " . $myMissMatch . " mismatches\n";
print "The ratio of hits to misses is " . $myMatch . ":" . $myMissMatch . ".\n";
Of course, you'd probably want to read the sequence from something else on the fly instead of hard-coding the array. But you get the idea. With the above code your output will be:
torgis-MacBook-Pro:platform-tools torgis$ ./dna.pl
Comparing A <=> T
Comparing T <=> A
Comparing C <=> C MATCH
Comparing G <=> G MATCH
Comparing T <=> A
Comparing A <=> A MATCH
Comparing C <=> C MATCH
You have 4 matches.
You have 3 mismatches
The ratio of hits to misses is 4:3.
So many ways to do this. Here's one.
use strict;
use warnings;
my $seq1 = "ATCGTAC";
my $seq2 = "TACGAAC";
my $len = length $seq1;
my $matches = 0;
for my $i (0..$len-1) {
$matches++ if substr($seq1, $i, 1) eq substr($seq2, $i, 1);
}
printf "Length: %d Matches: %d Ratio: %5.3f\n", $len, $matches, $matches/$len;
exit 0;
Just grab the length of one of the strings (we're assuming string lengths are equal, right?), and then iterate using substr.
my #strings = ( 'ATCGTAC', 'TACGAAC' );
my $matched;
foreach my $ix ( 0 .. length( $strings[0] ) - 1 ) {
$matched++
if substr( $strings[0], $ix, 1 ) eq substr( $strings[1], $ix, 1 );
}
print "Matches: $matched\n";
print "Mismatches: ", length( $strings[0] ) - $matched, "\n";
I think substr is the way to go, rather than splitting the strings into arrays.
This is probably most convenient if presented as a subroutine:
use strict;
use warnings;
print ratio(qw/ ATCGTAC TACGAAC /);
sub ratio {
my ($aa, $bb) = #_;
my $total = length $aa;
my $matches = 0;
for (0 .. $total-1) {
$matches++ if substr($aa, $_, 1) eq substr($bb, $_, 1);
}
$matches / ($total - $matches);
}
output
1.33333333333333
Bill Ruppert's right that there are many way to do this. Here's another:
use Modern::Perl;
say compDNAseq( 'ATCGTAC', 'TACGAAC' );
sub compDNAseq {
my $total = my $i = 0;
$total += substr( $_[1], $i++, 1 ) eq $1 while $_[0] =~ /(.)/g;
sprintf '%.2f', $total / ( $i - $total );
}
Output:
1.33
Here is an approach which gives a NULL, \0, for each match in an xor comparison.
#!/usr/bin/perl
use strict;
use warnings;
my $d1='ATCGTAC';
my $d2='TACGAAC';
my $len = length $d1; # assumes $d1 and $d2 are the same length
my $matches = () = ($d1 ^ $d2) =~ /\0/g;
printf "ratio of %f", $matches / ($len - $matches);
Output: ratio of 1.333333
This question is closely related to another stackoverflow question. Looking for a very efficient solution to the question asked there. Have suffix arrays been implemented in perl ?
Here is my current solution in perl.
chomp(my $ipstr = <>);
my #bigstrchars = split(//, $ipstr);
my $length = (length $ipstr);
my $sum = 0;
my $span = 1;
my $flag = 0;
while ($span < $length) {
for ($j=0; $j+$span<$length; $j++) {
if ($bigstrchars[$j] eq $bigstrchars[$j+$span]) {
$sum++;
}
else {
last;
}
}
if ($span == 1 && $sum == ($length-1)) {
$sum = $length * ($length+1) * 0.5;
$flag = 1;
last;
}
$span++;
}
unless ($flag) {
$sum += $length;
}
How can this be improved ?
EDIT
Stating the problem here :
For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.
The problem is to give an algorithm to calculate the sum of similarities of a string S with each of it's suffixes. For example, let the string be : ababaa. Then, the suffixes of the string are ababaa, babaa, abaa, baa, aa and a. The similarities of each of these strings with the string ababaa are 6,0,3,0,1,1, respectively. Thus the answer is 6 + 0 + 3 + 0 + 1 + 1 = 11
What about Array::Suffix?
Flesk's solution is pretty elegant. You asked for efficiency, then you asked for improvement. When it comes to perl, I find whatever takes the less time to understand when coming back to it after 3 months to be the best improvement. So take into consideration something a little more descriptive:
use Data::Dumper;
use strict;
main();
sub main {
my $string = "ababaa"; # input string
my $parts; # hash ref
my #suffixes = split '',$string; # break input into tokens
my $running_sum = 0;
$"='';
# Build suffix tree
for (0..$#suffixes){
$parts->{"#suffixes"}=0;
shift #suffixes;
}
# Compare suffixes to initial string
for my $suffix (sort keys %$parts){
$parts->{$suffix} = getMatches($suffix,$string);
$running_sum += $parts->{$suffix};
}
# Output
$Data::Dumper::Sortkeys++;
print Dumper($parts), "\nTotal Matches: $running_sum";
}
sub getMatches{
my ($word,$string) = #_;
my $part = '';
my $offset = 0;
my $matches = 0;
for (0..(length($word) - 1)){
$offset++;
$part = substr($word,0,$offset);
if ($string =~ /^$part/){ $matches++; }
}
return $matches;
}
There are obvious inefficiencies that could be improved (loops, regex comparison, subroutine call), but the point of this answer is an alternative to something that I've already identified as better for the only benefit of better future comprehension.
If I understand the algorithm correctly and you want to calculate the sum of longest common prefixes, your implementation is incorrect, since you're lacking the ascending lexicographic sort.
Here's one way to solve your problem:
#!/usr/bin/perl
use strict;
use warnings;
chomp(my $ipstr = <>);
my #subipstrs = map [split//], sort map{substr $ipstr, $_} 0 .. length($ipstr) - 1;
my $sum = 0;
for my $i (1 .. $#subipstrs) {
my #last = #{$subipstrs[$i-1]};
my #this = #{$subipstrs[$i]};
my $j = 0;
$sum++ while $j < #last && $j < #this && $last[$j] eq $this[$j++];
}
For the example string ababaa in the question you refer to this will produce the suffix array
5 | a
4 | aa
2 | abaa
0 | ababaa
3 | baa
1 | babaa
represented by #subipstrs
#subipstrs = (
['a'],
['a', 'a'],
['a', 'b', 'a', 'a'],
['a', 'b', 'a', 'b', 'a', 'a'],
['b', 'a', 'a'],
['b', 'a', 'b', 'a', 'a']
);
This makes calculating lcps a matter of comparing neighboring array refs element by element while pairs match, and adding up the total number of matches. The result is
5 | a | 0
4 | aa | 1
2 | abaa | 1
0 | ababaa | 3
3 | baa | 0
1 | babaa | 2
With a total of 7, not 11.
EDIT: This solves the problem you're interested in:
#!/usr/bin/perl
use strict;
use warnings;
chomp(my $ipstr = <>);
my $len = my $sum = length($ipstr);
for my $i (1 .. $len -1) {
my $substr = substr $ipstr, $i;
chop $substr while $substr ne substr $ipstr, 0, length($substr);
$sum += length($substr);
}
And is a bit faster than your solution with your example string and 1M iterations:
trinity 80906/s -- -32%
flesk 119332/s 47% --
EDIT2: This one's faster because it works from the start of the strings and are able to discard negative matches faster:
#!/usr/bin/perl
use strict;
use warnings;
chomp(my $ipstr = <>);
my $len = my $sum = length($ipstr);
for my $i (1 .. $len - 1) {
my $ipstrcopy = reverse $ipstr;
my $substr = reverse substr $ipstr, $i;
my ($slen, $j) = (length($substr), 0);
$sum++ while $j++ <= $slen && chop $ipstrcopy eq chop $substr;
}
ababaa and 100K iterations:
trinity 81967/s -- -24%
flesk 107527/s 31% --
abcdefghijklmnopqrstuvwxyz and 100K iterations:
trinity 26178/s -- -15%
flesk 30769/s 18% --
aaaaaaaaaaabbbaaaaaaaaaaaaaaaabbbaaaaaaaaa and 100K iterations:
trinity 5435/s -- -30%
flesk 7800/s 44% --
The algorithm can probably be improved further by reversing $ipstr before the loop or just using substrs instead of chop.