Need better solution for combination of numbers - linux

I have file with following entries:
1,2
2,3
4,5
1,3
1,4
5,6
...
This tells the ids: first column matches with second column. Now I want to find all id groups that are having all combinations only. i.e. the following needs to be output:
1,2,3
4,5
1,4
5,6
I tried to write a perl script for the solution:
while(<STDIN>) {
if(m/^(\d+),(\d+)/) {
$dub{$1}{$2} = 1;
$dub{$2}{$1} = 1;
$hs{$1} = 1;
$hs{$2} = 1;
}
}
$i=0;
foreach $a (keys %dub) {
$grp[$i]{$a} = 1;
foreach $b (keys %{$dub{$a}}) {
$grp[$i]{$b} = 1;
foreach $c (keys %hs) {
if($c == $a || $c == $b) { next; }
$flag = 1;
foreach $d (keys %{$grp[$i]}) {
if(!$dub{$d}{$c}) {
$flag = 0;
last;
}
}
$grp[$i]{$c} = 1 if($flag);
}
$i++;
}
}
for($i=0; $i<=$#grp; $i++) {
print join(",", (keys %{$grp[$i]}))."\n";
}
But this takes hell lot of time for execution.
Is there a better solution, algorithm or performance tune for above script?
Any solution in LAMP is appreciated.
Thanks
EDIT:
Think of this way:
(1,2) is defined as "1 and 2 are similar"
(2,3) is defined as "2 and 3 are similar"
(1,4) is defined as "1 and 4 are similar"
(1,3) is defined as "1 and 3 are similar"
From these similarities I conclude that group (1,2,3) are similar to each other but not group (1,2,3,4).
In order to form group (1,2,3,4) there should be other entries in data as (2,4) and (3,4).
Finally I wanted to find all groups in given set of co-ordinates.

From my understanding, {1,2,3} are in the same group because all point to each other ({1,2}, {2,3},{1,3} exist). So, we can reduce this problem to finding cliques in an undirected graph, which is NP-Complete problem. So, every solution will be quite inefficient on big data.

This works for me:
use Data::Dump;
my #results;
my ($last_a, $last_b) = (0,0);
while(<DATA>) {
chomp;
my ($a, $b) = split /,/;
if( $last_b == $a ) {
my $last_item = $results[$#results];
push #$last_item, $b;
}
else {
push #results, [$a, $b];
}
($last_a, $last_b) = ($a, $b);
}
dd #results; # ([1, 2, 3], [4, 5], [1, 3], [1, 4], [5, 6])
__DATA__
1,2
2,3
4,5
1,3
1,4
5,6

You haven't really described the algorithm that we're supposed to used. I can't really understand why your input generates "1,2,3" and "1,4" rather than just "1,2,3,4".
But is this what you want?
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my %data;
while (<DATA>) {
chomp;
my ($k, $v) = split /,/;
push #{ $data{$k} }, $v;
}
foreach (sort keys %data) {
say "$_,", join ',', #{ $data{$_ } };
}
__DATA__
1,2
2,3
4,5
1,3
1,4
5,6

Related

Perl - parse string to integer from arguments and compare it

There are two arguments: $a and $b, and both are strings. They are will be compared, if a is greater/less/equal to b.
Other people can achieve 42 char, I don't know how to achieve it.
AUTOLOAD {
$_[0] + 0 > $_[1] + 0 ? "greater" :
$_[1] == $_[0] ? "equal" : "less"
}
This will do as you ask, but you give almost no information on your program so it is very difficult to suggest anything
use strict;
use warnings 'all';
for ( [ 1, 2 ], [2, 2], [3, 2] ) {
my ($aa, $bb) = #$_;
printf "%d is %s %d\n", $aa, compare($aa, $bb), $bb;
}
sub compare {
my ($aa, $bb) = #_;
('less than', 'equal to', 'greater than')[($aa <=> $bb) + 1];
}
output
1 is less than 2
2 is equal to 2
3 is greater than 2

Splitting a numerical string in Perl

I have a numerical string:
"13245988"
I want to split before and after consecutive numbers.
Expected output is:
1
32
45
988
Here is what I've tried:
#!/usr/bin/perl
use strict;
use warnings;
my $a="132459";
my #b=split("",$a);
my $k=0;
my #c=();
for(my $i=0; $i<=#b; $i++) {
my $j=$b[$i]+1;
if($b[$i] == $j) {
$c[$k].=$b[$i];
} else {
$k++;
$c[$k]=$b[$i];
$k++;
}
}
foreach my $z (#c) {
print "$z\n";
}
Editing based on clarified question. Something like this should work:
#!/usr/bin/perl
use strict;
use warnings;
my $a = "13245988";
my #b = split("",$a);
my #c = ();
push #c, shift #b; # Put first number into result.
for my $num (#b) { # Loop through remaining numbers.
my $last = $c[$#c] % 10; # Get the last digit of the last entry.
if(( $num <= $last+1) && ($num >= $last-1)) {
# This number is within 1 of the last one
$c[$#c] .= $num; # Append this one to it
} else {
push #c, $num; # Non-consecutive, add a new entry;
}
}
foreach my $z (#c) {
print "$z\n";
}
Output:
1
32
45
988

Determining the ratio of matches to non-matches of 2 primary strands? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to plot a gene graph for a DNA sequence say ATGCCGCTGCGC?
Im trying to write a Perl script that compares two DNA sequences (60 characters in length each lets say) in alignment, and then show the ratio of matches to non-matches of the sequences to each other. But i'm not having much luck. if it helps i can upload my code, but its no use. here's an example of what im trying to achieve below.
e.g
A T C G T A C
| | | | | | |
T A C G A A C
So the matches of the above example would be 4. and non-matches are: 3. Giving it a ratio of 4.3.
Any help would be much appreciated. thanks.
in general, please do post your code. It does help. In any case, something like this should do what you are asking:
#!/usr/bin/perl -w
use strict;
my $d1='ATCGTAC';
my $d2='TACGAAC';
my #dna1=split(//,$d1);
my #dna2=split(//,$d2);
my $matches=0;
for (my $i=0; $i<=$#dna1; $i++) {
$matches++ if $dna1[$i] eq $dna2[$i];
}
my $mis=scalar(#dna1)-$matches;
print "Matches/Mismatches: $matches/$mis\n";
Bear in mind though that the ratio of 4 to 3 is most certainly not 4.3 but ~1.3. If you post some information on your input file format I will update my answer to include lines for parsing the sequence from your file.
Normally I'd say "What have you tried" and "upload your code first" because it doesn't seem to be a very difficult problem. But let's give this a shot:
create two arrays, one to hold each sequence:
#sequenceOne = ("A", "T", "C", "G", "T", "A", "C");
#sequenceTwo = ("T", "A", "C", "G", "A", "A", "C");
$myMatch = 0;
$myMissMatch = 0;
for ($i = 0; $i < #sequenceOne; $i++) {
my $output = "Comparing " . $sequenceOne[$i] . " <=> " . $sequenceTwo[$i];
if ($sequenceOne[$i] eq $sequenceTwo[$i]) {
$output .= " MATCH\n";
$myMatch++;
} else {
$myMissMatch++;
$output .= "\n";
}
print $output;
}
print "You have " . $myMatch . " matches.\n";
print "You have " . $myMissMatch . " mismatches\n";
print "The ratio of hits to misses is " . $myMatch . ":" . $myMissMatch . ".\n";
Of course, you'd probably want to read the sequence from something else on the fly instead of hard-coding the array. But you get the idea. With the above code your output will be:
torgis-MacBook-Pro:platform-tools torgis$ ./dna.pl
Comparing A <=> T
Comparing T <=> A
Comparing C <=> C MATCH
Comparing G <=> G MATCH
Comparing T <=> A
Comparing A <=> A MATCH
Comparing C <=> C MATCH
You have 4 matches.
You have 3 mismatches
The ratio of hits to misses is 4:3.
So many ways to do this. Here's one.
use strict;
use warnings;
my $seq1 = "ATCGTAC";
my $seq2 = "TACGAAC";
my $len = length $seq1;
my $matches = 0;
for my $i (0..$len-1) {
$matches++ if substr($seq1, $i, 1) eq substr($seq2, $i, 1);
}
printf "Length: %d Matches: %d Ratio: %5.3f\n", $len, $matches, $matches/$len;
exit 0;
Just grab the length of one of the strings (we're assuming string lengths are equal, right?), and then iterate using substr.
my #strings = ( 'ATCGTAC', 'TACGAAC' );
my $matched;
foreach my $ix ( 0 .. length( $strings[0] ) - 1 ) {
$matched++
if substr( $strings[0], $ix, 1 ) eq substr( $strings[1], $ix, 1 );
}
print "Matches: $matched\n";
print "Mismatches: ", length( $strings[0] ) - $matched, "\n";
I think substr is the way to go, rather than splitting the strings into arrays.
This is probably most convenient if presented as a subroutine:
use strict;
use warnings;
print ratio(qw/ ATCGTAC TACGAAC /);
sub ratio {
my ($aa, $bb) = #_;
my $total = length $aa;
my $matches = 0;
for (0 .. $total-1) {
$matches++ if substr($aa, $_, 1) eq substr($bb, $_, 1);
}
$matches / ($total - $matches);
}
output
1.33333333333333
Bill Ruppert's right that there are many way to do this. Here's another:
use Modern::Perl;
say compDNAseq( 'ATCGTAC', 'TACGAAC' );
sub compDNAseq {
my $total = my $i = 0;
$total += substr( $_[1], $i++, 1 ) eq $1 while $_[0] =~ /(.)/g;
sprintf '%.2f', $total / ( $i - $total );
}
Output:
1.33
Here is an approach which gives a NULL, \0, for each match in an xor comparison.
#!/usr/bin/perl
use strict;
use warnings;
my $d1='ATCGTAC';
my $d2='TACGAAC';
my $len = length $d1; # assumes $d1 and $d2 are the same length
my $matches = () = ($d1 ^ $d2) =~ /\0/g;
printf "ratio of %f", $matches / ($len - $matches);
Output: ratio of 1.333333

Suffix arrays in perl?

This question is closely related to another stackoverflow question. Looking for a very efficient solution to the question asked there. Have suffix arrays been implemented in perl ?
Here is my current solution in perl.
chomp(my $ipstr = <>);
my #bigstrchars = split(//, $ipstr);
my $length = (length $ipstr);
my $sum = 0;
my $span = 1;
my $flag = 0;
while ($span < $length) {
for ($j=0; $j+$span<$length; $j++) {
if ($bigstrchars[$j] eq $bigstrchars[$j+$span]) {
$sum++;
}
else {
last;
}
}
if ($span == 1 && $sum == ($length-1)) {
$sum = $length * ($length+1) * 0.5;
$flag = 1;
last;
}
$span++;
}
unless ($flag) {
$sum += $length;
}
How can this be improved ?
EDIT
Stating the problem here :
For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.
The problem is to give an algorithm to calculate the sum of similarities of a string S with each of it's suffixes. For example, let the string be : ababaa. Then, the suffixes of the string are ababaa, babaa, abaa, baa, aa and a. The similarities of each of these strings with the string ababaa are 6,0,3,0,1,1, respectively. Thus the answer is 6 + 0 + 3 + 0 + 1 + 1 = 11
What about Array::Suffix?
Flesk's solution is pretty elegant. You asked for efficiency, then you asked for improvement. When it comes to perl, I find whatever takes the less time to understand when coming back to it after 3 months to be the best improvement. So take into consideration something a little more descriptive:
use Data::Dumper;
use strict;
main();
sub main {
my $string = "ababaa"; # input string
my $parts; # hash ref
my #suffixes = split '',$string; # break input into tokens
my $running_sum = 0;
$"='';
# Build suffix tree
for (0..$#suffixes){
$parts->{"#suffixes"}=0;
shift #suffixes;
}
# Compare suffixes to initial string
for my $suffix (sort keys %$parts){
$parts->{$suffix} = getMatches($suffix,$string);
$running_sum += $parts->{$suffix};
}
# Output
$Data::Dumper::Sortkeys++;
print Dumper($parts), "\nTotal Matches: $running_sum";
}
sub getMatches{
my ($word,$string) = #_;
my $part = '';
my $offset = 0;
my $matches = 0;
for (0..(length($word) - 1)){
$offset++;
$part = substr($word,0,$offset);
if ($string =~ /^$part/){ $matches++; }
}
return $matches;
}
There are obvious inefficiencies that could be improved (loops, regex comparison, subroutine call), but the point of this answer is an alternative to something that I've already identified as better for the only benefit of better future comprehension.
If I understand the algorithm correctly and you want to calculate the sum of longest common prefixes, your implementation is incorrect, since you're lacking the ascending lexicographic sort.
Here's one way to solve your problem:
#!/usr/bin/perl
use strict;
use warnings;
chomp(my $ipstr = <>);
my #subipstrs = map [split//], sort map{substr $ipstr, $_} 0 .. length($ipstr) - 1;
my $sum = 0;
for my $i (1 .. $#subipstrs) {
my #last = #{$subipstrs[$i-1]};
my #this = #{$subipstrs[$i]};
my $j = 0;
$sum++ while $j < #last && $j < #this && $last[$j] eq $this[$j++];
}
For the example string ababaa in the question you refer to this will produce the suffix array
5 | a
4 | aa
2 | abaa
0 | ababaa
3 | baa
1 | babaa
represented by #subipstrs
#subipstrs = (
['a'],
['a', 'a'],
['a', 'b', 'a', 'a'],
['a', 'b', 'a', 'b', 'a', 'a'],
['b', 'a', 'a'],
['b', 'a', 'b', 'a', 'a']
);
This makes calculating lcps a matter of comparing neighboring array refs element by element while pairs match, and adding up the total number of matches. The result is
5 | a | 0
4 | aa | 1
2 | abaa | 1
0 | ababaa | 3
3 | baa | 0
1 | babaa | 2
With a total of 7, not 11.
EDIT: This solves the problem you're interested in:
#!/usr/bin/perl
use strict;
use warnings;
chomp(my $ipstr = <>);
my $len = my $sum = length($ipstr);
for my $i (1 .. $len -1) {
my $substr = substr $ipstr, $i;
chop $substr while $substr ne substr $ipstr, 0, length($substr);
$sum += length($substr);
}
And is a bit faster than your solution with your example string and 1M iterations:
trinity 80906/s -- -32%
flesk 119332/s 47% --
EDIT2: This one's faster because it works from the start of the strings and are able to discard negative matches faster:
#!/usr/bin/perl
use strict;
use warnings;
chomp(my $ipstr = <>);
my $len = my $sum = length($ipstr);
for my $i (1 .. $len - 1) {
my $ipstrcopy = reverse $ipstr;
my $substr = reverse substr $ipstr, $i;
my ($slen, $j) = (length($substr), 0);
$sum++ while $j++ <= $slen && chop $ipstrcopy eq chop $substr;
}
ababaa and 100K iterations:
trinity 81967/s -- -24%
flesk 107527/s 31% --
abcdefghijklmnopqrstuvwxyz and 100K iterations:
trinity 26178/s -- -15%
flesk 30769/s 18% --
aaaaaaaaaaabbbaaaaaaaaaaaaaaaabbbaaaaaaaaa and 100K iterations:
trinity 5435/s -- -30%
flesk 7800/s 44% --
The algorithm can probably be improved further by reversing $ipstr before the loop or just using substrs instead of chop.

How to turn an array of words into an array containing the characters of the words in order?

I have an array of unknown number of words, with an unknown max length.
I need to convert it to another array, basically turning it into a column word array.
so with this original array:
#original_array = ("first", "base", "Thelongest12", "justAWORD4");
The resluting array would be:
#result_array = ("fbTj","iahu","rses","selt","t oA"," nW"," gO"," eR"," sD"," t4"," 1 "," 2 ");
Actually I will have this:
fbTj
iahu
rses
selt
t oA
nW
gO
eR
sD
t4
1
2
I need to do it in order to make a table, and these words are the table's headers.
I hope I have made myself clear, and appreciate the help you are willing to give.
I tried it with the split function, but I keep messing it up...
EDIT:
Hi all, thanks for all the tips and suggestions! I learned quite much from all of you hence the upvote. However I found tchrist's answer more convenient, maybe because I come from a c,c# background... :)
use strict;
use warnings;
use 5.010;
use Algorithm::Loops 'MapCarU';
my #original_array = ("first", "base", "Thelongest12", "justAWORD4");
my #result_array = MapCarU { join '', map $_//' ', #_ } map [split //], #original_array;
I have an old program that does this. Maybe you can adapt it:
$ cat /tmp/a
first
base
Thelongest12
justAWORD4
$ rot90 /tmp/a
fbTj
iahu
rses
selt
t oA
nW
gO
eR
sD
t4
1
2
Here’s the source:
$ cat ~/scripts/rot90
#!/usr/bin/perl
# rot90 - tchrist#perl.com
$/ = '';
# uncomment for easier to read, but not reversible:
### #ARGV = map { "fmt -20 $_ |" } #ARGV;
while ( <> ) {
chomp;
#lines = split /\n/;
$MAXCOLS = -1;
for (#lines) { $MAXCOLS = length if $MAXCOLS < length; }
#vlines = ( " " x #lines ) x $MAXCOLS;
for ( $row = 0; $row < #lines; $row++ ) {
for ( $col = 0; $col < $MAXCOLS; $col++ ) {
$char = ( length($lines[$row]) > $col )
? substr($lines[$row], $col, 1)
: ' ';
substr($vlines[$col], $row, 1) = $char;
}
}
for (#vlines) {
# uncomment for easier to read, but again not reversible
### s/(.)/$1 /g;
print $_, "\n";
}
print "\n";
}
use strict;
use warnings;
use List::Util qw(max);
my #original_array = ("first", "base", "Thelongest12", "justAWORD4");
my #new_array = transpose(#original_array);
sub transpose {
my #a = map { [ split // ] } #_;
my $max = max(map $#$_, #a);
my #out;
for my $n (0 .. $max) {
$out[$n] .= defined $a[$_][$n] ? $a[$_][$n] : ' ' for 0 .. $#a;
}
return #out;
}
It can be done easily by this simple one-liner:
perl -le'#l=map{chomp;$m=length if$m<length;$_}<>;for$i(0..$m-1){print map substr($_,$i,1)||" ",#l}'

Resources