How to move the decimal point N places to the left efficiently? - string

I have a bunch of decimal numbers (as strings) which I receive from an API. I need to 'unscale' them, i.e. divide them by some power of 10. This seems a simple task for integers, but I have decimals with no guaranteed range. So, basically I need a function that works like this:
move_point "12.34" 1; # "1.234"
move_point "12.34" 5; # "0.0001234"
I'd rather not use floats to avoid any rounding errors.

This is a bit verbose, but should do the trick:
sub move_point {
my ($n, $places) = #_;
die 'negative number of places' if $places < 0;
return $n if $places == 0;
my ($i, $f) = split /\./, $n; # split to integer/fractional parts
$places += length($f);
$n = sprintf "%0*s", $places+1, $i.$f; # left pad with enough zeroes
substr($n, -$places, 0, '.'); # insert the decimal point
return $n;
}
Demo:
my $n = "12.34";
for my $p (0..5) {
printf "%d %s\n", $p, move_point($n, $p);
}
0 12.34
1 1.234
2 0.1234
3 0.01234
4 0.001234
5 0.0001234

Unless your data has contains values with significantly more digits than you have shown then a floating-point value has more than enough accuracy for your purpose. Perl can reliably reproduce up to 16-digit values
use strict;
use warnings 'all';
use feature 'say';
say move_point("12.34", 1); # "1.234"
say move_point("12.34", 5); # "0.0001234"
say move_point("1234", 12);
say move_point("123400", -9);
sub move_point {
my ($v, $n) = #_;
my $dp = $v =~ /\.([^.]*)\z/ ? length $1 : 0;
$dp += $n;
$v /= 10**$n;
sprintf '%.*f', $dp < 0 ? 0 : $dp, $v;
}
output
1.234
0.0001234
0.000000001234
123400000000000
Update
If the limits of standard floating-point numbers are actually insuffcient for you then the core Math::BigFloat will do what you need
This program shows a number with sixteen digits of accuracy, multiplied by everything from 10E-20 to 10E20
use strict;
use warnings 'all';
use feature 'say';
use Math::BigFloat;
for ( -20 .. 20 ) {
say move_point('1234567890.1234567890', $_);
}
sub move_point {
my ($v, $n) = #_;
$v = Math::BigFloat->new($v);
# Build 10**$n
my $mul = Math::BigFloat->new(10)->bpow($n);
# Count new decimal places
my $dp = $v =~ /\.([^.]*)\z/ ? length $1 : 0;
$dp += $n;
$v->bdiv($mul);
$v->bfround(-$dp) if $dp >= 0;
$v->bstr;
}
output
123456789012345678900000000000
12345678901234567890000000000
1234567890123456789000000000
123456789012345678900000000
12345678901234567890000000
1234567890123456789000000
123456789012345678900000
12345678901234567890000
1234567890123456789000
123456789012345678900
12345678901234567890
1234567890123456789
123456789012345678.9
12345678901234567.89
1234567890123456.789
123456789012345.6789
12345678901234.56789
1234567890123.456789
123456789012.3456789
12345678901.23456789
1234567890.123456789
123456789.0123456789
12345678.90123456789
1234567.890123456789
123456.7890123456789
12345.67890123456789
1234.567890123456789
123.4567890123456789
12.34567890123456789
1.234567890123456789
0.1234567890123456789
0.01234567890123456789
0.001234567890123456789
0.0001234567890123456789
0.00001234567890123456789
0.000001234567890123456789
0.0000001234567890123456789
0.00000001234567890123456789
0.000000001234567890123456789
0.0000000001234567890123456789
0.00000000001234567890123456789

Related

Find the index of the Nth occurrence of a character in a string

I've found index and rindex for finding the first or last occurrence of a character (or substring) in a string. I'm also aware that they have an offset that can be used to start at a certain index.
What I want to know is if there is a simple way to find the index of the Nth occurrence of a character or substring in a string. I'd prefer not to have to do it with a regex and would rather not to have to write a loop that just repeatedly calls index with an offset.
EDIT: I didn't state the restriction well. The reason I said "no loop" is I am looking for a built-in way to do it, which exists in many languages.
One possible implementation:
use strict;
use warnings;
use feature qw(say);
my $string = 'the quick brown fox jumped over the lazy dog';
my $substring = 'o';
my $n = 4;
sub nth_index {
my ($string, $substring, $n) = #_;
my ($times, $index) = (0, 0);
while ( $times < $n && $index != -1 ) {
$index = index(
$string,
$substring,
$times == 0
? 0
: $index + length($substring),
);
$times++;
}
return $index;
}
say nth_index($string, $substring, $n); # 42
Here are two examples of how I would solve the problem
Subroutine nth_index1 uses index, while nth_index2 uses a regex. Both involve loops, as any solution must
I believe the regex solution is much more readable, while the index solution is probably a little faster. But they are both so quick that they are highly unlikely to cause a bottleneck, and readability is always paramount
use strict;
use warnings 'all';
my $s = 'the quick brown fox jumps over the lazy dog';
my $ss = 'o';
for my $n ( 1 .. 4 ) {
printf "%d %d\n",
nth_index1($s, $ss, $n),
nth_index2($s, $ss, $n);
}
sub nth_index1 {
my ($s, $ss, $n) = #_;
my $i;
my $len = length $ss;
while ( $n-- ) {
$i = index($s, $ss, $i ? $i + $len : 0 );
return if $i < 0;
}
$i;
}
sub nth_index2 {
my ($s, $ss, $n) = #_;
while ( $s =~ /$ss/g ) {
return $-[0] unless --$n;
}
return;
}
output
12 12
17 17
26 26
41 41
As stated, there is no built-in for this. Here are a few ways, using split, index, and regex.
use warnings;
use strict;
use feature qw(say);
my $str = "Xab_ab_ab_ab_"; # 'Xab_ab'; # test failed (3) matches
my $N = 3;
foreach my $patt qw(a ab c) {
say "Find index of occurrence $N of |$patt| in: |$str|";
say "index: ", ( ind_Nth_match_1($str, $patt, $N) // "no $N matches" ); #/
say "split: ", ( ind_Nth_match_2($str, $patt, $N) // "no $N matches" ); #/
say "regex: ", ( ind_Nth_match_3($str, $patt, $N) // "no $N matches" ); #/
}
sub ind_Nth_match_1 {
my ($str, $patt, $N) = #_;
my ($pos, $cnt) = (0, 0);
while ($pos = index($str, $patt, $pos) + 1) { # != 0
return $pos-1 if ++$cnt == $N;
}
return;
}
sub ind_Nth_match_2 {
my ($str, $patt, $N) = #_;
my #toks = split /($patt)/, $str;
return if #toks < 2*$N;
return length( join '', #toks[0..2*$N-1] ) - length($patt);
}
sub ind_Nth_match_3 {
my ($str, $patt, $N) = #_;
my $cnt = 0;
while ($str =~ m/$patt/g) {
return $-[0] if ++$cnt == $N;
}
}
This prints
Find index of occurrence 3 of |a| in: |Xab_ab_ab_ab_|
index: 7
split: 7
regex: 7
Find index of occurrence 3 of |ab| in: |Xab_ab_ab_ab_|
index: 7
split: 7
regex: 7
Find index of occurrence 3 of |c| in: |Xab_ab_ab_ab_|
index: no 3 matches
split: no 3 matches
regex: no 3 matches
Notes
In split each delimiter is also returned in the output list, with capturing /($patt)/, for simpler length estimation. Thus we count 2*$N (and then take -1).
In regex the #- array is used, #LAST_MATCH_START, for the position of the last successful match. Here the /g in scalar context in while makes it jump from a match to the next in repeated executions, and $-[0] gives the starting position of the last (previous) such match.
The subs return undef if there are not required $N matches, including no matches at all.
Thanks to Borodin for comments on return from subs and on using #- instead of #+.
(This answer does not answer your question, but is here to help you down the road towards accepting a regex solution.)
The way you're asking for a numeric index of the position, it sounds like you're thinking of extracting data from the string once you have that number, in a way that C programmers might do it.
For example, say you had the string
my $str = "My daddy left home when I was three and he didn't leave much for ma and me";
and you wanted to extract all the data up to the first instance of the word "and". Here's the way you could do it, which is sort of a C way to do it using Perl.
my $pos = find_index_of_first_occurrence( $str, 'and' );
# Note that find_index_of_first_occurrence() is a hypothetical function.
print substr( $str, 0, $pos );
# Prints "My daddy left home when I was three "
The way you'd do that in Perl with regexes is much simpler.
$str =~ /^(.*?)and/;
print $1;
With regexes, you're combining the searching for the string and the extraction of the data in one operation. (Note that both code snippets ignore the case of not finding "and" at all, for the sake of simplicity)
I understand that you don't know regexes very well yet, and that regexes can be daunting at first, but you'll need to understand them as part of learning Perl if you're going to succeed with the language.

Splitting a numerical string in Perl

I have a numerical string:
"13245988"
I want to split before and after consecutive numbers.
Expected output is:
1
32
45
988
Here is what I've tried:
#!/usr/bin/perl
use strict;
use warnings;
my $a="132459";
my #b=split("",$a);
my $k=0;
my #c=();
for(my $i=0; $i<=#b; $i++) {
my $j=$b[$i]+1;
if($b[$i] == $j) {
$c[$k].=$b[$i];
} else {
$k++;
$c[$k]=$b[$i];
$k++;
}
}
foreach my $z (#c) {
print "$z\n";
}
Editing based on clarified question. Something like this should work:
#!/usr/bin/perl
use strict;
use warnings;
my $a = "13245988";
my #b = split("",$a);
my #c = ();
push #c, shift #b; # Put first number into result.
for my $num (#b) { # Loop through remaining numbers.
my $last = $c[$#c] % 10; # Get the last digit of the last entry.
if(( $num <= $last+1) && ($num >= $last-1)) {
# This number is within 1 of the last one
$c[$#c] .= $num; # Append this one to it
} else {
push #c, $num; # Non-consecutive, add a new entry;
}
}
foreach my $z (#c) {
print "$z\n";
}
Output:
1
32
45
988

Determining the ratio of matches to non-matches of 2 primary strands? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to plot a gene graph for a DNA sequence say ATGCCGCTGCGC?
Im trying to write a Perl script that compares two DNA sequences (60 characters in length each lets say) in alignment, and then show the ratio of matches to non-matches of the sequences to each other. But i'm not having much luck. if it helps i can upload my code, but its no use. here's an example of what im trying to achieve below.
e.g
A T C G T A C
| | | | | | |
T A C G A A C
So the matches of the above example would be 4. and non-matches are: 3. Giving it a ratio of 4.3.
Any help would be much appreciated. thanks.
in general, please do post your code. It does help. In any case, something like this should do what you are asking:
#!/usr/bin/perl -w
use strict;
my $d1='ATCGTAC';
my $d2='TACGAAC';
my #dna1=split(//,$d1);
my #dna2=split(//,$d2);
my $matches=0;
for (my $i=0; $i<=$#dna1; $i++) {
$matches++ if $dna1[$i] eq $dna2[$i];
}
my $mis=scalar(#dna1)-$matches;
print "Matches/Mismatches: $matches/$mis\n";
Bear in mind though that the ratio of 4 to 3 is most certainly not 4.3 but ~1.3. If you post some information on your input file format I will update my answer to include lines for parsing the sequence from your file.
Normally I'd say "What have you tried" and "upload your code first" because it doesn't seem to be a very difficult problem. But let's give this a shot:
create two arrays, one to hold each sequence:
#sequenceOne = ("A", "T", "C", "G", "T", "A", "C");
#sequenceTwo = ("T", "A", "C", "G", "A", "A", "C");
$myMatch = 0;
$myMissMatch = 0;
for ($i = 0; $i < #sequenceOne; $i++) {
my $output = "Comparing " . $sequenceOne[$i] . " <=> " . $sequenceTwo[$i];
if ($sequenceOne[$i] eq $sequenceTwo[$i]) {
$output .= " MATCH\n";
$myMatch++;
} else {
$myMissMatch++;
$output .= "\n";
}
print $output;
}
print "You have " . $myMatch . " matches.\n";
print "You have " . $myMissMatch . " mismatches\n";
print "The ratio of hits to misses is " . $myMatch . ":" . $myMissMatch . ".\n";
Of course, you'd probably want to read the sequence from something else on the fly instead of hard-coding the array. But you get the idea. With the above code your output will be:
torgis-MacBook-Pro:platform-tools torgis$ ./dna.pl
Comparing A <=> T
Comparing T <=> A
Comparing C <=> C MATCH
Comparing G <=> G MATCH
Comparing T <=> A
Comparing A <=> A MATCH
Comparing C <=> C MATCH
You have 4 matches.
You have 3 mismatches
The ratio of hits to misses is 4:3.
So many ways to do this. Here's one.
use strict;
use warnings;
my $seq1 = "ATCGTAC";
my $seq2 = "TACGAAC";
my $len = length $seq1;
my $matches = 0;
for my $i (0..$len-1) {
$matches++ if substr($seq1, $i, 1) eq substr($seq2, $i, 1);
}
printf "Length: %d Matches: %d Ratio: %5.3f\n", $len, $matches, $matches/$len;
exit 0;
Just grab the length of one of the strings (we're assuming string lengths are equal, right?), and then iterate using substr.
my #strings = ( 'ATCGTAC', 'TACGAAC' );
my $matched;
foreach my $ix ( 0 .. length( $strings[0] ) - 1 ) {
$matched++
if substr( $strings[0], $ix, 1 ) eq substr( $strings[1], $ix, 1 );
}
print "Matches: $matched\n";
print "Mismatches: ", length( $strings[0] ) - $matched, "\n";
I think substr is the way to go, rather than splitting the strings into arrays.
This is probably most convenient if presented as a subroutine:
use strict;
use warnings;
print ratio(qw/ ATCGTAC TACGAAC /);
sub ratio {
my ($aa, $bb) = #_;
my $total = length $aa;
my $matches = 0;
for (0 .. $total-1) {
$matches++ if substr($aa, $_, 1) eq substr($bb, $_, 1);
}
$matches / ($total - $matches);
}
output
1.33333333333333
Bill Ruppert's right that there are many way to do this. Here's another:
use Modern::Perl;
say compDNAseq( 'ATCGTAC', 'TACGAAC' );
sub compDNAseq {
my $total = my $i = 0;
$total += substr( $_[1], $i++, 1 ) eq $1 while $_[0] =~ /(.)/g;
sprintf '%.2f', $total / ( $i - $total );
}
Output:
1.33
Here is an approach which gives a NULL, \0, for each match in an xor comparison.
#!/usr/bin/perl
use strict;
use warnings;
my $d1='ATCGTAC';
my $d2='TACGAAC';
my $len = length $d1; # assumes $d1 and $d2 are the same length
my $matches = () = ($d1 ^ $d2) =~ /\0/g;
printf "ratio of %f", $matches / ($len - $matches);
Output: ratio of 1.333333

Why is my word frequency counter example written in Perl failing to produce useful output?

I am very new to Perl, and I am trying to write a word frequency counter as a learning exercise.
However, I am not able to figure out the error in my code below, after working on it. This is my code:
$wa = "A word frequency counter.";
#wordArray = split("",$wa);
$num = length($wa);
$word = "";
$flag = 1; # 0 if previous character was an alphabet and 1 if it was a blank.
%wordCount = ("null" => 0);
if ($num == -1) {
print "There are no words.\n";
} else {
print "$length";
for $i (0 .. $num) {
if(($wordArray[$i]!=' ') && ($flag==1)) { # start of a new word.
print "here";
$word = $wordArray[$i];
$flag = 0;
} elsif ($wordArray[$i]!=' ' && $flag==0) { # continuation of a word.
$word = $word . $wordArray[$i];
} elsif ($wordArray[$i]==' '&& $flag==0) { # end of a word.
$word = $word . $wordArray[$i];
$flag = 1;
$wordCount{$word}++;
print "\nword: $word";
} elsif ($wordArray[$i]==" " && $flag==1) { # series of blanks.
# do nothing.
}
}
for $i (keys %wordCount) {
print " \nword: $i - count: $wordCount{$i} ";
}
}
It's neither printing "here", nor the words. I am not worried about optimization at this point, though any input in that direction would also be much appreciated.
This is a good example of a problem where Perl will help you work out what's wrong if you just ask it for help. Get used to always adding the lines:
use strict;
use warnings;
to the top of your Perl programs.
Fist off,
$wordArray[$i]!=' '
should be
$wordArray[$i] ne ' '
according to the Perl documentation for comparing strings and characters. Basically use numeric operators (==, >=, …) for numbers, and string operators for text (eq, ne, lt, …).
Also, you could do
#wordArray = split(" ",$wa);
instead of
#wordArray = split("",$wa);
and then #wordArray wouldn't need to do the wonky character checking and you never would have had the problem. #wordArray will be split into the words already and you'll just have to count the occurrences.
You seem to be writing C in Perl. The difference is not just one of style. By exploding a string into a an array of individual characters, you cause the memory footprint of your script to explode as well.
Also, you need to think about what constitutes a word. Below, I am not suggesting that any \w+ is a word, rather pointing out the difference between \S+ and \w+.
#!/usr/bin/env perl
use strict; use warnings;
use YAML;
my $src = '$wa = "A word frequency counter.";';
print Dump count_words(\$src, 'w');
print Dump count_words(\$src, 'S');
sub count_words {
my $src = shift;
my $class = sprintf '\%s+', shift;
my %counts;
while ($$src =~ /(?<sequence> $class)/gx) {
$counts{ $+{sequence} } += 1;
}
return \%counts;
}
Output:
---
A: 1
counter: 1
frequency: 1
wa: 1
word: 1
---
'"A': 1
$wa: 1
=: 1
counter.";: 1
frequency: 1
word: 1

Convert Memory Size (Human readable) into Actual Number (bytes) in Perl

Is there an actual package in CPAN to convert such string:
my $string = "54.4M"
my $string2 = "3.2G"
into the actual number in bytes:
54,400,000
3,200,000,000
And vice versa.
In principle what I want to do at the end is to sum out all the memory size.
To get the exact output you asked for, use Number::FormatEng and Number::Format:
use strict;
use warnings;
use Number::FormatEng qw(:all);
use Number::Format qw(:subs);
my $string = "54.4M" ;
my $string2 = "3.2G" ;
print format_number(unformat_pref($string)) , "\n";
print format_number(unformat_pref($string2)) , "\n";
__END__
54,400,000
3,200,000,000
By the way, only unformat_pref is needed if you are going to perform calculations with the result.
Since Number::FormatEng was intended for engineering notation conversion (not for bytes), its prefix is case-sensitive. If you want to use it for kilobytes, you must use lower case k.
Number::Format will convert these strings into actual bytes (kinda, almost).
use Number::Format qw(:subs);
my $string = "54.4M" ;
my $string2 = "3.2G" ;
print round(unformat_number($string) , 0), "\n";
print round(unformat_number($string2), 0), "\n";
__END__
57042534
3435973837
The reason I said "kinda, almost" is that Number::Format treats 1K as being equal to 1024 bytes, not 1000 bytes. That's probably why it gives a weird-looking result (with fractional bytes), unless it is rounded.
For your first problem, I did not find a CPAN package, but this code snippet might do:
sub convert_human_size {
my $size = shift;
my #suffixes = ('', qw(k m g));
for my $index (0..$#suffixes) {
my $suffix = $suffixes[$index];
if ( $size =~ /^([\d.]+)$suffix\z/i ) {
return int($1 * (1024 ** $index));
}
}
# No match
die "Didn't understand human-readable file size '$size'"; # or croak
}
Wrap the number through Number::Format's format_number function if you'd like pretty semi-colons (e.g. "5,124" instead of "5124")
CPAN solves the second part of your problem:
Number::Bytes::Human
For example:
use Number::Bytes::Human qw(format_bytes);
$size = format_bytes(54_400_000);
You may provide an optional bs => 1000 parameter to change the base of the conversion to 1000 instead of 1024.
This should get you started. You could add other factors, like kilobytes ("K") on your own, as well as formatting of output (comma separators, for example):
#!/usr/bin/perl -w
use strict;
use POSIX qw(floor);
my $string = "54.4M";
if ( $string =~ m/(\d+)?.(\d+)([M|G])/ ) {
my $mantissa = "$1.$2";
if ( $3 eq "M" ) {
$mantissa *= (2 ** 20);
}
elsif ( $3 eq "G" ) {
$mantissa *= (2 ** 30);
}
print "$string = ".floor($mantissa)." bytes\n";
}
Output:
54.4M = 57042534 bytes
Basically, to go from strings to numbers, all you need is a hash mapping units to multipliers:
#!/usr/bin/perl
use strict; use warnings;
my $base = 1000;
my %units = (
K => $base,
M => $base ** 2,
G => $base ** 3,
# etc
);
my #strings = qw( 54.4M 3.2G 1K 0.1M .);
my $pattern = join('|', sort keys %units);
my $total;
for my $string ( #strings ) {
while ( $string =~ /(([0-9]*(?:\.[0-9]+)?)($pattern))/g ) {
my $number = $2 * $units{$3};
$total += $number;
printf "%12s = %12.0f\n", $1, $number;;
}
}
printf "Total %.0f bytes\n", $total;
Output:
54.4M = 54400000
3.2G = 3200000000
1K = 1000
0.1M = 100000
Total 3254501000 bytes

Resources