What is the fastest way to increment a string in perl? - string

I would like to append a string in perl within a loop in a fast way, without having to copy the string for each iteration. I'm looking for something like StringBuilder from Java or C#.
I currently know the following alternatives in mind, in order to do 'a += b'.
a .= b # concat
a = join('', a, b); # join
push #a, b # array push
I am not interested in copying all string to the other. I need to copy one character per time, or append small strings foreach iteration. I am trying to solve the following problem: compress the input string 'aaabbccc' to '3a2b3c'. So the idea is to iterate over the input string, check how many repeated characters we have, and then append to the output in the compressed way. What is the most efficient to perform this in perl ?
Here is a link to the problem I was trying to solve. I's slightly different though.

For comparsion, I tried to test different versions for solving your actual problem of compressing the string. Here is my test script test.pl:
use strict;
use warnings;
use Benchmark qw(cmpthese);
use Inline C => './compress_c.c';
my $str_len = 10000;
my #chars = qw(a b c d);
my $str;
$str .= [#chars]->[rand 4] for 1 .. $str_len;
cmpthese(
-1,
{
compress_array => sub { compress_array( $str ) },
compress_regex => sub { compress_regex( $str ) },
compress_str => sub { compress_str( $str ) },
compress_c => sub { compress_c( $str ) },
}
);
# Suggested by #melpomene in the comments
sub compress_regex {
return $_[0] =~ s/([a-z])\1+/($+[0] - $-[0]) . $1/egr;
}
sub compress_array {
my $result = '';
my #chrs = split //, $_[0];
my $prev = $chrs[0];
my $count = 1;
my #result;
for my $i ( 1..$#chrs ) {
my $char = $chrs[$i];
if ( $prev eq $char ) {
$count++;
next if $i < $#chrs;
}
if ( $count > 1) {
push #result, $count, $prev;
}
else {
push #result, $prev;
}
if ( ( $i == $#chrs ) and ( $prev ne $char ) ) {
push #result, $char;
last;
}
$count = 1;
$prev = $char;
}
return join '', #result;
}
sub compress_str {
my $result = '';
my $prev = substr $_[0], 0, 1;
my $count = 1;
my $lastind = (length $_[0]) - 1;
for my $i (1 .. $lastind) {
my $char = substr $_[0], $i, 1;
if ( $prev eq $char ) {
$count++;
next if $i < $lastind;
}
if ( $count > 1) {
$result .= $count;
}
$result .= $prev;
if ( ( $i == $lastind ) and ( $prev ne $char ) ) {
$result .= $char;
last;
}
$count = 1;
$prev = $char;
}
return $result;
}
where compress_c.c is:
SV *compress_c(SV* str_sv) {
STRLEN len;
char* str = SvPVbyte(str_sv, len);
SV* result = newSV(len);
char *buf = SvPVX(result);
char prev = str[0];
int count = 1;
int j = 0;
int i;
for (i = 1; i < len; i++ )
{
char cur = str[i];
if ( prev == cur ) {
count++;
if ( i < (len - 1) )
continue;
}
if ( count > 1) {
buf[j++] = count + '0'; // assume count is less than 10
}
buf[j++] = prev;
if ( (i == (len - 1)) && (prev != cur) ) buf[j++] = cur;
count = 1;
prev = cur;
}
buf[j] = '\0';
SvPOK_on(result);
SvCUR_set(result, j);
return result;
}
The result of running perl test.pl:
Rate compress_array compress_str compress_regex compress_c
compress_array 311/s -- -42% -45% -99%
compress_str 533/s 71% -- -6% -98%
compress_regex 570/s 83% 7% -- -98%
compress_c 30632/s 9746% 5644% 5273% --
Which shows that regex version is slightly faster than the string version. However, the C version is the fastest, and it is about 50 times as fast as the regex version.
Note: I tested this on my Ubuntu 16.10 laptop (Intel Core i7-7500U CPU # 2.70GHz)

I've performed the following benchmark in several ways to perform that:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(cmpthese);
my $dna;
$dna .= [qw(G A T C)]->[rand 4] for 1 .. 10000;
sub frequency_concat {
my $result = '';
for my $idx (0 .. length($dna) - 1) {
$result .= substr($dna, $idx, 1);
}
return $result;
}
sub frequency_join {
my $result = '';
for my $idx (0 .. length($dna) - 1) {
$result = join '', $result, substr($dna,$idx,1);
}
return $result;
}
sub frequency_list_push {
my #result = ();
for my $idx (0 .. length($dna) - 1) {
push #result, substr($dna,$idx,1);
}
return join '', #result;
}
sub frequency_list_prealloc {
my #result = (' ' x length($dna));
for my $idx (0 .. length($dna) - 1) {
$result[$idx] = substr($dna,$idx,1);
}
return join '', #result;
}
cmpthese(-1, # Run each for at least 1 second(s) {
concat => \&frequency_concat,
join => \&frequency_join,
list_push => \&frequency_list_push,
list_list_prealloc => \&frequency_list_prealloc
}
);
The results below have shown that the concat (a . b) is the fastest operation. I don't understand why, since this will need to make several copies of the string.
Rate join list_push list_list_prealloc concat
join 213/s -- -38% -41% -74%
list_push 342/s 60% -- -5% -58%
list_list_prealloc 359/s 68% 5% -- -56%
concat 822/s 285% 140% 129% --

Related

Split string to fixed length chunks and write in separate line in Raku

I have a file test.txt:
Stringsplittingskills
I want to read this file and write to another file out.txt with three characters in each line like
Str
ing
spl
itt
ing
ski
lls
What I did
my $string = "test.txt".IO.slurp;
my $start = 0;
my $elements = $string.chars;
# open file in writing mode
my $file_handle = "out.txt".IO.open: :w;
while $start < $elements {
my $line = $string.substr($start,3);
if $line.chars == 3 {
$file_handle.print("$line\n")
} elsif $line.chars < 3 {
$file_handle.print("$line")
}
$start = $start + 3;
}
# close file handle
$file_handle.close
This runs fine when the length of string is not multiple of 3. When the string length is multiple of 3, it inserts extra newline at the end of output file. How can I avoid inserting new line at the end when the string length is multiple of 3?
I tried another shorter approach,
my $string = "test.txt".IO.slurp;
my $file_handle = "out.txt".IO.open: :w;
for $string.comb(3) -> $line {
$file_handle.print("$line\n")
}
Still it suffers from same issue.
I looked for here, here but still unable to solve it.
spurt "out.txt", "test.txt".IO.comb(3).join("\n")
Another approach using substr-rw.
subset PositiveInt of Int where * > 0;
sub break( Str $str is copy, PositiveInt $length )
{
my $i = $length;
while $i < $str.chars
{
$str.substr-rw( $i, 0 ) = "\n";
$i += $length + 1;
}
$str;
}
say break("12345678", 3);
Output
123
456
78
The correct answer is of course to use .comb and .join.
That said, this is how you might fix your code.
You could change the if line to check if it is at the end, and use else.
if $start+3 < $elements {
$file_handle.print("$line\n")
} else {
$file_handle.print($line)
}
Personally I would change it so that only the addition of \n is conditional.
while $start < $elements {
my $line = $string.substr($start,3);
$file_handle.print( $line ~ ( "\n" x ($start+3 < $elements) ));
$start += 3;
}
This works because < returns either True or False.
Since True == 1 and False == 0, the x operator repeats the \n at most once.
'abc' x 1; # 'abc'
'abc' x True; # 'abc'
'abc' x 0; # ''
'abc' x False; # ''
If you were very cautious you could use x+?.
(Which is actually 3 separate operators.)
'abc' x 3; # 'abcabcabc'
'abc' x+? 3; # 'abc'
infix:« x »( 'abc', prefix:« + »( prefix:« ? »( 3 ) ) );
I would probably use loop if I were going to structure it like this.
loop ( my $start = 0; $start < $elements ; $start += 3 ) {
my $line = $string.substr($start,3);
$file_handle.print( $line ~ ( "\n" x ($start+3 < $elements) ));
}
Or instead of adding a newline to the end of each line, you could add it to the beginning of every line except the first.
while $start < $elements {
my $line = $string.substr($start,3);
my $nl = "\n";
# clear $nl the first time through
once $nl = "";
$file_handle.print($nl ~ $line);
$start = $start + 3;
}
At the command line prompt, three one-liner solutions below.
Using comb and batch (retains incomplete set of 3 letters at end):
~$ echo 'StringsplittingskillsX' | perl6 -ne '.join.put for .comb.batch(3);'
Str
ing
spl
itt
ing
ski
lls
X
Simplifying (no batch, only comb):
~$ echo 'StringsplittingskillsX' | perl6 -ne '.put for .comb(3);'
Str
ing
spl
itt
ing
ski
lls
X
Alternatively, using comb and rotor (discards incomplete set of 3 letters at end):
~$ echo 'StringsplittingskillsX' | perl6 -ne '.join.put for .comb.rotor(3);'
Str
ing
spl
itt
ing
ski
lls

Compare two strings and find mismatch and match and count them both in perl

Compare two strings and find mismatch and mismatch and count them both
string1 = "SEQUENCE"
string2 = "SEKUEAEE"
I want output like. With the mismatch and match count.
'SS' match 1
'EE' match 3
'UU' match 1
'QK' mismatch 1
'NA' mismatch 1
'CE' mismatch 1
Here's a solution in old Perl. Also works with however many strings you want
use warnings;
use strict;
use List::AllUtils qw( mesh part count_by pairs );
my #strings = ("SEQUENCES", "SEKUEAEES", "SEKUEAEES");
my $i = 0;
print join "",
map { $_->[0] . " " . ($_->[1] > 1 ? 'match' : 'mismatch') . " " . $_->[1] ."\n" }
pairs
count_by { $_ }
map { join "", #$_ }
part { int($i++/scalar #strings) }
&mesh( #{[ map { [ split // ] } #strings ]} )
;
And here for comparison, analogous code in Perl 6.
my #strings = "SEQUENCES", "SEKUEAEES", "SEKUEAEES";
([Z] #strings>>.comb)
.map({ .join })
.Bag
.map({ "{.key} { .value > 1 ?? 'match' !! 'mismatch' } {.value}\n" })
.join
.say;
Isn't that just pretty?
Solution that works for any amount of strings.
use List::Util qw(max);
use Perl6::Junction qw(all);
my #strings = qw(SEQUENCE SEKUEAEE);
my (%matches, %mismatches);
for my $i (0 .. -1 + max map { length } #strings) {
my #c = map { substr $_, $i, 1 } #strings;
if ($c[0] eq all #c) {
$matches{join '', #c}++;
} else {
$mismatches{join '', #c}++;
}
}
for my $k (keys %matches) {
printf "'%s' match %d\n", $k, $matches{$k};
}
for my $k (keys %mismatches) {
printf "'%s' mismatch %d\n", $k, $mismatches{$k};
}
__END__
'SS' match 1
'UU' match 1
'EE' match 3
'QK' mismatch 1
'NA' mismatch 1
'CE' mismatch 1
Useing the non-core but very handy List::MoreUtils module.
#!/usr/bin/env perl
use warnings;
use strict;
use feature qw/say/;
use List::MoreUtils qw/each_array/;
sub count_matches {
die "strings must be equal length!" unless length $_[0] == length $_[1];
my #letters1 = split //, $_[0];
my #letters2 = split //, $_[1];
my (%matches, %mismatches);
my $iter = each_array #letters1, #letters2;
while (my ($c1, $c2) = $iter->()) {
if ($c1 eq $c2) {
$matches{"$c1$c2"} += 1;
} else {
$mismatches{"$c1$c2"} += 1;
}
}
say "'$_' match $matches{$_}" for sort keys %matches;
say "'$_' mismatch $mismatches{$_}" for sort keys %mismatches;
}
count_matches qw/SEQUENCE SEKUEAEE/;

why do I receive the errror

in my code, I have a function which verifies if a cell from an worksheet contains valid values, meaning:
1. it contains something.
2. the contained value is not SPACE, TAB or ENTER.
when i am just checking the function within a print (to print the value returned by the function for a single cell), apparently, everything works fine.
when I am integrating the "print of the result" into a while loop, for checking a range of cells, it received the error message: Can't call method "value" on an undefined value at D:\test.pl
here is my code:
use strict;
use Spreadsheet::ParseXLSX;
my $parser = Spreadsheet::ParseXLSX->new();
sub check_cell($) {
my ( $worksheet, $a, $b ) = #_;
if ( $worksheet->get_cell( $a, $b )
or not ord( $worksheet->get_cell( $a, $b )->value() ) ~~ [ 32, 9, 10 ] )
{
return 1;
} else {
return 0;
}
}
my $workbook = $parser->parse('D:\test.pl data.xlsx');
if ( !defined $workbook ) {
die $parser->error(), ".\n";
}
my $worksheet = $workbook->worksheet(0);
my $i = 8;
while ( &check_cell( $worksheet, $i, 0 ) ) {
print &check_cell( $worksheet, $i, 0 ), "\n";
$i++;
}
if I remove the while and index increment, everything works fine.
Can anyone tell me why the error occurs in 1st case?
Thank you.
When you compare two strings for equallity, you have to use eq, not ==
while( check_cell($worksheet, $i, 0) eq "correct cell" ) {
#...
};
Also, it's more natural to return 0 or 1 in your check_cell sub, so you haven't to test the result in while loop:
sub check_cell {
# return 1 if it's ok, else 0
}
while( check_cell($worksheet, $i, 0) ) {
#...
};
And the smartmatch operator ~~ has been marked as experimental since Perl 5.18.0
When $worksheet->get_cell($a, $b) doesn't return a defined value, you can't call ->value on the undefined value it returned. Don't you want to use and instead of or in the condition?
in this case, it worked modifying the condition:
instead of this:
if ( $worksheet->get_cell( $a, $b )
or not ord( $worksheet->get_cell( $a, $b )->value() ) ~~ [ 32, 9, 10 ] )
{
return 1;
}
I used:
if ( $worksheet -> get_cell($a, $b) ) {
if ( not ord($worksheet -> get_cell($a, $b) -> value()) ~~ [32, 9, 10] ){
return 1;
}
}
and the scrit exits without any error.
Thank you guys for your help.

substring comparison in perl

Im working on comparing 2 substrings sub1 and sub2 from 2 initial strings seq1 and seq2 allowing only one mismatch $k is the length of subsequence
foreach (my $i = 0; $i < length($seq1) - $k; $i += 1) {
my $sub1 = substr($seq1, $i, $k);
foreach (my $j = 0; $j < length($seq2) - $k; $i++) {
my $sub2 = substr($seq2, $j, $k);
my $diff = $sub1 ^ $sub2;
my $num_mismatch = $diff =~ tr/\0//c;
if ($num_mismatch == 1) {
$d{$sub1}++;
}
}
}
foreach (keys %d) {
print "$_\n";
}
*When I run the code It gets stuck until i kill the process and it doesnt give any result. Any Help about this ? *
foreach (my $j=0;$j<length($seq2)-$k;$i++)
should be
foreach (my $j=0;$j<length($seq2)-$k;$j++)
# ^^

How to turn an array of words into an array containing the characters of the words in order?

I have an array of unknown number of words, with an unknown max length.
I need to convert it to another array, basically turning it into a column word array.
so with this original array:
#original_array = ("first", "base", "Thelongest12", "justAWORD4");
The resluting array would be:
#result_array = ("fbTj","iahu","rses","selt","t oA"," nW"," gO"," eR"," sD"," t4"," 1 "," 2 ");
Actually I will have this:
fbTj
iahu
rses
selt
t oA
nW
gO
eR
sD
t4
1
2
I need to do it in order to make a table, and these words are the table's headers.
I hope I have made myself clear, and appreciate the help you are willing to give.
I tried it with the split function, but I keep messing it up...
EDIT:
Hi all, thanks for all the tips and suggestions! I learned quite much from all of you hence the upvote. However I found tchrist's answer more convenient, maybe because I come from a c,c# background... :)
use strict;
use warnings;
use 5.010;
use Algorithm::Loops 'MapCarU';
my #original_array = ("first", "base", "Thelongest12", "justAWORD4");
my #result_array = MapCarU { join '', map $_//' ', #_ } map [split //], #original_array;
I have an old program that does this. Maybe you can adapt it:
$ cat /tmp/a
first
base
Thelongest12
justAWORD4
$ rot90 /tmp/a
fbTj
iahu
rses
selt
t oA
nW
gO
eR
sD
t4
1
2
Here’s the source:
$ cat ~/scripts/rot90
#!/usr/bin/perl
# rot90 - tchrist#perl.com
$/ = '';
# uncomment for easier to read, but not reversible:
### #ARGV = map { "fmt -20 $_ |" } #ARGV;
while ( <> ) {
chomp;
#lines = split /\n/;
$MAXCOLS = -1;
for (#lines) { $MAXCOLS = length if $MAXCOLS < length; }
#vlines = ( " " x #lines ) x $MAXCOLS;
for ( $row = 0; $row < #lines; $row++ ) {
for ( $col = 0; $col < $MAXCOLS; $col++ ) {
$char = ( length($lines[$row]) > $col )
? substr($lines[$row], $col, 1)
: ' ';
substr($vlines[$col], $row, 1) = $char;
}
}
for (#vlines) {
# uncomment for easier to read, but again not reversible
### s/(.)/$1 /g;
print $_, "\n";
}
print "\n";
}
use strict;
use warnings;
use List::Util qw(max);
my #original_array = ("first", "base", "Thelongest12", "justAWORD4");
my #new_array = transpose(#original_array);
sub transpose {
my #a = map { [ split // ] } #_;
my $max = max(map $#$_, #a);
my #out;
for my $n (0 .. $max) {
$out[$n] .= defined $a[$_][$n] ? $a[$_][$n] : ' ' for 0 .. $#a;
}
return #out;
}
It can be done easily by this simple one-liner:
perl -le'#l=map{chomp;$m=length if$m<length;$_}<>;for$i(0..$m-1){print map substr($_,$i,1)||" ",#l}'

Resources