String partitioning (converting complex string to an array) in perl - string

There is a large string s, that contains item codes which are comma delimited.
e.g.:
$s="90320,328923,SKJS32767,DSIKUDIU,829EUE,AUSIUD0Q897,AJIUE98,
387493420DA,93RE,AKDJ93,SADI983,90439,JADKJ84";
In my application these strings are passed to a function, which returns the price of these items, i.e. the output of the function is corresponding price for the item code input.
However, due to certain limitations, the maximum length of $s should not exceed 16. If the length of $s exceeds 16, then an exception is thrown. Thus, these strings should be partitioned into an array, such that, the length of each element of array is less than or equal to 16.
e.g: After partitioning $s, the array is:
$Arr[0]='90320,328923',#Note First 16 char is 0320,328923,SK.
However, SK is neglected as its an incomplete(being partial) item code.
$Arr[1]='SKJS32767',
$Arr[2]='DSIKUDIU,829EUE',
$Arr[3]='AUSIUD0Q897',
$Arr[4]='AJIUE98',
$Arr[5]='387493420DA,93RE'
For a given $s, the function should return an array, following the constraints noted above.
My approach has been to use the substr function, and extract a string up to a 16 offset, from an updated position index. Can it be done in a better way?

This is very simple using a global /g regular expression match.
This program demonstrates. The regex pattern looks for as many characters as possible up to a maximum of sixteen that must be followed by a comma or the end of the string.
However, my first thought was the same as RobEarl's comment - why not just put one field from the string into each element of the array? Is there really a need to pack more than one into an element just because it is possible?
use strict;
use warnings;
use 5.010;
my $s = '90320,328923,SKJS32767,DSIKUDIU,829EUE,AUSIUD0Q897,AJIUE98,387493420DA,93RE,AKDJ93,SADI983,90439,JADKJ84';
my #partitions;
while ( $s =~ /\G(.{0,16})(?:,|\z)/g ) {
push #partitions, $1;
}
say for #partitions;
output
90320,328923
SKJS32767
DSIKUDIU,829EUE
AUSIUD0Q897
AJIUE98
387493420DA,93RE
AKDJ93,SADI983
90439,JADKJ84

You need to look at the length of the current string plus the current article number to determine if it is too long.
Split the long string into single articles. Concatenate the last element of the new list of strings if it's below 17 chars or push the article number as a fresh string into the list.
my $s="90320,328923,SKJS32767,DSIKUDIU,829EUE,AUSIUD0Q897,AJIUE98,387493420DA,93RE,AKDJ93,SADI983,90439,JADKJ84";
my #items = split /,/, $s;
my #strings = ( shift #items );
while ( my $item = shift #items ) {
if ( length($strings[-1]) + length($item) > 15) { # 15 because of the comma
push #strings, $item;
} else {
$strings[-1] .= ',' . $item;
}
}
dd \#strings;
__END__
[
"90320,328923",
"SKJS32767",
"DSIKUDIU,829EUE",
"AUSIUD0Q897",
"AJIUE98",
"387493420DA,93RE",
"AKDJ93,SADI983",
"90439,JADKJ84",
]

Related

How can I pad a series of hyphen-separated numbers each to two digits?

I am fairly new to PowerShell programming, so need help with set of strings I have as described below:
"14-2-1-1"
"14-2-1-1-1"
"14-2-1-1-10"
I want to pad zero to each number in between - if that number is between 1 and 9. So the result should look like:
"14-02-01-01"
"14-02-01-01-01"
"14-02-01-01-10"
I came up with the following code but was wondering if there is a better/faster solution.
$Filenum = "14-2-1-1"
$hicount = ($Filenum.ToCharArray() | Where-Object{$_ -eq '-'} | Measure-Object).Count
$FileNPad = ''
For ($i=0; $i -le $hicount; $i++) {
$Filesec= "{$i}" -f $Filenum.split('-')
If ([int]$Filesec -le 9)
{
$FileNPad = "$FileNPad-"+"0"+"$Filesec"
}
Else
{
$FileNPad="$FileNPad-$Filesec"
}
}
$FileNPad = $FileNPad.Trim("-"," ")
Instead of trying to manually keep track of how many elements and inspect each value, you can simply split on -, padleft, then join back together with -
"14-2-1-1","14-2-1-1-1","14-2-1-1-10" | ForEach-Object {
($_ -split '-').PadLeft(2,'0') -join '-'
}
Which outputs
14-02-01-01
14-02-01-01-01
14-02-01-01-10
I'd be inclined to go with something like Doug Maurer's answer due to its clarity, but here's another way to look at this. The last section below shows a solution that might be just as clear with some advantages of its own.
Pattern of digit groups in the input
Your input strings are composed of one or more groups, where each group...
...contains one or more digits, and...
...is preceded by a - or the beginning of the string, and...
...is followed by a - or the end of the string.
Groups that require a leading "0" to be inserted contain exactly one digit; that is, they consist of...
...a - or the beginning of the string, followed by...
...a single digit, followed by...
...a - or the end of the string.
Replacing single-digit digit groups using the -replace operator
We can use regular expressions with the -replace operator to locate that pattern and replace the single digit with a "0" followed by that same digit...
'0-0-0-0', '00-00-00-00', '1-2-3-4', '01-02-03-04', '10-20-30-40', '11-22-33-44' |
ForEach-Object -Process { $_ -replace '(?<=-|^)(\d)(?=-|$)', '0$1' }
...which outputs...
00-00-00-00
00-00-00-00
01-02-03-04
01-02-03-04
10-20-30-40
11-22-33-44
As the documentation describes, the -replace operator is used like this...
<input> -replace <regular-expression>, <substitute>
The match pattern
The match pattern '(?<=-|^)(\d)(?=-|$)' means...
(?<=-|^): A zero-width positive lookbehind assertion for either - or the beginning of the string
In other words, match but don't capture either - or the beginning of the string
(\d): A single digit, captured and made available with the replacement substitution $1
(?=-|$): A zero-width positive lookahead assertion for either - or the end of the string
In other words, match but don't capture either - or the end of the string
The replacement pattern
The replacement pattern '0$1' means...
The literal text '0', followed by...
The value of the first capture ((\d))
Replacing single-digit digit groups using [Regex]::Replace() and a replacement [String]
Instead of the -replace operator you can also call the static Replace() method of the [Regex] class...
'0-0-0-0', '00-00-00-00', '1-2-3-4', '01-02-03-04', '10-20-30-40', '11-22-33-44' |
ForEach-Object -Process { [Regex]::Replace($_, '(?<=-|^)(\d)(?=-|$)', '0$1') }
...and the result is the same.
Replacing digit groups using [Regex]::Replace() and a [MatchEvaluator]
A hybrid of the regular expression and imperative solutions is to call an overload of the Replace() method that takes a [MatchEvaluator] instead of a replacement [String]...
# This [ScriptBlock] will be passed to a [System.Text.RegularExpressions.MatchEvaluator] parameter
$matchEvaluator = {
# The [System.Text.RegularExpressions.Match] parameter
param($match)
# The replacement [String]
return $match.Value.PadLeft(2, '0')
}
'0-0-0-0', '00-00-00-00', '1-2-3-4', '01-02-03-04', '10-20-30-40', '11-22-33-44' |
ForEach-Object -Process { [Regex]::Replace($_, '(\d+)', $matchEvaluator) }
This produces the same result as above.
A [MatchEvaluator] is a delegate that takes a Match to be replaced ($match) and returns the [String] with which to replace it (the matched text left-padded to two digits). Also note that, whereas above we are capturing only standalone digits (\d), here we capture all groups of one or more digits (\d+) and leave it to PadLeft() to figure out if a leading "0" is needed.
I think this is a much more compelling solution than regular expressions alone because it's the best of that and the imperative world:
It uses a simple regular expression pattern to locate digit groups in the input string
It uses a simple [ScriptBlock] to transform digit groups in the input string
By not splitting the input string apart it does not create as much intermediate string and array garbage
Whether this potential performance improvement is overshadowed by using regular expressions at all, I can't say without benchmarking

Splitting a String with Perl

I was following along with this tutorial on how to split strings when I came across a quote that confused me.
Words about Context
Put to its normal use, split is used in list context. It may also be
used in scalar context, though its use in scalar context is
deprecated. In scalar context, split returns the number of fields
found, and splits into the #_ array. It's easy to see why that might
not be desirable, and thus, why using split in scalar context is
frowned upon.
I have the following script that I've been working with:
#!/usr/bin/perl
use strict;
use warnings;
use v5.24;
doWork();
sub doWork {
my $str = "This,is,data";
my #splitData = split(/,/, $str);
say $splitData[1];
return;
}
I don't fully understand how you would use split on a list.
From my understanding, using the split function on my $str variable is frowned upon? How then would I go about splitting a string with the comma as the delimiter?
The frowned-upon behaviour documented by that passage was deprecated at least as far back as 5.8.8 (11 years ago) and was removed from Perl in 5.12 (7 years ago).
The passage documents that
my $n = split(...);
is equivalent to
my $n = do { #_ = split(...); #_ }; # <5.12
The assignment to #_ is unexpected. This type of behaviour is called "surprising action at a distance", and it can result in malfunctioning code. As such, before 5.12, using split in scalar context was frowned-upon. Since 5.12, however,
my $n = split(...);
is equivalent to
my $n = do { my #anon = split(...); #anon }; # ≥5.12
The surprising behaviour having been removed, it's no longer frowned-upon to use split in scalar context for the reason stated in the passage you quoted.
It should probably still be avoided, not just for backwards compatibility, but because there are far better ways of counting the number of substrings. I would use the following:
my $n = 1 + tr/,//; # Faster than: my $n = split(/,/, $_, -1);
You are using split in list context, so it does not exercise the frowned-upon behaviour, no matter what version of Perl you use. In other words, your usage is fine.
It's fine unless you are trying to handle CSV data, that is. In that case, you should be using Text::CSV_XS.
use Text::CSV_XS qw( );
my $csv = Text::CSV_XS->new({ auto_diag => 2, binary => 1 });
while (my $row = $csv->getline($fh)) { ... } # Parsing CSV
for (...) { $csv->say($fh, $row); } # Generating CSV
Calling split in scalar context isn't very useful. It effectively returns the number of separators plus one, and there are better ways of doing that.
For example,
my $str = "This,is,data";
my $splitData = split(/,/, $str);
say $splitData;
will print 3 as it counts the substrings after the split.
split in scalarf context used to also return the split parts in #_, but that frowned-upon behaviour was removed because it's rather unexpected.
Using it as an array is perfect.
my $str = "This,is,data";
the above line is a single string.
my #splitData = split(/,/, $str);
You are now splitting the $str into an array, or a list of values. So effectively you are now sitting with #splitData which is in fact:
"This" "is" "string"
So you can either use them all, say #splitData or use each of them as a scalar #splitData[1] which we never use as it is always better to write it as $splitData[1]
The tutorial says it nicely. Use split on a string to create a list of substrings.
You can then obviously automatically assign each of the list values in a loop without having to print each list value.
my $str = "This,is,data";
my #splitData = split(/,/, $str);
foreach $value(#splitData) {
say "$value\n"
}
This basically re-assigns $splitData[0], $splitData[1] etc... to $value as scalar.

Distance between matched substrings

I have a chromosome sequence and have to find subsequences in it and the distances between them.
For example:
string:
AACCGGTTACGTTTGGCCAAACGTTTTTTGGGGAAACCCACGTACGTAAAGCCGGTTAAACGT
Substring:
ACGT
I have to find the distance between all occurrences of ACGT.
I normally do not recommend answering posts where it is obvious the OP just wants other people to do their work. However, there is already one answer the use of which will be problematic if input strings are largish, so here is something that uses Perl builtins.
The special variable #- stores the positions of matches after a pattern matches.
use strict;
use warnings;
use Data::Dumper;
my $string = 'AACCGGTTACGTTTGGCCAAACGTTTTTTGGGGAAACCCACGTACGTAAAGCCGGTTAAACGT';
my #pos;
while ( $string =~ /ACGT/g ) {
push #pos, $-[0];
}
my #dist;
for my $i (1 .. $#pos) {
push #dist, $pos[$i] - $pos[$i - 1];
}
print Dumper(\#pos, \#dist);
This method uses less memory than splitting the original string (which may be a problem if the original string is large enough). Its memory footprint can be further reduced, but I focused on clarity by showing the accumulation of match positions and the calculation of deltas separately.
One open question is whether you want the index of the first match from the beginning of the string. Strictly speaking, "distances between matches" excludes that.
use strict;
use warnings;
use Data::Dumper;
my $string = 'AACCGGTTACGTTTGGCCAAACGTTTTTTGGGGAAACCCACGTACGTAAAGCCGGTTAAACGT';
my #dist;
my $last;
while ($string =~ /ACGT/g) {
no warnings 'uninitialized';
push #dist, $-[0] - $last;
$last = $-[0];
}
# Do we want the distance of the first
# match from the beginning of the string?
shift #dist;
print Dumper \#dist;
Of course, it is possible to use index for this as well, but it looks considerably uglier.
You may split your input string by "ACGT" and remove the first and the last elements of the returned array to get all fragments between "ACGT". Then calculate lengths of this fragments:
my $input = "AACCGGTTACGTTTGGCCAAACGTTTTTTGGGGAAACCCACGTACGTAAAGCCGGTTAAACGT";
my #fragments = split("ACGT", $input, -1);
#fragments = #fragments[1..$#fragments - 1];
my #dist_arr = map {length} #fragments;
Demo: https://ideone.com/AqEwGu

Horde_Text_Diff string comparison library only comparing 1st character of strings

I am using Horde_Text_Diff to compute the difference between two strings. Sample code is as follows:
$check_diff = new Horde_Text_Diff( 'auto', array('asdf','asd11') );
$renderer = new Horde_Text_Diff_Renderer_Inline();
echo $renderer->render($check_diff);
This echoes nothing. The correct behaviour would be to show a difference at character 4.
If I change the comparison array from array('asdf','asd11') to, for instance, array('asdf','12345'), then it will output a1. In other words, it seems only to be comparing the first character. Any ideas?
When I try this, I get two warnings:
PHP Warning: array_walk() expects parameter 1 to be array, string given in /usr/share/php/Horde/Text/Diff/Engine/Native.php on line 33
PHP Warning: array_walk() expects parameter 1 to be array, string given in /usr/share/php/Horde/Text/Diff/Engine/Native.php on line 34
I.e., something is getting strings where it expects arrays.
That's because, rather than passing (an array containing) two strings to Horde_Text_Diff(), you should pass (an array containing) two arrays-of-strings (where each string represents a line of text).
If the actual strings you're currently trying to pass in contain multiple lines of text, then you can split them into arrays-of-strings using explode(), e.g.:
$a = "foo\nbar\nbaz";
$b = "foo\nqux\nbaz";
$a_lines = explode("\n", $a);
$b_lines = explode("\n", $b);
$check_diff = new Horde_Text_Diff( 'auto', array($a_lines, $b_lines) );
$renderer = new Horde_Text_Diff_Renderer_Inline();
echo $renderer->render($check_diff);
which outputs:
foo
<del>bar</del><ins>qux</ins>
baz

how to display specific words per page

hey guys,
i have a string which contains say 100 words, now i want to split that string for words and want to display 10 words per page. how this could be done... i am not getting the logic for the same..
please reply with how this could be resolved.
if the words are seperated by white spaces,count for it and make arrays containing 10 words each. if words are separated, split(java) or explode(php) to arrays and combine 10 to make a new array n display it in each page
if you are talking about php, you can do something like this :
<?php
$words = explode(' ', $string,); // create an array with each word as items.
$start = 0; // index of the first word
$number = 10; // number of words per page
$words_part = array_slice($words, $start, $number); // take 10 words from the array, starting at $start
foreach($words_part as $word) {
echo $word;
}
?>
Basically, we separate each words using a space as separator, then we take the relevant part of the resulting array (the ten first words).
You can easily the number of words taken in each part by modifying the $number variable, or the start place with the $start variable.
Implementation in another language will be identical.
I hope this help.

Resources