Perl: using an array to capitalize words - string

I'm treating large blocks of all-caps text, converting them to mixed case and adding punctuation. However, there's a large set of words and names that I want to capitalize, such as days of the week, months, people, and so on.
Rather than using a giant glob of substitutions, is there a way to use an array or hash of properly capitalized terms somehow? So if I have a string "and then on monday, we should see bob and sue" to convert it to "and then on Monday, we should see Bob and Sue" if I have the terms stored in an array or hash?
Thanks!

use feature qw( fc ); # Alternatively, "lc" is "close enough".
my #to_capitalize = qw( Monday Bob Sue );
my #abbreviations = qw( USA );
my #exceptions = qw( iPhone );
my %words = (
( map { fc($_) => ucfirst(lc($_)) } #to_capitalize ),
( map { fc($_) => uc($_) } #abbreviations ),
( map { fc($_) => $_ } #exceptions ),
);
my $pat =
join '|',
#map quotemeta, # If this is needed, our use of \b is incorrect.
keys %words;
s/\b($pat)\b/$words{fc($1)}/ig;
Adjust as needed.

You could create a hash of words to transformations, then if you encounter those words apply the expected transformation:
use strict;
use warnings;
use feature qw(say);
my %transformations = (
monday => \&titlecase,
bob => \&titlecase,
sue => \&titlecase,
shout => \&uppercase,
);
while ( my $line = <$filehandle> ) {
chomp $line;
foreach my $word ( split /\s+/, $line ) {
if ( my $transform = $transformations{lc $word} ) {
$line =~ s/$word/$transform->($word)/ge;
}
}
say $line;
}
sub titlecase {
my ($s) = #_;
return ucfirst lc $s;
}
sub uppercase {
my ($s) = #_;
return uc $s;
}

Thanks for the great ideas. I ended up using the idea of a hash (for the first time!). This may not be very optimal but it gets the job done :)
%capitalize = (
"usa" => "USA",
"nfl" => "NFL",
);
$row = "I live in the usa and love me some nfl.";
foreach my $key (keys %capitalize) {
if ($row =~ $key) {
$row =~ s/$key/$capitalize{$key}/g;
}
}
print $row;

Related

Parse Excel data raises "File error"

I have three functions in a Perl script to parse an Excel file to getthe desired output Excel file. I achieved the correct output, however I get an error
File error: data may have been lost
Likely the root cause is writing over the same Excel cell twice in the file.
How do I get rid of this error while maintaining the functions in the script?
Input file
A B
Apples Carrots
Oranges Broccoli
Grapes Spinach
Desire output file
A B
Apples Carrots
Oranges Broccoli
PEACHES ASPARAGUS
Perl code
use v5.10.0;
use warnings;
use Spreadsheet::ParseExcel;
use Spreadsheet::ParseExcel::SaveParser;
use Spreadsheet::WriteExcel;
my $parser = Spreadsheet::ParseExcel::SaveParser->new();
my $workbook_R = $parser->parse('C:\Perl\databases\Fruits_and_Veggies.xls');
my $workbook_W = Spreadsheet::WriteExcel->new('C:\Perl\databases\New_Fruits_and_Veggies.xls');
my $worksheet_W = $workbook_W->add_worksheet();
for our $worksheet_R ( $workbook_R->worksheets() ) {
my ( $row_min, $row_max ) = $worksheet_R->row_range();
my ( $col_min, $col_max ) = $worksheet_R->col_range();
for our $row ( $row_min .. $row_max ) {
for our $col ( $col_min .. $col_max ) {
FruitStand();
VeggieStand();
ComboStand();
#------------------------------------------------------------------------------
# sub FruitStand - parsing: replace Grapes with PEACHES
#------------------------------------------------------------------------------
sub FruitStand {
# if the cell contains Grapes write 'PEACHES' instead
my $cell_grapes = $worksheet_R->get_cell( $row, $col );
if ( $cell_grapes->value() =~ /Grapes/ ) {
$worksheet_W->write($row, $col,"PEACHES");
}
}
#------------------------------------------------------------------------------
# sub VeggieStand - parsing: repalce Spinach with ASPARAGUS
#------------------------------------------------------------------------------
sub VeggieStand {
# if the cell contains Spinach write 'ASPARAGUS' instead
my $cell_veggies = $worksheet_R->get_cell( $row, $col );
# my $cell = $worksheet_R->get_cell( $row, $col );
if (/ $cell_veggies->value() =~ /Spinach/ ) {
$worksheet_W->write($row, $col,"ASPARAGUS");
}
}
#------------------------------------------------------------------------------
# Writing all fruits and veggies with the 2 changes (PEACHES and ASPARAGUS)
#------------------------------------------------------------------------------
sub ComboStand {
my $cell = $worksheet_R->get_cell( $row, $col );
$worksheet_W->write($row, $col, $cell->value()) ;
}
}
}
}
As you have said, the error message is due to a cell being written several times. You can get rid of the error message by ensuring that each cell is only written once. Since your three subroutines have very similar functionality, they can be combined into a single set of lines that does everything, and use an if / else cascade to decide which action should be taken.
for our $row ( $row_min .. $row_max ) {
for our $col ( $col_min .. $col_max ) {
my $cell = $worksheet_R->get_cell( $row, $col );
if($cell->value() =~ /Spinach/) {
$worksheet_W->write($row, $col,"ASPARAGUS");
}
elsif($cell->value() =~ /Grapes/) {
$worksheet_W->write($row, $col,"PEACHES");
}
else {
$worksheet_W->write($row, $col, $cell->value()) ;
}
}
}
If you have to keep the functions, I suggest something like this, where you have a function that takes the current cell and applies any appropriate transformations to it, and returns the text ready for output. This keeps the repetition of reading the cell and writing the cell out of the subroutines:
for our $row ( $row_min .. $row_max ) {
for our $col ( $col_min .. $col_max ) {
my $cell = $worksheet_R->get_cell( $row, $col );
$worksheet_w->write( $row, $col, produce_check($cell->value) );
}
}
And the produce_check sub that performs the swaps would also be a good place to do any text normalisations or other checks you might want to do on the input, e.g. removing extra whitespace, setting all the output to title case, etc.
sub produce_check {
my $prod = shift;
# maybe we have to make sure there's no trailing whitespace on $prod
$prod =~ s/\s*$//;
my %swaps = (
grapes => 'peaches',
spinach => 'asparagus',
tins => 'cans',
zuchini => 'zucchini'
);
# is $prod one of pieces of produce we have to swap?
# perhaps our input is in a mixture of cases, uppercase, lowercase, titlecase
# to avoid having to add all those variations to the %swaps hash, we convert
# to lowercase using `lc`
if ( $swaps{ lc($prod) } ) {
$prod = $swaps{ lc($prod) };
}
# this line uses `ucfirst($prod)` to convert all output to titlecase.
# You could also convert everything to lowercase ( `lc($prod)` ), to
# uppercase ( `uc($prod)` ), or just leave it as-is by using `return $prod;`
return ucfirst( $prod );
}

Find the index of the Nth occurrence of a character in a string

I've found index and rindex for finding the first or last occurrence of a character (or substring) in a string. I'm also aware that they have an offset that can be used to start at a certain index.
What I want to know is if there is a simple way to find the index of the Nth occurrence of a character or substring in a string. I'd prefer not to have to do it with a regex and would rather not to have to write a loop that just repeatedly calls index with an offset.
EDIT: I didn't state the restriction well. The reason I said "no loop" is I am looking for a built-in way to do it, which exists in many languages.
One possible implementation:
use strict;
use warnings;
use feature qw(say);
my $string = 'the quick brown fox jumped over the lazy dog';
my $substring = 'o';
my $n = 4;
sub nth_index {
my ($string, $substring, $n) = #_;
my ($times, $index) = (0, 0);
while ( $times < $n && $index != -1 ) {
$index = index(
$string,
$substring,
$times == 0
? 0
: $index + length($substring),
);
$times++;
}
return $index;
}
say nth_index($string, $substring, $n); # 42
Here are two examples of how I would solve the problem
Subroutine nth_index1 uses index, while nth_index2 uses a regex. Both involve loops, as any solution must
I believe the regex solution is much more readable, while the index solution is probably a little faster. But they are both so quick that they are highly unlikely to cause a bottleneck, and readability is always paramount
use strict;
use warnings 'all';
my $s = 'the quick brown fox jumps over the lazy dog';
my $ss = 'o';
for my $n ( 1 .. 4 ) {
printf "%d %d\n",
nth_index1($s, $ss, $n),
nth_index2($s, $ss, $n);
}
sub nth_index1 {
my ($s, $ss, $n) = #_;
my $i;
my $len = length $ss;
while ( $n-- ) {
$i = index($s, $ss, $i ? $i + $len : 0 );
return if $i < 0;
}
$i;
}
sub nth_index2 {
my ($s, $ss, $n) = #_;
while ( $s =~ /$ss/g ) {
return $-[0] unless --$n;
}
return;
}
output
12 12
17 17
26 26
41 41
As stated, there is no built-in for this. Here are a few ways, using split, index, and regex.
use warnings;
use strict;
use feature qw(say);
my $str = "Xab_ab_ab_ab_"; # 'Xab_ab'; # test failed (3) matches
my $N = 3;
foreach my $patt qw(a ab c) {
say "Find index of occurrence $N of |$patt| in: |$str|";
say "index: ", ( ind_Nth_match_1($str, $patt, $N) // "no $N matches" ); #/
say "split: ", ( ind_Nth_match_2($str, $patt, $N) // "no $N matches" ); #/
say "regex: ", ( ind_Nth_match_3($str, $patt, $N) // "no $N matches" ); #/
}
sub ind_Nth_match_1 {
my ($str, $patt, $N) = #_;
my ($pos, $cnt) = (0, 0);
while ($pos = index($str, $patt, $pos) + 1) { # != 0
return $pos-1 if ++$cnt == $N;
}
return;
}
sub ind_Nth_match_2 {
my ($str, $patt, $N) = #_;
my #toks = split /($patt)/, $str;
return if #toks < 2*$N;
return length( join '', #toks[0..2*$N-1] ) - length($patt);
}
sub ind_Nth_match_3 {
my ($str, $patt, $N) = #_;
my $cnt = 0;
while ($str =~ m/$patt/g) {
return $-[0] if ++$cnt == $N;
}
}
This prints
Find index of occurrence 3 of |a| in: |Xab_ab_ab_ab_|
index: 7
split: 7
regex: 7
Find index of occurrence 3 of |ab| in: |Xab_ab_ab_ab_|
index: 7
split: 7
regex: 7
Find index of occurrence 3 of |c| in: |Xab_ab_ab_ab_|
index: no 3 matches
split: no 3 matches
regex: no 3 matches
Notes
In split each delimiter is also returned in the output list, with capturing /($patt)/, for simpler length estimation. Thus we count 2*$N (and then take -1).
In regex the #- array is used, #LAST_MATCH_START, for the position of the last successful match. Here the /g in scalar context in while makes it jump from a match to the next in repeated executions, and $-[0] gives the starting position of the last (previous) such match.
The subs return undef if there are not required $N matches, including no matches at all.
Thanks to Borodin for comments on return from subs and on using #- instead of #+.
(This answer does not answer your question, but is here to help you down the road towards accepting a regex solution.)
The way you're asking for a numeric index of the position, it sounds like you're thinking of extracting data from the string once you have that number, in a way that C programmers might do it.
For example, say you had the string
my $str = "My daddy left home when I was three and he didn't leave much for ma and me";
and you wanted to extract all the data up to the first instance of the word "and". Here's the way you could do it, which is sort of a C way to do it using Perl.
my $pos = find_index_of_first_occurrence( $str, 'and' );
# Note that find_index_of_first_occurrence() is a hypothetical function.
print substr( $str, 0, $pos );
# Prints "My daddy left home when I was three "
The way you'd do that in Perl with regexes is much simpler.
$str =~ /^(.*?)and/;
print $1;
With regexes, you're combining the searching for the string and the extraction of the data in one operation. (Note that both code snippets ignore the case of not finding "and" at all, for the sake of simplicity)
I understand that you don't know regexes very well yet, and that regexes can be daunting at first, but you'll need to understand them as part of learning Perl if you're going to succeed with the language.

perl match whole word only with array list

I need to be able to match a user to a list of users, but only an exact match.
In the sample code below:
- if strUser contains "info" there is a match and that's good.
- if strUser contains "theinfo", or "infostuff" there is a match and that's not good.
I can't use \b modifiers because the variable will only contain the user name, no spaces or word boundaries.
#!/usr/bin/perl
$strUser = "theinfo";
$strUsers = "(alpha|info|omega)";
if ( $strUser =~ /$strUsers/ ) {
print "match\n";
}
exit(0);
Use a boundaries within your regex ^ and \z:
if ( $strUser =~ /^$strUsers\z/ ) {
Note, since you're likely wanting to work with literal strings, I would put more effort into constructing your regex by using quotemeta:
#!/usr/bin/perl
use strict;
use warnings;
my $strUser = "theinfo";
my #users = qw(alpha info omega);
my $list_users_re = join '|', map {quotemeta} #users;
if ( $strUser =~ /^($list_users_re)\z/ ) {
print "match\n";
}
exit(0);
However, an even easier solution since you're looking for exact matches, is just to use a hash:
my #users = qw(alpha info omega);
my %is_user = map { $_ => 1 } #users;
if ( $is_user{$strUser} ) {
print "match\n";
}
Finally, always include use strict; and use warnings in EVERY perl script.

how to compare 2 strings by each characters in perl

basically I want to compare
$a = "ABCDE";
$b = "--(-)-";
and get output CE.
i.e where ever parentheses occur the characters of $a should be taken.
One of the rare uses of the bitwise or-operator.
# magic happens here ↓
perl -E'say (("ABCDE" | "--(-)-" =~ tr/-()/\377\000/r) =~ tr/\377//dr)'
prints CE.
Use this for golfing purposes only, AHA’s solution is much more maintainable.
Simple regex and pos solution:
my $str = "ABCDE";
my $pat = "--(-)-";
my #list;
while ($pat =~ /(?=[()])/g) {
last if pos($pat) > length($str); # Required to prevent matching outside $x
my $char = substr($str, pos($y), 1);
push #list, $char;
}
print #list;
Note the use of lookahead to get the position before the matching character.
Combined with Axeman's use of the #- variable we can get an alternative loop:
while ($pat =~ /[()]/g) {
last if $-[0] > length($str);
my $char = substr($str, $-[0], 1);
push #list, $char;
}
This is pretty much mentioned in the documentation for #-:
After a match against some variable $var :
....
$& is the same as substr($var, $-[0], $+[0] - $-[0])
In other words, the matched string $& equals that substring expression. If you replace $var with another string, you would get the characters matching the same positions.
In my example, the expression $+[0] - $-[0] (offset of end of match minus offset of start of match) would be 1, since that is the max length of the matching regex.
QED.
This uses the idea that you can scan one string for positions and just take the values of the other strings. #s is a reusable product.
use strict;
use warnings;
sub chars {
my $source = shift;
return unless #_;
my #chars = map { substr( $source, $_, 1 ) } #_;
return wantarray ? #chars, join( '', #chars );
}
my $a = "ABCDE";
my $b = "--(-)-";
my #s;
push #s, #- while $b =~ m/[()]/g;
my $res = chars( $a, #s );
Way faster than all the solutions except daxim's, and almost as fast as daxim's without preventing the use of characters 255 and above:
my $pat = $b =~ s/[^()]/.?/gr =~ s/[()]/(.?)/gr
my $c = join '', $a =~ /^$pat/s;
It changes
---(-)-
to
.?.?.?(.?).?(.?).?
Then uses the result as regex pattern to extract the desired characters.
This is easy to accomplish using each_array, each_arrayref or pairwise from List::MoreUtils:
#!/usr/bin/env perl
use strict;
use warnings;
use List::Util qw( min );
use List::MoreUtils qw( each_array );
my $string = 'ABCDE';
my $pattern = '--(-)-';
my #string_chars = split //, $string;
my #pattern_chars = split //, $pattern;
# Equalise length
my $min_length = min $#string_chars, $#pattern_chars;
$#string_chars = $#pattern_chars = $min_length;
my $ea = each_array #string_chars, #pattern_chars;
while ( my ( $string_char, $pattern_char ) = $ea->() ) {
print $string_char if $pattern_char =~ /[()]/;
}
Using pairwise:
{
no warnings qw( once );
print pairwise {
$a if $b =~ /[()]/;
} #string_chars, #pattern_chars;
}
Without using List::MoreUtils:
for ( 0 .. $#string_chars ) {
print $string_chars[$_] if $pattern_chars[$_] =~ /[()]/;
}
Thanks to TLP for discovering the set $# technique without which this solution will have been longer and complicated. :-)
#!/usr/bin/perl
use strict;
use warnings;
my $a = "ABCDE";
my $b = "--(-)-";
my ($i, $c, $x, $y) = 0;
$c .= $y =~ /\(|\)/ ? $x : "" while ($x = substr $a, $i, 1) && ($y = substr $b, $i++, 1);
print "$c\n";

How to turn an array of words into an array containing the characters of the words in order?

I have an array of unknown number of words, with an unknown max length.
I need to convert it to another array, basically turning it into a column word array.
so with this original array:
#original_array = ("first", "base", "Thelongest12", "justAWORD4");
The resluting array would be:
#result_array = ("fbTj","iahu","rses","selt","t oA"," nW"," gO"," eR"," sD"," t4"," 1 "," 2 ");
Actually I will have this:
fbTj
iahu
rses
selt
t oA
nW
gO
eR
sD
t4
1
2
I need to do it in order to make a table, and these words are the table's headers.
I hope I have made myself clear, and appreciate the help you are willing to give.
I tried it with the split function, but I keep messing it up...
EDIT:
Hi all, thanks for all the tips and suggestions! I learned quite much from all of you hence the upvote. However I found tchrist's answer more convenient, maybe because I come from a c,c# background... :)
use strict;
use warnings;
use 5.010;
use Algorithm::Loops 'MapCarU';
my #original_array = ("first", "base", "Thelongest12", "justAWORD4");
my #result_array = MapCarU { join '', map $_//' ', #_ } map [split //], #original_array;
I have an old program that does this. Maybe you can adapt it:
$ cat /tmp/a
first
base
Thelongest12
justAWORD4
$ rot90 /tmp/a
fbTj
iahu
rses
selt
t oA
nW
gO
eR
sD
t4
1
2
Here’s the source:
$ cat ~/scripts/rot90
#!/usr/bin/perl
# rot90 - tchrist#perl.com
$/ = '';
# uncomment for easier to read, but not reversible:
### #ARGV = map { "fmt -20 $_ |" } #ARGV;
while ( <> ) {
chomp;
#lines = split /\n/;
$MAXCOLS = -1;
for (#lines) { $MAXCOLS = length if $MAXCOLS < length; }
#vlines = ( " " x #lines ) x $MAXCOLS;
for ( $row = 0; $row < #lines; $row++ ) {
for ( $col = 0; $col < $MAXCOLS; $col++ ) {
$char = ( length($lines[$row]) > $col )
? substr($lines[$row], $col, 1)
: ' ';
substr($vlines[$col], $row, 1) = $char;
}
}
for (#vlines) {
# uncomment for easier to read, but again not reversible
### s/(.)/$1 /g;
print $_, "\n";
}
print "\n";
}
use strict;
use warnings;
use List::Util qw(max);
my #original_array = ("first", "base", "Thelongest12", "justAWORD4");
my #new_array = transpose(#original_array);
sub transpose {
my #a = map { [ split // ] } #_;
my $max = max(map $#$_, #a);
my #out;
for my $n (0 .. $max) {
$out[$n] .= defined $a[$_][$n] ? $a[$_][$n] : ' ' for 0 .. $#a;
}
return #out;
}
It can be done easily by this simple one-liner:
perl -le'#l=map{chomp;$m=length if$m<length;$_}<>;for$i(0..$m-1){print map substr($_,$i,1)||" ",#l}'

Resources