We are migrating from a MS Excel OLE based module to Spreadsheet::ParseExcel (or similar). As we have hundreds of programs using our module we'd prefer that we provide a drop in replacement i.e. the data returned is identical.
The problem is dates - using Excel we get a Win32::OLE::Variant object of type VT_DATE. As a workaround we can construct this manually by checking $cell->type() eq 'Date' and returning the object.
The problem is that the type is not reliably set, so we can't always do this.
The Date type is set in two places. This is the logic used in FmtDefault.pm:
if ( ( ( $iFmtIdx >= 0x0E ) && ( $iFmtIdx <= 0x16 ) )
|| ( ( $iFmtIdx >= 0x2D ) && ( $iFmtIdx <= 0x2F ) ) )
{
return "Date";
}
and if this check fails and we get Numeric, then it does a backup check in ParseExcel.pm:
if ( $FmtStr =~ m{^[dmy][-\\/dmy]*$}i ) {
$rhKey{Type} = "Date";
}
However a number of common format strings are not working, for example:
[$-C09]dddd\\,\\ d\\ mmmm\\ yyyy;# i.e. Sunday, 24 January 1982
d/m/yyyy;# i.e. 24/1/1982
I've checked the Excel specification at openoffice.org and also read guides such as http://jonvonderheyden.net/excel/a-comprehensive-guide-to-number-formats-in-excel/#date_code and it seems that the below rule will match a date format string:
A string with d, m, or y characters, which are not between "" or [], not preceded with \ unless it's a \\, and not followed by - or *.
This seems very complicated and error-prone. Is there a better way?
It seems Spreadsheet::ParseExcel::Utility::ExcelFmt() flags a date format under $format_mode so perhaps this logic can be modified to return the flag? But I'd prefer something ready to go without changing the Spreadsheet::ParseExcel modules if possible.
Do you know what columns are supposed to be dates?
In my usage, I do, and convert them with:
$val = $cell->unformatted();
# if it was properly set as a Date cell, the value will be a number of days since 1900 or 1904
# that we can convert to a date, regardless of the format they were shown.
if ( $val =~ /^[0-9]{5}(?:\.[0-9]+)?\z/ ) {
$date = Spreadsheet::ParseExcel::Utility::ExcelFmt( 'YYYY-MM-DD', $val, $wb->{'Flg1904'} );
}
else {
$val = $cell->value();
$val =~ s/^'//;
# try parsing it with Date::Manip, which handles all common formats (see its ParseDateString doc)
use Date::Manip ();
Date::Manip::Date_Init("TZ=GMT","DateFormat=US");
$date = Date::Manip::UnixDate( $val, '%Y-%m-%d' );
}
Update: sounds like you are best off modifying ExcelFmt, something like this (untested):
--- Utility.pm.orig 2014-12-17 07:16:06.609942082 -0800
+++ Utility.pm 2014-12-17 07:18:14.453965764 -0800
## -69,7 +69,7 ##
#
sub ExcelFmt {
- my ( $format_str, $number, $is_1904, $number_type, $want_subformats ) = #_;
+ my ( $format_str, $number, $is_1904, $number_type, $want_subformats, $want_format_mode ) = #_;
# Return text strings without further formatting.
return $number unless $number =~ $qrNUMBER;
## -956,8 +956,14 ##
$result =~ s/^\$\-/\-\$/;
$result =~ s/^\$ \-/\-\$ /;
- # Return color and locale strings if required.
- if ($want_subformats) {
+ # Return format mode and/or color and locale strings if required.
+ if ( $want_subformats && $want_format_mode ) {
+ return ( $result, $color, $locale, $format_mode );
+ }
+ elsif ($want_format_mode) {
+ return ( $result, $format_mode );
+ }
+ elsif ($want_subformats) {
return ( $result, $color, $locale );
}
else {
Be sure to submit it to the maintainer for inclusion in a later release.
Related
I have a xlsx that im converting into a perl hash
Name
Type
Symbol
Colour
JOHN
SUV
X
R
ROB
MPV
Y
B
JAMES
4X4
Y
G
Currently, I can only set the hash superkey to the column wanted based on column array. I cant seem to figure out how to choose based on column header.
use Data::Dumper;
use Text::Iconv;
my $converter = Text::Iconv->new("utf-8", "windows-1251");
use Spreadsheet::XLSX;
my $excel = Spreadsheet::XLSX->new('file.xlsx', $converter);
foreach my $sheet (#{$excel->{Worksheet}}) {
if ($sheet->{Name} eq "sheet1"){
my %data;
for my $row ( 0 .. $sheet-> {MaxRow}) {
if ($sheet->{Cells}[0][$col]->{Val} eq "Symbol"){
my $super_key = $sheet->{Cells}[$row][$col]{Val};
}
my $key = $sheet->{Cells}[$row][0]{Val};
my $value = $sheet->{Cells}[$row][2]{Val};
my $value2= $sheet->{Cells}[$row][3]{Val};
$data{$super_key}->{$key}->{$value}=${value2};
}
print Dumper \%data;
}}
The outcome i get is,
$VAR1 = {
'' => {
'JOHN' => {
'SUV' => R
I would like to have;
$VAR1 = {
'X' => {
'JOHN' => {
'SUV' => R
`
You are missing use strict; in your perl script. If you had it, you would have seen your error yourself
Defining the $super_key with my in your If-clause, makes this variable lose scope as soon as you exit it.
And using a variable $col without defining it doesn't work either.
Better (and probably working) is:
for my $row ( 0 .. $sheet-> {MaxRow}) {
my $super_key;
foreach my $col (0 .. 3) {
if ($sheet->{Cells}[0][$col]->{Val} eq "Symbol"){
$super_key = $sheet->{Cells}[$row][$col]{Val};
}
}
my $key = $sheet->{Cells}[$row][0]{Val};
my $value = $sheet->{Cells}[$row][2]{Val};
my $value2= $sheet->{Cells}[$row][3]{Val};
$data{$super_key}->{$key}->{$value}=${value2};
}
I am new to Perl and I'm trying to create a simple calculator program, but the rules are different from normal maths. All operations have the same power and the math problem must be solved from left to right.
Here is an example:
123 - 10 + 4 * 10 = ((123 - 10) + 4) * 10 = 1170
8 * 7 / 3 + 2 = ((8 * 7) / 3) + 2 = 20.666
So in the first case the user needs to enter one string: 123 - 10 + 4 * 10.
How do i approach this task?
I'm sorry if it's too much of a general question, but i'm not sure how to even begin. Do i need a counter? Like - every second character of the string is an operator, while the two on the sides are digits.
I'm afraid I'm lazy so I'll parse with a regex and process as I parse.
#!/usr/bin/env perl
#use Data::Dumper;
use Params::Validate (':all');
use 5.01800;
use warnings;
my $string=q{123 - 10 + 4 * 10};
my $result;
sub fee {
my ($a)=validate_pos(#_,{ type=>SCALAR });
#warn Data::Dumper->Dump([\$a],[qw(*a)]),' ';
$result=$a;
};
sub fi {
my ($op,$b)=validate_pos(#_,{ type=>SCALAR},{ type=>SCALAR });
#warn Data::Dumper->Dump([\$op,\$b],[qw(*op *b)]),' ';
$result = $op eq '+' ? $result+$b :
$op eq '-' ? $result-$b :
$op eq '*' ? $result*$b :
$op eq '/' ? $result/$b :
undef;
#warn Data::Dumper->Dump([\$result],[qw(*result)]),' ';
};
$string=~ m{^(\d+)(?{ fee($1) })(?:(?: *([-+/*]) *)(\d+)(?{ fi($2,$3) }))*$};
say $result;
Note the use of (?{...}) 1
To be clear, you are not looking for a regular calculator. You are looking for a calculator that bends the rules of math.
What you want is to extract the operands and operators, then handle them 3 at the time, with the first one being the rolling "sum", the second an operator and the third an operand.
A simple way to handle it is to just eval the strings. But since eval is a dangerous operation, we need to de-taint the input. We do this with a regex match: /\d+|[+\-*\/]+/g. This matches either 1 or more + digits \d or |, 1 or more + of either +-*/. And we do this match as many times as we can /g.
use strict;
use warnings;
use feature 'say';
while (<>) { # while we get input
my ($main, #ops) = /\d+|[+\-*\/]+/g; # extract the ops
while (#ops) { # while the list is not empty
$main = calc($main, splice #ops, 0, 2); # take 2 items off the list and process
}
say $main; # print result
}
sub calc {
eval "#_"; # simply eval a string of 3 ops, e.g. eval("1 + 2")
}
You may wish to add some input checking, to count the args and make sure they are the correct number.
A more sensible solution is to use a calling table, using the operator as the key from a hash of subs designed to handle each math operation:
sub calc {
my %proc = (
"+" => sub { $_[0] + $_[1] },
"-" => sub { $_[0] - $_[1] },
"/" => sub { $_[0] / $_[1] },
"*" => sub { $_[0] * $_[1] }
);
return $proc{$_[1]}($_[0], $_[2]);
}
As long as the middle argument is an operator, this will perform the required operation without the need for eval. This will also allow you to add other math operations that you might want for the future.
Just to read raw input from the user you would simply read the STDIN file handle.
$input = <STDIN>;
This will give you a string, say "123 + 234 - 345" which will have a end of line marker. You can remove this safely with the chomp command.
After that you will want to parse your string to get your appropriate variables. You can brute force this with a stream scanner that looks at each character as you read it and processes it accordingly. For example:
#input = split //, $input;
for $ch (#input) {
if ($ch > 0 and $ch <= 9) {
$tVal = ($tVal * 10) + $ch;
} elsif ($ch eq " ") {
$newVal = $oldVal
} elsif ($ch eq "+") {
# Do addition stuff
}...
}
Another approach would be to split it into words so you can just deal with whole terms.
#input = split /\s+/, $input;
Instead of a stream of characters, as you process the array values will be 123, +, 234, -, and 345...
Hope this points you in the right direction...
My final goal for my first perl program :To create an excel sheet for reporting purpose and email the sheet as an attachment.
I have reached till creating a csv file. now i wanted to convert this to excel sheet and autofit the content.
I have an example code in our environment,could someone take time to explain each line on the below code, it would be very grateful.
outputfile,urloutputfile,scomoutputfile - are the csv files, now being converted to excel sheets.
Please explain how an element is being passed to the other function also.
my $parser = Text::CSV::Simple->new;
my $workbook = Excel::Writer::XLSX->new($auditxl);
my #totcsvlist;
push(#totcsvlist,$outputfile);
push(#totcsvlist,$urloutputfile);
push(#totcsvlist,$scomoutputfile);
my #data;
my $subject = 'worksheet';
foreach my $totcsvlis (#totcsvlist)
{
undef #data;
chomp($totcsvlis);
if ($totcsvlis eq $outputfile) { $subject="Service Status"; }
if ($totcsvlis eq $urloutputfile) { $subject="URL Status"; }
if ($totcsvlis eq $scomoutputfile) { $subject="SCOM Agent Status"; }
#data = $parser->read_file($totcsvlis);
my $headers = shift #data;
import_data($workbook, $subject, $headers, \#data);
}
$workbook->close();
sub autofit_columns {
my $worksheet = shift;
my $col = 0;
for my $width (#{$worksheet->{__col_widths}}) {
$worksheet->set_column($col, $col, $width) if $width;
$col++;
}
}
sub import_data {
my $workbook = shift;
my $base_name = shift;
my $colums = shift;
my $data = shift;
my $limit = shift || 50_000;
my $start_row = shift || 1;
my $bold = $workbook->add_format();
$bold->set_bold(1);
$bold->set_bg_color('gray');
$bold->set_border();
my $celbor = $workbook->add_format();
$celbor->set_border();
my $worksheet = $workbook->add_worksheet($base_name);
$worksheet->add_write_handler(qr[\w], \&store_string_widths);
my $w = 1;
$worksheet->write('A' . $start_row, $colums, $bold);
my $i = $start_row;
my $qty = 0;
for my $row (#$data) {
$qty++;
$worksheet->write($i++, 0, $row,$celbor);
}
autofit_columns($worksheet);
warn "Convereted $qty rows.";
return $worksheet;
}
sub autofit_columns {
my $worksheet = shift;
my $col = 0;
for my $width (#{$worksheet->{__col_widths}}) {
$worksheet->set_column($col, $col, $width + 5) if $width;
$col++;
}
}
sub store_string_widths {
my $worksheet = shift;
my $col = $_[1];
my $token = $_[2];
return if not defined $token; # Ignore undefs.
return if $token eq ''; # Ignore blank cells.
return if ref $token eq 'ARRAY'; # Ignore array refs.
return if $token =~ /^=/; # Ignore formula
return if $token =~ m{^[fh]tt?ps?://};
return if $token =~ m{^mailto:};
return if $token =~ m{^(?:in|ex)ternal:};
my $old_width = $worksheet->{__col_widths}->[$col];
my $string_width = string_width($token);
if (not defined $old_width or $string_width > $old_width) {
$worksheet->{__col_widths}->[$col] = $string_width;
}
return undef;
}
sub string_width {
return length $_[0];
}
I have tried to search and read modules used in the above code, but over head.
https://github.com/jmcnamara/spreadsheet-writeexcel/blob/master/examples/autofit.pl
-- has similar code and has provided a basic over view. but i would like to understand in detail.
Thank you so much in advance.
Regards,
Kaushik KM
Here is the documentation for the add_write_handler() method call. It says:
add_write_handler( $re, $code_ref )
This method is used to extend the Excel::Writer::XLSX write() method
to handle user defined data.
And later, it says:
The add_write_handler() method take two arguments, $re, a regular
expression to match incoming data and $code_ref a callback function
to handle the matched data
So, here you have a method call that takes two arguments. The first is a regex that tells the object what type of data this new write handler is used for. The second is a reference to the subroutine that should be used as the write handler for data that matches the regex.
The regex you have is qr[\w]. The actual regex bit of that is \w. And that just means "match a word character". The qr is to compile a string into a regex and the [ ... ] is just the delimiter for the regex string (qr/.../ is one of a class of Perl operators that allows you to use almost any character you want as a delimiter).
So, if your object is called on to write some data that contains at least one word character, the subroutine which is given as the second argument is used. But we take a reference to the subroutine.
Elsewhere in your code, you define the store_string_widths() subroutine. Subroutines in Perl are a lot like variables, and that means that they have their own sigil. The sigil for a subroutine is & (like the $ for scalar and # for arrays). You very rarely need the & in modern Perl code, so you won't see it used very often. One place that it is still used, is when we take a reference to a subroutine. You take a reference to any variable by putting a slash in front of the variable's full name (like \#array or \%hash) and subroutines are no different. So \&store_string_widths means "get a reference to the subroutine called store_string_widths()".
You say that this is your first Perl program. I have to say that this feels a little ambitious for your first Perl code. I don't cover references at all in my two-day beginners course and on my intermediate course I cover most references, but on mention subroutine references in passing. If you can understand references enough to get this all working, then I think you're doing really well.
I modified the code to work with two files. to_search.txt has string to be searched. big_file.fastq has lines where to be searched and if string found (2 mismatch allowed with exact length which range from 8-10, no addition and deletion), place in respective name. So each string is searched in all lines (2nd line) in big_file.fastq.
# to_search.txt: (length can be from 8-20 characters)
1 TCCCTTGT
2 ACGAGACT
3 GCTGTACG
4 ATCACCAG
5 TGGTCAAC
6 ATCGCACA
7 GTCGTGTA
8 AGCGGAGG
9 ATCCTTTG
10 TACAGCGC
#2000 search needed
# big_file.fastq: 2 billions lines (each 4 lines are associated: string search is in second line of each 4 lines).
# Second line can have 100-200 characters
#M04398:19:000000000-APDK3:1:1101:21860:1000 1:N:0:1
TCttTTGTGATCGATCGATCGATCGATCGGTCGTGTAGCCTCCAACCAAGCACCCCATCTGTTCCAAATCTTCTCCCACTGCTACTTGAAGACGCTGAAGTTGAAGGGCCACCTTCATCATTCTGG
+
#8ACCDGGGGGGGGGGGGGEGGGGGGGGDFFEGGG#FFGGGGGGGGGGGGGGGGGCF#<FFGGGGGFE9FGGGFEACE8EFFFGGGGGF9F?CECEFEG,CFGF,CF#CCC?BFFG<,9<9AEFG,,
#M04398:19:000000000-APDK3:1:1101:13382:1000 1:N:0:1
NTCGATCGATCGATCGATCGATCGATCGTTCTGAGAGGTACCAACCAAGCACACCACGGGCGACACAGACAGCTCCGTGTTGAACGGGTTGTTCTTCTTCTTGCCTTCATCATCCCCATCCTCAGTGGACGCAGCTTGCTCATCCTTCCTC
+
#8BCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
#M04398:19:000000000-APDK3:1:1101:18888:1000 1:N:0:1
NCAGAATGAGGAAGGATGAGCCCCGTCGTGTCGAAGCTATTGACACAGCGCTATTCCGTCTTTATGTTCACTTTAAGCGGTACAAGGAGCTGCTTGTTCTGATTCAGGAACCGAACCCTGGTGGTGTGCTTGGTTGGCAAGTTTACGGCTC
+
#8BCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGE
Here is the code for two mismatches. I tried with exact match, speed is not bad: takes around a day. I have used Time::Progress module. When I use 2 mismatch: shows 115 days to finish. How the speed can be improved here?
#!/usr/bin/perl
use strict;
use warnings;
$| = 1;
open( IN_P1, "big_file.fastq" ) or die "File not found";
my ( #sample_file_names, #barcode1 );
open( BC_FILE, "to_search.txt" ) or die "No barcode file";
my #barcode_file_content = <BC_FILE>;
foreach (#barcode_file_content) {
chomp $_;
$_ =~ s/\r//;
$_ =~ s/\n//;
#print $_;
my #elements = split( "(\t|,| )", $_ );
push #sample_file_names, $elements[0];
push #barcode1, $elements[2];
}
# open FH
my #fh_array_R1;
foreach (#sample_file_names) {
chomp $_;
local *OUT_R1;
open( OUT_R1, ">", "$_\.fq" ) or die "cannot write file";
push #fh_array_R1, *OUT_R1;
}
# unknown barcode file
open( UNKNOWN_R1, ">unknown-barcode_SE.fq" ) or die "cannot create unknown-r1 file";
while ( defined( my $firstp1 = <IN_P1> ) ) {
my $p1_first_line = $firstp1;
my $p1_second_line = <IN_P1>;
my $p1_third_line = <IN_P1>;
my $p1_fourth_line = <IN_P1>;
chomp( $p1_first_line, $p1_second_line, $p1_third_line, $p1_fourth_line, );
my $matched_R1 = "$p1_first_line\n$p1_second_line\n$p1_third_line\n$p1_fourth_line\n";
for ( my $j = 0 ; $j < scalar #barcode1 ; $j++ ) {
chomp $barcode1[$j];
my $barcode1_regex = make_barcode_fragments( $barcode1[$j] );
if ( $p1_second_line =~ /$barcode1_regex/i ) {
# keep if matched
print { $fh_array_R1[$j] } $matched_R1;
last;
}
else {
#print to unknown;
print UNKNOWN_R1 $matched_R1;
}
}
}
# make two mismatch patterm of barcode
sub make_barcode_fragments {
my ($in1) = #_;
my #subpats;
for my $i ( 0 .. length($in1) - 1 ) {
for my $j ( $i + 1 .. length($in1) - 1 ) {
my $subpat = join( '',
substr( $in1, 0, $i ),
'\\w', substr( $in1, $i + 1, $j - $i - 1 ),
'\\w', substr( $in1, $j + 1 ),
);
push #subpats, $subpat;
}
}
my $pat = join( '|', #subpats );
#print $pat;
return "$pat";
}
exit;
If your algorithm cannot be changed/improved in Perl itself, you can still get speedup by writing the time consuming parts in C. Here is an example using inline C:
use strict;
use warnings;
use Benchmark qw(timethese);
use Inline C => './check_line_c.c';
my $find = "MATCH1";
my $search = "saasdadadadadasd";
my %sub_info = (
c => sub { check_line_c( $find, $search ) },
perl => sub { check_line_perl( $find, $search ) },
);
timethese( 4_000_000, \%sub_info );
sub check_line_perl {
my ($find, $search ) = #_;
my $max_distance = 2;
for my $offset ( 0 .. length($search) - length($find) ) {
my $substr = substr( $search, $offset, length($find) );
my $hd = hd( $find, $substr );
if ( $hd <= $max_distance ) {
return ( $hd, $substr );
}
}
return ( undef, undef );
}
sub hd {
return ( $_[0] ^ $_[1] ) =~ tr/\001-\377//;
}
where check_line_c.c is:
void check_line_c( char* find, char * search ) {
int max_distance = 2;
int flen = strlen(find);
int last_ind = strlen(search) - flen;
SV *dis = &PL_sv_undef;
SV *match = &PL_sv_undef;
for ( int ind = 0; ind <= last_ind; ind++ )
{
int count = 0;
for ( int j = 0; j < flen; j++ )
{
if ( find[j] ^ search[ind+j] ) count++;
}
if ( count < max_distance )
{
match = newSV(flen);
sv_catpvn(match, search+ind, flen );
dis = newSViv(count);
break;
}
}
Inline_Stack_Vars;
Inline_Stack_Reset;
Inline_Stack_Push(sv_2mortal(dis));
Inline_Stack_Push(sv_2mortal(match));
Inline_Stack_Done;
}
The output is (Ubuntu Laptop using Intel Core i7-4702MQ CPU #2.20GHz):
Benchmark: timing 4000000 iterations of c, perl...
c: 2 wallclock secs ( 0.76 usr + 0.00 sys = 0.76 CPU) # 5263157.89/s (n=4000000)
perl: 19 wallclock secs (18.30 usr + 0.00 sys = 18.30 CPU) # 218579.23/s (n=4000000)
So this gives a 24-fold speedup for this case.
I'd suggest creating a really bad hashing algorithm. Something nice and reversible and inefficient, like the sum of the characters. Or maybe the sum of unique values (1-4) represented by the characters.
Compute the target sums, and also compute the maximum allowed variance. That is, if the objective is a match with two substitutions, then what is the maximum possible difference? (4-1 + 4-1 = 6).
Then, for each "window" of text of the appropriate length in the target data file, compute a running score. (Add a character to the end, drop a character from the start, update the hash score.) If the score for a window is within the allowable range, you can do further investigation.
You might want to implement this as different passes. Possibly even as different stages in a shell pipeline or script. The idea being that you might be able to parallelize parts of the search. (For instance, all the match strings with the same length could be searched by one process, since the hash windows are the same.)
Also, of course, it is beneficial that you can keep your early work if your program crashes in the later stages. And you can even have the early parts of the process running while you are still developing the end stages.
I have a script which parses an XLSX document and convert it in other format.
In the part where the XLSX file is red, i have an exit condition:
do
{
...
...
...
}
until (uc(trim($worksheet -> get_cell($indexRow, 0) -> value())) eq "");
unfortunately, when I am executing the script, I receive this error:
Can't call method "value" on an undefined value at myfilename.pl line 94
Can anyone give me a hint or explain what what i am doing wrong, please?
by the way, I am a beginner in Perl programming.
the expression
$worksheet -> get_cell($indexRow, 0)
Is not defined. So, no properties can be extracted.
My advice is to re-factor the code this way (but every body have his own advice) (take into considerations spaces in the second condition!):
while (1) # 1 is true
{
...
...
...
# suppose $indexrow has been calculated in the loop.
if ( ! $worksheet -> get_cell($indexRow, 0) ) {
# do something (cell is not defined)
# Cell is empty? (??????? 7 ?)
}
elsif ( ! trim( $worksheet->get_cell($indexRow,0)->value() ) ) {
last; #escape this loop
}
}
Empty string is false, so ! is equivalent to "the string is holding text".
Hope it helps.