Parsing Excel of ASCII format in Perl - excel

We have perl script whose job is to read an excel file and convert it into a flat file. We get excel file from some other system on a shared location.
The other system is generating a flat file dumping data in the file seperated by a tab and appending with .xls. Now the problem with that is since in xls file if there is a string with leading 0 e.g 012345 it will be displayed as 12345 in excel. To preseve the leading 0 what they do is they write the data in this fashion (in java)
"=\"" + some string + "\""
Now if we open the file in excel it is proper with no = or ", but when reading via perl it read the string as it is i.e. ="some string".
How can we work around this, i have tried a solution to trim the leading =" and " but do not feel it to be clen one. Can someont suggest anything else

My suggestion (with the limited information you've given) would be to read the source file using Text::CSV, and then output via whatever means you would otherwise. For substitution, use regular expressions.
Simplified example (partially taken straight from the documentation):
#!usr/bin/perl
use Text::CSV;
# set binary attribute and use a tab as a separator:
my $csv = Text::CSV->new ( { binary => 1, sep_char => "\t"} )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, "<:encoding(utf8)", "test.xls" or die "test.xls: $!";
my #rows;
while ( my $row = $csv->getline( $fh ) ) {
foreach my $column (#$row) {
$column =~ s/^="|"$//g; # remove opening '="' and closing '"'
}
push #rows, $row;
}
$csv->eof or $csv->error_diag();
close $fh;
# do file writing magic or further processing here
In case you don't know, the 's' at the beginning of the regular expression indicates you want to substitue and the 'g' at the end means "repeat for all matches".
For more information, see:
https://metacpan.org/pod/Text::CSV
http://perldoc.perl.org/perlretut.html (Perl documentation on regular expressions)

Related

search multi line string from multiple files in a directory

the string to to be searched is:
the file_is being created_automaically {
period=20ns }
the perl script i am using is following ( this script is working fine for single line string but not working for multi line )
#!/usr/bin/perl
my $dir = "/home/vikas";
my #files = glob( $dir . '/*' );
#print "#files";
system ("rm -rf $dir/log.txt");
my $list;
foreach $list(#files){
if( !open(LOGFILE, "$list")){
open (File, ">>", "$dir/log.txt");
select (File);
print " $list \: unable to open file";
close (File);
else {
while (<LOGFILE>){
if($_ =~ /".*the.*automaically.*\{\n.*period\=20ns.*\}"/){
open (File, ">>", "$dir/log.txt");
select (File);
print " $list \: File contain the required string\n";
close (File);
break;
}
}
close (LOGFILE);
}
}
This code does not compile, it contains errors that causes it to fail to execute. You should never post code that you have not first tried to run.
The root of your problem is that for a multiline match, you cannot read the file in line-by-line mode, you have to slurp the whole file into a variable. However, your program contains many flaws. I will demonstrate. Here follows excerpts of your code (with fixed indentation and missing curly braces).
First off, always use:
use strict;
use warnings;
This will save you many headaches and long searches for hidden problems.
system ("rm -rf $dir/log.txt");
This is better done in Perl, where you can control for errors:
unlink "$dir/log.txt" or die "Cannot delete '$dir/log.txt': $!";
foreach my $list (#files) {
# ^^
Declare the loop variable in the loop itself, not before it.
if( !open(LOGFILE, "$list")){
open (File, ">>", "$dir/log.txt");
select (File);
print " $list \: unable to open file";
close (File);
You never have to explicitly select a file handle before you print to it. You just print to the file handle: print File "....". What you are doing is just changing the STDOUT file handle, which is not a good thing to do.
Also, this is error logging, which should go to STDERR instead. This can be done simply by opening STDERR to a file at the beginning of your program. Why do this? If you are not debugging a program at a terminal, for example via the web or some other process where STDERR does not show up on your screen. Otherwise it is just extra work while debugging.
open STDERR, ">", "$dir/log.txt" or die "Cannot open 'log.txt' for overwrite: $!";
This has the added benefit of you not having to delete the log first. And now you do this instead:
if (! open LOGFILE, $list ) {
warn "Unable to open file '$list': $!";
} else ....
warn goes to STDERR, so it is basically the same as print STDERR.
Speaking of open, you should use three argument open with explicit file handle. So it becomes:
if (! open my $fh, "<", $list )
} else {
while (<LOGFILE>) {
Since you are looking for a multiline match, you need to slurp the file(s) instead. This is done by setting the input record separator to undef. Typically like this:
my $file = do { local $/; <$fh> }; # $fh is our file handle, formerly LOGFILE
Next how to apply the regex:
if($_ =~ /".*the.*automaically.*\{\n.*period\=20ns.*\}"/) {
$_ =~ is optional. A regex automatically matches against $_ if no other variable is used.
You should probably not use " in the regex. Unless you have " in the target string. I don't know why you put it there, maybe you think strings need to be quoted inside a regex. If you do, that is wrong. To match the string you have above, you do:
if( /the.*automaically.*{.*period=20ns.*}/s ) {
You don't have to escape \ curly braces {} or equal sign =. You don't have to use quotes. The /s modifier makes . (wildcard character period) also match newline, so we can remove \n. We can remove .* from start or end of string, because that is implied, regex matches are always partial unless anchors are used.
break;
The break keyword is only used with the switch feature, which is experimental, plus you don't use it, or have it enabled. So it is just a bareword, which is wrong. If you want to exit a loop prematurely, you use last. Note that we don't have to use last because we slurp the file, so we have no loop.
Also, you generally should pick suitable variable names. If you have a list of files, the variable that contains the file name should not be called $list, I think. It is logical that it is called $file. And the input file handle should not be called LOGFILE, it should be called $input, or $infh (input file handle).
This is what I get if I apply the above to your program:
use strict;
use warnings;
my $dir = "/home/vikas";
my #files = glob( $dir . '/*' );
my $logfile = "$dir/log.txt";
open STDERR, ">", $logfile or die "Cannot open '$logfile' for overwrite: $!";
foreach my $file (#files) {
if(! open my $input, "<", $file) {
warn "Unable to open '$file': $!";
} else {
my $txt = do { local $/; <$fh> };
if($txt =~ /the.*automaically.*{.*period=20ns.*}/) {
print " $file : File contain the required string\n";
}
}
}
Note that the print goes to STDOUT, not to the error log. It is not common practice to have STDOUT and STDERR to the same file. If you want, you can simply redirect output in the shell, like this:
$ perl foo.pl > output.txt
The following sample code demonstrates usage of regex for multiline case with logger($fname,$msg) subroutine.
Code snippet assumes that input files are relatively small and can be read into a variable $data (an assumption is that computer has enough memory to read into).
NOTE: input data files should be distinguishable from rest files in home directory $ENV{HOME}, in this code sample these files assumed to match pattern test_*.dat, perhaps you do not intend to scan absolutely all files in your home directory (there could be many thousands of files but you interested in a few only)
#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say';
my($dir,$re,$logfile);
$dir = '/home/vikas/';
$re = qr/the file_is being created_automaically \{\s+period=20ns\s+\}/;
$logfile = $dir . 'logfile.txt';
unlink $logfile if -e $logfile;
for ( glob($dir . "test_*.dat") ) {
if( open my $fh, '<', $_ ) {
my $data = do { local $/; <$fh> };
close $fh;
logger($logfile, "INFO: $_ contains the required string")
if $data =~ /$re/gsm;
} else {
logger($logfile, "WARN: unable to open $_");
}
}
exit 0;
sub logger {
my $fname = shift;
my $text = shift;
open my $fh, '>>', $fname
or die "Couldn't to open $fname";
say $fh $text;
close $fh;
}
Reference: regex modifies, unlink, perlvar

How can I split my data in small enough chunks to feed to Seq?

I am working on a bioinformatics project where I am looking at very large genomes. Seg only reads 135 lines at a time, so when we feed the genomes in it gets overloaded. I am trying to create a perl command that will split the sections into 135 line sections. The character limit would be 10,800 since there are 80 columns. This is what i have so far
#!usr/bin/perl
use warnings;
use strict;
my $str =
'>AATTCCGG
TTCCGGAA
CCGGTTAA
AAGGTTCC
>AATTCCGG';
substr($str,17) = "";
print "$str";
It splits at the 17th character but only prints that section, I want it to continue printing the rest of the data. How do i add a command that allows the rest of the data to be shown. Like it should split at every 17th character continuing. (then of course i can go back in and scale it up to the size i actually need. )
I assume that the "very large genome" is stored in a very large file, and that it is fine to collect data by number of lines (and not by number of characters) since this is the first mentioned criterion.
Then you can read the file line by line and assemble lines until there is 135 of them. Then hand them off to a program or routine that processes that, empty your buffer, and keep going
use warnings;
use strict;
use feature 'say';
my $file = shift || 'default_filename.txt';
my $num_lines_to_process = 135;
open my $fh, '<', $file or die "Can't open $file: $!";
my ($line_counter, #buffer);
while (<$fh>) {
chomp;
if ($line_counter == $num_lines_to_process)
{
process_data(\#buffer);
#buffer = ();
$line_counter = 0;
}
push #buffer, $_;
++$line_counter;
}
process_data(\#buffer) if #buffer; # last batch
sub process_data {
my ($rdata) = #_;
say for #$rdata; say '---'; # print data for a test
}
If your processing application/routine wants a string, you can append to a string every time instead of adding to an array, $buffer .= $_; and clear that by $buffer = ''; as needed.
If you need to pass a string but there is also some use of an array while collecting data (intermediate checks/pruning/processing?), then collect lines into an array and use as needed, and join into a string before handing it off, my $data = join '', #buffer;
You can also make use of the $. variable and the modulo operator (%)
while (<$fh>) {
chomp;
push #buffer, $_;
if ($. % $num_lines_to_process == 0) # every $num_lines_to_process
{
process_data(\#buffer);
#buffer = ();
}
}
process_data(\#buffer) if #buffer; # last batch
In this case we need to first store a line and then check its number, since $. (line number read from a filehandle, see docs linked above) starts from 1 (not 0).
substr returns the removed part of a string; you can just run it in a loop:
while (length $str) {
my $substr = substr $str, 0, 17, "";
print $substr, "\n";
}

Perl: String in Substring or Substring in String

I'm working with DNA sequences in a file, and this file is formatted something like this, though with more than one sequence:
>name of sequence
EXAMPLESEQUENCEATCGATCGATCG
I need to be able to tell if a variable (which is also a sequence) matches any of the sequences in the file, and what the name of the sequence it matches, if any, is. Because of the nature of these sequences, my entire variable could be contained in a line of the file, or a line of the variable could be a part of my variable.
Right now my code looks something like this:
use warnings;
use strict;
my $filename = "/users/me/file/path/file.txt";
my $exampleentry = "ATCG";
my $returnval = "The sequence does not match any in the file";
open file, "<$filename" or die "Can't find file";
my #Name;
my #Sequence;
my $inx = 0;
while (<file>){
$Name[$inx] = <file>;
$Sequence[$inx] = <file>;
$indx++;
}unless(index($Sequence[$inx], $exampleentry) != -1 || index($exampleentry, $Sequence[$inx]) != -1){
$returnval = "The sequence matches: ". $Name[$inx];
}
print $returnval;
However, even when I purposely set $entry as a match from the file, I still return The sequence does not match any in the file. Also, when running the code, I get Use of uninitialized value in index at thiscode.pl line 14, <file> line 3002. as well as Use of uninitialized value within #Name in concatenation (.) or string at thiscode.pl line 15, <file> line 3002.
How can I perform this search?
I will assume that the purpose of this script is to determine if $exampleentry matches any record in the file file.txt. A record describes here a DNA sequence and corresponds to three consecutive lines in the file. The variable $exampleentry will match the sequence if it matches the third line of the record. A match means here that either
$exampleentry is a substring of $line, or
$line is a substring of $exampleentry,
where $line referes to the corresponding line in the file.
First, consider the input file file.txt:
>name of sequence
EXAMPLESEQUENCEATCGATCGATCG
in the program you try to read these two lines, using three calls to readline. Accordingly, that last call to readline will return undef since there are no more lines to read.
It therefore seems reasonable that the two last lines in file.txt are malformed, and the correct format should be:
>name of sequence
EXAMPLESEQUENCE
ATCGATCGATCG
If I now understand you correctly, I hope this could solve your problem:
use feature qw(say);
use strict;
use warnings;
my $filename = "file.txt";
my $exampleentry = "ATCG";
my $returnval = "The sequence does not match any in the file";
open (my $fh, '<', $filename ) or die "Can't find file: $!";
my #name;
my #sequence;
my $inx = 0;
while (<$fh>) {
chomp ($name[$inx] = <$fh>);
chomp ($sequence[$inx] = <$fh>);
if (
index($sequence[$inx], $exampleentry) != -1
|| index($exampleentry, $sequence[$inx]) != -1
) {
$returnval = "The sequence matches: ". $name[$inx];
last;
}
}
say $returnval;
Notes:
I have changed variable names to follow snake_case convention. For example the variable #Name is better written using all lower case as #name.
I changed the open() call to follow the new recommended 3-parameter style, see Don't Open Files in the old way for more information.
Used feature say instead of print
Added a chomp after each readline to avoid storing newline characters in the arrays.

Adding custom header to specific files in a directory

I would like to add a unique one line header that pertains to each file FOCUS*.tsv file in a specified directory. After that, I would like to combine all of these files into one file.
First I’ve tried sed command.
`my $cmd9 = `sed -i '1i$SampleID[4]' $tsv_file`;` print $cmd9;
It looked like it worked but after I’ve combined all of these files into one file in the next section of the code, the inserted row was listed four times for each file.
I’ve tried the following Perl script to accomplish the same but it deleted the content of the file and only prints out the added header.
I’m looking for the simplest way to accomplish what I’m looking for.
Here is what I’ve tried.
#!perl
use strict;
use warnings;
use Tie::File;
my $home="/data/";
my $tsv_directory = $home."test_all_runs/".$ARGV[0];
my $tsvfiles = $home."test_all_runs/".$ARGV[0]."/tsv_files.txt";
my #run_directory = (); #run_directory = split /\//, $tsv_directory; print "The run directory is #############".$run_directory[3]."\n";
my $cmd = `ls $tsv_directory/FOCUS*\.tsv > $tsvfiles`; #print "$cmd";
my $cmda = "ls $tsv_directory/FOCUS*\.tsv > $tsvfiles"; #print "$cmda";
my #tsvfiles =();
#this code opens the vcf_files.txt file and passes each line into an array for indidivudal manipulation
open(TXT2, "$tsvfiles");
while (<TXT2>){
push (#tsvfiles, $_);
}
close(TXT2);
foreach (#tsvfiles){
chop($_);
}
#this loop works fine
for my $tsv_file (#tsvfiles){
open my $in, '>', $tsv_file or die "Can't write new file: $!";
open my $out, '>', "$tsv_file.new" or die "Can't write new file: $!";
$tsv_file =~ m|([^/]+)-oncomine.tsv$| or die "Can't extract Sample ID";
my $sample_id = $1;
#print "The sample ID is ############## $sample_id\n";
my $headerline = $run_directory[3]."/".$sample_id;
print $out $headerline;
while( <$in> ) {
print $out $_;
}
close $out;
close $in;
unlink($tsv_file);
rename("$tsv_file.new", $tsv_file);
}
Thank you
Apparently, the wrong '>' when opening the file for reading was the problem and it got solved.
However, I'd like to make a few comments on some of the rest of the code.
The list of files is built by running external ls redirected to a file, then reading this file into an array. However, that is exactly the job of glob and all of that is replaced by
my #tsvfiles = glob "$tsv_directory/FOCUS*.tsv";
Then you don't need the chomp either, and the chop that is used would actually hurt since it removes the last character, not only the newline (or really $/).
Use of chop is probably not what you want. If you are removing the linefeed ($/) use chomp
To extract a match and assign it, a common idiom is
my ($sample_id) = $tsv_file =~ m|([^/]+)-oncomine.tsv$|
or die "Can't extract Sample ID: $!";
Note that I also added $!, to actually print the error. Otherwise we just don't know what it was.
The unlink and rename appear to be overwriting one file with another. You can do that by using move from the core module File::Copy
use File::Copy qw(move);
move ($tsv_file_new, $tsv_file)
or die "Can't move $tsv_file to $tsv_file_new: $!";
which renames the _new into $tsv_file, so overwriting it.
As for how the files need to be combined, more precise explanation would be needed.

Split a PDF by Bookmarks?

I am to process single PDFs that have each been created by 'merging' multiple PDFs. Each of the merged PDF has the places where the PDF parts start displayed with a bookmark.
Is there any way to automatically split this up by bookmarks with a script?
We only have the bookmarks to indicate the parts, not the page numbers, so we would need to infer the page numbers from the bookmarks. A Linux tool would be best.
pdftk can be used to split the PDF file and extract the page numbers of the bookmarks.
To get the page numbers of the bookmarks do
pdftk in.pdf dump_data
and make your script read the page numbers from the output.
Then use
pdftk in.pdf cat A-B output out_A-B.pdf
to get the pages from A to B into out_A-B.pdf.
The script could be something like this:
#!/bin/bash
infile=$1 # input pdf
outputprefix=$2
[ -e "$infile" -a -n "$outputprefix" ] || exit 1 # Invalid args
pagenumbers=( $(pdftk "$infile" dump_data | \
grep '^BookmarkPageNumber: ' | cut -f2 -d' ' | uniq)
end )
for ((i=0; i < ${#pagenumbers[#]} - 1; ++i)); do
a=${pagenumbers[i]} # start page number
b=${pagenumbers[i+1]} # end page number
[ "$b" = "end" ] || b=$[b-1]
pdftk "$infile" cat $a-$b output "${outputprefix}"_$a-$b.pdf
done
There's a command line tool written in Java called Sejda where you can find the splitbybookmarks command that does exactly what you asked. It's Java so it runs on Linux and being a command line tool you can write script to do that.
Disclaimer
I'm one of the authors
you have programs that are built like pdf-split that can do that for you:
A-PDF Split is a very simple, lightning-quick desktop utility program that lets you split any Acrobat pdf file into smaller pdf files. It provides complete flexibility and user control in terms of how files are split and how the split output files are uniquely named. A-PDF Split provides numerous alternatives for how your large files are split - by pages, by bookmarks and by odd/even page. Even you can extract or remove part of a PDF file. A-PDF Split also offers advanced defined splits that can be saved and later imported for use with repetitive file-splitting tasks. A-PDF Split represents the ultimate in file splitting flexibility to suit every need.
A-PDF Split works with password-protected pdf files, and can apply various pdf security features to the split output files. If needed, you can recombine the generated split files with other pdf files using a utility such as A-PDF Merger to form new composite pdf files.
A-PDF Split does NOT require Adobe Acrobat, and produces documents compatible with Adobe Acrobat Reader Version 5 and above.
edit*
also found a free open sourced program Here if you do not want to pay.
Here's a little Perl program I use for the task. Perl isn't special; it's just a wrapper around pdftk to interpret its dump_data output to turn it into page numbers to extract:
#!perl
use v5.24;
use warnings;
use Data::Dumper;
use File::Path qw(make_path);
use File::Spec::Functions qw(catfile);
my $pdftk = '/usr/local/bin/pdftk';
my $file = $ARGV[0];
my $split_dir = $ENV{PDF_SPLIT_DIR} // 'pdf_splits';
die "Can't find $ARGV[0]\n" unless -e $file;
# Read the data that pdftk spits out.
open my $pdftk_fh, '-|', $pdftk, $file, 'dump_data';
my #chapters;
while( <$pdftk_fh> ) {
state $chapter = 0;
next unless /\ABookmark/;
if( /\ABookmarkBegin/ ) {
my( $title ) = <$pdftk_fh> =~ /\ABookmarkTitle:\s+(.+)/;
my( $level ) = <$pdftk_fh> =~ /\ABookmarkLevel:\s+(.+)/;
my( $page_number ) = <$pdftk_fh> =~ /\BookmarkPageNumber:\s+(.+)/;
# I only want to split on chapters, so I skip higher
# level numbers (higher means more nesting, 1 is lowest).
next unless $level == 1;
# If you have front matter (preface, etc) then this numbering
# will be off. Chapter 1 might be called Chapter 3.
push #chapters, {
title => $title,
start_page => $page_number,
chapter => $chapter++,
};
}
}
# The end page for one chapter is one before the start page for
# the next chapter. There might be some blank pages at the end
# of the split for PDFs where the next chapter needs to start on
# an odd page.
foreach my $i ( 0 .. $#chapters - 1 ) {
my $last_page = $chapters[$i+1]->{start_page} - 1;
$chapters[$i]->{last_page} = $last_page;
}
$chapters[$#chapters]->{last_page} = 'end';
make_path $split_dir;
foreach my $chapter ( #chapters ) {
my( $start, $end ) = $chapter->#{qw(start_page last_page)};
# slugify the title so use it as a filename
my $title = lc( $chapter->{title} =~ s/[^a-z]+/-/gri );
my $path = catfile( $split_dir, "$title.pdf" );
say "Outputting $path";
# Use pdftk to extract that part of the PDF
system $pdftk, $file, 'cat', "$start-$end", 'output', $path;
}

Resources