Perl: Extract data from different sections of a text file simultaneously - string

I want to extract data from different sections of a text file simultaneously. Is it possible to open the file using two separate filehandles(as shown below) ? Or is it possible to cache the location of the first file handle and then return to that point in the document when I close the second one?
Note: I am only reading data from the text file, never writing to it.
open( $filehandle, '<:encoding(UTF-8)', $filename )
or die "Could not open file '$filename' $!";
$row = <$filehandle>;
{
replace_unicode_char();
if ( $row =~ /$table_num/ ) {
open( $filehandle_reg, '<:encoding(UTF-8)', $filename )
or die "Could not open file '$filename' $!";
$line = <$filehandle_reg>;
if ( $line =~ /Section\_[0-9]+/ ) {
# Do something...
}
}
}

You can use the seek() function to move around in the file, and the tell() function to get the current position in the file.
So, instead of having two filehandles, have two variables storing a position in the file, and use seek() to jump back and forth between them.

Related

search multi line string from multiple files in a directory

the string to to be searched is:
the file_is being created_automaically {
period=20ns }
the perl script i am using is following ( this script is working fine for single line string but not working for multi line )
#!/usr/bin/perl
my $dir = "/home/vikas";
my #files = glob( $dir . '/*' );
#print "#files";
system ("rm -rf $dir/log.txt");
my $list;
foreach $list(#files){
if( !open(LOGFILE, "$list")){
open (File, ">>", "$dir/log.txt");
select (File);
print " $list \: unable to open file";
close (File);
else {
while (<LOGFILE>){
if($_ =~ /".*the.*automaically.*\{\n.*period\=20ns.*\}"/){
open (File, ">>", "$dir/log.txt");
select (File);
print " $list \: File contain the required string\n";
close (File);
break;
}
}
close (LOGFILE);
}
}
This code does not compile, it contains errors that causes it to fail to execute. You should never post code that you have not first tried to run.
The root of your problem is that for a multiline match, you cannot read the file in line-by-line mode, you have to slurp the whole file into a variable. However, your program contains many flaws. I will demonstrate. Here follows excerpts of your code (with fixed indentation and missing curly braces).
First off, always use:
use strict;
use warnings;
This will save you many headaches and long searches for hidden problems.
system ("rm -rf $dir/log.txt");
This is better done in Perl, where you can control for errors:
unlink "$dir/log.txt" or die "Cannot delete '$dir/log.txt': $!";
foreach my $list (#files) {
# ^^
Declare the loop variable in the loop itself, not before it.
if( !open(LOGFILE, "$list")){
open (File, ">>", "$dir/log.txt");
select (File);
print " $list \: unable to open file";
close (File);
You never have to explicitly select a file handle before you print to it. You just print to the file handle: print File "....". What you are doing is just changing the STDOUT file handle, which is not a good thing to do.
Also, this is error logging, which should go to STDERR instead. This can be done simply by opening STDERR to a file at the beginning of your program. Why do this? If you are not debugging a program at a terminal, for example via the web or some other process where STDERR does not show up on your screen. Otherwise it is just extra work while debugging.
open STDERR, ">", "$dir/log.txt" or die "Cannot open 'log.txt' for overwrite: $!";
This has the added benefit of you not having to delete the log first. And now you do this instead:
if (! open LOGFILE, $list ) {
warn "Unable to open file '$list': $!";
} else ....
warn goes to STDERR, so it is basically the same as print STDERR.
Speaking of open, you should use three argument open with explicit file handle. So it becomes:
if (! open my $fh, "<", $list )
} else {
while (<LOGFILE>) {
Since you are looking for a multiline match, you need to slurp the file(s) instead. This is done by setting the input record separator to undef. Typically like this:
my $file = do { local $/; <$fh> }; # $fh is our file handle, formerly LOGFILE
Next how to apply the regex:
if($_ =~ /".*the.*automaically.*\{\n.*period\=20ns.*\}"/) {
$_ =~ is optional. A regex automatically matches against $_ if no other variable is used.
You should probably not use " in the regex. Unless you have " in the target string. I don't know why you put it there, maybe you think strings need to be quoted inside a regex. If you do, that is wrong. To match the string you have above, you do:
if( /the.*automaically.*{.*period=20ns.*}/s ) {
You don't have to escape \ curly braces {} or equal sign =. You don't have to use quotes. The /s modifier makes . (wildcard character period) also match newline, so we can remove \n. We can remove .* from start or end of string, because that is implied, regex matches are always partial unless anchors are used.
break;
The break keyword is only used with the switch feature, which is experimental, plus you don't use it, or have it enabled. So it is just a bareword, which is wrong. If you want to exit a loop prematurely, you use last. Note that we don't have to use last because we slurp the file, so we have no loop.
Also, you generally should pick suitable variable names. If you have a list of files, the variable that contains the file name should not be called $list, I think. It is logical that it is called $file. And the input file handle should not be called LOGFILE, it should be called $input, or $infh (input file handle).
This is what I get if I apply the above to your program:
use strict;
use warnings;
my $dir = "/home/vikas";
my #files = glob( $dir . '/*' );
my $logfile = "$dir/log.txt";
open STDERR, ">", $logfile or die "Cannot open '$logfile' for overwrite: $!";
foreach my $file (#files) {
if(! open my $input, "<", $file) {
warn "Unable to open '$file': $!";
} else {
my $txt = do { local $/; <$fh> };
if($txt =~ /the.*automaically.*{.*period=20ns.*}/) {
print " $file : File contain the required string\n";
}
}
}
Note that the print goes to STDOUT, not to the error log. It is not common practice to have STDOUT and STDERR to the same file. If you want, you can simply redirect output in the shell, like this:
$ perl foo.pl > output.txt
The following sample code demonstrates usage of regex for multiline case with logger($fname,$msg) subroutine.
Code snippet assumes that input files are relatively small and can be read into a variable $data (an assumption is that computer has enough memory to read into).
NOTE: input data files should be distinguishable from rest files in home directory $ENV{HOME}, in this code sample these files assumed to match pattern test_*.dat, perhaps you do not intend to scan absolutely all files in your home directory (there could be many thousands of files but you interested in a few only)
#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say';
my($dir,$re,$logfile);
$dir = '/home/vikas/';
$re = qr/the file_is being created_automaically \{\s+period=20ns\s+\}/;
$logfile = $dir . 'logfile.txt';
unlink $logfile if -e $logfile;
for ( glob($dir . "test_*.dat") ) {
if( open my $fh, '<', $_ ) {
my $data = do { local $/; <$fh> };
close $fh;
logger($logfile, "INFO: $_ contains the required string")
if $data =~ /$re/gsm;
} else {
logger($logfile, "WARN: unable to open $_");
}
}
exit 0;
sub logger {
my $fname = shift;
my $text = shift;
open my $fh, '>>', $fname
or die "Couldn't to open $fname";
say $fh $text;
close $fh;
}
Reference: regex modifies, unlink, perlvar

Adding custom header to specific files in a directory

I would like to add a unique one line header that pertains to each file FOCUS*.tsv file in a specified directory. After that, I would like to combine all of these files into one file.
First I’ve tried sed command.
`my $cmd9 = `sed -i '1i$SampleID[4]' $tsv_file`;` print $cmd9;
It looked like it worked but after I’ve combined all of these files into one file in the next section of the code, the inserted row was listed four times for each file.
I’ve tried the following Perl script to accomplish the same but it deleted the content of the file and only prints out the added header.
I’m looking for the simplest way to accomplish what I’m looking for.
Here is what I’ve tried.
#!perl
use strict;
use warnings;
use Tie::File;
my $home="/data/";
my $tsv_directory = $home."test_all_runs/".$ARGV[0];
my $tsvfiles = $home."test_all_runs/".$ARGV[0]."/tsv_files.txt";
my #run_directory = (); #run_directory = split /\//, $tsv_directory; print "The run directory is #############".$run_directory[3]."\n";
my $cmd = `ls $tsv_directory/FOCUS*\.tsv > $tsvfiles`; #print "$cmd";
my $cmda = "ls $tsv_directory/FOCUS*\.tsv > $tsvfiles"; #print "$cmda";
my #tsvfiles =();
#this code opens the vcf_files.txt file and passes each line into an array for indidivudal manipulation
open(TXT2, "$tsvfiles");
while (<TXT2>){
push (#tsvfiles, $_);
}
close(TXT2);
foreach (#tsvfiles){
chop($_);
}
#this loop works fine
for my $tsv_file (#tsvfiles){
open my $in, '>', $tsv_file or die "Can't write new file: $!";
open my $out, '>', "$tsv_file.new" or die "Can't write new file: $!";
$tsv_file =~ m|([^/]+)-oncomine.tsv$| or die "Can't extract Sample ID";
my $sample_id = $1;
#print "The sample ID is ############## $sample_id\n";
my $headerline = $run_directory[3]."/".$sample_id;
print $out $headerline;
while( <$in> ) {
print $out $_;
}
close $out;
close $in;
unlink($tsv_file);
rename("$tsv_file.new", $tsv_file);
}
Thank you
Apparently, the wrong '>' when opening the file for reading was the problem and it got solved.
However, I'd like to make a few comments on some of the rest of the code.
The list of files is built by running external ls redirected to a file, then reading this file into an array. However, that is exactly the job of glob and all of that is replaced by
my #tsvfiles = glob "$tsv_directory/FOCUS*.tsv";
Then you don't need the chomp either, and the chop that is used would actually hurt since it removes the last character, not only the newline (or really $/).
Use of chop is probably not what you want. If you are removing the linefeed ($/) use chomp
To extract a match and assign it, a common idiom is
my ($sample_id) = $tsv_file =~ m|([^/]+)-oncomine.tsv$|
or die "Can't extract Sample ID: $!";
Note that I also added $!, to actually print the error. Otherwise we just don't know what it was.
The unlink and rename appear to be overwriting one file with another. You can do that by using move from the core module File::Copy
use File::Copy qw(move);
move ($tsv_file_new, $tsv_file)
or die "Can't move $tsv_file to $tsv_file_new: $!";
which renames the _new into $tsv_file, so overwriting it.
As for how the files need to be combined, more precise explanation would be needed.

Reading new format in INI file perl

I have another problem regarding the INI file. My teacher had renewed the configuration of the files. Now my codes can't read this new configuration file. The format of it has been changed. How to read this format in the INI file in perl?
[section1]
value1
value2
As you can see, the format of the INI file now has only values in it. The parameter is gone. How can perl read this line? It does not have any parameters and only values are left. I want to read only the values. I use Config::Tiny before to read the line but i can't seem to solve this problem with this :
my $file = "file directory";
my $Config = Config::Tiny->read($file);
$Config->{"section1"}->{_};
Are my codes wrong? Because i can't get the output from this code. Can anybody help me fix this problem? Thank you.
You'll need to deal with it yourself:
#!/usr/bin/perl
use 5.10.0;
sub read_not_ini
{
my $file = shift;
my %values;
my $section;
open my $fh, '<', $file or die "Can't read $file: $!";
while (<$fh>)
{
# skip comments, blank lines.
next if /^\s*#/ or /^\s*$/;
# don't need/want end-line character.
chomp;
if (/^\[([^\]]+)\]/)
{
my $s = $1;
$section = $values{$s} //= [];
}
elsif($section)
{
push #$section, $_;
}
else
{
say STDERR "$file: values without a section: line $.";
}
}
return \%values;
}
my $Config = read_not_ini(shift); # pass in as param
say for #{$Config->{"section1"}};

using perl fetch a .txt file and for every line in that file do something [duplicate]

This question already has an answer here:
Why does my file content/user input not match? (missing chomp canonical) [duplicate]
(1 answer)
Closed 8 years ago.
I'm pretty new in perl so please try to understand me.
I have in a .txt file defined some lines like this:
doc1.20131010.zip
doc2.20131010.zip
doc3.20131010.zip
doc4.20131010.zip
I made this code:
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie;
use Net::SFTP::Attributes;
use Net::SFTP;
use constant {
HOST => "x.x.x.x",
USER_NAME => "sftptest",
PASSWORD => "**********",
DEBUG => "0",
};
my $REMOTE_DIR = "IN";
my $LOCAL_DIR = "/home/rec";
my $sftp = Net::SFTP->new (
HOST,
timeout => 240,
user => USER_NAME,
password => PASSWORD,
autodie => 1,
);
#
# Fetch Files
#
#my $res = $sftp->ls($REMOTE_DIR,sub { print $_[0]{longname}, "\n" });
#print "$res";
my $ls = $sftp->ls($REMOTE_DIR)
or die "ls failed: " . $sftp->error;
open my $fh, '>', '/home/rec/listing' or die "unable to create file: $!";
print $fh $_->{filename}, "\n" for #$ls;
close $fh;
open F, "</home/docs/listing";
for my $line (<F>)
{
#print "$line";
$sftp->get("$line","$line") ;
}
Now when I run the above code it should give me the above files listed, instead I get this:
Couldn't stat remote file: No such file or directory at ./r.pl line 40.
You probably need to remove newline after reading file names from filehandle:
for my $line (<F>) {
chomp($line);
$sftp->get($line, $line);
}
or more commonly,
while (my $line = <F>) {
chomp($line);
$sftp->get($line, $line);
}
You use use autodie;, yet you have:
open my $fh, '>', '/home/rec/listing' or die "unable to create file: $!";
No need for the or die... since the program will automatically die.
You also have use feature qw(say);, yet you use print instead of say. The whole purpose of say is to prevent issues that might be the cause of your error.
You also should check the return results of your $sftp->get($line, $line); line to see if it was successful or not.
If you did both of these, you would have seen that your $sftp->get($line, $line) was failing because you forgot to chomp that NL at the end of the file.
Instead, you used:
`print $line;`
which printed the file out, but since the file name had a NL, it looked fine. Otherwise, you would have see the extra space and immediately seen the problem.

How to print file content in parts to the screen?

I am trying to store command output into a file (that works fine) and then what I want to do, is to display the file content to the screen.
My problem is that I want it to be displayed in parts (for example 20 lines at a time) and let the user press [Enter] or any key to continue to the next part. I was thinking about piping the file content to more however it displays the whole file content at once instead of doing it by parts.
Here is my part of the code that is responsible for opening a file, writing to it, and then displaying it on the screen.
open FILE, '>', $filename;
print FILE #command;
open FILE, '<', $filename;
while (<FILE>) {
open MORE, '| more';
print MORE;
}
close MORE;
close FILE;
use strict;
use warnings;
my #command = map "output line $_\n", 1..100;
my $page_size = 20;
my $n = 0;
for my $line (#command) {
print $line;
$n ++;
if ($n % $page_size == 0) {
print "--More--";
<>;
}
}
You just need the open more out of the loop:
close FILE;
open FILE, '<', $filename;
open MORE, '| more';
while (<FILE>) {
print MORE;
}
close MORE;
close FILE;
or without using more:
open my $file, '<', $filename or die("$!");
while (#command) {
print join("\n", splice(#command, 0, 20));
<>;
}
close $file;

Resources