Errors in declaration when trying to parse a csv file - string

I'm trying to parse a CSV file that is formatted like this:
dog cats,yellow blue tomorrow,12445
birds,window bank door,-novalue-
birds,window door,5553
aspirin man,red,567
(there is no value where -novalue- is written)
use strict;
use warnings;
my $filename = 'in.txt';
my $filename2 = 'out.txt';
open(my $in, '<:encoding(UTF-8)', $filename)
or die "Could not open file '$filename' $!";
my $word = "";
while (my $row = <$in>) {
chomp $row;
my #fields = split(/,/,$row);
#Save the first word of the second column
($word) = split(/\s/,$fields[1]);
if ($word eq 'importartWord')
{
printf $out "$fields[0]".';'."$word".';'."$fields[2]";
}
else #keep as it was
{
printf $out "$fields[0]".';'."$fields[1]".';'."$fields[2]";
}
Use of uninitialized value $word in string ne at prueba7.pl line 22, <$in> line 10.
No matter where I define $word I cannot stop receiving that error and can't understand why. I think I have initialized $word correctly. I would really appreciate your help here.
Please if you are going to suggest using Text::CSV post a working code example since I haven't been able to apply it for the propose I have explained here. That's the reason I ended up writing the above code.
PD:
Because I know you are going to ask for my previous code using Text::CSV, here it is:
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({ sep_char => ';', binary => 1 }) or
die "Cannot use CSV: ".Text::CSV->error_diag ();
#directorio donde esta esc_prim2.csv
my $file = 'C:\Users\Sergio\Desktop\GIS\perl\esc_prim2.csv';
my $sal = 'C:\Users\Sergio\Desktop\GIS\perl\esc_prim3.csv';
open my $data, "<:encoding(utf8)", "$file" or die "$file: $!";
open my $out, ">:encoding(utf8)", "$sal" or die "$sal: $!";
$csv->eol ("\r\n");
#initializing variables
my $row = "";
my $word = "";
my $validar = 0;
my $line1 = "";
my #mwords = [""];#Just a try to initialize mwords... doesn't work, error keeps showing
#save the first line with field names on the other file
$line1 = <$data>;
$csv->parse($line1);
my #fields = $csv->fields();
$csv->print($out,[$fields[0], $fields[1], $fields[2]]);
while ($row = <$data>) {
if ($csv->parse($row)) {
#fields = $csv->fields();
#save first word of the field's second element
#mwords = split (/\s/, $fields[1]);
#keep the first one
$word = $mwords[0];
printf($mwords[0]);
#if that word is not one of SAN, EL y LA... writes a line in the new file with the updated second field.
$validar = ($word ne 'SAN') && ($word ne 'EL') && ($word ne 'LA');
if ($validar)
{
$csv->print($out,[$fields[0], $word, $fields[2]]);
}
else { #Saves the line in the new file as it was in the old one.
$csv->print($out,[$fields[0], $fields[1], $fields[2]]);
}
} else {#error procesing row
warn "La row no se ha podido procesar\n";
}
}
close $data or die "$file: $!";
close $out or die "$sal: $!";
Here the line where $validar is declared brings the same error of "uninitialized value" although I did it.
I also tried the push #rows, $row; approach but I don't really know how to handle the $rows[$i] since they are references to arrays (pointers) and I know they can't be operated as variables... Couldn't find a working example on how to use them.

I think you're misunderstanding the error. It's not a problem with the declaration of the variable, but with the data that you're putting into the variable.
Use of uninitialized value
This means that you are trying to use a value that is undefined (not undeclared). That means you are using a variable that you haven't given a value.
You can get more details about the warning (and it's a warning, not an error) by adding use diagnostics to your code. You'll get something like this:
(W uninitialized) An undefined value was used as if it were already
defined. It was interpreted as a "" or a 0, but maybe it was a mistake.
To suppress this warning assign a defined value to your variables.
To help you figure out what was undefined, perl will try to tell you
the name of the variable (if any) that was undefined. In some cases
it cannot do this, so it also tells you what operation you used the
undefined value in. Note, however, that perl optimizes your program
and the operation displayed in the warning may not necessarily appear
literally in your program. For example, "that $foo" is usually
optimized into "that " . $foo, and the warning will refer to the
concatenation (.) operator, even though there is no . in
your program.
So, when you're populating $word, it's not getting a value. Presumably, that's because some lines in your input file have an empty record there.
I have no way of knowing whether or not that's a valid input for your program, so I can't really give any helpful suggestions on how to fix this.

The error message you provided ends with: line 22, <$in> line 10. but your question doesn't show line 10 of the data ($in) requiring some speculation in this answer - but, I'd say that the second field, $field[1], of line 10 of in.txt is empty.
Consequently, this line: ($word) = split(/\s/,$fields[1]); is causing $word to be undefined. As a result, some use of it latter - be it the ne operator (as displayed in the message) or anything else is going to generate an error.
As an aside - there's little point in interpolating a variable in a string on its own; instead of "$fields[0]", say $fields[0] unless you're going to put something else in there, like "$fields[0];". You may want to consider replacing
printf $out "$fields[0]".';'."$word".';'."$fields[2]";
with
printf $out $fields[0] . ';' . $word . ';' . $fields[2];
or
printf $out "$fields[0];$word;$fields[2]";
Of course, TMTOWTDI - so you may want to tell me to mind my own business instead. :-)

Related

Perl: String in Substring or Substring in String

I'm working with DNA sequences in a file, and this file is formatted something like this, though with more than one sequence:
>name of sequence
EXAMPLESEQUENCEATCGATCGATCG
I need to be able to tell if a variable (which is also a sequence) matches any of the sequences in the file, and what the name of the sequence it matches, if any, is. Because of the nature of these sequences, my entire variable could be contained in a line of the file, or a line of the variable could be a part of my variable.
Right now my code looks something like this:
use warnings;
use strict;
my $filename = "/users/me/file/path/file.txt";
my $exampleentry = "ATCG";
my $returnval = "The sequence does not match any in the file";
open file, "<$filename" or die "Can't find file";
my #Name;
my #Sequence;
my $inx = 0;
while (<file>){
$Name[$inx] = <file>;
$Sequence[$inx] = <file>;
$indx++;
}unless(index($Sequence[$inx], $exampleentry) != -1 || index($exampleentry, $Sequence[$inx]) != -1){
$returnval = "The sequence matches: ". $Name[$inx];
}
print $returnval;
However, even when I purposely set $entry as a match from the file, I still return The sequence does not match any in the file. Also, when running the code, I get Use of uninitialized value in index at thiscode.pl line 14, <file> line 3002. as well as Use of uninitialized value within #Name in concatenation (.) or string at thiscode.pl line 15, <file> line 3002.
How can I perform this search?
I will assume that the purpose of this script is to determine if $exampleentry matches any record in the file file.txt. A record describes here a DNA sequence and corresponds to three consecutive lines in the file. The variable $exampleentry will match the sequence if it matches the third line of the record. A match means here that either
$exampleentry is a substring of $line, or
$line is a substring of $exampleentry,
where $line referes to the corresponding line in the file.
First, consider the input file file.txt:
>name of sequence
EXAMPLESEQUENCEATCGATCGATCG
in the program you try to read these two lines, using three calls to readline. Accordingly, that last call to readline will return undef since there are no more lines to read.
It therefore seems reasonable that the two last lines in file.txt are malformed, and the correct format should be:
>name of sequence
EXAMPLESEQUENCE
ATCGATCGATCG
If I now understand you correctly, I hope this could solve your problem:
use feature qw(say);
use strict;
use warnings;
my $filename = "file.txt";
my $exampleentry = "ATCG";
my $returnval = "The sequence does not match any in the file";
open (my $fh, '<', $filename ) or die "Can't find file: $!";
my #name;
my #sequence;
my $inx = 0;
while (<$fh>) {
chomp ($name[$inx] = <$fh>);
chomp ($sequence[$inx] = <$fh>);
if (
index($sequence[$inx], $exampleentry) != -1
|| index($exampleentry, $sequence[$inx]) != -1
) {
$returnval = "The sequence matches: ". $name[$inx];
last;
}
}
say $returnval;
Notes:
I have changed variable names to follow snake_case convention. For example the variable #Name is better written using all lower case as #name.
I changed the open() call to follow the new recommended 3-parameter style, see Don't Open Files in the old way for more information.
Used feature say instead of print
Added a chomp after each readline to avoid storing newline characters in the arrays.

Adding custom header to specific files in a directory

I would like to add a unique one line header that pertains to each file FOCUS*.tsv file in a specified directory. After that, I would like to combine all of these files into one file.
First I’ve tried sed command.
`my $cmd9 = `sed -i '1i$SampleID[4]' $tsv_file`;` print $cmd9;
It looked like it worked but after I’ve combined all of these files into one file in the next section of the code, the inserted row was listed four times for each file.
I’ve tried the following Perl script to accomplish the same but it deleted the content of the file and only prints out the added header.
I’m looking for the simplest way to accomplish what I’m looking for.
Here is what I’ve tried.
#!perl
use strict;
use warnings;
use Tie::File;
my $home="/data/";
my $tsv_directory = $home."test_all_runs/".$ARGV[0];
my $tsvfiles = $home."test_all_runs/".$ARGV[0]."/tsv_files.txt";
my #run_directory = (); #run_directory = split /\//, $tsv_directory; print "The run directory is #############".$run_directory[3]."\n";
my $cmd = `ls $tsv_directory/FOCUS*\.tsv > $tsvfiles`; #print "$cmd";
my $cmda = "ls $tsv_directory/FOCUS*\.tsv > $tsvfiles"; #print "$cmda";
my #tsvfiles =();
#this code opens the vcf_files.txt file and passes each line into an array for indidivudal manipulation
open(TXT2, "$tsvfiles");
while (<TXT2>){
push (#tsvfiles, $_);
}
close(TXT2);
foreach (#tsvfiles){
chop($_);
}
#this loop works fine
for my $tsv_file (#tsvfiles){
open my $in, '>', $tsv_file or die "Can't write new file: $!";
open my $out, '>', "$tsv_file.new" or die "Can't write new file: $!";
$tsv_file =~ m|([^/]+)-oncomine.tsv$| or die "Can't extract Sample ID";
my $sample_id = $1;
#print "The sample ID is ############## $sample_id\n";
my $headerline = $run_directory[3]."/".$sample_id;
print $out $headerline;
while( <$in> ) {
print $out $_;
}
close $out;
close $in;
unlink($tsv_file);
rename("$tsv_file.new", $tsv_file);
}
Thank you
Apparently, the wrong '>' when opening the file for reading was the problem and it got solved.
However, I'd like to make a few comments on some of the rest of the code.
The list of files is built by running external ls redirected to a file, then reading this file into an array. However, that is exactly the job of glob and all of that is replaced by
my #tsvfiles = glob "$tsv_directory/FOCUS*.tsv";
Then you don't need the chomp either, and the chop that is used would actually hurt since it removes the last character, not only the newline (or really $/).
Use of chop is probably not what you want. If you are removing the linefeed ($/) use chomp
To extract a match and assign it, a common idiom is
my ($sample_id) = $tsv_file =~ m|([^/]+)-oncomine.tsv$|
or die "Can't extract Sample ID: $!";
Note that I also added $!, to actually print the error. Otherwise we just don't know what it was.
The unlink and rename appear to be overwriting one file with another. You can do that by using move from the core module File::Copy
use File::Copy qw(move);
move ($tsv_file_new, $tsv_file)
or die "Can't move $tsv_file to $tsv_file_new: $!";
which renames the _new into $tsv_file, so overwriting it.
As for how the files need to be combined, more precise explanation would be needed.

Reading new format in INI file perl

I have another problem regarding the INI file. My teacher had renewed the configuration of the files. Now my codes can't read this new configuration file. The format of it has been changed. How to read this format in the INI file in perl?
[section1]
value1
value2
As you can see, the format of the INI file now has only values in it. The parameter is gone. How can perl read this line? It does not have any parameters and only values are left. I want to read only the values. I use Config::Tiny before to read the line but i can't seem to solve this problem with this :
my $file = "file directory";
my $Config = Config::Tiny->read($file);
$Config->{"section1"}->{_};
Are my codes wrong? Because i can't get the output from this code. Can anybody help me fix this problem? Thank you.
You'll need to deal with it yourself:
#!/usr/bin/perl
use 5.10.0;
sub read_not_ini
{
my $file = shift;
my %values;
my $section;
open my $fh, '<', $file or die "Can't read $file: $!";
while (<$fh>)
{
# skip comments, blank lines.
next if /^\s*#/ or /^\s*$/;
# don't need/want end-line character.
chomp;
if (/^\[([^\]]+)\]/)
{
my $s = $1;
$section = $values{$s} //= [];
}
elsif($section)
{
push #$section, $_;
}
else
{
say STDERR "$file: values without a section: line $.";
}
}
return \%values;
}
my $Config = read_not_ini(shift); # pass in as param
say for #{$Config->{"section1"}};

Perl if condition parameters

I have a log file which looks like below:
4680 p4exp/v68 PJIANG-015394 25:34:19 IDLE none
8869 unnamed p4-python R integration semiconductor-project-trunktip-turbolinuxclient 01:33:52 IDLE none
8870 unnamed p4-python R integration remote-trunktip-osxclient 01:33:52
There are many such entries in the same log file such that some contains IDLE none at the end while some does not. I would like to retain the ones having "R integration" and "IDLE none" in a hash and ignore the rest. I have tried the following code but not getting the desired results.
#!/usr/bin/perl
open (FH,'/root/log.txt');
my %stat;
my ($killid, $killid_details);
while ($line = <FH>) {
if ($line =~ m/(\d+)/){
$killid = $1;
}
if ($line =~ /R integration/ and $line =~ /IDLE none/){
$killid_details = $line;
}
$stat{$killid} = {
killid => $killid_details
};
}
close (FH);
I am getting all the lines with R integration (for example I get 8869, 8870 lines) which should not be the case as 8870 should be ignored.
Please inform me if any mistake. I am still learning perl. Thank you.
I made a few changes in your program:
Always put in use strict; and use warnings;. These will catch 90% of your errors. (Although not this time).
When you open a file, you need to either use or die as in open my $fh, "<", $file or die qq(blah, blah, blah); or use use autodie; (which is now preferred). In your case, if the file didn't open, your program would have continued merrily along. You need to test whether or not the open statement worked.
Note my open statement. I use a variable for the file handle. This is preferred because it's not global, and it's easier to pass into subroutines. Also note I use the three parameter open. This way, you don't run into trouble if your file name begins with some strange character.
When you declare a variable, it's best to do it in scope. This way, variables go out of scope when you no longer need them. I moved where $killid and $killid_details to be declared inside the loop. That way, they no longer exist outside the loop.
You need to be more careful with your regular expressions. What if the phrase IDLE none appears elsewhere in your line? You only want it if its on the end of the line.
Now, for the issues you had:
You need to chomp lines when you read them. In Perl, the NL at the end of the line is read in. The chomp command removes it.
Your logic was a bit strange. You set $killid if your line had a digit in it (I modified it to look only for digits at the beginning of the line). However, you simply went on your merry way even if killid was not set. In your version, because you declared $killid outside of the loop, it had a value in each loop. Here I go to the next statement if $killid isn't defined.
You had a weird definition for your hash. You were defining a reference hash within a hash. No need for that. I made it a simple hash.
Here it is:
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie;
use Data::Dumper;
open my $log_fh, '<', '/root/log.txt';
my %stat;
while (my $line = <$log_fh>) {
chomp $line;
next if not $line =~ /^(\d+)\s+/;
my $killid = $1;
if ($line =~ /R\s+integration/ and $line =~ /IDLE\s+none$/){
my $killid_details = $line;
$stat{$killid} = $killid_details;
}
}
close $log_fh;
say Dumper \%stat;
I think this is probably what you want:
while (<FH>) {
next unless /^(\d+).*R integration.*IDLE none/;
$stat{$1} = $_;
}
The regexp should be anchored to the beginning of the line, so you don't match a number anywhere on the line. There's no need to do multiple regexp matches, assuming the order of R integration and IDLE none are always as in the example. You need to use next when there's no match, so you don't process non-matching lines.
And I suspect that you just want to set the value of the hash entry to the string, not a reference to another hash.

Simple perl opendir

I am completely new to perl and have just been learning it. I came across this script I need to run that has some network Tstat trace data. However, I get an error 'Cannot parse date.'
The code that generates this is here
foreach my $dir (#trace_dirs) {
undef #traces;
opendir(DIR, $dir) || die "Can't open dir: $dir \n";
#traces = grep { /.out$/ && -d "$dir/$_" } readdir(DIR);
foreach my $trace (#traces) {
$trace =~ /^(\d\d)_(\d\d)_(\d\d)_(\w\w\w)_(\d\d\d\d)/;
$trace_date=&ParseDate("$3/$4/$5 $1:$2") || die "Cannot parse date \n";
$traces{$trace_date} = $trace;
$trace_dir{$trace_date} = $dir;
}
closedir DIR;
}
can some tell me what this code is looking for?
When you run into problems like this, throw yourself a bone by looking at the data you are trying to play with. Make sure that the value in $trace is what you expect and that the date string you create is what you expect:
print "Trace is [$trace]\n";
if( $trace =~ /^(\d\d)_(\d\d)_(\d\d)_(\w\w\w)_(\d\d\d\d)/ ) {
my $date = "$3/$4/$5 $1:$2";
print "date is [$date]\n";
$trace_date= ParseDate( $date ) || die "Cannot parse date [$date]\n";
}
I'm guessing that the value in $4, which apparently is a string like 'Jan', 'Feb', and so on, isn't something that ParseDate likes.
Note that you should only use the capture variables after a successful pattern match, lest they be left over from a different match.
However, I get an error 'Cannot parse date.'
You get the error due to the line:
$trace =~ /^(\d\d)_(\d\d)_(\d\d)_(\w\w\w)_(\d\d\d\d)/;
The script expects that all files in the directory with extension .out have proper timestamps in the beginning of their names. And the line of the script lack any error handling.
Try adding some check here, e.g.:
unless($trace =~ /^(\d\d)_(\d\d)_(\d\d)_(\w\w\w)_(\d\d\d\d)/) {
warn "WRN: Malformed file name: $trace\n";
next;
}
That checks if the file name matches, and if it doesn't, warning would be printed and it would be skipped.
Alternatively you can also add the check to the grep {} readdir() line:
#traces = grep { /.out$/ && /^(\d\d)_(\d\d)_(\d\d)_(\w\w\w)_(\d\d\d\d)/ && -d "$dir/$_" } readdir(DIR);
to filter out misplaced .out files (hm, actually directories) before they reach the loop which calls the ParseDate function.

Resources