Perl: String in Substring or Substring in String - string

I'm working with DNA sequences in a file, and this file is formatted something like this, though with more than one sequence:
>name of sequence
EXAMPLESEQUENCEATCGATCGATCG
I need to be able to tell if a variable (which is also a sequence) matches any of the sequences in the file, and what the name of the sequence it matches, if any, is. Because of the nature of these sequences, my entire variable could be contained in a line of the file, or a line of the variable could be a part of my variable.
Right now my code looks something like this:
use warnings;
use strict;
my $filename = "/users/me/file/path/file.txt";
my $exampleentry = "ATCG";
my $returnval = "The sequence does not match any in the file";
open file, "<$filename" or die "Can't find file";
my #Name;
my #Sequence;
my $inx = 0;
while (<file>){
$Name[$inx] = <file>;
$Sequence[$inx] = <file>;
$indx++;
}unless(index($Sequence[$inx], $exampleentry) != -1 || index($exampleentry, $Sequence[$inx]) != -1){
$returnval = "The sequence matches: ". $Name[$inx];
}
print $returnval;
However, even when I purposely set $entry as a match from the file, I still return The sequence does not match any in the file. Also, when running the code, I get Use of uninitialized value in index at thiscode.pl line 14, <file> line 3002. as well as Use of uninitialized value within #Name in concatenation (.) or string at thiscode.pl line 15, <file> line 3002.
How can I perform this search?

I will assume that the purpose of this script is to determine if $exampleentry matches any record in the file file.txt. A record describes here a DNA sequence and corresponds to three consecutive lines in the file. The variable $exampleentry will match the sequence if it matches the third line of the record. A match means here that either
$exampleentry is a substring of $line, or
$line is a substring of $exampleentry,
where $line referes to the corresponding line in the file.
First, consider the input file file.txt:
>name of sequence
EXAMPLESEQUENCEATCGATCGATCG
in the program you try to read these two lines, using three calls to readline. Accordingly, that last call to readline will return undef since there are no more lines to read.
It therefore seems reasonable that the two last lines in file.txt are malformed, and the correct format should be:
>name of sequence
EXAMPLESEQUENCE
ATCGATCGATCG
If I now understand you correctly, I hope this could solve your problem:
use feature qw(say);
use strict;
use warnings;
my $filename = "file.txt";
my $exampleentry = "ATCG";
my $returnval = "The sequence does not match any in the file";
open (my $fh, '<', $filename ) or die "Can't find file: $!";
my #name;
my #sequence;
my $inx = 0;
while (<$fh>) {
chomp ($name[$inx] = <$fh>);
chomp ($sequence[$inx] = <$fh>);
if (
index($sequence[$inx], $exampleentry) != -1
|| index($exampleentry, $sequence[$inx]) != -1
) {
$returnval = "The sequence matches: ". $name[$inx];
last;
}
}
say $returnval;
Notes:
I have changed variable names to follow snake_case convention. For example the variable #Name is better written using all lower case as #name.
I changed the open() call to follow the new recommended 3-parameter style, see Don't Open Files in the old way for more information.
Used feature say instead of print
Added a chomp after each readline to avoid storing newline characters in the arrays.

Related

How can I split my data in small enough chunks to feed to Seq?

I am working on a bioinformatics project where I am looking at very large genomes. Seg only reads 135 lines at a time, so when we feed the genomes in it gets overloaded. I am trying to create a perl command that will split the sections into 135 line sections. The character limit would be 10,800 since there are 80 columns. This is what i have so far
#!usr/bin/perl
use warnings;
use strict;
my $str =
'>AATTCCGG
TTCCGGAA
CCGGTTAA
AAGGTTCC
>AATTCCGG';
substr($str,17) = "";
print "$str";
It splits at the 17th character but only prints that section, I want it to continue printing the rest of the data. How do i add a command that allows the rest of the data to be shown. Like it should split at every 17th character continuing. (then of course i can go back in and scale it up to the size i actually need. )
I assume that the "very large genome" is stored in a very large file, and that it is fine to collect data by number of lines (and not by number of characters) since this is the first mentioned criterion.
Then you can read the file line by line and assemble lines until there is 135 of them. Then hand them off to a program or routine that processes that, empty your buffer, and keep going
use warnings;
use strict;
use feature 'say';
my $file = shift || 'default_filename.txt';
my $num_lines_to_process = 135;
open my $fh, '<', $file or die "Can't open $file: $!";
my ($line_counter, #buffer);
while (<$fh>) {
chomp;
if ($line_counter == $num_lines_to_process)
{
process_data(\#buffer);
#buffer = ();
$line_counter = 0;
}
push #buffer, $_;
++$line_counter;
}
process_data(\#buffer) if #buffer; # last batch
sub process_data {
my ($rdata) = #_;
say for #$rdata; say '---'; # print data for a test
}
If your processing application/routine wants a string, you can append to a string every time instead of adding to an array, $buffer .= $_; and clear that by $buffer = ''; as needed.
If you need to pass a string but there is also some use of an array while collecting data (intermediate checks/pruning/processing?), then collect lines into an array and use as needed, and join into a string before handing it off, my $data = join '', #buffer;
You can also make use of the $. variable and the modulo operator (%)
while (<$fh>) {
chomp;
push #buffer, $_;
if ($. % $num_lines_to_process == 0) # every $num_lines_to_process
{
process_data(\#buffer);
#buffer = ();
}
}
process_data(\#buffer) if #buffer; # last batch
In this case we need to first store a line and then check its number, since $. (line number read from a filehandle, see docs linked above) starts from 1 (not 0).
substr returns the removed part of a string; you can just run it in a loop:
while (length $str) {
my $substr = substr $str, 0, 17, "";
print $substr, "\n";
}

Perl - Searching values in a log file and store/print them as a string.

I would like to search values after a specific word (Current Value = ) in a log file, and makes a string with values.
vcs_output.log: a log file
** Fault injection **
Count = 1533
0: Path = cmp_top.iop.sparc0.exu.alu.byp_alu_rcc_data_e[6]
0: Current value = x
1: Path = cmp_top.iop.sparc0.exu.alu.byp_alu_rs3_data_e[51]
1: Current value = x
2: Path = cmp_top.iop.sparc0.exu.alu.byp_alu_rs1_data_e[3]
2: Current value = 1
3: Path = cmp_top.iop.sparc0.exu.alu.shft_alu_shift_out_e[18]
3: Current value = 0
4: Path = cmp_top.iop.sparc0.exu.alu.byp_alu_rs3_data_e[17]
4: Current value = x
5: Path = cmp_top.iop.sparc0.exu.alu.byp_alu_rs1_data_e[43]
5: Current value = 0
6: Path = cmp_top.iop.sparc0.exu.alu.byp_alu_rcc_data_e[38]
6: Current value = x
7: Path = cmp_top.iop.sparc0.exu.alu.byp_alu_rs2_data_e_l[30]
7: Current value = 1
.
.
.
If I store values after "Current value = ", then x,x,1,0,x,0,x,1. I ultimately save/print them as a string such as xx10x0x1.
Here is my code
code.pl:
#!/usr/bin/perl
use strict;
use warnings;
##### Read input
open ( my $input_fh, '<', 'vcs_output.log' ) or die $!;
chomp ( my #input = <$input_fh> );
my $i=0;
my #arr;
while (#input) {
if (/Current value = /)
$arr[i]= $input; # put the matched value to array
}
}
## make a string from the array using an additional loop
close ( $input_fh );
I think there is a way to make a string in one loop (or even not using a loop). Please advise me to make it. Any suggestion is appreciated.
You can do both that you ask for.
To build a string directly, just append to it what you capture in the regex
my $string;
while (<$input_fh>)
{
my ($val) = /Current\s*value\s*=\s*(.*)/;
$string .= $val;
}
If the match fails then $val is an empty string, so we don't have to test. You can also write the whole while loop in one line
$string .= (/Current\s*value\s*=\s*(.*)/)[0] while <$input_fh>;
but I don't see why that would be necessary. Note that this reads from the filehandle, and line by line. There is no reason to first read all lines into an array.
To avoid (explicit) looping, you can read all lines and pass them through map, naively as
my $string = join '',
map { (/Current\s*value\s*=\s*(.*)/) ? $1 : () } <$input_fh>;
Since map needs a list, the filehandle is in list context, returning the list of all lines in the file. Then each is processed by code in map's block, and its output list is then joined.
The trick map { ($test) ? $val : () } uses map to also do grep's job, to filter -- the empty list that is returned if $test fails is flattened into the output list, thus disappearing. The "test" here is the regex match, which in the scalar context returns true/false, while the capture sets $1.
But, like above, we can return the first element of the list that match returns, instead of testing whether the match was successful. And since we are in map we can in fact return the "whole" list
my $string = join '',
map { /Current\s*value\s*=\s*(.*)/ } <$input_fh>;
what may be clearer here.
Comments on the code in the question
the while (#input) is an infinite loop, since #input never gets depleted. You'd need foreach (#input) -- but better just read the filehandle, while (<$input_fh>)
your regex does match on a line with that string, but it doesn't attempt to match the pattern that you need (what follows =). Once you add that, it need be captured as well, by ()
you can assign to the i-th element (which should be $i) but then you'd have to increment $i as you go. Most of the time it is better to just push #array, $value
You can use capturing parentheses to grab the string you want:
use strict;
use warnings;
my #arr;
open ( my $input_fh, '<', 'vcs_output.log' ) or die $!;
while (<$input_fh>) {
if (/Current value = (.)/) {
push #arr, $1;
}
}
close ( $input_fh );
print "#arr\n";
__END__
x x 1 0 x 0 x 1
Use grep and perlre
http://perldoc.perl.org/functions/grep.html
http://perldoc.perl.org/perlre.html
If on a non-Unix environment then...
#!/usr/bin/perl -w
use strict;
open (my $fh, '<', "vcs_output.log");
chomp (my #lines = <$fh>);
# Filter for lines which contain string 'Current value'
#lines = grep{/Current value/} #lines;
# Substitute out what we don't want... leaving us with the 'xx10x0x1'
#lines = map { $_ =~ s/.*Current value = //;$_} #lines;
my $str = join('', #lines);
print $str;
Otherwise...
#!/usr/bin/perl -w
use strict;
my $output = `grep "Current value" vcs_output.log | sed 's/.*Current value = //'`;
$output =~ s/\n//g;
print $output;

Errors in declaration when trying to parse a csv file

I'm trying to parse a CSV file that is formatted like this:
dog cats,yellow blue tomorrow,12445
birds,window bank door,-novalue-
birds,window door,5553
aspirin man,red,567
(there is no value where -novalue- is written)
use strict;
use warnings;
my $filename = 'in.txt';
my $filename2 = 'out.txt';
open(my $in, '<:encoding(UTF-8)', $filename)
or die "Could not open file '$filename' $!";
my $word = "";
while (my $row = <$in>) {
chomp $row;
my #fields = split(/,/,$row);
#Save the first word of the second column
($word) = split(/\s/,$fields[1]);
if ($word eq 'importartWord')
{
printf $out "$fields[0]".';'."$word".';'."$fields[2]";
}
else #keep as it was
{
printf $out "$fields[0]".';'."$fields[1]".';'."$fields[2]";
}
Use of uninitialized value $word in string ne at prueba7.pl line 22, <$in> line 10.
No matter where I define $word I cannot stop receiving that error and can't understand why. I think I have initialized $word correctly. I would really appreciate your help here.
Please if you are going to suggest using Text::CSV post a working code example since I haven't been able to apply it for the propose I have explained here. That's the reason I ended up writing the above code.
PD:
Because I know you are going to ask for my previous code using Text::CSV, here it is:
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({ sep_char => ';', binary => 1 }) or
die "Cannot use CSV: ".Text::CSV->error_diag ();
#directorio donde esta esc_prim2.csv
my $file = 'C:\Users\Sergio\Desktop\GIS\perl\esc_prim2.csv';
my $sal = 'C:\Users\Sergio\Desktop\GIS\perl\esc_prim3.csv';
open my $data, "<:encoding(utf8)", "$file" or die "$file: $!";
open my $out, ">:encoding(utf8)", "$sal" or die "$sal: $!";
$csv->eol ("\r\n");
#initializing variables
my $row = "";
my $word = "";
my $validar = 0;
my $line1 = "";
my #mwords = [""];#Just a try to initialize mwords... doesn't work, error keeps showing
#save the first line with field names on the other file
$line1 = <$data>;
$csv->parse($line1);
my #fields = $csv->fields();
$csv->print($out,[$fields[0], $fields[1], $fields[2]]);
while ($row = <$data>) {
if ($csv->parse($row)) {
#fields = $csv->fields();
#save first word of the field's second element
#mwords = split (/\s/, $fields[1]);
#keep the first one
$word = $mwords[0];
printf($mwords[0]);
#if that word is not one of SAN, EL y LA... writes a line in the new file with the updated second field.
$validar = ($word ne 'SAN') && ($word ne 'EL') && ($word ne 'LA');
if ($validar)
{
$csv->print($out,[$fields[0], $word, $fields[2]]);
}
else { #Saves the line in the new file as it was in the old one.
$csv->print($out,[$fields[0], $fields[1], $fields[2]]);
}
} else {#error procesing row
warn "La row no se ha podido procesar\n";
}
}
close $data or die "$file: $!";
close $out or die "$sal: $!";
Here the line where $validar is declared brings the same error of "uninitialized value" although I did it.
I also tried the push #rows, $row; approach but I don't really know how to handle the $rows[$i] since they are references to arrays (pointers) and I know they can't be operated as variables... Couldn't find a working example on how to use them.
I think you're misunderstanding the error. It's not a problem with the declaration of the variable, but with the data that you're putting into the variable.
Use of uninitialized value
This means that you are trying to use a value that is undefined (not undeclared). That means you are using a variable that you haven't given a value.
You can get more details about the warning (and it's a warning, not an error) by adding use diagnostics to your code. You'll get something like this:
(W uninitialized) An undefined value was used as if it were already
defined. It was interpreted as a "" or a 0, but maybe it was a mistake.
To suppress this warning assign a defined value to your variables.
To help you figure out what was undefined, perl will try to tell you
the name of the variable (if any) that was undefined. In some cases
it cannot do this, so it also tells you what operation you used the
undefined value in. Note, however, that perl optimizes your program
and the operation displayed in the warning may not necessarily appear
literally in your program. For example, "that $foo" is usually
optimized into "that " . $foo, and the warning will refer to the
concatenation (.) operator, even though there is no . in
your program.
So, when you're populating $word, it's not getting a value. Presumably, that's because some lines in your input file have an empty record there.
I have no way of knowing whether or not that's a valid input for your program, so I can't really give any helpful suggestions on how to fix this.
The error message you provided ends with: line 22, <$in> line 10. but your question doesn't show line 10 of the data ($in) requiring some speculation in this answer - but, I'd say that the second field, $field[1], of line 10 of in.txt is empty.
Consequently, this line: ($word) = split(/\s/,$fields[1]); is causing $word to be undefined. As a result, some use of it latter - be it the ne operator (as displayed in the message) or anything else is going to generate an error.
As an aside - there's little point in interpolating a variable in a string on its own; instead of "$fields[0]", say $fields[0] unless you're going to put something else in there, like "$fields[0];". You may want to consider replacing
printf $out "$fields[0]".';'."$word".';'."$fields[2]";
with
printf $out $fields[0] . ';' . $word . ';' . $fields[2];
or
printf $out "$fields[0];$word;$fields[2]";
Of course, TMTOWTDI - so you may want to tell me to mind my own business instead. :-)

Parsing Excel of ASCII format in Perl

We have perl script whose job is to read an excel file and convert it into a flat file. We get excel file from some other system on a shared location.
The other system is generating a flat file dumping data in the file seperated by a tab and appending with .xls. Now the problem with that is since in xls file if there is a string with leading 0 e.g 012345 it will be displayed as 12345 in excel. To preseve the leading 0 what they do is they write the data in this fashion (in java)
"=\"" + some string + "\""
Now if we open the file in excel it is proper with no = or ", but when reading via perl it read the string as it is i.e. ="some string".
How can we work around this, i have tried a solution to trim the leading =" and " but do not feel it to be clen one. Can someont suggest anything else
My suggestion (with the limited information you've given) would be to read the source file using Text::CSV, and then output via whatever means you would otherwise. For substitution, use regular expressions.
Simplified example (partially taken straight from the documentation):
#!usr/bin/perl
use Text::CSV;
# set binary attribute and use a tab as a separator:
my $csv = Text::CSV->new ( { binary => 1, sep_char => "\t"} )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, "<:encoding(utf8)", "test.xls" or die "test.xls: $!";
my #rows;
while ( my $row = $csv->getline( $fh ) ) {
foreach my $column (#$row) {
$column =~ s/^="|"$//g; # remove opening '="' and closing '"'
}
push #rows, $row;
}
$csv->eof or $csv->error_diag();
close $fh;
# do file writing magic or further processing here
In case you don't know, the 's' at the beginning of the regular expression indicates you want to substitue and the 'g' at the end means "repeat for all matches".
For more information, see:
https://metacpan.org/pod/Text::CSV
http://perldoc.perl.org/perlretut.html (Perl documentation on regular expressions)

Perl if condition parameters

I have a log file which looks like below:
4680 p4exp/v68 PJIANG-015394 25:34:19 IDLE none
8869 unnamed p4-python R integration semiconductor-project-trunktip-turbolinuxclient 01:33:52 IDLE none
8870 unnamed p4-python R integration remote-trunktip-osxclient 01:33:52
There are many such entries in the same log file such that some contains IDLE none at the end while some does not. I would like to retain the ones having "R integration" and "IDLE none" in a hash and ignore the rest. I have tried the following code but not getting the desired results.
#!/usr/bin/perl
open (FH,'/root/log.txt');
my %stat;
my ($killid, $killid_details);
while ($line = <FH>) {
if ($line =~ m/(\d+)/){
$killid = $1;
}
if ($line =~ /R integration/ and $line =~ /IDLE none/){
$killid_details = $line;
}
$stat{$killid} = {
killid => $killid_details
};
}
close (FH);
I am getting all the lines with R integration (for example I get 8869, 8870 lines) which should not be the case as 8870 should be ignored.
Please inform me if any mistake. I am still learning perl. Thank you.
I made a few changes in your program:
Always put in use strict; and use warnings;. These will catch 90% of your errors. (Although not this time).
When you open a file, you need to either use or die as in open my $fh, "<", $file or die qq(blah, blah, blah); or use use autodie; (which is now preferred). In your case, if the file didn't open, your program would have continued merrily along. You need to test whether or not the open statement worked.
Note my open statement. I use a variable for the file handle. This is preferred because it's not global, and it's easier to pass into subroutines. Also note I use the three parameter open. This way, you don't run into trouble if your file name begins with some strange character.
When you declare a variable, it's best to do it in scope. This way, variables go out of scope when you no longer need them. I moved where $killid and $killid_details to be declared inside the loop. That way, they no longer exist outside the loop.
You need to be more careful with your regular expressions. What if the phrase IDLE none appears elsewhere in your line? You only want it if its on the end of the line.
Now, for the issues you had:
You need to chomp lines when you read them. In Perl, the NL at the end of the line is read in. The chomp command removes it.
Your logic was a bit strange. You set $killid if your line had a digit in it (I modified it to look only for digits at the beginning of the line). However, you simply went on your merry way even if killid was not set. In your version, because you declared $killid outside of the loop, it had a value in each loop. Here I go to the next statement if $killid isn't defined.
You had a weird definition for your hash. You were defining a reference hash within a hash. No need for that. I made it a simple hash.
Here it is:
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie;
use Data::Dumper;
open my $log_fh, '<', '/root/log.txt';
my %stat;
while (my $line = <$log_fh>) {
chomp $line;
next if not $line =~ /^(\d+)\s+/;
my $killid = $1;
if ($line =~ /R\s+integration/ and $line =~ /IDLE\s+none$/){
my $killid_details = $line;
$stat{$killid} = $killid_details;
}
}
close $log_fh;
say Dumper \%stat;
I think this is probably what you want:
while (<FH>) {
next unless /^(\d+).*R integration.*IDLE none/;
$stat{$1} = $_;
}
The regexp should be anchored to the beginning of the line, so you don't match a number anywhere on the line. There's no need to do multiple regexp matches, assuming the order of R integration and IDLE none are always as in the example. You need to use next when there's no match, so you don't process non-matching lines.
And I suspect that you just want to set the value of the hash entry to the string, not a reference to another hash.

Resources