This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
How to join two substrings in perl
How could I find a substring of a string array which corresponds to user entered index of character?
For example: If there is a string $string = 'MFHYRAKCLAMSCTLPHCAKNDHGCTH';and it get broken into array #string = ( "MFHYRA","KCLAM", "SCTLP", "HCAKNDHGCTH" ) ; now if user enters position of A as 6 and 10, how could two corresponding substrings be searched and joined?
Essentially, what you ask is this: we have a position in the initial string. The string gets split up in substrings. In which of the substrings is the position?
Assume that
#pos = ( 5, 7, 9 ) ;
is the list of positions for which you would like to find the substrings.
my $n = 0 ; # current position
my %results ;
foreach my $ss ( #substrings ) {
$n += length( $ss ) ;
foreach my $p ( #pos ) {
if( ! $results{$p} and $p < $n ) { $results{$p} = $ss ; }
}
}
foreach my $p ( #pos ) {
print "Position $p, substring $results{$p}\n" ;
}
Clearly, this code could use some optimalization, for example -- no need to loop over the elements of #pos if we already have a substring for them, and we should break the operation after the last element of #pos got its substring, but for a few positions entered for the user this doesn't really matter.
Related
I would like to search values after a specific word (Current Value = ) in a log file, and makes a string with values.
vcs_output.log: a log file
** Fault injection **
Count = 1533
0: Path = cmp_top.iop.sparc0.exu.alu.byp_alu_rcc_data_e[6]
0: Current value = x
1: Path = cmp_top.iop.sparc0.exu.alu.byp_alu_rs3_data_e[51]
1: Current value = x
2: Path = cmp_top.iop.sparc0.exu.alu.byp_alu_rs1_data_e[3]
2: Current value = 1
3: Path = cmp_top.iop.sparc0.exu.alu.shft_alu_shift_out_e[18]
3: Current value = 0
4: Path = cmp_top.iop.sparc0.exu.alu.byp_alu_rs3_data_e[17]
4: Current value = x
5: Path = cmp_top.iop.sparc0.exu.alu.byp_alu_rs1_data_e[43]
5: Current value = 0
6: Path = cmp_top.iop.sparc0.exu.alu.byp_alu_rcc_data_e[38]
6: Current value = x
7: Path = cmp_top.iop.sparc0.exu.alu.byp_alu_rs2_data_e_l[30]
7: Current value = 1
.
.
.
If I store values after "Current value = ", then x,x,1,0,x,0,x,1. I ultimately save/print them as a string such as xx10x0x1.
Here is my code
code.pl:
#!/usr/bin/perl
use strict;
use warnings;
##### Read input
open ( my $input_fh, '<', 'vcs_output.log' ) or die $!;
chomp ( my #input = <$input_fh> );
my $i=0;
my #arr;
while (#input) {
if (/Current value = /)
$arr[i]= $input; # put the matched value to array
}
}
## make a string from the array using an additional loop
close ( $input_fh );
I think there is a way to make a string in one loop (or even not using a loop). Please advise me to make it. Any suggestion is appreciated.
You can do both that you ask for.
To build a string directly, just append to it what you capture in the regex
my $string;
while (<$input_fh>)
{
my ($val) = /Current\s*value\s*=\s*(.*)/;
$string .= $val;
}
If the match fails then $val is an empty string, so we don't have to test. You can also write the whole while loop in one line
$string .= (/Current\s*value\s*=\s*(.*)/)[0] while <$input_fh>;
but I don't see why that would be necessary. Note that this reads from the filehandle, and line by line. There is no reason to first read all lines into an array.
To avoid (explicit) looping, you can read all lines and pass them through map, naively as
my $string = join '',
map { (/Current\s*value\s*=\s*(.*)/) ? $1 : () } <$input_fh>;
Since map needs a list, the filehandle is in list context, returning the list of all lines in the file. Then each is processed by code in map's block, and its output list is then joined.
The trick map { ($test) ? $val : () } uses map to also do grep's job, to filter -- the empty list that is returned if $test fails is flattened into the output list, thus disappearing. The "test" here is the regex match, which in the scalar context returns true/false, while the capture sets $1.
But, like above, we can return the first element of the list that match returns, instead of testing whether the match was successful. And since we are in map we can in fact return the "whole" list
my $string = join '',
map { /Current\s*value\s*=\s*(.*)/ } <$input_fh>;
what may be clearer here.
Comments on the code in the question
the while (#input) is an infinite loop, since #input never gets depleted. You'd need foreach (#input) -- but better just read the filehandle, while (<$input_fh>)
your regex does match on a line with that string, but it doesn't attempt to match the pattern that you need (what follows =). Once you add that, it need be captured as well, by ()
you can assign to the i-th element (which should be $i) but then you'd have to increment $i as you go. Most of the time it is better to just push #array, $value
You can use capturing parentheses to grab the string you want:
use strict;
use warnings;
my #arr;
open ( my $input_fh, '<', 'vcs_output.log' ) or die $!;
while (<$input_fh>) {
if (/Current value = (.)/) {
push #arr, $1;
}
}
close ( $input_fh );
print "#arr\n";
__END__
x x 1 0 x 0 x 1
Use grep and perlre
http://perldoc.perl.org/functions/grep.html
http://perldoc.perl.org/perlre.html
If on a non-Unix environment then...
#!/usr/bin/perl -w
use strict;
open (my $fh, '<', "vcs_output.log");
chomp (my #lines = <$fh>);
# Filter for lines which contain string 'Current value'
#lines = grep{/Current value/} #lines;
# Substitute out what we don't want... leaving us with the 'xx10x0x1'
#lines = map { $_ =~ s/.*Current value = //;$_} #lines;
my $str = join('', #lines);
print $str;
Otherwise...
#!/usr/bin/perl -w
use strict;
my $output = `grep "Current value" vcs_output.log | sed 's/.*Current value = //'`;
$output =~ s/\n//g;
print $output;
I've found index and rindex for finding the first or last occurrence of a character (or substring) in a string. I'm also aware that they have an offset that can be used to start at a certain index.
What I want to know is if there is a simple way to find the index of the Nth occurrence of a character or substring in a string. I'd prefer not to have to do it with a regex and would rather not to have to write a loop that just repeatedly calls index with an offset.
EDIT: I didn't state the restriction well. The reason I said "no loop" is I am looking for a built-in way to do it, which exists in many languages.
One possible implementation:
use strict;
use warnings;
use feature qw(say);
my $string = 'the quick brown fox jumped over the lazy dog';
my $substring = 'o';
my $n = 4;
sub nth_index {
my ($string, $substring, $n) = #_;
my ($times, $index) = (0, 0);
while ( $times < $n && $index != -1 ) {
$index = index(
$string,
$substring,
$times == 0
? 0
: $index + length($substring),
);
$times++;
}
return $index;
}
say nth_index($string, $substring, $n); # 42
Here are two examples of how I would solve the problem
Subroutine nth_index1 uses index, while nth_index2 uses a regex. Both involve loops, as any solution must
I believe the regex solution is much more readable, while the index solution is probably a little faster. But they are both so quick that they are highly unlikely to cause a bottleneck, and readability is always paramount
use strict;
use warnings 'all';
my $s = 'the quick brown fox jumps over the lazy dog';
my $ss = 'o';
for my $n ( 1 .. 4 ) {
printf "%d %d\n",
nth_index1($s, $ss, $n),
nth_index2($s, $ss, $n);
}
sub nth_index1 {
my ($s, $ss, $n) = #_;
my $i;
my $len = length $ss;
while ( $n-- ) {
$i = index($s, $ss, $i ? $i + $len : 0 );
return if $i < 0;
}
$i;
}
sub nth_index2 {
my ($s, $ss, $n) = #_;
while ( $s =~ /$ss/g ) {
return $-[0] unless --$n;
}
return;
}
output
12 12
17 17
26 26
41 41
As stated, there is no built-in for this. Here are a few ways, using split, index, and regex.
use warnings;
use strict;
use feature qw(say);
my $str = "Xab_ab_ab_ab_"; # 'Xab_ab'; # test failed (3) matches
my $N = 3;
foreach my $patt qw(a ab c) {
say "Find index of occurrence $N of |$patt| in: |$str|";
say "index: ", ( ind_Nth_match_1($str, $patt, $N) // "no $N matches" ); #/
say "split: ", ( ind_Nth_match_2($str, $patt, $N) // "no $N matches" ); #/
say "regex: ", ( ind_Nth_match_3($str, $patt, $N) // "no $N matches" ); #/
}
sub ind_Nth_match_1 {
my ($str, $patt, $N) = #_;
my ($pos, $cnt) = (0, 0);
while ($pos = index($str, $patt, $pos) + 1) { # != 0
return $pos-1 if ++$cnt == $N;
}
return;
}
sub ind_Nth_match_2 {
my ($str, $patt, $N) = #_;
my #toks = split /($patt)/, $str;
return if #toks < 2*$N;
return length( join '', #toks[0..2*$N-1] ) - length($patt);
}
sub ind_Nth_match_3 {
my ($str, $patt, $N) = #_;
my $cnt = 0;
while ($str =~ m/$patt/g) {
return $-[0] if ++$cnt == $N;
}
}
This prints
Find index of occurrence 3 of |a| in: |Xab_ab_ab_ab_|
index: 7
split: 7
regex: 7
Find index of occurrence 3 of |ab| in: |Xab_ab_ab_ab_|
index: 7
split: 7
regex: 7
Find index of occurrence 3 of |c| in: |Xab_ab_ab_ab_|
index: no 3 matches
split: no 3 matches
regex: no 3 matches
Notes
In split each delimiter is also returned in the output list, with capturing /($patt)/, for simpler length estimation. Thus we count 2*$N (and then take -1).
In regex the #- array is used, #LAST_MATCH_START, for the position of the last successful match. Here the /g in scalar context in while makes it jump from a match to the next in repeated executions, and $-[0] gives the starting position of the last (previous) such match.
The subs return undef if there are not required $N matches, including no matches at all.
Thanks to Borodin for comments on return from subs and on using #- instead of #+.
(This answer does not answer your question, but is here to help you down the road towards accepting a regex solution.)
The way you're asking for a numeric index of the position, it sounds like you're thinking of extracting data from the string once you have that number, in a way that C programmers might do it.
For example, say you had the string
my $str = "My daddy left home when I was three and he didn't leave much for ma and me";
and you wanted to extract all the data up to the first instance of the word "and". Here's the way you could do it, which is sort of a C way to do it using Perl.
my $pos = find_index_of_first_occurrence( $str, 'and' );
# Note that find_index_of_first_occurrence() is a hypothetical function.
print substr( $str, 0, $pos );
# Prints "My daddy left home when I was three "
The way you'd do that in Perl with regexes is much simpler.
$str =~ /^(.*?)and/;
print $1;
With regexes, you're combining the searching for the string and the extraction of the data in one operation. (Note that both code snippets ignore the case of not finding "and" at all, for the sake of simplicity)
I understand that you don't know regexes very well yet, and that regexes can be daunting at first, but you'll need to understand them as part of learning Perl if you're going to succeed with the language.
I'm writing a Perl programm and I've got the following problem:
I have a large list of start and end positions in a string. This positions correspond to substrings in this string. I now want to transfer this positions to a second string. This second string is identical to the first string, except that it has additional hyphen.
Example for original String: "ABCDEF" and one Substring "BCDE"
What I have:
Positions of substring in this original string: Start = 1, End =
4
The original string with additional hyphen: "-AB---CD--E-F---"
What I want:
Position of the substring in the hyphen-string: Start=2, End=10
I have a large list of this substring positions.
I strongly suspect that you have shown a reduced version of the problem, in which case any solution may not work for the real situation.
However, it seems simplest to build a regex by interspersing -* (i.e. zero or more hyphens) between characters.
This program works that way, building a regex of B-*C-*D-*E and comparing it to both of your sample strings.
use strict;
use warnings;
my #strings = qw/ ABCDEF -AB---CD--E-F--- /;
my ($start, $end) = (1, 4);
my $substr = substr $strings[0], $start, $end-$start + 1;
my $regex = join '-*', split //, $substr;
$regex = qr/$regex/;
for my $string (#strings) {
if ($string =~ $regex) {
printf "Substring found at %d to %d in string %s\n", $-[0], $+[0]-1, $string;
}
}
output
Substring found at 1 to 4 in string ABCDEF
Substring found at 2 to 10 in string -AB---CD--E-F---
Does this work for you? It just searches for the characters specified by start and end in the hyphenated string and returns their indices.
sub hyphen_substrings {
my $original = shift;
my $hyphenated = shift;
my #substrings = #_;
my #return;
for my $substring (#substrings) {
my ($start, $end) = #{$substring}[0, 1];
my $start_h = index $hyphenated, substr $original, $start, 1;
my $end_h = index $hyphenated, substr $original, $end, 1;
push #return, [$start_h, $end_h];
}
return #return;
}
use strict;
use warnings;
my $theStringGivenAsAnInputExample="-AB---CD--E-F---";
my $start=1;
my $end=4;
my $theStringGivenAsAnotherInput="ABCDEF";
my $regexp=join("-*",split("",substr($theStringGivenAsAnotherInput,$start,$end))
);
$theStringGivenAsAnInputExample =~ /$regexp/p;
print ${^PREMATCH},"\n";
print ${^POSTMATCH},"\n";
print ${^MATCH},"\n";
my $startPosition = length(${^PREMATCH});
my $finishPosition = length(${^PREMATCH})+length(${^MATCH})-1;
print "start, $startPosition finish, $finishPosition\n";
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to plot a gene graph for a DNA sequence say ATGCCGCTGCGC?
Im trying to write a Perl script that compares two DNA sequences (60 characters in length each lets say) in alignment, and then show the ratio of matches to non-matches of the sequences to each other. But i'm not having much luck. if it helps i can upload my code, but its no use. here's an example of what im trying to achieve below.
e.g
A T C G T A C
| | | | | | |
T A C G A A C
So the matches of the above example would be 4. and non-matches are: 3. Giving it a ratio of 4.3.
Any help would be much appreciated. thanks.
in general, please do post your code. It does help. In any case, something like this should do what you are asking:
#!/usr/bin/perl -w
use strict;
my $d1='ATCGTAC';
my $d2='TACGAAC';
my #dna1=split(//,$d1);
my #dna2=split(//,$d2);
my $matches=0;
for (my $i=0; $i<=$#dna1; $i++) {
$matches++ if $dna1[$i] eq $dna2[$i];
}
my $mis=scalar(#dna1)-$matches;
print "Matches/Mismatches: $matches/$mis\n";
Bear in mind though that the ratio of 4 to 3 is most certainly not 4.3 but ~1.3. If you post some information on your input file format I will update my answer to include lines for parsing the sequence from your file.
Normally I'd say "What have you tried" and "upload your code first" because it doesn't seem to be a very difficult problem. But let's give this a shot:
create two arrays, one to hold each sequence:
#sequenceOne = ("A", "T", "C", "G", "T", "A", "C");
#sequenceTwo = ("T", "A", "C", "G", "A", "A", "C");
$myMatch = 0;
$myMissMatch = 0;
for ($i = 0; $i < #sequenceOne; $i++) {
my $output = "Comparing " . $sequenceOne[$i] . " <=> " . $sequenceTwo[$i];
if ($sequenceOne[$i] eq $sequenceTwo[$i]) {
$output .= " MATCH\n";
$myMatch++;
} else {
$myMissMatch++;
$output .= "\n";
}
print $output;
}
print "You have " . $myMatch . " matches.\n";
print "You have " . $myMissMatch . " mismatches\n";
print "The ratio of hits to misses is " . $myMatch . ":" . $myMissMatch . ".\n";
Of course, you'd probably want to read the sequence from something else on the fly instead of hard-coding the array. But you get the idea. With the above code your output will be:
torgis-MacBook-Pro:platform-tools torgis$ ./dna.pl
Comparing A <=> T
Comparing T <=> A
Comparing C <=> C MATCH
Comparing G <=> G MATCH
Comparing T <=> A
Comparing A <=> A MATCH
Comparing C <=> C MATCH
You have 4 matches.
You have 3 mismatches
The ratio of hits to misses is 4:3.
So many ways to do this. Here's one.
use strict;
use warnings;
my $seq1 = "ATCGTAC";
my $seq2 = "TACGAAC";
my $len = length $seq1;
my $matches = 0;
for my $i (0..$len-1) {
$matches++ if substr($seq1, $i, 1) eq substr($seq2, $i, 1);
}
printf "Length: %d Matches: %d Ratio: %5.3f\n", $len, $matches, $matches/$len;
exit 0;
Just grab the length of one of the strings (we're assuming string lengths are equal, right?), and then iterate using substr.
my #strings = ( 'ATCGTAC', 'TACGAAC' );
my $matched;
foreach my $ix ( 0 .. length( $strings[0] ) - 1 ) {
$matched++
if substr( $strings[0], $ix, 1 ) eq substr( $strings[1], $ix, 1 );
}
print "Matches: $matched\n";
print "Mismatches: ", length( $strings[0] ) - $matched, "\n";
I think substr is the way to go, rather than splitting the strings into arrays.
This is probably most convenient if presented as a subroutine:
use strict;
use warnings;
print ratio(qw/ ATCGTAC TACGAAC /);
sub ratio {
my ($aa, $bb) = #_;
my $total = length $aa;
my $matches = 0;
for (0 .. $total-1) {
$matches++ if substr($aa, $_, 1) eq substr($bb, $_, 1);
}
$matches / ($total - $matches);
}
output
1.33333333333333
Bill Ruppert's right that there are many way to do this. Here's another:
use Modern::Perl;
say compDNAseq( 'ATCGTAC', 'TACGAAC' );
sub compDNAseq {
my $total = my $i = 0;
$total += substr( $_[1], $i++, 1 ) eq $1 while $_[0] =~ /(.)/g;
sprintf '%.2f', $total / ( $i - $total );
}
Output:
1.33
Here is an approach which gives a NULL, \0, for each match in an xor comparison.
#!/usr/bin/perl
use strict;
use warnings;
my $d1='ATCGTAC';
my $d2='TACGAAC';
my $len = length $d1; # assumes $d1 and $d2 are the same length
my $matches = () = ($d1 ^ $d2) =~ /\0/g;
printf "ratio of %f", $matches / ($len - $matches);
Output: ratio of 1.333333
I am very new to Perl, and I am trying to write a word frequency counter as a learning exercise.
However, I am not able to figure out the error in my code below, after working on it. This is my code:
$wa = "A word frequency counter.";
#wordArray = split("",$wa);
$num = length($wa);
$word = "";
$flag = 1; # 0 if previous character was an alphabet and 1 if it was a blank.
%wordCount = ("null" => 0);
if ($num == -1) {
print "There are no words.\n";
} else {
print "$length";
for $i (0 .. $num) {
if(($wordArray[$i]!=' ') && ($flag==1)) { # start of a new word.
print "here";
$word = $wordArray[$i];
$flag = 0;
} elsif ($wordArray[$i]!=' ' && $flag==0) { # continuation of a word.
$word = $word . $wordArray[$i];
} elsif ($wordArray[$i]==' '&& $flag==0) { # end of a word.
$word = $word . $wordArray[$i];
$flag = 1;
$wordCount{$word}++;
print "\nword: $word";
} elsif ($wordArray[$i]==" " && $flag==1) { # series of blanks.
# do nothing.
}
}
for $i (keys %wordCount) {
print " \nword: $i - count: $wordCount{$i} ";
}
}
It's neither printing "here", nor the words. I am not worried about optimization at this point, though any input in that direction would also be much appreciated.
This is a good example of a problem where Perl will help you work out what's wrong if you just ask it for help. Get used to always adding the lines:
use strict;
use warnings;
to the top of your Perl programs.
Fist off,
$wordArray[$i]!=' '
should be
$wordArray[$i] ne ' '
according to the Perl documentation for comparing strings and characters. Basically use numeric operators (==, >=, …) for numbers, and string operators for text (eq, ne, lt, …).
Also, you could do
#wordArray = split(" ",$wa);
instead of
#wordArray = split("",$wa);
and then #wordArray wouldn't need to do the wonky character checking and you never would have had the problem. #wordArray will be split into the words already and you'll just have to count the occurrences.
You seem to be writing C in Perl. The difference is not just one of style. By exploding a string into a an array of individual characters, you cause the memory footprint of your script to explode as well.
Also, you need to think about what constitutes a word. Below, I am not suggesting that any \w+ is a word, rather pointing out the difference between \S+ and \w+.
#!/usr/bin/env perl
use strict; use warnings;
use YAML;
my $src = '$wa = "A word frequency counter.";';
print Dump count_words(\$src, 'w');
print Dump count_words(\$src, 'S');
sub count_words {
my $src = shift;
my $class = sprintf '\%s+', shift;
my %counts;
while ($$src =~ /(?<sequence> $class)/gx) {
$counts{ $+{sequence} } += 1;
}
return \%counts;
}
Output:
---
A: 1
counter: 1
frequency: 1
wa: 1
word: 1
---
'"A': 1
$wa: 1
=: 1
counter.";: 1
frequency: 1
word: 1