I am writing a program that is heavily dependent on the perl's index function to take strings. My objective so to break this string into a series of substrings that do not have any spaces.
The problem that I am having is that in the second half of the loop the index function starts skipping spaces.
I am using the test data:
1 12 a 2 5 P Q
I am expecting to get:
1, 12, a, 2, 5, P, Q,
Instead I get:
1, 12, a, 2 5, P Q,
My code follows:
use 5.010;
use strict;
use warnings;
my $ivalue = <stdin>;
chomp($ivalue);
$ivalue = $ivalue . " ";
my $current;
my $space = 0;
my $safespace = 0;
my $lastspace = 0;
my $closestspace = 0;
my $i = 0;
# Test data - 1 12 a 2 5 P Q
while ($space != -1){
$space = index($ivalue, " ", $space + $i++);
$closestspace = $space - $lastspace;
#print $lastspace . " " . $space . " = " . $closestspace . "; ";
$current = substr($ivalue, $lastspace, $closestspace);
#say "substring = " . $current;
$lastspace = $space + 1;
}
Thanks ahead of time! If anyone has any suggestion on how to improve the way I asked my question or on my code in general those are appreciated as well.
The problem is in this line:
$space = index($ivalue, " ", $space + $i++);
instead of using $i, make that $space + 1
Related
I have a file test.txt:
Stringsplittingskills
I want to read this file and write to another file out.txt with three characters in each line like
Str
ing
spl
itt
ing
ski
lls
What I did
my $string = "test.txt".IO.slurp;
my $start = 0;
my $elements = $string.chars;
# open file in writing mode
my $file_handle = "out.txt".IO.open: :w;
while $start < $elements {
my $line = $string.substr($start,3);
if $line.chars == 3 {
$file_handle.print("$line\n")
} elsif $line.chars < 3 {
$file_handle.print("$line")
}
$start = $start + 3;
}
# close file handle
$file_handle.close
This runs fine when the length of string is not multiple of 3. When the string length is multiple of 3, it inserts extra newline at the end of output file. How can I avoid inserting new line at the end when the string length is multiple of 3?
I tried another shorter approach,
my $string = "test.txt".IO.slurp;
my $file_handle = "out.txt".IO.open: :w;
for $string.comb(3) -> $line {
$file_handle.print("$line\n")
}
Still it suffers from same issue.
I looked for here, here but still unable to solve it.
spurt "out.txt", "test.txt".IO.comb(3).join("\n")
Another approach using substr-rw.
subset PositiveInt of Int where * > 0;
sub break( Str $str is copy, PositiveInt $length )
{
my $i = $length;
while $i < $str.chars
{
$str.substr-rw( $i, 0 ) = "\n";
$i += $length + 1;
}
$str;
}
say break("12345678", 3);
Output
123
456
78
The correct answer is of course to use .comb and .join.
That said, this is how you might fix your code.
You could change the if line to check if it is at the end, and use else.
if $start+3 < $elements {
$file_handle.print("$line\n")
} else {
$file_handle.print($line)
}
Personally I would change it so that only the addition of \n is conditional.
while $start < $elements {
my $line = $string.substr($start,3);
$file_handle.print( $line ~ ( "\n" x ($start+3 < $elements) ));
$start += 3;
}
This works because < returns either True or False.
Since True == 1 and False == 0, the x operator repeats the \n at most once.
'abc' x 1; # 'abc'
'abc' x True; # 'abc'
'abc' x 0; # ''
'abc' x False; # ''
If you were very cautious you could use x+?.
(Which is actually 3 separate operators.)
'abc' x 3; # 'abcabcabc'
'abc' x+? 3; # 'abc'
infix:« x »( 'abc', prefix:« + »( prefix:« ? »( 3 ) ) );
I would probably use loop if I were going to structure it like this.
loop ( my $start = 0; $start < $elements ; $start += 3 ) {
my $line = $string.substr($start,3);
$file_handle.print( $line ~ ( "\n" x ($start+3 < $elements) ));
}
Or instead of adding a newline to the end of each line, you could add it to the beginning of every line except the first.
while $start < $elements {
my $line = $string.substr($start,3);
my $nl = "\n";
# clear $nl the first time through
once $nl = "";
$file_handle.print($nl ~ $line);
$start = $start + 3;
}
At the command line prompt, three one-liner solutions below.
Using comb and batch (retains incomplete set of 3 letters at end):
~$ echo 'StringsplittingskillsX' | perl6 -ne '.join.put for .comb.batch(3);'
Str
ing
spl
itt
ing
ski
lls
X
Simplifying (no batch, only comb):
~$ echo 'StringsplittingskillsX' | perl6 -ne '.put for .comb(3);'
Str
ing
spl
itt
ing
ski
lls
X
Alternatively, using comb and rotor (discards incomplete set of 3 letters at end):
~$ echo 'StringsplittingskillsX' | perl6 -ne '.join.put for .comb.rotor(3);'
Str
ing
spl
itt
ing
ski
lls
How do I convert $var = "000000000" to $var = "0_0000_0000" in Perl ?
If the string is always 9 characters long, you can just use substr:
my $var = '000000000';
substr($var, 5, 0) = '_';
substr($var, 1, 0) = '_';
For formatting strings of arbitrary length you could use a function like this:
sub format_str {
my $str = reverse $_[0];
$str =~ s/(.{4})(?=.)/$1_/g;
return scalar reverse $str;
}
my $var = "000000000";
print format_str $var; # "0_0000_0000"
$var = "000000000";
$var2 = substr($var,0,1)."_".substr($var,1,4)."_".substr($var,5);
print $var2;
Assuming you're asking how to insert a _ after the first and fifth characters of a string, the following are a variety of straightforward solutions:
my $in = '000000000';
my $out = substr($var,0,1) . '_' . substr($var,1,4) . '_' . substr($var,5);
my $in = '000000000';
my $out = join('_', substr($var,0,1), substr($var,1,4), substr($var,5));
my $in = '000000000';
my $out = join('_', unpack('a1 a4 a4*', $in));
my $in = '000000000';
my $out = $in =~ s/^(.)(.{4})/${1}_${2}_/sr; # 5.14+
my $in = '000000000';
( my $out = $in ) =~ s/^(.)(.{4})/${1}_${2}_/s;
In-place:
my $var = '000000000';
$var =~ s/^(.)(.{4})/${1}_${2}_/s;
my $var = '000000000';
substr($var, 5, 0) = '_';
substr($var, 1, 0) = '_';
For a solution for any-length string and considering efficiency issues that arise for very-long strings, please see my previous question&answer: How to chunk text "from the back" in perl.
Per suggestion in comment, here is code using the idea in the linked question/answer which answers the OP question:
use integer;
my $la = length($var);
my $r = $la % 4;
my $q = $la / 4;
my $tr = $r ? "a$r" : "";
$var = join "_", unpack "$tr(a4)$q", $var;
Note: change all three 4s for a different grouping size.
If this is a commify problem that is solved in "How can I output my numbers with commas added?", available as perldoc -q 'commas added', then a similar solution will suffice, with extra parameters to define the separator and the size of the interval
You will want to read the perlfaq entry for other alternatives
use strict;
use warnings 'all';
print group_characters(1234567), "\n";
print group_characters('000000000', '_', 4), "\n";
print group_characters('0123456789ABCDEF', ' ', 4), "\n";
sub group_characters {
my ($s, $sep, $n) = #_;
$sep //= ',';
$n //= 3;
1 while $s =~ s/[^$sep]+\K(?=[^$sep]{$n})/$sep/;
$s;
}
output
1,234,567
0_0000_0000
0123 4567 89AB CDEF
I want to calculate the average of 15 files:- ifile1.txt, ifile2.txt, ....., ifile15.txt. Number of columns and rows of each file are same. But some of them are missing values. Part of the data looks as
ifile1.txt ifile2.txt ifile3.txt
3 ? ? ? . 1 2 1 3 . 4 ? ? ? .
1 ? ? ? . 1 ? ? ? . 5 ? ? ? .
4 6 5 2 . 2 5 5 1 . 3 4 3 1 .
5 5 7 1 . 0 0 1 1 . 4 3 4 0 .
. . . . . . . . . . . . . . .
I would like to find a new file which will show the average of these 15 fils without considering the missing values.
ofile.txt
2.66 2 1 3 . (i.e. average of 3 1 4, average of ? 2 ? and so on)
2.33 ? ? ? .
3 5 4.33 1.33 .
3 2.67 4 0.66 .
. . . . .
This question is similar to my earlier question Average of multiple files in shell where the script was
awk 'FNR == 1 { nfiles++; ncols = NF }
{ for (i = 1; i < NF; i++) sum[FNR,i] += $i
if (FNR > maxnr) maxnr = FNR
}
END {
for (line = 1; line <= maxnr; line++)
{
for (col = 1; col < ncols; col++)
printf " %f", sum[line,col]/nfiles;
printf "\n"
}
}' ifile*.txt
But I can't able to modify it.
Use this:
paste ifile*.txt | awk '{n=f=0; for(i=1;i<=NF;i++){if($i*1){f++;n+=$i}}; print n/f}'
paste will show all files side by side
awk calculates the averages per line:
n=f=0; set the variables to 0.
for(i=1;i<=NF;i++) loop trough all the fields.
if($i*1) if the field contains a digit (multiplication by 1 will succeed).
f++;n+=$i increment f (number of fields with digits) and sum up n.
print n/f calculate n/f.
awk '
{
for (i = 1;i <= NF;i++) {
Sum[FNR,i]+=$i
Count[FNR,i]+=$i!="?"
}
}
END {
for( i = 1; i <= FNR; i++){
for( j = 1; j <= NF; j++) printf "%s ", Count[i,j] != 0 ? Sum[i,j]/Count[i,j] : "?"
print ""
}
}
' ifile*
assuming file are correctly feeded (no trailing empty space line, ...)
awk 'FNR == 1 { nfiles++; ncols = NF }
{ for (i = 1; i < NF; i++)
if ( $i != "?" ) { sum[FNR,i] += $i ; count[FNR,i]++ ;}
if (FNR > maxnr) maxnr = FNR
}
END {
for (line = 1; line <= maxnr; line++)
{
for (col = 1; col < ncols; col++)
if ( count[line,col] > 0 ) printf " %f", sum[line,col]/count[line,col];
else printf " ? " ;
printf "\n" ;
}
}' ifile*.txt
I just check the '?' ...
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to plot a gene graph for a DNA sequence say ATGCCGCTGCGC?
Im trying to write a Perl script that compares two DNA sequences (60 characters in length each lets say) in alignment, and then show the ratio of matches to non-matches of the sequences to each other. But i'm not having much luck. if it helps i can upload my code, but its no use. here's an example of what im trying to achieve below.
e.g
A T C G T A C
| | | | | | |
T A C G A A C
So the matches of the above example would be 4. and non-matches are: 3. Giving it a ratio of 4.3.
Any help would be much appreciated. thanks.
in general, please do post your code. It does help. In any case, something like this should do what you are asking:
#!/usr/bin/perl -w
use strict;
my $d1='ATCGTAC';
my $d2='TACGAAC';
my #dna1=split(//,$d1);
my #dna2=split(//,$d2);
my $matches=0;
for (my $i=0; $i<=$#dna1; $i++) {
$matches++ if $dna1[$i] eq $dna2[$i];
}
my $mis=scalar(#dna1)-$matches;
print "Matches/Mismatches: $matches/$mis\n";
Bear in mind though that the ratio of 4 to 3 is most certainly not 4.3 but ~1.3. If you post some information on your input file format I will update my answer to include lines for parsing the sequence from your file.
Normally I'd say "What have you tried" and "upload your code first" because it doesn't seem to be a very difficult problem. But let's give this a shot:
create two arrays, one to hold each sequence:
#sequenceOne = ("A", "T", "C", "G", "T", "A", "C");
#sequenceTwo = ("T", "A", "C", "G", "A", "A", "C");
$myMatch = 0;
$myMissMatch = 0;
for ($i = 0; $i < #sequenceOne; $i++) {
my $output = "Comparing " . $sequenceOne[$i] . " <=> " . $sequenceTwo[$i];
if ($sequenceOne[$i] eq $sequenceTwo[$i]) {
$output .= " MATCH\n";
$myMatch++;
} else {
$myMissMatch++;
$output .= "\n";
}
print $output;
}
print "You have " . $myMatch . " matches.\n";
print "You have " . $myMissMatch . " mismatches\n";
print "The ratio of hits to misses is " . $myMatch . ":" . $myMissMatch . ".\n";
Of course, you'd probably want to read the sequence from something else on the fly instead of hard-coding the array. But you get the idea. With the above code your output will be:
torgis-MacBook-Pro:platform-tools torgis$ ./dna.pl
Comparing A <=> T
Comparing T <=> A
Comparing C <=> C MATCH
Comparing G <=> G MATCH
Comparing T <=> A
Comparing A <=> A MATCH
Comparing C <=> C MATCH
You have 4 matches.
You have 3 mismatches
The ratio of hits to misses is 4:3.
So many ways to do this. Here's one.
use strict;
use warnings;
my $seq1 = "ATCGTAC";
my $seq2 = "TACGAAC";
my $len = length $seq1;
my $matches = 0;
for my $i (0..$len-1) {
$matches++ if substr($seq1, $i, 1) eq substr($seq2, $i, 1);
}
printf "Length: %d Matches: %d Ratio: %5.3f\n", $len, $matches, $matches/$len;
exit 0;
Just grab the length of one of the strings (we're assuming string lengths are equal, right?), and then iterate using substr.
my #strings = ( 'ATCGTAC', 'TACGAAC' );
my $matched;
foreach my $ix ( 0 .. length( $strings[0] ) - 1 ) {
$matched++
if substr( $strings[0], $ix, 1 ) eq substr( $strings[1], $ix, 1 );
}
print "Matches: $matched\n";
print "Mismatches: ", length( $strings[0] ) - $matched, "\n";
I think substr is the way to go, rather than splitting the strings into arrays.
This is probably most convenient if presented as a subroutine:
use strict;
use warnings;
print ratio(qw/ ATCGTAC TACGAAC /);
sub ratio {
my ($aa, $bb) = #_;
my $total = length $aa;
my $matches = 0;
for (0 .. $total-1) {
$matches++ if substr($aa, $_, 1) eq substr($bb, $_, 1);
}
$matches / ($total - $matches);
}
output
1.33333333333333
Bill Ruppert's right that there are many way to do this. Here's another:
use Modern::Perl;
say compDNAseq( 'ATCGTAC', 'TACGAAC' );
sub compDNAseq {
my $total = my $i = 0;
$total += substr( $_[1], $i++, 1 ) eq $1 while $_[0] =~ /(.)/g;
sprintf '%.2f', $total / ( $i - $total );
}
Output:
1.33
Here is an approach which gives a NULL, \0, for each match in an xor comparison.
#!/usr/bin/perl
use strict;
use warnings;
my $d1='ATCGTAC';
my $d2='TACGAAC';
my $len = length $d1; # assumes $d1 and $d2 are the same length
my $matches = () = ($d1 ^ $d2) =~ /\0/g;
printf "ratio of %f", $matches / ($len - $matches);
Output: ratio of 1.333333
I have an array of unknown number of words, with an unknown max length.
I need to convert it to another array, basically turning it into a column word array.
so with this original array:
#original_array = ("first", "base", "Thelongest12", "justAWORD4");
The resluting array would be:
#result_array = ("fbTj","iahu","rses","selt","t oA"," nW"," gO"," eR"," sD"," t4"," 1 "," 2 ");
Actually I will have this:
fbTj
iahu
rses
selt
t oA
nW
gO
eR
sD
t4
1
2
I need to do it in order to make a table, and these words are the table's headers.
I hope I have made myself clear, and appreciate the help you are willing to give.
I tried it with the split function, but I keep messing it up...
EDIT:
Hi all, thanks for all the tips and suggestions! I learned quite much from all of you hence the upvote. However I found tchrist's answer more convenient, maybe because I come from a c,c# background... :)
use strict;
use warnings;
use 5.010;
use Algorithm::Loops 'MapCarU';
my #original_array = ("first", "base", "Thelongest12", "justAWORD4");
my #result_array = MapCarU { join '', map $_//' ', #_ } map [split //], #original_array;
I have an old program that does this. Maybe you can adapt it:
$ cat /tmp/a
first
base
Thelongest12
justAWORD4
$ rot90 /tmp/a
fbTj
iahu
rses
selt
t oA
nW
gO
eR
sD
t4
1
2
Here’s the source:
$ cat ~/scripts/rot90
#!/usr/bin/perl
# rot90 - tchrist#perl.com
$/ = '';
# uncomment for easier to read, but not reversible:
### #ARGV = map { "fmt -20 $_ |" } #ARGV;
while ( <> ) {
chomp;
#lines = split /\n/;
$MAXCOLS = -1;
for (#lines) { $MAXCOLS = length if $MAXCOLS < length; }
#vlines = ( " " x #lines ) x $MAXCOLS;
for ( $row = 0; $row < #lines; $row++ ) {
for ( $col = 0; $col < $MAXCOLS; $col++ ) {
$char = ( length($lines[$row]) > $col )
? substr($lines[$row], $col, 1)
: ' ';
substr($vlines[$col], $row, 1) = $char;
}
}
for (#vlines) {
# uncomment for easier to read, but again not reversible
### s/(.)/$1 /g;
print $_, "\n";
}
print "\n";
}
use strict;
use warnings;
use List::Util qw(max);
my #original_array = ("first", "base", "Thelongest12", "justAWORD4");
my #new_array = transpose(#original_array);
sub transpose {
my #a = map { [ split // ] } #_;
my $max = max(map $#$_, #a);
my #out;
for my $n (0 .. $max) {
$out[$n] .= defined $a[$_][$n] ? $a[$_][$n] : ' ' for 0 .. $#a;
}
return #out;
}
It can be done easily by this simple one-liner:
perl -le'#l=map{chomp;$m=length if$m<length;$_}<>;for$i(0..$m-1){print map substr($_,$i,1)||" ",#l}'