I have an array like
$arr = array("arif", "arin", "asif", "armin", "arpan");
I want to search and get the indices of the elements which meet a regex.
In this case I wanna get the indices 0, 1, 3, 4 as they match my pattern
$regex = '|^ar|';
Use preg_grep() for that:
<?php
$arr = array("arif", "arin", "asif", "armin", "arpan");
$regex = '|^ar|';
$res = array_keys(preg_grep($regex, $arr));
var_dump($res);
Use preg_match function while iterating through the input array:
$arr = array("arif", "arin", "asif", "armin", "arpan");
$keys = [];
foreach ($arr as $k => $item) {
if (preg_match('/^ar/', $item)) $keys[] = $k;
}
print_r($keys);
The output:
Array
(
[0] => 0
[1] => 1
[2] => 3
[3] => 4
)
Loop over each item in the array, test if your regular expression matches using preg_match the item, if it does, add the index to another array of indexes. If it doesn't match, simply continue. You'll be left with an array of indices.
$words = array("arif", "arin", "asif", "armin", "arpan");
$pattern = '|^ar|';
$indices = array();
foreach ($words as $i => $word) {
// if there is a match
if (preg_match($pattern, $word)) {
// append the current index to the indices array
$indices[] = $i;
}
}
Related
I have a xlsx that im converting into a perl hash
Name
Type
Symbol
Colour
JOHN
SUV
X
R
ROB
MPV
Y
B
JAMES
4X4
Y
G
Currently, I can only set the hash superkey to the column wanted based on column array. I cant seem to figure out how to choose based on column header.
use Data::Dumper;
use Text::Iconv;
my $converter = Text::Iconv->new("utf-8", "windows-1251");
use Spreadsheet::XLSX;
my $excel = Spreadsheet::XLSX->new('file.xlsx', $converter);
foreach my $sheet (#{$excel->{Worksheet}}) {
if ($sheet->{Name} eq "sheet1"){
my %data;
for my $row ( 0 .. $sheet-> {MaxRow}) {
if ($sheet->{Cells}[0][$col]->{Val} eq "Symbol"){
my $super_key = $sheet->{Cells}[$row][$col]{Val};
}
my $key = $sheet->{Cells}[$row][0]{Val};
my $value = $sheet->{Cells}[$row][2]{Val};
my $value2= $sheet->{Cells}[$row][3]{Val};
$data{$super_key}->{$key}->{$value}=${value2};
}
print Dumper \%data;
}}
The outcome i get is,
$VAR1 = {
'' => {
'JOHN' => {
'SUV' => R
I would like to have;
$VAR1 = {
'X' => {
'JOHN' => {
'SUV' => R
`
You are missing use strict; in your perl script. If you had it, you would have seen your error yourself
Defining the $super_key with my in your If-clause, makes this variable lose scope as soon as you exit it.
And using a variable $col without defining it doesn't work either.
Better (and probably working) is:
for my $row ( 0 .. $sheet-> {MaxRow}) {
my $super_key;
foreach my $col (0 .. 3) {
if ($sheet->{Cells}[0][$col]->{Val} eq "Symbol"){
$super_key = $sheet->{Cells}[$row][$col]{Val};
}
}
my $key = $sheet->{Cells}[$row][0]{Val};
my $value = $sheet->{Cells}[$row][2]{Val};
my $value2= $sheet->{Cells}[$row][3]{Val};
$data{$super_key}->{$key}->{$value}=${value2};
}
Here I have two arrays
var arry1 = [{id:1,name:"muni"},{id:2,name:"Anji"}, {id:3,name:"vinod"},{id:4,name:"anil"}];
var arry2 = [{id:3},{id:1}];
I want the following results
arry1= [{id:2,name:"Anji"},{id:4,name:"anil"}]
Should be remove second selected elements in first array
You can use Array.filter to remove any element that is present in arry2. We can create a Set of id elements to filter out, this will be more efficient for larger arrays:
var arry1 = [{id:1,name:"muni"},{id:2,name:"Anji"}, {id:3,name:"vinod"},{id:4,name:"anil"}];
var arry2 = [{id:3},{id:1}];
// Filter out any elements in arry1 that are also present in arry2, first create a Set of IDs to filter
const idsToFilter = new Set(arry2.map(el => el.id));
const result = arry1.filter(el => !idsToFilter.has(el.id));
console.log("Result:", result)
While removing from an array, you should iterate backwards over it.
for (let i = arry1.length - 1; i >=0; i--) {
...
}
This ensures that no elements are skipped after an element is removed. See also this other question for more info on this.
Now for each element of arry1 we want to check whether it should be removed.
let idsToRemove = arry2.map(e => e.id); // [3, 1]
for (let i = arry1.length - 1; i >=0; i--) {
if (idsToRemove.includes(arry1[i].id) {
// it should be removed
arry1.splice(i, 1);
}
}
Something like the above should then work for your problem. For easier understanding of the code, I first mapped arry2 to only the IDs, but of course you can also use another loop to see whether there is a match. The most important take-away is that to safely remove from an array while iterating it, you need to iterate from the last to the first element.
Try this it will work, here filter will filter out only those array element which doesn't exist in arry2
var myArray = arry1.filter(ar => !arry2.find(el => (el.id === ar.id) ))
I modified the code to work with two files. to_search.txt has string to be searched. big_file.fastq has lines where to be searched and if string found (2 mismatch allowed with exact length which range from 8-10, no addition and deletion), place in respective name. So each string is searched in all lines (2nd line) in big_file.fastq.
# to_search.txt: (length can be from 8-20 characters)
1 TCCCTTGT
2 ACGAGACT
3 GCTGTACG
4 ATCACCAG
5 TGGTCAAC
6 ATCGCACA
7 GTCGTGTA
8 AGCGGAGG
9 ATCCTTTG
10 TACAGCGC
#2000 search needed
# big_file.fastq: 2 billions lines (each 4 lines are associated: string search is in second line of each 4 lines).
# Second line can have 100-200 characters
#M04398:19:000000000-APDK3:1:1101:21860:1000 1:N:0:1
TCttTTGTGATCGATCGATCGATCGATCGGTCGTGTAGCCTCCAACCAAGCACCCCATCTGTTCCAAATCTTCTCCCACTGCTACTTGAAGACGCTGAAGTTGAAGGGCCACCTTCATCATTCTGG
+
#8ACCDGGGGGGGGGGGGGEGGGGGGGGDFFEGGG#FFGGGGGGGGGGGGGGGGGCF#<FFGGGGGFE9FGGGFEACE8EFFFGGGGGF9F?CECEFEG,CFGF,CF#CCC?BFFG<,9<9AEFG,,
#M04398:19:000000000-APDK3:1:1101:13382:1000 1:N:0:1
NTCGATCGATCGATCGATCGATCGATCGTTCTGAGAGGTACCAACCAAGCACACCACGGGCGACACAGACAGCTCCGTGTTGAACGGGTTGTTCTTCTTCTTGCCTTCATCATCCCCATCCTCAGTGGACGCAGCTTGCTCATCCTTCCTC
+
#8BCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
#M04398:19:000000000-APDK3:1:1101:18888:1000 1:N:0:1
NCAGAATGAGGAAGGATGAGCCCCGTCGTGTCGAAGCTATTGACACAGCGCTATTCCGTCTTTATGTTCACTTTAAGCGGTACAAGGAGCTGCTTGTTCTGATTCAGGAACCGAACCCTGGTGGTGTGCTTGGTTGGCAAGTTTACGGCTC
+
#8BCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGE
Here is the code for two mismatches. I tried with exact match, speed is not bad: takes around a day. I have used Time::Progress module. When I use 2 mismatch: shows 115 days to finish. How the speed can be improved here?
#!/usr/bin/perl
use strict;
use warnings;
$| = 1;
open( IN_P1, "big_file.fastq" ) or die "File not found";
my ( #sample_file_names, #barcode1 );
open( BC_FILE, "to_search.txt" ) or die "No barcode file";
my #barcode_file_content = <BC_FILE>;
foreach (#barcode_file_content) {
chomp $_;
$_ =~ s/\r//;
$_ =~ s/\n//;
#print $_;
my #elements = split( "(\t|,| )", $_ );
push #sample_file_names, $elements[0];
push #barcode1, $elements[2];
}
# open FH
my #fh_array_R1;
foreach (#sample_file_names) {
chomp $_;
local *OUT_R1;
open( OUT_R1, ">", "$_\.fq" ) or die "cannot write file";
push #fh_array_R1, *OUT_R1;
}
# unknown barcode file
open( UNKNOWN_R1, ">unknown-barcode_SE.fq" ) or die "cannot create unknown-r1 file";
while ( defined( my $firstp1 = <IN_P1> ) ) {
my $p1_first_line = $firstp1;
my $p1_second_line = <IN_P1>;
my $p1_third_line = <IN_P1>;
my $p1_fourth_line = <IN_P1>;
chomp( $p1_first_line, $p1_second_line, $p1_third_line, $p1_fourth_line, );
my $matched_R1 = "$p1_first_line\n$p1_second_line\n$p1_third_line\n$p1_fourth_line\n";
for ( my $j = 0 ; $j < scalar #barcode1 ; $j++ ) {
chomp $barcode1[$j];
my $barcode1_regex = make_barcode_fragments( $barcode1[$j] );
if ( $p1_second_line =~ /$barcode1_regex/i ) {
# keep if matched
print { $fh_array_R1[$j] } $matched_R1;
last;
}
else {
#print to unknown;
print UNKNOWN_R1 $matched_R1;
}
}
}
# make two mismatch patterm of barcode
sub make_barcode_fragments {
my ($in1) = #_;
my #subpats;
for my $i ( 0 .. length($in1) - 1 ) {
for my $j ( $i + 1 .. length($in1) - 1 ) {
my $subpat = join( '',
substr( $in1, 0, $i ),
'\\w', substr( $in1, $i + 1, $j - $i - 1 ),
'\\w', substr( $in1, $j + 1 ),
);
push #subpats, $subpat;
}
}
my $pat = join( '|', #subpats );
#print $pat;
return "$pat";
}
exit;
If your algorithm cannot be changed/improved in Perl itself, you can still get speedup by writing the time consuming parts in C. Here is an example using inline C:
use strict;
use warnings;
use Benchmark qw(timethese);
use Inline C => './check_line_c.c';
my $find = "MATCH1";
my $search = "saasdadadadadasd";
my %sub_info = (
c => sub { check_line_c( $find, $search ) },
perl => sub { check_line_perl( $find, $search ) },
);
timethese( 4_000_000, \%sub_info );
sub check_line_perl {
my ($find, $search ) = #_;
my $max_distance = 2;
for my $offset ( 0 .. length($search) - length($find) ) {
my $substr = substr( $search, $offset, length($find) );
my $hd = hd( $find, $substr );
if ( $hd <= $max_distance ) {
return ( $hd, $substr );
}
}
return ( undef, undef );
}
sub hd {
return ( $_[0] ^ $_[1] ) =~ tr/\001-\377//;
}
where check_line_c.c is:
void check_line_c( char* find, char * search ) {
int max_distance = 2;
int flen = strlen(find);
int last_ind = strlen(search) - flen;
SV *dis = &PL_sv_undef;
SV *match = &PL_sv_undef;
for ( int ind = 0; ind <= last_ind; ind++ )
{
int count = 0;
for ( int j = 0; j < flen; j++ )
{
if ( find[j] ^ search[ind+j] ) count++;
}
if ( count < max_distance )
{
match = newSV(flen);
sv_catpvn(match, search+ind, flen );
dis = newSViv(count);
break;
}
}
Inline_Stack_Vars;
Inline_Stack_Reset;
Inline_Stack_Push(sv_2mortal(dis));
Inline_Stack_Push(sv_2mortal(match));
Inline_Stack_Done;
}
The output is (Ubuntu Laptop using Intel Core i7-4702MQ CPU #2.20GHz):
Benchmark: timing 4000000 iterations of c, perl...
c: 2 wallclock secs ( 0.76 usr + 0.00 sys = 0.76 CPU) # 5263157.89/s (n=4000000)
perl: 19 wallclock secs (18.30 usr + 0.00 sys = 18.30 CPU) # 218579.23/s (n=4000000)
So this gives a 24-fold speedup for this case.
I'd suggest creating a really bad hashing algorithm. Something nice and reversible and inefficient, like the sum of the characters. Or maybe the sum of unique values (1-4) represented by the characters.
Compute the target sums, and also compute the maximum allowed variance. That is, if the objective is a match with two substitutions, then what is the maximum possible difference? (4-1 + 4-1 = 6).
Then, for each "window" of text of the appropriate length in the target data file, compute a running score. (Add a character to the end, drop a character from the start, update the hash score.) If the score for a window is within the allowable range, you can do further investigation.
You might want to implement this as different passes. Possibly even as different stages in a shell pipeline or script. The idea being that you might be able to parallelize parts of the search. (For instance, all the match strings with the same length could be searched by one process, since the hash windows are the same.)
Also, of course, it is beneficial that you can keep your early work if your program crashes in the later stages. And you can even have the early parts of the process running while you are still developing the end stages.
I am trying to write a Perl script where a string get fragmented at every occurrence of 'E' and when user enters positions of 'C' through command line (say 3-8 or 3-8,13-18 or any comma separated such positions of 'C' according to the string in such format if the string is long), the fragments containing 'C' (say at 3 and 8 positions) should be joined and shown in the output. Suppose string is
"ABCDEABCDEABCDEABCDEABCDE" and user enters 3-8 then program oputput should be-
ABCDEABCDE
ABCDE
ABCDE
ABCDE
I wrote a script where user enters 'C' positions through command line and string get cut at every position of 'E' but after that I'm not able to write it properly. Please help!
Code (edited) that I've written so far is:
use strict;
use warnings;
my $s = 'ABCDEABCDEABCDEABCDEABCDE';
my $i=0;
my #where;
my #array;
my #bond;
my #pos;
my #s_array;
my #s_array2;
for ($i=0; $i<=4; $i++) {
$where[$i] = index($s,"C",$where[$i-1])+1;
push(#array,$where[$i]);
}
print "All Positions of C: #array\n\n";
print "Enter C positions:\n";
my #join_C = <STDIN>;
foreach (#join_C) {
#bond = split (',',$_);
}
foreach (#bond) {
#pos = split ('-', $_);
}
print "entered C positions:#pos\n";
print "Resulting disulfide linked peptides\n\n";
my #a = split(/E/, $s);
my $result = join ("E,", #a);
my #final = split(/,/, $result);
foreach my $final (#final) {
foreach my $pos(#pos) {
my #source = split //, $final[#final];
my $length = #source;
for ($i=0; $i<=$length; $i++) {
if (index($final[$i], "C") == $pos[$i]) {
push (#s_array, $final[$i]);
}
else {
push (#s_array2, $final[$i]);
}
}
}
}
my $lastly_joined = join (',', #s_array);
print "Joined Fragments= #s_array\n";
print "Remaining fragments= #s_array2\n";
I will try to understand what you want to do.
I am trying to write a Perl script where a string get fragmented at every
occurrence of 'E'
Okay, first create the input. Lets use an array to make access to the
elements easier.
my #s = split ('', 'ABCDE' x 5);
I'm not sure how that string will look in your case. Can you please provide an
real world example.
and when user enters c_pos of 'C' through command line (say 3-8
or 3-8,13-18 or any comma separated such c_pos of 'C' according to
the string in such format if the string is long)
I would suggest to use commandline arguments. That makes it easier to use the
script later in a chain with other tools. Pass the arguments to the script:
script.pl 3-8,13-18
So we get a list of pairs:
my #pairs = split (',', join('', #ARGV));
Now you should check that the passed values point to 'C's. Valid combinations
are stored in a hash where the key is the start index and the value is the end
index.
my %c_pos;
foreach my $pair (#pairs) {
my ($from, $to) = split('-', $pair);
if (($string[$from-1] eq 'C') && ($string[$to-1] eq 'C')) {
$c_pos{$from-1} = $to-1;
} else {
warn "position ${from}-${to} not valid => ignored!\n";
}
}
the fragments containing 'C' (say at 3 and 8 positions) should be joined
and shown in the output.
Now we can iterate on the elements of #s. When we hit a start index
a 'connecion' starts and this connection is active until the end is
reached.
We store all values to the current entry.
When we hit an 'E' and we are not in a 'connection' the current entry is
pushed to our result and we start with the next empty entry.
for (my $i=0; $i<#string; $i++) {
if ($c_pos{$i}) {
$inside_connection = 1;
$end = $c_pos{$i};
} elsif ($i == $end) {
$inside_connection = 0;
$end = 0;
}
$entry.=$string[$i];
if ($inside_connection) {
# do not split on 'E'
} elsif ($string[$i] eq 'E') {
# split on 'E'
push #result, $entry;
$entry = '';
}
}
Because I do not know better I assumed that a chaind connection like
3-8, 8-13 will cause that it works like you would have said 3-13. Hope
that fits. Here is the complete script:
use strict;
use warnings;
my #string = split ('', 'ABCDE' x 5);
my #pairs = split (',', join('', #ARGV));
my %c_pos;
foreach my $pair (#pairs) {
my ($from, $to) = split('-', $pair);
if (($string[$from-1] eq 'C') && ($string[$to-1] eq 'C')) {
$c_pos{$from-1} = $to-1;
} else {
warn "position ${from}-${to} not valid => ignored!\n";
}
}
my #result;
my $entry = '';
my $inside_connection = 0;
my $end=0;
for (my $i=0; $i<#string; $i++) {
if ($c_pos{$i}) {
$inside_connection = 1;
$end = $c_pos{$i};
} elsif ($i == $end) {
$inside_connection = 0;
$end = 0;
}
$entry.=$string[$i];
if ($inside_connection) {
# do not split on 'E'
} elsif ($string[$i] eq 'E') {
# split on 'E'
push #result, $entry;
$entry = '';
}
}
print join ("\n", #result);
Is there a method in Perl (not BioPerl) to find the number of each two consecutive letters.
I.e., number of AA, AC, AG, AT, CC, CA, ... in a sequence like this:
$sequence = 'AACGTACTGACGTACTGGTTGGTACGA'
PS: We can make it manually by using the regular expression, i.e., $GC=($sequence=~s/GC/GC/g) which return the number of GC in the sequence.
I need an automated and generic way.
You had me confused for a while, but I take it you want to count the dinucleotides in a given string.
Code:
my #dinucs = qw(AA AC AG CC CA CG);
my %count;
my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
for my $dinuc (#dinucs) {
$count{$dinuc} = ($sequence =~ s/\Q$dinuc\E/$dinuc/g);
}
Output from Data::Dumper:
$VAR1 = {
"AC" => 5,
"CC" => "",
"AG" => "",
"AA" => 1,
"CG" => 3,
"CA" => ""
};
Close to TLP's answer, but without substitution:
my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
my #dinucs = qw(AA AC AG AT CC CG);
my %count = map{$_ => 0}#dinucs;
for my $dinuc (#dinucs) {
while($sequence=~/$dinuc/g) {
$count{$dinuc}++;
}
}
Benchmark:
my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
my #dinucs = qw(AA AC AG AT CC CG);
my %count = map{$_ => 0}#dinucs;
my $count = -3;
my $r = cmpthese($count, {
'match' => sub {
for my $dinuc (#dinucs) {
while($sequence=~/$dinuc/g) {
$count{$dinuc}++;
}
}
},
'substitute' => sub {
for my $dinuc (#dinucs) {
$count{$dinuc} = ($sequence =~ s/\Q$dinuc\E/$dinuc/g);
}
}
});
Output:
Rate substitute Match
Substitute 13897/s -- -11%
Match 15622/s 12% --
Regex works if you're careful, but there's a simple solution using substr that will be faster and more flexible.
(As of this posting, the regex solution marked as accepted will fail to correctly count dinucleotides in repeated regions like 'AAAA...', of which there are many in naturally occurring sequences.
Once you match 'AA', the regex search resumes on the third character, skipping the middle 'AA' dinucleotide. This doesn't affect the other dinucleotides since if you have 'AC' at one position, you're guaranteed not to have it in the next base, naturally. The particular sequence given in the question will not suffer from this problem since no base appears three times in a row.)
The method I suggest is more flexible in that it can count words of any length; extending the regex method to longer words is complicated since you have to do even more gymnastics with your regex to get an accurate count.
sub substrWise {
my ($seq, $wordLength) = #_;
my $cnt = {};
my $w;
for my $i (0 .. length($seq) - $wordLength) {
$w = substr($seq, $i, $wordLength);
$cnt->{$w}++;
}
return $cnt;
}
sub regexWise {
my ($seq, $dinucs) = #_;
my $cnt = {};
for my $d (#$dinucs) {
if (substr($d, 0,1) eq substr($d, 1,1) ) {
my $n = substr($d, 0,1);
$cnt->{$d} = ($seq =~ s/$n(?=$n)/$n/g); # use look-ahead
} else {
$cnt->{$d} = ($seq =~ s/$d/$d/g);
}
}
return $cnt;
}
my #dinucs = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
use Test::More tests => 1;
my $rWise = regexWise($sequence, \#dinucs);
my $sWise = substrWise($sequence, 2);
$sWise->{$_} //= '' for #dinucs; # substrWise will not create keys for words not found
# this seems like desirable behavior IMO,
# but i'm adding '' to show that the counts match
is_deeply($rWise, $sWise, 'verify equivalence');
use Benchmark qw(:all);
cmpthese(100000, {
'regex' => sub {
regexWise($sequence, \#dinucs);
},
'substr' => sub {
substrWise($sequence, 2);
}
Output:
1..1
ok 1 - verify equivalence
Rate regex substr
regex 11834/s -- -85%
substr 76923/s 550% --
For longer sequences (10-100 kbase), the advantage is not as pronounced, but it still wins by about 70%.