How to troubleshoot very slow perl script that runs faster on one server but slow on another - linux

I received a perl script that is supposed to filter probes for Hi-C data, but it runs sooooooooo slow (several days). The one who gave me the script told me it only took a few hours on her previous lab's server, so I am wondering what would cause such a difference in run time, and how would I go about making it run faster? My knowledge of perl is very limited, but my google search so far suggests I may need to change some perl configuration files? Below is the main function in the whole script that takes the longest, the input file contains chromosome targets and locations. Any help would be greatly appreciated!
I have tried running with perl 5.16 and 5.26 to see if maybe the version was too old, but it took just as long.
sub get_restriction_fragments_for_targets {
my #targets = #_;
# Split by chromsome and sort
my %targets;
foreach my $target (#targets) {
push #{$targets{$target->{chr}}},$target;
}
foreach my $chr (keys %targets) {
my #sorted = sort {$a -> {start} <=> $b -> {start}} #{$targets{$chr}};
# We also need to merge overlapping capture regions
my #merged;
my $last_region;
foreach my $region (#sorted) {
unless ($last_region) {
$last_region = $region;
next;
}
if ($region->{start} < $last_region -> {end}) {
# Merge
if ($region ->{end} > $last_region->{end}) {
$last_region->{end} = $region->{end};
}
next;
}
push #merged,$last_region;
$last_region = $region;
}
push #merged,$last_region;
$targets{$chr} = \#merged;
}
my #target_fragments;
open (IN,'all_restriction_fragments.txt') or die $!;
my $last_chr = "";
my $last_index = 0;
while (<IN>) {
chomp;
my $line = $_;
my ($chr,$start,$end) = split(/\t/);
if ($chr ne $last_chr) {
warn "Moving to $chr\n";
$last_chr = $chr;
$last_index = 0;
}
next unless (exists $targets{$chr});
my #local_targets = #{$targets{$chr}};
foreach my $index ($last_index .. $#local_targets) {
my $target = $local_targets[$index];
if ($target -> {end} < $start) {
$last_index = $index;
next;
}
if ($target -> {start} > $end) {
last;
}
push #target_fragments,{
id => $target->{id},
chr => $chr,
start => $start,
end => $end,
};
last;
}
}
return #target_fragments;
}

One can't tell without knowing a lot more about the problem, in the first place about the size of data structures and of the file that's processed.
Here is a generic comment on what would in principle be most time consuming operations in what the code seems to be doing, in a (probably) decreasing order of importance and/or likelihood
How big is the file? That's disk access and that can take any time
Sorting in a loop -- how big is data that is getting sorted? A sort is time consuming
There's that "tap dance" around #sorted, which results in data copy -- how much data?
In general, there is lots of data copying -- how large is #targets passed into the routine?
As you can see, each factor comes with the question of how large the data is. Disk access is of course costly, but a lot of data copying between various data structures in the program is just as important.
So if you can provide some detail that'll help us give you a more detailed analysis/guess.

Related

Why this Perl script run out of memory gradually

I have a trouble in running a Perl script in muti-threads. It continued consume memory and finally the system ran out of memory and killed it. It seems that the sub-threads were detached but the system resource were not released when they finished. I am pretty new to Perl and couldn't find which part went wrong. This is part of the script that may cause this problem. Could anyone help me with this?
use strict;
use warnings;
print "different number:\t";
my $num1=<>;
chomp $num1;
if($num1!~/[1 2 3 4 5]/)
{
print "invalid input number\n";
END;
}
my $i=0;
my $no;
my #spacer1;
my $nn;
my #spacer2;
open IN,"file1.txt"or die"$!";
while(<IN>)
{
chomp;
if($_=~ /^>((\d)+)\|((\d)+)/)
{
$no=$1;
$spacer1[$no][0]=$3;
}
else
{
$spacer1[$no][1]=$_;
}
}
close IN;
open IN, "file2.txt" or die "$!";
while(<IN>)
{
chomp;
if($_=~ /^>((\d)+)\|((\d)+)/)
{
$nn=$1;
$spacer2[$nn][0]=$3;
}
else
{
$spacer2[$nn][1]=$_;
}
}
close IN;
#-----------------------------------------------------------------#create threads
use subs qw(sg_ana);
use threads;
use Thread::Semaphore;
my $cycl=(int($no/10000))+1;
my $c;
my #thd;
my $thread_limit= Thread::Semaphore -> new (3);
foreach $c(1..$cycl)
{
$thread_limit->down();
$thd[$c]=threads->create("sg_ana",$c-1,$c,$num1);
$thd[$c]->detach();
}
&waitquit;
#-------------------------------------------------------------#limite threads num
sub waitquit
{
print "waiting\n";
my $num=0;
while($num<3)
{
$thread_limit->down();
$num++;
}
}
#---------------------------------------------------------------#alignment
my $n;
my $n1;
my $j;
my $k;
my $l;
my $m;
my $num;#number of match
my $num2=0;;#arrange num
sub sg_ana
{
my $c1=shift;
my $c2=shift;
$num1=shift;
open OUT,">$num1.$c2.txt" or die "$!";
if($num1==1)
{
foreach $n($c1*10000..$c2*10000-1)
{
if($spacer2[$n][1])
{
my $presult1;
my $presult2;
$num2=-1;
foreach $i(0..19)
{
$num=0;
$num2++;
my $tmp1=(substr $spacer2[$n][1],0,$i)."\\"."w".(substr $spacer2[$n][1],$i+1,19-$i);
foreach $n1(0..#spacer1-1)
{
if($spacer1[$n1][1])
{
my $tmp2=substr $spacer1[$n1][1],0,20;
if($tmp2=~/$tmp1/)
{
$num++;
$presult1.=$n1.",";
}
}
}
$presult2=$i+1;
if($num>=4)
{
print OUT "\n";
}
}
}
}
}
close OUT;
$thread_limit->up();
}
Rule one of debugging perl is enable use strict; and use
warnings; and then sort out the errors. Actually, you should
probably do that first of all, before you even start writing code.
You're creating and limiting threads via a Semaphore - but actually
this is really inefficient because of how perl does threads - they
aren't lightweight, so spawning loads is a bad idea. A better way of doing this is via Thread::Queue a bit like this.
Please use 3 arg open and lexical file handles. e.g. open ( my
$out, '>', "$num.$c2.txt" ) or die $!;. You're probably getting
away with it here, but you have got OUT as a global namespace
variable being used by multiple threads. That way lies dragons.
Don't use single letter variables. And given how you you use $c
then you'd be far better off:
foreach my $value ( 1..$cycl ) {
## do stuff
}
The same is true of all your other single letter variables though - they're not meaningful.
You pass $num before it's initialised, so it's always going to
be undef within your sub. So your actual subroutine is just:
sub sg_ana
{
my $c1=shift;
my $c2=shift;
$num1=shift;
open OUT,">$num1.$c2.txt" or die "$!";
close OUT;
$semaphore->up();
}
Looking at it - I think you may be trying to do something with a shared variable there, but you're not actually sharing it. I can't decode the logic of your program though (thanks to having a load of single letter variables most likely) so I can't say for sure.
You're calling a subroutine &waitquit;. That's not good style -
prefixing with an ampersand and supplying no arguments does
something subtly different to just invoking the sub 'normally' - so
you should avoid it.
Don't instantiate your semaphore like this:
my $semaphore=new Thread::Semaphore(3);
That's an indirect procedure call, and bad style. It would be better written as:
my $thread_limit = Thread::Semaphore -> new ( 3 );
I would suggest rather than using Semaphores like that, you'd be far better off not detatching your threads, and just using join. You also don't need an array of threads - threads -> list does that for you.
I can't reproduce your problem, because your sub isn't doing
anything. Have you by any chance modified it for posting? But a classic reason for perl memory exhaustion when threading is because each thread clones the parent process - and so 100 threads is 100x the memory.

Address certain core for threads in Perl

I have a list of 40 files, which I want to modify through my script.
Since every file processed in the same way, I want to use Threads to speed it up.
Therefore I have this construct :
my $threads_ = sub
{
while (defined(my $taskRef = $q->dequeue()))
{
my $work= shift(#{$workRef});
&{\&{$work}}(#{$workRef});
my $open= $q->open() - 1;
}
};
my #Working;
for( my $i = 1; $i < 8; $i++)
{
push #Working, threads->new($threads_);
}
And I have this code for starting a thread for every file
foreach my $File (#Filelist)
{
$q->enqueue(['mySub',$FirstVar,$SecondVar]);
}
But it still takes way to long time.
My question is, is there a certain way to assign each thread to a single Core, in order to speed it up?
I'd use Parallel::ForkManager for something like this; it works great. I'd recommend not brewing your own when an accepted standard solution exists. By "address certain core", I take it to mean your purpose is to limit the number of concurrent tasks to the number of available processors and ForkManager will do this for you -- just set the max number of processes when you initialize your ForkManager object.
The commenters above were absolutely correct to point out that I/O will eventually limit your throughput, but it's easy enough to determine when adding more processes fails to speed things up.

How do I queue perl subroutines to a thread queue instead of data?

Background:
In reading how to multithread my perl script, I read (from http://perldoc.perl.org/threads.html#BUGS-AND-LIMITATIONS)
On most systems, frequent and continual creation and destruction of
threads can lead to ever-increasing growth in the memory footprint of
the Perl interpreter. While it is simple to just launch threads and
then ->join() or ->detach() them, for long-lived applications, it is
better to maintain a pool of threads, and to reuse them for the work
needed, using queues to notify threads of pending work.
My script will be long-lived; it's an PKI LDAP directory monitoring daemon that will always be running. The enterprise monitoring solution will generate an alarm if it stops running for any reason. My script will check that I can reach another PKI LDAP directory, as well as validate revocation lists on both.
Problem: Everything I can find on google shows passing variables (e.g. scalars) to the thread queue rather than the subroutine itself... I think I'm just not understanding how to implement a thread queue properly compared to how you implement a thread (without queues).
Question 1: How can I "maintain a pool of threads" to avoid the perl interpreter from slowly eating up more and more memory?
Question 2: (Unrelated but while I have this code posted) Is there a safe amount of sleep at the end of the main program so that I don't start a thread more than once in a minute? 60 seems obvious but could that ever cause it to run more than once if the loop is fast, or perhaps miss a minute because of processing time or something?
Thanks in advance!
#!/usr/bin/perl
use feature ":5.10";
use warnings;
use strict;
use threads;
use Proc::Daemon;
#
### Global Variables
use constant false => 0;
use constant true => 1;
my $app = $0;
my $continue = true;
$SIG{TERM} = sub { $continue = false };
# Directory Server Agent (DSA) info
my #ListOfDSAs = (
{ name => "Myself (inbound)",
host => "ldap.myco.ca",
base => "ou=mydir,o=myco,c=ca",
},
{ name => "Company 2",
host => "ldap.comp2.ca",
base => "ou=their-dir,o=comp2,c=ca",
}
);
#
### Subroutines
sub checkConnections
{ # runs every 5 minutes
my (#DSAs, $logfile) = #_;
# Code to ldapsearch
threads->detach();
}
sub validateRevocationLists
{ # runs every hour on minute xx:55
my (#DSAs, $logfile) = #_;
# Code to validate CRLs haven't expired, etc
threads->detach();
}
#
### Main program
Proc::Daemon::Init;
while ($continue)
{
my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime(time);
# Question 1: Queues??
if ($min % 5 == 0 || $min == 0)
{ threads->create(&checkConnections, #ListOfDSAs, "/var/connect.log"); }
if ($min % 55 == 0)
{ threads->create(&validateRevocationLists, #ListOfDSAs, "/var/RLs.log"); }
sleep 60; # Question 2: Safer/better way to prevent multiple threads being started for same check in one matching minute?
}
# TERM RECEIVED
exit 0;
__END__
use threads;
use Thread::Queue 3.01 qw( );
my $check_conn_q = Thread::Queue->new();
my $validate_revoke_q = Thread::Queue->new();
my #threads;
push #threads, async {
while (my $job = $check_conn_q->dequeue()) {
check_conn(#$job);
}
};
push #threads, async {
while (my $job = $validate_revoke_q->dequeue()) {
validate_revoke(#$job);
}
};
while ($continue) {
my ($S,$M,$H,$m,$d,$Y) = localtime; $m+=1; $Y+=1900;
$check_conn_q->enqueue([ #ListOfDSAs, "/var/connect.log" ])
if $M % 5 == 0;
$validate_revoke_q->enqueue([ #ListOfDSAs, "/var/RLs.log" ])
if $M == 55;
sleep 30;
}
$check_conn_q->end();
$validate_revoke_q->end();
$_->join for #threads;
I'm not sure parallelisation is needed here. If it's not, you could simply use
use List::Util qw( min );
sub sleep_until {
my ($until) = #_;
my $time = time;
return if $time >= $until;
sleep($until - $time);
}
my $next_check_conn = my $next_validate_revoke = time;
while ($continue) {
sleep_until min $next_check_conn, $next_validate_revoke;
last if !$continue;
my $time = time;
if ($time >= $next_check_conn) {
check_conn(#ListOfDSAs, "/var/connect.log");
$next_check_conn = time + 5*60;
}
if ($time >= $next_validate_revoke) {
validate_revoke(#ListOfDSAs, "/var/RLs.log");
$next_validate_revoke = time + 60*60;
}
}
I would recommend just running the checks one at a time, as there does not appear to be a compelling reason to use threads here, and you don't want to add unnecessary complexity to a program that will be running all the time.
If you do want to learn how use a thread pool, there are examples included with the threads module. There is also a Thread::Pool module that may be useful.
As for ensuring you don't repeat the checks in the same minute, you are correct that sleeping for 60 seconds will be inadequate. No matter what value you choose to sleep, you will have edge cases in which it fails: either it will be slightly shorter than a minute, and you will occasionally have two checks in the same minute, or it will be slightly longer than a minute, and you will occasionally miss a check altogether.
Instead, use a variable to remember when the task was last done. You can then use a shorter sleep time without worrying about multiple checks per minute.
my $last_task_time = -1;
while ($continue)
{
my $min = (localtime(time))[1];
if ($last_task_time != $min &&
($min % 5 == 0 || $min > ($last_task_time+5)%60))
{
#Check connections here.
if ($min == 55 || ($last_task_time < 55 && $min > 55))
{
#Validate revocation lists here.
}
$last_task_time = $min;
}
else
{
sleep 55; #Ensures there is at least one check per minute.
}
}
Update: I fixed the code so that it will recover if the last task ran too long. This would be fine if it occasionally takes a long time. If the tasks are frequently taking longer than five minutes, though, you need a different solution (threads would probably make sense in that case).

Perl script execution keeps getting killed - running out of memory

I am trying to execute a perl script that processes a small 12 x 2 text file (approx. 260 bytes) and a large .bedgraph file (at least 1.3 MB in size). From these two files, the script outputs a new bedgraph file.
I have ran this script on 3 other .bedgraph files but I try to run it on the rest of them the process keeps getting Killed.
It should take about 20 minutes on average for the perl script to run on each of the .bedgraph files.
I'm running the perl script on my local machine (not from a server). I'm using a Linux OS Ubuntu 12.04 system 64-bit 4GB RAM.
Why does my perl script execution keeps getting killed and how can I fix this?
Here's the script:
# input file handle
open(my $sizes_fh, '<', 'S_lycopersicum_chromosomes.size') or die $!;
# output file handles
open(my $output, '+>', 'tendaysafterbreaker_output.bedgraph') or die $!;
my #array;
while(<$sizes_fh>){
chomp;
my ($chrom1, $size) = split(/\t/, $_);
#array = (0) x $size;
open(my $bedgraph_fh, '<', 'Solanum_lycopersicum_tendaysafterbreaker.bedgraph') or die $!;
while(<$bedgraph_fh>){
chomp;
my ($chrom2, $start, $end, $FPKM) = split(/\t/, $_);
if ($chrom1 eq $chrom2){
for(my $i = $start; $i < $end; $i++){
$array[$i] += $FPKM;
}
}
}
close $bedgraph_fh or warn $!;
my ($last_start, $last_end) = 0;
my $last_value = $array[0];
for (my $i = 1; $i < $#array; $i++){
my $curr_val = $array[$i];
my $curr_pos = $i;
# if the current value is not equal to the last value
if ($curr_val != $last_value){
my $last_value = $curr_val;
print $output "$chrom1\t$last_start\t$last_end\t$last_value\n";
$last_start = $last_end = $curr_pos;
} else {
$last_end = $i;
}
}
}
close $sizes_fh or warn $!;
You are trying to allocate an array of 90,000,000 elements. Perl, due to its flexible typing and other advanced variable features, uses a lot more memory for this than you would expect.
On my (Windows 7) machine, a program that just allocates such an array and does nothing else eats up 3.5 GB of RAM.
There are various ways to avoid this huge memory usage. Here are a couple:
The PDL module for scientific data processing, which is designed to efficiently store huge numeric arrays in memory. This will change the syntax for allocating and using the array, though (and it messes around with Perl's syntax in various other ways).
DBM::Deep is a module that allocates a database in a file--and then lets you access that database through a normal array or hash:
use DBM::Deep;
my #array;
my $db = tie #array, "DBM::Deep", "array.db";
#Now you can use #array like a normal array, but it will be stored in a database.
If you know a bit of C, it is quite simple to offload the array manipulation into low-level code. Using a C array takes less space, and is a lot faster. However, you loose nice stuff like bounds checking. Here is an implementation with Inline::C:
use Inline 'C';
...;
__END__
__C__
// note: I don't know if your data contains only ints or doubles. Adjust types as needed
int array_len = -1; // last index
int *array = NULL;
void make_array(int size) {
free(array);
// if this fails, start checking return value of malloc for != NULL
array = (int*) malloc(sizeof(int) * size);
array_len = size - 1;
}
// returns false on bounds error
int array_increment(int start, int end, int fpkm) {
if ((end - 1) > array_len) return 0;
int i;
for (i = start; i < end; i++) {
array[i] += fpkm;
}
return 1;
}
// please check if this is actually equivalent to your code.
// I removed some unneccessary-looking variables.
void loop_over_array(char* chrom1) {
int
i,
last_start = 0,
last_end = 0,
last_value = array[0];
for(i = 1; i < array_len; i++) { // are you sure not `i <= array_len`?
if (array[i] != last_value) {
last_value = array[i];
// I don't know how to use Perl filehandles from C,
// so just redirect the output on the command line
printf("%s\t%d\t%d\t%d\n", chrom1, last_start, last_end, last_value);
last_start = i;
}
last_end = i;
}
}
void free_array {
free(array);
}
Minimal testing code:
use Test::More;
make_array(15);
ok !array_increment(0, 16, 2);
make_array(95_000_000);
ok array_increment(0, 3, 1);
ok array_increment(2, 95_000_000, 1);
loop_over_array("chrom");
free_array();
done_testing;
The output of this test case is
chrom 0 1 2
chrom 2 2 1
(with testing output removed). It may take a second to compile, but after that it should be quite fast.
In the records read from $bedgraph_fh, what's a typical value for $start? Although hashes have more overhead per entry than arrays, you may be able to save some memory if #array starts with a lot of unused entries. e.g., If you have an #array of 90 million elements, but the first 80 million are never used, then there's a good chance you'll be better off with a hash.
Other than that, I don't see any obvious cases of this code holding on to data that's not needed by the algorithm it implements, although, depending on your actual objective, it is possible that there may be an alternative algorithm which doesn't require as much data to be held in memory.
If you really need to be dealing with a set of 90 million active data elements, though, then your primary options are going to be either buy a lot of RAM or use some form of database. In the latter case, I'd opt for SQLite (via DBD::SQLite) for simplicity and light weight, but YMMV.

Efficiency of Threading/Forking

So, I was considering using forking or threading to do some simple parralelization. To make sure that it was worth it, I wrote three simple scripts to benchmark sequential vs threading vs forking. I used two very simple methods to initialize an array of arrays and then another method to find the max element in each array and write it to a file.
Methods:
sub initialize
{
for (my $i=0; $i <= 2; $i++)
{
for (my $j=0; $j < 5000000; $j++)
{
$array[$i][$j]=$j+$i;
}
}
}
sub getMax
{
my $num = shift;
my $array = shift;
my $length=scalar(#{$array});
my $max=-9**9**9;
my #ra;
for (my $i=0; $i < $length; $i++)
{
if ($max < ${$array}[$i])
{
$max=${$array}[$i];
}
}
tie #ra, 'Tie::File', "test.txt" or die;
$ra[$num]=$max;
}
Sequential:
my $start = Time::HiRes::time();
for (my $count = 0; $count <= 2; $count++)
{
getMax($count,$array[$count]);
}
my $stop = Time::HiRes::time();
my $duration = $stop-$start;
print "Time spent: $duration\n";
print "End of main program\n";
Threading:
my #threads=();
my $start = Time::HiRes::time();
for (my $count = 0; $count <= 2; $count++)
{
my $t = threads->new(\&getMax, $count, $array[$count]);
push(#threads,$t);
}
foreach (#threads)
{
my $num = $_->join;
}
my $stop = Time::HiRes::time();
my $duration = $stop-$start;
print "Time spent: $duration\n";
print "End of main program\n";
Forking:
my $pm = Parallel::ForkManager->new(3);
my $start = Time::HiRes::time();
for (my $count = 0; $count <= 2; $count++)
{
my $pid = $pm->start and next;
getMax($count,$array[$count]);
$pm->finish;
}
$pm->wait_all_children;
my $stop = Time::HiRes::time();
my $duration = $stop-$start;
print "Time spent: $duration\n";
print "\nEnd of main program\n";
Sequential: 2.88 sec
Threading: 4.10 sec
Forking: 3.88 sec
I guess that for my purposes (obviously not this, but something not too much more computationally intensive), threading/forking is not helpful. I understand that the two are not solely used for temporal efficiency, but I imagine that's one of the benefits depending on what you're doing. So, my question is when exactly does threading/forking actually make one's code run faster?
The processor and memory are the fastest components of a computer. Because fast memory is also expensive, disk drives are used to store large amounts of data inexpensively, with the trade-off that it is very much slower to access.
When computer programs rely on data from slow media, the faster components can often be left with nothing to do until the necessary data arrives. The primary use of multithreading is to allow the processor to get on with something else while waiting for a required resource.
The sorts of things that can be done in parallel are
Keeping the user interface functional while waiting for something to complete
Doing multi-processor calculations
Fetching data from from multiple internet sites
Reading from multiple disk drives
The important thing about all of these is that multithreading is only advantageous if the threads don't compete with each other for the same resources.
Trying to speed up a disk read by reading half the data in each of two threads, for instance, will not be successful, because there is a bottleneck at the disk controller and a limit to how fast it can return data. But RAID drives can speed things up by reading part of the data from each of several drives at the same time.
In your example, there is only one processor that can do the maximum calculation. Getting several threads doing it doesn't mean the processor can do the work any faster, and in fact it will be slowed down by having to switch between threads. However, if you could arrange for each thread to be run on a separate processor of a multi-processor system you would get an advantage. This technique is often used by audio-visual software to get the maximum speed of processing.
Similarly, fetching data from multiple internet sources in parallel can be very useful, but only until the capacity of the link has been reached, when the threads will start competing with each other for bandwidth.

Resources