Perl script execution keeps getting killed - running out of memory - linux

I am trying to execute a perl script that processes a small 12 x 2 text file (approx. 260 bytes) and a large .bedgraph file (at least 1.3 MB in size). From these two files, the script outputs a new bedgraph file.
I have ran this script on 3 other .bedgraph files but I try to run it on the rest of them the process keeps getting Killed.
It should take about 20 minutes on average for the perl script to run on each of the .bedgraph files.
I'm running the perl script on my local machine (not from a server). I'm using a Linux OS Ubuntu 12.04 system 64-bit 4GB RAM.
Why does my perl script execution keeps getting killed and how can I fix this?
Here's the script:
# input file handle
open(my $sizes_fh, '<', 'S_lycopersicum_chromosomes.size') or die $!;
# output file handles
open(my $output, '+>', 'tendaysafterbreaker_output.bedgraph') or die $!;
my #array;
while(<$sizes_fh>){
chomp;
my ($chrom1, $size) = split(/\t/, $_);
#array = (0) x $size;
open(my $bedgraph_fh, '<', 'Solanum_lycopersicum_tendaysafterbreaker.bedgraph') or die $!;
while(<$bedgraph_fh>){
chomp;
my ($chrom2, $start, $end, $FPKM) = split(/\t/, $_);
if ($chrom1 eq $chrom2){
for(my $i = $start; $i < $end; $i++){
$array[$i] += $FPKM;
}
}
}
close $bedgraph_fh or warn $!;
my ($last_start, $last_end) = 0;
my $last_value = $array[0];
for (my $i = 1; $i < $#array; $i++){
my $curr_val = $array[$i];
my $curr_pos = $i;
# if the current value is not equal to the last value
if ($curr_val != $last_value){
my $last_value = $curr_val;
print $output "$chrom1\t$last_start\t$last_end\t$last_value\n";
$last_start = $last_end = $curr_pos;
} else {
$last_end = $i;
}
}
}
close $sizes_fh or warn $!;

You are trying to allocate an array of 90,000,000 elements. Perl, due to its flexible typing and other advanced variable features, uses a lot more memory for this than you would expect.
On my (Windows 7) machine, a program that just allocates such an array and does nothing else eats up 3.5 GB of RAM.
There are various ways to avoid this huge memory usage. Here are a couple:
The PDL module for scientific data processing, which is designed to efficiently store huge numeric arrays in memory. This will change the syntax for allocating and using the array, though (and it messes around with Perl's syntax in various other ways).
DBM::Deep is a module that allocates a database in a file--and then lets you access that database through a normal array or hash:
use DBM::Deep;
my #array;
my $db = tie #array, "DBM::Deep", "array.db";
#Now you can use #array like a normal array, but it will be stored in a database.

If you know a bit of C, it is quite simple to offload the array manipulation into low-level code. Using a C array takes less space, and is a lot faster. However, you loose nice stuff like bounds checking. Here is an implementation with Inline::C:
use Inline 'C';
...;
__END__
__C__
// note: I don't know if your data contains only ints or doubles. Adjust types as needed
int array_len = -1; // last index
int *array = NULL;
void make_array(int size) {
free(array);
// if this fails, start checking return value of malloc for != NULL
array = (int*) malloc(sizeof(int) * size);
array_len = size - 1;
}
// returns false on bounds error
int array_increment(int start, int end, int fpkm) {
if ((end - 1) > array_len) return 0;
int i;
for (i = start; i < end; i++) {
array[i] += fpkm;
}
return 1;
}
// please check if this is actually equivalent to your code.
// I removed some unneccessary-looking variables.
void loop_over_array(char* chrom1) {
int
i,
last_start = 0,
last_end = 0,
last_value = array[0];
for(i = 1; i < array_len; i++) { // are you sure not `i <= array_len`?
if (array[i] != last_value) {
last_value = array[i];
// I don't know how to use Perl filehandles from C,
// so just redirect the output on the command line
printf("%s\t%d\t%d\t%d\n", chrom1, last_start, last_end, last_value);
last_start = i;
}
last_end = i;
}
}
void free_array {
free(array);
}
Minimal testing code:
use Test::More;
make_array(15);
ok !array_increment(0, 16, 2);
make_array(95_000_000);
ok array_increment(0, 3, 1);
ok array_increment(2, 95_000_000, 1);
loop_over_array("chrom");
free_array();
done_testing;
The output of this test case is
chrom 0 1 2
chrom 2 2 1
(with testing output removed). It may take a second to compile, but after that it should be quite fast.

In the records read from $bedgraph_fh, what's a typical value for $start? Although hashes have more overhead per entry than arrays, you may be able to save some memory if #array starts with a lot of unused entries. e.g., If you have an #array of 90 million elements, but the first 80 million are never used, then there's a good chance you'll be better off with a hash.
Other than that, I don't see any obvious cases of this code holding on to data that's not needed by the algorithm it implements, although, depending on your actual objective, it is possible that there may be an alternative algorithm which doesn't require as much data to be held in memory.
If you really need to be dealing with a set of 90 million active data elements, though, then your primary options are going to be either buy a lot of RAM or use some form of database. In the latter case, I'd opt for SQLite (via DBD::SQLite) for simplicity and light weight, but YMMV.

Related

How to improve the memory usage in nodejs code?

I tried one code at hackerearth : https://www.hackerearth.com/practice/data-structures/stacks/basics-of-stacks/practice-problems/algorithm/fight-for-laddus/description/
The speed seems fine however the memory usage exceed the 256mb limit by nearly 2.8 times.
In java and python the memory is 5 times less however the time is nearly twice.
What factor can be used to optimise the memory usage in nodejs code implementation?
Here is nodejs implementation:
// Sample code to perform I/O:
process.stdin.resume();
process.stdin.setEncoding("utf-8");
var stdin_input = "";
process.stdin.on("data", function (input) {
stdin_input += input; // Reading input from STDIN
});
process.stdin.on("end", function () {
main(stdin_input);
});
function main(input) {
let arr = input.split("\n");
let testCases = parseInt(arr[0], 10);
arr.splice(0,1);
finalStr = "";
while(testCases > 0){
let inputArray = (arr[arr.length - testCases*2 + 1]).split(" ");
let inputArrayLength = inputArray.length;
testCases = testCases - 1;
frequencyObject = { };
for(let i = 0; i < inputArrayLength; ++i) {
if(!frequencyObject[inputArray[i]])
{
frequencyObject[inputArray[i]] = 0;
}
++frequencyObject[inputArray[i]];
}
let finalArray = [];
finalArray[inputArrayLength-1] = -1;
let stack = [];
stack.push(inputArrayLength-1);
for(let i = inputArrayLength-2; i>=0; i--)
{
let stackLength = stack.length;
while(stackLength > 0 && frequencyObject[inputArray[stack[stackLength-1]]] <= frequencyObject[inputArray[i]])
{
stack.pop();
stackLength--;
}
if (stackLength > 0) {
finalArray[i] = inputArray[stack[stackLength-1]];
} else {
finalArray[i] = -1;
}
stack.push(i);
}
console.log(finalArray.join(" ") + "\n")
}
}
What factor can be used to optimize the memory usage in nodejs code implementation?
Here are some things to consider:
Don't buffer any more input data than you need to before you process it or output it.
Try to avoid making copies of data. Use the data in place if possible. Remember that all string operations create a new string that is likely a copy of the original data. And, many array operations like .map(), .filter(), etc... create new copies of the original array.
Keep in mind that garbage collection is delayed and is typically done during idle time. So, for example, modifying strings in a loop may create a lot of temporary objects that all must exist at once, even though most or all of them will be garbage collected when the loop is done. This creates a poor peak memory usage.
Buffering
The first thing I notice is that you read the entire input file into memory before you process any of it. Right away for large input files, you're going to use a lot of memory. Instead, what you want to do is read enough of a chunk to get the next testCase and then process it.
FYI, this incremental reading/processing will make the code significantly more complicated to write (I've written an implementation myself) because you have to handle partially read lines, but it will hold down memory use a bunch and that's what you asked for.
Copies of Data
After reading the entire input file into memory, you immediately make a copy of it all with this:
let arr = input.split("\n");
So, now you've more than doubled the amount of memory the input data is taking up. Instead of just one string for all the input, you now still have all of that in memory, but you've now broken it up into hundreds of other strings (each with a little overhead of its own for a new string and of course a copy of each line).
Modifying Strings in a Loop
When you're creating your final result which you call finalStr, you're doing this over and over again:
finalStr = finalStr + finalArray.join(" ") + "\n"
This is going to create tons and tons of incremental strings that will likely end up all in memory at once because garbage collection probably won't run until the loop is over. As an example, if you had 100 lines of output that were each 100 characters long so the total output (not count line termination characters) was 100 x 100 = 10,000 characters, then constructing this in a loop like you are would create temporary strings of 100, 200, 300, 400, ... 10,000 which would consume 5000 (avg length) * 100 (number of temporary strings) = 500,000 characters. That's 50x the total output size consumed in temporary string objects.
So, not only does this create tons of incremental strings each one larger than the previous one (since you're adding onto it), it also creates your entire output in memory before writing any of it out to stdout.
Instead, you can incremental output each line to stdout as you construct each line. This will put the worst case memory usage at probably about 2x the output size whereas you're at 50x or worse.

How to troubleshoot very slow perl script that runs faster on one server but slow on another

I received a perl script that is supposed to filter probes for Hi-C data, but it runs sooooooooo slow (several days). The one who gave me the script told me it only took a few hours on her previous lab's server, so I am wondering what would cause such a difference in run time, and how would I go about making it run faster? My knowledge of perl is very limited, but my google search so far suggests I may need to change some perl configuration files? Below is the main function in the whole script that takes the longest, the input file contains chromosome targets and locations. Any help would be greatly appreciated!
I have tried running with perl 5.16 and 5.26 to see if maybe the version was too old, but it took just as long.
sub get_restriction_fragments_for_targets {
my #targets = #_;
# Split by chromsome and sort
my %targets;
foreach my $target (#targets) {
push #{$targets{$target->{chr}}},$target;
}
foreach my $chr (keys %targets) {
my #sorted = sort {$a -> {start} <=> $b -> {start}} #{$targets{$chr}};
# We also need to merge overlapping capture regions
my #merged;
my $last_region;
foreach my $region (#sorted) {
unless ($last_region) {
$last_region = $region;
next;
}
if ($region->{start} < $last_region -> {end}) {
# Merge
if ($region ->{end} > $last_region->{end}) {
$last_region->{end} = $region->{end};
}
next;
}
push #merged,$last_region;
$last_region = $region;
}
push #merged,$last_region;
$targets{$chr} = \#merged;
}
my #target_fragments;
open (IN,'all_restriction_fragments.txt') or die $!;
my $last_chr = "";
my $last_index = 0;
while (<IN>) {
chomp;
my $line = $_;
my ($chr,$start,$end) = split(/\t/);
if ($chr ne $last_chr) {
warn "Moving to $chr\n";
$last_chr = $chr;
$last_index = 0;
}
next unless (exists $targets{$chr});
my #local_targets = #{$targets{$chr}};
foreach my $index ($last_index .. $#local_targets) {
my $target = $local_targets[$index];
if ($target -> {end} < $start) {
$last_index = $index;
next;
}
if ($target -> {start} > $end) {
last;
}
push #target_fragments,{
id => $target->{id},
chr => $chr,
start => $start,
end => $end,
};
last;
}
}
return #target_fragments;
}
One can't tell without knowing a lot more about the problem, in the first place about the size of data structures and of the file that's processed.
Here is a generic comment on what would in principle be most time consuming operations in what the code seems to be doing, in a (probably) decreasing order of importance and/or likelihood
How big is the file? That's disk access and that can take any time
Sorting in a loop -- how big is data that is getting sorted? A sort is time consuming
There's that "tap dance" around #sorted, which results in data copy -- how much data?
In general, there is lots of data copying -- how large is #targets passed into the routine?
As you can see, each factor comes with the question of how large the data is. Disk access is of course costly, but a lot of data copying between various data structures in the program is just as important.
So if you can provide some detail that'll help us give you a more detailed analysis/guess.

How do I queue perl subroutines to a thread queue instead of data?

Background:
In reading how to multithread my perl script, I read (from http://perldoc.perl.org/threads.html#BUGS-AND-LIMITATIONS)
On most systems, frequent and continual creation and destruction of
threads can lead to ever-increasing growth in the memory footprint of
the Perl interpreter. While it is simple to just launch threads and
then ->join() or ->detach() them, for long-lived applications, it is
better to maintain a pool of threads, and to reuse them for the work
needed, using queues to notify threads of pending work.
My script will be long-lived; it's an PKI LDAP directory monitoring daemon that will always be running. The enterprise monitoring solution will generate an alarm if it stops running for any reason. My script will check that I can reach another PKI LDAP directory, as well as validate revocation lists on both.
Problem: Everything I can find on google shows passing variables (e.g. scalars) to the thread queue rather than the subroutine itself... I think I'm just not understanding how to implement a thread queue properly compared to how you implement a thread (without queues).
Question 1: How can I "maintain a pool of threads" to avoid the perl interpreter from slowly eating up more and more memory?
Question 2: (Unrelated but while I have this code posted) Is there a safe amount of sleep at the end of the main program so that I don't start a thread more than once in a minute? 60 seems obvious but could that ever cause it to run more than once if the loop is fast, or perhaps miss a minute because of processing time or something?
Thanks in advance!
#!/usr/bin/perl
use feature ":5.10";
use warnings;
use strict;
use threads;
use Proc::Daemon;
#
### Global Variables
use constant false => 0;
use constant true => 1;
my $app = $0;
my $continue = true;
$SIG{TERM} = sub { $continue = false };
# Directory Server Agent (DSA) info
my #ListOfDSAs = (
{ name => "Myself (inbound)",
host => "ldap.myco.ca",
base => "ou=mydir,o=myco,c=ca",
},
{ name => "Company 2",
host => "ldap.comp2.ca",
base => "ou=their-dir,o=comp2,c=ca",
}
);
#
### Subroutines
sub checkConnections
{ # runs every 5 minutes
my (#DSAs, $logfile) = #_;
# Code to ldapsearch
threads->detach();
}
sub validateRevocationLists
{ # runs every hour on minute xx:55
my (#DSAs, $logfile) = #_;
# Code to validate CRLs haven't expired, etc
threads->detach();
}
#
### Main program
Proc::Daemon::Init;
while ($continue)
{
my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime(time);
# Question 1: Queues??
if ($min % 5 == 0 || $min == 0)
{ threads->create(&checkConnections, #ListOfDSAs, "/var/connect.log"); }
if ($min % 55 == 0)
{ threads->create(&validateRevocationLists, #ListOfDSAs, "/var/RLs.log"); }
sleep 60; # Question 2: Safer/better way to prevent multiple threads being started for same check in one matching minute?
}
# TERM RECEIVED
exit 0;
__END__
use threads;
use Thread::Queue 3.01 qw( );
my $check_conn_q = Thread::Queue->new();
my $validate_revoke_q = Thread::Queue->new();
my #threads;
push #threads, async {
while (my $job = $check_conn_q->dequeue()) {
check_conn(#$job);
}
};
push #threads, async {
while (my $job = $validate_revoke_q->dequeue()) {
validate_revoke(#$job);
}
};
while ($continue) {
my ($S,$M,$H,$m,$d,$Y) = localtime; $m+=1; $Y+=1900;
$check_conn_q->enqueue([ #ListOfDSAs, "/var/connect.log" ])
if $M % 5 == 0;
$validate_revoke_q->enqueue([ #ListOfDSAs, "/var/RLs.log" ])
if $M == 55;
sleep 30;
}
$check_conn_q->end();
$validate_revoke_q->end();
$_->join for #threads;
I'm not sure parallelisation is needed here. If it's not, you could simply use
use List::Util qw( min );
sub sleep_until {
my ($until) = #_;
my $time = time;
return if $time >= $until;
sleep($until - $time);
}
my $next_check_conn = my $next_validate_revoke = time;
while ($continue) {
sleep_until min $next_check_conn, $next_validate_revoke;
last if !$continue;
my $time = time;
if ($time >= $next_check_conn) {
check_conn(#ListOfDSAs, "/var/connect.log");
$next_check_conn = time + 5*60;
}
if ($time >= $next_validate_revoke) {
validate_revoke(#ListOfDSAs, "/var/RLs.log");
$next_validate_revoke = time + 60*60;
}
}
I would recommend just running the checks one at a time, as there does not appear to be a compelling reason to use threads here, and you don't want to add unnecessary complexity to a program that will be running all the time.
If you do want to learn how use a thread pool, there are examples included with the threads module. There is also a Thread::Pool module that may be useful.
As for ensuring you don't repeat the checks in the same minute, you are correct that sleeping for 60 seconds will be inadequate. No matter what value you choose to sleep, you will have edge cases in which it fails: either it will be slightly shorter than a minute, and you will occasionally have two checks in the same minute, or it will be slightly longer than a minute, and you will occasionally miss a check altogether.
Instead, use a variable to remember when the task was last done. You can then use a shorter sleep time without worrying about multiple checks per minute.
my $last_task_time = -1;
while ($continue)
{
my $min = (localtime(time))[1];
if ($last_task_time != $min &&
($min % 5 == 0 || $min > ($last_task_time+5)%60))
{
#Check connections here.
if ($min == 55 || ($last_task_time < 55 && $min > 55))
{
#Validate revocation lists here.
}
$last_task_time = $min;
}
else
{
sleep 55; #Ensures there is at least one check per minute.
}
}
Update: I fixed the code so that it will recover if the last task ran too long. This would be fine if it occasionally takes a long time. If the tasks are frequently taking longer than five minutes, though, you need a different solution (threads would probably make sense in that case).

Efficiency of Threading/Forking

So, I was considering using forking or threading to do some simple parralelization. To make sure that it was worth it, I wrote three simple scripts to benchmark sequential vs threading vs forking. I used two very simple methods to initialize an array of arrays and then another method to find the max element in each array and write it to a file.
Methods:
sub initialize
{
for (my $i=0; $i <= 2; $i++)
{
for (my $j=0; $j < 5000000; $j++)
{
$array[$i][$j]=$j+$i;
}
}
}
sub getMax
{
my $num = shift;
my $array = shift;
my $length=scalar(#{$array});
my $max=-9**9**9;
my #ra;
for (my $i=0; $i < $length; $i++)
{
if ($max < ${$array}[$i])
{
$max=${$array}[$i];
}
}
tie #ra, 'Tie::File', "test.txt" or die;
$ra[$num]=$max;
}
Sequential:
my $start = Time::HiRes::time();
for (my $count = 0; $count <= 2; $count++)
{
getMax($count,$array[$count]);
}
my $stop = Time::HiRes::time();
my $duration = $stop-$start;
print "Time spent: $duration\n";
print "End of main program\n";
Threading:
my #threads=();
my $start = Time::HiRes::time();
for (my $count = 0; $count <= 2; $count++)
{
my $t = threads->new(\&getMax, $count, $array[$count]);
push(#threads,$t);
}
foreach (#threads)
{
my $num = $_->join;
}
my $stop = Time::HiRes::time();
my $duration = $stop-$start;
print "Time spent: $duration\n";
print "End of main program\n";
Forking:
my $pm = Parallel::ForkManager->new(3);
my $start = Time::HiRes::time();
for (my $count = 0; $count <= 2; $count++)
{
my $pid = $pm->start and next;
getMax($count,$array[$count]);
$pm->finish;
}
$pm->wait_all_children;
my $stop = Time::HiRes::time();
my $duration = $stop-$start;
print "Time spent: $duration\n";
print "\nEnd of main program\n";
Sequential: 2.88 sec
Threading: 4.10 sec
Forking: 3.88 sec
I guess that for my purposes (obviously not this, but something not too much more computationally intensive), threading/forking is not helpful. I understand that the two are not solely used for temporal efficiency, but I imagine that's one of the benefits depending on what you're doing. So, my question is when exactly does threading/forking actually make one's code run faster?
The processor and memory are the fastest components of a computer. Because fast memory is also expensive, disk drives are used to store large amounts of data inexpensively, with the trade-off that it is very much slower to access.
When computer programs rely on data from slow media, the faster components can often be left with nothing to do until the necessary data arrives. The primary use of multithreading is to allow the processor to get on with something else while waiting for a required resource.
The sorts of things that can be done in parallel are
Keeping the user interface functional while waiting for something to complete
Doing multi-processor calculations
Fetching data from from multiple internet sites
Reading from multiple disk drives
The important thing about all of these is that multithreading is only advantageous if the threads don't compete with each other for the same resources.
Trying to speed up a disk read by reading half the data in each of two threads, for instance, will not be successful, because there is a bottleneck at the disk controller and a limit to how fast it can return data. But RAID drives can speed things up by reading part of the data from each of several drives at the same time.
In your example, there is only one processor that can do the maximum calculation. Getting several threads doing it doesn't mean the processor can do the work any faster, and in fact it will be slowed down by having to switch between threads. However, if you could arrange for each thread to be run on a separate processor of a multi-processor system you would get an advantage. This technique is often used by audio-visual software to get the maximum speed of processing.
Similarly, fetching data from multiple internet sources in parallel can be very useful, but only until the capacity of the link has been reached, when the threads will start competing with each other for bandwidth.

Memory leak in WebTestCase or in Symfony2-Kernel - how to report to the gurus?

I think I have found a memory-leak in the WebTestCase class or in the Kernel itself. My questions at the end.
To reproduce, I make a new empty WebTestCase that does $max asserts (in my tests, $max ranges from 1 to 100.000) without instancing any kernel.
<?php
use Symfony\Bundle\FrameworkBundle\Test\WebTestCase;
class DemoTest extends WebTestCase
{
public function testHello( )
{
$max = 100000;
for( $i = 0; $i < 1; $i++ )
{
$this->assertTrue( true );
}
}
}
With $max=1
Time: 0 seconds, Memory: 5.75Mb
OK (1 test, 1 assertion)
With $max=100000
Time: 1 second, Memory: 5.75Mb
OK (1 test, 100000 assertions)
As expected, either running 1 time or 100.000 times, I consume the same memory.
Now I try the loop creating any arbitrary class, working with it and destroying it. I choose DOMDocument the same way I could have chosen any other class.
public function testHello( )
{
$max = 100000;
for( $i = 0; $i < $max; $i++ )
{
$dummy = new \DOMDocument();
$dummy->loadHTML( '<html><head><title>dummy</title></head><body>dummy</body></html>' );
unset( $dummy );
$this->assertTrue( true );
}
}
With $max=1
Time: 0 seconds, Memory: 5.75Mb
OK (1 test, 1 assertion)
With $max=100000
Time: 2 seconds, Memory: 5.75Mb
OK (1 test, 100000 assertions)
Again, regardless if I execute it 1 time or 100.000, I consume the same memory.
But... Now I create and destroy kernels within the loop. I NEITHER DO the ->boot() nor the ->shutdown(), just obtain the new kernel objects that, without even being booted, are then unreferenced, ans so they should die freeing all their resources.
public function testHello( )
{
$max = 10000;
for( $i = 0; $i < $max; $i++ )
{
$dummy = static::createKernel();
unset( $dummy );
$this->assertTrue( true );
}
}
With $max=1
Time: 0 seconds, Memory: 6.25Mb
OK (1 test, 1 assertion)
With $max=100000
Time: 9 seconds, Memory: 165.50Mb
OK (1 test, 100000 assertions)
It is logical that it raises from 5.75 to 6.25, it might be the size of the kernel's resources. Half mega. But what is incorrect it is the fact of taht by creating 100.000 kernels without any operations on them, we tend to consume 165 megas. Somebody is freeing bad the resources that it allocated.
My questions:
Is this a leak? Or it is the expected behaviour which, for any extrange reason I don't get why should this work as it does?
I use 2.0.10. If it is a leak, how can I know if this is corrected in a later version?
If it has not been corrected, which is the easiest way to report it to the core-team?
Thanks!
Xavi.
Symfony2 code is hosted on github.com and issues must declared on this site https://github.com/symfony/symfony/issues
I encounter this kind of issue with heavy web testcases too, and it's not fixed in 2.0.x versions. I can't ensure you it's a memory leak or normal behaviour...
Google Groups for developers : http://groups.google.com/group/symfony-devs?pli=1
Google Groups for users : http://groups.google.com/group/symfony2?pli=1
Symfony2 forum : http://forum.symfony-project.org/viewforum.php?f=23
I advise you the Google Groups mailing list, and then GitHub issues.

Resources