Why this Perl script run out of memory gradually - multithreading

I have a trouble in running a Perl script in muti-threads. It continued consume memory and finally the system ran out of memory and killed it. It seems that the sub-threads were detached but the system resource were not released when they finished. I am pretty new to Perl and couldn't find which part went wrong. This is part of the script that may cause this problem. Could anyone help me with this?
use strict;
use warnings;
print "different number:\t";
my $num1=<>;
chomp $num1;
if($num1!~/[1 2 3 4 5]/)
{
print "invalid input number\n";
END;
}
my $i=0;
my $no;
my #spacer1;
my $nn;
my #spacer2;
open IN,"file1.txt"or die"$!";
while(<IN>)
{
chomp;
if($_=~ /^>((\d)+)\|((\d)+)/)
{
$no=$1;
$spacer1[$no][0]=$3;
}
else
{
$spacer1[$no][1]=$_;
}
}
close IN;
open IN, "file2.txt" or die "$!";
while(<IN>)
{
chomp;
if($_=~ /^>((\d)+)\|((\d)+)/)
{
$nn=$1;
$spacer2[$nn][0]=$3;
}
else
{
$spacer2[$nn][1]=$_;
}
}
close IN;
#-----------------------------------------------------------------#create threads
use subs qw(sg_ana);
use threads;
use Thread::Semaphore;
my $cycl=(int($no/10000))+1;
my $c;
my #thd;
my $thread_limit= Thread::Semaphore -> new (3);
foreach $c(1..$cycl)
{
$thread_limit->down();
$thd[$c]=threads->create("sg_ana",$c-1,$c,$num1);
$thd[$c]->detach();
}
&waitquit;
#-------------------------------------------------------------#limite threads num
sub waitquit
{
print "waiting\n";
my $num=0;
while($num<3)
{
$thread_limit->down();
$num++;
}
}
#---------------------------------------------------------------#alignment
my $n;
my $n1;
my $j;
my $k;
my $l;
my $m;
my $num;#number of match
my $num2=0;;#arrange num
sub sg_ana
{
my $c1=shift;
my $c2=shift;
$num1=shift;
open OUT,">$num1.$c2.txt" or die "$!";
if($num1==1)
{
foreach $n($c1*10000..$c2*10000-1)
{
if($spacer2[$n][1])
{
my $presult1;
my $presult2;
$num2=-1;
foreach $i(0..19)
{
$num=0;
$num2++;
my $tmp1=(substr $spacer2[$n][1],0,$i)."\\"."w".(substr $spacer2[$n][1],$i+1,19-$i);
foreach $n1(0..#spacer1-1)
{
if($spacer1[$n1][1])
{
my $tmp2=substr $spacer1[$n1][1],0,20;
if($tmp2=~/$tmp1/)
{
$num++;
$presult1.=$n1.",";
}
}
}
$presult2=$i+1;
if($num>=4)
{
print OUT "\n";
}
}
}
}
}
close OUT;
$thread_limit->up();
}

Rule one of debugging perl is enable use strict; and use
warnings; and then sort out the errors. Actually, you should
probably do that first of all, before you even start writing code.
You're creating and limiting threads via a Semaphore - but actually
this is really inefficient because of how perl does threads - they
aren't lightweight, so spawning loads is a bad idea. A better way of doing this is via Thread::Queue a bit like this.
Please use 3 arg open and lexical file handles. e.g. open ( my
$out, '>', "$num.$c2.txt" ) or die $!;. You're probably getting
away with it here, but you have got OUT as a global namespace
variable being used by multiple threads. That way lies dragons.
Don't use single letter variables. And given how you you use $c
then you'd be far better off:
foreach my $value ( 1..$cycl ) {
## do stuff
}
The same is true of all your other single letter variables though - they're not meaningful.
You pass $num before it's initialised, so it's always going to
be undef within your sub. So your actual subroutine is just:
sub sg_ana
{
my $c1=shift;
my $c2=shift;
$num1=shift;
open OUT,">$num1.$c2.txt" or die "$!";
close OUT;
$semaphore->up();
}
Looking at it - I think you may be trying to do something with a shared variable there, but you're not actually sharing it. I can't decode the logic of your program though (thanks to having a load of single letter variables most likely) so I can't say for sure.
You're calling a subroutine &waitquit;. That's not good style -
prefixing with an ampersand and supplying no arguments does
something subtly different to just invoking the sub 'normally' - so
you should avoid it.
Don't instantiate your semaphore like this:
my $semaphore=new Thread::Semaphore(3);
That's an indirect procedure call, and bad style. It would be better written as:
my $thread_limit = Thread::Semaphore -> new ( 3 );
I would suggest rather than using Semaphores like that, you'd be far better off not detatching your threads, and just using join. You also don't need an array of threads - threads -> list does that for you.
I can't reproduce your problem, because your sub isn't doing
anything. Have you by any chance modified it for posting? But a classic reason for perl memory exhaustion when threading is because each thread clones the parent process - and so 100 threads is 100x the memory.

Related

Perl - turning foreach loop to a multi-threaded run

I have the following code:
foreach my $inst (sort keys %{ ... }) {
next if (...)
somefuntion($a, $b, $c, $inst);
}
I would like to run this function on all the $inst-s asynchronously.
I tried to make it multi-threaded, but I'm having trouble with the syntax or implementation.
*** EDIT: ***
Apparently (i haven't noticed until now), the function uses a hash and the updates gets lost.
Should Threads::shared help in this case? Is it relevant in this case or should I just try forks?
Perl's got three major ways I'd suggest to do parallel code
Threads
Forks
Nonblocking IO
The latter isn't strictly speaking 'parallel' in all circumstances, but it does let you do multiple things at the same time, without waiting for each to finish, so it's beneficial in certain circumstances.
E.g. maybe you want to open 10 concurrent ssh sessions - you can just do an IO::Select to find which of them are 'ready' and process them as they come in.
The ssh shells themselves are of course, separate processes.
But when doing parallel, you need to be aware of a couple of pitfalls - one being 'self denial of service' - you can generate huge resource consumption very easily. The other being that you've got some inherent race conditions, and no longer a deterministic flow of program - that brings you a whole new class of exciting bugs.
Threads
I wouldn't advocate spawning a thread-per-instance, as that scales badly. Threads in perl are NOT lightweight, like you might be assuming. That means that implementing them as if they are, gives you a denial of service condition.
What I'd typically suggest is running with Thread::Queue and some "worker" threads - and use the Queue to pass data to some number of workers that are scaled to your resource availability. Depending on what is your limiting factor here that's making you do parallel.
(e.g. disk, network, cpu, etc.)
So to use a simplistic example that I've posted previously:
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Queue;
my $nthreads = 5;
my $process_q = Thread::Queue->new();
my $failed_q = Thread::Queue->new();
#this is a subroutine, but that runs 'as a thread'.
#when it starts, it inherits the program state 'as is'. E.g.
#the variable declarations above all apply - but changes to
#values within the program are 'thread local' unless the
#variable is defined as 'shared'.
#Behind the scenes - Thread::Queue are 'shared' arrays.
sub worker {
#NB - this will sit a loop indefinitely, until you close the queue.
#using $process_q -> end
#we do this once we've queued all the things we want to process
#and the sub completes and exits neatly.
#however if you _don't_ end it, this will sit waiting forever.
while ( my $server = $process_q->dequeue() ) {
chomp($server);
print threads->self()->tid() . ": pinging $server\n";
my $result = `/bin/ping -c 1 $server`;
if ($?) { $failed_q->enqueue($server) }
print $result;
}
}
#insert tasks into thread queue.
open( my $input_fh, "<", "server_list" ) or die $!;
$process_q->enqueue(<$input_fh>);
close($input_fh);
#we 'end' process_q - when we do, no more items may be inserted,
#and 'dequeue' returns 'undefined' when the queue is emptied.
#this means our worker threads (in their 'while' loop) will then exit.
$process_q->end();
#start some threads
for ( 1 .. $nthreads ) {
threads->create( \&worker );
}
#Wait for threads to all finish processing.
foreach my $thr ( threads->list() ) {
$thr->join();
}
#collate results. ('synchronise' operation)
while ( my $server = $failed_q->dequeue_nb() ) {
print "$server failed to ping\n";
}
This will start 5 threads, and queue up some number of jobs, such that 5 are running in parallel at any given time, and 'unwind' gracefully after.
Forking
Parallel::Forkmanager is the tool for the job here.
Unlike threads, forks are quite efficient on a Unix system, as the native fork() system call is well optimised.
But what it's not so good at is passing data around - you've got to hand roll any IPCs between your forks in a way that you don't so much with Threads.
A simple example of this would be:
#!/usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager;
my $concurrent_fork_limit = 4;
my $fork_manager = Parallel::ForkManager->new($concurrent_fork_limit);
foreach my $thing ( "fork", "spoon", "knife", "plate" ) {
my $pid = $fork_manager->start;
if ($pid) {
print "$$: Fork made a child with pid $pid\n";
} else {
print "$$: child process started, with a key of $thing ($pid)\n";
}
$fork_manager->finish;
}
$fork_manager->wait_all_children();
This does spawn off subprocesses, but cleans up after them fairly readily.
Nonblocking IO
Using IO::Select you would open some number of filehandles to subprocesses, and then use the can_read function to process the ones that are ready to run.
The perldoc IO::Select covers most of the detail here, which I'll reproduce for convenience:
use IO::Select;
$select = IO::Select->new();
$select->add(\*STDIN);
$select->add($some_handle);
#ready = $select->can_read($timeout);
#ready = IO::Select->new(#handles)->can_read(0);
You could use threads.
Here's an example that should take about 5 seconds to finish although it calls sleep(5) twice:
#!/usr/bin/perl
use strict;
use warnings;
use threads;
my %data = (
'foo' => 'bar',
'apa' => 'bepa',
);
sub somefuntion {
my $key = shift;
print "$key\n";
sleep(5);
return $data{$key};
}
my #threads;
for my $inst (sort keys %data) {
push #threads, threads->create('somefuntion', $inst);
}
print "running...\n";
for my $thr (#threads) {
print $thr->join() . "\n";
}
print "done\n";
This answer was made to show how threads works in Perl because you mentioned threads. Just a word of caution:
The "interpreter-based threads" provided by Perl are not the fast, lightweight system for multitasking that one might expect or hope for. Threads are implemented in a way that makes them easy to misuse. Few people know how to use them correctly or will be able to provide help.
The use of interpreter-based threads in perl is officially discouraged.

Perl seg fault while joining threads

I have a code similar to the below. I have one main script which is calling another module named initial.pm. initial.pm opens up connection with an AMQP server (In my case RabbitMQ)and using Net::AMQP::RabbitMQ library for establishing the connection. Everything works fine except when I try to join my threads I get segmentation fault.
I think the Net::AMQP::RabbitMQ is not thread safe. But this is only being used by the main thread. Im pretty sure you can reproduce the error if you just copy and past the codes below.
How do I fix it ?
main.pl
#!/usr/bin/perl
use Cwd qw/realpath/;
use File::Basename qw/dirname/;
use lib 'lib';
use threads;
use threads::shared;
use initial;
my #threads = ();
my $run :shared = 1;
my $init = load initial($name);
$SIG{'TERM'} = sub {
$run = 0;
};
threads->create(\&proc1);
threads->create(\&proc2);
while($run){
sleep(1);
print "I am main thread\n";
}
$_->join() for threads->list();
sub proc1 {
while($run){
sleep(1);
print "I am child thread 1 \n"
}
}
sub proc2 {
while($run){
sleep(1);
print "I am child thread 2 \n";
}
}
lib/initial.pm
package initial;
use Net::AMQP::RabbitMQ;
use Cwd qw/realpath/;
use File::Basename qw/dirname/;
my $mq;
my $stop = 0;
sub load {
my $class = shift;
my $self = {};
connectq();
bless $self,$class;
return $self;
}
sub connectq {
$mq = Net::AMQP::RabbitMQ->new();
my ($host,$port,$user,$pass) = ('localhost','5672','guest','guest');
$mq->connect($host, {
user => $user,
password => $pass,
port => $port,
timeout => 10,
});
$mq->channel_open(1);
$mq->consume(1, 'logger');
}
1;
I can't reproduce your problem directly, because I don't have the library installed.
One way of 'faking' thread safety in a not-thread-safe module is to rescope your 'use' to only the bit where you'll be using it.
You see, when you start a thread, it copies the program state - loaded libraries and everything.
If your run (something like):
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
use Data::Dumper;
sub thread1 {
print threads->self->tid.": Includes:", Dumper \%INC,"\n";
}
#main;
print "Main includes:", Dumper \%INC,"\n";
threads -> create ( \&thread1 );
You'll see XML::Twig is loaded in both. If the process of 'loading' the module causes some state changes (and it can) then you immediately have a potential thread-safety issue.
However if you instead do:
#!/usr/bin/env perl
use strict;
use warnings;
use threads;
use Data::Dumper;
sub thread1 {
require XML::Twig;
XML::Twig -> import;
print threads->self->tid.": Includes:", Dumper (\%INC),"\n";
}
#main;
print "Main includes:", Dumper (\%INC),"\n";
threads -> create ( \&thread1 );
foreach my $thr ( threads -> list ) {
$thr -> join;
}
You effectively cause the module to be dynamically loaded within the thread - the module is only present in one 'code instance' so you are much less likely to be tripped up by 'thread safety' issues.
Alternatively - forking instead of threading ... might be an alternative. This has slightly different 'safety' problems.
But there really is no way to avoid this. Even with shared variables, the core problem is - when you thread, bits of code happen in a different order. There's all sorts of fruity things that can happen as a result. A shared var is one way of ensuring it's the same variable being checked each time - e.g. share $init, but that may well make things worse, because you're then potentially trampling over the same instance/sockets with different threads.
You can, however, reduce the 'thread safety' problem to a limited scope, and use e.g. Thread::Queue to pass messages to/from your 'module user' thread.

How to get started multithreading in Perl

I have a perl program that takes over 13 hours to run. I think it could benefit from introducing multithreading but I have never done this before and I'm at a loss as to how to begin.
Here is my situation:
I have a directory of hundreds of text files. I loop through every file in the directory using a basic for loop and do some processing (text processing on the file itself, calling an outside program on the file, and compressing it). When complete I move on to the next file. I continue this way doing each file, one after the other, in a serial fashion. The files are completely independent from each other and the process returns no values (other than success/failure codes) so this seems like a good candidate for multithreading.
My questions:
How do I rewrite my basic loop to take advantage of threads? There appear to be several moduals for threading out there.
How do I control how many threads are currently running? If I have N cores available, how do I limit the number of threads to N or N - n?
Do I need to manage the thread count manually or will Perl do that for me?
Any advice would be much appreciated.
Since your threads are simply going to launch a process and wait for it to end, best to bypass the middlemen and just use processes. Unless you're on a Windows system, I'd recommend Parallel::ForkManager for your scenario.
use Parallel::ForkManager qw( );
use constant MAX_PROCESSES => ...;
my $pm = Parallel::ForkManager->new(MAX_PROCESSES);
my #qfns = ...;
for my $qfn (#qfns) {
my $pid = $pm->start and next;
exec("extprog", $qfn)
or die $!;
}
$pm->wait_all_children();
If you wanted you avoid using needless intermediary threads in Windows, you'd have to use something akin to the following:
use constant MAX_PROCESSES => ...;
my #qfns = ...;
my %children;
for my $qfn (#qfns) {
while (keys(%children) >= MAX_PROCESSES) {
my $pid = wait();
delete $children{$pid};
}
my $pid = system(1, "extprog", $qfn);
++$children{$pid};
}
while (keys(%children)) {
my $pid = wait();
delete $children{$pid};
}
Someone's given your a forking example. Forks aren't native on Windows, so I'd tend to prefer threading.
For the sake of completeness - here's a rough idea of how threading works (and IMO is one of the better approaches, rather than respawning threads).
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Queue;
my $nthreads = 5;
my $process_q = Thread::Queue->new();
my $failed_q = Thread::Queue->new();
#this is a subroutine, but that runs 'as a thread'.
#when it starts, it inherits the program state 'as is'. E.g.
#the variable declarations above all apply - but changes to
#values within the program are 'thread local' unless the
#variable is defined as 'shared'.
#Behind the scenes - Thread::Queue are 'shared' arrays.
sub worker {
#NB - this will sit a loop indefinitely, until you close the queue.
#using $process_q -> end
#we do this once we've queued all the things we want to process
#and the sub completes and exits neatly.
#however if you _don't_ end it, this will sit waiting forever.
while ( my $server = $process_q->dequeue() ) {
chomp($server);
print threads->self()->tid() . ": pinging $server\n";
my $result = `/bin/ping -c 1 $server`;
if ($?) { $failed_q->enqueue($server) }
print $result;
}
}
#insert tasks into thread queue.
open( my $input_fh, "<", "server_list" ) or die $!;
$process_q->enqueue(<$input_fh>);
close($input_fh);
#we 'end' process_q - when we do, no more items may be inserted,
#and 'dequeue' returns 'undefined' when the queue is emptied.
#this means our worker threads (in their 'while' loop) will then exit.
$process_q->end();
#start some threads
for ( 1 .. $nthreads ) {
threads->create( \&worker );
}
#Wait for threads to all finish processing.
foreach my $thr ( threads->list() ) {
$thr->join();
}
#collate results. ('synchronise' operation)
while ( my $server = $failed_q->dequeue_nb() ) {
print "$server failed to ping\n";
}
If you need to move complicated data structures around, I'd recommend having a look at Storable - specifically freeze and thaw. These will let you shuffle around objects, hashes, arrays etc. easily in queues.
Note though - for any parallel processing option, you get good CPU utilisation, but you don't get more disk IO - that's often a limiting factor.

Perl: Correctly passing array for threads to work on

I'm learning how to do threading in Perl. I was going over the example code here and adapted the solution code slightly:
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Semaphore;
my $sem = Thread::Semaphore->new(2); # max 2 threads
my #names = ("Kaku", "Tyson", "Dawkins", "Hawking", "Goswami", "Nye");
my #threads = map {
# request a thread slot, waiting if none are available:
foreach my $whiz (#names) {
$sem->down;
threads->create(\&mySubName, $whiz);
}
} #names;
sub mySubName {
return "Hello Dr. " . $_[0] . "\n";
# release slot:
$sem->up;
}
foreach my $t (#threads) {
my $hello = $t->join();
print "$hello";
}
Of course, this is now completely broken and does not work. It results in this error:
C:\scripts\perl\sandbox>threaded.pl
Can't call method "join" without a package or object reference at C:\scripts\perl\sandbox\threaded.pl line 24.
Perl exited with active threads:
0 running and unjoined
9 finished and unjoined
0 running and detached
My objective was two-fold:
Enforce max number of threads allowed at any given time
Provide the array of 'work' for the threads to consume
In the original solution, I noticed that the 0..100; code seems to specify the amount of 'work' given to the threads. However, in my case where I want to supply an array of work, do I still need to supply something similar?
Any guidance and corrections very welcome.
You're storing the result of foreach into #threads rather than the result of threads->create.
Even if you fix this, you collect completed threads too late. I'm not sure how big of a problem that is, but it might prevent more than 64 threads from being started on some systems. (64 is the max number of threads a program can have at a time on some systems.)
A better approach is to reuse your threads. This solves both of your problems and avoids the overhead of repeatedly creating threads.
use threads;
use Thread::Queue 3.01 qw( );
use constant NUM_WORKERS => 2;
sub work {
my ($job) = #_;
...
}
{
my $q = Thread::Queue->new();
for (1..NUM_WORKERS) {
async {
while (my $job = $q->dequeue()) {
work($job);
}
};
}
$q->enqueue(#names); # Can be done over time.
$q->end(); # When you're done adding.
$_->join() for threads->list();
}

Error using ithreads with Memoize

I just introduced threads to a Perl program, where one of its modules was using Memoize.
I'm getting this error message:
Thread 1 terminated abnormally: Anonymous function called in forbidden scalar context; faulting
The error occurs if I have both threads and Memoize, but will disappear if I take away one of these elements. But the problem isn't because Memoize isn't thread-safe - in my code, all the memoization happens within the same thread.
Is this a bug with Memoize? Is there a way I can work around this? Otherwise I'm going to get rid of Memoize.
Here's some sample code to isolate the problem:
use strict;
use warnings;
use threads;
use Thread::Semaphore;
use Memoize;
my $semaphore = Thread::Semaphore->new;
memoize('foo');
sub foo {
return shift;
}
sub invoke_foo {
$semaphore->down; # ensure memoization is thread-safe
my $result = foo(#_);
$semaphore->up;
return $result;
}
my #threads;
foreach (1 .. 5) {
my $t = threads->create( sub { invoke_foo($_) });
push #threads, $t;
}
$_->join foreach #threads;
Memoize stores the caches for every memoized function in one hash (instead of using a closure). It uses the address of the function as the index into that hash.
The problem is that the address of the function changes when it's cloned into a new thread. (Add print(\&foo, "\n"); in invoke_foo.). It's a bug in Memoize.
Workaround: Load the memoised module from within the threads. the following simulates (the relevant aspects of) that:
use strict;
use warnings;
use threads;
use Memoize;
sub foo {
return shift;
}
sub invoke_foo {
return foo(#_);
}
my #threads;
foreach (1 .. 5) {
my $t = threads->create( sub {
memoize('foo');
invoke_foo($_);
});
push #threads, $t;
}
$_->join foreach #threads;
By the way, each thread has its own cache. that could also be considered a bug.
As noted, Memoize is not thread aware. If you want per thread memoization, ikegami's restructuring will work well. If instead you want global memoization, then replacing Memoize with something like the following could work:
use strict;
use warnings;
use 5.010;
use threads;
use threads::shared;
sub memoize_shared {
my $name = shift;
my $glob = do {
no strict 'refs';
\*{(caller)."::$name"}
};
my $code = \&$glob;
my $sep = $;;
my (%scalar, %list) :shared;
no warnings 'redefine';
*$glob = sub {
my $arg = join $sep => #_;
if (wantarray) {
#{$list{$arg} ||= sub {\#_}->(&$code)}
}
else {
exists $scalar{$arg}
? $scalar{$arg}
:($scalar{$arg} = &$code)
}
}
}
and to use it:
sub foo {
my $x = shift;
say "foo called with '$x'";
"foo($x)"
}
memoize_shared 'foo';
for my $t (1 .. 4) {
threads->create(sub {
my $x = foo 'bar';
say "thread $t got $x"
})->join
}
which prints:
foo called with 'bar'
thread 1 got foo(bar)
thread 2 got foo(bar)
thread 3 got foo(bar)
thread 4 got foo(bar)
The memoize_shared function above is fairly complicated because it deals with propegating list and scalar contexts as well as replacing the named subroutine. It is sometimes easier to just build the memoziation into the target subroutine:
{my %cache :shared;
sub foo {
my $x = shift;
if (exists $cache{$x}) {$cache{$x}}
else {
say "foo called with '$x'";
$cache{$x} = "foo($x)"
}
}}
Building the memoization into the subroutine does make it a bit more complicated, but it will be faster than using a wrapper function like memoize. And it gives you exact control over how to memoize the subroutine, including things like using a threads::shared cache.
Memoize should work under threads, albeit a bit slower:
"There is some problem with the way goto &f works under threaded Perl, perhaps because of the lexical scoping of #_. This is a bug in
Perl, and until it is resolved, memoized functions will see a slightly
different caller() and will perform a little more slowly on threaded
perls than unthreaded perls."

Resources