Can I reuse joined threads in perl? - multithreading

I have a module that is running multiple threads and pushing them onto a list of threads.
ex:
#!/usr/bin/perl
#test_module.pm
package test_module;
use strict;
use warnings;
use threads;
sub main {
my $max_threads = 10;
my #threads = ();
# create threads
while (scalar #threads < $max_threads) {
my $thread = threads->new(\&thread_sub);
push #threads, $thread;
}
# join threads
for my $thread (#threads) {
$thread->join();
}
}
sub thread_sub {
my $id = threads->tid();
print "I am in thread $id\n";
}
1;
The problem is that I am calling this module multiple times from one Perl script and instead of eliminating the old threads and creating new ones, the thread ids just keep incrementing. I have heard that if you don't properly get rid of old threads in Perl this will cause a memory leak and slow your program down, is this true? Is the data from my old threads just sitting in memory taking up space?
If so this can become a large problem since my script will be part of a much larger program that may generate hundreds or thousands of threads all of which would just be taking up memory even after they are done being used. How can I stop this from happening? Can my threads be reused?
Here is an example script that will call the module and show how the threads will continue to increment even though I joined the old threads (I thought that "join" was how you cleaned up after them, am I doing something wrong?) The way this script will be used I can't afford to have memory from old threads sitting there taking up space.
ex:
#!/usr/bin/perl
#testing.pl
use strict;
use warnings;
use test_module;
test_module::main();
test_module::main();
test_module::main();
system 'pause';
Thanks!

Don't worry about thread IDs incrementing - that doesn't mean the number of running threads is increasing. Once a thread is joined it has finished executing and been terminated.
However, continuously respawning threads isn't ideal either - creating a thread isn't a particularly lightweight operation in perl. So if you've got to do something like that, and are particularly focussing on efficiency - look to fork() instead.
I find I tend to use a 'worker thread' model, using Thread::Queue:
my $processing_q = Thread::Queue -> new();
sub worker_thread {
while ( my $item = $processing_q -> dequeue() ) {
# do stuff to $item
}
}
for ( 1 .. $num_threads ) {
my $thr = threads -> create ( \&worker_thread );
}
$processing_q -> enqueue ( #generic_list_of_things );
$processing_q -> end;
foreach my $thread ( threads -> list() ) {
$thread -> join();
}
This will feed in a batch of items into a queue, and your worker threads will process them one at a time - means you can have a sensible number running, without having to continuously respawn.
As an alternative though - take a look at Parallel::ForkManager - fork style parallel processing may seem counterintuitive initially, but fork() is a native system call on Unix systems, so it tends to be better optimised.

Related

Perl - turning foreach loop to a multi-threaded run

I have the following code:
foreach my $inst (sort keys %{ ... }) {
next if (...)
somefuntion($a, $b, $c, $inst);
}
I would like to run this function on all the $inst-s asynchronously.
I tried to make it multi-threaded, but I'm having trouble with the syntax or implementation.
*** EDIT: ***
Apparently (i haven't noticed until now), the function uses a hash and the updates gets lost.
Should Threads::shared help in this case? Is it relevant in this case or should I just try forks?
Perl's got three major ways I'd suggest to do parallel code
Threads
Forks
Nonblocking IO
The latter isn't strictly speaking 'parallel' in all circumstances, but it does let you do multiple things at the same time, without waiting for each to finish, so it's beneficial in certain circumstances.
E.g. maybe you want to open 10 concurrent ssh sessions - you can just do an IO::Select to find which of them are 'ready' and process them as they come in.
The ssh shells themselves are of course, separate processes.
But when doing parallel, you need to be aware of a couple of pitfalls - one being 'self denial of service' - you can generate huge resource consumption very easily. The other being that you've got some inherent race conditions, and no longer a deterministic flow of program - that brings you a whole new class of exciting bugs.
Threads
I wouldn't advocate spawning a thread-per-instance, as that scales badly. Threads in perl are NOT lightweight, like you might be assuming. That means that implementing them as if they are, gives you a denial of service condition.
What I'd typically suggest is running with Thread::Queue and some "worker" threads - and use the Queue to pass data to some number of workers that are scaled to your resource availability. Depending on what is your limiting factor here that's making you do parallel.
(e.g. disk, network, cpu, etc.)
So to use a simplistic example that I've posted previously:
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Queue;
my $nthreads = 5;
my $process_q = Thread::Queue->new();
my $failed_q = Thread::Queue->new();
#this is a subroutine, but that runs 'as a thread'.
#when it starts, it inherits the program state 'as is'. E.g.
#the variable declarations above all apply - but changes to
#values within the program are 'thread local' unless the
#variable is defined as 'shared'.
#Behind the scenes - Thread::Queue are 'shared' arrays.
sub worker {
#NB - this will sit a loop indefinitely, until you close the queue.
#using $process_q -> end
#we do this once we've queued all the things we want to process
#and the sub completes and exits neatly.
#however if you _don't_ end it, this will sit waiting forever.
while ( my $server = $process_q->dequeue() ) {
chomp($server);
print threads->self()->tid() . ": pinging $server\n";
my $result = `/bin/ping -c 1 $server`;
if ($?) { $failed_q->enqueue($server) }
print $result;
}
}
#insert tasks into thread queue.
open( my $input_fh, "<", "server_list" ) or die $!;
$process_q->enqueue(<$input_fh>);
close($input_fh);
#we 'end' process_q - when we do, no more items may be inserted,
#and 'dequeue' returns 'undefined' when the queue is emptied.
#this means our worker threads (in their 'while' loop) will then exit.
$process_q->end();
#start some threads
for ( 1 .. $nthreads ) {
threads->create( \&worker );
}
#Wait for threads to all finish processing.
foreach my $thr ( threads->list() ) {
$thr->join();
}
#collate results. ('synchronise' operation)
while ( my $server = $failed_q->dequeue_nb() ) {
print "$server failed to ping\n";
}
This will start 5 threads, and queue up some number of jobs, such that 5 are running in parallel at any given time, and 'unwind' gracefully after.
Forking
Parallel::Forkmanager is the tool for the job here.
Unlike threads, forks are quite efficient on a Unix system, as the native fork() system call is well optimised.
But what it's not so good at is passing data around - you've got to hand roll any IPCs between your forks in a way that you don't so much with Threads.
A simple example of this would be:
#!/usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager;
my $concurrent_fork_limit = 4;
my $fork_manager = Parallel::ForkManager->new($concurrent_fork_limit);
foreach my $thing ( "fork", "spoon", "knife", "plate" ) {
my $pid = $fork_manager->start;
if ($pid) {
print "$$: Fork made a child with pid $pid\n";
} else {
print "$$: child process started, with a key of $thing ($pid)\n";
}
$fork_manager->finish;
}
$fork_manager->wait_all_children();
This does spawn off subprocesses, but cleans up after them fairly readily.
Nonblocking IO
Using IO::Select you would open some number of filehandles to subprocesses, and then use the can_read function to process the ones that are ready to run.
The perldoc IO::Select covers most of the detail here, which I'll reproduce for convenience:
use IO::Select;
$select = IO::Select->new();
$select->add(\*STDIN);
$select->add($some_handle);
#ready = $select->can_read($timeout);
#ready = IO::Select->new(#handles)->can_read(0);
You could use threads.
Here's an example that should take about 5 seconds to finish although it calls sleep(5) twice:
#!/usr/bin/perl
use strict;
use warnings;
use threads;
my %data = (
'foo' => 'bar',
'apa' => 'bepa',
);
sub somefuntion {
my $key = shift;
print "$key\n";
sleep(5);
return $data{$key};
}
my #threads;
for my $inst (sort keys %data) {
push #threads, threads->create('somefuntion', $inst);
}
print "running...\n";
for my $thr (#threads) {
print $thr->join() . "\n";
}
print "done\n";
This answer was made to show how threads works in Perl because you mentioned threads. Just a word of caution:
The "interpreter-based threads" provided by Perl are not the fast, lightweight system for multitasking that one might expect or hope for. Threads are implemented in a way that makes them easy to misuse. Few people know how to use them correctly or will be able to provide help.
The use of interpreter-based threads in perl is officially discouraged.

How to get started multithreading in Perl

I have a perl program that takes over 13 hours to run. I think it could benefit from introducing multithreading but I have never done this before and I'm at a loss as to how to begin.
Here is my situation:
I have a directory of hundreds of text files. I loop through every file in the directory using a basic for loop and do some processing (text processing on the file itself, calling an outside program on the file, and compressing it). When complete I move on to the next file. I continue this way doing each file, one after the other, in a serial fashion. The files are completely independent from each other and the process returns no values (other than success/failure codes) so this seems like a good candidate for multithreading.
My questions:
How do I rewrite my basic loop to take advantage of threads? There appear to be several moduals for threading out there.
How do I control how many threads are currently running? If I have N cores available, how do I limit the number of threads to N or N - n?
Do I need to manage the thread count manually or will Perl do that for me?
Any advice would be much appreciated.
Since your threads are simply going to launch a process and wait for it to end, best to bypass the middlemen and just use processes. Unless you're on a Windows system, I'd recommend Parallel::ForkManager for your scenario.
use Parallel::ForkManager qw( );
use constant MAX_PROCESSES => ...;
my $pm = Parallel::ForkManager->new(MAX_PROCESSES);
my #qfns = ...;
for my $qfn (#qfns) {
my $pid = $pm->start and next;
exec("extprog", $qfn)
or die $!;
}
$pm->wait_all_children();
If you wanted you avoid using needless intermediary threads in Windows, you'd have to use something akin to the following:
use constant MAX_PROCESSES => ...;
my #qfns = ...;
my %children;
for my $qfn (#qfns) {
while (keys(%children) >= MAX_PROCESSES) {
my $pid = wait();
delete $children{$pid};
}
my $pid = system(1, "extprog", $qfn);
++$children{$pid};
}
while (keys(%children)) {
my $pid = wait();
delete $children{$pid};
}
Someone's given your a forking example. Forks aren't native on Windows, so I'd tend to prefer threading.
For the sake of completeness - here's a rough idea of how threading works (and IMO is one of the better approaches, rather than respawning threads).
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Queue;
my $nthreads = 5;
my $process_q = Thread::Queue->new();
my $failed_q = Thread::Queue->new();
#this is a subroutine, but that runs 'as a thread'.
#when it starts, it inherits the program state 'as is'. E.g.
#the variable declarations above all apply - but changes to
#values within the program are 'thread local' unless the
#variable is defined as 'shared'.
#Behind the scenes - Thread::Queue are 'shared' arrays.
sub worker {
#NB - this will sit a loop indefinitely, until you close the queue.
#using $process_q -> end
#we do this once we've queued all the things we want to process
#and the sub completes and exits neatly.
#however if you _don't_ end it, this will sit waiting forever.
while ( my $server = $process_q->dequeue() ) {
chomp($server);
print threads->self()->tid() . ": pinging $server\n";
my $result = `/bin/ping -c 1 $server`;
if ($?) { $failed_q->enqueue($server) }
print $result;
}
}
#insert tasks into thread queue.
open( my $input_fh, "<", "server_list" ) or die $!;
$process_q->enqueue(<$input_fh>);
close($input_fh);
#we 'end' process_q - when we do, no more items may be inserted,
#and 'dequeue' returns 'undefined' when the queue is emptied.
#this means our worker threads (in their 'while' loop) will then exit.
$process_q->end();
#start some threads
for ( 1 .. $nthreads ) {
threads->create( \&worker );
}
#Wait for threads to all finish processing.
foreach my $thr ( threads->list() ) {
$thr->join();
}
#collate results. ('synchronise' operation)
while ( my $server = $failed_q->dequeue_nb() ) {
print "$server failed to ping\n";
}
If you need to move complicated data structures around, I'd recommend having a look at Storable - specifically freeze and thaw. These will let you shuffle around objects, hashes, arrays etc. easily in queues.
Note though - for any parallel processing option, you get good CPU utilisation, but you don't get more disk IO - that's often a limiting factor.

perl threads self detach

I'm pretty new to perl (and programming too) and were toying around with threads for the last couple of weeks and so far I understood that using them to perform some similar parallel tasks is descouraged - memory consumption is uncontrollable if your number of threads depends on some input values, and simply limiting that number and doing some interim joins seems pretty much silly.
So I've tried to trick threads to return me some values through queues followed by detaching those threads (and without actually joining them) - here's an example with parallel ping:
#!/usr/bin/perl
#
use strict;
use warnings;
use threads;
use NetAddr::IP;
use Net::Ping;
use Thread::Queue;
use Thread::Semaphore;
########## get my IPs from CIDR-notation #############
my #ips;
for my $cidr (#ARGV) {
my $n = NetAddr::IP->new($cidr);
foreach ( #{ $n->hostenumref } ) {
push #ips, ( split( '/', $_ ) )[0];
}
}
my $ping = Net::Ping->new("icmp");
my $pq = Thread::Queue->new( #ips, undef ); # ping-worker-queue
my $rq = Thread::Queue->new(); # response queue
my $semaphore = Thread::Semaphore->new(100); # I hoped this may be usefull to limit # of concurrent threads
while ( my $phost = $pq->dequeue() ) {
$semaphore->down();
threads->create( { 'stack_size' => 32 * 4096 }, \&ping_th, $phost );
}
sub ping_th {
$rq->enqueue( $_[0] ) if $ping->ping( $_[0], 1 );
$semaphore->up();
threads->detach();
}
$rq->enqueue(undef);
while ( my $alive_ip = $rq->dequeue() ) {
print $alive_ip, "\n";
}
I couldn't find a fully comprehensive description of how threads->detach() should work from within a threaded subroutine and thought that this might work... and it does - if I do something in the main program (thread) that stretches it's lifetime (sleep does well), so all the detached threads finish up and enqueue their part to my $rq, otherwise it will run some threads collect their results to the queue and exit with warnings like:
Perl exited with active threads:
5 running and unjoined
0 finished and unjoined
0 running and detached
Making the main program "sleep" for a while, once again, seems silly - is there no way to make threads do their stuff and detach ONLY after the actual threads->detach() call?
So far my guess is that threads->detach() inside a sub applies as soon as the thread is created and so this is not the way.
I tried this out with CentOSs good old v5.10.1. Should this change with modern v5.16 or v5.18 (usethreads-compiled)?
Detaching a thread isn't particularly useful, because you're effectively saying 'I don't care when they exit'.
This isn't typically what you want - your process is finishing with thread still running.
Generally though - creating threads has an overhead, because your processs is cloned in memory. You want to avoid doing this. Thread::Queue is also good to use, because it's a thread safe way of transferring information. In your code, you don't actually need it for $pq because you're not actually threading at the point where you're using it.
Your semaphore is one approach to doing it, but can I suggest as an alternative:
#!/usr/bin/perl
use strict;
use warnings;
use Thread::Queue;
my $nthreads = 100;
my $ping_q = Thread::Queue -> new();
my $result_q = Thread::Queue -> new();
sub ping_host {
my $pinger = Net::Ping->new("icmp");
while ( my $hostname = $ping_q -> dequeue() ) {
if ( $pinger -> ping ( $hostname, 1 ) ) {
$result_q -> enqueue ( $hostname );
}
}
}
#start the threads
for ( 1..$nthreads ) {
threads -> create ( \&ping_host );
}
#queue the workload
$ping_q -> enqueue ( #ip_list );
#close the queue, so '$ping_q -> dequeue' returns undef, breaking the while loop.
$ping_q -> end();
#wait for pingers to finish.
foreach my $thr ( threads -> list() ) {
$thr -> join();
}
$results_q -> end();
#collate results
while ( my $successful_host = $results_q -> dequeue_nb() ) {
print $successful_host, "\n";
}
This way you spawn the threads up front, queue the targets and then collate the results when you're done. You don't incur the overhead for repeatedly respawning threads, and you program will wait until all the threads are done. Which may be a while, because the ping timeout on a 'down' host will be quite a while.
Since detached threads can't be joined, you can wait for threads to finish their jobs,
sleep 1 while threads->list();

Multithreading management in Perl

What exactly does Perl do to threads that have completed its task? Does it let it idle or just kills it? I have a basic code structure below and I was wondering how to best optimize it.
use threads;
use Thread::Semaphore
my $s = Thread::Semaphore->new($maxThreads);
my #threads;
my $thread;
foreach my $tasktodo (#tasktodo) {
$s->down();
$thread = threads->new(\&doThis);
push #threads, $thread;
}
foreach my $thr (#threads) {
$thr->join();
}
sub doThis {
# blah blah
# completed, gonna let more threads run with $s->up()
$s->up();
}
In this case, once a thread completes, I want to free up resources for more threads to run. I'm worried about joining threads at the end of the loop. If in the whole program life cycle it will have 4 threads created, will #threads still have 4 threads in it when joining?
Lets say $maxThreads is 2, will it run 2 threads then when those 2 completes, it will be killed and run 2 more threads. At the end it will only join or wait for those 2 threads running?
EDIT: I also don't care for the return values of these threads, that's why I want to free up resources. Only reason I'm joining is I want all threads to complete before continuing with the script. Again, is this the best implementation?
The usual method for terminating a thread is to return EXPR from the entry point function with the appropriate return value(s).
The join function waits for this return value, and clean up the thread. So, in my opinion your code is fine.
Another way to exit a thread is this:
threads->exit(status);
Also, you can get a list of joinable threads with:
threads->list(threads::joinable);

Perl: Correctly passing array for threads to work on

I'm learning how to do threading in Perl. I was going over the example code here and adapted the solution code slightly:
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Semaphore;
my $sem = Thread::Semaphore->new(2); # max 2 threads
my #names = ("Kaku", "Tyson", "Dawkins", "Hawking", "Goswami", "Nye");
my #threads = map {
# request a thread slot, waiting if none are available:
foreach my $whiz (#names) {
$sem->down;
threads->create(\&mySubName, $whiz);
}
} #names;
sub mySubName {
return "Hello Dr. " . $_[0] . "\n";
# release slot:
$sem->up;
}
foreach my $t (#threads) {
my $hello = $t->join();
print "$hello";
}
Of course, this is now completely broken and does not work. It results in this error:
C:\scripts\perl\sandbox>threaded.pl
Can't call method "join" without a package or object reference at C:\scripts\perl\sandbox\threaded.pl line 24.
Perl exited with active threads:
0 running and unjoined
9 finished and unjoined
0 running and detached
My objective was two-fold:
Enforce max number of threads allowed at any given time
Provide the array of 'work' for the threads to consume
In the original solution, I noticed that the 0..100; code seems to specify the amount of 'work' given to the threads. However, in my case where I want to supply an array of work, do I still need to supply something similar?
Any guidance and corrections very welcome.
You're storing the result of foreach into #threads rather than the result of threads->create.
Even if you fix this, you collect completed threads too late. I'm not sure how big of a problem that is, but it might prevent more than 64 threads from being started on some systems. (64 is the max number of threads a program can have at a time on some systems.)
A better approach is to reuse your threads. This solves both of your problems and avoids the overhead of repeatedly creating threads.
use threads;
use Thread::Queue 3.01 qw( );
use constant NUM_WORKERS => 2;
sub work {
my ($job) = #_;
...
}
{
my $q = Thread::Queue->new();
for (1..NUM_WORKERS) {
async {
while (my $job = $q->dequeue()) {
work($job);
}
};
}
$q->enqueue(#names); # Can be done over time.
$q->end(); # When you're done adding.
$_->join() for threads->list();
}

Resources