How to pause Thread::Queue access while waiting for prior threads to finish up? - multithreading

So this is what I have right now:
for my $action (#actionList){
$q->enqueue([$_, $action]) for #component_dirs;
print "\nWaiting for prior actions to finish up...\n";
until (!defined($q->peek())) {}
}
$q->end();
$_->join() for threads->list();
But this doesn't seem to work.. is there a better way to force the queue to wait for previous $action items to complete before allowing access again?
edit: Oddly enough, it's magically started working... maybe it was working all along and I just didn't make the output apparent enough. Either way, my question still stands - is there a better way?

Your code doesn't wait until the previous action has completed, it just wastes CPU until another thread starts working on the last job.
For things like “flags”, you should generally use semaphores instead. Semaphores are thread-safe counters with up and down methods. For example, we could pass a semaphore along with the job, which starts with count zero. Each thread increments the semaphore when it finishes a job. Our main thread tries to decrement the semaphore by the count of jobs, which will block until all threads have finished:
my $q = Thread::Queue->new;
my #workers = map { threads->create(\&worker, $q) } 1 .. $NUM_WORKERS;
for my $action (#actionList) {
my $sem = Thread::Semaphore->new(0);
$q->enqueue([$_, $action, $sem]) for #component_dirs;
$sem->down(0+#component_dirs); # wait for the threads
}
$q->end;
$_->join for #workers;
sub worker {
my ($q) = #_;
while (my $job = $q->dequeue) {
my ($component, $action, $sem) = #$job;
...
$sem->up;
}
}
Actually, we could reuse the semaphore.
See the Thread::Semaphore docs for more details.
This usage is similar to barriers.

Related

Perl - turning foreach loop to a multi-threaded run

I have the following code:
foreach my $inst (sort keys %{ ... }) {
next if (...)
somefuntion($a, $b, $c, $inst);
}
I would like to run this function on all the $inst-s asynchronously.
I tried to make it multi-threaded, but I'm having trouble with the syntax or implementation.
*** EDIT: ***
Apparently (i haven't noticed until now), the function uses a hash and the updates gets lost.
Should Threads::shared help in this case? Is it relevant in this case or should I just try forks?
Perl's got three major ways I'd suggest to do parallel code
Threads
Forks
Nonblocking IO
The latter isn't strictly speaking 'parallel' in all circumstances, but it does let you do multiple things at the same time, without waiting for each to finish, so it's beneficial in certain circumstances.
E.g. maybe you want to open 10 concurrent ssh sessions - you can just do an IO::Select to find which of them are 'ready' and process them as they come in.
The ssh shells themselves are of course, separate processes.
But when doing parallel, you need to be aware of a couple of pitfalls - one being 'self denial of service' - you can generate huge resource consumption very easily. The other being that you've got some inherent race conditions, and no longer a deterministic flow of program - that brings you a whole new class of exciting bugs.
Threads
I wouldn't advocate spawning a thread-per-instance, as that scales badly. Threads in perl are NOT lightweight, like you might be assuming. That means that implementing them as if they are, gives you a denial of service condition.
What I'd typically suggest is running with Thread::Queue and some "worker" threads - and use the Queue to pass data to some number of workers that are scaled to your resource availability. Depending on what is your limiting factor here that's making you do parallel.
(e.g. disk, network, cpu, etc.)
So to use a simplistic example that I've posted previously:
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Queue;
my $nthreads = 5;
my $process_q = Thread::Queue->new();
my $failed_q = Thread::Queue->new();
#this is a subroutine, but that runs 'as a thread'.
#when it starts, it inherits the program state 'as is'. E.g.
#the variable declarations above all apply - but changes to
#values within the program are 'thread local' unless the
#variable is defined as 'shared'.
#Behind the scenes - Thread::Queue are 'shared' arrays.
sub worker {
#NB - this will sit a loop indefinitely, until you close the queue.
#using $process_q -> end
#we do this once we've queued all the things we want to process
#and the sub completes and exits neatly.
#however if you _don't_ end it, this will sit waiting forever.
while ( my $server = $process_q->dequeue() ) {
chomp($server);
print threads->self()->tid() . ": pinging $server\n";
my $result = `/bin/ping -c 1 $server`;
if ($?) { $failed_q->enqueue($server) }
print $result;
}
}
#insert tasks into thread queue.
open( my $input_fh, "<", "server_list" ) or die $!;
$process_q->enqueue(<$input_fh>);
close($input_fh);
#we 'end' process_q - when we do, no more items may be inserted,
#and 'dequeue' returns 'undefined' when the queue is emptied.
#this means our worker threads (in their 'while' loop) will then exit.
$process_q->end();
#start some threads
for ( 1 .. $nthreads ) {
threads->create( \&worker );
}
#Wait for threads to all finish processing.
foreach my $thr ( threads->list() ) {
$thr->join();
}
#collate results. ('synchronise' operation)
while ( my $server = $failed_q->dequeue_nb() ) {
print "$server failed to ping\n";
}
This will start 5 threads, and queue up some number of jobs, such that 5 are running in parallel at any given time, and 'unwind' gracefully after.
Forking
Parallel::Forkmanager is the tool for the job here.
Unlike threads, forks are quite efficient on a Unix system, as the native fork() system call is well optimised.
But what it's not so good at is passing data around - you've got to hand roll any IPCs between your forks in a way that you don't so much with Threads.
A simple example of this would be:
#!/usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager;
my $concurrent_fork_limit = 4;
my $fork_manager = Parallel::ForkManager->new($concurrent_fork_limit);
foreach my $thing ( "fork", "spoon", "knife", "plate" ) {
my $pid = $fork_manager->start;
if ($pid) {
print "$$: Fork made a child with pid $pid\n";
} else {
print "$$: child process started, with a key of $thing ($pid)\n";
}
$fork_manager->finish;
}
$fork_manager->wait_all_children();
This does spawn off subprocesses, but cleans up after them fairly readily.
Nonblocking IO
Using IO::Select you would open some number of filehandles to subprocesses, and then use the can_read function to process the ones that are ready to run.
The perldoc IO::Select covers most of the detail here, which I'll reproduce for convenience:
use IO::Select;
$select = IO::Select->new();
$select->add(\*STDIN);
$select->add($some_handle);
#ready = $select->can_read($timeout);
#ready = IO::Select->new(#handles)->can_read(0);
You could use threads.
Here's an example that should take about 5 seconds to finish although it calls sleep(5) twice:
#!/usr/bin/perl
use strict;
use warnings;
use threads;
my %data = (
'foo' => 'bar',
'apa' => 'bepa',
);
sub somefuntion {
my $key = shift;
print "$key\n";
sleep(5);
return $data{$key};
}
my #threads;
for my $inst (sort keys %data) {
push #threads, threads->create('somefuntion', $inst);
}
print "running...\n";
for my $thr (#threads) {
print $thr->join() . "\n";
}
print "done\n";
This answer was made to show how threads works in Perl because you mentioned threads. Just a word of caution:
The "interpreter-based threads" provided by Perl are not the fast, lightweight system for multitasking that one might expect or hope for. Threads are implemented in a way that makes them easy to misuse. Few people know how to use them correctly or will be able to provide help.
The use of interpreter-based threads in perl is officially discouraged.

How to get started multithreading in Perl

I have a perl program that takes over 13 hours to run. I think it could benefit from introducing multithreading but I have never done this before and I'm at a loss as to how to begin.
Here is my situation:
I have a directory of hundreds of text files. I loop through every file in the directory using a basic for loop and do some processing (text processing on the file itself, calling an outside program on the file, and compressing it). When complete I move on to the next file. I continue this way doing each file, one after the other, in a serial fashion. The files are completely independent from each other and the process returns no values (other than success/failure codes) so this seems like a good candidate for multithreading.
My questions:
How do I rewrite my basic loop to take advantage of threads? There appear to be several moduals for threading out there.
How do I control how many threads are currently running? If I have N cores available, how do I limit the number of threads to N or N - n?
Do I need to manage the thread count manually or will Perl do that for me?
Any advice would be much appreciated.
Since your threads are simply going to launch a process and wait for it to end, best to bypass the middlemen and just use processes. Unless you're on a Windows system, I'd recommend Parallel::ForkManager for your scenario.
use Parallel::ForkManager qw( );
use constant MAX_PROCESSES => ...;
my $pm = Parallel::ForkManager->new(MAX_PROCESSES);
my #qfns = ...;
for my $qfn (#qfns) {
my $pid = $pm->start and next;
exec("extprog", $qfn)
or die $!;
}
$pm->wait_all_children();
If you wanted you avoid using needless intermediary threads in Windows, you'd have to use something akin to the following:
use constant MAX_PROCESSES => ...;
my #qfns = ...;
my %children;
for my $qfn (#qfns) {
while (keys(%children) >= MAX_PROCESSES) {
my $pid = wait();
delete $children{$pid};
}
my $pid = system(1, "extprog", $qfn);
++$children{$pid};
}
while (keys(%children)) {
my $pid = wait();
delete $children{$pid};
}
Someone's given your a forking example. Forks aren't native on Windows, so I'd tend to prefer threading.
For the sake of completeness - here's a rough idea of how threading works (and IMO is one of the better approaches, rather than respawning threads).
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Queue;
my $nthreads = 5;
my $process_q = Thread::Queue->new();
my $failed_q = Thread::Queue->new();
#this is a subroutine, but that runs 'as a thread'.
#when it starts, it inherits the program state 'as is'. E.g.
#the variable declarations above all apply - but changes to
#values within the program are 'thread local' unless the
#variable is defined as 'shared'.
#Behind the scenes - Thread::Queue are 'shared' arrays.
sub worker {
#NB - this will sit a loop indefinitely, until you close the queue.
#using $process_q -> end
#we do this once we've queued all the things we want to process
#and the sub completes and exits neatly.
#however if you _don't_ end it, this will sit waiting forever.
while ( my $server = $process_q->dequeue() ) {
chomp($server);
print threads->self()->tid() . ": pinging $server\n";
my $result = `/bin/ping -c 1 $server`;
if ($?) { $failed_q->enqueue($server) }
print $result;
}
}
#insert tasks into thread queue.
open( my $input_fh, "<", "server_list" ) or die $!;
$process_q->enqueue(<$input_fh>);
close($input_fh);
#we 'end' process_q - when we do, no more items may be inserted,
#and 'dequeue' returns 'undefined' when the queue is emptied.
#this means our worker threads (in their 'while' loop) will then exit.
$process_q->end();
#start some threads
for ( 1 .. $nthreads ) {
threads->create( \&worker );
}
#Wait for threads to all finish processing.
foreach my $thr ( threads->list() ) {
$thr->join();
}
#collate results. ('synchronise' operation)
while ( my $server = $failed_q->dequeue_nb() ) {
print "$server failed to ping\n";
}
If you need to move complicated data structures around, I'd recommend having a look at Storable - specifically freeze and thaw. These will let you shuffle around objects, hashes, arrays etc. easily in queues.
Note though - for any parallel processing option, you get good CPU utilisation, but you don't get more disk IO - that's often a limiting factor.

Multithreading management in Perl

What exactly does Perl do to threads that have completed its task? Does it let it idle or just kills it? I have a basic code structure below and I was wondering how to best optimize it.
use threads;
use Thread::Semaphore
my $s = Thread::Semaphore->new($maxThreads);
my #threads;
my $thread;
foreach my $tasktodo (#tasktodo) {
$s->down();
$thread = threads->new(\&doThis);
push #threads, $thread;
}
foreach my $thr (#threads) {
$thr->join();
}
sub doThis {
# blah blah
# completed, gonna let more threads run with $s->up()
$s->up();
}
In this case, once a thread completes, I want to free up resources for more threads to run. I'm worried about joining threads at the end of the loop. If in the whole program life cycle it will have 4 threads created, will #threads still have 4 threads in it when joining?
Lets say $maxThreads is 2, will it run 2 threads then when those 2 completes, it will be killed and run 2 more threads. At the end it will only join or wait for those 2 threads running?
EDIT: I also don't care for the return values of these threads, that's why I want to free up resources. Only reason I'm joining is I want all threads to complete before continuing with the script. Again, is this the best implementation?
The usual method for terminating a thread is to return EXPR from the entry point function with the appropriate return value(s).
The join function waits for this return value, and clean up the thread. So, in my opinion your code is fine.
Another way to exit a thread is this:
threads->exit(status);
Also, you can get a list of joinable threads with:
threads->list(threads::joinable);

Strange variable behaviour using Perl ithreads

I'm trying to implement a multithreaded application based on a slightly altered boss/worker model. Basically the main thread creates several boss threads, which in turn spawn two worker threads each (possibly more). That's because the boss threads deal with one host or network device each, and the worker threads could take a while to complete their work.
I'm using Thread::Pool to realize this concept, and so far it works quite well; I also don't think my problem is related to Thread::Pool (see below). Very simplified pseudocode ahead:
use strict;
use warnings;
my $bosspool = create_bosspool(); # spawns all boss threads
my $taskpool = undef; # created in each boss thread at
# creation of each boss thread
# give device jobs to boss threads
while (1) {
foreach my $device ( #devices ) {
$bosspool->job($device);
}
sleep(1);
}
# This sub is called for jobs passed to the $bosspool
sub process_boss
{
my $device = shift;
foreach my $task ( $device->{tasks} ) {
# process results as they become available
process_result() while ( $taskpool->results );
# give task jobs to task threads
scalar $taskpool->job($device, $task);
sleep(1); ### HACK ###
}
# process remaining results / wait for all tasks to finish
process_result() while ( $taskpool->results || $taskpool->todo );
# happy result processing
}
sub process_result
{
my $result = $taskpool->result_any();
# mangle $result
}
# This sub is called for jobs passed to the $taskpool of each boss thread
sub process_task
{
# not so important stuff
return $result;
}
By the way, the reason I'm not using the monitor()-routine is because I have to wait for all jobs in the $taskpool to finish. Now, this code works just wonderful, unless you remove the ### HACK ### line. Without sleeping, $taskpool->todo() won't deliver the right number of jobs still open if you add them or receive their results too "fast". Like, you add 4 jobs in total but $taskpool->todo() will only return 2 afterwards (with no pending results). This leads to all sorts of interesting effects.
OK, so Thread::Pool->todo() is crap, let's try a workaround:
sub process_boss
{
my $device = shift;
my $todo = 0;
foreach my $task ( $device->{tasks} ) {
# process results as they become available
while ( $taskpool->results ) {
process_result();
$todo--;
}
# give task jobs to task threads
scalar $taskpool->job($device, $task);
$todo++;
}
# process remaining results / wait for all tasks to finish
while ( $todo ) {
process_result();
sleep(1); ### HACK ###
$todo--;
}
}
This will also work fine, as long as I keep the ### HACK ### line. Without this line, this code will reproduce the problems of Thread::Pool->todo(), as $todo does not only get decremented by 1, but 2 or even more.
I've tested this code with only one boss thread, so there was basically no multithreading involved (when it comes to this subroutine). $bosspool, $taskpool and especially $todo aren't :shared, no side effects possible, right? What's happening in this subroutine, which gets executed by only one boss thread, with no shared variables, semaphores, etc.?
I would suggest that the best way to implement a 'worker' threads model, is with Thread::Queue. The problem with doing something like this, is figuring out when queues are complete, or whether items are dequeued and pending processing.
With Thread::Queue you can use a while loop to fetch elements from the queue, and end the queue, such that the while loop returns undef and the threads exit.
So you don't always need multiple 'boss' threads, you can just use multiple different flavours of worker and input queues. I would question why you need a 'boss' thread model in that instance. It seems unnecessary.
With reference to:
Perl daemonize with child daemons
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Queue;
my $nthreads = 4;
my #targets = qw ( device1 device2 device3 device4 );
my $task_one_q = Thread::Queue->new();
my $task_two_q = Thread::Queue->new();
my $results_q = Thread::Queue->new();
sub task_one_worker {
while ( my $item = task_one_q->dequeue ) {
#do something with $item
$results_q->enqueue("$item task_one complete");
}
}
sub task_two_worker {
while ( my $item = task_two_q->dequeue ) {
#do something with $item
$results_q->enqueue("$item task_two complete");
}
}
#start threads;
for ( 1 .. $nthreads ) {
threads->create( \&task_one_worker );
threads->create( \&task_two_worker );
}
foreach my $target (#targets) {
$task_one_q->enqueue($target);
$task_two_q->enqueue($target);
}
$task_one_q->end;
$task_two_q->end;
#Wait for threads to exit.
foreach my $thr ( threads->list() ) {
threads->join();
}
$results_q->end();
while ( my $item = $results_q->dequeue() ) {
print $item, "\n";
}
You could do something similar with a boss thread if you were desirous - you can create a queue per boss and pass it by reference to the workers. I'm not sure that it's necessary though.

How to create threads in Perl?

I have got easy Perl script where I have got a BIG loop and inside this I invoke more or less million times function my_fun(). I would like to create pool of threads which will be dealing with it - max 5 threads in this same time will be invoking this method in loop.
It is really important for me to use the fastest library - It will be really nice to see examples.
My code looks like this:
for (my $i = 0; $i < 1000000 ; $i++) {
my_fun();
}
Thank you in advance
Have a look at Parallel::ForkManager. It's using fork, not threads, but it should get your job done very simply.
Example lifted from the docs and slightly altered:
use Parallel::ForkManager;
my $pm = Parallel::ForkManager->new(5); # number of parallel processes
for my $i (0 .. 999999) {
# Forks and returns the pid for the child:
my $pid = $pm->start and next;
#... do some work with $data in the child process ...
my_fun();
$pm->finish; # Terminates the child process
}
$pm->wait_all_children;
We can't give you the fastest way, since that depends on the work, and you didn't tell us what the work is.
But you did ask about threads, so I'll give you the foundation of a threaded application. Here is a worker model. It's robust, maintainable and extendable.
use threads;
use Thread::Queue qw( ); # Version 3.01+ required
my $NUM_WORKERS = 5;
sub worker {
my ($job) = #_;
...
}
my $q = Thread::Queue->new();
my #workers;
for (1..$NUM_WORKERS) {
push #workers, async {
while (defined(my $job = $q->dequeue())) {
worker($job);
}
};
}
$q->enqueue($_) for #jobs; # Send work
$q->end(); # Tell workers they're done.
$_->join() for #workers; # Wait for the workers to finish.
This is a basic flow (one-directional), but it's easy to make bi-directional by adding a response queue.
This uses actual threads, but you can switch to using processes by switching use threads; to use forks;.
Parallel::ForkManager can also be used to provide a worker model, but it's continually creating new processes instead of reusing them. This does allow it to handle child death easily, though.
Ref: Thread::Queue (or Thread::Queue::Any)
Take a look at the threads documentation.

Resources